Problem Solving

How I Won a Parallel Programming Marathon Without Knowing Parallel Programming

Luiz H. B. Lago

29 Apr 2025 — 5 min read

Introduction: What Is Parallel Programming?

Parallel programming is a computing paradigm that allows tasks to be divided and executed simultaneously. It leverages multiple processing units—like CPU cores or GPU threads—to perform large computations more efficiently. While the CPU (Central Processing Unit) is designed for general-purpose, sequential tasks, the GPU (Graphics Processing Unit) is optimized for handling multiple parallel operations across large datasets.

Parallelism is increasingly important in high-performance computing, gaming, data analysis, and AI training. But what if I told you that you could win a parallel programming marathon without being an expert in it? That’s exactly what happened to me.

Prior Experience: A Taste of Parallelism

Prior to the competition, I had briefly worked on a research project at INPE involving the parallelization of legacy Fortran code for the SX-Aurora TSUBASA hardware. Rather than writing any explicit parallel code, my contribution was to restructure the code so the compiler could apply automatic parallelization. This involved simplifying loop structures, often removing conditional branches inside for loops, and making the computation within loops more predictable. These seemingly small changes significantly improved the compiler’s ability to parallelize the code effectively by itself.

The first and most important step in any parallel computing task is recognizing sections of code that can be executed independently. Sequential dependencies, where one operation depends on the result of another, are bottlenecks to parallelization. During the competition, our first task for every problem was to isolate these independent regions.

It’s worth noting that not all parallelization needs to operate at this micro level. For example, in games, separate CPUs might handle game logic, rendering, and physics, each as independent tasks. But in this marathon, the challenge was data parallelism—parallelizing the same task across many data points.

Choosing the Right Hardware

One of the strategic decisions in the marathon was choosing which hardware to target: the CPU or the GPU. We were provided with a powerful 16-core CPU and an NVIDIA RTX 4090 GPU, and each came with its own strengths and challenges.

In general:

CPUs are designed for versatility and are easier to program for, especially when using frameworks like OpenMP, which abstracts much of the complexity of parallel thread management.
GPUs, on the other hand, are optimized for massive parallelism. They can execute thousands of threads simultaneously, offering significantly greater speedup—if the problem is well-suited to their architecture and the code is written carefully.

The trade-off is clear: the CPU is easier to parallelize, but the GPU delivers much higher performance potential.

Our team had some experience with OpenMP, and using it would’ve allowed us to get quick results with minimal effort. But we decided to go further and aim for maximum performance by focusing on GPU-based solutions. This meant writing and adapting our code to run on the RTX 4090.

This choice wasn’t just about raw power—it was also a scoring strategy. The marathon rewarded higher speedups with more points, and we knew the GPU was the only path to reach those top-tier scores. Of course, this came with the cost of a steeper learning curve and more trial-and-error during development. But it was worth to try, we had nothing to lose.

Ultimately, choosing the GPU wasn’t just about chasing performance. It was a deliberate decision to tackle the hardest problems head-on and get the most out of the available hardware, even if it meant learning new tools like CUDA and Thrust on the fly.

Memory and Execution Considerations

CPU vs GPU Execution

On CPUs, multiple threads can read and write to shared memory with relatively little penalty. But on GPUs, things work differently.

A GPU is organized hierarchically:

Grids contain Blocks
Blocks contain Threads Each thread in a block runs the same instruction in lockstep (SIMD—Single Instruction, Multiple Data). If your code contains if or switch statements that diverge execution paths, only the threads that match the condition proceed, while the rest become idle—wasting valuable compute time.

GPU Memory Hierarchy

There are four main types of memory in GPU architecture (listed from slowest to fastest access):

Global Memory: Accessible by all threads, but slow.
Shared Memory: Accessible by all threads within a block; much faster.
Local Memory: Private to a single thread.
Registers: Fastest, but very limited.

To maximize GPU performance:

Minimize accesses to global memory.
Prefer shared memory for intermediate results within blocks.
Use registers for frequently accessed variables.

Atomic Writes and Synchronization

Parallel code introduces race conditions. Suppose two threads try to update the same variable:

Thread A reads the value.
Thread B writes a new value.
Thread A writes its (now stale) value, overwriting B’s update.

This is problematic when aggregating results, like in counters or sums.

The solution? Atomic operations. These are special instructions that ensure only one thread at a time updates a memory location. However, they have a cost—they stall other threads while they complete.

To minimize this:

Let each thread write to its own local memory.
Use atomic operations only at the end to merge results.
On GPUs, use shared memory for per-block accumulation before doing a final atomic operation on global memory.

This strategy reduces contention and improves overall throughput.

Thrust and CUDA

To program on the GPU, we used two main approaches:

Thrust

Thrust is a high-level C++ library developed by NVIDIA. It abstracts away much of the complexity involved in GPU memory management and parallel execution. It feels familiar to anyone who’s used modern languages with functional-style APIs (like JavaScript or Python).

Here are the main functions we used:

thrust::transform: Like map, applies a function to each element.
thrust::reduce: Performs a parallel reduction (e.g., summing values).
thrust::sort: Sorts data; can use a custom comparator. Uses counting sort for primitive types, which is very fast.

These functions run in parallel on the GPU and are incredibly helpful for beginners in CUDA.

CUDA

For more complex problems, we used low-level CUDA. This required:

Manual memory allocation (cudaMalloc, cudaMemcpy)
Defining the number of threads, blocks, and grids
Writing GPU kernels

We didn’t dive too deep into raw CUDA but used it where Thrust was too limited.

Results

There were three challenges in the marathon:

In challenge A we had to compute distances between points and rank them. We fully parallelized this using the GPU and scored 487.8x speedup points, the highest single-problem score in the competition. Despite our efforts, the problem B offered little to no room for optimization through parallelism, but the problem C was a recursive mathematical formula. We simplified it algebraically, isolating the iteration dependency. This allowed us to parallelize the computation and perform the final summation using GPU reduction. We scored 373.7x speedup points.

We finished first overall, with a total of 862.8 points, ranking first both in the on-site and remote competition categories.

Final Thoughts

Winning this marathon wasn't about knowing everything about GPUs or CUDA. It was about:

Understanding the principles of parallelism
Making the code easier to parallelize
Choosing the right tools
Learning fast and iterating quickly

Sometimes, just not getting in the way of the compiler is already half the job done.