Parallel Monte Carlo

Calculate the value of $\pi$ through parallel direct Monte Carlo.

A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is $\pi$, the area of the square is 4, and the ratio is $\pi/4$. This means that, if you throw $N$ darts randomly at the square, approximately $M=N\pi/4$ of those darts will land inside the unit circle.

Throw darts randomly at a unit square and count how many of them ($M$) landed inside of a unit circle. Approximate $\pi \approx 4M/N$.

Sequential implementation:

function mc(N)
    M = 0 # number of darts that landed in the circle
    for i in 1:N
        if rand()^2 + rand()^2 < 1.0
            M += 1
        end
    end
    pi = 4 * M / N
    return pi
end

N = 100_000_000

mc(N)

3.14171236

Parallelization with `tmapreduce`

To parallelize the Monte Carlo simulation, we use tmapreduce with + as the reduction operator. For the map part, we take 1:N as our input collection and "throw one dart" per element.

using OhMyThreads

function mc_parallel(N; kwargs...)
    M = tmapreduce(+, 1:N; kwargs...) do i
        rand()^2 + rand()^2 < 1.0
    end
    pi = 4 * M / N
    return pi
end

# or alternatively
#
# function mc_parallel(N)
#     M = @tasks for _ in 1:N
#         @set reducer = +
#         rand()^2 + rand()^2 < 1.0
#     end
#     pi = 4 * M / N
#     return pi
# end

mc_parallel(N)

3.14156496

Let's run a quick benchmark.

using BenchmarkTools
using Base.Threads: nthreads

@assert nthreads() > 1 # make sure we have multiple Julia threads
@show nthreads()       # print out the number of threads

@btime mc($N) samples=10 evals=3;
@btime mc_parallel($N) samples=10 evals=3;

nthreads() = 10
  301.636 ms (0 allocations: 0 bytes)
  41.864 ms (68 allocations: 5.81 KiB)

Static scheduling

Because the workload is highly uniform, it makes sense to also try the StaticScheduler and compare the performance of static and dynamic scheduling (with default parameters).

using OhMyThreads: StaticScheduler

@btime mc_parallel($N; scheduler=:dynamic) samples=10 evals=3; # default
@btime mc_parallel($N; scheduler=:static) samples=10 evals=3;

  41.839 ms (68 allocations: 5.81 KiB)
  41.838 ms (68 allocations: 5.81 KiB)

Manual parallelization

First, using the index_chunks function, we divide the iteration interval 1:N into nthreads() parts. Then, we apply a regular (sequential) map to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for $\pi$.

using OhMyThreads: @spawn, index_chunks

function mc_parallel_manual(N; nchunks = nthreads())
    tasks = map(index_chunks(1:N; n = nchunks)) do idcs
        @spawn mc(length(idcs))
    end
    pi = sum(fetch, tasks) / nchunks
    return pi
end

mc_parallel_manual(N)

3.14180504

And this is the performance:

@btime mc_parallel_manual($N) samples=10 evals=3;

  30.224 ms (65 allocations: 5.70 KiB)

It is faster than mc_parallel above because the task-local computation mc(length(idcs)) is faster than the implicit task-local computation within tmapreduce (which itself is a mapreduce).

idcs = first(index_chunks(1:N; n = nthreads()))

@btime mapreduce($+, $idcs) do i
    rand()^2 + rand()^2 < 1.0
end samples=10 evals=3;

@btime mc($(length(idcs))) samples=10 evals=3;

  41.750 ms (0 allocations: 0 bytes)
  30.148 ms (0 allocations: 0 bytes)

This page was generated using Literate.jl.

Parallel Monte Carlo

Sequential implementation:

Parallelization with tmapreduce

Static scheduling

Manual parallelization

Parallelization with `tmapreduce`