# Parallel Monte Carlo

Calculate the value of $\pi$ through parallel direct Monte Carlo.

A unit circle is inscribed inside a unit square with side length 2 (from -1 to 1). The area of the circle is $\pi$, the area of the square is 4, and the ratio is $\pi/4$. This means that, if you throw $N$ darts randomly at the square, approximately $M=N\pi/4$ of those darts will land inside the unit circle.

Throw darts randomly at a unit square and count how many of them ($M$) landed inside of a unit circle. Approximate $\pi \approx 4M/N$.

## Sequential implementation:

```
function mc(N)
M = 0 # number of darts that landed in the circle
for i in 1:N
if rand()^2 + rand()^2 < 1.0
M += 1
end
end
pi = 4 * M / N
return pi
end
N = 100_000_000
mc(N)
```

`3.14171236`

## Parallelization with `tmapreduce`

To parallelize the Monte Carlo simulation, we use `tmapreduce`

with `+`

as the reduction operator. For the map part, we take `1:N`

as our input collection and "throw one dart" per element.

```
using OhMyThreads
function mc_parallel(N; kwargs...)
M = tmapreduce(+, 1:N; kwargs...) do i
rand()^2 + rand()^2 < 1.0
end
pi = 4 * M / N
return pi
end
# or alternatively
#
# function mc_parallel(N)
# M = @tasks for _ in 1:N
# @set reducer = +
# rand()^2 + rand()^2 < 1.0
# end
# pi = 4 * M / N
# return pi
# end
mc_parallel(N)
```

`3.14156496`

Let's run a quick benchmark.

```
using BenchmarkTools
using Base.Threads: nthreads
@assert nthreads() > 1 # make sure we have multiple Julia threads
@show nthreads() # print out the number of threads
@btime mc($N) samples=10 evals=3;
@btime mc_parallel($N) samples=10 evals=3;
```

```
nthreads() = 10
301.636 ms (0 allocations: 0 bytes)
41.864 ms (68 allocations: 5.81 KiB)
```

### Static scheduling

Because the workload is highly uniform, it makes sense to also try the `StaticScheduler`

and compare the performance of static and dynamic scheduling (with default parameters).

```
using OhMyThreads: StaticScheduler
@btime mc_parallel($N; scheduler=:dynamic) samples=10 evals=3; # default
@btime mc_parallel($N; scheduler=:static) samples=10 evals=3;
```

```
41.839 ms (68 allocations: 5.81 KiB)
41.838 ms (68 allocations: 5.81 KiB)
```

## Manual parallelization

First, using the `index_chunks`

function, we divide the iteration interval `1:N`

into `nthreads()`

parts. Then, we apply a regular (sequential) `map`

to spawn a Julia task per chunk. Each task will locally and independently perform a sequential Monte Carlo simulation. Finally, we fetch the results and compute the average estimate for $\pi$.

```
using OhMyThreads: @spawn, index_chunks
function mc_parallel_manual(N; nchunks = nthreads())
tasks = map(index_chunks(1:N; n = nchunks)) do idcs
@spawn mc(length(idcs))
end
pi = sum(fetch, tasks) / nchunks
return pi
end
mc_parallel_manual(N)
```

`3.14180504`

And this is the performance:

`@btime mc_parallel_manual($N) samples=10 evals=3;`

```
30.224 ms (65 allocations: 5.70 KiB)
```

It is faster than `mc_parallel`

above because the task-local computation `mc(length(idcs))`

is faster than the implicit task-local computation within `tmapreduce`

(which itself is a `mapreduce`

).

```
idcs = first(index_chunks(1:N; n = nthreads()))
@btime mapreduce($+, $idcs) do i
rand()^2 + rand()^2 < 1.0
end samples=10 evals=3;
@btime mc($(length(idcs))) samples=10 evals=3;
```

```
41.750 ms (0 allocations: 0 bytes)
30.148 ms (0 allocations: 0 bytes)
```

*This page was generated using Literate.jl.*