Julia Set

In this example, we will compute an image of the Julia set in parallel. We will explore the schedule and nchunks options that can be used to get load balancing.

The value of a single pixel of the Julia set, which corresponds to a point in the complex number plane, can be computed by the following iteration procedure.

function _compute_pixel(i, j, n; max_iter = 255, c = -0.79 + 0.15 * im)
    x = -2.0 + (j - 1) * 4.0 / (n - 1)
    y = -2.0 + (i - 1) * 4.0 / (n - 1)

    z = x + y * im
    iter = max_iter
    for k in 1:max_iter
        if abs2(z) > 4.0
            iter = k - 1
            break
        end
        z = z^2 + c
    end
    return iter
end
_compute_pixel (generic function with 1 method)

Note that the value of the pixel is the number of performed iterations for the corresponding complex input number. Hence, the computational workload is non-uniform.

Sequential computation

In our naive implementation, we just loop over the dimensions of the image matrix and call the pixel kernel above.

function compute_juliaset_sequential!(img)
    N = size(img, 1)
    for j in 1:N
        for i in 1:N
            img[i, j] = _compute_pixel(i, j, N)
        end
    end
    return img
end

N = 2000
img = zeros(Int, N, N)
compute_juliaset_sequential!(img);

Let's look at the result

using Plots
p = heatmap(img)

Parallelization

The Julia set computation above is a map! operation: We apply some function to each element of the array. Hence, we can use tmap! for parallelization. We use CartesianIndices to map between linear and two-dimensional cartesian indices.

using OhMyThreads: tmap!

function compute_juliaset_parallel!(img; kwargs...)
    N = size(img, 1)
    cart = CartesianIndices(img)
    tmap!(img, eachindex(img); kwargs...) do idx
        c = cart[idx]
        _compute_pixel(c[1], c[2], N)
    end
    return img
end

# or alternatively
#
# function compute_juliaset_parallel!(img; kwargs...)
#     N = size(img, 1)
#     cart = CartesianIndices(img)
#     @tasks for idx in eachindex(img)
#         c = cart[idx]
#         img[idx] = _compute_pixel(c[1], c[2], N)
#     end
#     return img
# end

N = 2000
img = zeros(Int, N, N)
compute_juliaset_parallel!(img);
p = heatmap(img)

Benchmark

Let's benchmark the variants above.

using BenchmarkTools
using Base.Threads: nthreads

N = 2000
img = zeros(Int, N, N)

@show nthreads()

@btime compute_juliaset_sequential!($img) samples=10 evals=3;
@btime compute_juliaset_parallel!($img) samples=10 evals=3;
nthreads() = 10
  131.295 ms (0 allocations: 0 bytes)
  31.422 ms (68 allocations: 6.09 KiB)

As hoped, the parallel implementation is much faster!

Dynamic vs static scheduling

As stated above, the per-pixel computation is non-uniform. Hence, we do benefit from the load balancing of the default dynamic scheduler. The latter divides the overall workload into tasks that can then be dynamically distributed among threads to adjust the per-thread load. We can try to fine tune and improve the load balancing further by increasing the ntasks parameter of the scheduler, that is, creating more tasks with smaller per-task workload.

using OhMyThreads: DynamicScheduler

@btime compute_juliaset_parallel!($img; ntasks=N, scheduler=:dynamic) samples=10 evals=3;
  17.438 ms (12018 allocations: 1.11 MiB)

Note that while this turns out to be a bit faster, it comes at the expense of much more allocations.

To quantify the impact of load balancing we can opt out of dynamic scheduling and use the StaticScheduler instead. The latter doesn't provide any form of load balancing.

using OhMyThreads: StaticScheduler

@btime compute_juliaset_parallel!($img; scheduler=:static) samples=10 evals=3;
  30.097 ms (73 allocations: 6.23 KiB)

This page was generated using Literate.jl.