Tutorial: missing value handling

This tutorial illustrates the usage of Transducers.jl by stepping through various handling of missing values.

using Transducers

Dot product

Here is a simple way to compute a dot product using MapSplat:

zip(1:3, 10:2:14) |> MapSplat(*) |> sum

Let's see what it does step by step. First we create a "printer" transducer using the following function (see Map):

xf_printer(label) = Map() do x
    println(label, ": ", x)
    return x  # just return it as-is
end

This transducer just pass-through the input while printing its value (prefixed by a label). Let's sandwich the previous MapSplat(*) with it:

zip(1:3, 10:2:14) |>
xf_printer(" input") |>
MapSplat(*) |>
xf_printer("output") |>
sum

 input: (1, 10)
output: 10
 input: (2, 12)
output: 24
 input: (3, 14)
output: 42

You can see that the input tuple (1, 10) is splatted into function * by MapSplat which then outputs 10. This is repeated for all inputs.

Perhaps unfortunately, this way of computing a dot product propagates any missing values contained in the input arrays to the result (which may actually be desired in certain cases).

xs = [1, missing, 3, 2]
ys = [10, 14, missing, 12]
zip(xs, ys) |> MapSplat(*) |> sum

missing

However, it is very simple to ignore any missing values using OfType:

zip(xs, ys) |> OfType(Tuple{Vararg{Number}}) |> MapSplat(*) |> sum

Here, Tuple{Vararg{Number}} is a type that matches with a tuple of any length with numbers. It does not match with a tuple if it has a missing.

@assert (1, 0.5) isa Tuple{Vararg{Number}}
@assert (1, 0.5, 2im) isa Tuple{Vararg{Number}}
@assert !((1, missing) isa Tuple{Vararg{Number}})

The part ... |> OfType(Tuple{Vararg{Number}}) |> MapSplat(*) can be factored out using opcompose:

xf_mdot = opcompose(OfType(Tuple{Vararg{Number}}), MapSplat(*));

or equivalently

OfType(Tuple{Vararg{Number}}) ⨟ MapSplat(*)

The transducer xf_mdot can be used where previously OfType(Tuple{Vararg{Number}}) |> MapSplat(*) was used:

zip(xs, ys) |> xf_mdot |> sum

Covariance

Transducer xf_mdot above can also be used to compute the covariance. First, we need the number of pairs of elements in xs and ys that both of them are not missing:

nonmissings = zip(xs, ys) |> Map(x -> x isa Tuple{Vararg{Number}}) |> count

Finally, we have to pre-process the input to xf_mdot by subtracting the average. It's easy to do this with Map:

using Statistics: mean

function xf_demean(xs, ys)
    xmean = mean(skipmissing(xs))
    ymean = mean(skipmissing(ys))
    return Map(((x, y),) -> (x - xmean, y - ymean))
end

We can then compute the covariance by combining xf_demean and xf_mdot:

s = zip(xs, ys) |> xf_demean(xs, ys) |> xf_mdot |> sum
s / nonmissings

1.0

In xf_demean, the averages of the vectors xs and ys are computed separately. It is also easy to compute the averages of the elements where both xs and ys are non-missing:

function xf_demean2(xs, ys)
    n, xsum, ysum =
        zip(xs, ys) |>
        OfType(Tuple{Vararg{Number}}) |>
        Map(((x, y),) -> (1, x, y)) |>
        Broadcasting() |>
        sum
    xmean = xsum / n
    ymean = ysum / n
    return Map(((x, y),) -> (x - xmean, y - ymean))
end

s = zip(xs, ys) |> xf_demean2(xs, ys) |> xf_mdot |> sum
s / nonmissings

0.5

In xf_demean2, we used Broadcasting transducer to broadcast elements of the tuple (1, x, y) over the reducing function of sum (i.e., +).

Advanced: `TeeRF` and `ProductRF`

Alternatively, it can also be computed using by combining foldxl, ProductRF, TeeRF, and DataTools.inc1 (see below for how TeeRF and ProductRF work):

using DataTools: inc1

function xf_demean3(xs, ys)
    n, (xsum, ysum) =
        zip(xs, ys) |>
        OfType(Tuple{Vararg{Number}}) |>
        foldxl(TeeRF(inc1, ProductRF(+, +)))
    xmean = xsum / n
    ymean = ysum / n
    return Map(((x, y),) -> (x - xmean, y - ymean))
end

s = zip(xs, ys) |> xf_demean3(xs, ys) |> xf_mdot |> sum
s / nonmissings

0.5

reduce, sum, etc. in Base support dims argument. Transducers.jl does not support this argument as of writing. However, this can easily be emulated using eachcol, eachrow, or eachslice iterators in Base.

xs = [
    0       missing 1       2
    3       4       5       missing
    missing 6       7       missing
]

eachcol(xs) |> Broadcasting() |> NotA(Missing) |> sum

3-element Vector{Any}:
  3
 12
 13

Here, we use NotA transducer that filters out missing values:

[1, 2, missing, 3] |> NotA(Missing) |> collect

3-element Vector{Int64}:
 1
 2
 3

Above computation returns the sum over each row without taking into account the relationship within a column. Another possibly useful reduction is the sum of the columns with no missing values. This can easily be done by filtering before:

eachcol(xs) |> Filter(x -> !any(ismissing, x)) |> Broadcasting() |> sum

3-element Vector{Int64}:
 1
 5
 7

`findmax` and `findmin`

Another useful operation is findmax/findmin. Using Filter, missing values can be filtered out by

filtered_pairs = [1, 3, missing, 0] |> pairs |> Filter(!(ismissing ∘ last))
collect(filtered_pairs)

3-element Vector{Pair{Int64, Union{Missing, Int64}}}:
 1 => 1
 2 => 3
 4 => 0

These key-value pairs can be accumulated by the following reducing step function:

#                     ,--- current state
#                     |
#                     |              ,-- input
#                     |              |
function findmax_step((argmax, max), (index, value))
    argmax, max = value > max ? (index, value) : (argmax, max)
    return argmax => max
    #        \
    #         \__ next state
end

foldxl(findmax_step, filtered_pairs)

2 => 3

or equivalently

[1, 3, missing, 0] |> pairs |> Filter(!(ismissing ∘ last)) |> foldxl(findmax_step)

2 => 3

foldxl is like foldl but always uses Transducers.jl's extended fold protocol. It also has the unary curried method foldxl(rf) defined as xs -> foldxl(rf, xs). It is handy to use in the piping context as in the latter example.

DataTools.rightif can be used for defining findmax/findmin-like functions on the fly:

using DataTools: rightif

foldxl(rightif(<, last), filtered_pairs)

Pair{Int64, Union{Missing, Int64}}(2, 3)

Side note: why `Pair{Int,Union{Missing,Int}}`?

The result type just above using rightif is Pair{Int,Union{Missing,Int}}:

typeof(foldxl(rightif(<, last), filtered_pairs))

Pair{Int64, Union{Missing, Int64}}

This is because that's the element type of pairs([1, 3, missing, 0]) and rightif does not re-construct the input Pair like findmax_step:

[1, 3, missing, 0] |> pairs |> first |> typeof

Pair{Int64, Union{Missing, Int64}}

We can avoid this by pre-processing the input with MapSplat(Pair):

foldxl(rightif(<, last), filtered_pairs |> MapSplat(Pair))

2 => 3

`findmin`

Similarly, we can define findmin with

function findmin_step((argmin, min), (index, value))
    argmin, min = value < min ? (index, value) : (argmin, min)
    return argmin => min
end

foldxl(findmin_step, filtered_pairs)

4 => 0

and

foldxl(rightif(>, last), filtered_pairs)

Pair{Int64, Union{Missing, Int64}}(4, 0)

Extrema (`findminmax`)

To compute findmax and findmax in a single sweep, we can use TeeRF to "fan out" the input stream to multiple reducing step functions:

foldxl(TeeRF(findmin_step, findmax_step), filtered_pairs)

(4 => 0, 2 => 3)

or equivalently

foldxl(TeeRF(rightif(>, last), rightif(<, last)), filtered_pairs)

(Pair{Int64, Union{Missing, Int64}}(4, 0), Pair{Int64, Union{Missing, Int64}}(2, 3))

In general, multiple folds on a same collection

a₁ = foldxl(rf₁, xs)
a₂ = foldxl(rf₂, xs)
...
aₙ = foldxl(rfₙ, xs)

can be fused into a single fold using TeeRF

a₁, a₂, ..., aₙ = foldxl(TeeRF(rf₁, rf₂, ..., rfₙ), xs)

provided that the input collection xs and the reducing functions rf₁, rf₂, ..., and rfₙ are not stateful.

More fusing by transforming reducing functions

In the above computation, we have a reducing (step) function

rf = TeeRF(rightif(>, last), rightif(<, last));

a transducer

xf = Filter(!(ismissing ∘ last));

and an iterable

xs = pairs([1, 3, missing, 0]);

In Transducers.jl, a transducer acts as iterator transformation xf(xs) as well as reducing function transformation xf'(rf). Thus, the following calls are equivalent:

@assert foldxl(rf, xf, xs) == (4 => 0, 2 => 3)
@assert foldxl(rf, xf(xs)) == (4 => 0, 2 => 3)
@assert foldxl(xf'(rf), xs) == (4 => 0, 2 => 3)

By exploiting this equality, we can fuse more computations by moving the transformation on the side of reducing function. For example, we can compute non-missing extrema and count missings at the same time:

[1, 3, missing, 0] |>
pairs |>
MapSplat(Pair) |>  # avoid `Pair{Int,Union{Missing, Int}}`
foldxl(TeeRF(
    Map(ismissing ∘ last)'(+),  # count number of missings
    Filter(!(ismissing ∘ last))'(TeeRF(
        rightif(>, last),  # find non-missing minimum
        rightif(<, last),  # find non-missing maximum
    )),
))

(1, (4 => 0, 2 => 3))

Using ProductRF, we can compute findmax of individual and zipped items at the same time

zip(
    pairs([1, 3, missing, 0]),  # produces k1 => v1
    pairs([4, missing, 6, 5]),  # produces k2 => v2
) |>
Map() do ((k1, v1), (k2, v2))
    (k1 => v1, k2 => v2, (k1, k2) => (v1, v2))
end |>
foldxl(ProductRF(
    Filter(!(ismissing ∘ last))'(rightif(<, last)),  # max of k1 => v1
    Filter(!(ismissing ∘ last))'(rightif(<, last)),  # max of k2 => v2
    Filter(((_, (v1, v2)),) -> v1 !== missing && v2 !== missing)'(
        rightif(<, last)  # max (k1, k2) => (v1, v2)
    ),
))

(2 => 3, 3 => 6, (1, 1) => (1, 4))

ProductRF is like TeeRF but acts on the input that is already a tuple. That is to say, given a collection of n-tuple xs, multiple folds on a same collection

a₁ = foldxl(rf₁, (x[1] for x in xs))
a₂ = foldxl(rf₂, (x[2] for x in xs))
...
aₙ = foldxl(rfₙ, (x[n] for x in xs))

can be fused into a single fold using ProductRF

a₁, a₂, ..., aₙ = foldxl(ProductRF(rf₁, rf₂, ..., rfₙ), xs)

provided that the input collection xs and the reducing functions rf₁, rf₂, ..., and rfₙ are not stateful.

Early termination

Base's maximum reports the maximum to be missing when it receives a container with a missing:

maximum([1, 2, missing, 3])

missing

We can obtain the same behavior by using isless instead of > in findmax_step′:

function findmax_step′((argmax, max), (index, value))
    argmax, max = isless(max, value) ? (index, value) : (argmax, max)
    return argmax => max
end

foldl(findmax_step′, pairs([1, 2, missing, 3]))

3 => missing

foldl(findmax_step′, ...) does not stop even after it observed missing. We can easily add early termination by using ReduceIf:

[1, 2, missing, 3] |> pairs |> ReduceIf(ismissing ∘ last) |> foldxl(findmax_step′)

3 => missing

Note that ReduceIf(f) is not same as TakeWhile(!f):

[1, 2, missing, 3] |> pairs |> TakeWhile(!(ismissing ∘ last)) |> foldxl(findmax_step′)

2 => 2

That is to say, TakeWhile does not evaluate the inner reducing step function with the item that triggers the early termination. That's why we need ReduceIf here.

`findmin` with `missing` and `NaN`

Unfortunately, we do not have min-compatible "total" ordering in Base. Thus, we need to create a function that special-cases missing and NaN:

function isgreater(x, y)
    xisnan = x != x
    xisnan isa Bool || return false  # x is missing
    yisnan = y != y
    yisnan isa Bool || return true  # y is missing
    xisnan && return false
    yisnan && return true
    return isless(y, x)
end

@assert isgreater(2, 1)
@assert isgreater(1, missing)
@assert isgreater(NaN, missing)
@assert isgreater(1, NaN)
@assert !isgreater(1, 2)
@assert !isgreater(missing, 1)
@assert !isgreater(missing, NaN)
@assert !isgreater(NaN, 1)

Using isgreater instead of <, we can define findmin_step′ like findmax_step′:

function findmin_step′((argmin, min), (index, value))
    argmin, min = isgreater(min, value) ? (index, value) : (argmin, min)
    return argmin => min
end

foldl(findmax_step′, pairs([1, 2, missing, 3]))

3 => missing

Extrema (`findminmax`) with early termination

As before, we can fuse findmin_step′ and findmax_step′ using TeeRF. This can also be composed with ReduceIf:

[1, 2, 3, 0, 2] |>
pairs |>
ReduceIf(ismissing ∘ last) |>
foldxl(TeeRF(findmin_step′, findmax_step′))

(4 => 0, 3 => 3)

[1, 2, missing, 0, 2] |>
pairs |>
ReduceIf(ismissing ∘ last) |>
foldxl(TeeRF(findmin_step′, findmax_step′))

(3 => missing, 3 => missing)

Ad-hoc imputation

Using Scan, it is straightforward to fill missing items with the last non-missing item (last observation carried forward):

[1, 3, missing, 0, 2, missing, missing] |>
pairs |>
MapSplat(Pair) |>
Scan() do last, (k, v)
    ismissing(v) ? last : k => v
end |>
collect

7-element Vector{Pair{Int64, Int64}}:
 1 => 1
 2 => 3
 2 => 3
 4 => 0
 5 => 2
 5 => 2
 5 => 2

rightif can also be used:

[1, 3, missing, 0, 2, missing, missing] |>
pairs |>
MapSplat(Pair) |>
Scan(rightif(!(ismissing ∘ right), last)) |>
collect

Note that the output still may contain missing if the first item is missing:

[missing, 1, missing] |>
pairs |>
MapSplat(Pair) |>
Scan(rightif(!(ismissing ∘ right), last)) |>
collect

3-element Vector{Pair{Int64, Union{Missing, Int64}}}:
 1 => missing
 2 => 1
 2 => 1

This can be worked around by specifying init argument for Scan:

[missing, 1, missing] |>
pairs |>
MapSplat(Pair) |>
Scan(rightif(!(ismissing ∘ right), last), 0 => 0) |>
collect

3-element Vector{Pair{Int64, Int64}}:
 0 => 0
 2 => 1
 2 => 1

This page was generated using Literate.jl.