Tutorial: missing value handling
This tutorial illustrates the usage of Transducers.jl by stepping through various handling of missing values.
using Transducers
Dot product
Here is a simple way to compute a dot product using MapSplat
:
zip(1:3, 10:2:14) |> MapSplat(*) |> sum
76
Let's see what it does step by step. First we create a "printer" transducer using the following function (see Map
):
xf_printer(label) = Map() do x
println(label, ": ", x)
return x # just return it as-is
end
This transducer just pass-through the input while printing its value (prefixed by a label
). Let's sandwich the previous MapSplat(*)
with it:
zip(1:3, 10:2:14) |>
xf_printer(" input") |>
MapSplat(*) |>
xf_printer("output") |>
sum
input: (1, 10)
output: 10
input: (2, 12)
output: 24
input: (3, 14)
output: 42
You can see that the input tuple (1, 10)
is splatted into function *
by MapSplat
which then outputs 10
. This is repeated for all inputs.
Perhaps unfortunately, this way of computing a dot product propagates any missing values contained in the input arrays to the result (which may actually be desired in certain cases).
xs = [1, missing, 3, 2]
ys = [10, 14, missing, 12]
zip(xs, ys) |> MapSplat(*) |> sum
missing
However, it is very simple to ignore any missing values using OfType
:
zip(xs, ys) |> OfType(Tuple{Vararg{Number}}) |> MapSplat(*) |> sum
34
Here, Tuple{Vararg{Number}}
is a type that matches with a tuple of any length with numbers. It does not match with a tuple if it has a missing
.
@assert (1, 0.5) isa Tuple{Vararg{Number}}
@assert (1, 0.5, 2im) isa Tuple{Vararg{Number}}
@assert !((1, missing) isa Tuple{Vararg{Number}})
The part ... |> OfType(Tuple{Vararg{Number}}) |> MapSplat(*)
can be factored out using opcompose
:
xf_mdot = opcompose(OfType(Tuple{Vararg{Number}}), MapSplat(*));
or equivalently
OfType(Tuple{Vararg{Number}}) ⨟ MapSplat(*)
The transducer xf_mdot
can be used where previously OfType(Tuple{Vararg{Number}}) |> MapSplat(*)
was used:
zip(xs, ys) |> xf_mdot |> sum
34
Covariance
Transducer xf_mdot
above can also be used to compute the covariance. First, we need the number of pairs of elements in xs
and ys
that both of them are not missing
:
nonmissings = zip(xs, ys) |> Map(x -> x isa Tuple{Vararg{Number}}) |> count
2
Finally, we have to pre-process the input to xf_mdot
by subtracting the average. It's easy to do this with Map
:
using Statistics: mean
function xf_demean(xs, ys)
xmean = mean(skipmissing(xs))
ymean = mean(skipmissing(ys))
return Map(((x, y),) -> (x - xmean, y - ymean))
end
We can then compute the covariance by combining xf_demean
and xf_mdot
:
s = zip(xs, ys) |> xf_demean(xs, ys) |> xf_mdot |> sum
s / nonmissings
1.0
In xf_demean
, the averages of the vectors xs
and ys
are computed separately. It is also easy to compute the averages of the elements where both xs
and ys
are non-missing
:
function xf_demean2(xs, ys)
n, xsum, ysum =
zip(xs, ys) |>
OfType(Tuple{Vararg{Number}}) |>
Map(((x, y),) -> (1, x, y)) |>
Broadcasting() |>
sum
xmean = xsum / n
ymean = ysum / n
return Map(((x, y),) -> (x - xmean, y - ymean))
end
s = zip(xs, ys) |> xf_demean2(xs, ys) |> xf_mdot |> sum
s / nonmissings
0.5
In xf_demean2
, we used Broadcasting
transducer to broadcast elements of the tuple (1, x, y)
over the reducing function of sum
(i.e., +
).
Advanced: TeeRF
and ProductRF
Alternatively, it can also be computed using by combining foldxl
, ProductRF
, TeeRF
, and DataTools.inc1
(see below for how TeeRF
and ProductRF
work):
using DataTools: inc1
function xf_demean3(xs, ys)
n, (xsum, ysum) =
zip(xs, ys) |>
OfType(Tuple{Vararg{Number}}) |>
foldxl(TeeRF(inc1, ProductRF(+, +)))
xmean = xsum / n
ymean = ysum / n
return Map(((x, y),) -> (x - xmean, y - ymean))
end
s = zip(xs, ys) |> xf_demean3(xs, ys) |> xf_mdot |> sum
s / nonmissings
0.5
Vectorized reduction
reduce
, sum
, etc. in Base
support dims
argument. Transducers.jl does not support this argument as of writing. However, this can easily be emulated using eachcol
, eachrow
, or eachslice
iterators in Base
.
xs = [
0 missing 1 2
3 4 5 missing
missing 6 7 missing
]
if VERSION < v"1.1"
using Compat: eachcol
end
eachcol(xs) |> Broadcasting() |> NotA(Missing) |> sum
3-element Vector{Any}:
3
12
13
Here, we use NotA
transducer that filters out missing
values:
[1, 2, missing, 3] |> NotA(Missing) |> collect
3-element Vector{Int64}:
1
2
3
Above computation returns the sum over each row without taking into account the relationship within a column. Another possibly useful reduction is the sum of the columns with no missing values. This can easily be done by filtering before:
eachcol(xs) |> Filter(x -> !any(ismissing, x)) |> Broadcasting() |> sum
3-element Vector{Int64}:
1
5
7
findmax
and findmin
Another useful operation is findmax
/findmin
. Using Filter
, missing
values can be filtered out by
filtered_pairs = [1, 3, missing, 0] |> pairs |> Filter(!(ismissing ∘ last))
collect(filtered_pairs)
3-element Vector{Pair{Int64, Union{Missing, Int64}}}:
1 => 1
2 => 3
4 => 0
These key-value pairs can be accumulated by the following reducing step function:
# ,--- current state
# |
# | ,-- input
# | |
function findmax_step((argmax, max), (index, value))
argmax, max = value > max ? (index, value) : (argmax, max)
return argmax => max
# \
# \__ next state
end
foldxl(findmax_step, filtered_pairs)
2 => 3
or equivalently
[1, 3, missing, 0] |> pairs |> Filter(!(ismissing ∘ last)) |> foldxl(findmax_step)
2 => 3
foldxl
is like foldl
but always uses Transducers.jl's extended fold protocol. It also has the unary curried method foldxl(rf)
defined as xs -> foldxl(rf, xs)
. It is handy to use in the piping context as in the latter example.
DataTools.rightif
can be used for defining findmax
/findmin
-like functions on the fly:
using DataTools: rightif
foldxl(rightif(<, last), filtered_pairs)
Pair{Int64, Union{Missing, Int64}}(2, 3)
Side note: why Pair{Int,Union{Missing,Int}}
?
The result type just above using rightif
is Pair{Int,Union{Missing,Int}}
:
typeof(foldxl(rightif(<, last), filtered_pairs))
Pair{Int64, Union{Missing, Int64}}
This is because that's the element type of pairs([1, 3, missing, 0])
and rightif
does not re-construct the input Pair
like findmax_step
:
[1, 3, missing, 0] |> pairs |> first |> typeof
Pair{Int64, Union{Missing, Int64}}
We can avoid this by pre-processing the input with MapSplat(Pair)
:
foldxl(rightif(<, last), filtered_pairs |> MapSplat(Pair))
2 => 3
findmin
Similarly, we can define findmin
with
function findmin_step((argmin, min), (index, value))
argmin, min = value < min ? (index, value) : (argmin, min)
return argmin => min
end
foldxl(findmin_step, filtered_pairs)
4 => 0
and
foldxl(rightif(>, last), filtered_pairs)
Pair{Int64, Union{Missing, Int64}}(4, 0)
Extrema (findminmax
)
To compute findmax
and findmax
in a single sweep, we can use TeeRF
to "fan out" the input stream to multiple reducing step functions:
foldxl(TeeRF(findmin_step, findmax_step), filtered_pairs)
(4 => 0, 2 => 3)
or equivalently
foldxl(TeeRF(rightif(>, last), rightif(<, last)), filtered_pairs)
(Pair{Int64, Union{Missing, Int64}}(4, 0), Pair{Int64, Union{Missing, Int64}}(2, 3))
In general, multiple folds on a same collection
a₁ = foldxl(rf₁, xs)
a₂ = foldxl(rf₂, xs)
...
aₙ = foldxl(rfₙ, xs)
can be fused into a single fold using TeeRF
a₁, a₂, ..., aₙ = foldxl(TeeRF(rf₁, rf₂, ..., rfₙ), xs)
provided that the input collection xs
and the reducing functions rf₁
, rf₂
, ..., and rfₙ
are not stateful.
More fusing by transforming reducing functions
In the above computation, we have a reducing (step) function
rf = TeeRF(rightif(>, last), rightif(<, last));
xf = Filter(!(ismissing ∘ last));
and an iterable
xs = pairs([1, 3, missing, 0]);
In Transducers.jl, a transducer acts as iterator transformation xf(xs)
as well as reducing function transformation xf'(rf)
. Thus, the following calls are equivalent:
@assert foldxl(rf, xf, xs) == (4 => 0, 2 => 3)
@assert foldxl(rf, xf(xs)) == (4 => 0, 2 => 3)
@assert foldxl(xf'(rf), xs) == (4 => 0, 2 => 3)
By exploiting this equality, we can fuse more computations by moving the transformation on the side of reducing function. For example, we can compute non-missing extrema and count missings at the same time:
[1, 3, missing, 0] |>
pairs |>
MapSplat(Pair) |> # avoid `Pair{Int,Union{Missing, Int}}`
foldxl(TeeRF(
Map(ismissing ∘ last)'(+), # count number of missings
Filter(!(ismissing ∘ last))'(TeeRF(
rightif(>, last), # find non-missing minimum
rightif(<, last), # find non-missing maximum
)),
))
(1, (4 => 0, 2 => 3))
Using ProductRF
, we can compute findmax
of individual and zip
ped items at the same time
zip(
pairs([1, 3, missing, 0]), # produces k1 => v1
pairs([4, missing, 6, 5]), # produces k2 => v2
) |>
Map() do ((k1, v1), (k2, v2))
(k1 => v1, k2 => v2, (k1, k2) => (v1, v2))
end |>
foldxl(ProductRF(
Filter(!(ismissing ∘ last))'(rightif(<, last)), # max of k1 => v1
Filter(!(ismissing ∘ last))'(rightif(<, last)), # max of k2 => v2
Filter(((_, (v1, v2)),) -> v1 !== missing && v2 !== missing)'(
rightif(<, last) # max (k1, k2) => (v1, v2)
),
))
(2 => 3, 3 => 6, (1, 1) => (1, 4))
ProductRF
is like TeeRF
but acts on the input that is already a tuple. That is to say, given a collection of n
-tuple xs
, multiple folds on a same collection
a₁ = foldxl(rf₁, (x[1] for x in xs))
a₂ = foldxl(rf₂, (x[2] for x in xs))
...
aₙ = foldxl(rfₙ, (x[n] for x in xs))
can be fused into a single fold using ProductRF
a₁, a₂, ..., aₙ = foldxl(ProductRF(rf₁, rf₂, ..., rfₙ), xs)
provided that the input collection xs
and the reducing functions rf₁
, rf₂
, ..., and rfₙ
are not stateful.
Early termination
Base
's maximum
reports the maximum to be missing
when it receives a container with a missing
:
maximum([1, 2, missing, 3])
missing
We can obtain the same behavior by using isless
instead of >
in findmax_step′
:
function findmax_step′((argmax, max), (index, value))
argmax, max = isless(max, value) ? (index, value) : (argmax, max)
return argmax => max
end
foldl(findmax_step′, pairs([1, 2, missing, 3]))
3 => missing
foldl(findmax_step′, ...)
does not stop even after it observed missing
. We can easily add early termination by using ReduceIf
:
[1, 2, missing, 3] |> pairs |> ReduceIf(ismissing ∘ last) |> foldxl(findmax_step′)
3 => missing
Note that ReduceIf(f)
is not same as TakeWhile(!f)
:
[1, 2, missing, 3] |> pairs |> TakeWhile(!(ismissing ∘ last)) |> foldxl(findmax_step′)
2 => 2
That is to say, TakeWhile
does not evaluate the inner reducing step function with the item that triggers the early termination. That's why we need ReduceIf
here.
findmin
with missing
and NaN
Unfortunately, we do not have min
-compatible "total" ordering in Base
. Thus, we need to create a function that special-cases missing
and NaN
:
function isgreater(x, y)
xisnan = x != x
xisnan isa Bool || return false # x is missing
yisnan = y != y
yisnan isa Bool || return true # y is missing
xisnan && return false
yisnan && return true
return isless(y, x)
end
@assert isgreater(2, 1)
@assert isgreater(1, missing)
@assert isgreater(NaN, missing)
@assert isgreater(1, NaN)
@assert !isgreater(1, 2)
@assert !isgreater(missing, 1)
@assert !isgreater(missing, NaN)
@assert !isgreater(NaN, 1)
Using isgreater
instead of <
, we can define findmin_step′
like findmax_step′
:
function findmin_step′((argmin, min), (index, value))
argmin, min = isgreater(min, value) ? (index, value) : (argmin, min)
return argmin => min
end
foldl(findmax_step′, pairs([1, 2, missing, 3]))
3 => missing
Extrema (findminmax
) with early termination
As before, we can fuse findmin_step′
and findmax_step′
using TeeRF
. This can also be composed with ReduceIf
:
[1, 2, 3, 0, 2] |>
pairs |>
ReduceIf(ismissing ∘ last) |>
foldxl(TeeRF(findmin_step′, findmax_step′))
(4 => 0, 3 => 3)
[1, 2, missing, 0, 2] |>
pairs |>
ReduceIf(ismissing ∘ last) |>
foldxl(TeeRF(findmin_step′, findmax_step′))
(3 => missing, 3 => missing)
Ad-hoc imputation
Using Scan
, it is straightforward to fill missing
items with the last non-missing
item (last observation carried forward):
[1, 3, missing, 0, 2, missing, missing] |>
pairs |>
MapSplat(Pair) |>
Scan() do last, (k, v)
ismissing(v) ? last : k => v
end |>
collect
7-element Vector{Pair{Int64, Int64}}:
1 => 1
2 => 3
2 => 3
4 => 0
5 => 2
5 => 2
5 => 2
rightif
can also be used:
[1, 3, missing, 0, 2, missing, missing] |>
pairs |>
MapSplat(Pair) |>
Scan(rightif(!(ismissing ∘ right), last)) |>
collect
Note that the output still may contain missing
if the first item is missing
:
[missing, 1, missing] |>
pairs |>
MapSplat(Pair) |>
Scan(rightif(!(ismissing ∘ right), last)) |>
collect
3-element Vector{Pair{Int64, Union{Missing, Int64}}}:
1 => missing
2 => 1
2 => 1
This can be worked around by specifying init
argument for Scan
:
[missing, 1, missing] |>
pairs |>
MapSplat(Pair) |>
Scan(rightif(!(ismissing ∘ right), last), 0 => 0) |>
collect
3-element Vector{Pair{Int64, Int64}}:
0 => 0
2 => 1
2 => 1
This page was generated using Literate.jl.