Sometimes we get CSV files which lack a header. For example:
POKI_RUN_COMMAND{{cat data/headerless.csv}}HERE
You can use Miller to add a header. The --implicit-csv-header applies positionally indexed labels:
POKI_RUN_COMMAND{{mlr --csv --implicit-csv-header cat data/headerless.csv}}HERE
Following that, you can rename the positionally indexed labels to names with meaning for your context.
For example:
POKI_RUN_COMMAND{{mlr --csv --implicit-csv-header label name,age,status data/headerless.csv}}HERE
Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller’s output
to the system command sed 1d, or you can use Miller’s --headerless-csv-output option:
POKI_RUN_COMMAND{{head -5 data/colored-shapes.dkvp | mlr --ocsv cat}}HERE
POKI_RUN_COMMAND{{head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat}}HERE
Lastly, often we say “CSV” or “TSV” when we have
positionally indexed data in columns which are separated by commas or tabs,
respectively. In this case it’s perhaps simpler to just use NIDX
format which was designed for this purpose. (See also
POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE.) For example:
POKI_RUN_COMMAND{{mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv}}HERE
Bulk rename of fields
Suppose you want to replace spaces with underscores in your column names:
POKI_RUN_COMMAND{{cat data/spaces.csv}}HERE
The simplest way is to use mlr rename with -g (for global
replace, not just first occurrence of space within each field) and -r
for pattern-matching (rather than explicit single-column renames):
POKI_RUN_COMMAND{{mlr --csv rename -g -r ' ,_' data/spaces.csv}}HERE
POKI_RUN_COMMAND{{mlr --csv --opprint rename -g -r ' ,_' data/spaces.csv}}HERE
You can also do this with a for-loop:
POKI_RUN_COMMAND{{cat data/bulk-rename-for-loop.mlr}}HERE
POKI_RUN_COMMAND{{mlr --icsv --opprint put -f data/bulk-rename-for-loop.mlr data/spaces.csv}}HERE
Full field renames and reassigns
Using Miller 5.0.0’s map literals and assigning to $*, you can fully generalize
mlr rename,
mlr reorder,
etc.:
POKI_RUN_COMMAND{{cat data/small}}HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/full-reorg.sh)HERE
Numbering and renumbering records
The awk-like built-in variable NR is incremented for each input record:
POKI_RUN_COMMAND{{cat data/small}}HERE
POKI_RUN_COMMAND{{mlr put '$nr = NR' data/small}}HERE
However, this is the record number within the original input stream
— not after any filtering you may have done:
POKI_RUN_COMMAND{{mlr filter '$a == "wye"' then put '$nr = NR' data/small}}HERE
There are two good options here. One is to use the cat verb with -n:
POKI_RUN_COMMAND{{mlr filter '$a == "wye"' then cat -n data/small}}HERE
The other is to keep your own counter within the put DSL:
POKI_RUN_COMMAND{{mlr filter '$a == "wye"' then put 'begin {@n = 1} $n = @n; @n += 1' data/small}}HERE
The difference is a matter of taste (although mlr cat -n puts the counter first).
Data-cleaning examples
Here are some ways to use the type-checking options as described in
the POKI_PUT_LINK_FOR_PAGE(reference-dsl.html#Type-test_and_type-assertion_expressions)HERE.
Suppose you have the following data file, with inconsistent typing for boolean.
(Also imagine that, for the sake of discussion, we have a million-line file
rather than a four-line file, so we can’t see it all at once and some
automation is called for.)
POKI_RUN_COMMAND{{cat data/het-bool.csv}}HERE
One option is to coerce everything to boolean, or integer:
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv}}HERE
A second option is to flag badly formatted data within the output stream:
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv}}HERE
Or perhaps to flag badly formatted data outside the output stream:
POKI_RUN_COMMAND{{mlr --icsv --opprint put 'if (!is_string($reachable)) {eprint "Malformed at NR=".NR} ' data/het-bool.csv}}HERE
A third way is to abort the process on first instance of bad data:
POKI_RUN_COMMAND_TOLERATING_ERROR{{mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv}}HERE
Splitting nested fields
Suppose you have a TSV file like this:
POKI_INCLUDE_ESCAPED(data/nested.tsv)HERE
The simplest option is to use mlr nest:
POKI_RUN_COMMAND{{mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv}}HERE
POKI_RUN_COMMAND{{mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv}}HERE
While mlr nest is simplest, let’s also take a look at a few ways to do this using the
put DSL.
One option to split out the colon-delimited values in the b
column is to use splitnv to create an integer-indexed map and loop
over it, adding new fields to the current record:
POKI_RUN_COMMAND{{mlr --from data/nested.tsv --itsv --oxtab put 'o=splitnv($b, ":"); for (k,v in o) {$["p".k]=v}'}}HERE
while another is to loop over the same map from splitnv and use
it (with put -q to suppress printing the original record) to produce
multiple records:
POKI_RUN_COMMAND{{mlr --from data/nested.tsv --itsv --oxtab put -q 'o=splitnv($b, ":"); for (k,v in o) {emit mapsum($*, {"b":v})}'}}HERE
POKI_RUN_COMMAND{{mlr --from data/nested.tsv --tsv put -q 'o=splitnv($b, ":"); for (k,v in o) {emit mapsum($*, {"b":v})}'}}HERE
Showing differences between successive queries
Suppose you have a database query which you run at one point in time, producing the output on the
left, then again later producing the output on the right:
And, suppose you want to compute the differences in the counters between
adjacent keys. Since the color names aren’t all in the same order, nor
are they all present on both sides, we can’t just paste the two files
side-by-side and do some column-four-minus-column-two arithmetic.
First, rename counter columns to make them distinct:
POKI_RUN_COMMAND{{mlr --csv rename count,previous_count data/previous_counters.csv > data/prevtemp.csv}}HERE
POKI_RUN_COMMAND{{cat data/prevtemp.csv}}HERE
POKI_RUN_COMMAND{{mlr --csv rename count,current_count data/current_counters.csv > data/currtemp.csv}}HERE
POKI_RUN_COMMAND{{cat data/currtemp.csv}}HERE
Then, join on the key field(s), and use unsparsify to zero-fill counters
absent on one side but present on the other. Use --ul and
--ur to emit unpaired records (namely, purple on the left and yellow on the right):
POKI_INCLUDE_AND_RUN_ESCAPED(data/previous-to-current.sh)HERE
Finding missing dates
Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates:
POKI_RUN_COMMAND{{head -n 10 data/miss-date.csv}}HERE
POKI_RUN_COMMAND{{wc -l data/miss-date.csv}}HERE
Since there are 1372 lines in the data file, some automation is called for.
To find the missing dates, you can convert the dates to seconds since the epoch
using strptime, then compute adjacent differences (the cat -n
simply inserts record-counters):
POKI_INCLUDE_AND_RUN_ESCAPED(data/miss-date-1.sh)HERE
Then, filter for adjacent difference not being 86400 (the number of seconds in a day):
POKI_INCLUDE_AND_RUN_ESCAPED(data/miss-date-2.sh)HERE
Given this, it’s now easy to see where the gaps are:
POKI_RUN_COMMAND{{mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv}}HERE
POKI_RUN_COMMAND{{mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv}}HERE
Two-pass algorithms
Miller is a streaming record processor; commands are performed once per
record. This makes Miller particularly suitable for single-pass algorithms,
allowing many of its verbs to process files that are (much) larger than the
amount of RAM present in your system. (Of course, Miller verbs such as
sort, tac, etc. all must ingest and retain all input records
before emitting any output records.) You can also use out-of-stream variables
to perform multi-pass computations, at the price of retaining all input records
in memory.
Two-pass algorithms: computation of percentages
For example, mapping numeric values down a column to the percentage
between their min and max values is two-pass: on the first pass you find the
min and max values, then on the second, map each record’s value to a
percentage.
POKI_INCLUDE_AND_RUN_ESCAPED(data/two-pass-percentage.sh)HERE
Two-pass algorithms: line-number ratios
Similarly, finding the total record count requires first reading through
all the data:
POKI_INCLUDE_AND_RUN_ESCAPED(data/two-pass-record-numbers.sh)HERE
Two-pass algorithms: records having max value
The idea is to retain records having the largest value of n in the
following data:
POKI_RUN_COMMAND{{mlr --itsv --opprint cat data/maxrows.tsv}}HERE
Of course, the largest value of n isn’t known until after
all data have been read. Using an out-of-stream variable we can retain all
records as they are read, then filter them at the end:
POKI_RUN_COMMAND{{cat data/maxrows.mlr}}HERE
POKI_RUN_COMMAND{{mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv}}HERE
Rectangularizing data
Suppose you have a method (in whatever language) which is printing things of the form
POKI_INCLUDE_ESCAPED(data/rect-outer.txt)HERE
and then calls another method which prints things of the form
POKI_INCLUDE_ESCAPED(data/rect-middle.txt)HERE
and then, perhaps, that second method calls a third method which prints things of the form
POKI_INCLUDE_ESCAPED(data/rect-inner.txt)HERE
with the result that your program’s output is
POKI_INCLUDE_ESCAPED(data/rect.txt)HERE
The idea here is that middles starting with a 1 belong to the outer value of 1,
and so on. (For example, the outer values might be account IDs, the middle
values might be invoice IDs, and the inner values might be invoice line-items.)
If you want all the middle and inner lines to have the context of which outers
they belong to, you can modify your software to pass all those through your
methods. Alternatively, don’t refactor your code just to handle some
ad-hoc log-data formatting — instead, use the following to rectangularize
the data. The idea is to use an out-of-stream variable to accumulate fields
across records. Clear that variable when you see an outer ID; accumulate
fields; emit output when you see the inner IDs.
POKI_INCLUDE_AND_RUN_ESCAPED(data/rect.sh)HERE
Regularizing ragged CSV
Miller handles compliant CSV: in particular, it’s an error if the
number of data fields in a given data line don’t match the number of
header lines. But in the event that you have a CSV file in which some lines
have less than the full number of fields, you can use Miller to pad them out.
The trick is to use NIDX format, for which each line stands on its own without
respect to a header line.
POKI_RUN_COMMAND{{cat data/ragged.csv}}HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv.sh)HERE
or, more simply,
POKI_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv-2.sh)HERE
Feature-counting
Suppose you have some heterogeneous data like this:
POKI_INCLUDE_ESCAPED(data/features.json)HERE
A reasonable question to ask is, how many occurrences of each field are
there? And, what percentage of total row count has each of them? Since the
denominator of the percentage is not known until the end, this is a two-pass
algorithm:
POKI_INCLUDE_ESCAPED(data/feature-count.mlr)HERE
Then
POKI_RUN_COMMAND{{mlr --json put -q -f data/feature-count.mlr data/features.json}}HERE
POKI_RUN_COMMAND{{mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json}}HERE
Unsparsing
The previous section discussed how to fill out missing data fields within
CSV with full header line — so the list of all field names is present
within the header line. Next, let’s look at a related problem: we have
data where each record has various key names but we want to produce rectangular
output having the union of all key names.
For example, suppose you have JSON input like this:
POKI_RUN_COMMAND{{cat data/sparse.json}}HERE
There are field names a, b, v, u,
x, w in the data — but not all in every record. Since
we don’t know the names of all the keys until we’ve read them all,
this needs to be a two-pass algorithm. On the first pass, remember all the
unique key names and all the records; on the second pass, loop through the
records filling in absent values, then producing output. Use put -q
since we don’t want to produce per-record output, only emitting output in
the end block:
POKI_RUN_COMMAND{{cat data/unsparsify.mlr}}HERE
POKI_RUN_COMMAND{{mlr --json put -q -f data/unsparsify.mlr data/sparse.json}}HERE
POKI_RUN_COMMAND{{mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json}}HERE
POKI_RUN_COMMAND{{mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json}}HERE
There is a keystroke-saving verb for this: mlr unsparsify.
Parsing log-file output
This, of course, depends highly on what’s in your log files. But, as
an example, suppose you have log-file lines such as
POKI_CARDIFY(2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378)HERE
I prefer to pre-filter with grep and/or sed to extract the structured text, then hand that to Miller. Example:
POKI_CARDIFY(grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status)HERE
Memoization with out-of-stream variables
The recursive function for the Fibonacci sequence is famous for its computational complexity.
Namely, using
f(0)=1,
f(1)=1,
f(n)=f(n-1)+f(n-2) for n≥2,
the evaluation tree branches left as well as right at each non-trivial level, resulting in millions
or more paths to the root 0/1 nodes for larger n. This program
POKI_INCLUDE_ESCAPED(data/fibo-uncached.sh)HERE
produces output like this:
Note that the time it takes to evaluate the function is blowing up exponentially as the input argument
increases. Using @-variables, which persist across records, we can cache and reuse the results
of previous computations:
POKI_INCLUDE_ESCAPED(data/fibo-cached.sh)HERE
with output like this: