Common patterns
POKI_PUT_TOC_HERE

Headerless CSV on input or output

Sometimes we get CSV files which lack a header. For example: POKI_RUN_COMMAND{{cat data/headerless.csv}}HERE

You can use Miller to add a header. The --implicit-csv-header applies positionally indexed labels: POKI_RUN_COMMAND{{mlr --csv --implicit-csv-header cat data/headerless.csv}}HERE

Following that, you can rename the positionally indexed labels to names with meaning for your context. For example: POKI_RUN_COMMAND{{mlr --csv --implicit-csv-header label name,age,status data/headerless.csv}}HERE

Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller’s output to the system command sed 1d, or you can use Miller’s --headerless-csv-output option: POKI_RUN_COMMAND{{head -5 data/colored-shapes.dkvp | mlr --ocsv cat}}HERE POKI_RUN_COMMAND{{head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat}}HERE

Lastly, often we say “CSV” or “TSV” when we have positionally indexed data in columns which are separated by commas or tabs, respectively. In this case it’s perhaps simpler to just use NIDX format which was designed for this purpose. (See also POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE.) For example: POKI_RUN_COMMAND{{mlr --inidx --ifs comma --oxtab cut -f 1,3 data/headerless.csv}}HERE

Bulk rename of fields

Suppose you want to replace spaces with underscores in your column names: POKI_RUN_COMMAND{{cat data/spaces.csv}}HERE

The simplest way is to use mlr rename with -g (for global replace, not just first occurrence of space within each field) and -r for pattern-matching (rather than explicit single-column renames): POKI_RUN_COMMAND{{mlr --csv rename -g -r ' ,_' data/spaces.csv}}HERE POKI_RUN_COMMAND{{mlr --csv --opprint rename -g -r ' ,_' data/spaces.csv}}HERE

You can also do this with a for-loop: POKI_RUN_COMMAND{{cat data/bulk-rename-for-loop.mlr}}HERE POKI_RUN_COMMAND{{mlr --icsv --opprint put -f data/bulk-rename-for-loop.mlr data/spaces.csv}}HERE

Search-and-replace over all fields

How to do $name=gsub($name, "old", "new") for all fields? POKI_RUN_COMMAND{{cat data/sar.csv}}HERE POKI_RUN_COMMAND{{cat data/sar.mlr}}HERE POKI_RUN_COMMAND{{mlr --csv put -f data/sar.mlr data/sar.csv}}HERE

Full field renames and reassigns

Using Miller 5.0.0’s map literals and assigning to $*, you can fully generalize mlr rename, mlr reorder, etc.: POKI_RUN_COMMAND{{cat data/small}}HERE POKI_INCLUDE_AND_RUN_ESCAPED(data/full-reorg.sh)HERE

Numbering and renumbering records

The awk-like built-in variable NR is incremented for each input record: POKI_RUN_COMMAND{{cat data/small}}HERE POKI_RUN_COMMAND{{mlr put '$nr = NR' data/small}}HERE

However, this is the record number within the original input stream — not after any filtering you may have done: POKI_RUN_COMMAND{{mlr filter '$a == "wye"' then put '$nr = NR' data/small}}HERE

There are two good options here. One is to use the cat verb with -n: POKI_RUN_COMMAND{{mlr filter '$a == "wye"' then cat -n data/small}}HERE

The other is to keep your own counter within the put DSL: POKI_RUN_COMMAND{{mlr filter '$a == "wye"' then put 'begin {@n = 1} $n = @n; @n += 1' data/small}}HERE

The difference is a matter of taste (although mlr cat -n puts the counter first).

Options for dealing with duplicate rows

If your data has records appearing multiple times, you can use mlr uniq to show and/or count the unique records.

If you want to look at partial uniqueness — for example, show only the first record for each unique combination of the account_id and account_status fields — you might use mlr head -n 1 -g account_id,account_status. Please also see mlr head.

Data-cleaning examples

Here are some ways to use the type-checking options as described in the POKI_PUT_LINK_FOR_PAGE(reference-dsl.html#Type-test_and_type-assertion_expressions)HERE. Suppose you have the following data file, with inconsistent typing for boolean. (Also imagine that, for the sake of discussion, we have a million-line file rather than a four-line file, so we can’t see it all at once and some automation is called for.) POKI_RUN_COMMAND{{cat data/het-bool.csv}}HERE

One option is to coerce everything to boolean, or integer: POKI_RUN_COMMAND{{mlr --icsv --opprint put '$reachable = boolean($reachable)' data/het-bool.csv}}HERE POKI_RUN_COMMAND{{mlr --icsv --opprint put '$reachable = int(boolean($reachable))' data/het-bool.csv}}HERE

A second option is to flag badly formatted data within the output stream: POKI_RUN_COMMAND{{mlr --icsv --opprint put '$format_ok = is_string($reachable)' data/het-bool.csv}}HERE

Or perhaps to flag badly formatted data outside the output stream: POKI_RUN_COMMAND{{mlr --icsv --opprint put 'if (!is_string($reachable)) {eprint "Malformed at NR=".NR} ' data/het-bool.csv}}HERE

A third way is to abort the process on first instance of bad data: POKI_RUN_COMMAND_TOLERATING_ERROR{{mlr --csv put '$reachable = asserting_string($reachable)' data/het-bool.csv}}HERE

Splitting nested fields

Suppose you have a TSV file like this: POKI_INCLUDE_ESCAPED(data/nested.tsv)HERE

The simplest option is to use mlr nest: POKI_RUN_COMMAND{{mlr --tsv nest --explode --values --across-records -f b --nested-fs : data/nested.tsv}}HERE POKI_RUN_COMMAND{{mlr --tsv nest --explode --values --across-fields -f b --nested-fs : data/nested.tsv}}HERE

While mlr nest is simplest, let’s also take a look at a few ways to do this using the put DSL.

One option to split out the colon-delimited values in the b column is to use splitnv to create an integer-indexed map and loop over it, adding new fields to the current record: POKI_RUN_COMMAND{{mlr --from data/nested.tsv --itsv --oxtab put 'o=splitnv($b, ":"); for (k,v in o) {$["p".k]=v}'}}HERE

while another is to loop over the same map from splitnv and use it (with put -q to suppress printing the original record) to produce multiple records: POKI_RUN_COMMAND{{mlr --from data/nested.tsv --itsv --oxtab put -q 'o=splitnv($b, ":"); for (k,v in o) {emit mapsum($*, {"b":v})}'}}HERE POKI_RUN_COMMAND{{mlr --from data/nested.tsv --tsv put -q 'o=splitnv($b, ":"); for (k,v in o) {emit mapsum($*, {"b":v})}'}}HERE

Showing differences between successive queries

Suppose you have a database query which you run at one point in time, producing the output on the left, then again later producing the output on the right:
POKI_RUN_COMMAND{{cat data/previous_counters.csv}}HERE POKI_RUN_COMMAND{{cat data/current_counters.csv}}HERE

And, suppose you want to compute the differences in the counters between adjacent keys. Since the color names aren’t all in the same order, nor are they all present on both sides, we can’t just paste the two files side-by-side and do some column-four-minus-column-two arithmetic.

First, rename counter columns to make them distinct: POKI_RUN_COMMAND{{mlr --csv rename count,previous_count data/previous_counters.csv > data/prevtemp.csv}}HERE POKI_RUN_COMMAND{{cat data/prevtemp.csv}}HERE POKI_RUN_COMMAND{{mlr --csv rename count,current_count data/current_counters.csv > data/currtemp.csv}}HERE POKI_RUN_COMMAND{{cat data/currtemp.csv}}HERE

Then, join on the key field(s), and use unsparsify to zero-fill counters absent on one side but present on the other. Use --ul and --ur to emit unpaired records (namely, purple on the left and yellow on the right): POKI_INCLUDE_AND_RUN_ESCAPED(data/previous-to-current.sh)HERE

Finding missing dates

Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates: POKI_RUN_COMMAND{{head -n 10 data/miss-date.csv}}HERE POKI_RUN_COMMAND{{wc -l data/miss-date.csv}}HERE

Since there are 1372 lines in the data file, some automation is called for. To find the missing dates, you can convert the dates to seconds since the epoch using strptime, then compute adjacent differences (the cat -n simply inserts record-counters): POKI_INCLUDE_AND_RUN_ESCAPED(data/miss-date-1.sh)HERE

Then, filter for adjacent difference not being 86400 (the number of seconds in a day): POKI_INCLUDE_AND_RUN_ESCAPED(data/miss-date-2.sh)HERE

Given this, it’s now easy to see where the gaps are: POKI_RUN_COMMAND{{mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv}}HERE POKI_RUN_COMMAND{{mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv}}HERE

Two-pass algorithms

Miller is a streaming record processor; commands are performed once per record. This makes Miller particularly suitable for single-pass algorithms, allowing many of its verbs to process files that are (much) larger than the amount of RAM present in your system. (Of course, Miller verbs such as sort, tac, etc. all must ingest and retain all input records before emitting any output records.) You can also use out-of-stream variables to perform multi-pass computations, at the price of retaining all input records in memory.

Two-pass algorithms: computation of percentages

For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record’s value to a percentage. POKI_INCLUDE_AND_RUN_ESCAPED(data/two-pass-percentage.sh)HERE

Two-pass algorithms: line-number ratios

Similarly, finding the total record count requires first reading through all the data: POKI_INCLUDE_AND_RUN_ESCAPED(data/two-pass-record-numbers.sh)HERE

Two-pass algorithms: records having max value

The idea is to retain records having the largest value of n in the following data: POKI_RUN_COMMAND{{mlr --itsv --opprint cat data/maxrows.tsv}}HERE

Of course, the largest value of n isn’t known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end: POKI_RUN_COMMAND{{cat data/maxrows.mlr}}HERE POKI_RUN_COMMAND{{mlr --itsv --opprint put -q -f data/maxrows.mlr data/maxrows.tsv}}HERE

Rectangularizing data

Suppose you have a method (in whatever language) which is printing things of the form POKI_INCLUDE_ESCAPED(data/rect-outer.txt)HERE and then calls another method which prints things of the form POKI_INCLUDE_ESCAPED(data/rect-middle.txt)HERE and then, perhaps, that second method calls a third method which prints things of the form POKI_INCLUDE_ESCAPED(data/rect-inner.txt)HERE with the result that your program’s output is POKI_INCLUDE_ESCAPED(data/rect.txt)HERE The idea here is that middles starting with a 1 belong to the outer value of 1, and so on. (For example, the outer values might be account IDs, the middle values might be invoice IDs, and the inner values might be invoice line-items.) If you want all the middle and inner lines to have the context of which outers they belong to, you can modify your software to pass all those through your methods. Alternatively, don’t refactor your code just to handle some ad-hoc log-data formatting — instead, use the following to rectangularize the data. The idea is to use an out-of-stream variable to accumulate fields across records. Clear that variable when you see an outer ID; accumulate fields; emit output when you see the inner IDs. POKI_INCLUDE_AND_RUN_ESCAPED(data/rect.sh)HERE

Regularizing ragged CSV

Miller handles compliant CSV: in particular, it’s an error if the number of data fields in a given data line don’t match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line. POKI_RUN_COMMAND{{cat data/ragged.csv}}HERE POKI_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv.sh)HERE or, more simply, POKI_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv-2.sh)HERE

Feature-counting

Suppose you have some heterogeneous data like this: POKI_INCLUDE_ESCAPED(data/features.json)HERE

A reasonable question to ask is, how many occurrences of each field are there? And, what percentage of total row count has each of them? Since the denominator of the percentage is not known until the end, this is a two-pass algorithm: POKI_INCLUDE_ESCAPED(data/feature-count.mlr)HERE

Then POKI_RUN_COMMAND{{mlr --json put -q -f data/feature-count.mlr data/features.json}}HERE POKI_RUN_COMMAND{{mlr --ijson --opprint put -q -f data/feature-count.mlr data/features.json}}HERE

Unsparsing

The previous section discussed how to fill out missing data fields within CSV with full header line — so the list of all field names is present within the header line. Next, let’s look at a related problem: we have data where each record has various key names but we want to produce rectangular output having the union of all key names.

For example, suppose you have JSON input like this: POKI_RUN_COMMAND{{cat data/sparse.json}}HERE

There are field names a, b, v, u, x, w in the data — but not all in every record. Since we don’t know the names of all the keys until we’ve read them all, this needs to be a two-pass algorithm. On the first pass, remember all the unique key names and all the records; on the second pass, loop through the records filling in absent values, then producing output. Use put -q since we don’t want to produce per-record output, only emitting output in the end block: POKI_RUN_COMMAND{{cat data/unsparsify.mlr}}HERE POKI_RUN_COMMAND{{mlr --json put -q -f data/unsparsify.mlr data/sparse.json}}HERE POKI_RUN_COMMAND{{mlr --ijson --ocsv put -q -f data/unsparsify.mlr data/sparse.json}}HERE POKI_RUN_COMMAND{{mlr --ijson --opprint put -q -f data/unsparsify.mlr data/sparse.json}}HERE

There is a keystroke-saving verb for this: mlr unsparsify.

Parsing log-file output

This, of course, depends highly on what’s in your log files. But, as an example, suppose you have log-file lines such as POKI_CARDIFY(2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378)HERE I prefer to pre-filter with grep and/or sed to extract the structured text, then hand that to Miller. Example: POKI_CARDIFY(grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status)HERE

Memoization with out-of-stream variables

The recursive function for the Fibonacci sequence is famous for its computational complexity. Namely, using f(0)=1, f(1)=1, f(n)=f(n-1)+f(n-2) for n≥2, the evaluation tree branches left as well as right at each non-trivial level, resulting in millions or more paths to the root 0/1 nodes for larger n. This program POKI_INCLUDE_ESCAPED(data/fibo-uncached.sh)HERE

produces output like this:

i  o      fcount  seconds_delta
1  1      1       0
2  2      3       0.000039101
3  3      5       0.000015974
4  5      9       0.000019073
5  8      15      0.000026941
6  13     25      0.000036955
7  21     41      0.000056028
8  34     67      0.000086069
9  55     109     0.000134945
10 89     177     0.000217915
11 144    287     0.000355959
12 233    465     0.000506163
13 377    753     0.000811815
14 610    1219    0.001297235
15 987    1973    0.001960993
16 1597   3193    0.003417969
17 2584   5167    0.006215811
18 4181   8361    0.008294106
19 6765   13529   0.012095928
20 10946  21891   0.019592047
21 17711  35421   0.031193972
22 28657  57313   0.057254076
23 46368  92735   0.080307961
24 75025  150049  0.129482031
25 121393 242785  0.213325977
26 196418 392835  0.334423065
27 317811 635621  0.605969906
28 514229 1028457 0.971235037

Note that the time it takes to evaluate the function is blowing up exponentially as the input argument increases. Using @-variables, which persist across records, we can cache and reuse the results of previous computations: POKI_INCLUDE_ESCAPED(data/fibo-cached.sh)HERE

with output like this:

i  o      fcount seconds_delta
1  1      1      0
2  2      3      0.000053883
3  3      3      0.000035048
4  5      3      0.000045061
5  8      3      0.000014067
6  13     3      0.000028849
7  21     3      0.000028133
8  34     3      0.000027895
9  55     3      0.000014067
10 89     3      0.000015020
11 144    3      0.000012875
12 233    3      0.000033140
13 377    3      0.000014067
14 610    3      0.000012875
15 987    3      0.000029087
16 1597   3      0.000013828
17 2584   3      0.000013113
18 4181   3      0.000012875
19 6765   3      0.000013113
20 10946  3      0.000012875
21 17711  3      0.000013113
22 28657  3      0.000013113
23 46368  3      0.000015974
24 75025  3      0.000012875
25 121393 3      0.000013113
26 196418 3      0.000012875
27 317811 3      0.000013113
28 514229 3      0.000012875