POKI_PUT_TOC_HERE

Parsing log-file output

This, of course, depends highly on what’s in your log files. But, as an example, suppose you have log-file lines such as POKI_CARDIFY(2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [sometext] various/sorts/of data {& punctuation} hits=1 status=0 time=2.378)HERE I prefer to pre-filter with grep and/or sed to extract the structured text, then hand that to Miller. Example: POKI_CARDIFY(grep 'various sorts' *.log | sed 's/.*} //' | mlr --fs space --repifs --oxtab stats1 -a min,p10,p50,p90,max -f time -g status)HERE

Rectangularizing data

Suppose you have a method (in whatever language) which is printing things of the form POKI_INCLUDE_ESCAPED(data/rect-outer.txt)HERE and then calls another method which prints things of the form POKI_INCLUDE_ESCAPED(data/rect-middle.txt)HERE and then, perhaps, that second method calls a third method which prints things of the form POKI_INCLUDE_ESCAPED(data/rect-inner.txt)HERE with the result that your program’s output is POKI_INCLUDE_ESCAPED(data/rect.txt)HERE The idea here is that middles starting with a 1 belong to the outer value of 1, and so on. (For example, the outer values might be account IDs, the middle values might be invoice IDs, and the inner values might be invoice line-items.) If you want all the middle and inner lines to have the context of which outers they belong to, you can modify your software to pass all those through your methods. Alternatively, you can use the following to rectangularize the data. The idea is to use an out-of-stream variable to accumulate fields across records. Clear that variable when you see an outer ID; accumulate fields; emit output when you see the inner IDs. POKI_INCLUDE_AND_RUN_ESCAPED(data/rect.sh)HERE

Bulk rename of field names

POKI_RUN_COMMAND{{cat data/spaces.csv}}HERE POKI_RUN_COMMAND{{mlr --csv --rs lf rename -r -g ' ,_' data/spaces.csv}}HERE POKI_RUN_COMMAND{{mlr --csv --irs lf --opprint rename -r -g ' ,_' data/spaces.csv}}HERE

You can also do this with a for-loop but it puts the modified fields after the unmodified fields: POKI_RUN_COMMAND{{cat data/bulk-rename-for-loop.mlr}}HERE POKI_RUN_COMMAND{{mlr --icsv --irs lf --opprint put -f data/bulk-rename-for-loop.mlr data/spaces.csv}}HERE

Headerless CSV on input or output

Sometimes we get CSV files which lack a header. For example: POKI_RUN_COMMAND{{cat data/headerless.csv}}HERE

You can use Miller to add a header: the --implicit-csv-header applies positionally indexed labels: POKI_RUN_COMMAND{{mlr --csv --rs lf --implicit-csv-header cat data/headerless.csv}}HERE POKI_RUN_COMMAND{{mlr --icsv --irs lf --implicit-csv-header --opprint cat data/headerless.csv}}HERE

Following that, you can rename the positionally indexed labels to names with meaning for your context. For example: POKI_RUN_COMMAND{{mlr --csv --rs lf --implicit-csv-header label name,age,status data/headerless.csv}}HERE POKI_RUN_COMMAND{{mlr --icsv --rs lf --implicit-csv-header --opprint label name,age,status data/headerless.csv}}HERE

Likewise, if you need to produce CSV which is lacking its header, you can pipe Miller’s output to the system command sed 1d, or you can use Miller’s --headerless-csv-output option: POKI_RUN_COMMAND{{head -5 data/colored-shapes.dkvp | mlr --ocsv cat}}HERE POKI_RUN_COMMAND{{head -5 data/colored-shapes.dkvp | mlr --ocsv --headerless-csv-output cat}}HERE

Regularizing ragged CSV

Miller handles compliant CSV: in particular, it’s an error if the number of data fields in a given data line don’t match the number of header lines. But in the event that you have a CSV file in which some lines have less than the full number of fields, you can use Miller to pad them out. The trick is to use NIDX format, for which each line stands on its own without respect to a header line. POKI_RUN_COMMAND{{cat data/ragged.csv}}HERE POKI_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv.sh)HERE or, more simply, POKI_INCLUDE_AND_RUN_ESCAPED(data/ragged-csv-2.sh)HERE

Finding missing dates

Suppose you have some date-stamped data which may (or may not) be missing entries for one or more dates: POKI_RUN_COMMAND{{head -n 10 data/miss-date.csv}}HERE POKI_RUN_COMMAND{{wc -l data/miss-date.csv}}HERE

To find these, you can convert the dates to seconds since the epoch using strptime, then compute adjacent differences (the cat -n simply inserts record-counters): POKI_INCLUDE_AND_RUN_ESCAPED(data/miss-date-1.sh)HERE

Then, filter for adjacent difference not being 86400 (the number of seconds in a day): POKI_INCLUDE_AND_RUN_ESCAPED(data/miss-date-2.sh)HERE

Given this, it’s now easy to see where the gaps are: POKI_RUN_COMMAND{{mlr cat -n then filter '$n >= 770 && $n <= 780' data/miss-date.csv}}HERE POKI_RUN_COMMAND{{mlr cat -n then filter '$n >= 1115 && $n <= 1125' data/miss-date.csv}}HERE

Two-pass algorithms

Miller is a streaming record processor; commands are performed once per record. This makes Miller particularly suitable for single-pass algorithms, allowing many of its verbs to process files that are (much) larger than the amount of RAM present in your system. (Of course, Miller verbs such as sort, tac, etc. all must ingest and retain all input records before emitting any output records.) You can also use out-of-stream variables to perform multi-pass computations, at the price of retaining all input records in memory.

Two-pass algorithms: computation of percentages

For example, mapping numeric values down a column to the percentage between their min and max values is two-pass: on the first pass you find the min and max values, then on the second, map each record’s value to a percentage. POKI_INCLUDE_AND_RUN_ESCAPED(data/two-pass-percentage.sh)HERE

Two-pass algorithms: line-number ratios

Similarly, finding the total record count requires first reading through all the data: POKI_INCLUDE_AND_RUN_ESCAPED(data/two-pass-record-numbers.sh)HERE

Two-pass algorithms: records having max value

The idea is to retain records having the largest value of n in the following data: POKI_RUN_COMMAND{{mlr --itsv --irs lf --opprint cat data/maxrows.tsv}}HERE

Of course, the largest value of n isn’t known until after all data have been read. Using an out-of-stream variable we can retain all records as they are read, then filter them at the end: POKI_RUN_COMMAND{{cat data/maxrows.mlr}}HERE POKI_RUN_COMMAND{{mlr --itsv --irs lf --opprint put -q -f data/maxrows.mlr data/maxrows.tsv}}HERE

Filtering paragraphs of text

The idea is to use a record separator which is a pair of newlines. Then, if you want each paragraph to be a record with a single value, use a field-separator which isn’t present in the input data (e.g. a control-A which is octal 001). Or, if you want each paragraph to have its lines as separate values, use newline as field separator. POKI_RUN_COMMAND{{cat paragraphs.txt}}HERE POKI_RUN_COMMAND{{mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\001' filter '$1 =~ "the"'}}HERE POKI_RUN_COMMAND{{mlr --from paragraphs.txt --nidx --rs '\n\n' --fs '\n' cut -f 1,3}}HERE

Doing arithmetic on fields with currency symbols

POKI_INCLUDE_ESCAPED(data/dollar-sign.txt)HERE

Program timing

This admittedly artificial example demonstrates using Miller time and stats functions to introspectly acquire some information about Miller’s own runtime. The delta function computes the difference between successive timestamps. POKI_INCLUDE_ESCAPED(data/timing-example.txt)HERE

Using out-of-stream variables

One of Miller’s strengths is its compact notation: for example, given input of the form POKI_RUN_COMMAND{{head -n 5 ../data/medium}}HERE you can simply do POKI_RUN_COMMAND{{mlr --oxtab stats1 -a sum -f x ../data/medium}}HERE or POKI_RUN_COMMAND{{mlr --opprint stats1 -a sum -f x -g b ../data/medium}}HERE rather than the more tedious POKI_INCLUDE_AND_RUN_ESCAPED(oosvar-example-sum.sh)HERE or POKI_INCLUDE_AND_RUN_ESCAPED(oosvar-example-sum-grouped.sh)HERE

The former (mlr stats1 et al.) has the advantages of being easier to type, being less error-prone to type, and running faster.

Nonetheless, out-of-stream variables (which I whimsically call oosvars), begin/end blocks, and emit statements give you the ability to implement logic — if you wish to do so — which isn’t present in other Miller verbs. (If you find yourself often using the same out-of-stream-variable logic over and over, please file a request at https://github.com/johnkerl/miller/issues to get it implemented directly in C as a Miller verb of its own.)

The following examples compute some things using oosvars which are already computable using Miller verbs, by way of providing food for thought.

Mean without/with oosvars

POKI_RUN_COMMAND{{mlr --opprint stats1 -a mean -f x data/medium}}HERE POKI_INCLUDE_AND_RUN_ESCAPED(data/mean-with-oosvars.sh)HERE

Keyed mean without/with oosvars

POKI_RUN_COMMAND{{mlr --opprint stats1 -a mean -f x -g a,b data/medium}}HERE POKI_INCLUDE_AND_RUN_ESCAPED(data/keyed-mean-with-oosvars.sh)HERE

Variance and standard deviation without/with oosvars

POKI_RUN_COMMAND{{mlr --oxtab stats1 -a count,sum,mean,var,stddev -f x data/medium}}HERE POKI_RUN_COMMAND{{cat variance.mlr}}HERE POKI_RUN_COMMAND{{mlr --oxtab put -q -f variance.mlr data/medium}}HERE You can also do this keyed, of course, imitating the keyed-mean example above.

Min/max without/with oosvars

POKI_RUN_COMMAND{{mlr --oxtab stats1 -a min,max -f x data/medium}}HERE POKI_RUN_COMMAND{{mlr --oxtab put -q '@x_min = min(@x_min, $x); @x_max = max(@x_max, $x); end{emitf @x_min, @x_max}' data/medium}}HERE

Keyed min/max without/with oosvars

POKI_RUN_COMMAND{{mlr --opprint stats1 -a min,max -f x -g a data/medium}}HERE POKI_INCLUDE_AND_RUN_ESCAPED(data/keyed-min-max-with-oosvars.sh)HERE

Delta without/with oosvars

POKI_RUN_COMMAND{{mlr --opprint step -a delta -f x data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint put '$x_delta = ispresent(@last) ? $x - @last : 0; @last = $x' data/small}}HERE

Keyed delta without/with oosvars

POKI_RUN_COMMAND{{mlr --opprint step -a delta -f x -g a data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint put '$x_delta = ispresent(@last[$a]) ? $x - @last[$a] : 0; @last[$a]=$x' data/small}}HERE

Exponentially weighted moving averages without/with oosvars

POKI_INCLUDE_AND_RUN_ESCAPED(verb-example-ewma.sh)HERE POKI_INCLUDE_AND_RUN_ESCAPED(oosvar-example-ewma.sh)HERE