POKI_PUT_TOC_HERE

Command overview

Whereas the Unix toolkit is made of the separate executables cat, tail, cut, sort, etc., Miller has subcommands, invoked as follows: POKI_INCLUDE_ESCAPED(subcommand-example.txt)HERE

These falls into categories as follows:
Commands Description
cat, cut, head, sort, tac, tail, top, uniq Analogs of their Unix-toolkit namesakes, discussed below as well as in POKI_PUT_LINK_FOR_PAGE(feature-comparison.html)HERE
filter, put, step awk-like functionality
histogram, stats1, stats2 Statistically oriented
group-by, group-like, having-fields Particularly oriented toward POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE, although all Miller commands can handle heterogeneous records
count-distinct, label, rename, rename, reorder These draw from other sources (see also POKI_PUT_LINK_FOR_PAGE(originality.html)HERE): count-distinct is SQL-ish, and rename can be done by sed (which does it faster: see POKI_PUT_LINK_FOR_PAGE(performance.html)HERE).

On-line help

Examples:

POKI_RUN_COMMAND{{mlr --help}}HERE POKI_RUN_COMMAND{{mlr sort --help}}HERE

then-chaining

In accord with the Unix philosophy, you can pipe data into or out of Miller. For example: POKI_CARDIFY(mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime)HERE

You can, if you like, instead simply chain commands together using the then keyword: POKI_CARDIFY(mlr cut --complement -f os_version then sort -f hostname,uptime *.dat)HERE Here’s a performance comparison: POKI_INCLUDE_ESCAPED(data/then-chaining-performance.txt)HERE There are two reasons to use then-chaining: one is for performance, although I don’t expect this to be a win in all cases. Using then-chaining avoids redundant string-parsing and string-formatting at each pipeline step: instead input records are parsed once, they are fed through each pipeline stage in memory, and then output records are formatted once. On the other hand, Miller is single-threaded, while modern systems are usually multi-processor, and when streaming-data programs operate through pipes, each one can use a CPU. Rest assured you get the same results either way.

The other reason to use then-chaining is for simplicity: you don’t have re-type formatting flags (e.g. --csv --rs lf --fs tab) at every pipeline stage.

I/O options

Formats

Options:

  --dkvp    --idkvp    --odkvp
  --nidx    --inidx    --onidx
  --csv     --icsv     --ocsv
  --csvlite --icsvlite --ocsvlite
  --pprint  --ipprint  --ppprint  --right
  --xtab    --ixtab    --oxtab

These are as discussed in POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE, with the exception of --right which makes pretty-printed output right-aligned:
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint --right cat data/small}}HERE

Additional notes:

Record/field/pair separators

Miller has record separators IRS and ORS, field separators IFS and OFS, and pair separators IPS and OPS. For example, in the DKVP line a=1,b=2,c=3, the record separator is newline, field separator is comma, and pair separator is the equals sign. These are the default values.

Options:

  --rs --irs --ors
  --fs --ifs --ofs --repifs
  --ps --ips --ops

Number formatting

The command-line option --ofmt {format string} is the global number format for commands which generate numeric output, e.g. stats1, stats2, histogram, and step, as well as mlr put. Examples: POKI_CARDIFY(--ofmt %.9le --ofmt %.6lf --ofmt %.0lf)HERE

These are just C printf formats applied to double-precision numbers. Please don’t use %s or %d. Additionally, if you use leading width (e.g. %18.12lf) then the output will contain embedded whitespace, which may not be what you want if you pipe the output to something else, particularly CSV. I use Miller’s pretty-print format (mlr --opprint) to column-align numerical data.

To apply formatting to a single field, overriding the global ofmt, use fmtnum function within mlr put. For example: POKI_RUN_COMMAND{{echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'}}HERE POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'}}HERE

Input conversion from hexadecimal is done automatically on fields handled by mlr put and mlr filter as long as the field value begins with "0x". To apply output conversion to hexadecimal on a single column, you may use fmtnum, or the keystroke-saving hexfmt function. Example: POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'}}HERE

Regular expressions

Miller lets you use regular expressions (of type POSIX.2) in the following contexts:

The i after the double-quoted regular-expression string is used to signify case-insensitive match.

For filter and put, if the regular expression is a string literal (the normal case), it is precompiled at process start and reused thereafter, which is efficient. If the regular expression is a more complex expression, including string concatenation using ., or a column name (in which case you can take regular expressions from input data!), then regexes are compiled on each record which works but is less efficient. As well, in this case there is no way to specify case-insensitive matching.

Example: POKI_RUN_COMMAND{{cat data/regex-in-data.dat}}HERE POKI_RUN_COMMAND{{mlr filter '$name =~ $regex' data/regex-in-data.dat}}HERE

Data transformations

cat

Most useful for format conversions (see POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE), and concatenating multiple same-schema CSV files to have the same header:
POKI_RUN_COMMAND{{cat a.csv}}HERE POKI_RUN_COMMAND{{cat b.csv}}HERE POKI_RUN_COMMAND{{mlr --csv cat a.csv b.csv}}HERE POKI_RUN_COMMAND{{mlr --icsv --oxtab cat a.csv b.csv}}HERE

check

POKI_RUN_COMMAND{{mlr check --help}}HERE

count-distinct

POKI_RUN_COMMAND{{mlr count-distinct --help}}HERE POKI_RUN_COMMAND{{mlr count-distinct -f a,b then sort -nr count data/medium}}HERE

cut

POKI_RUN_COMMAND{{mlr cut --help}}HERE
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint cut -f y,x,i data/small}}HERE
POKI_RUN_COMMAND{{echo 'a=1,b=2,c=3' | mlr cut -f b,c,a}}HERE POKI_RUN_COMMAND{{echo 'a=1,b=2,c=3' | mlr cut -o -f b,c,a}}HERE

filter

POKI_RUN_COMMAND{{mlr filter --help}}HERE

Field names must be specified using a $ in filter and put expressions, even though they don’t appear in the data stream. For integer-indexed data, this looks like awk’s $1,$2,$3. Likewise, enclose string literals in double quotes in filter expressions even though they don’t appear in file data. In particular, mlr filter '$x=="abc"' passes through the record x=abc.

The filter command supports the same built-in variables as for put, all awk-inspired: NF, NR, FNR, FILENUM, and FILENAME. This selects the 2nd record from each matching file: POKI_RUN_COMMAND{{mlr filter 'FNR == 2' data/small*}}HERE

Expressions may be arbitrarily complex: POKI_RUN_COMMAND{{mlr --opprint filter '$a == "pan" || $b == "wye"' data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint filter '($x > 0.5 && $y > 0.5) || ($x < 0.5 && $y < 0.5)' then stats2 -a corr -f x,y data/medium}}HERE
POKI_RUN_COMMAND{{mlr --opprint filter '($x > 0.5 && $y < 0.5) || ($x < 0.5 && $y > 0.5)' then stats2 -a corr -f x,y data/medium}}HERE
Newlines within the expression are ignored, which can help increase legibility of complex expressions: POKI_INCLUDE_ESCAPED(filter-multiline-example.txt)HERE

group-by

POKI_RUN_COMMAND{{mlr group-by --help}}HERE

This is similar to sort but with less work. Namely, Miller’s sort has three steps: read through the data and append linked lists of records, one for each unique combination of the key-field values; after all records are read, sort the key-field values; then print each record-list. The group-by operation simply omits the middle sort. An example should make this more clear.
POKI_RUN_COMMAND{{mlr --opprint group-by a data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint sort -f a data/small}}HERE

In this example, since the sort is on field a, the first step is to group together all records having the same value for field a; the second step is to sort the distinct a-field values pan, eks, and wye into eks, pan, and wye; the third step is to print out the record-list for a=eks, then the record-list for a=pan, then the record-list for a=wye. The group-by operation omits the middle sort and just puts like records together, for those times when a sort isn’t desired. In particular, the ordering of group-by fields for group-by is the order in which they were encountered in the data stream, which in some cases may be more interesting to you.

group-like

POKI_RUN_COMMAND{{mlr group-like --help}}HERE

This groups together records having the same schema (i.e. same ordered list of field names) which is useful for making sense of time-ordered output as described in POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE — in particular, in preparation for CSV or pretty-print output.
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE POKI_RUN_COMMAND{{mlr --opprint group-like data/het.dkvp}}HERE

having-fields

POKI_RUN_COMMAND{{mlr having-fields --help}}HERE

Similar to group-like, this retains records with specified schema.
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
POKI_RUN_COMMAND{{mlr having-fields --at-least resource data/het.dkvp}}HERE
POKI_RUN_COMMAND{{mlr having-fields --which-are resource,ok,loadsec data/het.dkvp}}HERE

head

POKI_RUN_COMMAND{{mlr head --help}}HERE Note that head is distinct from tophead shows fields which appear first in the data stream; top shows fields which are numerically largest (or smallest).
POKI_RUN_COMMAND{{mlr --opprint head -n 4 data/medium}}HERE POKI_RUN_COMMAND{{mlr --opprint head -n 1 -g b data/medium}}HERE

histogram

POKI_RUN_COMMAND{{mlr histogram --help}}HERE This is just a histogram; there’s not too much to say here. A note about binning, by example: Suppose you use --lo 0.0 --hi 1.0 --nbins 10 -f x. The input numbers less than 0 or greater than 1 aren’t counted in any bin. Input numbers equal to 1 are counted in the last bin. That is, bin 0 has 0.0 ≤ x < 0.1, bin 1 has 0.1 ≤ x < 0.2, etc., but bin 9 has 0.9 ≤ x ≤ 1.0. POKI_RUN_COMMAND{{mlr --opprint put '$x2=$x**2;$x3=$x2*$x' then histogram -f x,x2,x3 --lo 0 --hi 1 --nbins 10 data/medium}}HERE

join

POKI_RUN_COMMAND{{mlr join --help}}HERE Examples:

Join larger table with IDs with smaller ID-to-name lookup table, showing only paired records:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint cat data/join-left-example.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsvlite --opprint cat data/join-right-example.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsvlite --opprint join -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv}}HERE

Same, but with sorting the input first:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint sort -f idcode then join -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv}}HERE

Same, but showing only unpaired records:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint join --np --ul --ur -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv}}HERE

Use prefixing options to disambiguate between otherwise identical non-join field names:
POKI_RUN_COMMAND{{mlr --csvlite --opprint cat data/self-join.csv data/self-join.csv}}HERE
POKI_RUN_COMMAND{{mlr --csvlite --opprint join -j a --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv}}HERE

Use zero join columns:
POKI_RUN_COMMAND{{mlr --csvlite --opprint join -j "" --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv}}HERE

label

POKI_RUN_COMMAND{{mlr label --help}}HERE See also rename.

Example: Files such as /etc/passwd, /etc/group, and so on have implicit field names which are found in section-5 manpages. These field names may be made explicit as follows: POKI_INCLUDE_ESCAPED(data/label-example.txt)HERE

Likewise, if you have CSV/CSV-lite input data which has somehow been bereft of its header line, you can re-add a header line using --implicit-csv-header and label: POKI_RUN_COMMAND{{cat data/headerless.csv}}HERE POKI_RUN_COMMAND{{mlr --csv --rs lf --implicit-csv-header cat data/headerless.csv}}HERE POKI_RUN_COMMAND{{mlr --csv --rs lf --implicit-csv-header label name,age,status data/headerless.csv}}HERE POKI_RUN_COMMAND{{mlr --icsv --rs lf --implicit-csv-header --opprint label name,age,status data/headerless.csv}}HERE

put

POKI_RUN_COMMAND{{mlr put --help}}HERE

Field names must be specified using a $ in filter and put expressions, even though they don’t appear in the data stream. For integer-indexed data, this looks like awk’s $1,$2,$3. Likewise, enclose string literals in double quotes in put expressions even though they don’t appear in file data. In particular, mlr put '$x=="abc"' creates the field x=abc.

Multiple expressions may be given, separated by semicolons, and each may refer to the ones before: POKI_RUN_COMMAND{{ruby -e '10.times{|i|puts "i=#{i}"}' | mlr --opprint put '$j=$i+1;$k=$i+$j'}}HERE

Miller supports the following five built-in variables for filter and put, all awk-inspired: NF, NR, FNR, FILENUM, and FILENAME. POKI_RUN_COMMAND{{mlr --opprint put '$nf=NF; $nr=NR; $fnr=FNR; $filenum=FILENUM; $filename=FILENAME' data/small data/small2}}HERE Newlines within the expression are ignored, which can help increase legibility of complex expressions: POKI_INCLUDE_ESCAPED(put-multiline-example.txt)HERE

regularize

POKI_RUN_COMMAND{{mlr regularize --help}}HERE

This exists since hash-map software in various languages and tools encountered in the wild does not always print similar rows with fields in the same order: mlr regularize helps clean that up.

See also reorder.

rename

POKI_RUN_COMMAND{{mlr rename --help}}HERE
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint rename i,INDEX,b,COLUMN2 data/small}}HERE

As discussed in POKI_PUT_LINK_FOR_PAGE(performance.html)HERE, sed is significantly faster than Miller at doing this. However, Miller is format-aware, so it knows to do renames only within specified field keys and not any others, nor in field values which may happen to contain the same pattern. Example:
POKI_RUN_COMMAND{{sed 's/y/COLUMN5/g' data/small}}HERE
POKI_RUN_COMMAND{{mlr rename y,COLUMN5 data/small}}HERE
See also label.

reorder

POKI_RUN_COMMAND{{mlr reorder --help}}HERE This pivots specified field names to the start or end of the record — for example when you have highly multi-column data and you want to bring a field or two to the front of line where you can give a quick visual scan.
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint reorder -f i,b data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint reorder -e -f i,b data/small}}HERE

sample

POKI_RUN_COMMAND{{mlr sample --help}}HERE

This is reservoir-sampling: select k items from n with uniform probability and no repeats in the sample. (If n is less than k, then of course only n samples are produced.) With -g {field names}, produce a k-sample for each distinct value of the specified field names. POKI_INCLUDE_ESCAPED(data/sample-example.txt)HERE

Note that no output is produced until all inputs are in. Another way to do sampling, which works in the streaming case, is mlr filter 'urand() & 0.001' where you tune the 0.001 to meet your needs.

sort

POKI_RUN_COMMAND{{mlr sort --help}}HERE

Example: POKI_RUN_COMMAND{{mlr --opprint sort -f a -nr x data/small}}HERE

Here’s an example filtering log data: suppose multiple threads (labeled here by color) are all logging progress counts to a single log file. The log file is (by nature) chronological, so the progress of various threads is interleaved: POKI_RUN_COMMAND{{head -n 10 data/multicountdown.dat}}HERE

We can group these by thread by sorting on the thread ID (here, color). Since Miller’s sort is stable, this means that timestamps within each thread’s log data are still chronological: POKI_RUN_COMMAND{{head -n 20 data/multicountdown.dat | mlr --opprint sort -f color}}HERE

Any records not having all specified sort keys will appear at the end of the output, in the order they were encountered, regardless of the specified sort order: POKI_RUN_COMMAND{{mlr sort -n x data/sort-missing.dkvp}}HERE POKI_RUN_COMMAND{{mlr sort -nr x data/sort-missing.dkvp}}HERE

stats1

POKI_RUN_COMMAND{{mlr stats1 --help}}HERE These are simple univariate statistics on one or more number-valued fields (count and mode apply to non-numeric fields as well), optionally categorized by one or more other fields.
POKI_RUN_COMMAND{{mlr --oxtab stats1 -a count,sum,min,p10,p50,mean,p90,max -f x,y data/medium}}HERE
POKI_RUN_COMMAND{{mlr --opprint stats1 -a mean -f x,y -g b then sort -f b data/medium}}HERE
POKI_RUN_COMMAND{{mlr --opprint stats1 -a p50,p99 -f u,v -g color then put '$ur=$u_p99/$u_p50;$vr=$v_p99/$v_p50' data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{mlr --opprint count-distinct -f shape then sort -nr count data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{mlr --opprint stats1 -a mode -f color -g shape data/colored-shapes.dkvp}}HERE

stats2

POKI_RUN_COMMAND{{mlr stats2 --help}}HERE These are simple bivariate statistics on one or more pairs of number-valued fields, optionally categorized by one or more fields.
POKI_RUN_COMMAND{{mlr --oxtab put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a cov,corr -f x,y,y,y,x2,xy,x2,y2 data/medium}}HERE
POKI_RUN_COMMAND{{mlr --opprint put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a linreg-ols,r2 -f x,y,y,y,xy,y2 -g a data/medium}}HERE

Here’s an example simple line-fit. The x and y fields of the data/medium dataset are just independent uniformly distributed on the unit interval. Here we remove half the data and fit a line to it. POKI_INCLUDE_ESCAPED(data/linreg-example.txt)HERE

I use pgr for plotting; here’s a screenshot.

(Thanks Drew Kunas for a good conversation about PCA!)

Here’s an example estimating time-to-completion for a set of jobs. Input data comes from a log file, with number of work units left to do in the count field and accumulated seconds in the upsec field, labeled by the color field: POKI_RUN_COMMAND{{head -n 10 data/multicountdown.dat}}HERE We can do a linear regression on count remaining as a function of time: with c = m*u+b we want to find the time when the count goes to zero, i.e. u=-b/m. POKI_RUN_COMMAND{{mlr --oxtab stats2 -a linreg-pca -f upsec,count -g color then put '$donesec = -$upsec_count_pca_b/$upsec_count_pca_m' data/multicountdown.dat}}HERE

step

POKI_RUN_COMMAND{{mlr step --help}}HERE Most Miller commands are record-at-a-time, with the exception of stats1, stats2, and histogram which compute aggregate output. The step command is intermediate: it allows the option of adding fields which are functions of fields from previous records. Rsum is short for running sum.
POKI_RUN_COMMAND{{mlr --opprint step -a delta,rsum,counter -f x data/medium | head -15}}HERE
POKI_RUN_COMMAND{{mlr --opprint step -a delta,rsum,counter -f x -g a data/medium | head -15}}HERE
Example deriving uptime-delta from system uptime: POKI_INCLUDE_ESCAPED(data/ping-delta-example.txt)HERE

tac

POKI_RUN_COMMAND{{mlr tac --help}}HERE

Prints the records in the input stream in reverse order. Note: this requires Miller to retain all input records in memory before any output records are produced.
POKI_RUN_COMMAND{{mlr --icsv --opprint cat a.csv}}HERE POKI_RUN_COMMAND{{mlr --icsv --opprint cat b.csv}}HERE POKI_RUN_COMMAND{{mlr --icsv --opprint tac a.csv b.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$filename=FILENAME' then tac a.csv b.csv}}HERE

tail

POKI_RUN_COMMAND{{mlr tail --help}}HERE

Prints the last n records in the input stream, optionally by category.
POKI_RUN_COMMAND{{mlr --opprint tail -n 4 data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{mlr --opprint tail -n 1 -g shape data/colored-shapes.dkvp}}HERE

top

POKI_RUN_COMMAND{{mlr top --help}}HERE Note that top is distinct from headhead shows fields which appear first in the data stream; top shows fields which are numerically largest (or smallest).
POKI_RUN_COMMAND{{mlr --opprint top -n 4 -f x data/medium}}HERE POKI_RUN_COMMAND{{mlr --opprint top -n 2 -f x -g a then sort -f a data/medium}}HERE

uniq

POKI_RUN_COMMAND{{mlr uniq --help}}HERE
POKI_RUN_COMMAND{{wc -l data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{mlr uniq -g color,shape data/colored-shapes.dkvp}}HERE
POKI_RUN_COMMAND{{mlr --opprint uniq -g color,shape -c then sort -f color,shape data/colored-shapes.dkvp}}HERE

Functions for filter and put

POKI_RUN_COMMAND{{mlr --help-all-functions}}HERE

Data types

Miller’s input and output are all string-oriented: there is (as of August 2015 anyway) no support for binary record packing. In this sense, everything is a string in and out of Miller. During processing, field names are always strings, even if they have names like "3"; field values are usually strings. Field values’ ability to be interpreted as a non-string type only has meaning when comparison or function operations are done on them. And it is an error condition if Miller encounters non-numeric (or otherwise mistyped) data in a field in which it has been asked to do numeric (or otherwise type-specific) operations.

Field values are treated as numeric for the following:

For mlr put and mlr filter:

Null data

One of Miller’s key features is its support for heterogeneous data. Accordingly, if you try to sort on field hostname when not all records in the data stream have a field named hostname, it is not an error (although you could pre-filter the data stream using mlr having-fields --at-least hostname then sort ...). Rather, records lacking one or more sort keys are simply output contiguously by mlr sort.

Field values may also be null by being specified with present key but empty value: e.g. sending x=,y=2 to mlr put '$z=$x+$y'.

Rules for null-handling: