POKI_PUT_TOC_HERE
Command overview
Whereas the Unix toolkit is made of the separate executables cat, tail, cut,
sort, etc., Miller has subcommands, invoked as follows:
POKI_INCLUDE_ESCAPED(data/subcommand-example.txt)HERE
These falls into categories as follows:
Commands |
Description |
cat,
cut,
head,
sort,
tac,
tail,
top,
uniq
|
Analogs of their Unix-toolkit namesakes, discussed below as well as in
POKI_PUT_LINK_FOR_PAGE(feature-comparison.html)HERE |
filter,
put,
step
|
awk-like functionality |
histogram,
stats1,
stats2
|
Statistically oriented |
group-by,
group-like,
having-fields
|
Particularly oriented toward POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE, although
all Miller commands can handle heterogeneous records
|
count-distinct,
label,
rename,
rename,
reorder
|
These draw from other sources (see also POKI_PUT_LINK_FOR_PAGE(originality.html)HERE):
count-distinct is SQL-ish, and
rename can be done by sed (which does it faster:
see POKI_PUT_LINK_FOR_PAGE(performance.html)HERE).
|
On-line help
Examples:
POKI_RUN_COMMAND{{mlr --help}}HERE
POKI_RUN_COMMAND{{mlr sort --help}}HERE
I/O options
Formats
Options:
--dkvp --idkvp --odkvp
--nidx --inidx --onidx
--csv --icsv --ocsv
--csvlite --icsvlite --ocsvlite
--pprint --ipprint --ppprint --right
--xtab --ixtab --oxtab
--json --ijson --ojson
These are as discussed in POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE, with the exception of --right
which makes pretty-printed output right-aligned:
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint --right cat data/small}}HERE
|
Additional notes:
Use --csv, --pprint, etc. when the input and output formats are the same.
Use --icsv --opprint, etc. when you want format conversion as part of what Miller does to your data.
DKVP (key-value-pair) format is the default for input and output. So,
--oxtab is the same as --idkvp --oxtab.
Compression
Options:
--prepipe {command}
The prepipe command is anything which reads from standard input and produces data acceptable to
Miller. Nominally this allows you to use whichever decompression utilities you have installed on your
system, on a per-file basis. If the command has flags, quote them: e.g. mlr --prepipe 'zcat -cf'. Examples:
# These two produce the same output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz
# With multiple input files you need --prepipe:
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz
$ mlr --prepipe gunzip --idkvp --oxtab cut -f hostname,uptime myfile1.dat.gz myfile2.dat.gz
# Similar to the above, but with compressed output as well as input:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz | gzip > outfile.csv.gz
$ mlr --prepipe gunzip cut -f hostname,uptime myfile1.csv.gz myfile2.csv.gz | gzip > outfile.csv.gz
# Similar to the above, but with different compression tools for input and output:
$ gunzip < myfile1.csv.gz | mlr cut -f hostname,uptime | xz -z > outfile.csv.xz
$ xz -cd < myfile1.csv.xz | mlr cut -f hostname,uptime | gzip > outfile.csv.xz
$ mlr --prepipe 'xz -cd' cut -f hostname,uptime myfile1.csv.xz myfile2.csv.xz | xz -z > outfile.csv.xz
... etc.
Record/field/pair separators
Miller has record separators IRS and ORS, field
separators IFS and OFS, and pair separators IPS and
OPS. For example, in the DKVP line a=1,b=2,c=3, the record
separator is newline, field separator is comma, and pair separator is the
equals sign. These are the default values.
Options:
--rs --irs --ors
--fs --ifs --ofs --repifs
--ps --ips --ops
You can change a separator from input to output via e.g. --ifs =
--ofs :. Or, you can specify that the same separator is to be used for
input and output via e.g. --fs :.
The pair separator is only relevant to DKVP format.
Pretty-print and xtab formats ignore the separator arguments altogether.
The --repifs means that multiple successive occurrences of the
field separator count as one. For example, in CSV data we often signify nulls
by empty strings, e.g. 2,9,,,,,6,5,4. On the other hand, if the field
separator is a space, it might be more natural to parse 2 4 5 the
same as 2 4 5: --repifs --ifs ' ' lets this happen. In fact,
the --ipprint option above is internally implemented in terms of
--repifs.
Just write out the desired separator, e.g. --ofs '|'. But you
may use the symbolic names newline, space, tab,
pipe, or semicolon if you like.
Number formatting
The command-line option --ofmt {format string} is the global
number format for commands which generate numeric output, e.g.
stats1, stats2, histogram, and step, as
well as mlr put. Examples:
POKI_CARDIFY(--ofmt %.9le --ofmt %.6lf --ofmt %.0lf)HERE
These are just C printf formats applied to double-precision
numbers. Please don’t use %s or %d. Additionally, if
you use leading width (e.g. %18.12lf) then the output will contain
embedded whitespace, which may not be what you want if you pipe the output to
something else, particularly CSV. I use Miller’s pretty-print format
(mlr --opprint) to column-align numerical data.
To apply formatting to a single field, overriding the global
ofmt, use fmtnum function within mlr put. For example:
POKI_RUN_COMMAND{{echo 'x=3.1,y=4.3' | mlr put '$z=fmtnum($x*$y,"%08lf")'}}HERE
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=fmtnum(int($x*$y),"%08llx")'}}HERE
Input conversion from hexadecimal is done automatically on fields handled
by mlr put and mlr filter as long as the field value begins
with "0x". To apply output conversion to hexadecimal on a single column, you
may use fmtnum, or the keystroke-saving hexfmt function.
Example:
POKI_RUN_COMMAND{{echo 'x=0xffff,y=0xff' | mlr put '$z=hexfmt($x*$y)'}}HERE
Data transformations
bar
Cheesy bar-charting.
POKI_RUN_COMMAND{{mlr bar -h}}HERE
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint bar --lo 0 --hi 1 -f x,y data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint bar --lo 0.4 --hi 0.6 -f x,y data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint bar --auto -f x,y data/small}}HERE
bootstrap
POKI_RUN_COMMAND{{mlr bootstrap --help}}HERE
The canonical use for bootstrap sampling is to put error bars on statistical quantities, such as mean. For example:
$ mlr --opprint stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color u_mean u_count
yellow 0.497129 1413
red 0.492560 4641
purple 0.494005 1142
green 0.504861 1109
blue 0.517717 1470
orange 0.490532 303
$ mlr --opprint bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color u_mean u_count
yellow 0.500651 1380
purple 0.501556 1111
green 0.503272 1068
red 0.493895 4702
blue 0.512529 1496
orange 0.521030 321
$ mlr --opprint bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color u_mean u_count
yellow 0.498046 1485
blue 0.513576 1417
red 0.492870 4595
orange 0.507697 307
green 0.496803 1075
purple 0.486337 1199
$ mlr --opprint bootstrap then stats1 -a mean,count -f u -g color data/colored-shapes.dkvp
color u_mean u_count
blue 0.522921 1447
red 0.490717 4617
yellow 0.496450 1419
purple 0.496523 1192
green 0.507569 1111
orange 0.468014 292
cat
Most useful for format conversions (see
POKI_PUT_LINK_FOR_PAGE(file-formats.html)HERE), and concatenating multiple
same-schema CSV files to have the same header:
POKI_RUN_COMMAND{{mlr cat -h}}HERE
POKI_RUN_COMMAND{{cat data/a.csv}}HERE
|
POKI_RUN_COMMAND{{cat data/b.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --csv cat data/a.csv data/b.csv}}HERE
|
|
POKI_RUN_COMMAND{{mlr --icsv --oxtab cat data/a.csv data/b.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --csv cat -n data/a.csv data/b.csv}}HERE
|
check
POKI_RUN_COMMAND{{mlr check --help}}HERE
decimate
POKI_RUN_COMMAND{{mlr decimate --help}}HERE
count-distinct
POKI_RUN_COMMAND{{mlr count-distinct --help}}HERE
POKI_RUN_COMMAND{{mlr count-distinct -f a,b then sort -nr count data/medium}}HERE
POKI_RUN_COMMAND{{mlr count-distinct -n -f a,b data/medium}}HERE
cut
POKI_RUN_COMMAND{{mlr cut --help}}HERE
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint cut -f y,x,i data/small}}HERE
|
POKI_RUN_COMMAND{{echo 'a=1,b=2,c=3' | mlr cut -f b,c,a}}HERE
|
POKI_RUN_COMMAND{{echo 'a=1,b=2,c=3' | mlr cut -o -f b,c,a}}HERE
|
filter
POKI_RUN_COMMAND{{mlr filter --help}}HERE
Field names for filter
Field names must be specified using a $ in filter and put expressions, even though they don’t appear
in the data stream. For integer-indexed data, this looks like
awk’s $1,$2,$3. Likewise, enclose string literals in
double quotes in filter expressions even though they don’t
appear in file data. In particular, mlr filter '$x=="abc"' passes
through the record x=abc. If field names have special characters such
as . then you can use braces, e.g. '${field.name}'.
You may also use a computed field name in square brackets, e.g.
POKI_RUN_COMMAND{{echo a=3,b=4 | mlr filter '$["x"] < 0.5'}}HERE
Built-in variables for filter
The filter command supports the same built-in variables as for put, all awk-inspired: NF,
NR, FNR, FILENUM, and FILENAME, as well as
the mathematical constants PI and E.
This selects the 2nd
record from each matching file:
POKI_RUN_COMMAND{{mlr filter 'FNR == 2' data/small*}}HERE
Expression formatting for filter
Expressions may be arbitrarily complex:
POKI_RUN_COMMAND{{mlr --opprint filter '$a == "pan" || $b == "wye"' data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint filter '($x > 0.5 && $y > 0.5) || ($x < 0.5 && $y < 0.5)' then stats2 -a corr -f x,y data/medium}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint filter '($x > 0.5 && $y < 0.5) || ($x < 0.5 && $y > 0.5)' then stats2 -a corr -f x,y data/medium}}HERE
|
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
POKI_INCLUDE_ESCAPED(data/filter-multiline-example.txt)HERE
grep
POKI_RUN_COMMAND{{mlr grep -h}}HERE
group-by
POKI_RUN_COMMAND{{mlr group-by --help}}HERE
This is similar to sort but with less work. Namely, Miller’s
sort has three steps: read through the data and append linked lists of records,
one for each unique combination of the key-field values; after all records
are read, sort the key-field values; then print each record-list. The group-by
operation simply omits the middle sort. An example should make this more
clear.
POKI_RUN_COMMAND{{mlr --opprint group-by a data/small}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint sort -f a data/small}}HERE
|
In this example, since the sort is on field a, the first step is
to group together all records having the same value for field a; the
second step is to sort the distinct a-field values pan,
eks, and wye into eks, pan, and
wye; the third step is to print out the record-list for
a=eks, then the record-list for a=pan, then the record-list
for a=wye. The group-by operation omits the middle sort and just puts
like records together, for those times when a sort isn’t desired. In
particular, the ordering of group-by fields for group-by is the order in which
they were encountered in the data stream, which in some cases may be more interesting
to you.
group-like
POKI_RUN_COMMAND{{mlr group-like --help}}HERE
This groups together records having the same schema (i.e. same ordered list of field names)
which is useful for making sense of time-ordered output as described in
POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE — in particular, in
preparation for CSV or pretty-print output.
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint group-like data/het.dkvp}}HERE
|
having-fields
POKI_RUN_COMMAND{{mlr having-fields --help}}HERE
Similar to group-like, this retains records with specified schema.
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr having-fields --at-least resource data/het.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr having-fields --which-are resource,ok,loadsec data/het.dkvp}}HERE
|
head
POKI_RUN_COMMAND{{mlr head --help}}HERE
Note that head is distinct from top
— head shows fields which appear first in the data stream;
top shows fields which are numerically largest (or smallest).
POKI_RUN_COMMAND{{mlr --opprint head -n 4 data/medium}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint head -n 1 -g b data/medium}}HERE
|
histogram
POKI_RUN_COMMAND{{mlr histogram --help}}HERE
This is just a histogram; there’s not too much to say here. A note about
binning, by example: Suppose you use --lo 0.0 --hi 1.0 --nbins 10 -f
x. The input numbers less than 0 or greater than 1 aren’t counted
in any bin. Input numbers equal to 1 are counted in the last bin. That is, bin
0 has 0.0 ≤ x < 0.1, bin 1 has 0.1 ≤ x < 0.2,
etc., but bin 9 has 0.9 ≤ x ≤ 1.0.
POKI_RUN_COMMAND{{mlr --opprint put '$x2=$x**2;$x3=$x2*$x' then histogram -f x,x2,x3 --lo 0 --hi 1 --nbins 10 data/medium}}HERE
join
POKI_RUN_COMMAND{{mlr join --help}}HERE
Examples:
Join larger table with IDs with smaller ID-to-name lookup table, showing only paired records:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint cat data/join-left-example.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --icsvlite --opprint cat data/join-right-example.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --icsvlite --opprint join -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv}}HERE
|
Same, but with sorting the input first:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint sort -f idcode then join -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv}}HERE
|
Same, but showing only unpaired records:
POKI_RUN_COMMAND{{mlr --icsvlite --opprint join --np --ul --ur -u -j id -r idcode -f data/join-left-example.csv data/join-right-example.csv}}HERE
|
Use prefixing options to disambiguate between otherwise identical non-join field names:
POKI_RUN_COMMAND{{mlr --csvlite --opprint cat data/self-join.csv data/self-join.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --csvlite --opprint join -j a --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv}}HERE
|
Use zero join columns:
POKI_RUN_COMMAND{{mlr --csvlite --opprint join -j "" --lp left_ --rp right_ -f data/self-join.csv data/self-join.csv}}HERE
|
label
POKI_RUN_COMMAND{{mlr label --help}}HERE
See also rename.
Example: Files such as /etc/passwd, /etc/group, and so on
have implicit field names which are found in section-5 manpages. These field names may be made explicit as follows:
POKI_INCLUDE_ESCAPED(data/label-example.txt)HERE
Likewise, if you have CSV/CSV-lite input data which has somehow been bereft of its header line, you can re-add a header line using --implicit-csv-header and label:
POKI_RUN_COMMAND{{cat data/headerless.csv}}HERE
POKI_RUN_COMMAND{{mlr --csv --rs lf --implicit-csv-header cat data/headerless.csv}}HERE
POKI_RUN_COMMAND{{mlr --csv --rs lf --implicit-csv-header label name,age,status data/headerless.csv}}HERE
POKI_RUN_COMMAND{{mlr --icsv --rs lf --implicit-csv-header --opprint label name,age,status data/headerless.csv}}HERE
merge-fields
POKI_RUN_COMMAND{{mlr merge-fields --help}}HERE
This is like mlr stats1 but all accumulation is done across fields
within each given record: horizontal rather than vertical statistics, if you
will.
Examples:
POKI_RUN_COMMAND{{mlr --csvlite --opprint cat data/inout.csv}}HERE
POKI_RUN_COMMAND{{mlr --csvlite --opprint merge-fields -a min,max,sum -c _in,_out data/inout.csv}}HERE
POKI_RUN_COMMAND{{mlr --csvlite --opprint merge-fields -k -a sum -c _in,_out data/inout.csv}}HERE
nest
POKI_RUN_COMMAND{{mlr nest -h}}HERE
put
POKI_RUN_COMMAND{{mlr put --help}}HERE
Field names for put
Field names must be specified using a $ in filter and put expressions, even though
they don’t appear in the data stream. For integer-indexed data, this
looks like awk’s $1,$2,$3. Likewise, enclose string
literals in double quotes in put expressions even though they
don’t appear in file data. In particular, mlr put '$x="abc"'
creates the field x=abc and mlr filter '$x=="abc"' passes
through the field x if it has the value abc. If field names
have special characters such as . then you can use braces, e.g.
'${field.name}'.
You may also use a computed field name in square brackets, e.g.
POKI_RUN_COMMAND{{echo s=green,t=blue,a=3,b=4 | mlr put '$[$s."_".$t] = $a * $b'}}HERE
Built-in variables for put
Miller supports the following five built-in variables for filter and put, all awk-inspired:
NF, NR, FNR, FILENUM, and
FILENAME, as well as the mathematical constants PI and
E. Lastly, the ENV hashmap allows read access to environment
variables, e.g. ENV["HOME"] or ENV["foo_".$hostname].
Expression formatting for put
Multiple expressions may be given, separated by semicolons, and each may refer to the ones before:
POKI_RUN_COMMAND{{ruby -e '10.times{|i|puts "i=#{i}"}' | mlr --opprint put '$j = $i + 1; $k = $i +$j'}}HERE
Newlines within the expression are ignored, which can help increase legibility of complex expressions:
POKI_INCLUDE_AND_RUN_ESCAPED(data/put-multiline-example.txt)HERE
Semicolons, newlines, and curly braces for put
Miller uses semicolons as statement separators, not statement terminators. This means you can write:
POKI_INCLUDE_ESCAPED(data/semicolon-example.txt)HERE
Semicolons are optional after closing curly braces (which close conditionals and loops as discussed below).
POKI_RUN_COMMAND{{echo x=1,y=2 | mlr put 'while (NF < 10) { $[NF+1] = ""} $foo = "bar"'}}HERE
POKI_RUN_COMMAND{{echo x=1,y=2 | mlr put 'while (NF < 10) { $[NF+1] = ""}; $foo = "bar"'}}HERE
Semicolons are required between statements even if those statements are on
separate lines. Newlines are for your convenience but have no syntactic
meaning: line endings do not terminate statements. For example, adjacent
assignment statements must be separated by semicolons even if those statements
are on separate lines:
POKI_INCLUDE_ESCAPED(data/newline-example.txt)HERE
Bodies for all compound statements must be enclosed in curly braces, even if the body is a single statement:
POKI_CARDIFY{{mlr put 'if ($x == 1) $y = 2 # Syntax error}}HERE
POKI_CARDIFY{{mlr put 'if ($x == 1) { $y = 2 } # This is OK}}HERE
Bodies for compound statements may be empty:
POKI_CARDIFY{{mlr put 'if ($x == 1) { } # This no-op is syntactically acceptable}}HERE
Out-of-stream variables for put
There are three kinds of variables in Miller:
Built-in variables, as discussed above:
These are written all in capital letters, such as NR,
NF, FILENAME, and only a small, specific set of them is
defined by Miller.
Their values change from one record to the next as Miller scans through
your input data stream: NR is the count of records so far encountered
in the input stream, starting at 1; NF is the number of fields in the
current input record; FILENAME is the current file name; and so on as
detailed above.
Their scope is global: you can refer to them in any filter
or put statement. Their values are assigned by the input-record reader:
POKI_RUN_COMMAND{{mlr --csv put '$nr = NR' data/a.csv}}HERE
POKI_RUN_COMMAND{{mlr --csv repeat -n 3 then put '$nr = NR' data/a.csv}}HERE
These are read-only for the mlr put and mlr filter
DSLs: they may be assigned from, e.g. $nr=NR, but they may not be
assigned to: NR=100 is a syntax error.
You can output built-in variables indirectly, by assigning them to a
non-built-in variable: e.g. $nr = NR adds a field named nr to
each output record, containing the value of NR as of when that record
was ingested.
Fields within stream records, as discussed above:
These are prefixed with a dollar sign, such as $quantity,
$hostname, etc.
Their names depend on the contents of your input data stream, and their
values change from one record to the next as Miller scans through your input
data stream.
They are scoped to the current record of the filter or
put command in which they appear.
These are read-write: you can do $y=2*$x, $x=$x+1, etc.
Records are Miller’s output: field names present in the input
stream are passed through to output (written to standard output) unless fields
are removed with cut, or records are excluded with filter or
put -q, etc. Simply assign a value to a field and it will be output.
Out-of-stream variables, presented here:
These are prefixed with an at-sign, e.g. @sum. Furthermore,
unlike built-in variables and stream-record fields, they are maintained in an
arbitrarily nested hashmap: you can do @sum += $quanity, or
@sum[$color] += $quanity, or @sum[$color][$shape] +=
$quanity. The keys for the multi-level hashmap can be any expression which
evaluates to string or integer: e.g. @sum[NR] = $a + $b,
@sum[$a."-".$b] = $x, etc.
Their names and their values are entirely under your control; they change
only when you assign to them.
Just as for field names in stream records, if you want to define out-of-stream variables
with special characters such as . then you can use braces, e.g. '@{variable.name}["index"]'.
You may use a computed key in square brackets, e.g.
POKI_RUN_COMMAND{{echo s=green,t=blue,a=3,b=4 | mlr put -q '@[$s."_".$t] = $a * $b; emit all'}}HERE
Out-of-stream variables are scoped to the put command in which they appear.
In particular, if you have two or more put commands separated by then,
each put will have its own set of out-of-stream variables:
POKI_RUN_COMMAND{{cat data/a.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put '@sum += $a; end {emit @sum}' then put 'ispresent($a) {$a=10*$a; @sum += $a}; end {emit @sum}' data/a.dkvp}}HERE
Out-of-stream variables are read-write: you can do $sum=@sum, @sum=$sum,
etc.
You can output these in four ways: (1) assign them
to stream-record fields, e.g. $cumulative_sum = @sum; (2) use
emit/emitp/emitf, e.g. @sum += $x; emit @sum
which produces an extra output record such as sum=3.1648382; (3) use
the dump or edump keywords, which immediately print all
out-of-stream variables as a JSON data structure to the standard output or
standard error (respectively), or (4) use the print or eprint
keywords which immediately print an expression to standard output or standard
error, respectively. Note that dump, edump, print,
and eprint don’t output records which participate in
then-chaining; rather, they’re just immediate prints to
stdout/stderr.
Features of out-of-stream variables, and examples of their use, will be
presented in the following sections.
Begin/end blocks for put
Miller supports an awk-like begin/end syntax. The
statements in the begin block are executed before any input records
are read; the statements in the end block are executed after the last
input record is read. (If you want to execute some statement at the start of
each file, not at the start of the first file as with begin, you might
use a pattern/action block of the form FNR == 1 { ... }.) All
statements outside of begin or end are, of course, executed
on every input record. Semicolons separate statements inside or outside of
begin/end blocks; semicolons are required between begin/end block bodies and
any subsequent statement. For example:
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-1.sh)HERE
Since uninitialized out-of-stream variables default to 0 for
addition/substraction and 1 for multiplication when they appear on expression
right-hand sides (as in awk), the above can be written more succinctly
as
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-2.sh)HERE
The put -q option is a shorthand which suppresses printing of each
output record, with only emit statements being output. So to get only
summary outputs, one could write
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-3.sh)HERE
We can do similarly with multiple out-of-stream variables:
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-4.sh)HERE
This is of course not much different than
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-5.sh)HERE
Note that it’s a syntax error for begin/end blocks to refer to field names (beginning with $),
since these execute outside the context of input records.
Indexed out-of-stream variables for put
Using an index on the @count and @sum variables, we get the benefit of the
-g (group-by) option which mlr stats1 and various other Miller commands have:
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-6.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-7.sh)HERE
Indices can be arbitrarily deep — here there are two or more of them:
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-6a.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-7a.sh)HERE
The idea is that stats1, and other Miller commands, encapsulate
frequently-used patterns with a minimum of keystroking (and run a little
faster), whereas using out-of-stream variables you have more flexibility and
control in what you do. Out-of-stream variables, along with pattern/action
blocks and begin/end blocks, give you flexibility in what you can do with
Miller.
Begin/end blocks can be mixed with pattern/action blocks. For example:
POKI_INCLUDE_AND_RUN_ESCAPED(data/begin-end-example-8.sh)HERE
Emit statements for put
As noted above, there are three ways to output out-of-stream variables:
(1) Assign them to stream-record fields, e.g. $cumulative_sum = @sum;
(2) Use emit, e.g. @sum += $x; emit @sum which produces an
extra output record such as sum=3.1648382; (3) Use the dump
keyword, which immediately prints all out-of-stream variables to the standard
output as a JSON data structure. Note that the latter aren’t output records
which participate in then-chaining; rather, they’re just an
immediate print to stdout. This section is about emit.
There are three variants: emitf, emit, and
emitp. Keep in mind that out-of-stream variables are a nested,
multi-level hashmap (directly viewable as JSON using dump), whereas
Miller output records are lists of single-level key-value pairs. The three emit
variants allow you to control how the multilevel hashmaps are flatten down to
output records.
Use emitf to output several out-of-stream variables side-by-side in the same output record.
For emitf these mustn’t have indexing using @name[...]. Example:
POKI_RUN_COMMAND{{mlr put -q '@count += 1; @x_sum += $x; @y_sum += $y; end { emitf @count, @x_sum, @y_sum}' data/small}}HERE
Use emit to output an out-of-stream variable. If it’s non-indexed you’ll get a simple key-value pair:
POKI_RUN_COMMAND{{cat data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum += $x; end { dump }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum += $x; end { emit @sum }' data/small}}HERE
If it’s indexed then use as many names after emit as there are indices:
POKI_RUN_COMMAND{{mlr put -q '@sum[$a] += $x; end { dump }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a] += $x; end { emit @sum, "a" }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a", "b" }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b][$i] += $x; end { dump }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b][$i] += $x; end { emit @sum, "a", "b", "i" }' data/small}}HERE
Now for emitp: if you have as many names following emit as
there are levels in the out-of-stream variable’s hashmap, then emit and emitp do the same
thing. Where they differ is when you don’t specify as many names as there are hashmap levels. In this
case, Miller needs to flatten multiple map indices down to output-record keys: emitp includes full
prefixing (hence the p in emitp) while emit takes the deepest hashmap key as the
output-record key:
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { dump }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { emit @sum, "a" }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { emit @sum }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small}}HERE
POKI_RUN_COMMAND{{mlr --oxtab put -q '@sum[$a][$b] += $x; end { emitp @sum }' data/small}}HERE
Use --oflatsep to specify the character which joins multilevel
keys for emitp (it defaults to a colon):
POKI_RUN_COMMAND{{mlr put -q --oflatsep / '@sum[$a][$b] += $x; end { emitp @sum, "a" }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q --oflatsep / '@sum[$a][$b] += $x; end { emitp @sum }' data/small}}HERE
POKI_RUN_COMMAND{{mlr --oxtab put -q --oflatsep / '@sum[$a][$b] += $x; end { emitp @sum }' data/small}}HERE
Emit-all statements for put
Use emit all (or emit @* which is synonumous) to output all
out-of-stream variables. You can use the following idiom to get various
accumulators output side-by-side (reminiscent of mlr stats1):
POKI_RUN_COMMAND{{mlr --opprint put -q '@v["sum"] += $x; @v["count"] += 1; end{emit all}' data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint put -q '@v[$a][$b]["sum"] += $x; @v[$a][$b]["count"] += 1; end{emit @*,"a","b"}' data/small}}HERE
Unset statements for put
You can clear a map key by assigning the empty string as its value: $x="" or @x="".
Using unset you can remove the key entirely. Examples:
POKI_RUN_COMMAND{{cat data/small}}HERE
POKI_RUN_COMMAND{{mlr put 'unset $x, $a' data/small}}HERE
This can also be done, of course, using mlr cut -x. You can also clear out-of-stream variables, at the base name level, or at an indexed sublevel:
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { dump; unset @sum; dump }' data/small}}HERE
POKI_RUN_COMMAND{{mlr put -q '@sum[$a][$b] += $x; end { dump; unset @sum["eks"]; dump }' data/small}}HERE
If you use unset all (or unset @* which is synonymous), that will unset all out-of-stream
variables which have been defined up to that point.
More variable assignments for put
There are three remaining kinds of variable assignment using out-of-stream
variables, the last two of which use the $* syntax:
Recursive copy of out-of-stream variables
Out-of-stream variable assigned to full stream record
Full stream record assigned to an out-of-stream variable
Example recursive copy of out-of-stream variables:
POKI_RUN_COMMAND{{mlr --opprint put -q '@v["sum"] += $x; @v["count"] += 1; end{dump; @w = @v; dump}' data/small}}HERE
Example of out-of-stream variable assigned to full stream record, where the 2nd record is stashed, and the 4th record is overwritten with that:
POKI_RUN_COMMAND{{mlr put 'NR == 2 {@keep = $*}; NR == 4 {$* = @keep}' data/small}}HERE
Example of full stream record assigned to an out-of-stream variable, finding
the record for which the x field has the largest value in the input
stream:
POKI_RUN_COMMAND{{cat data/small}}HERE
POKI_RUN_COMMAND{{mlr --opprint put -q 'isnull(@xmax) || $x > @xmax {@xmax=$x; @recmax=$*}; end {emit @recmax}' data/small}}HERE
Pattern-action blocks for put
These are reminiscent of awk syntax. They can be used to allow
assignments to be done only when appropriate — e.g. for math-function
domain restrictions, regex-matching, and so on:
POKI_RUN_COMMAND{{mlr cat data/put-gating-example-1.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put '$x > 0.0 { $y = log10($x); $z = sqrt($y) }' data/put-gating-example-1.dkvp}}HERE
POKI_RUN_COMMAND{{mlr cat data/put-gating-example-2.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put '$a =~ "([a-z]+)_([0-9]+)" { $b = "left_\1"; $c = "right_\2" }' data/put-gating-example-2.dkvp}}HERE
This produces heteregenous output which Miller, of course, has no problems
with (see POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE). But if you
want homogeneous output, the curly braces can be replaced with a semicolon
between the expression and the body statements. This causes put to
evaluate the boolean expression (along with any side effects, namely,
regex-captures \1, \2, etc.) but doesn’t use it as a
criterion for whether subsequent assignments should be executed. Instead,
subsequent assignments are done unconditionally:
POKI_RUN_COMMAND{{mlr put '$x > 0.0; $y = log10($x); $z = sqrt($y)' data/put-gating-example-1.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put '$a =~ "([a-z]+)_([0-9]+)"; $b = "left_\1"; $c = "right_\2"' data/put-gating-example-2.dkvp}}HERE
If-statements for put
These are again reminiscent of awk. Pattern-action blocks are a special case of if with no
elif or else blocks, no if keyword, and parentheses optional around the boolean expression:
POKI_CARDIFY{{mlr put 'NR == 4 {$foo = "bar"}'}}HERE
POKI_CARDIFY{{mlr put 'if (NR == 4) {$foo = "bar"}'}}HERE
Compound statements use elif (rather than elsif or else if):
POKI_INCLUDE_ESCAPED(data/if-chain.sh)HERE
While and do-while loops for put
Miller’s while and do-while are unsurprising in
comparison to various languages, as are break and continue:
POKI_INCLUDE_AND_RUN_ESCAPED(data/while-example-1.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/while-example-2.sh)HERE
A break or continue within nested conditional blocks or
if-statements will, of course, propagate to the innermost loop enclosing them,
if any. A break or continue outside a loop is a syntax error
that will be flagged as soon as the expression is parsed, before any input
records are ingested.
The existence of while, do-while, and for loops
in Miller’s DSL means that you can create infinite-loop scenarios
inadvertently. In particular, please recall that DSL statements are executed
once if in begin or end blocks, and once per record
otherwise. For example, while (NR < 10) will never terminate as
NR is only incremented between records.
For-loops for put
While Miller’s while and do-while statements are
much as in many other languages, for loops are more idiosyncratic to
Miller. They are loops over key-value pairs, whether in stream records or
out-of-stream variables: more reminiscent of foreach, as in (for
example) PHP.
There are a two variants: for-loop over key-value pairs in the current
stream record and for-loop over key-value pairs in an out-of-stream
variable. In each case the in keyword specifies the hashmap being
iterated over, and the variable names between for and in are
bound to the keys and values, respectively, of the hashmap’s key-value pairs on
each loop iteration. As with while and do-while, a
break or continue within nested control structures will
propagate to the innermost loop enclosing them, if any, and a break or
continue outside a loop is a syntax error that will be flagged as soon
as the expression is parsed, before any input records are ingested.
For-loop over the current stream record:
POKI_RUN_COMMAND{{cat data/for-srec-example.tbl}}HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-srec-example-1.sh)HERE
POKI_RUN_COMMAND{{mlr --from data/small --opprint put 'for (k,v in $*) { $[k."_type"] = typeof(v) }'}}HERE
Note that the value of the current field in the for-loop can be gotten either using the bound
variable value, or through a computed field name using square brackets as in $[key].
Important note: to avoid inconsistent looping behavior in case you’re
setting new fields (and/or unsetting existing ones) while looping over the
record, Miller makes a copy of the record before the loop: loop variables
are bound from the copy and all other reads/writes involve the record
itself:
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-srec-example-2.sh)HERE
It can be confusing to modify the stream record while iterating over a copy of it, so
instead you might find it simpler to use an out-of-stream variable in the loop and only update
the stream record after the loop:
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-srec-example-3.sh)HERE
No triple-for: As of Miller 4.1.0 there is no C-style triple-for of the form
POKI_CARDIFY{{for (i = 1; i <= 10; i++) { ... } # No such}}HERE
but this can be synthesized using out-of-stream variables and while:
POKI_CARDIFY{{@i = 1; while (@i <= 10) {...; @i += 1}}}HERE
For-loop over out-of-stream variable: This is similar to looping
over the current stream record except for additional degrees of freedom: you
can start iterating on sub-hashmaps of an out-of-stream variable; you can loop
over nested keys; you can loop over all out-of-stream variables. As with
for-loops over stream records, the bound variables are bound to a copy of the
sub-hashmap as it was before the loop started. The sub-hashmap is specified by
square-bracketed indices after in, and additional deeper indices are
bound to loop key-variables. The terminal values are bound to the loop
value-variable whenever the keys are neither too shallow, nor too deep. Example
indexing is as follows:
POKI_INCLUDE_ESCAPED(data/for-oosvar-example-0a.txt)HERE
That’s confusing in the abstract, so a concrete example is in order.
Suppose the out-of-stream variable @myvar is populated as follows:
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0b.sh)HERE
Then the too-shallow parts — indexed by the basename myvar
and the index "nesting-is-too-shallow" — have depth two
(basename and one index specify a terminal value) and can be gotten as follows:
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0c.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0d.sh)HERE
Note that it would take more than these two indices to reach the deeper values in the hashmap so they
aren’t bound in either of these for-loops.
By contrast, the "just-right" parts have depth three (basename and
two indices specify a terminal value) and can be gotten at by any of the
following:
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0e.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0f.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(data/for-oosvar-example-0g.sh)HERE
Note that three key levels are specified here: basename and two indices.
So these for-loops don’t produce the depth-two or depth-four entries in
the hashmap.
Filter statements for put
You can use filter within put. In fact, the
following two are synonymous:
POKI_RUN_COMMAND{{mlr filter 'NR==2 || NR==3' data/small}}HERE
POKI_RUN_COMMAND{{mlr put 'filter NR==2 || NR==3' data/small}}HERE
The former, of course, is much easier to type. But the latter allows you to define more complex expressions
for the filter, and/or do other things in addition to the filter:
POKI_RUN_COMMAND{{mlr put '@running_sum += $x; filter @running_sum > 1.3' data/small}}HERE
POKI_RUN_COMMAND{{mlr put '$z = $x * $y; filter $z > 0.3' data/small}}HERE
A note on the complexity of put
One of Miller’s strengths is its brevity: it’s much quicker
— and less error-prone — to type mlr stats1 -a sum -f x,y -g
a,b than having to track summation variables as in awk, or using
Miller’s out-of-stream variables. And the more language features
Miller’s put-DSL has (for-loops, if-statements, nested control
structures, etc.) then the less powerful it begins to seem: because of
the other programming-language features it doesn’t have.
When I was originally prototyping Miller in 2015, the decision I had was
whether to hand-code in a low-level language like C or Rust, with my own
hand-rolled DSL, or whether to use a higher-level language (like Python or Lua
or Nim) and let the put statements be handled by the implementation
language’s own eval: the implementation language would take the
place of a DSL. Multiple performance experiments showed me I could get better
throughput using the former, and using C in particular — by a wide margin. So
Miller is C under the hood with a hand-rolled DSL.
I do want to keep focusing on what Miller is good at — concise notation,
low latency, and high throughput — and not add too much in terms of
high-level-language features to the DSL. That said, some sort of looping over
field names is a basic thing to want. As of 4.1.0 we have recursive
for/while/if structures on about the same complexity level as awk.
While I’m excited by these powerful language features, I hope to keep new
features beyond 4.1.0 focused on Miller’s sweet spot which is speed plus
simplicity.
regularize
POKI_RUN_COMMAND{{mlr regularize --help}}HERE
This exists since hash-map software in various languages and tools
encountered in the wild does not always print similar rows with fields in the
same order: mlr regularize helps clean that up.
See also reorder.
rename
POKI_RUN_COMMAND{{mlr rename --help}}HERE
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint rename i,INDEX,b,COLUMN2 data/small}}HERE
|
As discussed in POKI_PUT_LINK_FOR_PAGE(performance.html)HERE, sed
is significantly faster than Miller at doing this. However, Miller is
format-aware, so it knows to do renames only within specified field keys and
not any others, nor in field values which may happen to contain the same
pattern. Example:
POKI_RUN_COMMAND{{sed 's/y/COLUMN5/g' data/small}}HERE
|
POKI_RUN_COMMAND{{mlr rename y,COLUMN5 data/small}}HERE
|
See also label.
reorder
POKI_RUN_COMMAND{{mlr reorder --help}}HERE
This pivots specified field names to the start or end of the record — for
example when you have highly multi-column data and you want to bring a field or
two to the front of line where you can give a quick visual scan.
POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint reorder -f i,b data/small}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint reorder -e -f i,b data/small}}HERE
|
repeat
POKI_RUN_COMMAND{{mlr repeat --help}}HERE
This is useful in at least two ways: one, as a data-generator as in the
above example using urand(); two, for reconstructing individual
samples from data which has been count-aggregated:
POKI_RUN_COMMAND{{cat data/repeat-example.dat}}HERE
POKI_RUN_COMMAND{{mlr repeat -f count then cut -x -f count data/repeat-example.dat}}HERE
After expansion with repeat, such data can then be sent on to
stats1 -a mode, or (if the data are numeric) to stats1 -a
p10,p50,p90, etc.
reshape
POKI_RUN_COMMAND{{mlr reshape --help}}HERE
sample
POKI_RUN_COMMAND{{mlr sample --help}}HERE
This is reservoir-sampling: select k items from n with
uniform probability and no repeats in the sample. (If n is less than
k, then of course only n samples are produced.) With -g
{field names}, produce a k-sample for each distinct value of the
specified field names.
POKI_INCLUDE_ESCAPED(data/sample-example.txt)HERE
Note that no output is produced until all inputs are in. Another way to do
sampling, which works in the streaming case, is mlr filter 'urand() &
0.001' where you tune the 0.001 to meet your needs.
sec2gmt
POKI_RUN_COMMAND{{mlr sec2gmt -h}}HERE
shuffle
POKI_RUN_COMMAND{{mlr shuffle -h}}HERE
sort
POKI_RUN_COMMAND{{mlr sort --help}}HERE
Example:
POKI_RUN_COMMAND{{mlr --opprint sort -f a -nr x data/small}}HERE
Here’s an example filtering log data: suppose multiple threads (labeled here by color) are all logging progress counts to a single log file. The log file is (by nature) chronological, so the progress of various threads is interleaved:
POKI_RUN_COMMAND{{head -n 10 data/multicountdown.dat}}HERE
We can group these by thread by sorting on the thread ID (here,
color). Since Miller’s sort is stable, this means that
timestamps within each thread’s log data are still chronological:
POKI_RUN_COMMAND{{head -n 20 data/multicountdown.dat | mlr --opprint sort -f color}}HERE
Any records not having all specified sort keys will appear at the end of the output, in the order they
were encountered, regardless of the specified sort order:
POKI_RUN_COMMAND{{mlr sort -n x data/sort-missing.dkvp}}HERE
POKI_RUN_COMMAND{{mlr sort -nr x data/sort-missing.dkvp}}HERE
stats1
POKI_RUN_COMMAND{{mlr stats1 --help}}HERE
These are simple univariate statistics on one or more number-valued fields
(count and mode apply to non-numeric fields as well),
optionally categorized by one or more other fields.
POKI_RUN_COMMAND{{mlr --oxtab stats1 -a count,sum,min,p10,p50,mean,p90,max -f x,y data/medium}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint stats1 -a mean -f x,y -g b then sort -f b data/medium}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint stats1 -a p50,p99 -f u,v -g color then put '$ur=$u_p99/$u_p50;$vr=$v_p99/$v_p50' data/colored-shapes.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint count-distinct -f shape then sort -nr count data/colored-shapes.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint stats1 -a mode -f color -g shape data/colored-shapes.dkvp}}HERE
|
stats2
POKI_RUN_COMMAND{{mlr stats2 --help}}HERE
These are simple bivariate statistics on one or more pairs of number-valued
fields, optionally categorized by one or more fields.
POKI_RUN_COMMAND{{mlr --oxtab put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a cov,corr -f x,y,y,y,x2,xy,x2,y2 data/medium}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint put '$x2=$x*$x; $xy=$x*$y; $y2=$y**2' then stats2 -a linreg-ols,r2 -f x,y,y,y,xy,y2 -g a data/medium}}HERE
|
Here’s an example simple line-fit. The x and y
fields of the data/medium dataset are just independent uniformly
distributed on the unit interval. Here we remove half the data and fit a line to it.
POKI_INCLUDE_ESCAPED(data/linreg-example.txt)HERE
I use pgr for
plotting; here’s a screenshot.
(Thanks Drew Kunas for a good conversation about PCA!)
Here’s an example estimating time-to-completion for a set of jobs.
Input data comes from a log file, with number of work units left to do in the
count field and accumulated seconds in the upsec field,
labeled by the color field:
POKI_RUN_COMMAND{{head -n 10 data/multicountdown.dat}}HERE
We can do a linear regression on count remaining as a function of time: with c = m*u+b we want to find the
time when the count goes to zero, i.e. u=-b/m.
POKI_RUN_COMMAND{{mlr --oxtab stats2 -a linreg-pca -f upsec,count -g color then put '$donesec = -$upsec_count_pca_b/$upsec_count_pca_m' data/multicountdown.dat}}HERE
step
POKI_RUN_COMMAND{{mlr step --help}}HERE
Most Miller commands are record-at-a-time, with the exception of stats1,
stats2, and histogram which compute aggregate output. The
step command is intermediate: it allows the option of adding fields
which are functions of fields from previous records. Rsum is short for running sum.
POKI_RUN_COMMAND{{mlr --opprint step -a delta,rsum,counter -f x data/medium | head -15}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint step -a delta,rsum,counter -f x -g a data/medium | head -15}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint step -a ewma -f x -d 0.1,0.9 ../doc/data/medium | head -15}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint step -a ewma -f x -d 0.1,0.9 -o smooth,rough ../doc/data/medium | head -15}}HERE
|
Example deriving uptime-delta from system uptime:
POKI_INCLUDE_ESCAPED(data/ping-delta-example.txt)HERE
tac
POKI_RUN_COMMAND{{mlr tac --help}}HERE
Prints the records in the input stream in reverse order. Note: this
requires Miller to retain all input records in memory before any output records
are produced.
POKI_RUN_COMMAND{{mlr --icsv --opprint cat data/a.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --icsv --opprint cat data/b.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --icsv --opprint tac data/a.csv data/b.csv}}HERE
|
POKI_RUN_COMMAND{{mlr --icsv --opprint put '$filename=FILENAME' then tac data/a.csv data/b.csv}}HERE
|
tail
POKI_RUN_COMMAND{{mlr tail --help}}HERE
Prints the last n records in the input stream, optionally by category.
POKI_RUN_COMMAND{{mlr --opprint tail -n 4 data/colored-shapes.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint tail -n 1 -g shape data/colored-shapes.dkvp}}HERE
|
top
POKI_RUN_COMMAND{{mlr top --help}}HERE
Note that top is distinct from head
— head shows fields which appear first in the data stream;
top shows fields which are numerically largest (or smallest).
POKI_RUN_COMMAND{{mlr --opprint top -n 4 -f x data/medium}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint top -n 2 -f x -g a then sort -f a data/medium}}HERE
|
uniq
POKI_RUN_COMMAND{{mlr uniq --help}}HERE
POKI_RUN_COMMAND{{wc -l data/colored-shapes.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr uniq -g color,shape data/colored-shapes.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint uniq -g color,shape -c then sort -f color,shape data/colored-shapes.dkvp}}HERE
|
POKI_RUN_COMMAND{{mlr --opprint uniq -n -g color,shape data/colored-shapes.dkvp}}HERE
|
then-chaining
In accord with the
Unix philosophy, you can pipe data into or out of
Miller. For example:
POKI_CARDIFY(mlr cut --complement -f os_version *.dat | mlr sort -f hostname,uptime)HERE
You can, if you like, instead simply chain commands together using the
then keyword:
POKI_CARDIFY(mlr cut --complement -f os_version then sort -f hostname,uptime *.dat)HERE
Here’s a performance comparison:
POKI_INCLUDE_ESCAPED(data/then-chaining-performance.txt)HERE
There are two reasons to use then-chaining: one is for performance, although I
don’t expect this to be a win in all cases. Using then-chaining avoids
redundant string-parsing and string-formatting at each pipeline step: instead
input records are parsed once, they are fed through each pipeline stage in
memory, and then output records are formatted once. On the other hand, Miller
is single-threaded, while modern systems are usually multi-processor, and when
streaming-data programs operate through pipes, each one can use a CPU. Rest
assured you get the same results either way.
The other reason to use then-chaining is for simplicity: you don’t
have re-type formatting flags (e.g. --csv --rs lf --fs tab) at every
pipeline stage.
Functions for filter and put
POKI_RUN_COMMAND{{mlr --help-all-functions}}HERE
Data types
Miller’s input and output are all string-oriented: there is (as of
August 2015 anyway) no support for binary record packing. In this sense,
everything is a string in and out of Miller. During processing, field names
are always strings, even if they have names like "3"; field values are usually
strings. Field values’ ability to be interpreted as a non-string type
only has meaning when comparison or function operations are done on them. And
it is an error condition if Miller encounters non-numeric (or otherwise
mistyped) data in a field in which it has been asked to do numeric (or
otherwise type-specific) operations.
Field values are treated as numeric for the following:
Numeric sort: mlr sort -n, mlr sort -nr.
Statistics: mlr histogram, mlr stats1, mlr stats2.
Cross-record arithmetic: mlr step.
For mlr put and mlr filter:
Miller’s types for function processing are null (empty
string), error, string, float (double-precision),
int (64-bit signed), and boolean.
On input, string values representable as numbers, e.g. "3" or "3.1", are
treated as int or float, respectively. If a record has x=1,y=2 then
mlr put '$z=$x+$y' will produce x=1,y=2,z=3, and mlr put
'$z=$x.$y' gives an error. To coerce back to string for processing, use
the string function: mlr put '$z=string($x).string($y)' will
produce x=1,y=2,z=12.
On input, string values representable as boolean (e.g. "true",
"false") are not automatically treated as boolean. (This is
because "true" and "false" are ordinary words, and auto
string-to-boolean on a column consisting of words would result in some strings
mixed with some booleans.) Use the boolean function to coerce: e.g.
giving the record x=1,y=2,w=false to mlr put '$z=($x<$y) ||
boolean($w)'.
Functions take types as described in mlr --help-all-functions:
for example, log10 takes float input and produces float output,
gmt2sec maps string to int, and sec2gmt maps int to string.
All math functions described in mlr --help-all-functions take
integer as well as float input.
Null data: empty and absent
One of Miller’s key features is its support for heterogeneous
data. For example, take mlr sort: if you try to sort on field
hostname when not all records in the data stream have a field
named hostname, it is not an error (although you could pre-filter the
data stream using mlr having-fields --at-least hostname then sort
...). Rather, records lacking one or more sort keys are simply output
contiguously by mlr sort.
Miller has two kinds of null data:
Empty: a field name is present in a record (or in an out-of-stream
variable) with empty value: e.g. x=,y=2 in the data input stream, or
assignment $x="" or @x="" in mlr put.
Absent: a field name is not present, e.g. input record is
x=1,y=2 and a put or filter expression refers to
$z. Or, reading an out-of-stream variable which hasn’t been
assigned a value yet,
e.g. mlr put -q '@sum += $x'; end{emit @sum}' or mlr put -q
'@sum[$a][$b] += $x'; end{emit @sum, "a", "b"}'.
You can test these programatically using the functions
isempty/isnotempty, isabsent/ispresent, and
isnull/isnotnull. For the last pair, note that null means
either empty or absent.
Rules for null-handling:
- Records with one or more empty sort-field values sort after records with
all sort-field values present:
POKI_RUN_COMMAND{{mlr cat data/sort-null.dat}}HERE
POKI_RUN_COMMAND{{mlr sort -n a data/sort-null.dat}}HERE
POKI_RUN_COMMAND{{mlr sort -nr a data/sort-null.dat}}HERE
- Functions/operators which have one or more empty arguments produce empty output: e.g.
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$x+$y'}}HERE
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=$x+$y'}}HERE
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=log($x);$b=log($y)'}}HERE
with the exception that the min and max functions are
special: if one argument is non-null, it wins:
POKI_RUN_COMMAND{{echo 'x=,y=3' | mlr put '$a=min($x,$y);$b=max($x,$y)'}}HERE
- Functions of absent variables (e.g. mlr put '$y =
log10($nonesuch)') evaluate to absent, and arithmetic/bitwise/boolean
operators with both operands being absent evaluate to absent.
Arithmetic operators with one absent operand return the other operand.
More specifically, absent values act like zero for addition/subtraction, and
one for multiplication: Furthermore, any expression which evaluates to
absent is not stored in the output record:
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=$u+$v; $b=$u+$y; $c=$x+$y'}}HERE
POKI_RUN_COMMAND{{echo 'x=2,y=3' | mlr put '$a=min($x,$v);$b=max($u,$y);$c=min($u,$v)'}}HERE
The reasoning is as follows:
Empty values are explicit in the data so they should explicitly affect accumulations:
mlr put '@sum += $x'
should accumulate numeric x values into the sum but an empty
x, when encountered in the input data stream, should make the sum
non-numeric. To work around this you can use the
isnotnull function as follows:
mlr put 'isnotnull($x) { @sum += $x }'
Absent stream-record values should not break accumulations, since Miller
by design handles heterogenous data: the running @sum in
mlr put '@sum += $x'
should not be invalidated for records which have no x.
Absent out-of-stream-variable values are precisely what allow you to write
mlr put '@sum += $x'. Otherwise you would have to write
mlr put 'begin{@sum = 0}; @sum += $x' —
which is tolerable — but for
mlr put 'begin{...}; @sum[$a][$b] += $x'
you’d have to pre-initialize @sum for all values of $a and $b in your
input data stream, which is intolerable.
The penalty for the absent feature is that misspelled variables can be hard to find:
e.g. in mlr put 'begin{@sumx = 10}; ...; update @sumx somehow per-record; ...; end {@something = @sum * 2}'
the accumulator is spelt @sumx in the begin-block but @sum in the end-block, where since it
is absent, @sum*2 evaluates to 2.
Since absent plus absent is absent (and likewise for other operators),
accumulations such as @sum += $x work correctly on heterogenous data,
as do within-record formulas if both operands are absent. If one operand is
present, you may get behavior you don’t desire. To work around this
— namely, to set an output field only for records which have all the
inputs present — you can use a pattern-action block with
ispresent:
POKI_RUN_COMMAND{{mlr cat data/het.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put 'ispresent($loadsec) { $loadmillis = $loadsec * 1000 }' data/het.dkvp}}HERE
POKI_RUN_COMMAND{{mlr put '$loadmillis = (ispresent($loadsec) ? $loadsec : 0.0) * 1000' data/het.dkvp}}HERE
If you’re interested in a formal description of how empty and absent
fields participate in arithmetic, here’s a table for plus (other
arithmetic/boolean/bitwise operators are similar):
POKI_RUN_COMMAND{{mlr --print-type-arithmetic-info}}HERE
String literals
You can use the following backslash escapes for strings such as between the double quotes in contexts such as
mlr filter '$name =~ "..."',
mlr put '$name = $othername . "..."',
mlr put '$name = sub($name, "...", "..."), etc.:
\a: ASCII code 0x07 (alarm/bell)
\b: ASCII code 0x08 (backspace)
\f: ASCII code 0x0c (formfeed)
\n: ASCII code 0x0a (LF/linefeed/newline)
\r: ASCII code 0x0d (CR/carriage return)
\t: ASCII code 0x09 (tab)
\v: ASCII code 0x0b (vertical tab)
\\: backslash
\": double quote
\123: Octal 123, etc. for \000 up to \377
\x7f: Hexadecimal 7f, etc. for \x00 up to \xff
See also https://en.wikipedia.org/wiki/Escape_sequences_in_C.
These replacements apply only to strings you key in for the DSL expressions for filter and put:
that is, if you type \t in a string literal for a filter/put expression, it will be turned into a tab character. If you want a backslash followed by a t, then please type \\t.
However, these replacements are not done automatically within your data stream. If you wish to make these
replacements, you can do, for example, for a field named field, mlr put '$field = gsub($field, "\\t",
"\t")'. If you need to make such a replacement for all fields in your data, you should probably simply use the
system sed command.
Regular expressions
Miller lets you use regular expressions (of type POSIX.2) in the following contexts:
In mlr filter with =~ or !=~, e.g. mlr
filter '$url =~ "http.*com"'
In mlr put with sub or gsub, e.g. mlr put
'$url = sub($url, "http.*com", "")'
In mlr having-fields, e.g. mlr having-fields
--any-matching '^sda[0-9]'
In mlr cut, e.g. mlr cut -r -f '^status$,^sda[0-9]'
In mlr rename, e.g. mlr rename -r '^(sda[0-9]).*$,dev/\1'
In mlr grep, e.g. mlr --csv grep 00188555487 myfiles*.csv
Points demonstrated by the above examples:
There are no implicit start-of-string or end-of-string anchors; please
use ^ and/or $ explicitly.
Miller regexes are wrapped with double quotes rather than slashes.
The i after the ending double quote indicates a case-insensitive
regex.
Capture groups are wrapped with (...) rather than
\(...\); use \( and \) to match against parentheses.
For filter and put, if the regular expression is a string
literal (the normal case), it is precompiled at process start and reused
thereafter, which is efficient. If the regular expression is a more complex
expression, including string concatenation using ., or a column name
(in which case you can take regular expressions from input data!), then regexes
are compiled on each record which works but is less efficient. As well, in this
case there is no way to specify case-insensitive matching.
Example:
POKI_RUN_COMMAND{{cat data/regex-in-data.dat}}HERE
POKI_RUN_COMMAND{{mlr filter '$name =~ $regex' data/regex-in-data.dat}}HERE
Regex captures
Regex captures of the form \0 through \9 are supported as
follows:
Captures have in-function context for sub and gsub.
For example, the first \1,\2 pair belong to the first sub and
the second \1,\2 pair belong to the second sub:
mlr put '$b = sub($a, "(..)_(...)", "\2-\1"); $c = sub($a, "(..)_(.)(..)", ":\1:\2:\3")'
Captures endure for the entirety of a put for the =~
and !=~ operators. For example, here the \1,\2 are set by the
=~ operator and are used by both subsequent assignment statements:
mlr put '$a =~ "(..)_(....); $b = "left_\1"; $c = "right_\2"'
The captures are not retained across multiple puts. For example, here the
\1,\2 won’t be expanded from the regex capture:
mlr put '$a =~ "(..)_(....)' then {... something else ...} then put '$b = "left_\1"; $c = "right_\2"'
Captures are ignored in filter for the =~ and
!=~ operators. For example, there is no mechanism provided to refer to
the first (..) as \1 or to the second (....) as
\2 in the following filter statement:
mlr filter '$a =~ "(..)_(....)'
Up to nine matches are supported: \1 through \9, while
\0 is the entire match string; \15 is treated as \1
followed by an unrelated 5.
Operator precedence
Operators are listed in order of decreasing precedence, highest first.
Operators Associativity
--------- -------------
() left to right
** right to left
! ~ unary+ unary- & right to left
binary* / // % left to right
binary+ binary- . left to right
<< >> left to right
& left to right
^ left to right
| left to right
< <= > >= left to right
== != =~ !=~ left to right
&& left to right
^^ left to right
|| left to right
? : right to left
= N/A for Miller (there is no $a=$b=$c)
Operator and function semantics
Functions are in general pass-throughs straight to the system-standard C
library.
The min and max functions are different from other
multi-argument functions which return null if any of their inputs are null: for
min and max, by contrast, if one argument is null, the other
is returned.
Symmetrically with respect to the bitwise OR, XOR, and AND operators
|, ^, &, Miller has logical operators
||, ^^, &&: the logical XOR not existing in
C.
The exponentiation operator ** is familiar from many languages.
The regex-match and regex-not-match operators =~ and
!=~ are similar to those in Ruby and Perl.
Arithmetic
Input scanning
Numbers in Miller are double-precision float or 64-bit signed integers.
Anything scannable as int, e.g 123 or 0xabcd, is treated as
an integer; otherwise, input scannable as float (4.56 or 8e9)
is treated as float; everything else is a string.
If you want all numbers to be treated as floats, then you may use
float() in your filter/put expressions (e.g. replacing $c = $a *
$b with $c = float($a) * float($b)) — or, more simply, use
mlr filter -F and mlr put -F which forces all numeric input,
whether from expression literals or field values, to float. Likewise mlr
stats1 -F and mlr step -F force integerable accumulators (such as
count) to be done in floating-point.
Conversion by math routines
For most math functions, integers are cast to float on input, and produce
float output: e.g. exp(0) = 1.0 rather than 1. The
following, however, produce integer output if their inputs are integers:
+ - * / // % abs
ceil floor max min round
roundm sgn. As well, stats1 -a min, stats1 -a
max, stats1 -a sum, step -a delta, and step -a
rsum produce integer output if their inputs are integers.
Conversion by arithmetic operators
The sum, difference, and product of integers is again integer, except for
when that would overflow a 64-bit integer at which point Miller converts the
result to float.
The short of it is that Miller does this transparently for you so you
needn’t think about it.
Implementation details of this, for the interested: integer adds and
subtracts overflow by at most one bit so it suffices to check sign-changes.
Thus, Miller allows you to add and subtract arbitrary 64-bit signed integers,
converting only to float precisely when the result is less than -263
or greater than 263-1. Multiplies, on the other hand, can overflow
by a word size and a sign-change technique does not suffice to detect overflow.
Instead Miller tests whether the floating-point product exceeds the
representable integer range. Now, 64-bit integers have 64-bit precision while
IEEE-doubles have only 52-bit mantissas — so, there are 53 bits including
implicit leading one. The following experiment explicitly demonstrates the
resolution at this range:
64-bit integer 64-bit integer Casted to double Back to 64-bit
in hex in decimal integer
0x7ffffffffffff9ff 9223372036854774271 9223372036854773760.000000 0x7ffffffffffff800
0x7ffffffffffffa00 9223372036854774272 9223372036854773760.000000 0x7ffffffffffff800
0x7ffffffffffffbff 9223372036854774783 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffc00 9223372036854774784 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffdff 9223372036854775295 9223372036854774784.000000 0x7ffffffffffffc00
0x7ffffffffffffe00 9223372036854775296 9223372036854775808.000000 0x8000000000000000
0x7ffffffffffffffe 9223372036854775806 9223372036854775808.000000 0x8000000000000000
0x7fffffffffffffff 9223372036854775807 9223372036854775808.000000 0x8000000000000000
That is, one cannot check an integer product to see if it is precisely
greater than 263-1 or less than -263 using either integer
arithmetic (it may have already overflowed) or using double-precision (due to
granularity). Instead Miller checks for overflow in 64-bit integer
multiplication by seeing whether the absolute value of the double-precision
product exceeds the largest representable IEEE double less than 263,
which we see from the listing above is 9223372036854774784. (An alternative
would be to do all integer multiplies using handcrafted multi-word 128-bit
arithmetic. This approach is not taken.)
Pythonic division
Division and remainder are
pythonic:
Quotient of integers is floating-point: 7/2 is 3.5.
Integer division is done with //: 7/2 is 3.
This rounds toward the negative.
Remainders are non-negative.