POKI_PUT_TOC_HERE

Overview

Miller handles name-indexed data using several formats: some you probably know by name, such as CSV, TSV, and JSON — and other formats you’re likely already seeing and using in your structured data. Additionally, Miller gives you the option of including comments within your data.

Examples

POKI_RUN_COMMAND{{mlr --usage-data-format-examples}}HERE

CSV/TSV/ASV/USV/etc.

When mlr is invoked with the --csv or --csvlite option, key names are found on the first record and values are taken from subsequent records. This includes the case of CSV-formatted files. See POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE for how Miller handles changes of field names within a single data stream.

Miller has record separator RS and field separator FS, just as awk does. For TSV, use --fs tab; to convert TSV to CSV, use --ifs tab --ofs comma, etc. (See also POKI_PUT_LINK_FOR_PAGE(reference.html#Record/field/pair_separators)HERE.)

TSV (tab-separated values): the following are synonymous pairs:

ASV (ASCII-separated values): the flags --asv, --iasv, --oasv, --asvlite, --iasvlite, and --oasvlite are analogous except they use ASCII FS and RS 0x1f and 0x1e, respectively.

USV (Unicode-separated values): likewise, the flags --usv, --iusv, --ousv, --usvlite, --iusvlite, and --ousvlite use Unicode FS and RS U+241F (UTF-8 0x0xe2909f) and U+241E (UTF-8 0xe2909e), respectively.

Miller’s --csv flag supports RFC-4180 CSV ( https://tools.ietf.org/html/rfc4180). This includes CRLF line-terminators by default, regardless of platform.

Here are the differences between CSV and CSV-lite:

Here are things they have in common:

DKVP: Key-value pairs

Miller’s default file format is DKVP, for delimited key-value pairs. Example: POKI_RUN_COMMAND{{mlr cat data/small}}HERE Such data are easy to generate, e.g. in Ruby with POKI_CARDIFY(puts "host=#{hostname},seconds=#{t2-t1},message=#{msg}")HERE POKI_CARDIFY{{puts mymap.collect{|k,v| "#{k}=#{v}"}.join(',')}}HERE or print statements in various languages, e.g. POKI_CARDIFY(echo "type=3,user=$USER,date=$date\n";)HERE POKI_CARDIFY{{logger.log("type=3,user=$USER,date=$date\n");}}HERE

Fields lacking an IPS will have positional index (starting at 1) used as the key, as in NIDX format. For example, dish=7,egg=8,flint is parsed as "dish" => "7", "egg" => "8", "3" => "flint" and dish,egg,flint is parsed as "1" => "dish", "2" => "egg", "3" => "flint".

As discussed in POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE, Miller handles changes of field names within the same data stream. But using DKVP format this is particularly natural. One of my favorite use-cases for Miller is in application/server logs, where I log all sorts of lines such as

resource=/path/to/file,loadsec=0.45,ok=true
record_count=100, resource=/path/to/file
resource=/some/other/path,loadsec=0.97,ok=false

etc. and I just log them as needed. Then later, I can use grep, mlr --opprint group-like, etc. to analyze my logs.

See POKI_PUT_LINK_FOR_PAGE(reference.html)HERE regarding how to specify separators other than the default equals-sign and comma.

NIDX: Index-numbered (toolkit style)

With --inidx --ifs ' ' --repifs, Miller splits lines on whitespace and assigns integer field names starting with 1. This recapitulates Unix-toolkit behavior.

Example with index-numbered output:
POKI_RUN_COMMAND{{cat data/small}}HERE POKI_RUN_COMMAND{{mlr --onidx --ofs ' ' cat data/small}}HERE

Example with index-numbered input:
POKI_RUN_COMMAND{{cat data/mydata.txt}}HERE POKI_RUN_COMMAND{{mlr --inidx --ifs ' ' --odkvp cat data/mydata.txt}}HERE

Example with index-numbered input and output:
POKI_RUN_COMMAND{{cat data/mydata.txt}}HERE POKI_RUN_COMMAND{{mlr --nidx --fs ' ' --repifs cut -f 2,3 data/mydata.txt}}HERE

Tabular JSON

JSON is a format which supports arbitrarily deep nesting of “objects” (hashmaps) and “arrays” (lists), while Miller is a tool for handling tabular data only. This means Miller cannot (and should not) handle arbitrary JSON. (Check out jq.)

But if you have tabular data represented in JSON then Miller can handle that for you.

Single-level JSON objects

An array of single-level objects is, quite simply, a table: POKI_RUN_COMMAND{{mlr --json head -n 2 then cut -f color,shape data/json-example-1.json}}HERE POKI_RUN_COMMAND{{mlr --json --jvstack head -n 2 then cut -f color,u,v data/json-example-1.json}}HERE POKI_RUN_COMMAND{{mlr --ijson --opprint stats1 -a mean,stddev,count -f u -g shape data/json-example-1.json}}HERE

Nested JSON objects

Additionally, Miller can tabularize nested objects by concatentating keys: POKI_RUN_COMMAND{{mlr --json --jvstack head -n 2 data/json-example-2.json}}HERE POKI_RUN_COMMAND{{mlr --ijson --opprint head -n 4 data/json-example-2.json}}HERE

Note in particular that as far as Miller’s put and filter, as well as other I/O formats, are concerned, these are simply field names with colons in them: POKI_RUN_COMMAND{{mlr --json --jvstack head -n 1 then put '${values:uv} = ${values:u} * ${values:v}' data/json-example-2.json}}HERE

Arrays

Arrays aren’t supported in Miller’s put/filter DSL. By default, JSON arrays are read in as integer-keyed maps.

Suppose you have arrays like this in our input data: POKI_RUN_COMMAND{{cat data/json-example-3.json}}HERE

Then integer indices (starting from 0 and counting up) are used as map keys: POKI_RUN_COMMAND{{mlr --ijson --oxtab cat data/json-example-3.json}}HERE

When the data are written back out as JSON, field names are re-expanded as above, but what were arrays on input are now maps on output: POKI_RUN_COMMAND{{mlr --json --jvstack cat data/json-example-3.json}}HERE

This is non-ideal, but it allows Miller (5.x release being latest as of this writing) to handle JSON arrays at all.

You might also use mlr --json-skip-arrays-on-input or mlr --json-fatal-arrays-on-input. To truly handle JSON, please use a JSON-processing tool such as jq.

Formatting JSON options

JSON isn’t a parameterized format, so RS, FS, PS aren’t specifiable. Nonetheless, you can do the following:

Again, please see jq for a truly powerful, JSON-specific tool.

JSON non-streaming

The JSON parser Miller uses does not return until all input is parsed: in particular this means that, unlike for other file formats, Miller does not (at present) handle JSON files in tail -f contexts.

PPRINT: Pretty-printed tabular

Miller’s pretty-print format is like CSV, but column-aligned. For example, compare
POKI_RUN_COMMAND{{mlr --ocsv cat data/small}}HERE POKI_RUN_COMMAND{{mlr --opprint cat data/small}}HERE
Note that while Miller is a line-at-a-time processor and retains input lines in memory only where necessary (e.g. for sort), pretty-print output requires it to accumulate all input lines (so that it can compute maximum column widths) before producing any output. This has two consequences: (a) pretty-print output won’t work on tail -f contexts, where Miller will be waiting for an end-of-file marker which never arrives; (b) pretty-print output for large files is constrained by available machine memory.

See POKI_PUT_LINK_FOR_PAGE(record-heterogeneity.html)HERE for how Miller handles changes of field names within a single data stream.

For output only (this isn’t supported in the input-scanner as of 5.0.0) you can use --barred with pprint output format: POKI_RUN_COMMAND{{mlr --opprint --barred cat data/small}}HERE

XTAB: Vertical tabular

This is perhaps most useful for looking a very wide and/or multi-column data which causes line-wraps on the screen (but see also https://github.com/twosigma/ngrid for an entirely different, very powerful option). Namely:
POKI_INCLUDE_ESCAPED(data/system-file-opprint-example.txt)HERE
POKI_INCLUDE_ESCAPED(data/system-file-oxtab-example.txt)HERE
POKI_INCLUDE_ESCAPED(data/system-file-ojson-example.txt)HERE

Markdown tabular

Markdown format looks like this: POKI_RUN_COMMAND{{mlr --omd cat data/small}}HERE which renders like this when dropped into various web tools (e.g. github comments):

As of Miller 4.3.0, markdown format is supported only for output, not input.

Data-conversion keystroke-savers

While you can do format conversion using mlr --icsv --ojson cat myfile.csv, there are also keystroke-savers for this purpose, such as mlr --c2j cat myfile.csv. For a complete list: POKI_RUN_COMMAND{{mlr --usage-format-conversion-keystroke-saver-options}}HERE

Autodetect of line endings

Default line endings (--irs and --ors) are 'auto' which means autodetect from the input file format, as long as the input file(s) have lines ending in either LF (also known as linefeed, '\n', 0x0a, Unix-style) or CRLF (also known as carriage-return/linefeed pairs, '\r\n', 0x0d 0x0a, Windows style).

If both IRS and ORS are auto (which is the default) then LF input will lead to LF output and CRLF input will lead to CRLF output, regardless of the platform you’re running on.

The line-ending autodetector triggers on the first line ending detected in the input stream. E.g. if you specify a CRLF-terminated file on the command line followed by an LF-terminated file then autodetected line endings will be CRLF.

If you use --ors {something else} with (default or explicitly specified) --irs auto then line endings are autodetected on input and set to what you specify on output.

If you use --irs {something else} with (default or explicitly specified) --ors auto then the output line endings used are LF on Unix/Linux/BSD/MacOSX, and CRLF on Windows.

See also POKI_PUT_LINK_FOR_PAGE(reference.html#Record/field/pair_separators)HERE for more information about record/field/pair separators.

Comments in data

You can include comments within your data files, and either have them ignored, or passed directly through to the standard output as soon as they are encountered: POKI_RUN_COMMAND{{mlr --usage-comments-in-data}}HERE

Examples: POKI_RUN_COMMAND{{cat data/budget.csv}}HERE POKI_RUN_COMMAND{{mlr --skip-comments --icsv --opprint sort -nr quantity data/budget.csv}}HERE POKI_RUN_COMMAND{{mlr --pass-comments --icsv --opprint sort -nr quantity data/budget.csv}}HERE