POKI_PUT_TOC_HERE
Suppose you have this CSV data file:
POKI_RUN_COMMAND{{cat example.csv}}HEREmlr cat
is like cat -- it passes the data
through unmodified ...
... but it can also do format conversion (here, you can pretty-print in tabular format):
POKI_RUN_COMMAND{{mlr --icsv --opprint cat example.csv}}HERE mlr head
and mlr tail
count
records rather than lines. Whethere you’re getting the first few records
or the last few, the CSV header is included either way:
You can sort primarily alphabetically on one field, then secondarily numerically descending on another field:
POKI_RUN_COMMAND{{mlr --icsv --opprint sort -f shape -nr index example.csv}}HERE You can use cut
to retain only specified fields, in the
same order they appeared in the input data:
You can also use cut -o
to retain only
specified fields in your preferred order:
You can use cut -x
to omit fields you
don’t care about:
You can use filter
to keep only records you care about:
You can use put
to create new fields which
are computed from other fields:
Even though Miller’s main selling point is
name-indexing, sometimes you really want to refer to a field name by its
positional index. Use $[[3]]
to access the name of field 3 or
$[[[3]]]
to access the value of field 3:
OK, CSV and pretty-print are fine. But Miller can also convert between a few other formats -- let’s take a look at JSON output:
POKI_RUN_COMMAND{{mlr --icsv --ojson put '$ratio = $quantity/$rate; $shape = toupper($shape)' example.csv}}HEREOr, JSON output with vertical-formatting flags:
POKI_RUN_COMMAND{{mlr --icsv --ojsonx tail -n 2 example.csv}}HERE Now suppose you want to sort the data on a given column,
and then take the top few in that ordering. You can use Miller’s
then
feature to pipe commands together.
Here are the records with the top three index
values:
Lots of Miller commands take a -g
option
for group-by: here, head -n 1 -g shape
outputs the first record
for each distinct value of the shape
field. This means we’re
finding the record with highest index
field for each distinct
shape
field:
Statistics can be computed with or without group-by field(s)
POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape}}HERE POKI_RUN_COMMAND{{mlr --icsv --opprint --from example.csv stats1 -a count,min,mean,max -f quantity -g shape,color}}HEREIf your output has a lot of columns, you can use XTAB format to line things up vertically for you instead:
POKI_RUN_COMMAND{{mlr --icsv --oxtab --from example.csv stats1 -a p0,p10,p25,p50,p75,p90,p99,p100 -f rate}}HEREOften we want to print output to the screen. Miller does this by default, as we’ve seen in the previous examples.
Sometimes we want to print output to another file: just use '> outputfilenamegoeshere' at the end of your command:
% mlr --icsv --opprint cat example.csv > newfile.csv # Output goes to the new file; # nothing is printed to the screen. |
% cat newfile.csv color shape flag index quantity rate yellow triangle 1 11 43.6498 9.8870 red square 1 15 79.2778 0.0130 red circle 1 16 13.8103 2.9010 red square 0 48 77.5542 7.4670 purple triangle 0 51 81.2290 8.5910 red square 0 64 77.1991 9.5310 purple triangle 0 65 80.1405 5.8240 yellow circle 1 73 63.9785 4.2370 yellow circle 1 87 63.5058 8.3350 purple square 0 91 72.3735 8.2430 |
% cp example.csv newfile.txt % cat newfile.txt color,shape,flag,index,quantity,rate yellow,triangle,1,11,43.6498,9.8870 red,square,1,15,79.2778,0.0130 red,circle,1,16,13.8103,2.9010 red,square,0,48,77.5542,7.4670 purple,triangle,0,51,81.2290,8.5910 red,square,0,64,77.1991,9.5310 purple,triangle,0,65,80.1405,5.8240 yellow,circle,1,73,63.9785,4.2370 yellow,circle,1,87,63.5058,8.3350 purple,square,0,91,72.3735,8.2430 |
% mlr -I --icsv --opprint cat newfile.txt % cat newfile.txt color shape flag index quantity rate yellow triangle 1 11 43.6498 9.8870 red square 1 15 79.2778 0.0130 red circle 1 16 13.8103 2.9010 red square 0 48 77.5542 7.4670 purple triangle 0 51 81.2290 8.5910 red square 0 64 77.1991 9.5310 purple triangle 0 65 80.1405 5.8240 yellow circle 1 73 63.9785 4.2370 yellow circle 1 87 63.5058 8.3350 purple square 0 91 72.3735 8.2430 |
mlr -I
you can bulk-operate on lots of files: e.g.
mlr -I --csv cut -x -f unwanted_column_name *.csv |
Lastly, using tee
within put
, you can split your input data into separate files
per one or more field names:
POKI_RUN_COMMAND{{cat circle.csv}}HERE | POKI_RUN_COMMAND{{cat square.csv}}HERE | POKI_RUN_COMMAND{{cat triangle.csv}}HERE |
shape,flag,index circle,1,24 square,0,36
shape=circle,flag=1,index=24 shape=square,flag=0,index=36
Data written this way are called DKVP, for delimited key-value pairs.
We’ve also already seen other ways to write the same data:CSV PPRINT JSON shape,flag,index shape flag index [ circle,1,24 circle 1 24 { square,0,36 square 0 36 "shape": "circle", "flag": 1, "index": 24 }, DKVP XTAB { shape=circle,flag=1,index=24 shape circle "shape": "square", shape=square,flag=0,index=36 flag 1 "flag": 0, index 24 "index": 36 } shape square ] flag 0 index 36
Anything we can do with CSV input data, we can do with any other format input data.
And you can read from one format, do any record-processing, and output to the same format as the input, or to a different output format.mlr --tsv
or mlr --tsvlite
. This
means I can do some (or all, or none) of my data processing within SQL queries,
and some (or none, or all) of my data processing using Miller — whichever
is most convenient for my needs at the moment.
For example, using default output formatting in mysql
we get
formatting like Miller’s --opprint --barred
:
$ mysql --database=mydb -e 'show columns in mytable' +------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +------------------+--------------+------+-----+---------+-------+ | id | bigint(20) | NO | MUL | NULL | | | category | varchar(256) | NO | | NULL | | | is_permanent | tinyint(1) | NO | | NULL | | | assigned_to | bigint(20) | YES | | NULL | | | last_update_time | int(11) | YES | | NULL | | +------------------+--------------+------+-----+---------+-------+
mysql
’s -B
we get TSV output:
$ mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --opprint cat Field Type Null Key Default Extra id bigint(20) NO MUL NULL - category varchar(256) NO - NULL - is_permanent tinyint(1) NO - NULL - assigned_to bigint(20) YES - NULL - last_update_time int(11) YES - NULL -
$ mysql --database=mydb -B -e 'show columns in mytable' | mlr --itsvlite --ojson --jlistwrap --jvstack cat [ { "Field": "id", "Type": "bigint(20)", "Null": "NO", "Key": "MUL", "Default": "NULL", "Extra": "" }, { "Field": "category", "Type": "varchar(256)", "Null": "NO", "Key": "", "Default": "NULL", "Extra": "" }, { "Field": "is_permanent", "Type": "tinyint(1)", "Null": "NO", "Key": "", "Default": "NULL", "Extra": "" }, { "Field": "assigned_to", "Type": "bigint(20)", "Null": "YES", "Key": "", "Default": "NULL", "Extra": "" }, { "Field": "last_update_time", "Type": "int(11)", "Null": "YES", "Key": "", "Default": "NULL", "Extra": "" } ]
$ mysql --database=mydb -B -e 'select * from mytable' > query.tsv $ mlr --from query.tsv --t2p stats1 -a count -f id -g category,assigned_to category assigned_to id_count special 10000978 207 special 10003924 385 special 10009872 168 standard 10000978 524 standard 10003924 392 standard 10009872 108 ...
mysql> CREATE TABLE abixy( a VARCHAR(32), b VARCHAR(32), i BIGINT(10), x DOUBLE, y DOUBLE ); Query OK, 0 rows affected (0.01 sec) bash$ mlr --onidx --fs comma cat data/medium > medium.nidx mysql> LOAD DATA LOCAL INFILE 'medium.nidx' REPLACE INTO TABLE abixy FIELDS TERMINATED BY ',' ; Query OK, 10000 rows affected (0.07 sec) Records: 10000 Deleted: 0 Skipped: 0 Warnings: 0 mysql> SELECT COUNT(*) AS count FROM abixy; +-------+ | count | +-------+ | 10000 | +-------+ 1 row in set (0.00 sec) mysql> SELECT * FROM abixy LIMIT 10; +------+------+------+---------------------+---------------------+ | a | b | i | x | y | +------+------+------+---------------------+---------------------+ | pan | pan | 1 | 0.3467901443380824 | 0.7268028627434533 | | eks | pan | 2 | 0.7586799647899636 | 0.5221511083334797 | | wye | wye | 3 | 0.20460330576630303 | 0.33831852551664776 | | eks | wye | 4 | 0.38139939387114097 | 0.13418874328430463 | | wye | pan | 5 | 0.5732889198020006 | 0.8636244699032729 | | zee | pan | 6 | 0.5271261600918548 | 0.49322128674835697 | | eks | zee | 7 | 0.6117840605678454 | 0.1878849191181694 | | zee | wye | 8 | 0.5985540091064224 | 0.976181385699006 | | hat | wye | 9 | 0.03144187646093577 | 0.7495507603507059 | | pan | wye | 10 | 0.5026260055412137 | 0.9526183602969864 | +------+------+------+---------------------+---------------------+
mysql> SELECT a, b, COUNT(*) AS count FROM abixy GROUP BY a, b ORDER BY COUNT DESC; +------+------+-------+ | a | b | count | +------+------+-------+ | zee | wye | 455 | | pan | eks | 429 | | pan | pan | 427 | | wye | hat | 426 | | hat | wye | 423 | | pan | hat | 417 | | eks | hat | 417 | | pan | zee | 413 | | eks | eks | 413 | | zee | hat | 409 | | eks | wye | 407 | | zee | zee | 403 | | pan | wye | 395 | | wye | pan | 392 | | zee | eks | 391 | | zee | pan | 389 | | hat | eks | 389 | | wye | eks | 386 | | wye | zee | 385 | | hat | zee | 385 | | hat | hat | 381 | | wye | wye | 377 | | eks | pan | 371 | | hat | pan | 363 | | eks | zee | 357 | +------+------+-------+ 25 rows in set (0.01 sec)
$ mlr --opprint uniq -c -g a,b then sort -nr count data/medium a b count zee wye 455 pan eks 429 pan pan 427 wye hat 426 hat wye 423 pan hat 417 eks hat 417 eks eks 413 pan zee 413 zee hat 409 eks wye 407 zee zee 403 pan wye 395 hat pan 363 eks zee 357
$ mysql -D miller -B -e 'select * from abixy' | mlr --itsv --opprint uniq -c -g a,b then sort -nr count a b count zee wye 455 pan eks 429 pan pan 427 wye hat 426 hat wye 423 pan hat 417 eks hat 417 eks eks 413 pan zee 413 zee hat 409 eks wye 407 zee zee 403 pan wye 395 wye pan 392 zee eks 391 zee pan 389 hat eks 389 wye eks 386 hat zee 385 wye zee 385 hat hat 381 wye wye 377 eks pan 371 hat pan 363 eks zee 357
grep
or what have you. Also it means not
every line needs to have the same list of field names (“schema ”).
Again, all the examples in the CSV section apply here — just change
the input-format flags. But there’s more you can do when not all the
records have the same shape.
Writing a program — in any language whatsoever — you can have
it print out log lines as it goes along, with items for various events jumbled
together. After the program has finished running you can sort it all out,
filter it, analyze it, and learn from it.
Suppose your program has printed something like this:
POKI_RUN_COMMAND{{cat log.txt}}HERE
Each print statement simply contains local information: the current
timestamp, whether a particular cache was hit or not, etc. Then using either
the system grep
command, or Miller’s having-fields
, or
is_present
, we can pick out the parts we want and analyze them:
POKI_INCLUDE_AND_RUN_ESCAPED(10-1.sh)HERE
POKI_INCLUDE_AND_RUN_ESCAPED(10-2.sh)HERE
Alternatively, we can simply group the similar data for a better look:
POKI_RUN_COMMAND{{mlr --opprint group-like log.txt}}HERE
POKI_RUN_COMMAND{{mlr --opprint group-like then sec2gmt time log.txt}}HERE