Overview: • About Miller • File formats • Miller features in the context of the Unix toolkit • Record-heterogeneity • Internationalization Using Miller: • Reference • Manpage • FAQ • Cookbook • Data examples • Installation, portability, dependencies, and testing • Documents by release Background: • Why C? • Why call it Miller? • How original is Miller? • Performance Repository: • Things to do • Contact information • GitHub repo |
• No output at all • Fields not selected • Diagnosing delimiter specifications • Error-output in certain string cases • How do I examine then-chaining? • I assigned $9 and it’s not 9th • Why doesn’t mlr cut put fields in the order I want? • NR is not consecutive after then-chaining • Why am I not seeing all possible joins occur? • What about XML or JSON file formats? Number one FAQPlease use mlr --csv --rs lf for native Un*x (linefeed-terminated) CSV files. Instead of specifying --rs lf on each invocation, you can instead have MLR_CSV_DEFAULT_RS=lf in your shell environment: e.g. put export MLR_CSV_DEFAULT_RS=lf in your ~/.bashrc or ~/.zshrc, or setenv MLR_CSV_DEFAULT_RS lf in your ~/.cshrc, as a one-time setup step.No output at allCheck the line-terminators of the data, e.g. with the command-line file program. Example: for CSV, Miller’s default line terminator is CR/LF (carriage return followed by linefeed, following RFC4180). Yet if your CSV has *nix-standard LF line endings, Miller will keep reading the file looking for a CR/LF which never appears. Solution in this case: tell Miller the input has LF line-terminator, e.g. mlr --csv --rs lf {remaining arguments ...}. Also try od -xcv and/or cat -e on your file to check for non-printable characters.Fields not selectedCheck the field-separators of the data, e.g. with the command-line head program. Example: for CSV, Miller’s default record separator is comma; if your data is tab-delimited, e.g. aTABbTABc, then Miller won’t find three fields named a, b, and c but rather just one named aTABbTABc. Solution in this case: mlr --fs tab {remaining arguments ...}. Also try od -xcv and/or cat -e on your file to check for non-printable characters.Diagnosing delimiter specifications# Use the `file` command to see if there are CR/LF terminators (in this case, # there are not): $ file data/colours.csv data/colours.csv: UTF-8 Unicode text # Look at the file to find names of fields $ cat data/colours.csv KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah # Try (unsuccessfully) to extract a few fields: $ mlr --csv cut -f KEY,PL,RO data/colours.csv (no output) # Use LF record separator (--rs lf) since the file doesn't have CR/LF line # endings -- but still unsuccessfully: $ mlr --csv --rs lf cut -f KEY,PL,RO data/colours.csv (only blank lines appear) # Use XTAB output format to get a sharper picture of where records/fields # are being split: $ mlr --icsv --irs lf --oxtab cat data/colours.csv KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_1;Weiß;White;Blanco;Valkoinen;Blanc;Bianco;Wit;Biały;Alb;Beyaz KEY;DE;EN;ES;FI;FR;IT;NL;PL;RO;TR masterdata_colourcode_2;Schwarz;Black;Negro;Musta;Noir;Nero;Zwart;Czarny;Negru;Siyah # Using XTAB output format makes it clearer that KEY;DE;...;RO;TR is being # treated as a single field name in the CSV header, and likewise each # subsequent line is being treated as a single field value. This is because # the default field separator is a comma but we have semicolons here. # Use XTAB again with different field separator (--fs semicolon): $ mlr --icsv --irs lf --ifs semicolon --oxtab cat data/colours.csv KEY masterdata_colourcode_1 DE Weiß EN White ES Blanco FI Valkoinen FR Blanc IT Bianco NL Wit PL Biały RO Alb TR Beyaz KEY masterdata_colourcode_2 DE Schwarz EN Black ES Negro FI Musta FR Noir IT Nero NL Zwart PL Czarny RO Negru TR Siyah # Using the new field-separator, retry the cut: $ mlr --csv --rs lf --fs semicolon cut -f KEY,PL,RO data/colours.csv KEY;PL;RO masterdata_colourcode_1;Biały;Alb masterdata_colourcode_2;Czarny;Negru Error-output in certain string casesmlr put '$y = string($x)' then put '$z = $y . $y' gives (error) on numeric data such as x=123, while mlr put '$z=string($x).string($x)' and mlr put '$y = string($x); $z = $y . $y' do not. This is because in the first case y is computed and stored as a string, then re-parsed as an integer, for which string-concatenation is an invalid operator. In the second case, casts are done independently; in the third case, both assignments are within the same put statement, where type information is maintained for the duration of all assignments in the put.How do I examine then-chaining?Then-chaining found in Miller is intended to function the same as Unix pipes, but with less keystroking. You can print your data one pipeline step at a time, to see what intermediate output at one step becomes the input to the next step. First, look at the input data:$ cat data/then-example.csv Status,Payment_Type,Amount paid,cash,10.00 pending,debit,20.00 paid,cash,50.00 pending,credit,40.00 paid,debit,30.00 $ mlr --icsv --rs lf --opprint count-distinct -f Status,Payment_Type data/then-example.csv Status Payment_Type count paid cash 2 pending debit 1 pending credit 1 paid debit 1 $ mlr --icsv --rs lf --opprint count-distinct -f Status,Payment_Type then sort -nr count data/then-example.csv Status Payment_Type count paid cash 2 pending debit 1 pending credit 1 paid debit 1 $ mlr --csv --rs lf count-distinct -f Status,Payment_Type data/then-example.csv | mlr --icsv --rs lf --opprint sort -nr count Status Payment_Type count paid cash 2 pending debit 1 pending credit 1 paid debit 1 I assigned $9 and it’s not 9thMiller records are ordered lists of key-value pairs. For NIDX format, DKVP format when keys are missing, or CSV/CSV-lite format with --implicit-csv-header, Miller will sequentially assign keys of the form 1, 2, etc. But these are not integer array indices: they’re just field names taken from the initial field ordering in the input data.$ echo x,y,z | mlr --dkvp cat 1=x,2=y,3=z $ echo x,y,z | mlr --dkvp put '$6="a";$4="b";$55="cde"' 1=x,2=y,3=z,6=a,4=b,55=cde $ echo x,y,z | mlr --nidx cat x,y,z $ echo x,y,z | mlr --csv --rs lf --implicit-csv-header cat 1,2,3 x,y,z $ echo x,y,z | mlr --dkvp rename 2,999 1=x,999=y,3=z $ echo x,y,z | mlr --dkvp rename 2,newname 1=x,newname=y,3=z $ echo x,y,z | mlr --csv --rs lf --implicit-csv-header reorder -f 3,1,2 3,1,2 z,x,y Why doesn’t mlr cut put fields in the order I want?Example: columns x,i,a were requested but they appear here in the order a,i,x:$ cat data/small a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 $ mlr cut -f x,i,a data/small a=pan,i=1,x=0.3467901443380824 a=eks,i=2,x=0.7586799647899636 a=wye,i=3,x=0.20460330576630303 a=eks,i=4,x=0.38139939387114097 a=wye,i=5,x=0.5732889198020006 $ mlr cut -o -f x,i,a data/small x=0.3467901443380824,i=1,a=pan x=0.7586799647899636,i=2,a=eks x=0.20460330576630303,i=3,a=wye x=0.38139939387114097,i=4,a=eks x=0.5732889198020006,i=5,a=wye NR is not consecutive after then-chainingGiven this input data:$ cat data/small a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533 a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797 a=wye,b=wye,i=3,x=0.20460330576630303,y=0.33831852551664776 a=eks,b=wye,i=4,x=0.38139939387114097,y=0.13418874328430463 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729 $ mlr filter '$x > 0.5' then put '$NR = NR' data/small a=eks,b=pan,i=2,x=0.7586799647899636,y=0.5221511083334797,NR=2 a=wye,b=pan,i=5,x=0.5732889198020006,y=0.8636244699032729,NR=5 $ echo x=1,y=2,z=3 | mlr put '$nf1 = NF; $u = 4; $nf2 = NF; unset $x,$y,$z; $nf3 = NF' nf1=3,u=4,nf2=5,nf3=3 $ mlr --opprint --from data/small put ' begin{ @nr1 = 0 } @nr1 += 1; $nr1 = @nr1 ' \ then filter '$x>0.5' \ then put ' begin{ @nr2 = 0 } @nr2 += 1; $nr2 = @nr2 ' a b i x y nr1 nr2 eks pan 2 0.7586799647899636 0.5221511083334797 2 1 wye pan 5 0.5732889198020006 0.8636244699032729 5 2 Why am I not seeing all possible joins occur?For example, the right file here has nine records, and the left file should add in the hostname column — so the join output should also have 9 records:$ mlr --icsvlite --opprint cat data/join-u-left.csv hostname ipaddr nadir.east.our.org 10.3.1.18 zenith.west.our.org 10.3.1.27 apoapsis.east.our.org 10.4.5.94 $ mlr --icsvlite --opprint cat data/join-u-right.csv ipaddr timestamp bytes 10.3.1.27 1448762579 4568 10.3.1.18 1448762578 8729 10.4.5.94 1448762579 17445 10.3.1.27 1448762589 12 10.3.1.18 1448762588 44558 10.4.5.94 1448762589 8899 10.3.1.27 1448762599 0 10.3.1.18 1448762598 73425 10.4.5.94 1448762599 12200 $ mlr --icsvlite --opprint join -j ipaddr -f data/join-u-left.csv data/join-u-right.csv ipaddr hostname timestamp bytes 10.3.1.27 zenith.west.our.org 1448762579 4568 10.4.5.94 apoapsis.east.our.org 1448762579 17445 10.4.5.94 apoapsis.east.our.org 1448762589 8899 10.4.5.94 apoapsis.east.our.org 1448762599 12200 $ mlr --icsvlite --opprint join -u -j ipaddr -f data/join-u-left.csv data/join-u-right.csv ipaddr hostname timestamp bytes 10.3.1.27 zenith.west.our.org 1448762579 4568 10.3.1.18 nadir.east.our.org 1448762578 8729 10.4.5.94 apoapsis.east.our.org 1448762579 17445 10.3.1.27 zenith.west.our.org 1448762589 12 10.3.1.18 nadir.east.our.org 1448762588 44558 10.4.5.94 apoapsis.east.our.org 1448762589 8899 10.3.1.27 zenith.west.our.org 1448762599 0 10.3.1.18 nadir.east.our.org 1448762598 73425 10.4.5.94 apoapsis.east.our.org 1448762599 12200 What about XML or JSON file formats?Miller handles# DKVP x=1,y=2 z=3 # XML <table> <record> <field> <key> x </key> <value> 1 </value> </field> <field> <key> y </key> <value> 2 </value> </field> </record> <field> <key> z </key> <value> 3 </value> </field> <record> </record> </table> # JSON [{"x":1,"y":2},{"z":3}] |