name: inverse layout: true class: center, middle, inverse --- # Advanced text editing With `sed` and `awk` .footnote[Marek Šuppa
Ondrej Jariabka
Adrián Matejov] --- layout: false # Why UNIX for Data Science? - The tools we'll learn about today will sound strange and obsolete (they syntax almost certainly will) -- - But the reason why we learn about them is simple: they are present virtually everywhere -- > A language that doesn't affect the way you think about programming, is not worth knowing. > > -- Alan Perils, Epigrams on programming ??? That's because they are required for POSIX compliance: https://pubs.opengroup.org/onlinepubs/9699919799/ https://en.wikiquote.org/wiki/Alan_Perlis#Epigrams_on_Programming,_1982 --- class: middle, center, inverse # `sed` Aka "**s**tream **ed**itor" --- # `sed` Takes in a stream of text _line by line_ and transforms it in one go. -- The syntax of `sed` commands is .center[`[addr]X[options]`] where `X` is a single-letter `sed` command (`s` in the example below). -- `sed [cmd] [filename]` or `cat [filename] | sed [cmd]` ```bash $ cat text.txt sed is a Unix utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. $ cat text.txt | sed 's/Unix/UNIX/' sed is a UNIX utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. $ sed 's/Unix/UNIX/' text.txt sed is a UNIX utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. ``` --- # `sed`: substitution .center[*The most common usecase of `sed`, denoted by `s`*] The syntaxt of the `s` command is `s/[regex]/[replacement]/[flags]` -- `sed 's/[regex]/[replacement]'` - replace text matched by `[regex]` with `[replacement]` ```bash $ cat text.txt sed is a Unix utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. $ cat text.txt | sed 's/the/THE/' sed is a UNIX utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on `THE` scripting features of the interactive editor ed. ``` - by default, only the first match *on the line* gets replaced - this can be changed with the `g` flag --- # `sed`: (global) substitution `sed 's/[regex]/[replacement]/g'` - replace text matched by `[regex]` with `[replacement]` **globally** (every occurrence on the line) ```bash $ cat text.txt sed is a Unix utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. $ cat text.txt | sed 's/the/THE/g' sed is a UNIX utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on `THE` scripting features of `THE` interactive editor ed. ``` --- # `sed`: (extended) regular expressions .left-eq-column[ .pure-table.pure-table-striped.mkdown-table[ | expr | description | |-----------|------------------------------| | `.` | any character | | `[ ]` | character class (or `[^ ]`) | | `^` | beginning of the line | | `$` | end of the line | | `?` | match once or not at all | | `+` | match 1+ times | | `*` | match 0+ times | | `{2,7}` | two to seven matches | | `[r]∣[e]` | match regex `[r]` or `[e]` | | `([r])` | reference for regex `[r]` | ] ] .right-eq-column[ Extended regular expressions can be turned on with `-E`. ```bash $ echo hello | sed -E 's/[a-m]+/XXXX/' XXXXo $ echo hello | sed -E 's/[lia]{2}/ZZ/' heZZo ``` ] --- # `sed`: regex references and alternatives Once part of a regex gets enclosed in parenthesis `()`, it can be referenced further. -- The `m`-th enclosed regex can be referenced via `\m` ```bash $ cat tenses.txt I was there. He will be here. It is everywhere. $ cat tenses.txt | sed -E 's/([her]+)/[\1]/' I was t[here]. H[e] will be here. It is [e]verywhere. ``` -- Using `|`, alternatives can be provided in the parenthesis (`()`). ```bash $ cat tenses.txt | sed -E 's/.*(is|was).*/# Found \1 on this line/' # Found was on this line He will be here. # Found is on this line ``` --- # `sed`: regex references and alternatives ```bash $ cat repetition.txt abc`abc` djaejk asdhrj bbccdd xxs`xxs` ``` The references can also be used directly in the regular expression: ```bash $ cat repetition.txt | sed -E 's/^(.*)\1$/\1/' abc djaejk asdhrj bbccdd xxs ``` --- # `sed`: referencing the whole match If we want to reference the whole match, we can use `&`. -- *Suppose we have following text* ```bash $ cat text.txt sed is a Unix utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. ``` *The `sed` command below will put all numbers into square brackets:* ```bash $ cat text.txt | sed -E 's/[0-9]+/[&]/g' sed is a Unix utility that transforms text. sed was developed from [1973] to [1974] by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. ``` --- # `sed`: `[addr]` Recall that `sed` commands have the following structure: .center[`[addr]X[options]`] -- Let's discuss `[addr]` a bit. -- ------------------------ .left-eq-smaller-column[ `sed "[cmd]"` - apply `[cmd]` on all lines `sed "5 [cmd]"` - apply `[cmd]` on line 5 `sed "$ [cmd]"` - apply `[cmd]` on the last line ] .right-eq-larger-column[ ```bash $ cat tenses.txt | grep here I was t`here`. He will be `here`. It is everyw`here`. $ cat tenses.txt | sed "2 s/here/home/" I was there. He will be `home`. It is everywhere. $ cat tenses.txt | sed "$ s/where/one/" I was there. He will be home. It is every`one`. ``` ] --- # `sed`: `[addr]` via regex Regular expressions can also be used as an address. `sed "/was/ s/here/orn/"` - on line which matches `was`, replace `here` with `orn` -- ```bash $ cat tenses.txt I `was` there. He will be here. It is everywhere. $ cat tenses.txt | sed "/was/ s/here/orn/" I was torn. He will be here. It is everywhere. ``` --- # `sed`: other commands `sed "[addr] d"` - delete lines described by `[addr]` `sed "[addr] p"` - print lines described by `[addr]` Note that the space between `[addr]` and the command is optional. -- ```bash $ cat text.txt sed is a Unix utility that transforms text. sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs. sed was based on the scripting features of the interactive editor ed. ``` The following deletes the second line: ```bash $ cat text.txt | sed 2d sed is a Unix utility that transforms text. sed was based on the scripting features of the interactive editor ed. ``` --- # `sed`: useful options `-i` - edit file "in place" ```bash $ cat tenses.txt I `was` there. He will be here. It is everywhere. $ sed -i "/was/ s/here/orn/" tenses.txt $ cat tenses.txt I was torn. He will be here. It is everywhere. ``` -- `-n` - do not automatically print all matched lines - works nicely in combination with the `p` command ```bash # Print specific (third) line of a file $ sed -n 3p tenses.txt It is everywhere. ``` --- # `sed`: custom separator `sed` is well known for its `/` separator (`s/foo/bar/` has become somewhat commonplace). -- But suppose we want to get rid of `http://` in `http://data.science.com`. -- Thankfully, basically any other character can be used as a separator, most commonly `#`: ```bash $ echo "http://data.science.com" | sed 's#http://##' data.science.com ``` --- class: middle, center, inverse # `awk` The simplest and most effective programming language you'll learn in 20 minutes ??? https://en.wikiquote.org/wiki/Alan_Perlis#Epigrams_on_Programming,_1982 --- # `awk` > A language that doesn't affect the way you think about programming, is not worth knowing. > > -- Alan Perils, Epigrams on programming The name is the abbreviation of its authors: **A**ho, **W**einberger and **K**ernighan. -- .left-eq-column[ It follows the pattern-action paradigm. ``` pattern1 { action1 } pattern2 { action2; action3 } ... ``` ] .right-eq-column[ **pattern**: - regular expression, numerical expression, string expression or a combination of these - by default each line matches **action**: - executable code (the default action is to print the line) ] --- # `awk`: quick example ```bash $ cat people.txt Amelia 555-5553 amelia.zodiacusque@gmail.com F Anthony 555-3412 anthony.asserturo@hotmail.com A Becky 555-7685 becky.algebrarum@gmail.com A Bill 555-1675 bill.drowning@hotmail.com A Broderick 555-0542 broderick.aliquotiens@yahoo.com R Camilla 555-2912 camilla.infusarum@skynet.be R Fabius 555-1234 fabius.undevicesimus@ucb.edu F ``` -- .left-eq-column[ Show phone numbers only: ```bash $ cat people.txt | awk '{ print $2 }' 555-5553 555-3412 555-7685 555-1675 555-0542 555-2912 555-1234 ``` ] .right-eq-column[ Show emails only: ```bash $ cat people.txt | awk '{ print $3 }' amelia.zodiacusque@gmail.com anthony.asserturo@hotmail.com becky.algebrarum@gmail.com bill.drowning@hotmail.com broderick.aliquotiens@yahoo.com camilla.infusarum@skynet.be fabius.undevicesimus@ucb.edu ``` ] --- # `awk`: patterns - *empty* - action(s) executed for each input line (default if pattern not specified) - `/[regex]/` - action(s) will be executed if the regular expression matches the line - `BEGIN` - action(s) executed before the input gets processed - `END` - action(s) executed after the input gets processed -- ``` $ cat awktext.txt AWK was created at Bell Labs in the 1970s. Its name is derived from the surnames of its authors. The acronym is pronounced the same as the bird auk. $ cat awktext.txt | awk '/is/' Its name is derived from the surnames of its authors. The acronym is pronounced the same as the bird auk. ``` ??? Regexes allow us to use `awk` much like `grep`. --- # `awk`: patterns & pre-filled variables Internally, `awk` works along two dimensions: lines (called rows) and "columns" (called fields) - `RS` - internal variable that contains the "row separator" - newline (`\n`) by default - `FS` - internal variable that contains the "field separator" - space (`' '`) by default - can be set via the `-F` flag (e.g. `awk -F:`) -- `awk` pre-fills quite a few other variables: .left-eq-column[ - `NR` - number of records (rows or lines) `awk` already processed - `NF` - number of fields (rows) in the current record (line) ] .right-eq-column[ Each field (column) has its own "special" variable: - `$1`: the first field - `$N`: the `N`-th field - `$0`: the whole field (row or line) ] .clear-both[ ```bash $ echo 'foo:123:bar:789' | awk -F: '{ print $3, $2, $0 }' bar 123 foo:123:bar:789 ``` ] --- # `awk`: patterns & pre-filled variables II _Print everything from the third line onwards_ ```bash $ cat people.txt | awk 'NR>2' Becky 555-7685 becky.algebrarum@gmail.com A Bill 555-1675 bill.drowning@hotmail.com A Broderick 555-0542 broderick.aliquotiens@yahoo.com R Camilla 555-2912 camilla.infusarum@skynet.be R Fabius 555-1234 fabius.undevicesimus@ucb.edu F ``` _Print all names off friends (`F` in the last column)_ ```bash $ cat people.txt | awk '$4 == "F" {print $1}' Amelia Fabius ``` _Print all phone numbers of relatives (`R` in the last column)_ ```bash $ cat people.txt | awk '$4 == "R" {print $2}' 555-0542 555-2912 ``` --- # `awk`: operators and variables - All standard operators work out of the box - That is, `>`, `<`, `>=`, `<=`, `==` and `!=` work as you'd expect them to - Custom variables are zero (empty string or empty array) initialized. ```bash $ ls *.txt -l -rw-rw-r--. 1 mrshu mrshu `149` Nov 14 23:18 awktext.txt -rw-r--r--. 1 mrshu mrshu ` 0` Nov 2 11:29 newfile.txt -rw-rw-r--. 1 mrshu mrshu `420` Nov 18 21:18 people.txt -rw-rw-r--. 1 mrshu mrshu ` 35` Nov 14 13:56 repetition.txt -rw-rw-r--. 1 mrshu mrshu ` 59` Nov 14 16:52 tenses_new.txt -rw-rw-r--. 1 mrshu mrshu ` 48` Nov 14 15:57 tenses.txt -rw-rw-r--. 1 mrshu mrshu `182` Nov 14 12:18 text.txt ``` -- _Sum the size of all files over 100 bytes:_ ```bash $ ls *.txt -l | awk '$5 >= 100 {sum += $5} END { print sum }' 751 ``` -- _What's the average file size (rounded to two decimal points)?_ ```bash $ ls *.txt -l | awk '{sum += $5} END { printf "avg=%.2f\n", sum/NR }' avg=127.57 ``` --- # `awk`: operators and variables II - Increment (`++`, `+=`) and decrement (`--`, `-=`) operators work out of the box - Associative arrays are automatically initialized ```bash $ cat people.txt Amelia 555-5553 amelia.zodiacusque@gmail.com F Anthony 555-3412 anthony.asserturo@hotmail.com A Becky 555-7685 becky.algebrarum@gmail.com A Bill 555-1675 bill.drowning@hotmail.com A Broderick 555-0542 broderick.aliquotiens@yahoo.com R Camilla 555-2912 camilla.infusarum@skynet.be R Fabius 555-1234 fabius.undevicesimus@ucb.edu F ``` _How many acquaintances (A), relatives (R) do we have in our dataset?_ ```bash $ cat people.txt | awk '{ p[$4]++ } END { print "A:", p["A"], "| R:", p["R"] }' A: 3 | R: 2 ``` --- # `awk`: control statements - All the standard control statements (`if`/`else`, `while`, `for`, `break`, `continue`) work as you would expect them to, with C/Python-like syntax ```bash $ cat people.txt Amelia 555-5553 amelia.zodiacusque@gmail.com F Anthony 555-3412 anthony.asserturo@hotmail.com A Becky 555-7685 becky.algebrarum@gmail.com A Bill 555-1675 bill.drowning@hotmail.com A Broderick 555-0542 broderick.aliquotiens@yahoo.com R Camilla 555-2912 camilla.infusarum@skynet.be R Fabius 555-1234 fabius.undevicesimus@ucb.edu F ``` _How many acquaintances (A), friends (`F`) and relatives (R) do we have in our dataset?_ ```bash $ cat people.txt | awk '{ p[$4]++ } END { for(i in p) print i, ":", p[i] }' A : 3 R : 2 F : 2 ``` --- # `awk`: actions (built-in functions) - `print` - the default action if not specified - prints the string out to the standard output ```bash # awk concatenates strings automatically # this basically generates a CSV $ cat people.txt | awk '{ print $1 "," $2 "," $4 }' Amelia`,`555-5553`,`F Anthony`,`555-3412`,`A Becky`,`555-7685`,`A Bill`,`555-1675`,`A Broderick`,`555-0542`,`R Camilla`,`555-2912`,`R Fabius`,`555-1234`,`F ``` - `printf "[formatstr]", variable` - prints out the `variable` according to `[formatstr]` - `[formatstr]` can contain - `%s`: string - `%d`: integer - `%f`: float --- # `awk`: actions (built-in functions) II - `length(s)` - return the length of string `s` - `tolower(s)` - lowercase the string `s` - `toupper(s)` - uppercase the string `s` - `gsub(r, s, t)` - replace the regular expression `r` with the substitution `s` in the `t` string (`$0` if not provided) - `system(c)` - run the command `c` --- # `awk`: sample implementation AWK's secret weapon is the pattern-action paradigm: ``` pattern1 { action1 } pattern2 { action2; action3 } ... ``` -- It allows not just for short (at most 2 lines) and simple-yet-powerful programs but also for simple implementation. -- ```python for line in file.readlines(): for pattern, actions in patterns_actions: if pattern.match(line): eval(actions) ``` --- class: middle, center, inverse # Useful commands --- # `wget` - _"web get"_ -- a tool for downloading files from the internet - supports HTTP, HTTPS and FTP protocols - `wget [URL] -O [filename]` - saves `[URL]` to `[filename]` - setting filename to `-` makes the output go to standard output ```bash $ wget uniba.sk --2020-11-18 22:56:41-- http://uniba.sk/ Resolving uniba.sk (uniba.sk)... 158.195.6.138 Connecting to uniba.sk (uniba.sk)|158.195.6.138|:80... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: https://uniba.sk/ [following] --2020-11-18 22:56:41-- https://uniba.sk/ Connecting to uniba.sk (uniba.sk)|158.195.6.138|:443... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: ‘index.html’ index.html [ <=> ] 37.07K --.-KB/s in 0.04s 2020-11-18 22:56:41 (869 KB/s) - ‘index.html’ saved [37964] ``` - `-q` makes the output "quiet" (doesn't print extended info) --- # `curl` - the name stands for "Client URL" but _"`cat` URL"_ is a great mnemonic - `curl` outputs the file it reads from the network to `stdout` by default ```bash $ curl uniba.sk
301 Moved Permanently
Moved Permanently
The document has moved
here
.
Apache/2.2.22 (Debian) Server at uniba.sk Port 80
``` - `curl -o [filename] [url]` saves `[url]` to `[filename]` (also works with) - forwarding `stdout` to a file does the same thing ```bash $ curl uniba.sk > index.html % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 299 100 299 0 0 4462 0 --:--:-- --:--:-- --:--:-- 4462 ``` - `-s` makes the output "silent" (doesn't print extended info) ??? https://ec.haxx.se/curl/curl-name --- # `wget` vs `curl` Much of their functionality is the same. There are a few important differences though .left-eq-column[ ### **`wget`**: - is a bit older and available on more devices (due to being part of GNU) - capable of doing recursive downloads (as in _"save all you find on this URL to disk"_) - can be found on busybox (albeit as a stripped-down clone) - can be typed in using only the left hand on a qwerty keyboard! ] .right-eq-column[ ### **``curl``**: - works much better with pipes and Unix scripts in general - has upload capabilities - supports more protocols (even ones like `TELNET`, `IMAP` or `SMTP`) - comes pre-installed on macOS and Windows 10 (!) ] ??? https://daniel.haxx.se/docs/curl-vs-wget.html --- # `diff` Show differences between two files, line by line. .left-eq-column[ ``` $ cat tenses.txt I was there. He will be here. It is everywhere. ``` ] .right-eq-column[ ``` $ cat tenses_new.txt I was `where you were not`. He will be here. It is `in there`. ``` ] .clear-both[ ``` $ diff tenses.txt tenses_new.txt 1c1 < I was there. --- > I was where you were not. 3c3 < It is everywhere. --- > It is in there. ``` ] If you'd like to see the diff side-by-side, you can use `diff -y` or (even) `vimdiff`.