+ - 0:00:00
Notes for current slide
Notes for next slide

Advanced text editing

With sed and awk

Marek Šuppa
Ondrej Jariabka
Adrián Matejov

1 / 55

Why UNIX for Data Science?

  • The tools we'll learn about today will sound strange and obsolete (they syntax almost certainly will)
2 / 55

Why UNIX for Data Science?

  • The tools we'll learn about today will sound strange and obsolete (they syntax almost certainly will)

  • But the reason why we learn about them is simple: they are present virtually everywhere

3 / 55

Why UNIX for Data Science?

  • The tools we'll learn about today will sound strange and obsolete (they syntax almost certainly will)

  • But the reason why we learn about them is simple: they are present virtually everywhere

A language that doesn't affect the way you think about programming, is not worth knowing.

-- Alan Perils, Epigrams on programming

4 / 55

sed

Aka "stream editor"

5 / 55

sed

Takes in a stream of text line by line and transforms it in one go.

6 / 55

sed

Takes in a stream of text line by line and transforms it in one go.

The syntax of sed commands is

[addr]X[options]

where X is a single-letter sed command (s in the example above).

7 / 55

sed

Takes in a stream of text line by line and transforms it in one go.

The syntax of sed commands is

[addr]X[options]

where X is a single-letter sed command (s in the example above).

sed [cmd] [filename] or cat [filename] | sed [cmd]

$ cat text.txt
sed is a Unix utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.
$ cat text.txt | sed 's/Unix/UNIX/'
sed is a UNIX utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.
$ sed 's/Unix/UNIX/' text.txt
sed is a UNIX utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.
8 / 55

sed: substitution

The most common usecase of sed, denoted by s

The syntaxt of the s command is s/[regex]/[replacement]/[flags]

9 / 55

sed: substitution

The most common usecase of sed, denoted by s

The syntaxt of the s command is s/[regex]/[replacement]/[flags]

sed 's/[regex]/[replacement]'

  • replace text matched by [regex] with [replacement]
$ cat text.txt
sed is a Unix utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.
$ cat text.txt | sed 's/the/THE/'
sed is a UNIX utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on THE scripting features of the interactive editor ed.
  • by default, only the first match on the line gets replaced

  • this can be changed with the g flag

10 / 55

sed: (global) substitution

sed 's/[regex]/[replacement]/g'

  • replace text matched by [regex] with [replacement] globally (every occurrence on the line)
$ cat text.txt
sed is a Unix utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.
$ cat text.txt | sed 's/the/THE/g'
sed is a UNIX utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on THE scripting features of THE interactive editor ed.
11 / 55

sed: (extended) regular expressions

expr description
. any character
[ ] character class (or [^ ])
^ beginning of the line
$ end of the line
? match once or not at all
+ match 1+ times
* match 0+ times
{2,7} two to seven matches
[r]∣[e] match regex [r] or [e]
([r]) reference for regex [r]

Extended regular expressions can be turned on with -E.

$ echo hello | sed -E 's/[a-m]+/XXXX/'
XXXXo
$ echo hello | sed -E 's/[lia]{2}/ZZ/'
heZZo
12 / 55

sed: regex references and alternatives

Once part of a regex gets enclosed in parenthesis (), it can be referenced further.

13 / 55

sed: regex references and alternatives

Once part of a regex gets enclosed in parenthesis (), it can be referenced further.

The m-th enclosed regex can be referenced via \m

$ cat tenses.txt
I was there.
He will be here.
It is everywhere.
$ cat tenses.txt | sed -E 's/([her]+)/[\1]/'
I was t[here].
H[e] will be here.
It is [e]verywhere.
14 / 55

sed: regex references and alternatives

Once part of a regex gets enclosed in parenthesis (), it can be referenced further.

The m-th enclosed regex can be referenced via \m

$ cat tenses.txt
I was there.
He will be here.
It is everywhere.
$ cat tenses.txt | sed -E 's/([her]+)/[\1]/'
I was t[here].
H[e] will be here.
It is [e]verywhere.

Using |, alternatives can be provided in the parenthesis (()).

$ cat tenses.txt | sed -E 's/.*(is|was).*/# Found \1 on this line/'
# Found was on this line
He will be here.
# Found is on this line
15 / 55

sed: regex references and alternatives

$ cat repetition.txt
abcabc
djaejk
asdhrj
bbccdd
xxsxxs

The references can also be used directly in the regular expression:

$ cat repetition.txt | sed -E 's/^(.*)\1$/\1/'
abc
djaejk
asdhrj
bbccdd
xxs
16 / 55

sed: referencing the whole match

If we want to reference the whole match, we can use &.

17 / 55

sed: referencing the whole match

If we want to reference the whole match, we can use &.

Suppose we have following text

$ cat text.txt
sed is a Unix utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.

The sed command below will put all numbers into square brackets:

$ cat text.txt | sed -E 's/[0-9]+/[&]/g'
sed is a Unix utility that transforms text.
sed was developed from [1973] to [1974] by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.
18 / 55

sed: [addr]

Recall that sed commands have the following structure: [addr]X[options]

19 / 55

sed: [addr]

Recall that sed commands have the following structure: [addr]X[options]

Let's discuss [addr] a bit.

20 / 55

sed: [addr]

Recall that sed commands have the following structure: [addr]X[options]

Let's discuss [addr] a bit.


sed "[cmd]"

  • apply [cmd] on all lines

sed "5 [cmd]"

  • apply [cmd] on line 5

sed "$ [cmd]"

  • apply [cmd] on the last line
$ cat tenses.txt | grep here
I was there.
He will be here.
It is everywhere.
$ cat tenses.txt | sed "2 s/here/home/"
I was there.
He will be home.
It is everywhere.
$ cat tenses.txt | sed "$ s/where/one/"
I was there.
He will be home.
It is everyone.
21 / 55

sed: [addr] via regex

Regular expressions can also be used as an address.

sed "/was/ s/here/orn/"

  • on line which matches was, replace here with orn
22 / 55

sed: [addr] via regex

Regular expressions can also be used as an address.

sed "/was/ s/here/orn/"

  • on line which matches was, replace here with orn
$ cat tenses.txt
I was there.
He will be here.
It is everywhere.
$ cat tenses.txt | sed "/was/ s/here/orn/"
I was torn.
He will be here.
It is everywhere.
23 / 55

sed: other commands

sed "[addr] d"

  • delete lines described by [addr]

sed "[addr] p"

  • print lines described by [addr]

Note that the space between [addr] and the command is optional.

24 / 55

sed: other commands

sed "[addr] d"

  • delete lines described by [addr]

sed "[addr] p"

  • print lines described by [addr]

Note that the space between [addr] and the command is optional.

$ cat text.txt
sed is a Unix utility that transforms text.
sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.
sed was based on the scripting features of the interactive editor ed.

The following deletes the second line:

$ cat text.txt | sed 2d
sed is a Unix utility that transforms text.
sed was based on the scripting features of the interactive editor ed.
25 / 55

sed: useful options

-i

  • edit file "in place"
$ cat tenses.txt
I was there.
He will be here.
It is everywhere.
$ sed -i "/was/ s/here/orn/" tenses.txt
$ cat tenses.txt
I was torn.
He will be here.
It is everywhere.
26 / 55

sed: useful options

-i

  • edit file "in place"
$ cat tenses.txt
I was there.
He will be here.
It is everywhere.
$ sed -i "/was/ s/here/orn/" tenses.txt
$ cat tenses.txt
I was torn.
He will be here.
It is everywhere.

-n

  • do not automatically print all matched lines
  • works nicely in combination with the p command
# Print specific (third) line of a file
$ sed -n 3p tenses.txt
It is everywhere.
27 / 55

sed: custom separator

sed is well known for its / separator (s/foo/bar/ has become somewhat commonplace).

28 / 55

sed: custom separator

sed is well known for its / separator (s/foo/bar/ has become somewhat commonplace).

But suppose we want to get rid of http:// in http://data.science.com.

29 / 55

sed: custom separator

sed is well known for its / separator (s/foo/bar/ has become somewhat commonplace).

But suppose we want to get rid of http:// in http://data.science.com.

Thankfully, basically any other character can be used as a separator, most commonly #:

$ echo "http://data.science.com" | sed 's#http://##'
data.science.com
30 / 55

awk

The simplest and most effective programming language you'll learn in 20 minutes

31 / 55

awk

A language that doesn't affect the way you think about programming, is not worth knowing.

-- Alan Perils, Epigrams on programming

The name is the abbreviation of its authors: Aho, Weinberger and Kernighan.

32 / 55

awk

A language that doesn't affect the way you think about programming, is not worth knowing.

-- Alan Perils, Epigrams on programming

The name is the abbreviation of its authors: Aho, Weinberger and Kernighan.

It follows the pattern-action paradigm.

pattern1 { action1 }
pattern2 { action2; action3 }
...

pattern:

  • regular expression, numerical expression, string expression or a combination of these
  • by default each line matches

action:

  • executable code (the default action is to print the line)
33 / 55

awk: quick example

$ cat people.txt
Amelia 555-5553 amelia.zodiacusque@gmail.com F
Anthony 555-3412 anthony.asserturo@hotmail.com A
Becky 555-7685 becky.algebrarum@gmail.com A
Bill 555-1675 bill.drowning@hotmail.com A
Broderick 555-0542 broderick.aliquotiens@yahoo.com R
Camilla 555-2912 camilla.infusarum@skynet.be R
Fabius 555-1234 fabius.undevicesimus@ucb.edu F
34 / 55

awk: quick example

$ cat people.txt
Amelia 555-5553 amelia.zodiacusque@gmail.com F
Anthony 555-3412 anthony.asserturo@hotmail.com A
Becky 555-7685 becky.algebrarum@gmail.com A
Bill 555-1675 bill.drowning@hotmail.com A
Broderick 555-0542 broderick.aliquotiens@yahoo.com R
Camilla 555-2912 camilla.infusarum@skynet.be R
Fabius 555-1234 fabius.undevicesimus@ucb.edu F

Show phone numbers only:

$ cat people.txt | awk '{ print $2 }'
555-5553
555-3412
555-7685
555-1675
555-0542
555-2912
555-1234

Show emails only:

$ cat people.txt | awk '{ print $3 }'
amelia.zodiacusque@gmail.com
anthony.asserturo@hotmail.com
becky.algebrarum@gmail.com
bill.drowning@hotmail.com
broderick.aliquotiens@yahoo.com
camilla.infusarum@skynet.be
fabius.undevicesimus@ucb.edu
35 / 55

awk: patterns

  • empty

    • action(s) executed for each input line (default if pattern not specified)
  • /[regex]/

    • action(s) will be executed if the regular expression matches the line
  • BEGIN

    • action(s) executed before the input gets processed
  • END

    • action(s) executed after the input gets processed
36 / 55

awk: patterns

  • empty

    • action(s) executed for each input line (default if pattern not specified)
  • /[regex]/

    • action(s) will be executed if the regular expression matches the line
  • BEGIN

    • action(s) executed before the input gets processed
  • END

    • action(s) executed after the input gets processed
$ cat awktext.txt
AWK was created at Bell Labs in the 1970s.
Its name is derived from the surnames of its authors.
The acronym is pronounced the same as the bird auk.
$ cat awktext.txt | awk '/is/'
Its name is derived from the surnames of its authors.
The acronym is pronounced the same as the bird auk.
37 / 55

Regexes allow us to use awk much like grep.

awk: patterns & pre-filled variables

Internally, awk works along two dimensions: lines (called rows) and "columns" (called fields)

  • RS
    • internal variable that contains the "row separator"
    • newline (\n) by default
  • FS
    • internal variable that contains the "field separator"
    • space (' ') by default
    • can be set via the -F flag (e.g. awk -F:)
38 / 55

awk: patterns & pre-filled variables

Internally, awk works along two dimensions: lines (called rows) and "columns" (called fields)

  • RS
    • internal variable that contains the "row separator"
    • newline (\n) by default
  • FS
    • internal variable that contains the "field separator"
    • space (' ') by default
    • can be set via the -F flag (e.g. awk -F:)

awk pre-fills quite a few other variables:

  • NR
    • number of records (rows or lines) awk already processed
  • NF
    • number of fields (rows) in the current record (line)

Each field (column) has its own "special" variable:

  • $1: the first field
  • $N: the N-th field
  • $0: the whole field (row or line)
$ echo 'foo:123:bar:789' | awk -F: '{ print $3, $2, $0 }'
bar 123 foo:123:bar:789
39 / 55

awk: patterns & pre-filled variables II

Print everything from the third line onwards

$ cat people.txt | awk 'NR>2'
Becky 555-7685 becky.algebrarum@gmail.com A
Bill 555-1675 bill.drowning@hotmail.com A
Broderick 555-0542 broderick.aliquotiens@yahoo.com R
Camilla 555-2912 camilla.infusarum@skynet.be R
Fabius 555-1234 fabius.undevicesimus@ucb.edu F

Print all names off friends (F in the last column)

$ cat people.txt | awk '$4 == "F" {print $1}'
Amelia
Fabius

Print all phone numbers of relatives (R in the last column)

$ cat people.txt | awk '$4 == "R" {print $2}'
555-0542
555-2912
40 / 55

awk: operators and variables

  • All standard operators work out of the box
    • That is, >, <, >=, <=, == and != work as you'd expect them to
  • Custom variables are zero (empty string or empty array) initialized.
$ ls *.txt -l
-rw-rw-r--. 1 mrshu mrshu 149 Nov 14 23:18 awktext.txt
-rw-r--r--. 1 mrshu mrshu 0 Nov 2 11:29 newfile.txt
-rw-rw-r--. 1 mrshu mrshu 420 Nov 18 21:18 people.txt
-rw-rw-r--. 1 mrshu mrshu 35 Nov 14 13:56 repetition.txt
-rw-rw-r--. 1 mrshu mrshu 59 Nov 14 16:52 tenses_new.txt
-rw-rw-r--. 1 mrshu mrshu 48 Nov 14 15:57 tenses.txt
-rw-rw-r--. 1 mrshu mrshu 182 Nov 14 12:18 text.txt
41 / 55

awk: operators and variables

  • All standard operators work out of the box
    • That is, >, <, >=, <=, == and != work as you'd expect them to
  • Custom variables are zero (empty string or empty array) initialized.
$ ls *.txt -l
-rw-rw-r--. 1 mrshu mrshu 149 Nov 14 23:18 awktext.txt
-rw-r--r--. 1 mrshu mrshu 0 Nov 2 11:29 newfile.txt
-rw-rw-r--. 1 mrshu mrshu 420 Nov 18 21:18 people.txt
-rw-rw-r--. 1 mrshu mrshu 35 Nov 14 13:56 repetition.txt
-rw-rw-r--. 1 mrshu mrshu 59 Nov 14 16:52 tenses_new.txt
-rw-rw-r--. 1 mrshu mrshu 48 Nov 14 15:57 tenses.txt
-rw-rw-r--. 1 mrshu mrshu 182 Nov 14 12:18 text.txt

Sum the size of all files over 100 bytes:

$ ls *.txt -l | awk '$5 >= 100 {sum += $5} END { print sum }'
751
42 / 55

awk: operators and variables

  • All standard operators work out of the box
    • That is, >, <, >=, <=, == and != work as you'd expect them to
  • Custom variables are zero (empty string or empty array) initialized.
$ ls *.txt -l
-rw-rw-r--. 1 mrshu mrshu 149 Nov 14 23:18 awktext.txt
-rw-r--r--. 1 mrshu mrshu 0 Nov 2 11:29 newfile.txt
-rw-rw-r--. 1 mrshu mrshu 420 Nov 18 21:18 people.txt
-rw-rw-r--. 1 mrshu mrshu 35 Nov 14 13:56 repetition.txt
-rw-rw-r--. 1 mrshu mrshu 59 Nov 14 16:52 tenses_new.txt
-rw-rw-r--. 1 mrshu mrshu 48 Nov 14 15:57 tenses.txt
-rw-rw-r--. 1 mrshu mrshu 182 Nov 14 12:18 text.txt

Sum the size of all files over 100 bytes:

$ ls *.txt -l | awk '$5 >= 100 {sum += $5} END { print sum }'
751

What's the average file size (rounded to two decimal points)?

$ ls *.txt -l | awk '{sum += $5} END { printf "avg=%.2f\n", sum/NR }'
avg=127.57
43 / 55

awk: operators and variables II

  • Increment (++, +=) and decrement (--, -=) operators work out of the box
  • Associative arrays are automatically initialized
$ cat people.txt
Amelia 555-5553 amelia.zodiacusque@gmail.com F
Anthony 555-3412 anthony.asserturo@hotmail.com A
Becky 555-7685 becky.algebrarum@gmail.com A
Bill 555-1675 bill.drowning@hotmail.com A
Broderick 555-0542 broderick.aliquotiens@yahoo.com R
Camilla 555-2912 camilla.infusarum@skynet.be R
Fabius 555-1234 fabius.undevicesimus@ucb.edu F

How many acquaintances (A), relatives (R) do we have in our dataset?

$ cat people.txt | awk '{ p[$4]++ } END { print "A:", p["A"], "| R:", p["R"] }'
A: 3 | R: 2
44 / 55

awk: control statements

  • All the standard control statements (if/else, while, for, break, continue) work as you would expect them to, with C/Python-like syntax
$ cat people.txt
Amelia 555-5553 amelia.zodiacusque@gmail.com F
Anthony 555-3412 anthony.asserturo@hotmail.com A
Becky 555-7685 becky.algebrarum@gmail.com A
Bill 555-1675 bill.drowning@hotmail.com A
Broderick 555-0542 broderick.aliquotiens@yahoo.com R
Camilla 555-2912 camilla.infusarum@skynet.be R
Fabius 555-1234 fabius.undevicesimus@ucb.edu F

How many acquaintances (A), friends (F) and relatives (R) do we have in our dataset?

$ cat people.txt | awk '{ p[$4]++ } END { for(i in p) print i, ":", p[i] }'
A : 3
R : 2
F : 2
45 / 55

awk: actions (built-in functions)

  • print
    • the default action if not specified
    • prints the string out to the standard output
# awk concatenates strings automatically
# this basically generates a CSV
$ cat people.txt | awk '{ print $1 "," $2 "," $4 }'
Amelia,555-5553,F
Anthony,555-3412,A
Becky,555-7685,A
Bill,555-1675,A
Broderick,555-0542,R
Camilla,555-2912,R
Fabius,555-1234,F
  • printf "[formatstr]", variable
    • prints out the variable according to [formatstr]
    • [formatstr] can contain
      • %s: string
      • %d: integer
      • %f: float
46 / 55

awk: actions (built-in functions) II

  • length(s)
    • return the length of string s
  • tolower(s)
    • lowercase the string s
  • toupper(s)
    • uppercase the string s
  • gsub(r, s, t)
    • replace the regular expression r with the substitution s in the t string ($0 if not provided)
  • system(c)
    • run the command c
47 / 55

awk: sample implementation

AWK's secret weapon is the pattern-action paradigm:

pattern1 { action1 }
pattern2 { action2; action3 }
...
48 / 55

awk: sample implementation

AWK's secret weapon is the pattern-action paradigm:

pattern1 { action1 }
pattern2 { action2; action3 }
...

It allows not just for short (at most 2 lines) and simple-yet-powerful programs but also for simple implementation.

49 / 55

awk: sample implementation

AWK's secret weapon is the pattern-action paradigm:

pattern1 { action1 }
pattern2 { action2; action3 }
...

It allows not just for short (at most 2 lines) and simple-yet-powerful programs but also for simple implementation.

for line in file.readlines():
for pattern, actions in patterns_actions:
if pattern.match(line):
eval(actions)
50 / 55

Useful commands

51 / 55

wget

  • "web get" -- a tool for downloading files from the internet
  • supports HTTP, HTTPS and FTP protocols
  • wget [URL] -O [filename]
    • saves [URL] to [filename]
    • setting filename to - makes the output go to standard output
$ wget uniba.sk
--2020-11-18 22:56:41-- http://uniba.sk/
Resolving uniba.sk (uniba.sk)... 158.195.6.138
Connecting to uniba.sk (uniba.sk)|158.195.6.138|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://uniba.sk/ [following]
--2020-11-18 22:56:41-- https://uniba.sk/
Connecting to uniba.sk (uniba.sk)|158.195.6.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘index.html’
index.html [ <=> ] 37.07K --.-KB/s in 0.04s
2020-11-18 22:56:41 (869 KB/s) - ‘index.html’ saved [37964]
  • -q makes the output "quiet" (doesn't print extended info)
52 / 55

curl

  • the name stands for "Client URL" but "cat URL" is a great mnemonic
  • curl outputs the file it reads from the network to stdout by default
$ curl uniba.sk
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://uniba.sk/">here</a>.</p>
<hr>
<address>Apache/2.2.22 (Debian) Server at uniba.sk Port 80</address>
</body></html>
  • curl -o [filename] [url] saves [url] to [filename] (also works with)
  • forwarding stdout to a file does the same thing
$ curl uniba.sk > index.html
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 299 100 299 0 0 4462 0 --:--:-- --:--:-- --:--:-- 4462
  • -s makes the output "silent" (doesn't print extended info)
53 / 55

wget vs curl

Much of their functionality is the same. There are a few important differences though

wget:

  • is a bit older and available on more devices (due to being part of GNU)

  • capable of doing recursive downloads (as in "save all you find on this URL to disk")

  • can be found on busybox (albeit as a stripped-down clone)

  • can be typed in using only the left hand on a qwerty keyboard!

curl:

  • works much better with pipes and Unix scripts in general

  • has upload capabilities

  • supports more protocols (even ones like TELNET, IMAP or SMTP)

  • comes pre-installed on macOS and Windows 10 (!)

54 / 55

diff

Show differences between two files, line by line.

$ cat tenses.txt
I was there.
He will be here.
It is everywhere.
$ cat tenses_new.txt
I was where you were not.
He will be here.
It is in there.
$ diff tenses.txt tenses_new.txt
1c1
< I was there.
---
> I was where you were not.
3c3
< It is everywhere.
---
> It is in there.

If you'd like to see the diff side-by-side, you can use diff -y or (even) vimdiff.

55 / 55

Why UNIX for Data Science?

  • The tools we'll learn about today will sound strange and obsolete (they syntax almost certainly will)
2 / 55
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow