With sed
and awk
Marek Šuppa
Ondrej Jariabka
Adrián Matejov
The tools we'll learn about today will sound strange and obsolete (they syntax almost certainly will)
But the reason why we learn about them is simple: they are present virtually everywhere
The tools we'll learn about today will sound strange and obsolete (they syntax almost certainly will)
But the reason why we learn about them is simple: they are present virtually everywhere
A language that doesn't affect the way you think about programming, is not worth knowing.
-- Alan Perils, Epigrams on programming
That's because they are required for POSIX compliance: https://pubs.opengroup.org/onlinepubs/9699919799/
https://en.wikiquote.org/wiki/Alan_Perlis#Epigrams_on_Programming,_1982
sed
Aka "stream editor"
sed
Takes in a stream of text line by line and transforms it in one go.
sed
Takes in a stream of text line by line and transforms it in one go.
The syntax of sed
commands is
[addr]X[options]
where X
is a single-letter sed
command (s
in the example above).
sed
Takes in a stream of text line by line and transforms it in one go.
The syntax of sed
commands is
[addr]X[options]
where X
is a single-letter sed
command (s
in the example above).
sed [cmd] [filename]
or cat [filename] | sed [cmd]
$ cat text.txt sed is a Unix utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.$ cat text.txt | sed 's/Unix/UNIX/'sed is a UNIX utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.$ sed 's/Unix/UNIX/' text.txtsed is a UNIX utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.
sed
: substitutionThe most common usecase of sed
, denoted by s
The syntaxt of the s
command is s/[regex]/[replacement]/[flags]
sed
: substitutionThe most common usecase of sed
, denoted by s
The syntaxt of the s
command is s/[regex]/[replacement]/[flags]
sed 's/[regex]/[replacement]'
[regex]
with [replacement]
$ cat text.txt sed is a Unix utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.$ cat text.txt | sed 's/the/THE/'sed is a UNIX utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on THE scripting features of the interactive editor ed.
by default, only the first match on the line gets replaced
this can be changed with the g
flag
sed
: (global) substitutionsed 's/[regex]/[replacement]/g'
[regex]
with [replacement]
globally (every occurrence on the line)$ cat text.txt sed is a Unix utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.$ cat text.txt | sed 's/the/THE/g'sed is a UNIX utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on THE scripting features of THE interactive editor ed.
sed
: (extended) regular expressionsexpr | description |
---|---|
. |
any character |
[ ] |
character class (or [^ ] ) |
^ |
beginning of the line |
$ |
end of the line |
? |
match once or not at all |
+ |
match 1+ times |
* |
match 0+ times |
{2,7} |
two to seven matches |
[r]∣[e] |
match regex [r] or [e] |
([r]) |
reference for regex [r] |
Extended regular expressions can be turned on with -E
.
$ echo hello | sed -E 's/[a-m]+/XXXX/'XXXXo$ echo hello | sed -E 's/[lia]{2}/ZZ/'heZZo
sed
: regex references and alternativesOnce part of a regex gets enclosed in parenthesis ()
, it can be referenced further.
sed
: regex references and alternativesOnce part of a regex gets enclosed in parenthesis ()
, it can be referenced further.
The m
-th enclosed regex can be referenced via \m
$ cat tenses.txt I was there.He will be here.It is everywhere.$ cat tenses.txt | sed -E 's/([her]+)/[\1]/'I was t[here].H[e] will be here.It is [e]verywhere.
sed
: regex references and alternativesOnce part of a regex gets enclosed in parenthesis ()
, it can be referenced further.
The m
-th enclosed regex can be referenced via \m
$ cat tenses.txt I was there.He will be here.It is everywhere.$ cat tenses.txt | sed -E 's/([her]+)/[\1]/'I was t[here].H[e] will be here.It is [e]verywhere.
Using |
, alternatives can be provided in the parenthesis (()
).
$ cat tenses.txt | sed -E 's/.*(is|was).*/# Found \1 on this line/'# Found was on this lineHe will be here.# Found is on this line
sed
: regex references and alternatives$ cat repetition.txt abcabcdjaejkasdhrjbbccddxxsxxs
The references can also be used directly in the regular expression:
$ cat repetition.txt | sed -E 's/^(.*)\1$/\1/'abcdjaejkasdhrjbbccddxxs
sed
: referencing the whole matchIf we want to reference the whole match, we can use &
.
sed
: referencing the whole matchIf we want to reference the whole match, we can use &
.
Suppose we have following text
$ cat text.txt sed is a Unix utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.
The sed
command below will put all numbers into square brackets:
$ cat text.txt | sed -E 's/[0-9]+/[&]/g'sed is a Unix utility that transforms text.sed was developed from [1973] to [1974] by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.
sed
: [addr]
Recall that sed
commands have the following structure:
[addr]X[options]
sed
: [addr]
Recall that sed
commands have the following structure:
[addr]X[options]
Let's discuss [addr]
a bit.
sed
: [addr]
Recall that sed
commands have the following structure:
[addr]X[options]
Let's discuss [addr]
a bit.
sed "[cmd]"
[cmd]
on all linessed "5 [cmd]"
[cmd]
on line 5sed "$ [cmd]"
[cmd]
on the last line$ cat tenses.txt | grep hereI was there.He will be here.It is everywhere.$ cat tenses.txt | sed "2 s/here/home/"I was there.He will be home.It is everywhere.$ cat tenses.txt | sed "$ s/where/one/"I was there.He will be home.It is everyone.
sed
: [addr]
via regexRegular expressions can also be used as an address.
sed "/was/ s/here/orn/"
was
, replace here
with orn
sed
: [addr]
via regexRegular expressions can also be used as an address.
sed "/was/ s/here/orn/"
was
, replace here
with orn
$ cat tenses.txtI was there.He will be here.It is everywhere.$ cat tenses.txt | sed "/was/ s/here/orn/"I was torn.He will be here.It is everywhere.
sed
: other commandssed "[addr] d"
[addr]
sed "[addr] p"
[addr]
Note that the space between [addr]
and the command is optional.
sed
: other commandssed "[addr] d"
[addr]
sed "[addr] p"
[addr]
Note that the space between [addr]
and the command is optional.
$ cat text.txt sed is a Unix utility that transforms text.sed was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs.sed was based on the scripting features of the interactive editor ed.
The following deletes the second line:
$ cat text.txt | sed 2dsed is a Unix utility that transforms text.sed was based on the scripting features of the interactive editor ed.
sed
: useful options-i
$ cat tenses.txtI was there.He will be here.It is everywhere.$ sed -i "/was/ s/here/orn/" tenses.txt$ cat tenses.txtI was torn.He will be here.It is everywhere.
sed
: useful options-i
$ cat tenses.txtI was there.He will be here.It is everywhere.$ sed -i "/was/ s/here/orn/" tenses.txt$ cat tenses.txtI was torn.He will be here.It is everywhere.
-n
p
command# Print specific (third) line of a file$ sed -n 3p tenses.txtIt is everywhere.
sed
: custom separatorsed
is well known for its /
separator (s/foo/bar/
has become somewhat commonplace).
sed
: custom separatorsed
is well known for its /
separator (s/foo/bar/
has become somewhat commonplace).
But suppose we want to get rid of http://
in http://data.science.com
.
sed
: custom separatorsed
is well known for its /
separator (s/foo/bar/
has become somewhat commonplace).
But suppose we want to get rid of http://
in http://data.science.com
.
Thankfully, basically any other character can be used as a separator, most commonly #
:
$ echo "http://data.science.com" | sed 's#http://##'data.science.com
awk
The simplest and most effective programming language you'll learn in 20 minutes
awk
A language that doesn't affect the way you think about programming, is not worth knowing.
-- Alan Perils, Epigrams on programming
The name is the abbreviation of its authors: Aho, Weinberger and Kernighan.
awk
A language that doesn't affect the way you think about programming, is not worth knowing.
-- Alan Perils, Epigrams on programming
The name is the abbreviation of its authors: Aho, Weinberger and Kernighan.
It follows the pattern-action paradigm.
pattern1 { action1 }pattern2 { action2; action3 }...
pattern:
action:
awk
: quick example$ cat people.txtAmelia 555-5553 amelia.zodiacusque@gmail.com FAnthony 555-3412 anthony.asserturo@hotmail.com ABecky 555-7685 becky.algebrarum@gmail.com ABill 555-1675 bill.drowning@hotmail.com ABroderick 555-0542 broderick.aliquotiens@yahoo.com RCamilla 555-2912 camilla.infusarum@skynet.be RFabius 555-1234 fabius.undevicesimus@ucb.edu F
awk
: quick example$ cat people.txtAmelia 555-5553 amelia.zodiacusque@gmail.com FAnthony 555-3412 anthony.asserturo@hotmail.com ABecky 555-7685 becky.algebrarum@gmail.com ABill 555-1675 bill.drowning@hotmail.com ABroderick 555-0542 broderick.aliquotiens@yahoo.com RCamilla 555-2912 camilla.infusarum@skynet.be RFabius 555-1234 fabius.undevicesimus@ucb.edu F
Show phone numbers only:
$ cat people.txt | awk '{ print $2 }'555-5553555-3412555-7685555-1675555-0542555-2912555-1234
Show emails only:
$ cat people.txt | awk '{ print $3 }'amelia.zodiacusque@gmail.comanthony.asserturo@hotmail.combecky.algebrarum@gmail.combill.drowning@hotmail.combroderick.aliquotiens@yahoo.comcamilla.infusarum@skynet.befabius.undevicesimus@ucb.edu
awk
: patternsempty
/[regex]/
BEGIN
END
awk
: patternsempty
/[regex]/
BEGIN
END
$ cat awktext.txt AWK was created at Bell Labs in the 1970s.Its name is derived from the surnames of its authors.The acronym is pronounced the same as the bird auk.$ cat awktext.txt | awk '/is/'Its name is derived from the surnames of its authors.The acronym is pronounced the same as the bird auk.
Regexes allow us to use awk
much like grep
.
awk
: patterns & pre-filled variablesInternally, awk
works along two dimensions: lines (called rows) and "columns" (called fields)
RS
\n
) by defaultFS
' '
) by default-F
flag (e.g. awk -F:
)awk
: patterns & pre-filled variablesInternally, awk
works along two dimensions: lines (called rows) and "columns" (called fields)
RS
\n
) by defaultFS
' '
) by default-F
flag (e.g. awk -F:
)awk
pre-fills quite a few other variables:
NR
awk
already processedNF
Each field (column) has its own "special" variable:
$1
: the first field$N
: the N
-th field$0
: the whole field (row or line)$ echo 'foo:123:bar:789' | awk -F: '{ print $3, $2, $0 }'bar 123 foo:123:bar:789
awk
: patterns & pre-filled variables IIPrint everything from the third line onwards
$ cat people.txt | awk 'NR>2'Becky 555-7685 becky.algebrarum@gmail.com ABill 555-1675 bill.drowning@hotmail.com ABroderick 555-0542 broderick.aliquotiens@yahoo.com RCamilla 555-2912 camilla.infusarum@skynet.be RFabius 555-1234 fabius.undevicesimus@ucb.edu F
Print all names off friends (F
in the last column)
$ cat people.txt | awk '$4 == "F" {print $1}'AmeliaFabius
Print all phone numbers of relatives (R
in the last column)
$ cat people.txt | awk '$4 == "R" {print $2}'555-0542555-2912
awk
: operators and variables>
, <
, >=
, <=
, ==
and !=
work as you'd expect them to$ ls *.txt -l-rw-rw-r--. 1 mrshu mrshu 149 Nov 14 23:18 awktext.txt-rw-r--r--. 1 mrshu mrshu 0 Nov 2 11:29 newfile.txt-rw-rw-r--. 1 mrshu mrshu 420 Nov 18 21:18 people.txt-rw-rw-r--. 1 mrshu mrshu 35 Nov 14 13:56 repetition.txt-rw-rw-r--. 1 mrshu mrshu 59 Nov 14 16:52 tenses_new.txt-rw-rw-r--. 1 mrshu mrshu 48 Nov 14 15:57 tenses.txt-rw-rw-r--. 1 mrshu mrshu 182 Nov 14 12:18 text.txt
awk
: operators and variables>
, <
, >=
, <=
, ==
and !=
work as you'd expect them to$ ls *.txt -l-rw-rw-r--. 1 mrshu mrshu 149 Nov 14 23:18 awktext.txt-rw-r--r--. 1 mrshu mrshu 0 Nov 2 11:29 newfile.txt-rw-rw-r--. 1 mrshu mrshu 420 Nov 18 21:18 people.txt-rw-rw-r--. 1 mrshu mrshu 35 Nov 14 13:56 repetition.txt-rw-rw-r--. 1 mrshu mrshu 59 Nov 14 16:52 tenses_new.txt-rw-rw-r--. 1 mrshu mrshu 48 Nov 14 15:57 tenses.txt-rw-rw-r--. 1 mrshu mrshu 182 Nov 14 12:18 text.txt
Sum the size of all files over 100 bytes:
$ ls *.txt -l | awk '$5 >= 100 {sum += $5} END { print sum }' 751
awk
: operators and variables>
, <
, >=
, <=
, ==
and !=
work as you'd expect them to$ ls *.txt -l-rw-rw-r--. 1 mrshu mrshu 149 Nov 14 23:18 awktext.txt-rw-r--r--. 1 mrshu mrshu 0 Nov 2 11:29 newfile.txt-rw-rw-r--. 1 mrshu mrshu 420 Nov 18 21:18 people.txt-rw-rw-r--. 1 mrshu mrshu 35 Nov 14 13:56 repetition.txt-rw-rw-r--. 1 mrshu mrshu 59 Nov 14 16:52 tenses_new.txt-rw-rw-r--. 1 mrshu mrshu 48 Nov 14 15:57 tenses.txt-rw-rw-r--. 1 mrshu mrshu 182 Nov 14 12:18 text.txt
Sum the size of all files over 100 bytes:
$ ls *.txt -l | awk '$5 >= 100 {sum += $5} END { print sum }' 751
What's the average file size (rounded to two decimal points)?
$ ls *.txt -l | awk '{sum += $5} END { printf "avg=%.2f\n", sum/NR }' avg=127.57
awk
: operators and variables II++
, +=
) and decrement (--
, -=
) operators work out of the box$ cat people.txt Amelia 555-5553 amelia.zodiacusque@gmail.com FAnthony 555-3412 anthony.asserturo@hotmail.com ABecky 555-7685 becky.algebrarum@gmail.com ABill 555-1675 bill.drowning@hotmail.com ABroderick 555-0542 broderick.aliquotiens@yahoo.com RCamilla 555-2912 camilla.infusarum@skynet.be RFabius 555-1234 fabius.undevicesimus@ucb.edu F
How many acquaintances (A), relatives (R) do we have in our dataset?
$ cat people.txt | awk '{ p[$4]++ } END { print "A:", p["A"], "| R:", p["R"] }'A: 3 | R: 2
awk
: control statementsif
/else
, while
, for
, break
, continue
) work as you would expect them to, with C/Python-like syntax$ cat people.txt Amelia 555-5553 amelia.zodiacusque@gmail.com FAnthony 555-3412 anthony.asserturo@hotmail.com ABecky 555-7685 becky.algebrarum@gmail.com ABill 555-1675 bill.drowning@hotmail.com ABroderick 555-0542 broderick.aliquotiens@yahoo.com RCamilla 555-2912 camilla.infusarum@skynet.be RFabius 555-1234 fabius.undevicesimus@ucb.edu F
How many acquaintances (A), friends (F
) and relatives (R) do we have in our dataset?
$ cat people.txt | awk '{ p[$4]++ } END { for(i in p) print i, ":", p[i] }' A : 3R : 2F : 2
awk
: actions (built-in functions)print
# awk concatenates strings automatically# this basically generates a CSV$ cat people.txt | awk '{ print $1 "," $2 "," $4 }'Amelia,555-5553,FAnthony,555-3412,ABecky,555-7685,ABill,555-1675,ABroderick,555-0542,RCamilla,555-2912,RFabius,555-1234,F
printf "[formatstr]", variable
variable
according to [formatstr]
[formatstr]
can contain%s
: string %d
: integer%f
: floatawk
: actions (built-in functions) IIlength(s)
s
tolower(s)
s
toupper(s)
s
gsub(r, s, t)
r
with the substitution s
in the t
string ($0
if not provided)system(c)
c
awk
: sample implementationAWK's secret weapon is the pattern-action paradigm:
pattern1 { action1 }pattern2 { action2; action3 }...
awk
: sample implementationAWK's secret weapon is the pattern-action paradigm:
pattern1 { action1 }pattern2 { action2; action3 }...
It allows not just for short (at most 2 lines) and simple-yet-powerful programs but also for simple implementation.
awk
: sample implementationAWK's secret weapon is the pattern-action paradigm:
pattern1 { action1 }pattern2 { action2; action3 }...
It allows not just for short (at most 2 lines) and simple-yet-powerful programs but also for simple implementation.
for line in file.readlines(): for pattern, actions in patterns_actions: if pattern.match(line): eval(actions)
wget
wget [URL] -O [filename]
[URL]
to [filename]
-
makes the output go to standard output$ wget uniba.sk--2020-11-18 22:56:41-- http://uniba.sk/Resolving uniba.sk (uniba.sk)... 158.195.6.138Connecting to uniba.sk (uniba.sk)|158.195.6.138|:80... connected.HTTP request sent, awaiting response... 301 Moved PermanentlyLocation: https://uniba.sk/ [following]--2020-11-18 22:56:41-- https://uniba.sk/Connecting to uniba.sk (uniba.sk)|158.195.6.138|:443... connected.HTTP request sent, awaiting response... 200 OKLength: unspecified [text/html]Saving to: ‘index.html’index.html [ <=> ] 37.07K --.-KB/s in 0.04s 2020-11-18 22:56:41 (869 KB/s) - ‘index.html’ saved [37964]
-q
makes the output "quiet" (doesn't print extended info)curl
cat
URL" is a great mnemoniccurl
outputs the file it reads from the network to stdout
by default$ curl uniba.sk<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"><html><head><title>301 Moved Permanently</title></head><body><h1>Moved Permanently</h1><p>The document has moved <a href="https://uniba.sk/">here</a>.</p><hr><address>Apache/2.2.22 (Debian) Server at uniba.sk Port 80</address></body></html>
curl -o [filename] [url]
saves [url]
to [filename]
(also works with)stdout
to a file does the same thing$ curl uniba.sk > index.html % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 299 100 299 0 0 4462 0 --:--:-- --:--:-- --:--:-- 4462
-s
makes the output "silent" (doesn't print extended info)wget
vs curl
Much of their functionality is the same. There are a few important differences though
wget
:is a bit older and available on more devices (due to being part of GNU)
capable of doing recursive downloads (as in "save all you find on this URL to disk")
can be found on busybox (albeit as a stripped-down clone)
can be typed in using only the left hand on a qwerty keyboard!
curl
:works much better with pipes and Unix scripts in general
has upload capabilities
supports more protocols (even ones like TELNET
, IMAP
or SMTP
)
comes pre-installed on macOS and Windows 10 (!)
diff
Show differences between two files, line by line.
$ cat tenses.txt I was there.He will be here.It is everywhere.
$ cat tenses_new.txt I was where you were not.He will be here.It is in there.
$ diff tenses.txt tenses_new.txt 1c1< I was there.---> I was where you were not.3c3< It is everywhere.---> It is in there.
If you'd like to see the diff side-by-side, you can use diff -y
or (even) vimdiff
.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |