+ - 0:00:00
Notes for current slide
Notes for next slide

Standard Input/Output and Intro to Text Processing

From terminals, via pipes, to processing actual text

Marek Šuppa
Ondrej Jariabka
Adrián Matejov

1 / 29

Why UNIX-like for Data Science?

2 / 29

Why UNIX-like for Data Science?

It teaches you the Unix philosophy, which is to

  • Write programs that do one thing and do it well

  • Write programs to work together

  • Write programs to handle text streams, because that is a universal interface

-- Doug McIlroy, creator of Unix pipes

3 / 29

Why UNIX-like for Data Science?

It teaches you the Unix philosophy, which is to

  • Write programs that do one thing and do it well

  • Write programs to work together

  • Write programs to handle text streams, because that is a universal interface

-- Doug McIlroy, creator of Unix pipes

Or in other words

KISS: Keep It Simple, Stupid

4 / 29

Input / Output in Computing History

5 / 29
  • Standardizing input/output has been a big breakthrough of UNIX

  • Unline in previous systems, the input/output devices have been abstracted

  • Plus the programmer needed to do absolutelly nothing to have some standard input/output set up for their program

Input / Output Abstracted

The standard streams for input, output and error

6 / 29
  • Standard error was added to Unix in the 1970s after several wasted phototypesetting runs ended with error messages being typeset instead of displayed on the user's terminal

Standard Input Output

By default, any process (command/application) has access to:

  • stdin (0): standard input (keyboard)

  • stdout (1): standard output (terminal)

  • stderr (2): standard error (terminal)

Also referred to as "standard I/O streams".

7 / 29

Standard Input Output

By default, any process (command/application) has access to:

  • stdin (0): standard input (keyboard)

  • stdout (1): standard output (terminal)

  • stderr (2): standard error (terminal)

Also referred to as "standard I/O streams".

From the point of view of the process, these are files like any other.

Represented by "file descriptors" (IDs) 0, 1 and 2.

By default, they are all connected to the terminal.

8 / 29

Useful Stream I/O commands

  • echo string
    • Outputs all of its (string) arguments to stdout
$ echo Hello
Hello
$ echo "Hi there"
Hi there
9 / 29

Useful Stream I/O commands

  • echo string
    • Outputs all of its (string) arguments to stdout
$ echo Hello
Hello
$ echo "Hi there"
Hi there
  • cat FILE
    • Outputs the contents of FILE to stdout
    • When no FILE is specified (or when FILE is -), read stdin
$ cat text.txt
This is a sample text from the text.txt file.
$ cat
Hi
Hi
there!
there!
10 / 29

Stream Redirection: Output

  • command > file.txt
    • the standard output of command will be redirected to file.txt
      $ echo Hello > output.txt
      $ cat output.txt
      Hello
      $ cat output.txt > file.txt
      $ cat file.txt
      Hello

11 / 29

Stream Redirection: Output

  • command > file.txt
    • the standard output of command will be redirected to file.txt
      $ echo Hello > output.txt
      $ cat output.txt
      Hello
      $ cat output.txt > file.txt
      $ cat file.txt
      Hello

  • command >> file.txt
    • the standard output of command will be appended to file.txt
      $ echo Hi > output.txt
      $ echo there! >> output.txt
      $ cat output.txt
      Hi
      there!

Note that > overrides the contents of the file -- anything that was in it will be removed.

12 / 29

Stream Redirection: Input

  • command < file.txt
    • the input to command will come from file.txt
$ echo Hello > file.txt
$ cat < file.txt
Hello

13 / 29

Stream Redirection: (Error) Output

  • Standard I/O can also be referenced via numbers

    • 0 for in, 1 for out, 2 for err
  • For instance, the following sends the error output to /dev/null

$ pip install somepackage 2> /dev/null

14 / 29

Stream Redirection: (Error) Output

  • Standard I/O can also be referenced via numbers

    • 0 for in, 1 for out, 2 for err
  • For instance, the following sends the error output to /dev/null

$ pip install somepackage 2> /dev/null

  • it is possible to combine multiple redirections in the same command

  • the following forwards the stdout to output.log and stderr to error.log

$ pip install unknownpkg 1> output.log 2> error.log

15 / 29

Stream Redirection: (Error) Output II

  • stdout and stderr can be combined into the same file using &>
$ pip install somepkg &> errout.txt

16 / 29

Pipes

Perhaps the single most striking invention in Unix (source)

  • A simple way of forwarding the stdout from one command to the stdin of another one.

  • The image below represents the following call (the pipe is denoted |):

$ command1 | command2
17 / 29

Pipes

Perhaps the single most striking invention in Unix (source)

  • A simple way of forwarding the stdout from one command to the stdin of another one.

  • The image below represents the following call (the pipe is denoted |):

$ command1 | command2

18 / 29

Pipes II

  • Similarly to &>, there exists a way of sending both the stdout and stderr to the stdin of another process

  • This is done via the |& shortcut (not used that much in practice though)

$ command1 |& command2
19 / 29

Pipes II

  • Similarly to &>, there exists a way of sending both the stdout and stderr to the stdin of another process

  • This is done via the |& shortcut (not used that much in practice though)

$ command1 |& command2

20 / 29

Text processing utilities

21 / 29

head and tail

  • head

    • outputs the first (by defaut 10) lines of the file
    • -n k only outputs the first k lines
    • -n -k outputs all but the last k lines
  • tail

    • outputs the last (by defaut 10) lines of the file
    • -n k only outputs the last k lines
    • -n +k output starts at line k
    • -f output appends as the file grows
$ cat file.txt
1 Adam
2 Beatrice
3 Cynthia
4 David
5 Emma
$ head -n 3 file.txt
1 Adam
2 Beatrice
3 Cynthia
$ head -n -3 file.txt
1 Adam
2 Beatrice
$ tail -n 3 file.txt
3 Cynthia
4 David
5 Emma
$ tail -n +4 file.txt
4 David
5 Emma
22 / 29

sort

  • sorts the input (file or stream of characters)

  • -r prints the sorted lines in reverse

  • -n makes the sorting numeric

  • -k m sort by column m (set sort key)

$ sort -k 2 file.txt
13 Adam
02 Beatrice
3 Cynthia
-4 David
5 Emma
$ cat file.txt
02 Beatrice
5 Emma
-4 David
3 Cynthia
13 Adam
$ sort file.txt
02 Beatrice
13 Adam
3 Cynthia
-4 David
5 Emma
$ sort -r file.txt
5 Emma
-4 David
3 Cynthia
13 Adam
02 Beatrice
$ sort -n file.txt
-4 David
02 Beatrice
3 Cynthia
5 Emma
13 Adam
23 / 29

uniq

  • removes the duplicate lines

    • provided that they are already sorted alphabetically
  • -d only outputs duplicates

  • -u only outputs unique lines (those without duplicates)

  • -i makes the comparison case-insensitive

  • -c prints out the number of repetitions

$ cat file.txt
Amy
Amy
Bob
Carol
Carol
Carol
$ uniq file.txt
Amy
Bob
Carol
$ uniq -d file.txt
Amy
Carol
$ uniq -u file.txt
Bob
$ uniq -c file.txt
2 Amy
1 Bob
3 Carol
24 / 29

sort + uniq == strong combo

  • Suppose we are given the following task:
    • Input: A list of users who tried to log in with the wrong password. Each line corresponds to one attempt.
    • Output: A list of users sorted by their number of unsuccessful login attempts, from most to least.
25 / 29

sort + uniq == strong combo

  • Suppose we are given the following task:
    • Input: A list of users who tried to log in with the wrong password. Each line corresponds to one attempt.
    • Output: A list of users sorted by their number of unsuccessful login attempts, from most to least.
$ cat attempts.txt
joe
lena
joe
joe
garfield
lena
woe
garfield
$ cat attempts.txt | sort
garfield
garfield
joe
joe
joe
lena
lena
woe
$ cat attempts.txt | sort | uniq -c
2 garfield
3 joe
2 lena
1 woe
$ cat attempts.txt | sort | uniq -c | sort
1 woe
2 garfield
2 lena
3 joe
$ cat attempts.txt | sort | uniq -c | sort -r
3 joe
2 lena
2 garfield
1 woe
26 / 29

tr

  • the standard usage is tr [SET1] [SET2]

  • translate from one set of characters ([SET1]) to the other ([SET2])

  • the sets (which need to be of the same size) can be defined as

    • explicit list of characters (abcdef)
    • implicit list of characters (a-z, A-Z or 0-9)
    • named groups like [:digit:] (all digits), [:alpha:] (all letters) or [:lower:] and [:upper:] (lower and upper case letters)
$ echo "Hi There" | tr abcdef ABCDEF
Hi ThErE
$ echo "Hi There" | tr a-z A-Z
HI THERE
$ echo "Hi There" | tr [:upper:] [:lower:]
hi there
  • -d removes the characters specified in [SET1]
$ echo "Hi There" | tr -d abcdef
Hi Thr
27 / 29

Debugging Pipes: per partes

  • Pipes were first implemented as (temporary) files
  • When working on a long command with a ton of pipes, it's the best debugging option available
$ cat attempts.txt | sort > file1.txt
$ cat file1.txt
[ ... output ommited ... ]
$ cat file1.txt | uniq -c > file2.txt
$ cat file2.txt
[ ... output omitted ... ]
$ cat file2.txt | sort -r > file3.txt
$ cat file3.txt
[ ... output omitted ... ]
28 / 29

Debugging Pipes: per partes

  • Pipes were first implemented as (temporary) files
  • When working on a long command with a ton of pipes, it's the best debugging option available
$ cat attempts.txt | sort > file1.txt
$ cat file1.txt
[ ... output ommited ... ]
$ cat file1.txt | uniq -c > file2.txt
$ cat file2.txt
[ ... output omitted ... ]
$ cat file2.txt | sort -r > file3.txt
$ cat file3.txt
[ ... output omitted ... ]
  • Once we see the commands work part-by-part (per partes), we can put it all together
$ cat attempts.txt | sort | uniq -c | sort -r
29 / 29

This is the massive advantage of having a powerful shell:

  • you are in an environment you know
  • you can interactively try stuff out (cobble it together) if you want
  • you very quickly see what works and what does not

Why UNIX-like for Data Science?

2 / 29
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow