+ - 0:00:00
Notes for current slide
Notes for next slide

Finding content on the disk

With grep, find and xargs

Marek Šuppa
Ondrej Jariabka
Adrián Matejov

1 / 40

Why UNIX for Data Science?

  • Why so many commands? Isn't GUI nicer, faster and overall better?

  • The answer is yes: the GUI is nicer and faster at discovering various options but when it comes to pure execution, CLI tends to be unmatched in speed

  • In other words, by using a mouse you don't build muscle memory

2 / 40

Why UNIX for Data Science?

  • Why so many commands? Isn't GUI nicer, faster and overall better?

  • The answer is yes: the GUI is nicer and faster at discovering various options but when it comes to pure execution, CLI tends to be unmatched in speed

  • In other words, by using a mouse you don't build muscle memory

  • Great tools for your toolbox

  • Parallel processing

3 / 40

"The clumsiness of people who have to engage their brain at every step is unbearably painful to watch, at least to me, and that's what the novice-friendly software makes people do, because there's no elegance in them, it's just a mass of features to be learned by rote... it's sadly obvious that we are moving into a way of working that is predominantly conscious, for which I believe the human brain was never prepared. we no longer have the time to let skills sink into the autonomous nervous system, as it were, and even if we try, the criminal in Redmond, WA, has a new, incompatible version out by the time we learned the last version... One of the joys of learning to ride a bicycle is to stop thinking about it -- the feeling that I had successfully programmed my body to master a bicycle at least thrilled me as a kid (except I didn't know the verb "to program")... we need to communicate to users that learning to use Emacs is like learning to ride a bicycle -- it does take some time and effort, it's a worth-while skill to have, and then you never forget. I firmly believe that the novice-friendly software is like giving people several sets of supporting wheels so they won't tilt, but could get moving right away, and then never taking them off, preferring that they keep using them and moving so slowly that they always need them. of course, if you argue that they should remove the supporting wheels to such users, they will, admittedly correctly, argue that they will fall splat on the ground and ruin their three-piece suits. clearly a no-go."

http://groups.google.com/group/comp.emacs/msg/821a0f04bab91864?dmode=source&output=gplain

https://news.ycombinator.com/item?id=2657135

A tale of grep

4 / 40

A tale of grep

Suppose you would like to find a specific like (say "guide to the galaxy") you know is saved in some file on the disk.

How would you go about doing that?

5 / 40

A tale of grep

Suppose you would like to find a specific like (say "guide to the galaxy") you know is saved in some file on the disk.

How would you go about doing that?

Well, turns out grep can be of help!

Up until now we used it in the following way

$ grep [regex] [file]
6 / 40

A tale of grep

Suppose you would like to find a specific like (say "guide to the galaxy") you know is saved in some file on the disk.

How would you go about doing that?

Well, turns out grep can be of help!

Up until now we used it in the following way

$ grep [regex] [file]

But it can also be applied recursively on directories and their content.

7 / 40

grep with recursion

grep -R

  • read all files under each directory, recursively, and follow symbolic links
  • starts from the current working directory by default
8 / 40

grep with recursion

grep -R

  • read all files under each directory, recursively, and follow symbolic links
  • starts from the current working directory by default
$ tree
.
├── a
│ ├── file1.txt
│ └── file2.txt
├── b
│ └── file40.txt
├── c
│ └── file3.txt
├── d
└── g
└── book.txt
5 directories, 5 files
$ grep -Ri 'guide to the galaxy'
g/book.txt:The Hitchiker's Guide to the Galaxy
9 / 40

grep's Extended Regular Expressions

  • grep uses so called "basic regular expressions" (BRE) by default.
10 / 40

grep's Extended Regular Expressions

  • grep uses so called "basic regular expressions" (BRE) by default.

  • What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)

11 / 40

grep's Extended Regular Expressions

  • grep uses so called "basic regular expressions" (BRE) by default.

  • What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)

  • The main difference is in backslashing the meta characters:

    • BRE: \?, \+, \{, \}
    • ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.

12 / 40

grep's Extended Regular Expressions

  • grep uses so called "basic regular expressions" (BRE) by default.

  • What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)

  • The main difference is in backslashing the meta characters:

    • BRE: \?, \+, \{, \}
    • ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.


Sample: find all Uniba login IDs in the current directory:

13 / 40

grep's Extended Regular Expressions

  • grep uses so called "basic regular expressions" (BRE) by default.

  • What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)

  • The main difference is in backslashing the meta characters:

    • BRE: \?, \+, \{, \}
    • ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.


Sample: find all Uniba login IDs in the current directory:

$ grep -Rn -E '[a-z]+[0-9]{1,3}'
l/access.log:11:Unauthorized access attmept from login ID novak123
l/access.log:18:Successfuly authorized roy47
z/login.py:4:log_in(username='rob5')
14 / 40

grep's Extended Regular Expressions

  • grep uses so called "basic regular expressions" (BRE) by default.

  • What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)

  • The main difference is in backslashing the meta characters:

    • BRE: \?, \+, \{, \}
    • ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.


Sample: find all Uniba login IDs in the current directory:

$ grep -Rn -E '[a-z]+[0-9]{1,3}'
l/access.log:11:Unauthorized access attmept from login ID novak123
l/access.log:18:Successfuly authorized roy47
z/login.py:4:log_in(username='rob5')

But what if we only wanted to search in say .py files?

15 / 40

find

One-stop solution for figuring out what is where

16 / 40

find

  • A command for locating (finding) files and directories
  • It looks through all the subdirectories recursively
  • If no starting-point is specified it starts from .
$ tree
.
├── a
│ ├── file1.txt
│ └── file2.txt
├── b
│ └── file40.txt
├── c
│ └── file3.txt
└── d
4 directories, 4 files
$ find
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
17 / 40

find: looking up filenames

Can be done with the -name option / flag

find [path] -name [pattern]

  • [pattern] supports the following wildcards:
    • *: matches any string (of any length)
    • ?: matches a single character
    • [ ]: matches a single character from the specified character class
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
$ find . -name "*.txt"
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt
$ find . -name "file?.txt"
./c/file3.txt
./a/file2.txt
./a/file1.txt
$ find . -name 'file[2-4].txt'
./c/file3.txt
./a/file2.txt
18 / 40

find: looking up paths

For matching full paths, -path is a good choice

find [path] -path [pattern]

  • [pattern] is applied on the whole path, not just the name
  • it once again supports wildcards
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
$ find . -path '*a*'
./a
./a/file2.txt
./a/file1.txt
$ find . -path './?'
./d
./c
./b
./a
$ find . -path '*c/file[0-9].txt'
./c/file3.txt

The pattern needs to match the whole path (this example does not):

$ find . -path '*c/file[0-9]'
19 / 40

find: looking up via regex

If wildcards do not suffice, we can also utilize the "full power of regexes"

find [path] -regex [pattern]

  • [pattern] can be any "standard" regular expression
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
$ find . -regex '.*file.*'
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt
$ find . -regex '.*[bc]/file.*'
./c/file3.txt
./b/file40.txt

Unlike in the examples below, the regex needs to match the whole path.

$ find . -regex 'file'
$ find . -regex '[bc]/file.*'
20 / 40

find: looking up via attributes

Files and directories have various attributes find can take a look at:

  • type

    • directory, file, symlink, ...
  • timestamps

    • last file change (-ctime)
    • last access (-atime)
    • last modification (-mtime)
  • file size

  • owner and group

  • permissions

21 / 40

find: looking up by file/dir type

find [path] -type [type]

  • [type] can be one of the following
    • f: "normal" file
    • d: directory
    • b/c: block/character device
    • p: named pipe
    • l: symlink
    • s: socket
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
$ find . -type d
.
./d
./c
./b
./a
$ find . -type f
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt
22 / 40

find: looking up by timestamps

find [path] -mmin [n]

  • file last modified [n] minutes ago

find [path] -mtime [n]

  • file last modified [n] days ago

Flags for other timestamps:

  • -cmin / -ctime
    • time of last attribute change
  • -amin / -atime
    • time of last access
$ ls -al
drwxrwxr-x 7 mrshu mrshu 4096 Nov 9 10:34 .
drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 11:39 a
drwxrwx--x 2 mrshu mrshu 4096 Nov 7 11:40 b
drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 11:57 c
drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 12:36 d
drwxrwxr-x 2 mrshu mrshu 4096 Nov 9 10:34 e
$ date
Mon 09 Nov 2020 10:55:23 AM UTC

Last modified exactly 22 minutes ago:

$ find . -mmin 22
.
./e

Last modified 1 full day ago:

$ find . -mtime 1
./a
./d
./b
./c
23 / 40

find: looking up by size

find [path] -size [n]

  • [n] can be followed by various units:
    • b for 512-byte blocks (the default)
    • c for bytes
    • k for Kilobytes (units of 1024 bytes)
    • M for Megabytes (units of 1048576 bytes)
    • G for Gigabytes (units of 1073741824 bytes)

find [path] -empty

  • find empty files and directories
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
$ find . -empty
./d
./c/file3.txt
./a/file1.txt
$ find . -size 0
./c/file3.txt
./a/file1.txt
24 / 40

find: a note on [n]

By default, [n] matches the exact value (of time/date or size)

This behaviour can be altered via the + and - prefixes

  • +[n]

    • matches all values larger than [n]
  • -[n]

    • matches all values smaller than [n]
  • [n]

    • matches exactly [n]
$ find . -mtime -3
.
./a
./d
./b
./c
./e
$ find /boot -size +15M
/boot/initrd.img-5.4.0-47-generic
/boot/initrd.img-5.4.0-52-generic
/boot/initrd.img-5.4.0-51-generic
$ ls -hs /boot/initrd.img-5.4.0-47-generic
78M /boot/initrd.img-5.4.0-47-generic
25 / 40

find: looking up by user/group

find [path] -user [user]

  • only show files and directories owned by [user]

find [path] -group [group]

  • only show files and directories that belong to group [group]
$ ls -l /etc
[ ... 160 lines omitted ... ]
drwxr-xr-x 2 root root 4096 Sep 19 19:52 sensors.d
-rw-r--r-- 1 root root 14464 Feb 16 2020 services
-rw-r----- 1 root shadow 1463 Sep 19 20:15 shadow
-rw-r----- 1 root shadow 1595 Sep 19 20:14 shadow-
-rw-r--r-- 1 root root 146 Jul 31 16:29 shells
drwxr-xr-x 2 root root 4096 Jul 31 16:28 skel
[ ... 32 lines omitted ... ]
$ find /etc/ -group shadow
/etc/shadow-
/etc/gshadow
/etc/shadow
/etc/gshadow-
26 / 40

find: looking up by permissions

find [path] -readable

  • only show files that are readable

find [path] -writable

  • only show files that are writable

find [path] -executable

  • only show files that are executable
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
find . -executable
.
./d
./c
./b
./a
27 / 40

find: looking up by permissions II

find [path] -perm [mode]

-[mode] can be specified in octal or symbolic format

  • [mode] can have various prefixes:
    • [mode]: exactly [mode] permissions are set
    • -[mode]: at least [mode] permissions are set
    • /[mode]: some [mode] permissions are set
$ ls -al
total 24
drwxrwxr-x 6 mrshu mrshu 4096 Nov 7 12:36 .
drwxr-xr-x 7 mrshu mrshu 4096 Nov 9 10:16 ..
drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 11:39 a
drwxrwx--x 2 mrshu mrshu 4096 Nov 7 11:40 b
drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 11:57 c
drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 12:36 d
$ find . -perm 771
./b
$ find . -perm -774
.
./a
./d
./c
28 / 40

find: combining search patterns

The search pattern on name and attribute level can be easily combined together.

For example:

  • find all empty .txt files
$ file . -name "*.txt" -empty
  • find all files modified in the last 20 minutes, whose filename contains "image"
$ file . -name "*image*" -mmin -20
29 / 40

find: actions on matches

-delete

  • delete all matched files

-exec [command] \;

  • run command [command] for each matched file/directory
  • string {} is replaced with the matched file/directory

-ok [command] \;

  • same thing as -exec but asks for user confirmation before running the command
$ cat c/file3.txt
$ cat b/file40.txt
This is 40
$ cat a/file2.txt
This is us
$ cat a/file1.txt
$ find . -name "*.txt" -exec echo {} \;
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt
$ find . -name "*.txt" -exec cat {} \;
This is 40
This is us
30 / 40

xargs

Constructing commands on the fly

31 / 40

xargs

  • Allows us to "parametrize" the commands we run by passing input via pipe

  • Processes standard input line-by-line and "applies" a command on each line

xargs [command]

  • command can be any command we would like to run on each line

  • -I {} will replace {} in the [command] by the input

$ find . -type f | xargs -I{} echo "File: {}"
File: ./c/file3.txt
File: ./b/file40.txt
File: ./a/file2.txt
File: ./a/file1.txt
32 / 40

xargs

  • Allows us to "parametrize" the commands we run by passing input via pipe

  • Processes standard input line-by-line and "applies" a command on each line

xargs [command]

  • command can be any command we would like to run on each line

  • -I {} will replace {} in the [command] by the input

$ find . -type f | xargs -I{} echo "File: {}"
File: ./c/file3.txt
File: ./b/file40.txt
File: ./a/file2.txt
File: ./a/file1.txt

This can be easily used as a clearer alternative to -delete or -exec

$ find . -type f -empty | xargs -I{} rm {}
33 / 40

xargs II

xargs [command]

  • -t

    • print each command prior to execution
  • -p

    • similar to -ok in find
    • ASKS For confirmation before executing a command
  • -P [max-procs]

    • run the commands in parallel, on at most [max-procs] processes

Ask before removing each empty file:

$ find . -type f -empty | xargs -p -I{} rm {}
rm ./c/file3.txt ?...y
rm ./a/file1.txt ?...y
34 / 40

Back to the grep tale

Task: find all Uniba IDs in the *.py files in the current directory (recursively)

Solution:

$ find . -name '*.py' -type f | xargs grep -n -E '[a-z]+[0-9]{1,3}'
z/login.py:4:log_in(username='rob5')
35 / 40

Useful commands

36 / 40

time

  • Let's you "time" (compute the time needed for) the execution of a specific command

  • An internal bash command but a standalone program also exists

  • Very nice for benchmarking competing approaches to solving the same problem

37 / 40

time

  • Let's you "time" (compute the time needed for) the execution of a specific command

  • An internal bash command but a standalone program also exists

  • Very nice for benchmarking competing approaches to solving the same problem

Suppose we'd like to benchmark the following two commands:

find ./foo -type f -name "*.txt" -exec rm {} \;
find ./foo -type f -name "*.txt" | xargs rm
38 / 40

time

  • Let's you "time" (compute the time needed for) the execution of a specific command

  • An internal bash command but a standalone program also exists

  • Very nice for benchmarking competing approaches to solving the same problem

Suppose we'd like to benchmark the following two commands:

find ./foo -type f -name "*.txt" -exec rm {} \;
find ./foo -type f -name "*.txt" | xargs rm

On a folder with 1000 files in it, here are the results:

time find ./foo -type f -name "*.txt" -exec rm {} \;
0.35s user 0.11s system 99% cpu 0.467 total
time find ./foo -type f -name "*.txt" | xargs rm
0.00s user 0.01s system 75% cpu 0.016 total

As we can see, the xargs approach seems to be a bit faster (various benchmarks tend to agree)

39 / 40

timeing the parallel xargs

Let's use time to demonstrate the difference that the -P in xargs can have.

The program we'll test this on is very simple -- just sleep (wait) for a bit (like an extensive computation would).

Sleep for 1, 2, 3, 4 and 5 seconds serially (about 15 sec in total):

$ time echo 1 2 3 4 5 | tr ' ' '\n' | xargs -I{} sleep {}
real 0m15.017s
user 0m0.007s
sys 0m0.012s

Sleep for 1, 2, 3, 4 and 5 seconds in parallel (about 5 sec in total):

$ echo 1 2 3 4 5 | tr ' ' '\n' | xargs -P 5 -I{} sleep {}
real 0m5.012s
user 0m0.007s
sys 0m0.014s
40 / 40

Why UNIX for Data Science?

  • Why so many commands? Isn't GUI nicer, faster and overall better?

  • The answer is yes: the GUI is nicer and faster at discovering various options but when it comes to pure execution, CLI tends to be unmatched in speed

  • In other words, by using a mouse you don't build muscle memory

2 / 40
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow