Finding content on the disk

With grep, find and xargs

Marek Šuppa
Ondrej Jariabka
Adrián Matejov

1 / 40

Why UNIX for Data Science?

Why so many commands? Isn't GUI nicer, faster and overall better?
The answer is yes: the GUI is nicer and faster at discovering various options but when it comes to pure execution, CLI tends to be unmatched in speed
In other words, by using a mouse you don't build muscle memory

2 / 40

Why UNIX for Data Science?

Why so many commands? Isn't GUI nicer, faster and overall better?
The answer is yes: the GUI is nicer and faster at discovering various options but when it comes to pure execution, CLI tends to be unmatched in speed
In other words, by using a mouse you don't build muscle memory
Great tools for your toolbox
Parallel processing

3 / 40

"The clumsiness of people who have to engage their brain at every step is unbearably painful to watch, at least to me, and that's what the novice-friendly software makes people do, because there's no elegance in them, it's just a mass of features to be learned by rote... it's sadly obvious that we are moving into a way of working that is predominantly conscious, for which I believe the human brain was never prepared. we no longer have the time to let skills sink into the autonomous nervous system, as it were, and even if we try, the criminal in Redmond, WA, has a new, incompatible version out by the time we learned the last version... One of the joys of learning to ride a bicycle is to stop thinking about it -- the feeling that I had successfully programmed my body to master a bicycle at least thrilled me as a kid (except I didn't know the verb "to program")... we need to communicate to users that learning to use Emacs is like learning to ride a bicycle -- it does take some time and effort, it's a worth-while skill to have, and then you never forget. I firmly believe that the novice-friendly software is like giving people several sets of supporting wheels so they won't tilt, but could get moving right away, and then never taking them off, preferring that they keep using them and moving so slowly that they always need them. of course, if you argue that they should remove the supporting wheels to such users, they will, admittedly correctly, argue that they will fall splat on the ground and ruin their three-piece suits. clearly a no-go."

http://groups.google.com/group/comp.emacs/msg/821a0f04bab91864?dmode=source&output=gplain

https://news.ycombinator.com/item?id=2657135

A tale of grep4 / 40

A tale of `grep`

Suppose you would like to find a specific string like (say "guide to the galaxy") you know is saved in some file on the disk.

How would you go about doing that?

5 / 40

A tale of `grep`

Suppose you would like to find a specific string like (say "guide to the galaxy") you know is saved in some file on the disk.

How would you go about doing that?

Well, turns out grep can be of help!

Up until now we used it in the following way

$ grep [regex] [file]

6 / 40

A tale of `grep`

Suppose you would like to find a specific string like (say "guide to the galaxy") you know is saved in some file on the disk.

How would you go about doing that?

Well, turns out grep can be of help!

Up until now we used it in the following way

$ grep [regex] [file]

But it can also be applied recursively on directories and their content.

7 / 40

`grep` with recursion

grep -R

read all files under each directory, recursively, and follow symbolic links
starts from the current working directory by default

8 / 40

`grep` with recursion

grep -R

read all files under each directory, recursively, and follow symbolic links
starts from the current working directory by default

$ tree
.
├── a
│   ├── file1.txt
│   └── file2.txt
├── b
│   └── file40.txt
├── c
│   └── file3.txt
├── d
└── g
    └── book.txt
5 directories, 5 files

$ grep -Ri 'guide to the galaxy'
g/book.txt:The Hitchiker's Guide to the Galaxy

9 / 40

grep's Extended Regular Expressionsgrep uses so called "basic regular expressions" (BRE) by default.
10 / 40

`grep`'s Extended Regular Expressions

grep uses so called "basic regular expressions" (BRE) by default.
What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)

11 / 40

`grep`'s Extended Regular Expressions

grep uses so called "basic regular expressions" (BRE) by default.
What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)
The main difference is in backslashing the meta characters:
- BRE: \?, \+, \{, \}
- ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.

12 / 40

`grep`'s Extended Regular Expressions

grep uses so called "basic regular expressions" (BRE) by default.
What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)
The main difference is in backslashing the meta characters:
- BRE: \?, \+, \{, \}
- ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.

Sample: find all Uniba login IDs in the current directory:

13 / 40

`grep`'s Extended Regular Expressions

grep uses so called "basic regular expressions" (BRE) by default.
What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)
The main difference is in backslashing the meta characters:
- BRE: \?, \+, \{, \}
- ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.

Sample: find all Uniba login IDs in the current directory:

$ grep -Rn -E '[a-z]+[0-9]{1,3}'
l/access.log:11:Unauthorized access attmept from login ID novak123
l/access.log:18:Successfuly authorized roy47
z/login.py:4:log_in(username='rob5')

14 / 40

`grep`'s Extended Regular Expressions

grep uses so called "basic regular expressions" (BRE) by default.
What we normally consider "regular expressions" are actually "extended regular expressions" (ERE)
The main difference is in backslashing the meta characters:
- BRE: \?, \+, \{, \}
- ERE: ?, +, {, }

EREs can be turned on in grep by passing the -E parameter.

Sample: find all Uniba login IDs in the current directory:

$ grep -Rn -E '[a-z]+[0-9]{1,3}'
l/access.log:11:Unauthorized access attmept from login ID novak123
l/access.log:18:Successfuly authorized roy47
z/login.py:4:log_in(username='rob5')

But what if we only wanted to search in say .py files?

15 / 40

`find`

One-stop solution for figuring out what is where

16 / 40

findA command for locating (finding) files and directories
It looks through all the subdirectories recursively
If no starting-point is specified it starts from .
$ tree
.
├── a
│   ├── file1.txt
│   └── file2.txt
├── b
│   └── file40.txt
├── c
│   └── file3.txt
└── d
4 directories, 4 files

$ find
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt
$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt

17 / 40

`find`: looking up filenames

Can be done with the -name option / flag

find [path] -name [pattern]

[pattern] supports the following wildcards:
- *: matches any string (of any length)
- ?: matches a single character
- [ ]: matches a single character from the specified character class

$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt

$ find . -name "*.txt"
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt

$ find . -name "file?.txt"
./c/file3.txt
./a/file2.txt
./a/file1.txt

$ find . -name 'file[2-4].txt'
./c/file3.txt
./a/file2.txt

18 / 40

`find`: looking up paths

For matching full paths, -path is a good choice

find [path] -path [pattern]

[pattern] is applied on the whole path, not just the name
it once again supports wildcards

$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt

$ find . -path '*a*'
./a
./a/file2.txt
./a/file1.txt

$ find . -path './?'
./d
./c
./b
./a

$ find . -path '*c/file[0-9].txt'
./c/file3.txt

The pattern needs to match the whole path (this example does not):

$ find . -path '*c/file[0-9]'

19 / 40

`find`: looking up via regex

If wildcards do not suffice, we can also utilize the "full power of regexes"

find [path] -regex [pattern]

[pattern] can be any "standard" regular expression

$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt

$ find . -regex '.*file.*'
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt

$ find . -regex '.*[bc]/file.*'
./c/file3.txt
./b/file40.txt

Unlike in the examples below, the regex needs to match the whole path.

$ find . -regex 'file'

$ find . -regex '[bc]/file.*'

20 / 40

`find`: looking up via attributes

Files and directories have various attributes find can take a look at:

type
- directory, file, symlink, ...
timestamps
- last file change (-ctime)
- last access (-atime)
- last modification (-mtime)
file size
owner and group
permissions

21 / 40

`find`: looking up by file/dir type

find [path] -type [type]

[type] can be one of the following
- f: "normal" file
- d: directory
- b/c: block/character device
- p: named pipe
- l: symlink
- s: socket

$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt

$ find . -type d
.
./d
./c
./b
./a

$ find . -type f
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt

22 / 40

`find`: looking up by timestamps

find [path] -mmin [n]

file last modified [n] minutes ago

find [path] -mtime [n]

file last modified [n] days ago

Flags for other timestamps:

-cmin / -ctime
- time of last attribute change
-amin / -atime
- time of last access

$ ls -al
drwxrwxr-x 7 mrshu mrshu 4096 Nov  9 10:34 .
drwxrwxr-x 2 mrshu mrshu 4096 Nov  7 11:39 a
drwxrwx--x 2 mrshu mrshu 4096 Nov  7 11:40 b
drwxrwxr-x 2 mrshu mrshu 4096 Nov  7 11:57 c
drwxrwxr-x 2 mrshu mrshu 4096 Nov  7 12:36 d
drwxrwxr-x 2 mrshu mrshu 4096 Nov  9 10:34 e

$ date
Mon 09 Nov 2020 10:55:23 AM UTC

Last modified exactly 22 minutes ago:

$ find . -mmin 22
.
./e

Last modified 1 full day ago:

$ find . -mtime 1
./a
./d
./b
./c

23 / 40

`find`: looking up by size

find [path] -size [n]

[n] can be followed by various units:
- b for 512-byte blocks (the default)
- c for bytes
- k for Kilobytes (units of 1024 bytes)
- M for Megabytes (units of 1048576 bytes)
- G for Gigabytes (units of 1073741824 bytes)

find [path] -empty

find empty files and directories

$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt

$ find . -empty
./d
./c/file3.txt
./a/file1.txt

$ find . -size 0
./c/file3.txt
./a/file1.txt

24 / 40

`find`: a note on `[n]`

By default, [n] matches the exact value (of time/date or size)

This behaviour can be altered via the + and - prefixes

+[n]
- matches all values larger than [n]
-[n]
- matches all values smaller than [n]
[n]
- matches exactly [n]

$ find . -mtime -3
.
./a
./d
./b
./c
./e

$ find /boot -size +15M
/boot/initrd.img-5.4.0-47-generic
/boot/initrd.img-5.4.0-52-generic
/boot/initrd.img-5.4.0-51-generic
$ ls -hs /boot/initrd.img-5.4.0-47-generic
78M /boot/initrd.img-5.4.0-47-generic

25 / 40

`find`: looking up by user/group

find [path] -user [user]

only show files and directories owned by [user]

find [path] -group [group]

only show files and directories that belong to group [group]

$ ls -l /etc
[ ... 160 lines omitted ... ]
drwxr-xr-x 2 root root       4096 Sep 19 19:52 sensors.d
-rw-r--r-- 1 root root      14464 Feb 16  2020 services
-rw-r----- 1 root shadow     1463 Sep 19 20:15 shadow
-rw-r----- 1 root shadow     1595 Sep 19 20:14 shadow-
-rw-r--r-- 1 root root        146 Jul 31 16:29 shells
drwxr-xr-x 2 root root       4096 Jul 31 16:28 skel
[ ... 32 lines omitted ... ]
$ find /etc/ -group shadow
/etc/shadow-
/etc/gshadow
/etc/shadow
/etc/gshadow-

26 / 40

`find`: looking up by permissions

find [path] -readable

only show files that are readable

find [path] -writable

only show files that are writable

find [path] -executable

only show files that are executable

$ find .
.
./d
./c
./c/file3.txt
./b
./b/file40.txt
./a
./a/file2.txt
./a/file1.txt

find . -executable
.
./d
./c
./b
./a

27 / 40

`find`: looking up by permissions II

find [path] -perm [mode]

-[mode] can be specified in octal or symbolic format

[mode] can have various prefixes:
- [mode]: exactly [mode] permissions are set
- -[mode]: at least [mode] permissions are set
- /[mode]: some [mode] permissions are set

$ ls -al
total 24
drwxrwxr-x 6 mrshu mrshu 4096 Nov  7 12:36 .
drwxr-xr-x 7 mrshu mrshu 4096 Nov  9 10:16 ..
drwxrwxr-x 2 mrshu mrshu 4096 Nov  7 11:39 a
drwxrwx--x 2 mrshu mrshu 4096 Nov  7 11:40 b
drwxrwxr-x 2 mrshu mrshu 4096 Nov  7 11:57 c
drwxrwxr-x 2 mrshu mrshu 4096 Nov  7 12:36 d
$ find . -perm 771
./b
$ find . -perm -774
.
./a
./d
./c

28 / 40

`find`: combining search patterns

The search pattern on name and attribute level can be easily combined together.

For example:

find all empty .txt files

$ file . -name "*.txt" -empty

find all files modified in the last 20 minutes, whose filename contains "image"

$ file . -name "*image*" -mmin -20

29 / 40

`find`: actions on matches

-delete

delete all matched files

-exec [command] \;

run command [command] for each matched file/directory
string {} is replaced with the matched file/directory

-ok [command] \;

same thing as -exec but asks for user confirmation before running the command

$ cat c/file3.txt
$ cat b/file40.txt
This is 40
$ cat a/file2.txt
This is us
$ cat a/file1.txt

$ find . -name "*.txt" -exec echo {} \;
./c/file3.txt
./b/file40.txt
./a/file2.txt
./a/file1.txt
$ find . -name "*.txt" -exec cat {} \;
This is 40
This is us

30 / 40

`xargs`

Constructing commands on the fly

31 / 40

`xargs`

Allows us to "parametrize" the commands we run by passing input via pipe
Processes standard input line-by-line and "applies" a command on each line

xargs [command]

command can be any command we would like to run on each line
-I {} will replace {} in the [command] by the input

$ find . -type f | xargs -I{} echo "File: {}"
File: ./c/file3.txt
File: ./b/file40.txt
File: ./a/file2.txt
File: ./a/file1.txt

32 / 40

`xargs`

Allows us to "parametrize" the commands we run by passing input via pipe
Processes standard input line-by-line and "applies" a command on each line

xargs [command]

command can be any command we would like to run on each line
-I {} will replace {} in the [command] by the input

$ find . -type f | xargs -I{} echo "File: {}"
File: ./c/file3.txt
File: ./b/file40.txt
File: ./a/file2.txt
File: ./a/file1.txt

This can be easily used as a clearer alternative to -delete or -exec

$ find . -type f -empty | xargs -I{} rm {}

33 / 40

`xargs` II

xargs [command]

-t
- print each command prior to execution
-p
- similar to -ok in find
- ASKS For confirmation before executing a command
-P [max-procs]
- run the commands in parallel, on at most [max-procs] processes

Ask before removing each empty file:

$ find . -type f -empty | xargs -p -I{} rm {} 
rm ./c/file3.txt ?...y
rm ./a/file1.txt ?...y

34 / 40

Back to the `grep` tale

Task: find all Uniba IDs in the *.py files in the current directory (recursively)

Solution:

$ find . -name '*.py' -type f | xargs grep -n -E '[a-z]+[0-9]{1,3}'
z/login.py:4:log_in(username='rob5')

35 / 40

Useful commands36 / 40

`time`

Let's you "time" (compute the time needed for) the execution of a specific command
An internal bash command but a standalone program also exists
Very nice for benchmarking competing approaches to solving the same problem

37 / 40

`time`

Let's you "time" (compute the time needed for) the execution of a specific command
An internal bash command but a standalone program also exists
Very nice for benchmarking competing approaches to solving the same problem

Suppose we'd like to benchmark the following two commands:

find ./foo -type f -name "*.txt" -exec rm {} \; 
find ./foo -type f -name "*.txt" | xargs rm

38 / 40

`time`

Let's you "time" (compute the time needed for) the execution of a specific command
An internal bash command but a standalone program also exists
Very nice for benchmarking competing approaches to solving the same problem

Suppose we'd like to benchmark the following two commands:

find ./foo -type f -name "*.txt" -exec rm {} \; 
find ./foo -type f -name "*.txt" | xargs rm

On a folder with 1000 files in it, here are the results:

time find ./foo -type f -name "*.txt" -exec rm {} \;
0.35s user 0.11s system 99% cpu 0.467 total
time find ./foo -type f -name "*.txt" | xargs rm
0.00s user 0.01s system 75% cpu 0.016 total

As we can see, the xargs approach seems to be a bit faster (various benchmarks tend to agree)

39 / 40

https://shapeshed.com/unix-xargs/

`time`ing the parallel `xargs`

Let's use time to demonstrate the difference that the -P in xargs can have.

The program we'll test this on is very simple -- just sleep (wait) for a bit (like an extensive computation would).

Sleep for 1, 2, 3, 4 and 5 seconds serially (about 15 sec in total):

$ time echo 1 2 3 4 5 | tr ' ' '\n' | xargs -I{} sleep {}
real    0m15.017s
user    0m0.007s
sys     0m0.012s

Sleep for 1, 2, 3, 4 and 5 seconds in parallel (about 5 sec in total):

$ echo 1 2 3 4 5 | tr ' ' '\n' | xargs -P 5 -I{} sleep {}
real    0m5.012s
user    0m0.007s
sys     0m0.014s

40 / 40

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Finding content on the disk

Why UNIX for Data Science?

Why UNIX for Data Science?

A tale of grep

A tale of grep

A tale of grep

A tale of grep

grep with recursion

grep with recursion

grep's Extended Regular Expressions

grep's Extended Regular Expressions

grep's Extended Regular Expressions

grep's Extended Regular Expressions

grep's Extended Regular Expressions

grep's Extended Regular Expressions

find

find

find: looking up filenames

find: looking up paths

find: looking up via regex

find: looking up via attributes

find: looking up by file/dir type

find: looking up by timestamps

find: looking up by size

find: a note on [n]

find: looking up by user/group

find: looking up by permissions

find: looking up by permissions II

find: combining search patterns

find: actions on matches

xargs

xargs

xargs

xargs II

Back to the grep tale

Useful commands

time

time

time

timeing the parallel xargs

Why UNIX for Data Science?

Help

A tale of `grep`

A tale of `grep`

A tale of `grep`

A tale of `grep`

`grep` with recursion

`grep` with recursion

`grep`'s Extended Regular Expressions

`grep`'s Extended Regular Expressions

`grep`'s Extended Regular Expressions

`grep`'s Extended Regular Expressions

`grep`'s Extended Regular Expressions

`grep`'s Extended Regular Expressions

`find`

`find`

`find`: looking up filenames

`find`: looking up paths

`find`: looking up via regex

`find`: looking up via attributes

`find`: looking up by file/dir type

`find`: looking up by timestamps

`find`: looking up by size

`find`: a note on `[n]`

`find`: looking up by user/group

`find`: looking up by permissions

`find`: looking up by permissions II

`find`: combining search patterns

`find`: actions on matches

`xargs`

`xargs`

`xargs`

`xargs` II

Back to the `grep` tale

`time`

`time`

`time`

`time`ing the parallel `xargs`