name: inverse layout: true class: center, middle, inverse --- # Finding content on the disk With `grep`, `find` and `xargs` .footnote[Marek Šuppa
Ondrej Jariabka
Adrián Matejov] --- layout: false # Why UNIX for Data Science? - Why so many commands? Isn't GUI nicer, faster and overall better? - The answer is yes: the GUI is nicer and faster at _discovering_ various options but when it comes to pure execution, CLI tends to be unmatched in speed - In other words, by using a mouse you don't build muscle memory -- - Great tools for your toolbox - Parallel processing ??? "The clumsiness of people who have to engage their brain at every step is unbearably painful to watch, at least to me, and that's what the novice-friendly software makes people do, because there's no elegance in them, it's just a mass of features to be learned by rote... it's sadly obvious that we are moving into a way of working that is predominantly _conscious_, for which I believe the human brain was never prepared. we no longer have the time to let skills sink into the autonomous nervous system, as it were, and even if we try, the criminal in Redmond, WA, has a new, incompatible version out by the time we learned the last version... One of the joys of learning to ride a bicycle is to stop thinking about it -- the feeling that I had successfully programmed my body to master a bicycle at least thrilled me as a kid (except I didn't know the verb "to program")... we need to communicate to users that learning to use Emacs is like learning to ride a bicycle -- it does take some time and effort, it's a worth-while skill to have, and then you never forget. I firmly believe that the novice-friendly software is like giving people several sets of supporting wheels so they won't tilt, but could get moving right away, and then never taking them off, preferring that they keep using them and moving so slowly that they always need them. of course, if you argue that they should remove the supporting wheels to such users, they will, admittedly correctly, argue that they will fall _splat_ on the ground and ruin their three-piece suits. clearly a no-go." http://groups.google.com/group/comp.emacs/msg/821a0f04bab91864?dmode=source&output=gplain https://news.ycombinator.com/item?id=2657135 --- class: middle, center, inverse # A tale of `grep` --- # A tale of `grep` Suppose you would like to find a specific string like (say "guide to the galaxy") you know is saved in some file on the disk. How would you go about doing that? -- Well, turns out `grep` can be of help! Up until now we used it in the following way ```bash $ grep [regex] [file] ``` -- But it can also be applied recursively on directories and their content. --- # `grep` with recursion `grep -R` - read all files under each directory, recursively, and follow symbolic links - starts from the current working directory by default -- ```bash $ tree . ├── a │ ├── file1.txt │ └── file2.txt ├── b │ └── file40.txt ├── c │ └── file3.txt ├── d └── g └── book.txt 5 directories, 5 files ``` ```bash $ grep -Ri 'guide to the galaxy' g/book.txt:The Hitchiker's `Guide to the Galaxy` ``` --- # `grep`'s Extended Regular Expressions - `grep` uses so called "basic regular expressions" (BRE) by default. -- - What we normally consider "regular expressions" are actually "extended regular expressions" (ERE) -- - The main difference is in backslashing the meta characters: - **BRE**: `\?`, `\+`, `\{`, `\}` - **ERE**: `?`, `+`, `{`, `}` EREs can be turned on in `grep` by passing the `-E` parameter. -- ------------------- Sample: find all Uniba login IDs in the current directory: -- ```bash $ grep -Rn -E '[a-z]+[0-9]{1,3}' l/access.log:11:Unauthorized access attmept from login ID `novak123` l/access.log:18:Successfuly authorized `roy47` z/login.py:4:log_in(username='`rob5`') ``` -- But what if we only wanted to search in say `.py` files? --- class: middle, center, inverse # `find` One-stop solution for figuring out what is where --- # `find` - A command for locating (*finding*) files and directories - It looks through all the subdirectories recursively - If no starting-point is specified it starts from `.` .left-eq-column[ ```bash $ tree . ├── a │ ├── file1.txt │ └── file2.txt ├── b │ └── file40.txt ├── c │ └── file3.txt └── d 4 directories, 4 files ``` ] .right-eq-column[ ```bash $ find . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt $ find . . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt ``` ] --- # `find`: looking up filenames Can be done with the `-name` option / flag .left-eq-column[ `find [path] -name [pattern]` - `[pattern]` supports the following wildcards: - `*`: matches any string (of any length) - `?`: matches a single character - `[ ]`: matches a single character from the specified character class ```bash $ find . . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt ``` ] .right-eq-column[ ```bash $ find `.` -name "*.txt" ./c/file3.txt ./b/file40.txt ./a/file2.txt ./a/file1.txt ``` ```bash $ find `.` -name "file?.txt" ./c/file3.txt ./a/file2.txt ./a/file1.txt ``` ```bash $ find `.` -name 'file[2-4].txt' ./c/file3.txt ./a/file2.txt ``` ] --- # `find`: looking up paths For matching full paths, `-path` is a good choice .left-eq-column[ `find [path] -path [pattern]` - `[pattern]` is applied on the whole path, not just the name - it once again supports wildcards ```bash $ find . . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt ``` ] .right-eq-column[ ```bash $ find . -path '*a*' ./`a` ./`a`/file2.txt ./`a`/file1.txt ``` ```bash $ find . -path './?' ./d ./c ./b ./a ``` ```bash $ find . -path '*c/file[0-9].txt' ./`c/file3.txt` ``` The pattern needs to match the whole path (this example does not): ```bash $ find . -path '*c/file[0-9]' ``` ] --- # `find`: looking up via regex If wildcards do not suffice, we can also utilize the "full power of regexes" .left-eq-column[ `find [path] -regex [pattern]` - `[pattern]` can be any "standard" regular expression ```bash $ find . . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt ``` ] .right-eq-column[ ```bash $ find . -regex '.*file.*' ./c/file3.txt ./b/file40.txt ./a/file2.txt ./a/file1.txt ``` ```bash $ find . -regex '.*[bc]/file.*' ./c/file3.txt ./b/file40.txt ``` Unlike in the examples below, the regex needs to match the whole path. ```bash $ find . -regex 'file' ``` ```bash $ find . -regex '[bc]/file.*' ``` ] --- # `find`: looking up via attributes Files and directories have various attributes `find` can take a look at: - type - directory, file, symlink, ... - timestamps - last file change (`-ctime`) - last access (`-atime`) - last modification (`-mtime`) - file size - owner and group - permissions --- # `find`: looking up by file/dir type .left-eq-column[ `find [path] -type [type]` - `[type]` can be one of the following - `f`: "normal" file - `d`: directory - `b`/`c`: block/character device - `p`: named pipe - `l`: symlink - `s`: socket ```bash $ find . . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt ``` ] .right-eq-column[ ```bash $ find . -type d . ./d ./c ./b ./a ``` ```bash $ find . -type f ./c/file3.txt ./b/file40.txt ./a/file2.txt ./a/file1.txt ``` ] --- # `find`: looking up by timestamps .left-eq-smaller-column[ `find [path] -mmin [n]` - file last modified `[n]` minutes ago `find [path] -mtime [n]` - file last modified `[n]` days ago ------------ Flags for other timestamps: - `-cmin` / `-ctime` - time of last attribute change - `-amin` / `-atime` - time of last access ] .right-eq-larger-column[ ```bash $ ls -al drwxrwxr-x 7 mrshu mrshu 4096 Nov 9 10:34 . drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 11:39 a drwxrwx--x 2 mrshu mrshu 4096 Nov 7 11:40 b drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 11:57 c drwxrwxr-x 2 mrshu mrshu 4096 Nov 7 12:36 d drwxrwxr-x 2 mrshu mrshu 4096 Nov 9 10:34 e ``` ```bash $ date Mon 09 Nov 2020 10:55:23 AM UTC ``` Last modified exactly 22 minutes ago: ```bash $ find . -mmin 22 . ./e ``` Last modified 1 full day ago: ```bash $ find . -mtime 1 ./a ./d ./b ./c ``` ] --- # `find`: looking up by size .left-eq-column[ `find [path] -size [n]` - `[n]` can be followed by various units: - **`b`** for 512-byte blocks (the default) - **`c`** for bytes - **`k`** for Kilobytes (units of 1024 bytes) - **`M`** for Megabytes (units of 1048576 bytes) - **`G`** for Gigabytes (units of 1073741824 bytes) `find [path] -empty` - find empty files and directories ] .right-eq-column[ ```bash $ find . . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt ``` ```bash $ find . -empty ./d ./c/file3.txt ./a/file1.txt ``` ```bash $ find . -size 0 ./c/file3.txt ./a/file1.txt ``` ] --- # `find`: a note on `[n]` By default, `[n]` matches the exact value (of time/date or size) This behaviour can be altered via the `+` and `-` prefixes - `+[n]` - matches all values larger than `[n]` - `-[n]` - matches all values smaller than `[n]` - `[n]` - matches exactly `[n]` .left-eq-smaller-column[ ```bash $ find . -mtime -3 . ./a ./d ./b ./c ./e ``` ] .right-eq-larger-column[ ```bash $ find /boot -size +15M /boot/initrd.img-5.4.0-47-generic /boot/initrd.img-5.4.0-52-generic /boot/initrd.img-5.4.0-51-generic $ ls -hs /boot/initrd.img-5.4.0-47-generic 78M /boot/initrd.img-5.4.0-47-generic ``` ] --- # `find`: looking up by user/group `find [path] -user [user]` - only show files and directories owned by `[user]` `find [path] -group [group]` - only show files and directories that belong to group `[group]` ```bash $ ls -l /etc [ ... 160 lines omitted ... ] drwxr-xr-x 2 root root 4096 Sep 19 19:52 sensors.d -rw-r--r-- 1 root root 14464 Feb 16 2020 services -rw-r----- 1 root shadow 1463 Sep 19 20:15 shadow -rw-r----- 1 root shadow 1595 Sep 19 20:14 shadow- -rw-r--r-- 1 root root 146 Jul 31 16:29 shells drwxr-xr-x 2 root root 4096 Jul 31 16:28 skel [ ... 32 lines omitted ... ] $ find /etc/ -group shadow /etc/shadow- /etc/gshadow /etc/shadow /etc/gshadow- ``` --- # `find`: looking up by permissions `find [path] -readable` - only show files that are readable `find [path] -writable` - only show files that are writable `find [path] -executable` - only show files that are executable .left-eq-column[ ```bash $ find . . ./d ./c ./c/file3.txt ./b ./b/file40.txt ./a ./a/file2.txt ./a/file1.txt ``` ] .right-eq-column[ ```bash find . -executable . ./d ./c ./b ./a ``` ] --- # `find`: looking up by permissions II `find [path] -perm [mode]` -`[mode]` can be specified in octal or symbolic format - `[mode]` can have various prefixes: - `[mode]`: exactly `[mode]` permissions are set - `-[mode]`: at least `[mode]` permissions are set - `/[mode]`: some `[mode]` permissions are set ```bash $ ls -al total 24 drwxrwx`r`-x 6 mrshu mrshu 4096 Nov 7 12:36 . drwxr-x`r`-x 7 mrshu mrshu 4096 Nov 9 10:16 .. drwxrwx`r`-x 2 mrshu mrshu 4096 Nov 7 11:39 a drwxrwx`-`-x 2 mrshu mrshu 4096 Nov 7 11:40 b drwxrwx`r`-x 2 mrshu mrshu 4096 Nov 7 11:57 c drwxrwx`r`-x 2 mrshu mrshu 4096 Nov 7 12:36 d $ find . -perm 771 ./b $ find . -perm -774 . ./a ./d ./c ``` --- # `find`: combining search patterns The search pattern on name and attribute level can be easily combined together. For example: - find all empty .txt files ```bash $ file . -name "*.txt" -empty ``` - find all files modified in the last 20 minutes, whose filename contains "image" ```bash $ file . -name "*image*" -mmin -20 ``` --- # `find`: actions on matches `-delete` - delete all matched files `-exec [command] \;` - run command `[command]` for each matched file/directory - string `{}` is replaced with the matched file/directory `-ok [command] \;` - same thing as `-exec` but asks for user confirmation before running the command .left-eq-smaller-column[ ```bash $ cat c/file3.txt $ cat b/file40.txt This is 40 $ cat a/file2.txt This is us $ cat a/file1.txt ``` ] .right-eq-larger-column[ ```bash $ find . -name "*.txt" -exec echo {} \; ./c/file3.txt ./b/file40.txt ./a/file2.txt ./a/file1.txt $ find . -name "*.txt" -exec cat {} \; This is 40 This is us ``` ] --- class: middle, center, inverse # `xargs` Constructing commands on the fly --- # `xargs` - Allows us to "parametrize" the commands we run by passing input via pipe - Processes standard input line-by-line and "applies" a command on each line `xargs [command]` - command can be any command we would like to run on each line - `-I {}` will replace `{}` in the `[command]` by the input ```bash $ find . -type f | xargs -I{} echo "File: {}" File: ./c/file3.txt File: ./b/file40.txt File: ./a/file2.txt File: ./a/file1.txt ``` -- This can be easily used as a clearer alternative to `-delete` or `-exec` ```bash $ find . -type f -empty | xargs -I{} rm {} ``` --- # `xargs` II `xargs [command]` - `-t` - print each command prior to execution - `-p` - similar to `-ok` in `find` - ASKS For confirmation before executing a command - `-P [max-procs]` - run the commands in parallel, on at most `[max-procs]` processes ---------- Ask before removing each empty file: ```bash $ find . -type f -empty | xargs -p -I{} rm {} rm ./c/file3.txt ?...y rm ./a/file1.txt ?...y ``` --- # Back to the `grep` tale **Task**: find all Uniba IDs in the `*.py` files in the current directory (recursively) **Solution**: ```bash $ find . -name '*.py' -type f | xargs grep -n -E '[a-z]+[0-9]{1,3}' z/login.py:4:log_in(username='`rob5`') ``` --- class: inverse, center, middle # Useful commands --- # `time` - Let's you "time" (compute the time needed for) the execution of a specific command - An internal `bash` command but a standalone program also exists - Very nice for benchmarking competing approaches to solving the same problem -- Suppose we'd like to benchmark the following two commands: ```bash find ./foo -type f -name "*.txt" -exec rm {} \; find ./foo -type f -name "*.txt" | xargs rm ``` -- On a folder with 1000 files in it, here are the results: ```bash time find ./foo -type f -name "*.txt" -exec rm {} \; 0.35s user 0.11s system 99% cpu 0.467 total time find ./foo -type f -name "*.txt" | xargs rm 0.00s user 0.01s system 75% cpu 0.016 total ``` As we can see, the `xargs` approach seems to be a bit faster ([various](https://www.everythingcli.org/find-exec-vs-find-xargs/) [benchmarks](https://danielmiessler.com/blog/linux-xargs-vs-exec/) tend to agree) ??? https://shapeshed.com/unix-xargs/ --- # `time`ing the parallel `xargs` Let's use `time` to demonstrate the difference that the `-P` in `xargs` can have. The program we'll test this on is very simple -- just `sleep` (wait) for a bit (like an extensive computation would). Sleep for 1, 2, 3, 4 and 5 seconds serially (about 15 sec in total): ```bash $ time echo 1 2 3 4 5 | tr ' ' '\n' | xargs -I{} sleep {} real 0m15.017s user 0m0.007s sys 0m0.012s ``` Sleep for 1, 2, 3, 4 and 5 seconds in parallel (about 5 sec in total): ```bash $ echo 1 2 3 4 5 | tr ' ' '\n' | xargs -P 5 -I{} sleep {} real 0m5.012s user 0m0.007s sys 0m0.014s ```