name: inverse
layout: true
class: center, middle, inverse
---
# Users, Groups and Regular Expressions

User management and the most useful tool UNIX can give you

.footnote[Marek Šuppa<br>Ondrej Jariabka<br>Adrián Matejov]

---
layout: false

# Why UNIX-like for Data Science?

If for nothing else, it's worth it for **regular expressions**.

> Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.

-- Cory Doctorow

https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions

---
class: middle, inverse

# Users and Groups

---
# Users

- UNIX was devised with "collaboration in mind"

- The concept of users plays a central role

--

- Same thing with Linux: it is a multi-user OS

- Each user is identified with a `UID`
  - Their actions (i.e. started processes or created files) are associated with this `UID`

???

- You know, sharing is caring and all that.

- In principle, UNIX has been built so that people could collaborate on
  documents, something basically unheard of in 1970s

---
# How do I become a user?

Via logging in. Two things need to happen:

1. Identification
  - By passing in the username

2. Authorization
  - By providing a password
  - Or other methods like SSH/HW crypto keys

---
# Where is info about users stored?

In general, two files:

- `/etc/passwd`
  - Can be read by everyone

- `/etc/shadow`
  - Can only be read by root (or "special users")
  - Actually contains the hashed passwords

???

The concept of shadowing came from the need to make the password hashes a bit
more secure -- so that they could not be bruteforced by a random user capable
of logging in.

Linux was kind of lucky: shadowing was ported there very early and basically
just stayed in up until now.

---
# `/etc/passwd`

A file full of colon (`:`) delimited fields like

```bash
jsmith`:`x`:`1001`:`1000`:`Joe Smith,Room 7,(234)555-8910,j@smi.th`:`/home/jsmith`:`/bin/sh
```

--

Each field has a specific meaning:

1. `jsmith`: the username (generally lowercase)

--

2. `x`: password (the `x` here means the password is in `/etc/shadow`)

--

3. `1001`: the user's `UID`

--

4. `1000`: the user's primary `GID` (Group ID)

--

5. `Joe Smith,Room 7,(234)555-8910,j@smi.th`: some further (contact) details about the user

--

6. `/home/jsmith`: home directory path

--

7. `/bin/sh`: user's default shell

???

The 5th row is actually https://en.wikipedia.org/wiki/Gecos_field -- a historical curiosity

---

# `/etc/shadow`

Similar to `/etc/passwd` in format, for example

```bash
jsmith:`$6$rTDC8QprwvDu`.:15377:0:99999:7:::
daemon:`*`:17206:0:99999:7:::
```

--

Once again, each field has a specific meaning:

1. `jsmith`: the username
2. the hashed password
  - empty: empty password
  - `!` or `*`: account is password locked, login only possible via other means (SSH)
  - `!!`: password not set yet
3. `15377`: day of last password change
4. `0`: days until change allowed
5. `99999`: days until change required
6. `7`: days warning for expiration

--

All the numbers of days are counted from the "beginning of the UNIX epoch": **1 January 1970**.

---
# Groups

- A useful concept for allowing groups of users to access a set of resources

- Could be files, special devices (printers, GPUs ...) or programs

--

- Uniquely identified by a `GID`

- Can have an access password (quite uncommon these days)

- From its point of view there are

- **users**: those that are associated with / part of it
  - **others**: everyone else

- Information about them is stored in `/etc/group` and `/etc/gshadow`

---
# `/etc/group` and `/etc/gshadow`

- `/etc/group`
```
sudo:x:3:mrshu,vidriduch,adman
lp:x:7:daemon,lp,mrshu
```
  - name
  - password (or `x`, in which case it is shadowed)
  - comma separated list of usernames

--

- `/etc/gshadow`
```
sudo:!::
lp:!!::
```
  - name
  - password (or `!`, `!!`, `*`)
  - list of administrators
  - list of users

---
# User groups

- Each user can be in multiple groups

- Just one of them is primary (its `GID` is right after `UID` in `/etc/passwd`)

--

- We can get the list of groups we are in by running the `groups` command:

```
$ groups
mrshu sudo lp
```

- To get the groups of other users, pass their username as a parameter

```
$ groups adman
adman : adman sudo
```

---

# `root` user

- an account for system administrator

- in the UNIX security model, the `root` user is considered "all-powerful"

- this user traditionally has `UID` 0 and home directory `/root`

- it is also associated with a specific `root` group (`GID` is also 0)

--

## `sudo`

- stands for "superuser do" or "substitute user do"

- allows "normal" users to run commands as `root`

- only for users specified in its configuration (`/etc/sudoers`)
  - sometimes it is enough to be part of a special group (like `sudo`)

---
# Useful commands

- `id`

- find out what your current identity is (along with `UID` and `GID`s)

```
$ id
uid=1001(mrshu) gid=1001(mrshu) groups=1001(mrshu),27(sudo)
```

--

- `su USER`

- change to some other `USER` (abbreviation of "set user")
  - if called without arguments, assumes that `USER` is `root`
  - if you know the `root`'s password, this is how you can get `root` privileges
  - `su -` is effectively the same thing as logging in as a different user

--

- `passwd`
  - change your UNIX password
  - `root` can also use it to change passwords of other users (`passwd USER`)

---
class: middle, inverse

# Regular Expressions

---
# Regular Expressions

- aka "regex" or "regexp"

- a quick way of describing a particular pattern of characters in text

- allows for extremely effective search and replace

--

- can be found everywhere on *NIX systems, but the especially in text editors

- comes from the `ed` editor but you'll mostly encounter the `grep` program

--

- in general `grep` outputs lines which match a given regex pattern

???

The name grep itself comes from the `ed` command:

> “One afternoon I asked Ken Thompson if he could lift the regular expression recognizer out of the editor and make a one-pass program to do it. He said yes. The next morning I found a note in my mail announcing a program named grep. It worked like a charm. When asked what that funny name meant, Ken said it was obvious. It stood for the editor command that it simulated, g/re/p (global regular expression print).”

-- [Chapter 9, On the Early History and Impact of Unix Tools to Build the Tools for a New Millenium](http://www.columbia.edu/~rh120/ch001j.c11)

https://medium.com/@rualthanzauva/grep-was-a-private-command-of-mine-for-quite-a-while-before-i-made-it-public-ken-thompson-a40e24a5ef48

---
# Using the `grep` command

**Task**: show lines in `file.txt` that match the regular expression `regexp`.

--

There are various ways of doing it:

- file as an argument

- `grep "regexp" file.txt`

- input forwarded via standard I/O forwarding

- `grep "regexp" < file.txt`

- data passed from pipe

- `cat file.txt | grep "regexp"`

---
# RegExp Patterns

```bash
$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8
```
- character(s)
```bash
$ cat file.txt | grep o
2 j`o`e2
3 m`o`lly13
4 nem`o`7
5 r`o`b5
6 r`o`y8
```

- strings of characters
```bash
$ cat file.txt | grep mo
3 `mo`lly13
4 ne`mo`7
```

---
# RegExp Patterns: Dot

```bash
$ cat file.txt
a.smith1
joe2
molly13
nemo7
rob5
roy8
```

- any character (denoted by a dot `.`)
```bash
$ cat file.txt | grep "o.."
j`oe2`
m`oll`y13
r`ob5`
r`oy8`
```
- an explicit dot can be expressed as `\.`
```bash
$ cat file.txt | grep "\."
a`.`smith1
```

---
# RegExp Patterns: Character Classes

```bash
$ cat file.txt
a.smith1
joe2
molly13
nemo7
rob5
roy8
```

- a class of characters (denoted `[]`)

- "find all lines which contain `2`, `3` or `5`"
    ```bash
    $ cat file.txt | grep [235]
    joe`2`
    molly1`3`
    rob`5`
    ```

--
    - "find all lines where `o` is followed by either `e` or `y`"
```bash
cat file.txt | grep "o[ey]"
j`oe`2
r`oy`8
```

---
# RegExp Patterns: Ranges I

```bash
$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8
```

- character classes can also be specified as ranges (i.e. `[a-z]` or `[0-9]`)

- "find all lines with three characters (`[a-z]`) followed by a number from `4` to `9`"
    ```bash
    $ cat file.txt | grep [a-z][a-z][a-z][4-9]
    4 n`emo7`
    5 `rob5`
    6 `roy8`
    ```
    - the repetition can be easily denoted with a number in curly braces `{}` 
    ```bash
    $ cat file.txt | grep [a-z]{3}[4-9]
    4 n`emo7`
    5 `rob5`
    6 `roy8`
    ```

---
# RegExp Patterns: Ranges II

```bash
$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8
```
- invert the class by putting `^` at the beginning of the definition (`[^ ]`)

- "find all lines with three characters (`[a-z]`) **not** followed by a number from `4` to `9`"
    ```bash
    $ cat file.txt | grep [a-z][a-z][a-z][^4-9]
    1 a.`smit`h1
    2 `joe2`
    3 `moll`y13
    4 `nemo`7
    ```

---

# RegExp Patterns: Repetitions
.left-eq-column[
```bash
$ cat text.txt
So, looking at the lock or the silk?
```

Repetitions can be applied on any character or character class.

]

.right-eq-column[
Three basic repetition operators:
- `\?`: match once or not at all
- `\+`: match **one and more** times
- `*`: match **zero and more** times

]

--

.clear-both[

---------

Match all `l`s followed by zero or one `o`:
```bash
$ cat text.txt | grep "lo\?"
So, `lo`oking at the `lo`ck or the si`l`k?
```

Match all `l`s followed by at least one or more `o`s:
```bash
$ cat text.txt | grep "lo\+"
So, `loo`king at the `lo`ck or the silk?
```

Match all `l`s followed by zero or more `o`s:
```bash
$ cat text.txt | grep "lo*"
So, `loo`king at the `lo`ck or the si`l`k?
```
]

---
# RegExp Patterns: Anchors

.left-eq-column[
```bash
$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8
```
]
.right-eq-column[
Anchors are two very important "special characters":

- `^`: match the beginning of the line
- `$`: match the end of the line
]

--

.clear-both[
.left-eq-column[
Find numbers at the beginning:
```bash
$ cat file.txt | grep "^[0-9]\+"
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8
```

]

.right-eq-column[
Find numbers at the end:
```bash
$ cat file.txt | grep "[0-9]\+$"
1 a.smith`1`
2 joe`2`
3 molly`13`
4 nemo`7`
5 rob`5`
6 roy`8`
```
]
]

---

# Using the `grep` command II

.left-eq-column[
- `grep PATTERNS FILE`

- prints lines that match patterns

- `-i`: make the search case-insensitive (**i**gnore-case)

- `-v`: print lines that do not match the pattern (in**v**ert)

- `-o`: output only the matched part of the line (**o**nly)

- `-n`: include the line number in the output (**n**umber)
]

.right-eq-column[
```bash
$ cat file.txt
a.smith1
joe2
molly13
nemo7
rob5
roy8

$ cat file.txt | grep "[0-5]\$" -n
1:a.smith`1`
2:joe`2`
3:molly1`3`
5:rob`5`

$ cat file.txt | grep "[0-5]\$" -n -v
4:nemo7
6:roy8

echo "Hello World!" | grep -i world
Hello `World`!

echo "Hello World!" | grep -i world -o
World

```
]

---

class: middle, inverse

# Useful Commands

`cut` and `paste`

---

# `cut`

- cut out a field from a text file, based on some separator

- `-d DELIM` set a specific delimiter (TAB by default)

- `-f FIELDS`
  - specify fields (starting from 1) to cut out
  - can be a number (like `-f 2`) or a list (like `-f 2,5`)
  - or a `<from>-<to>` format (like `-f 2-4`)

```bash
$ cut /etc/group -f 3 -d: | tail -n 5
972
84
971
970
969

$ cut /etc/group -f 1,3 -d: | tail -n 5
flatpak:972
screen:84
firebird:971
nm-fortisslvpn:970
docker:969
```

---

# `paste`

.left-eq-column[
- join files horizontally (like horizontal `cat`)

- `-d` sets the delimiter (TAB by default)

- `-s` appends data in **s**erial rather than in paralel
]
.right-eq-column[
```bash
$ cat names.txt
Mark Smith
Bobby Brown
Sue Miller
Jenny Igotit

$ cat numbers.txt
555-1234
555-9876
555-6743
867-5309
```
]

.clear-both[

.left-eq-column[
```bash
$ paste names.txt numbers.txt
Mark Smith      555-1234
Bobby Brown     555-9876
Sue Miller      555-6743
Jenny Igotit    867-5309
```
]

.right-eq-column[
```bash
$ paste -d, names.txt numbers.txt
Mark Smith,555-1234
Bobby Brown,555-9876
Sue Miller,555-6743
Jenny Igotit,867-5309
```
]

]

.clear-both[
```bash
$ paste -s names.txt numbers.txt
Mark Smith      Bobby Brown     Sue Miller      Jenny Igotit
555-1234        555-9876        555-6743        867-5309
```
]

???

Example taken straight from the great Wikipedia:

https://en.wikipedia.org/wiki/Paste_(Unix)

Notes for current slide

Notes for next slide

Users, Groups and Regular Expressions

User management and the most useful tool UNIX can give you

Marek Šuppa
Ondrej Jariabka
Adrián Matejov

1 / 49

Why UNIX-like for Data Science?

If for nothing else, it's worth it for regular expressions.

Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.

-- Cory Doctorow

https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions

2 / 49

Users and Groups3 / 49

Users

UNIX was devised with "collaboration in mind"
The concept of users plays a central role

4 / 49

Users

UNIX was devised with "collaboration in mind"
The concept of users plays a central role
Same thing with Linux: it is a multi-user OS
Each user is identified with a UID
- Their actions (i.e. started processes or created files) are associated with this UID

5 / 49

You know, sharing is caring and all that.
In principle, UNIX has been built so that people could collaborate on documents, something basically unheard of in 1970s

How do I become a user?

Via logging in. Two things need to happen:

Identification
- By passing in the username
Authorization
- By providing a password
- Or other methods like SSH/HW crypto keys

6 / 49

Where is info about users stored?

In general, two files:

/etc/passwd
- Can be read by everyone
/etc/shadow
- Can only be read by root (or "special users")
- Actually contains the hashed passwords

7 / 49

The concept of shadowing came from the need to make the password hashes a bit more secure -- so that they could not be bruteforced by a random user capable of logging in.

Linux was kind of lucky: shadowing was ported there very early and basically just stayed in up until now.

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

8 / 49

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

Each field has a specific meaning:

jsmith: the username (generally lowercase)

9 / 49

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

Each field has a specific meaning:

jsmith: the username (generally lowercase)
x: password (the x here means the password is in /etc/shadow)

10 / 49

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

Each field has a specific meaning:

jsmith: the username (generally lowercase)
x: password (the x here means the password is in /etc/shadow)
1001: the user's UID

11 / 49

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

Each field has a specific meaning:

jsmith: the username (generally lowercase)
x: password (the x here means the password is in /etc/shadow)
1001: the user's UID
1000: the user's primary GID (Group ID)

12 / 49

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

Each field has a specific meaning:

jsmith: the username (generally lowercase)
x: password (the x here means the password is in /etc/shadow)
1001: the user's UID
1000: the user's primary GID (Group ID)
Joe Smith,Room 7,(234)555-8910,j@smi.th: some further (contact) details about the user

13 / 49

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

Each field has a specific meaning:

jsmith: the username (generally lowercase)
x: password (the x here means the password is in /etc/shadow)
1001: the user's UID
1000: the user's primary GID (Group ID)
Joe Smith,Room 7,(234)555-8910,j@smi.th: some further (contact) details about the user
/home/jsmith: home directory path

14 / 49

`/etc/passwd`

A file full of colon (:) delimited fields like

jsmith:x:1001:1000:Joe Smith,Room 7,(234)555-8910,j@smi.th:/home/jsmith:/bin/sh

Each field has a specific meaning:

jsmith: the username (generally lowercase)
x: password (the x here means the password is in /etc/shadow)
1001: the user's UID
1000: the user's primary GID (Group ID)
Joe Smith,Room 7,(234)555-8910,j@smi.th: some further (contact) details about the user
/home/jsmith: home directory path
/bin/sh: user's default shell

15 / 49

The 5th row is actually https://en.wikipedia.org/wiki/Gecos_field -- a historical curiosity

`/etc/shadow`

Similar to /etc/passwd in format, for example

jsmith:$6$rTDC8QprwvDu.:15377:0:99999:7:::
daemon:*:17206:0:99999:7:::

16 / 49

`/etc/shadow`

Similar to /etc/passwd in format, for example

jsmith:$6$rTDC8QprwvDu.:15377:0:99999:7:::
daemon:*:17206:0:99999:7:::

Once again, each field has a specific meaning:

jsmith: the username
the hashed password
- empty: empty password
- ! or *: account is password locked, login only possible via other means (SSH)
- !!: password not set yet
15377: day of last password change
0: days until change allowed
99999: days until change required
7: days warning for expiration

17 / 49

`/etc/shadow`

Similar to /etc/passwd in format, for example

jsmith:$6$rTDC8QprwvDu.:15377:0:99999:7:::
daemon:*:17206:0:99999:7:::

Once again, each field has a specific meaning:

jsmith: the username
the hashed password
- empty: empty password
- ! or *: account is password locked, login only possible via other means (SSH)
- !!: password not set yet
15377: day of last password change
0: days until change allowed
99999: days until change required
7: days warning for expiration

All the numbers of days are counted from the "beginning of the UNIX epoch": 1 January 1970.

18 / 49

Groups

A useful concept for allowing groups of users to access a set of resources
- Could be files, special devices (printers, GPUs ...) or programs

19 / 49

Groups

A useful concept for allowing groups of users to access a set of resources
- Could be files, special devices (printers, GPUs ...) or programs
Uniquely identified by a GID
Can have an access password (quite uncommon these days)
From its point of view there are
- users: those that are associated with / part of it
- others: everyone else
Information about them is stored in /etc/group and /etc/gshadow

20 / 49

/etc/group and /etc/gshadow/etc/groupsudo:x:3:mrshu,vidriduch,adman
lp:x:7:daemon,lp,mrshu
name
password (or x, in which case it is shadowed)
comma separated list of usernames

21 / 49

`/etc/group` and `/etc/gshadow`

/etc/group
```
sudo:x:3:mrshu,vidriduch,adman
lp:x:7:daemon,lp,mrshu
```
- name
- password (or x, in which case it is shadowed)
- comma separated list of usernames
/etc/gshadow
```
sudo:!::
lp:!!::
```
- name
- password (or !, !!, *)
- list of administrators
- list of users

22 / 49

User groups

Each user can be in multiple groups
Just one of them is primary (its GID is right after UID in /etc/passwd)

23 / 49

User groups

Each user can be in multiple groups
Just one of them is primary (its GID is right after UID in /etc/passwd)
We can get the list of groups we are in by running the groups command:

$ groups
mrshu sudo lp

To get the groups of other users, pass their username as a parameter

$ groups adman
adman : adman sudo

24 / 49

`root` user

an account for system administrator
in the UNIX security model, the root user is considered "all-powerful"
this user traditionally has UID 0 and home directory /root
it is also associated with a specific root group (GID is also 0)

25 / 49

`root` user

an account for system administrator
in the UNIX security model, the root user is considered "all-powerful"
this user traditionally has UID 0 and home directory /root
it is also associated with a specific root group (GID is also 0)

`sudo`

stands for "superuser do" or "substitute user do"
allows "normal" users to run commands as root
only for users specified in its configuration (/etc/sudoers)
- sometimes it is enough to be part of a special group (like sudo)

26 / 49

Useful commands

id
- find out what your current identity is (along with UID and GIDs)

$ id
uid=1001(mrshu) gid=1001(mrshu) groups=1001(mrshu),27(sudo)

27 / 49

Useful commands

id
- find out what your current identity is (along with UID and GIDs)

$ id
uid=1001(mrshu) gid=1001(mrshu) groups=1001(mrshu),27(sudo)

su USER
- change to some other USER (abbreviation of "set user")
- if called without arguments, assumes that USER is root
- if you know the root's password, this is how you can get root privileges
- su - is effectively the same thing as logging in as a different user

28 / 49

Useful commands

id
- find out what your current identity is (along with UID and GIDs)

$ id
uid=1001(mrshu) gid=1001(mrshu) groups=1001(mrshu),27(sudo)

su USER
- change to some other USER (abbreviation of "set user")
- if called without arguments, assumes that USER is root
- if you know the root's password, this is how you can get root privileges
- su - is effectively the same thing as logging in as a different user
passwd
- change your UNIX password
- root can also use it to change passwords of other users (passwd USER)

29 / 49

Regular Expressions30 / 49

Regular Expressions

aka "regex" or "regexp"
a quick way of describing a particular pattern of characters in text
allows for extremely effective search and replace

31 / 49

Regular Expressions

aka "regex" or "regexp"
a quick way of describing a particular pattern of characters in text
allows for extremely effective search and replace
can be found everywhere on *NIX systems, but the especially in text editors
comes from the ed editor but you'll mostly encounter the grep program

32 / 49

Regular Expressions

aka "regex" or "regexp"
a quick way of describing a particular pattern of characters in text
allows for extremely effective search and replace
can be found everywhere on *NIX systems, but the especially in text editors
comes from the ed editor but you'll mostly encounter the grep program
in general grep outputs lines which match a given regex pattern

33 / 49

The name grep itself comes from the ed command:

“One afternoon I asked Ken Thompson if he could lift the regular expression recognizer out of the editor and make a one-pass program to do it. He said yes. The next morning I found a note in my mail announcing a program named grep. It worked like a charm. When asked what that funny name meant, Ken said it was obvious. It stood for the editor command that it simulated, g/re/p (global regular expression print).”

-- Chapter 9, On the Early History and Impact of Unix Tools to Build the Tools for a New Millenium

https://medium.com/@rualthanzauva/grep-was-a-private-command-of-mine-for-quite-a-while-before-i-made-it-public-ken-thompson-a40e24a5ef48

Using the `grep` command

Task: show lines in file.txt that match the regular expression regexp.

34 / 49

Using the `grep` command

Task: show lines in file.txt that match the regular expression regexp.

There are various ways of doing it:

file as an argument
- grep "regexp" file.txt
input forwarded via standard I/O forwarding
- grep "regexp" < file.txt
data passed from pipe
- cat file.txt | grep "regexp"

35 / 49

RegExp Patterns

$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

character(s)

$ cat file.txt | grep o
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

strings of characters

$ cat file.txt | grep mo
3 molly13
4 nemo7

36 / 49

RegExp Patterns: Dot

$ cat file.txt
a.smith1
joe2
molly13
nemo7
rob5
roy8

any character (denoted by a dot .)

$ cat file.txt | grep "o.."
joe2
molly13
rob5
roy8

an explicit dot can be expressed as \.
```
$ cat file.txt | grep "\."
a.smith1
```

37 / 49

RegExp Patterns: Character Classes

$ cat file.txt
a.smith1
joe2
molly13
nemo7
rob5
roy8

a class of characters (denoted [])
- "find all lines which contain 2, 3 or 5"
```
$ cat file.txt | grep [235]
joe2
molly13
rob5
```

38 / 49

RegExp Patterns: Character Classes

$ cat file.txt
a.smith1
joe2
molly13
nemo7
rob5
roy8

a class of characters (denoted [])
- "find all lines which contain 2, 3 or 5"
```
$ cat file.txt | grep [235]
joe2
molly13
rob5
```
- "find all lines where o is followed by either e or y"
```
cat file.txt | grep "o[ey]"
joe2
roy8
```

39 / 49

RegExp Patterns: Ranges I

$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

character classes can also be specified as ranges (i.e. [a-z] or [0-9])
- "find all lines with three characters ([a-z]) followed by a number from 4 to 9"
```
$ cat file.txt | grep [a-z][a-z][a-z][4-9]
4 nemo7
5 rob5
6 roy8
```
- the repetition can be easily denoted with a number in curly braces {}
```
$ cat file.txt | grep [a-z]{3}[4-9]
4 nemo7
5 rob5
6 roy8
```

40 / 49

RegExp Patterns: Ranges II

$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

invert the class by putting ^ at the beginning of the definition ([^ ])
- "find all lines with three characters ([a-z]) not followed by a number from 4 to 9"
```
$ cat file.txt | grep [a-z][a-z][a-z][^4-9]
1 a.smith1
2 joe2
3 molly13
4 nemo7
```

41 / 49

RegExp Patterns: Repetitions

$ cat text.txt
So, looking at the lock or the silk?

Repetitions can be applied on any character or character class.

Three basic repetition operators:

\?: match once or not at all
\+: match one and more times
*: match zero and more times

42 / 49

RegExp Patterns: Repetitions

$ cat text.txt
So, looking at the lock or the silk?

Repetitions can be applied on any character or character class.

Three basic repetition operators:

\?: match once or not at all
\+: match one and more times
*: match zero and more times

Match all ls followed by zero or one o:

$ cat text.txt | grep "lo\?"
So, looking at the lock or the silk?

Match all ls followed by at least one or more os:

$ cat text.txt | grep "lo\+"
So, looking at the lock or the silk?

Match all ls followed by zero or more os:

$ cat text.txt | grep "lo*"
So, looking at the lock or the silk?

43 / 49

RegExp Patterns: Anchors

$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

Anchors are two very important "special characters":

^: match the beginning of the line
$: match the end of the line

44 / 49

RegExp Patterns: Anchors

$ cat file.txt
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

Anchors are two very important "special characters":

^: match the beginning of the line
$: match the end of the line

Find numbers at the beginning:

$ cat file.txt | grep "^[0-9]\+"
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

Find numbers at the end:

$ cat file.txt | grep "[0-9]\+$"
1 a.smith1
2 joe2
3 molly13
4 nemo7
5 rob5
6 roy8

45 / 49

Using the `grep` command II

grep PATTERNS FILE
- prints lines that match patterns
- -i: make the search case-insensitive (ignore-case)
- -v: print lines that do not match the pattern (invert)
- -o: output only the matched part of the line (only)
- -n: include the line number in the output (number)

$ cat file.txt
a.smith1
joe2
molly13
nemo7
rob5
roy8
$ cat file.txt | grep "[0-5]\$" -n
1:a.smith1
2:joe2
3:molly13
5:rob5
$ cat file.txt | grep "[0-5]\$" -n -v
4:nemo7
6:roy8
echo "Hello World!" | grep -i world
Hello World!
echo "Hello World!" | grep -i world -o
World

46 / 49

Useful Commands

cut and paste

47 / 49

`cut`

cut out a field from a text file, based on some separator
-d DELIM set a specific delimiter (TAB by default)
-f FIELDS
- specify fields (starting from 1) to cut out
- can be a number (like -f 2) or a list (like -f 2,5)
- or a <from>-<to> format (like -f 2-4)

$ cut /etc/group -f 3 -d: | tail -n 5
972
84
971
970
969
$ cut /etc/group -f 1,3 -d: | tail -n 5
flatpak:972
screen:84
firebird:971
nm-fortisslvpn:970
docker:969

48 / 49

`paste`

join files horizontally (like horizontal cat)
-d sets the delimiter (TAB by default)
-s appends data in serial rather than in paralel

$ cat names.txt
Mark Smith
Bobby Brown
Sue Miller
Jenny Igotit
$ cat numbers.txt
555-1234
555-9876
555-6743
867-5309

$ paste names.txt numbers.txt
Mark Smith      555-1234
Bobby Brown     555-9876
Sue Miller      555-6743
Jenny Igotit    867-5309

$ paste -d, names.txt numbers.txt
Mark Smith,555-1234
Bobby Brown,555-9876
Sue Miller,555-6743
Jenny Igotit,867-5309

$ paste -s names.txt numbers.txt
Mark Smith      Bobby Brown     Sue Miller      Jenny Igotit
555-1234        555-9876        555-6743        867-5309

49 / 49

Example taken straight from the great Wikipedia:

https://en.wikipedia.org/wiki/Paste_(Unix)

Why UNIX-like for Data Science?

If for nothing else, it's worth it for regular expressions.

Knowing [regular expressions] can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.

-- Cory Doctorow

https://www.theguardian.com/technology/2012/dec/04/ict-teach-kids-regular-expressions

2 / 49

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow