+ - 0:00:00
Notes for current slide
Notes for next slide

Files and directories

Finding your way around the disk and the files on it.

Marek Šuppa
Ondrej Jariabka
Adrián Matejov

1 / 56

Why Linux for Data Science?

The Data Science Venn Diagram

2 / 56

What's on this computer?

  • lscpu -e

    • display information about available CPU

    • -e makes the output a bit less verbose and a bit more "human readable"

3 / 56

What's on this computer?

  • lscpu -e

    • display information about available CPU

    • -e makes the output a bit less verbose and a bit more "human readable"

  • free

    • report on how RAM is being used

    • by default reports the values in kibibytes (KiB == 1024 B)

    • normally used as free -m (MiB: mebibytes)

    • can also report megabytes (10002 bytes) via free --mega

4 / 56

What's on this computer?

  • lscpu -e

    • display information about available CPU

    • -e makes the output a bit less verbose and a bit more "human readable"

  • free

    • report on how RAM is being used

    • by default reports the values in kibibytes (KiB == 1024 B)

    • normally used as free -m (MiB: mebibytes)

    • can also report megabytes (10002 bytes) via free --mega

  • top / htop

    • shows the combined CPU and RAM information in one "graphical" application
    • also lists the running processes and gives you a quick answer to "what is running on this computer"
5 / 56

What devices are available on this computer?

  • lspci

    • lists all PCI devices (connected to PCI busses)
  • lsusb

    • lists all USB ports and devices connected to them
  • lsblk

    • list all block devices (basically all disks)
    • -f flag makes the output show a bit more about the filesystem (like its type or usage)
6 / 56

Block (data) devices

Massive variability:

  • floppy disk

  • CD / DVD

  • network disk (Samba, NFS, ...)

  • hard disk

  • USB disk (of various types)

7 / 56

Disk partitioning

  • Allows one big disk to be "logically" partitioned to smaller subparts

  • These partitions are handled as "independed disks"

  • Allows for specific parts of the filesystem to be dealt with differently

    • Separate user data from system files

    • Different filesystem types on different partitions


Image from Wikipedia

8 / 56

What is a file system?

  • A way of managing

    1. where a piece of data starts and finishes on a block device (disk)

    2. what its name is (filename), when was it created, ...

    3. where can the user find it (what directory does it reside in)

    4. on a specific device (i.e. optical discs vs. Flash discs vs. hard discs)

9 / 56

What is a file system?

  • A way of managing

    1. where a piece of data starts and finishes on a block device (disk)

    2. what its name is (filename), when was it created, ...

    3. where can the user find it (what directory does it reside in)

    4. on a specific device (i.e. optical discs vs. Flash discs vs. hard discs)


Standard file system types you are likely to encounter:

  • Windows: FAT32, NTFS, ReFS
  • macOS: APFS
  • Linux: EXT2, EXT3, EXT4, XFS, btrfs
10 / 56

Folder structure in Linux filesystems

  • Directories organized in a tree

  • There is one central "root" directory

    • It is denoted / (forward slash) but also called "filesystem root"

11 / 56

Folder structure in Linux filesystems

  • All the other (non-root) "data devices" are connected in so called "mountpoints"

    • Normally an empty directory
    • Once the device is mounted, the directory contains its contents

    • This is in contrast to "disks" (separate partitions) on Windows (with names like C:\, D:\ and so on)

12 / 56

Folder structure in Linux filesystems

  • Folders are also separated by forward slash (/)

    • For instance /home/jane/Documents/homework.txt

    • Note that Windows uses the other slash: backslash (\)

13 / 56
graph TD
root("/")
root --> boot("boot")
root --> dev("dev")
root --> media("media")
media --> cdrom[cdrom]
media --> usb["USB"]
root --> proc("proc")
root --> etc("etc")
root --> usr("usr")
usr --> usrbin("bin")
root --> bin("bin")
root --> home("home")
home --> jane("jane")
jane --> Documents
jane --> Downloads
jane --> Photos
jane --> Music
Documents --> homework([homework.txt])
root --> tmp("tmp")

Folder /home

  • The home directories of system's users

    /home/jane

  • Contains various user data, configuration files, possibly even user's applications

14 / 56

Folder /home

  • The home directories of system's users

    /home/jane

  • Contains various user data, configuration files, possibly even user's applications

  • Also denoted via the tilda ~ sign (which you see after you log in)

15 / 56

Folder /home

  • The home directories of system's users

    /home/jane

  • Contains various user data, configuration files, possibly even user's applications

  • Also denoted via the tilda ~ sign (which you see after you log in)

  • Prepended to a username, it indicates that user's home directory

    • for instance ~jane would mean /home/jane in our example
16 / 56

The tilda sign is a nice example of how the realities of where technologies were created shape various decisions we live with even now.

Here is what the Wiki says:

This convention derives from the Lear-Siegler ADM-3A terminal in common use during the 1970s, which happened to have the tilde symbol and the word "Home" (for moving the cursor to the upper left) on the same key.

https://en.wikipedia.org/wiki/Tilde#Computing

Folder /etc

  • From the Latin et cetera (and so on)

  • Configuration of the whole system

  • Things you can find here:

    • scripts that are executed on system startup

    • list of users and their encrypted passwords

    • network settings

    • list of available shells

    • filesystem settings (i.e. where to mount which disk)

    • ...

17 / 56

Folder /etc

  • From the Latin et cetera (and so on)

  • Configuration of the whole system

  • Things you can find here:

    • scripts that are executed on system startup

    • list of users and their encrypted passwords

    • network settings

    • list of available shells

    • filesystem settings (i.e. where to mount which disk)

    • ...

  • Basically the nerve centre of the whole system
18 / 56

Folder /proc

  • Information about the system and the processes running on it in the form of text files

    • Implemented as a special "virtual filesystem", mounted to the /proc folder
19 / 56

Folder /proc

  • Information about the system and the processes running on it in the form of text files

    • Implemented as a special "virtual filesystem", mounted to the /proc folder
  • /proc/version: version of the system that's running

  • /proc/cpuinfo: information about CPU

  • /proc/meminfo: information about RAM

20 / 56

Folder /boot

  • Files (and directories) necessary for the system boot

  • Generally contains the kernel and sometimes the temporary filesystem that's being used before the kernel starts up

21 / 56

Folders /bin and /usr/bin

  • Both contain applications that can be executed from the command line

  • /bin is reserved for system applications (like who or whoami)

  • /usr/bin contains applications that are not system-critical (like a webbrowser for instance)

22 / 56

Folders /bin and /usr/bin

  • Both contain applications that can be executed from the command line

  • /bin is reserved for system applications (like who or whoami)

  • /usr/bin contains applications that are not system-critical (like a webbrowser for instance)

  • /usr has many more interesting subdirectories; these generally contain files necessary for running the applications installed in /usr/bin

23 / 56

Folder /tmp

  • Temporary files of the system and its users

  • These are not expected to "survive a reboot"

  • On Linux, this generally means that the files will not be written to disk but will stay in RAM

24 / 56

Generally implemented via tmpfs -- https://en.wikipedia.org/wiki/Tmpfs

Folder /dev

  • As the name suggests, it contains "devices"

    • These behave as (special) files on the filesystem

    • There are two types

25 / 56

Folder /dev

  • As the name suggests, it contains "devices"

    • These behave as (special) files on the filesystem

    • There are two types

  • Character devices

    • A stream of characters
    • /dev/tty0, /dev/input/by-path/platform-thinkpad_acpi-event
  • Block devices

    • Provides access to blocks of data (useful for disks)
    • /dev/sda1, /dev/sda2
26 / 56

Folder /dev

  • As the name suggests, it contains "devices"

    • These behave as (special) files on the filesystem

    • There are two types

  • Character devices

    • A stream of characters
    • /dev/tty0, /dev/input/by-path/platform-thinkpad_acpi-event
  • Block devices

    • Provides access to blocks of data (useful for disks)
    • /dev/sda1, /dev/sda2
  • Also contains "pseudo devices"

    • /dev/null: accepts and discards all input (basically your own black hole)
    • /dev/random: produces a continuous stream of random data
27 / 56

Mounting devices

  • To attach (mount) a device, we'd run
$ mount [device] [folder]

so for instance

$ mount /dev/sdc /media/MyUSBKey
28 / 56

Mounting devices

  • To attach (mount) a device, we'd run
$ mount [device] [folder]

so for instance

$ mount /dev/sdc /media/MyUSBKey
  • To detach (unmount) a device, we'd run
$ umount [folder]

so for instance

$ umount /media/MyUSBKey
29 / 56

Mounting devices

  • To attach (mount) a device, we'd run
$ mount [device] [folder]

so for instance

$ mount /dev/sdc /media/MyUSBKey
  • To detach (unmount) a device, we'd run
$ umount [folder]

so for instance

$ umount /media/MyUSBKey
  • Currently mounted devices can be found by running just
$ mount
30 / 56

Mounting devices: hotplug

  • Mounting is not necessary in normal usage: things like USB disks are being mounted automatically (hotplug)

    • Each disk will get its own temporary directory (on Fedora these are in /run/media/<USER>, on other distros in /media)

    • After the disk gets detached, the folder disappears (gets removed)

31 / 56

Mounting devices: hotplug

  • Mounting is not necessary in normal usage: things like USB disks are being mounted automatically (hotplug)

    • Each disk will get its own temporary directory (on Fedora these are in /run/media/<USER>, on other distros in /media)

    • After the disk gets detached, the folder disappears (gets removed)

  • Still, before detaching the disk, it is necessary to run umount to ensure the data is written on it (or at least sync)

32 / 56

Useful commands: disks

  • df

    • reports disk usage
  • df -h

    • make the report human readable
    • defaults to powers of 1024, -H will give you powers of 1000
  • df --total

    • compute the grand total of all available disks as well
33 / 56

Useful commands: filesystem navigation

  • pwd

    • show the path to the current directory
  • ls

    • list the content of the current directory
34 / 56

Useful commands: filesystem navigation

  • pwd

    • show the path to the current directory
  • ls

    • list the content of the current directory
  • cd [directory]

    • change the current working directory (where you currently are in the filesystem) to [directory]
35 / 56

Useful commands: filesystem navigation

  • pwd

    • show the path to the current directory
  • ls

    • list the content of the current directory
  • cd [directory]

    • change the current working directory (where you currently are in the filesystem) to [directory]

    • [directory] can be both relative and absolute

36 / 56

Useful commands: filesystem navigation

  • pwd

    • show the path to the current directory
  • ls

    • list the content of the current directory
  • cd [directory]

    • change the current working directory (where you currently are in the filesystem) to [directory]

    • [directory] can be both relative and absolute

cd # go to the home directory
cd ~ # go to the home directory
cd /home/jane/Documents # go to /home/jane/Documents
cd .. # go one level above
cd Documents # go to the directory `Documents` in the current folder
cd - # go to the previously visited directory
37 / 56

Aside: asbolute vs. relative paths

  • Absolute paths start from the the root (/)

    • For instance /home/jane/Downloads/homework.pdf or /usr/bin/whoami

    • No matter where you are on the disk, it will always resolve to the same place

  • Relative paths do not start from the root (/) but they are resolved from the current working directory

    • Special path . represents the current directory and .. the parent directory
$ cd /home/joe
$ cat Documents/homework.txt
This is joe's homework in /home/joe/Documents/homework.txt
$ cd /home/jane
$ cat ./Documents/homework.txt
This is jane's homework in /home/jane/Documents/homework.txt
$ cd - # go to the previously visited directory
$ pwd
/home/joe
38 / 56

Aside: hidden files and directories

  • Files and folders whose name starts with a dot (.) are ignored by ls

  • They still exist on the disk but to list them, one needs to use the -a or -A flag

$ ls
a.txt b.txt c.txt data.dat test.txt
$ ls -a
. .. a.txt b.txt c.txt data.dat .hidden_file test.txt
39 / 56

Aside: hidden files and directories

  • Files and folders whose name starts with a dot (.) are ignored by ls

  • They still exist on the disk but to list them, one needs to use the -a or -A flag

$ ls
a.txt b.txt c.txt data.dat test.txt
$ ls -a
. .. a.txt b.txt c.txt data.dat .hidden_file test.txt
  • Note the first two "special paths" in the listing

    • . represents current directory

    • .. represents parent directory

40 / 56

Aside: hidden files and directories

  • Files and folders whose name starts with a dot (.) are ignored by ls

  • They still exist on the disk but to list them, one needs to use the -a or -A flag

$ ls
a.txt b.txt c.txt data.dat test.txt
$ ls -a
. .. a.txt b.txt c.txt data.dat .hidden_file test.txt
  • Note the first two "special paths" in the listing

    • . represents current directory

    • .. represents parent directory

  • These can also be chained together

$ cd ../../../../ # move four directories higher
41 / 56

Bash command/path expansion

Bash was written by "hackers" for "hackers": efficiency is one of its core tenants

  • It is not necessary to type out the full command name or the full path

  • After <Tab> gets pressed, Bash will try to autocomplete the rest of the command or file path

42 / 56

Bash command/path expansion

Bash was written by "hackers" for "hackers": efficiency is one of its core tenants

  • It is not necessary to type out the full command name or the full path

  • After <Tab> gets pressed, Bash will try to autocomplete the rest of the command or file path

  • If if there are multiple options, it will show them all

  • You can cycle through them by pressing <Tab> (and cycle back with <Shift-Tab>)

43 / 56

Bash command/path expansion

  • For example:
$ lsc<Tab>

will become

$ lscpu

and

$ cd Doc<Tab>

will become

$ cd Documents/

(provided you are a in a directory with a Documents/ directory in it.)

44 / 56

Bash history

  • The history of all commands you type to Bash are (normally) being saved to ~/.bash_history

  • You can get to them by:

    1. Pressing the up/down arrow keys
    2. Executing the history command
  • To search in the history, start typing a command and then press Ctrl-r (and Shift-Ctrl-r to cycle through)

45 / 56

Useful commands: files and directories

  • mkdir [directory]

    • creates a directory named [directory]
  • rmdir [directory]

    • removes an empty directory [directory] (if it's not empty, it'll let you know)
46 / 56

Useful commands: files and directories

  • mkdir [directory]

    • creates a directory named [directory]
  • rmdir [directory]

    • removes an empty directory [directory] (if it's not empty, it'll let you know)
  • rm [file or directory]

    • removes a file or a directory

    • the -r makes rm also recursively apply itself on subfolders of its argument

$ rm -r /home/jane

would remove every file in /home/jane, along with any directories and their content

47 / 56

Useful commands: files and directories

  • cp [source1] [source2] [target]

    • copies files and folders from [source] to [target]

    • there can be multiple sources

    • -r makes cp recursively copy the subfolders as well (it'll tell you otherwise)

48 / 56

Useful commands: files and directories

  • cp [source1] [source2] [target]

    • copies files and folders from [source] to [target]

    • there can be multiple sources

    • -r makes cp recursively copy the subfolders as well (it'll tell you otherwise)

  • mv [source1] [source2] [target]

    • moves files and directories from [source] to [target]

    • there can be multiple sources

49 / 56

Special paths and wildcards

  • Both [source] and [target] can be any paths, relative or absolute, even the special ones like . and ..
$ cd /home/jane
$ cp -r /tmp/tests . # Recursively copies /tmp/tests to /home/jane
$ cd /home/joe/Documents
$ cp -r /tmp/tests .. # Recursively copies /tmp/tests to /home/joe
50 / 56

Special paths and wildcards

  • Both [source] and [target] can be any paths, relative or absolute, even the special ones like . and ..
$ cd /home/jane
$ cp -r /tmp/tests . # Recursively copies /tmp/tests to /home/jane
$ cd /home/joe/Documents
$ cp -r /tmp/tests .. # Recursively copies /tmp/tests to /home/joe
  • When working with paths, we can use wildcards like ? and *
$ ls
a.txt b.txt c.txt data.dat test.txt
$ ls *.txt
a.txt b.txt c.txt test.txt
$ ls ?.txt
a.txt b.txt c.txt
51 / 56

Special paths and wildcards

  • Both [source] and [target] can be any paths, relative or absolute, even the special ones like . and ..
$ cd /home/jane
$ cp -r /tmp/tests . # Recursively copies /tmp/tests to /home/jane
$ cd /home/joe/Documents
$ cp -r /tmp/tests .. # Recursively copies /tmp/tests to /home/joe
  • When working with paths, we can use wildcards like ? and *
$ ls
a.txt b.txt c.txt data.dat test.txt
$ ls *.txt
a.txt b.txt c.txt test.txt
$ ls ?.txt
a.txt b.txt c.txt
52 / 56

Special paths and wildcards

  • Both [source] and [target] can be any paths, relative or absolute, even the special ones like . and ..
$ cd /home/jane
$ cp -r /tmp/tests . # Recursively copies /tmp/tests to /home/jane
$ cd /home/joe/Documents
$ cp -r /tmp/tests .. # Recursively copies /tmp/tests to /home/joe
  • When working with paths, we can use wildcards like ? and *
$ ls
a.txt b.txt c.txt data.dat test.txt
$ ls *.txt
a.txt b.txt c.txt test.txt
$ ls ?.txt
a.txt b.txt c.txt
  • In the last example ?.txt is expanded into all matching objects in the directory (works with files and directories alike)

  • In other words ls ?.txt is equivalent to ls a.txt b.txt c.txt

53 / 56

Final caveat

There is no undo button here

54 / 56

Final caveat

There is no undo button here

Contrary to popular belief, Unix is user friendly. It just happens to be very selective about who it decides to make friends with.

-- unknown

These systems expect you to know what you are doing.

55 / 56

Final caveat

There is no undo button here

Contrary to popular belief, Unix is user friendly. It just happens to be very selective about who it decides to make friends with.

-- unknown

These systems expect you to know what you are doing.

Please use rm -r sparingly and with care.

56 / 56

Why Linux for Data Science?

The Data Science Venn Diagram

2 / 56
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow