+ - 0:00:00
Notes for current slide
Notes for next slide

Git

Introduction to (source code) version control

Marek Šuppa
Ondrej Jariabka
Adrián Matejov

1 / 68

Why git for Data Science?

  • Standardized way of tracking your code and analyses (plus history thereof)
2 / 68

Why git for Data Science?

  • Standardized way of tracking your code and analyses (plus history thereof)
  • Helps avoid "versioning hell" (you know, files like essay.doc, essay_v2.doc, essay_final.doc)
3 / 68

Why git for Data Science?

  • Standardized way of tracking your code and analyses (plus history thereof)
  • Helps avoid "versioning hell" (you know, files like essay.doc, essay_v2.doc, essay_final.doc)

  • Gives you the ability to "jump in time"

4 / 68

Why git for Data Science?

  • Standardized way of tracking your code and analyses (plus history thereof)
  • Helps avoid "versioning hell" (you know, files like essay.doc, essay_v2.doc, essay_final.doc)

  • Gives you the ability to "jump in time"

  • Helps you make your work "reproducible"

5 / 68

Why git for Data Science?

  • Standardized way of tracking your code and analyses (plus history thereof)
  • Helps avoid "versioning hell" (you know, files like essay.doc, essay_v2.doc, essay_final.doc)

  • Gives you the ability to "jump in time"

  • Helps you make your work "reproducible"

  • Makes it a bit more straightforward to work on common (larger) projects with others

6 / 68

Git

The Global Information Tracker(TM)

7 / 68

Linus actually claims it does not mean anything...

Git

  • A distributed version control system
8 / 68

Git

  • A distributed version control system

  • Will keep track of the changes you make to the files it tracks

9 / 68

Git

  • A distributed version control system

  • Will keep track of the changes you make to the files it tracks

  • If you screw things up (i.e. accidentally remove some file), you can get back to previous state

10 / 68

Git

  • A distributed version control system

  • Will keep track of the changes you make to the files it tracks

  • If you screw things up (i.e. accidentally remove some file), you can get back to previous state

  • Allows for these changes to be easily transferred to others

11 / 68

Git

  • A distributed version control system

  • Will keep track of the changes you make to the files it tracks

  • If you screw things up (i.e. accidentally remove some file), you can get back to previous state

  • Allows for these changes to be easily transferred to others

  • Originally designed as a source code version control system for the Linux kernel

12 / 68

Git

  • A distributed version control system

  • Will keep track of the changes you make to the files it tracks

  • If you screw things up (i.e. accidentally remove some file), you can get back to previous state

  • Allows for these changes to be easily transferred to others

  • Originally designed as a source code version control system for the Linux kernel

  • Free and open-source software distributed under GNU GPL2 license

13 / 68

Git

  • A distributed version control system

  • Will keep track of the changes you make to the files it tracks

  • If you screw things up (i.e. accidentally remove some file), you can get back to previous state

  • Allows for these changes to be easily transferred to others

  • Originally designed as a source code version control system for the Linux kernel

  • Free and open-source software distributed under GNU GPL2 license

  • Currently the standard for source code versioning

14 / 68

Git: initializing the repository

  • Git stores its metadata (along with "snapshots") in a special .git folder

  • A folder which contains this .git folder is called a "repository"

15 / 68

Git: initializing the repository

  • Git stores its metadata (along with "snapshots") in a special .git folder

  • A folder which contains this .git folder is called a "repository"

  • A new repository can be initialized using the git init command

$ mkdir repo
$ cd repo
$ git init
Initialized empty Git repository in /tmp/repo/.git/
$ ls -alh
total 84K
drwxrwxr-x 3 mrshu mrshu 4.0K Nov 28 14:39 .
drwxrwxrwt 108 root root 72K Nov 28 14:39 ..
drwxrwxr-x 7 mrshu mrshu 4.0K Nov 28 14:39 .git
16 / 68

Git: recording changes

  • Files in the repository can be
    • untracked
    • tracked (Git knows it)
  • If a file is tracked, it can also be
    • unmodified
    • modified
    • staged


17 / 68

Git: recording changes II

  • To find out what the status of files in your repository is, use git status
$ git status
On branch master
No commits yet
nothing to commit (create/copy files and use "git add" to track)
18 / 68

Git: recording changes II

  • To find out what the status of files in your repository is, use git status
$ git status
On branch master
No commits yet
nothing to commit (create/copy files and use "git add" to track)
  • Let's add some content to a file and see what the status looks like
$ echo "My Analysis" > README
$ git status
On branch master
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
README
nothing added to commit but untracked files present (use "git add" to track)
19 / 68

Git: tracking new files

  • As the git status said, we can use git add to track a file
$ git add README
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: README
20 / 68

Git: tracking new files

  • As the git status said, we can use git add to track a file
$ git add README
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: README
  • And to commit the "staged" changes, we can run git commit
$ git commit
[master (root-commit) b3d8a54] Add README
1 file changed, 1 insertion(+)
create mode 100644 README

Note: git commit opens up your default editor, most likely vim.

21 / 68

Git: tracking new files II

The whole process once again, visualized in a pretty picture

22 / 68

Git: tracking files again

Let's use the same process to add one more file.

$ echo "Licensed under the terms of the CC-0 license." > LICENSE
23 / 68

Git: tracking files again

Let's use the same process to add one more file.

$ echo "Licensed under the terms of the CC-0 license." > LICENSE

The git commands are essentially the same:

$ git status
On branch master
Untracked files:
(use "git add <file>..." to include in what will be committed)
LICENSE
nothing added to commit but untracked files present (use "git add" to track)
$ git add LICENSE
$ git commit
[master 98bef79] Add LICENSE
1 file changed, 1 insertion(+)
create mode 100644 LICENSE
24 / 68

Git: tracking changes

Suppose we change one of the files we track:

$ echo -e "\nThis repo contains the analysis of git usage." >> README
25 / 68

Git: tracking changes

Suppose we change one of the files we track:

$ echo -e "\nThis repo contains the analysis of git usage." >> README

Running git status shows what has happened:

$ git status
On branch master
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: README
no changes added to commit (use "git add" and/or "git commit -a")
26 / 68

Git: tracking changes II

Using the git diff command, we can see what has changed:

$ git diff
diff --git a/README b/README
index dd0c36f..302b24f 100644
--- a/README
+++ b/README
@@ -1 +1,3 @@
My Analysis
+
+This repo contains the analysis of git usage.
27 / 68

Git: tracking changes II

Using the git diff command, we can see what has changed:

$ git diff
diff --git a/README b/README
index dd0c36f..302b24f 100644
--- a/README
+++ b/README
@@ -1 +1,3 @@
My Analysis
+
+This repo contains the analysis of git usage.

And we can again add the change in, just like before:

$ git add README
$ git commit
[master cd18d6e] Add some more description to README
1 file changed, 2 insertions(+)
28 / 68

Git: seeing what's happened

  • To see the history of a git repository, we can run git log:
$ git log
commit 96cee8d998f7306527fa360cb2dda6edb1dffc2f (HEAD -> master)
Author: mrshu <mr@shu.io>
Date: Mon Nov 30 13:30:45 2020 +0000
Add some more description to README
commit 98bef799dad1374e9a6bdd3cb0e31ab98d90f028
Author: mrshu <mr@shu.io>
Date: Sat Nov 28 21:28:28 2020 +0000
Add LICENSE
commit b3d8a54c03255fa93355edc78c3494e4b4c4ef4a
Author: mrshu <mr@shu.io>
Date: Sat Nov 28 20:54:02 2020 +0000
Add README
29 / 68

Git: intro commands review

  • git init

    • initializes (an empty) repository
  • git status

    • shows the status of the repository (modified, added, untracked files)
  • git diff

    • shows what changes have been done (as compared to what is already committed)
  • git add [file]

    • start tracking [file] (add to staging area, so that it can be committed)
  • git commit

    • commit staged files as a new snapshot
30 / 68

How Git works on the inside

A brief introduction into Git internals

31 / 68

Git: some fundamentals

  • Git stores snapshots, not diffs between commits

    • (This makes Git fast)

32 / 68

Git: some fundamentals

  • Git stores snapshots, not diffs between commits

    • (This makes Git fast)

  • Everything in Git is referenced by (and validated via) cryptographic hashes

    • Everything has a SHA1

    • Among other things, this ensures the integrity of the data

33 / 68

Git: what's inside a commit

  • commit: a pointer to a tree, combined with metadata
  • tree: a set of pointers to specific blobs
  • blob: a specific version of a file

34 / 68

Git: what's inside a repository

Each commit links to its "parent" (if it has one).

In its simplest form, a repository is a set of linked commits.

35 / 68

Git: what's inside a repository

Each commit links to its "parent" (if it has one).

In its simplest form, a repository is a set of linked commits.

But how does Git know what commit are we currently on?

36 / 68

Git: look where HEAD is

HEAD is a special file which says which commit is the repository pointing to.

$ cat .git/HEAD
ref: refs/heads/master
37 / 68

Git: look where HEAD is

HEAD is a special file which says which commit is the repository pointing to.

$ cat .git/HEAD
ref: refs/heads/master

Just as above, HEAD can point to "references" -- other files with actual hashes.

$ cat .git/refs/heads/master
96cee8d998f7306527fa360cb2dda6edb1dffc2f

These references are also called branches (or tags).

38 / 68

Git branching

A quick look at (probably) the most famous Git feature.

39 / 68

Git: creating new branch

Creating a new branch is easy -- just run git branch [branchname].

40 / 68

Git: creating new branch

Creating a new branch is easy -- just run git branch [branchname].

Internally, Git creates a new pointer with the name of your branch.

41 / 68

Git: creating new branch

Creating a new branch is easy -- just run git branch [branchname].

Internally, Git creates a new pointer with the name of your branch.

It will point to the same commit as HEAD did at the time.

42 / 68

Git: creating new branch

Creating a new branch is easy -- just run git branch [branchname].

Internally, Git creates a new pointer with the name of your branch.

It will point to the same commit as HEAD did at the time.

$ git branch testing

43 / 68

Git: switching branches

Just as we see in the git log below, we are still at the master branch:

$ git log --oneline --decorate
96cee8d (HEAD -> master, testing) Add some more description to README
98bef79 Add LICENSE
b3d8a54 Add README
44 / 68

Git: switching branches

Just as we see in the git log below, we are still at the master branch:

$ git log --oneline --decorate
96cee8d (HEAD -> master, testing) Add some more description to README
98bef79 Add LICENSE
b3d8a54 Add README

But we can easily switch with the git checkout:

$ git checkout testing
Switched to branch 'testing'
$ git log --oneline --decorate
96cee8d (HEAD -> testing, master) Add some more description to README
98bef79 Add LICENSE
b3d8a54 Add README
45 / 68

Git: switching branches

Just as we see in the git log below, we are still at the master branch:

$ git log --oneline --decorate
96cee8d (HEAD -> master, testing) Add some more description to README
98bef79 Add LICENSE
b3d8a54 Add README

But we can easily switch with the git checkout:

$ git checkout testing
Switched to branch 'testing'
$ git log --oneline --decorate
96cee8d (HEAD -> testing, master) Add some more description to README
98bef79 Add LICENSE
b3d8a54 Add README

And check which branch we are on with git branch:

$ git branch
* testing
master
46 / 68

Git: switching branches II

Here is what the situation looks like, visually.

Before:

After (running git checkout testing):



git checkout can also be used on anything else that resolves to a Git commit (like tags, HEAD and others)

47 / 68

Git: working in branches

Let's suppose we make some changes in the current (testing) branch

$ echo "print('Analysis is done in here')" > analysis.py

And commit them to the repository.

$ git commit
[testing f0aa1ae] Add analysis.py
1 file changed, 1 insertion(+)
create mode 100644 analysis.py
48 / 68

Git: working in branches

Let's suppose we make some changes in the current (testing) branch

$ echo "print('Analysis is done in here')" > analysis.py

And commit them to the repository.

$ git commit
[testing f0aa1ae] Add analysis.py
1 file changed, 1 insertion(+)
create mode 100644 analysis.py

Visually, the situation will look as follows:

49 / 68

Git: working in branches II

But what if we'd like to go back and make licensing clearer in README?

50 / 68

Git: working in branches II

But what if we'd like to go back and make licensing clearer in README?

Not a bit deal. We'll checkout master and add the changes there.

$ git checkout master
Switched to branch 'master'
$ echo -e "\n\nThis project is released to the public domain." >> README
$ git diff
diff --git a/README b/README
index 302b24f..82b5c99 100644
--- a/README
+++ b/README
@@ -1,3 +1,6 @@
My Analysis
This repo contains the analysis of git usage.
+
+
+This project is released to the public domain.
$ git add README
$ git commit
[master f9aa801] Update README to mention licensing
Date: Mon Nov 30 14:24:18 2020 +0000
1 file changed, 3 insertions(+)
51 / 68

Git: working in branches III

By making changes in both the master and the testing branch we have created a so called "divergent history".

$ git log --oneline --decorate --graph --all
* f9aa801 (HEAD -> master) Update README to mention licensing
| * f0aa1ae (testing) Add analysis.py
|/
* 96cee8d Add some more description to README
* 98bef79 Add LICENSE
* b3d8a54 Add README
52 / 68

Git: merging branches together

To get out of the divergent history state, we can merge the histories together.

$ git merge testing
Merge made by the 'recursive' strategy.
analysis.py | 1 +
1 file changed, 1 insertion(+)
create mode 100644 analysis.py
53 / 68

Git: merging branches together

To get out of the divergent history state, we can merge the histories together.

$ git merge testing
Merge made by the 'recursive' strategy.
analysis.py | 1 +
1 file changed, 1 insertion(+)
create mode 100644 analysis.py

And here is what the history looks like now:

$ git log --oneline --decorate --graph --all
* 0c1dc34 (HEAD -> master) Merge branch 'testing'
|\
| * f0aa1ae (testing) Add analysis.py
* | f9aa801 Update README to mention licensing
|/
* 96cee8d Add some more description to README
* 98bef79 Add LICENSE
* b3d8a54 Add README
54 / 68

Git: merging branches together

To get out of the divergent history state, we can merge the histories together.

$ git merge testing
Merge made by the 'recursive' strategy.
analysis.py | 1 +
1 file changed, 1 insertion(+)
create mode 100644 analysis.py

And here is what the history looks like now:

$ git log --oneline --decorate --graph --all
* 0c1dc34 (HEAD -> master) Merge branch 'testing'
|\
| * f0aa1ae (testing) Add analysis.py
* | f9aa801 Update README to mention licensing
|/
* 96cee8d Add some more description to README
* 98bef79 Add LICENSE
* b3d8a54 Add README

Once we are done with it, we can also delete the testing branch:

$ git branch -d testing
Deleted branch testing (was f0aa1ae).
55 / 68

Git: merging branches together II

Here is what the situation looked like before:

56 / 68

Git: merging branches together II

Here is what the situation looked like before:

And after:

57 / 68

Git: branching recap

  • git branch

    • shows you what branch you are on
  • git branch [branchname]

    • creates a new branch on the current HEAD
  • git checkout [branchname]

    • checkout the [branchname] branch
  • git log --oneline --decorate --graph --all

    • graph out the history on the current branch
  • git merge [branchname]

    • merge [branchname] into the current branch
  • git branch -d

    • delete a (merged) branch
58 / 68

Git: working with remotes

Using Git to collaborate with others

59 / 68

Git: cloning a repository

$ git clone https://gitlab.com/vidriduch/davos-hall-of-fame.git
Cloning into 'davos-hall-of-fame'...
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (7/7), 606 bytes | 101.00 KiB/s, done.

This creates a new directory called davos-hall-of-fame, with a repository.

60 / 68

Git: cloning a repository

$ git clone https://gitlab.com/vidriduch/davos-hall-of-fame.git
Cloning into 'davos-hall-of-fame'...
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (5/5), done.
remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0
Unpacking objects: 100% (7/7), 606 bytes | 101.00 KiB/s, done.

This creates a new directory called davos-hall-of-fame, with a repository.

$ cd davos-hall-of-fame/

In which we can create a new branch, in which we'll make our changes:

$ git checkout -b mrshu/hall-of-fame
Switched to a new branch 'mrshu/hall-of-fame'

(git checkout -b creates a new branch and switches into it right away)

61 / 68

Git: pushing changes up

Let's add my name to hall_of_fame.md:

$ echo "mrshu" >> hall_of_fame.md
$ git diff
diff --git a/hall_of_fame.md b/hall_of_fame.md
index e69de29..01dd831 100644
--- a/hall_of_fame.md
+++ b/hall_of_fame.md
@@ -0,0 +1 @@
+mrshu
62 / 68

Git: pushing changes up

Let's add my name to hall_of_fame.md:

$ echo "mrshu" >> hall_of_fame.md
$ git diff
diff --git a/hall_of_fame.md b/hall_of_fame.md
index e69de29..01dd831 100644
--- a/hall_of_fame.md
+++ b/hall_of_fame.md
@@ -0,0 +1 @@
+mrshu

And let's commit it in.

$ git add hall_of_fame.md
$ git commit
[mrshu/hall-of-fame 22573c5] Add mrshu to hall_of_fame.md
1 file changed, 1 insertion(+)
63 / 68

Git: pushing changes up

Let's add my name to hall_of_fame.md:

$ echo "mrshu" >> hall_of_fame.md
$ git diff
diff --git a/hall_of_fame.md b/hall_of_fame.md
index e69de29..01dd831 100644
--- a/hall_of_fame.md
+++ b/hall_of_fame.md
@@ -0,0 +1 @@
+mrshu

And let's commit it in.

$ git add hall_of_fame.md
$ git commit
[mrshu/hall-of-fame 22573c5] Add mrshu to hall_of_fame.md
1 file changed, 1 insertion(+)

And push it in:

$ git push --set-upstream origin mrshu/hall-of-fame
64 / 68

Git: remotes visually

65 / 68

Git != GitHub != GitLab

  • Git as a technology is completely independent from the "web frontends", such as GitHub and GitLab
66 / 68

Git != GitHub != GitLab

  • Git as a technology is completely independent from the "web frontends", such as GitHub and GitLab

  • By learning to use Git you learn the fundamentals that power all of them

67 / 68

Git != GitHub != GitLab

  • Git as a technology is completely independent from the "web frontends", such as GitHub and GitLab

  • By learning to use Git you learn the fundamentals that power all of them

  • GitHub and/or GitLab are businesses

    • Businesses tend to go under.
    • Git is free and open-source. It will most probably survive for a bit longer.
68 / 68

Why git for Data Science?

  • Standardized way of tracking your code and analyses (plus history thereof)
2 / 68
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow