name: inverse layout: true class: center, middle, inverse --- # Git Introduction to (source code) version control .footnote[Marek Šuppa
Ondrej Jariabka
Adrián Matejov] --- layout: false # Why `git` for Data Science? - Standardized way of tracking your code and analyses (plus history thereof) -- - Helps avoid "versioning hell" (you know, files like `essay.doc`, `essay_v2.doc`, `essay_final.doc`) -- - Gives you the ability to "jump in time" -- - Helps you make your work "reproducible" -- - Makes it a bit more straightforward to work on common (larger) projects with others --- class: middle, center, inverse # Git The Global Information Tracker(TM) ??? Linus actually claims it does not mean anything... --- # Git - A **distributed** version control system -- - Will **keep track of the changes** you make to the files it tracks -- - If you screw things up (i.e. accidentally remove some file), you can **get back to previous state** -- - Allows for these **changes** to be **easily transferred** to others -- - Originally designed as a source code version control system for the **Linux kernel** -- - **Free and open-source** software distributed under GNU GPL2 license -- - Currently the standard for source code versioning --- # Git: initializing the repository - Git stores its metadata (along with "snapshots") in a special `.git` folder - A folder which contains this `.git` folder is called a "*repository*" -- - A new repository can be initialized using the `git init` command ``` $ mkdir repo $ cd repo $ git init Initialized empty Git repository in /tmp/repo/.git/ $ ls -alh total 84K drwxrwxr-x 3 mrshu mrshu 4.0K Nov 28 14:39 . drwxrwxrwt 108 root root 72K Nov 28 14:39 .. drwxrwxr-x 7 mrshu mrshu 4.0K Nov 28 14:39 .git ``` --- # Git: recording changes .left-eq-column[ - Files in the repository can be - **untracked** - **tracked** (Git knows it) ] .right-eq-column[ - If a file is tracked, it can also be - **unmodified** - **modified** - **staged** ]
.center[![:scale 80%](./images/lifecycle.png)] ??? Image from https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository --- # Git: recording changes II - To find out what the status of files in your repository is, use `git status` ``` $ `git status` On branch master No commits yet nothing to commit (create/copy files and use "git add" to track) ``` -- - Let's add some content to a file and see what the status looks like ``` $ echo "My Analysis" > README $ `git status` On branch master No commits yet Untracked files: (use "git add
..." to include in what will be committed) README nothing added to commit but untracked files present (use "git add" to track) ``` --- # Git: tracking new files - As the `git status` said, we can use `git add` to track a file ``` $ `git add README` $ `git status` On branch master No commits yet Changes to be committed: (use "git rm --cached
..." to unstage) new file: README ``` -- - And to commit the "staged" changes, we can run `git commit` ``` $ `git commit` [master (root-commit) b3d8a54] Add README 1 file changed, 1 insertion(+) create mode 100644 README ``` *Note: `git commit` opens up your default editor, most likely `vim`*. --- # Git: tracking new files II ![:scale 100%](images/Git_illustration.png) .center[The whole process once again, visualized in a pretty picture] ??? Image from https://geo-python.github.io/site/lessons/L2/git-basics.html --- # Git: tracking files again Let's use the same process to add one more file. ``` $ echo "Licensed under the terms of the CC-0 license." > LICENSE ``` -- The `git` commands are essentially the same: ``` $ `git status` On branch master Untracked files: (use "git add
..." to include in what will be committed) LICENSE nothing added to commit but untracked files present (use "git add" to track) $ `git add LICENSE` $ `git commit` [master 98bef79] Add LICENSE 1 file changed, 1 insertion(+) create mode 100644 LICENSE ``` --- # Git: tracking changes Suppose we change one of the files we track: ``` $ echo -e "\nThis repo contains the analysis of git usage." >> README ``` -- Running `git status` shows what has happened: ``` $ `git status` On branch master Changes not staged for commit: (use "git add
..." to update what will be committed) (use "git restore
..." to discard changes in working directory) `modified: README` no changes added to commit (use "git add" and/or "git commit -a") ``` --- # Git: tracking changes II Using the `git diff` command, we can see what has changed: ``` $ `git diff` diff --git a/README b/README index dd0c36f..302b24f 100644 --- a/README +++ b/README @@ -1 +1,3 @@ My Analysis + +This repo contains the analysis of git usage. ``` -- And we can again add the change in, just like before: ``` $ `git add README` $ `git commit` [master cd18d6e] Add some more description to README 1 file changed, 2 insertions(+) ``` --- # Git: seeing what's happened - To see the history of a git repository, we can run `git log`: ``` $ `git log` commit 96cee8d998f7306527fa360cb2dda6edb1dffc2f (HEAD -> master) Author: mrshu
Date: Mon Nov 30 13:30:45 2020 +0000 Add some more description to README commit 98bef799dad1374e9a6bdd3cb0e31ab98d90f028 Author: mrshu
Date: Sat Nov 28 21:28:28 2020 +0000 Add LICENSE commit b3d8a54c03255fa93355edc78c3494e4b4c4ef4a Author: mrshu
Date: Sat Nov 28 20:54:02 2020 +0000 Add README ``` --- # Git: intro commands review - `git init` - initializes (an empty) repository - `git status` - shows the status of the repository (modified, added, untracked files) - `git diff` - shows what changes have been done (as compared to what is already committed) - `git add [file]` - start tracking `[file]` (add to staging area, so that it can be committed) - `git commit` - commit staged files as a new snapshot --- class: middle, inverse, center # How Git works on the inside A brief introduction into Git internals --- # Git: some fundamentals - Git stores snapshots, not diffs between commits - (This makes Git [fast](https://git-scm.com/about/small-and-fast)) .center[![:scale 50%](images/snapshots-vs-delta.png)] -- - Everything in Git is referenced by (and validated via) cryptographic hashes - Everything has a SHA1 - Among other things, this ensures the integrity of the data ??? https://batmat.github.io/presentations/git-next-level/prez.html#slide-9 --- # Git: what's inside a commit - **commit**: a pointer to a tree, combined with metadata - **tree**: a set of pointers to specific blobs - **blob**: a specific version of a file .center[![:scale 100%](images/commit-and-tree.png)] ??? https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell --- # Git: what's inside a repository Each commit links to its "parent" (if it has one). In its simplest form, a repository is a set of linked commits. .center[![:scale 100%](images/commits-and-parents.png)] -- But how does Git know what commit are we currently on? ??? https://git-scm.com/book/en/v2/Git-Branching-Branches-in-a-Nutshell --- # Git: look where `HEAD` is `HEAD` is a special file which says which commit is the repository pointing to. ```bash $ cat .git/HEAD ref: refs/heads/master ``` -- Just as above, `HEAD` can point to "references" -- other files with actual hashes. ```bash $ cat .git/refs/heads/master 96cee8d998f7306527fa360cb2dda6edb1dffc2f ``` .center[![:scale 55%](images/branch-and-history.png)] These references are also called *branches* (or *tags*). --- class: middle, inverse, center # Git branching A quick look at (probably) the most famous Git feature. --- # Git: creating new branch Creating a new branch is easy -- just run `git branch [branchname]`. -- Internally, Git creates a new pointer with the name of your branch. -- It will point to the same commit as `HEAD` did at the time. -- ```bash $ git branch testing ``` .center[![:scale 60%](images/head-to-master.png)] --- # Git: switching branches Just as we see in the `git log` below, we are still at the `master` branch: ``` $ `git log --oneline --decorate` 96cee8d (`HEAD -> master`, `testing`) Add some more description to README 98bef79 Add LICENSE b3d8a54 Add README ``` -- But we can easily switch with the `git checkout`: ``` $ git checkout testing Switched to branch 'testing' $ `git log --oneline --decorate` 96cee8d (`HEAD -> testing`, `master`) Add some more description to README 98bef79 Add LICENSE b3d8a54 Add README ``` -- And check which branch we are on with `git branch`: ``` $ `git branch` * testing master ``` --- # Git: switching branches II Here is what the situation looks like, visually. .left-eq-column[ **Before**:
.center[![:scale 100%](images/head-to-master.png)] ] .right-eq-column[ **After** (running `git checkout testing`):
.center[![:scale 100%](images/head-to-testing.png)] ] .clear-both[
.center.font-small[`git checkout` can also be used on anything else that resolves to a Git commit (like *tags*, `HEAD` and others)] ] --- # Git: working in branches Let's suppose we make some changes in the current (`testing`) branch ```bash $ echo "print('Analysis is done in here')" > analysis.py ``` And commit them to the repository. ```bash $ `git commit` [testing f0aa1ae] Add analysis.py 1 file changed, 1 insertion(+) create mode 100644 analysis.py ``` -- Visually, the situation will look as follows: .center[![:scale 60%](images/advance-testing.png)] --- # Git: working in branches II But what if we'd like to go back and make licensing clearer in README? -- Not a bit deal. We'll checkout `master` and add the changes there. ``` $ git checkout master Switched to branch 'master' ``` ```bash $ echo -e "\n\nThis project is released to the public domain." >> README $ `git diff` diff --git a/README b/README index 302b24f..82b5c99 100644 --- a/README +++ b/README @@ -1,3 +1,6 @@ My Analysis This repo contains the analysis of git usage. + + +This project is released to the public domain. $ `git add README` $ `git commit` [master f9aa801] Update README to mention licensing Date: Mon Nov 30 14:24:18 2020 +0000 1 file changed, 3 insertions(+) ``` --- # Git: working in branches III By making changes in both the `master` and the `testing` branch we have created a so called "**divergent history**". .center[![:scale 60%](images/advance-master.png)] ```bash $ `git log --oneline --decorate --graph --all` * f9aa801 (HEAD -> master) Update README to mention licensing | * f0aa1ae (testing) Add analysis.py |/ * 96cee8d Add some more description to README * 98bef79 Add LICENSE * b3d8a54 Add README ``` --- # Git: merging branches together To get out of the **divergent history** state, we can merge the histories together. ``` $ `git merge testing` Merge made by the 'recursive' strategy. analysis.py | 1 + 1 file changed, 1 insertion(+) create mode 100644 analysis.py ``` -- And here is what the history looks like now: ``` $ `git log --oneline --decorate --graph --all` * 0c1dc34 (HEAD -> master) Merge branch 'testing' |\ | * f0aa1ae (testing) Add analysis.py * | f9aa801 Update README to mention licensing |/ * 96cee8d Add some more description to README * 98bef79 Add LICENSE * b3d8a54 Add README ``` -- Once we are done with it, we can also delete the testing branch: ``` $ `git branch -d testing` Deleted branch testing (was f0aa1ae). ``` --- # Git: merging branches together II Here is what the situation looked like **before**: .center[![:scale 60%](images/basic-merging-1.png)] -- And **after**: .center[![:scale 60%](images/basic-merging-2.png)] --- # Git: branching recap - `git branch` - shows you what branch you are on - `git branch [branchname]` - creates a new branch on the current `HEAD` - `git checkout [branchname]` - checkout the `[branchname]` branch - `git log --oneline --decorate --graph --all` - graph out the history on the current branch - `git merge [branchname]` - merge `[branchname]` into the current branch - `git branch -d` - delete a (merged) branch --- class: center, middle, inverse # Git: working with remotes Using Git to collaborate with others --- # Git: cloning a repository ```bash $ `git clone https://gitlab.com/vidriduch/davos-hall-of-fame.git` Cloning into 'davos-hall-of-fame'... remote: Enumerating objects: 7, done. remote: Counting objects: 100% (7/7), done. remote: Compressing objects: 100% (5/5), done. remote: Total 7 (delta 0), reused 0 (delta 0), pack-reused 0 Unpacking objects: 100% (7/7), 606 bytes | 101.00 KiB/s, done. ``` This creates a new directory called `davos-hall-of-fame`, with a repository. -- ```bash $ cd davos-hall-of-fame/ ``` In which we can create a new branch, in which we'll make our changes: ``` $ `git checkout -b mrshu/hall-of-fame` Switched to a new branch 'mrshu/hall-of-fame' ``` (`git checkout -b` creates a new branch and switches into it right away) --- # Git: pushing changes up Let's add my name to `hall_of_fame.md`: ```bash $ echo "mrshu" >> hall_of_fame.md $ `git diff` diff --git a/hall_of_fame.md b/hall_of_fame.md index e69de29..01dd831 100644 --- a/hall_of_fame.md +++ b/hall_of_fame.md @@ -0,0 +1 @@ +mrshu ``` -- And let's commit it in. ```bash $ `git add hall_of_fame.md` $ `git commit` [mrshu/hall-of-fame 22573c5] Add mrshu to hall_of_fame.md 1 file changed, 1 insertion(+) ``` -- And push it in: ``` $ git push --set-upstream origin mrshu/hall-of-fame ``` --- # Git: remotes visually ![:scale 100%](https://geo-python.github.io/site/_images/pull-push-illustration.png) ??? https://geo-python.github.io/site/lessons/L2/git-basics.html --- # Git != GitHub != GitLab - Git as a technology is completely independent from the "web frontends", such as GitHub and GitLab - By learning to use Git you learn the fundamentals that power all of them - GitHub and/or GitLab are businesses - Businesses tend to go under. - Git is free and open-source. It will most probably survive for a bit longer.