Version Control for Scientists

TRANSFORM 2021

April 22 - 11am UTC

objectives

share my practices

take time to look closer

gain some insights

CONTENTS

lots (but not all) about git

curvenote - jupyter notebooks

data versioning

burning git questions

the git book

set up

pre-reqs:

  • Install git

  • Install VS Code

Install the 'Git Graph' extension

setup github

configure ssh

$ ssh-keygen -t rsa -b 4096 -C "your_email@example.com"

generating text

curl -s http://metaphorpsum.com/paragraphs/N
curl -s http://metaphorpsum.com/sentences/N
curl -s http://metaphorpsum.com/paragraphs/N >> file1.txt

Set up a working area

(base) ✔ ~/dev/t21 
16:31 $ ls -l
total 0
drwxr-xr-x  2 stevejpurves  staff  64 20 Apr 16:31 repo_one
(base) ✔ ~/dev/t21 
16:32 $ 

pinned items in the tutorial slack channel

Cloning

Use Case: Getting someone else's code and taking a good look at it

Find some code on GitHub!

git clone _______________

SSH or HTTP

look around

git clone

ls -l

ls -la

ls -la .git 

git branch

git checkout

git log

git log --graph --oneline --decorate --all

get a prettier command line

jump into VSCode!

Questions Answered:

Q: Is there a benefit to using a desktop client? I get confused sometimes using the command line, but then I tried a client and it didn't help. Which one is best?

Q: What is the minimum I need to know to use git alone?

theory

let's talk more about how git works

git is a database

a <key: Value> store

trees

versions

REFS => TAGS, Branches

HEAD

Branches

[detached] head

branch

MAIN

HEAD

0sjw3a
j3b4j2
d9jw7s
4j3n2h
9kwjcu
2ksjh7
js87sj

3 "trees"

distributed version control system

github

forks & Pull Requests

parent

fork

PR

Questions Answered:

Q: What is a fork?

Q: What is a pull request?

git up close

walk through building out a toy repository

git config & INIT

git config

git config --global --list

git config user.name Steve Purves

git config --global user.name Steve Purves

git init

git config user.name Steve Local

git config --list

Q: How to tell git to forget my repo?

git config --global alias.graph 'git log --graph --oneline --decorate --all'

git config --global --unset alias.graph

git ALIAS

start commiting

git branch -m master main

git add

git rm

git commit

git log

git log --graph --oneline --decorate --all

git diff

TAG & BRANCH

git show-refs

git tag

git branch

git merge

CONFLICTS

💥

Questions Answered:

Q: What is the minimum I need to know to use git alone?

git LEVEL 2

STASH

RESET || REVERT

rebase

commit --amend

the stash

git stash [push]

git stash pop

git stash list

---

git stash push -m "stash_name"

git stash pop stash@{n}

git stash apply stash@{n}

git stash branch <branchname> [stash@{n}]

RESET | REVERT

git reset --hard HEAD

git reset --keep HEAD
git revert <commit>

rebase

// setup a branch with changes

git checkout -b branch2

// commit commit commit

git rebase main

git rebase main branch2

// do it again with conflicts

rebase ONTO


git rebase --onto <newparent> <start> <branch>

commit ammend

git commit --amend -m "my new message"

// you are rewriting history!

DIVE INTO VSCODE

getting my code onto github

remeMber

A GitHub repo is just a copy of your git repo on someone else's (Mircosoft's) computer

cd ..
mkdir repo_two
cd repo_two

git init --bare

cd ../repo_one

// make a change
git commit -am "changes"

git push ../repo_two

git remote add local ~dev/t21/repo_two

you already have a repo locally*

git remote add origin _____________

git push -u origin main

git checkout -b my-branch

git push

git tag 'v1.0.1'

git push --tags

working with others

  1. publishing

  2. contributing

publishing

sharing your repo in any way is publishing

pushing to a public repo on github / gitlab etc... is the most common

once you publish beware of `rewriting history`

git commit --amend -m "my corrected message"

git rebase main

git reset

publishing

git push --force-with-lease

contributing

original

fork

PR

PR

upstream

origin

contributing

Ideally:

  1. make a fork
  2. clone your fork
  3. push to your fork
  4. open a PR back to the main repo
  5. configure upstream remote

contributing

What usually happens:

  1. clone the repo
  2. end up making local changes
  3. realise you want to contribute these back
  4. create a fork
  5. reconfigure local remotes
  6. push to your fork
  7. open a PR back to the main repo

contributing - EXAMPLE

git clone git@github.com:euclidity/t21-lorem-ipsum.git

// make local changes

git commit

git push // !argh no permissions

// fork

git remote -v

git remote rename origin upstream

git remote add origin ___________________

git push -u origin master

git push -u origin main

// open PR

best practices

Q: Best practices for working on a project owned by someone else?

Q: Is git just for collaboration? Or should I be using it on my own projects too?

Q: When collaborating the check out, commit, merge etc... model makes sense. When working alone should we just commit as we go? Or be more structured?

 

Q: Why is main + branch_mr_X + branch_mr_y not good practice?

Q: What is a sensible policy when it comes to branches? How many should I be using?

Q: How to deal with others changing a file before you try to merge? (some mergers require too much time for a quick adjustment) Does git have a staging ground?

Q: When should I merge my work?

gitfLow

gitfLow

2 long lived branches: main, dev

all other branches are short lived

all commits to main are releases

hotfix branches direct from main

all other branches are from dev

consistent naming

hot/issue-232 feat/new-shiny fix/things

 

CONtinuous integration

the ideal: all developers WORK on main

no long lived branches

merge as frequently as possible

concurrent development via architecture

google: component based development

But GIT IS NOT ENOUGH

WE NEED A

reproducible workflow

Versioning for notebooks

with curvenote

using github with jupyter notebooks is difficult

using github with jupyter notebooks is difficult

no meaningful diffs

conflicts because of metadata etc..

difficult to resolve conflicts

works better with no outputs

notebooks can get large and binary

github doesn't render well

gempy-github gempy-cn

JUPYTEXT

VERSIONING NOTEBOOKS EXAMPLE

Wait what just happened?

  1. Pulled a git repo
  2. Saved to a curvenote project
  3. Commited our notebook back into git
  4. We have a got commit including curvenote metadata
  5. Ran our notebook to produce outputs
  6. Saved to Curvenote again
  7. Share, pull out figures into a paper

data versioning

with DVC

DVC What?

DVC What?

initialise

dvc init

git status

git commit -am "init dvc"

add remote

dvc remote add mydrive gdrive://1uW00kcqk8VCRvBoYltmtQ7xq3EdIqPk4

dvc remote list

add DATA & PUSH

cp -r ../data .

dvc add data

git status

dvc push

git status

COMMIT DVC METADATA TO GIT

09:11 $ git status                                                                                                 
On branch main                                                                                                     
Your branch is ahead of 'origin/main' by 1 commit.                                                                 
  (use "git push" to publish your local commits)
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        modified:   .dvc/config
        modified:   .gitignore
        new file:   data.dvc


git commit -am "added dataset"

git push

Wait what just happened?

  1. pulled a git repo
  2. moved our data inside of our repo
  3. initialised dvc
  4. setup a data remote
  5. added our data to dvc & pushed
  6. commited all metadata into git & pushed

 

a git commit linked to our data

summary - GIT

use git to version your code

 

learn command line essentials

 

adopt and use a GUI tool e.g. VSCode

WE NEED A

reproducible workflow