Lecture 3¶

Version Control with `git`¶

This lecture borrows content from https://github.com/rdadolf/git-tutorial. It has been edited for cs109 and for various versions of cs207.

git is a tool that automates and enhances a lot of the tasks that arise when dealing with larger, longer-living, and collaborative projects. It has also become the common underpinning to many popular online code repositories, GitHub being the most popular. Others include GitLab and Bitbucket.

We'll go over the basics of git, but we should point out that a lot of talented people have given git tutorials, and we won't do any better than they have. In fact, if you're interested in learning git deeply and have some time on your hands, I suggest you read the Git Book. Scott Chacon and Ben Straub have done a tremendous job, and if you want to understand both the interfaces and the mechanisms behind git, this is the place to start.

Table of Contents¶

Lecture 3: Version Control with Git

Git Basics¶

Some more terminology for reinforcement¶

The first thing to understand about git is that the contents of your project are stored in several different states and forms at any given time. If you think about what version control is, this might not be surprising: in order to remember every change that's ever been made, you need to store a record of those changes somewhere, and to be able to handle multiple people changing the same code, you need to have different copies of the project and a way to combine them.

You can think about git operating on four different areas:

Git Commands

The Working Directory¶

The working directory is what you're currently looking at.

When you use an editor to modify a file, the changes are made to the working directory.

The Staging Area¶

The staging area is a place to collect a set of changes made to your project.

If you have changed three files to fix a bug, you will add all three to the staging area so that you can remember the changes as one historical entity.

The staging area is also called the index.

You move files from the working directory to the index using the command git add.

Changes in files in the staging area are still not part of any repository!

The Local Repository¶

The local repository is the place where git stores everything you've ever done to your project.

Even when you delete a file, a copy is stored in the repo (this is necessary for always being able to undo any change).

It's important to note that a local repository doesn't look much at all like your project files or directories.

Git has its own way of storing all the information, and if you're curious what it looks like, look in the .git directory in the working directory of your project.

Files are moved from the index to the local repository via the command git commit.

The Upstream Repository¶

When working in a team, every member will be working on their own local repository.

An upstream repository allows everyone to agree on a single version of history. If two people have made changes on their local repositories, they will combine those changes in the upstream repository. In our case this upstream repository is hosted by GitHub.
- Note: This need not be the case; SEAS provides git hosting, as do companies like Atlassian (bitbucket).

This upstream repository is also called a remote in git parlance.

The standard GitHub remote is called the origin: it is the repository which is given a web page on GitHub.

One usually moves code from local to remote repositories using git push, and in the other direction using git fetch.

You can think of most git operations as moving code or metadata from one of these areas to another.

Common Tasks in the version control of files.¶

Forking a repository¶

Forking a repository is performed on GitHub. You did this already in homework 1.

github_forking

Forking brings a repository into your own namespace. It's really a cloning process (see below), but it's done between two remotes on the server. In other words it creates a second upstream repository on the server, called the origin.

The forking process on GitHub will ask you where you want to fork the repository. Choose your own GitHub id.

If you're following along, be sure to replace my ID with your own ID.

The forking procedure leaves you with your own repository, <user_name>/playground.

my_github_fork

Cloning a repository¶

Now that we have a fork of the playground repository, let's clone it down to our local machines. Again, you already did this in homework 1. Now we're going to work through the meaning of the commands that you used. You will be asked to repeat a lot of the steps from homework 1, but this time the exact details will be explained.

A Short Pedagogical Aside¶

In case you're wondering why you didn't just do this tutorial for your homework, here is a brief explanation. There is evidence that people learn very well and retain knowledge best when concepts are introduced and then re-iterated and recalled at a later time Make it Stick: The Science of Successful Learning. In this class, we first showed you some of the git mechanics. Now we'll redo everything to help reinforce the mechanics and shed light on the actual process. This time around, please pay very close attention to the steps you're taking and try to relate them to the concepts described in the tutorial text.

Getting organized¶

Create a CS107 directory: ~/Classes/CS107/. Some of you have probably already created a similar directory structure. If not, please create one now. Inside this directory, you should put your course repo, the playground repo, and any other CS107-related directories and files.
Inside ~/Classes/CS107/ (or similar directory) create a directory called git_tutorial.
Move into the git_tutorial/ directory.

clone

clone

Cloning a repository does two things:

It takes a repository from somewhere (usually an upstream repository) and makes a local copy (your new local repository)
It creates the most recent copy of all of the files in the project (your new working directory).

This is generally how you will start working on a project for the first time.

Clone your forked playground repo to your `CS107/git_tutorial/` directory.¶

Cloning a repository depends a lot on the type of repository you're using. If you're cloning out of a directory on the machine you're currently on, it's just the path to the <project>.git file.

NOTE: From this point on, you will see cells containing code. You should type those commands into your terminal, NOT a Jupyter notebook. We use a notebook simply for demo purposes and to help you follow the steps.

WARNING! The code in the following cells is always preceeded by a combination of the following commands:

%%bash and
cd /tmp or cd /tmp/playground.

DO NOT type any of those commands into your terminal. They are used only in the notebook environment!

In [ ]:

%%bash
cd /tmp
rm -rf playground # remove if it exists
git clone https://github.com/dsondak/playground.git

In [ ]:

%%bash
ls -a /tmp/playground

Poking around¶

We have a nice smelling fresh repository. We'll explore the repo from the Git point of view using Git commands.

`log`¶

Log tells you all the changes that have occured in this project as of now.

In [ ]:

%%bash
cd /tmp/playground
git log

commit hash Each one of these "commits" is an SHA hash.

It uniquely identifies all actions that have happened to this repository previously.

The long string of hex digits next to commit is the long hash and identifies the unique commit.

There is some interesting history here: How much of a git sha is generally considered necessary to uniquely identify a change in a given codebase?

Getting help with commands¶

If you ever need help on a command, you can find the git man pages by hyphenating git and the command name.
Try it!

man git-log

Press the spacebar to scoll down and q to quit.

`status`¶

git_status

Status is your window into the current state of your project.

It can tell you which files you have changed and which files you currently have in your staging area.

You should use git status every other command in git!

This is especially true in the beginning when you're just learning to understand how things work. (Eventually you can probably relax on this.)

In [ ]:

%%bash
cd /tmp/playground
git status

Pay close attention to that text!

It says we are on the master branch of our local repository, and that this branch is up-to-date with the master branch of the upstream repository (or remote) named origin.

We know this because clone brings down a copy of the remote branch: origin/master.

origin/masterrepresents the local copy of the branch that came from the upstream repository (nicknamed origin in this case).

Branches are different, co-existing versions of your project.

Here we have encountered two of them, but remember there is a third one in the repository we forked from, and perhaps many more, depending on who else made these forks.

Branches represent a snapshot of the project by someone at some particular point in time. In general you will only care about your own branches and those of the "parent" remotes you forked/cloned from. We'll have much more to say about branches later.

Configuration information is stored in a special file called config, in a hidden folder called .git in your working directory. (The index and the local repository are stored there as well...more on that in a bit.)

Reminder: Hidden files and directories are preceded by a dot. The only way to see them is to type ls -a where the a option tells the ls command to list hidden files and directories.

A few special files and directories¶

`config`¶

In [ ]:

%%bash
cd /tmp/playground
cat .git/config

Notice that this file tells us about a remote called origin which is simply the Github repository we cloned from. So the process of cloning left us with a remote. The file also tells us about a branch called master, which "tracks" a remote branch called master at origin.

`.gitignore`¶

.gitignore tells us what files to ignore when adding files to the index and comitting to the local repository.

We use this file to ignore temporary data files and such when working in our repository.

Some `.gitignore` anatomy¶

Folders are indicated with a / at the end, in which case all files in that folder are ignored.

One of the lines in the .gitignore file is *.so. That line tells Git to ignore all files with the extension .so.

Some comments on `.gitignore`¶

A .gitignore file can be specialized to a specific language. The one below is specialized to the Python language.

Note that when creating a GitHub repo, you are asked if you want to create a .gitignore file. You don't have to create one, but it's a good idea. Of course, you can always add one later if you so desire.

In [ ]:

%%bash
cd /tmp/playground
cat .gitignore

Making changes¶

Ok! Enough poking around. Let's get down to business and add some files into our folder.

Now let's say that we want to add a new file to the project. The canonical sequence is "edit–add–commit–push".

In [ ]:

%%bash
cd /tmp/playground
echo 'In-class demo' >> README.md
git status

We've modified a file in the working directory, but it hasn't been staged yet. Make sure you read and understand the output.

Your local master branch does not contain anything that is not on the remote master branch. So git says: Your branch is up to date with origin/master.

You have some untracked files in your local directory that git is not keeping track of. Git senses this and informs you of this fact and goes one more step to inform you of what those untracked files are. Sometimes you want to stage these files and sometimes you don't. The decision is yours.

Git also tells you that there is nothing to commit but that there are some untracked files and maybe you want to start tracking them.

`add`¶

git_add

When you've made a change to a set of files and are ready to create a commit, the first step is to add all of the changed files to the staging area. That is what add is for.

Remember that what you see in the filesystem is your working directory, so the way to see what's in the staging area is with the status command.

This also means that if you add something to the staging area and then edit it again, you'll need to add the file to the staging area again if you want to remember the new changes.
- See the Staging Modified Files section at Git - Recording Changes to the Repository.

In [ ]:

%%bash
cd /tmp/playground
git add README.md
git status

Now our file is in the staging area (index) waiting to be committed. The file is still not even in our local repository.

WARNING: Do NOT use `git add .`¶

Instead of doing git add world.md you could use git add . in the top level of the repository. This adds all new files and changed files to the index, and is particularly useful if you have created multiple new files. You should be careful with this because it's a very annoying if you decide that you didn't want to add a file. I usually avoid this if I can, but sometimes it's the way to go. Note: The git add . sequence is far over-used and can cause collaboration problems. Please refrain from using it, especially if you're new to git.

`commit`¶

git_commit

When you're satisfied with the changes you've added to your staging area, you can commit those changes to your local repository with the commit command. Those changes will have a permanent record in the repository from now on.

Every commit has two features you should be aware of:

The first is a hash. This is a unique identifier for all of the information about that commit, including the code changes, the timestamp, and the author. We saw this already when we used git log earlier.
The second is a commit message. This is text that you can (and should) add to a commit to describe what the changes were.

Good commit messages are important!

Commit messages are a way of quickly telling your future self and your collaborators what a commit was about. For even a moderately sized project, digging through tens or hundreds of commits to find the change you're looking for is a nightmare without friendly summaries.

By convention, commit messages start with a single-line summary, then an empty line, then a more comprehensive description of the changes.

This is an okay commit message. The changes are small, and the summary is sufficient to describe what happened.

This is better. The summary captures the important information (major shift, direct vs. helper), and the full commit message describes what the high-level changes were.

This. Don't do this.

In [ ]:

%%bash
cd /tmp/playground
git commit -m "Modifying README for class prep."

In [ ]:

%%bash
cd /tmp/playground
git status

The git commit -m version is just a way to specify a commit message without opening a text editor. If you use a text editor you just say git commit.

Another nice command is to use git commit with the -a option: git commit -a. Note that git commit -a is shorthand to stage and commit a file which is already tracked all at once. It will not stage a file that is not yet tracked!

In [ ]:

%%bash
cd /tmp/playground
git branch -av

We see that our branch, master, has one more commit than the origin/master branch, the local copy of the branch that came from the upstream repository (nicknamed origin in this case). Let's push the changes.

`push`¶

git_push

The push command takes the changes you have made to your local repository and attempts to update a remote repository with them. If you're the only person working with both of these (which is how a solo GitHub project would work), then push should always succeed.

In [ ]:

%%bash
cd /tmp/playground
git push
git status

You can go to your remote repo and see the changes!

Note to Mac Users¶

At this point, please add the .DS_Store file into your .gitignore file. You shouldn't version this annoying file. Here's what it does: .DS_Store.

Then, do a git rm .DS_Store in each directory that contains .DS_Store. Note: You should try to use your new-found Unix skills to execute a single Unix command line to recursively remove all .DS_Store files!

Please do this both for your playground and course repos. Be sure to commit and push!

Breakout Room¶

Figure out whose birthday is closest to a holiday.
Modify (or create) your .gitignore file.
Add at least one file or directory to the .gitignore that you would like to ignore. You should discuss this with your group.
Make sure you push these changes to your remote repo!

Git habits¶

Commit early, commit often.

Git is more effective when used at a fine granularity. For starters, you can't undo what you haven't committed, so committing lots of small changes makes it easier to find the right rollback point. Also, merging becomes a lot easier when you only have to deal with a handful of conflicts.

Commit unrelated changes separately.

Identifying the source of a bug or understanding the reason why a particular piece of code exists is much easier when commits focus on related changes. Some of this has to do with simplifying commit messages and making it easier to look through logs, but it has other related benefits: commits are smaller and simpler, and merge conflicts are confined to only the commits which actually have conflicting code.

Do not commit binaries and other temporary files.

Git is meant for tracking changes. In nearly all cases, the only meaningful difference between the contents of two binaries is that they are different. If you change source files, compile, and commit the resulting binary, git sees an entirely different file. The end result is that the git repository (which contains a complete history, remember) begins to become bloated with the history of many dissimilar binaries. Worse, there's often little advantage to keeping those files in the history. An argument can be made for periodically snapshotting working binaries, but things like object files, compiled python files, and editor auto-saves are basically wasted space.

Ignore files which should not be committed

Git comes with a built-in mechanism for ignoring certain types of files. Placing filenames or wildcards in a .gitignore file placed in the top-level directory (where the .git directory is also located) will cause git to ignore those files when checking file status. This is a good way to ensure you don't commit the wrong files accidentally, and it also makes the output of git status somewhat cleaner.

Always make a branch for new changes

While it's tempting to work on new code directly in the master branch, it's usually a good idea to create a new one instead, especially for team-based projects. The major advantage to this practice is that it keeps logically disparate change sets separate. This means that if two people are working on improvements in two different branches, when they merge, the actual workflow is reflected in the git history. Plus, explicitly creating branches adds some semantic meaning to your branch structure. Moreover, there is very little difference in how you use git.

Write good commit messages

I cannot understate the importance of this.

Seriously. Write good commit messages.

Lecture 3¶

Version Control with git¶

Table of Contents¶

Git Basics¶

Some more terminology for reinforcement¶

The Working Directory¶

The Staging Area¶

The Local Repository¶

The Upstream Repository¶

Common Tasks in the version control of files.¶

Forking a repository¶

Cloning a repository¶

A Short Pedagogical Aside¶

Getting organized¶

Clone your forked playground repo to your CS107/git_tutorial/ directory.¶

Poking around¶

log¶

Getting help with commands¶

status¶

A few special files and directories¶

config¶

.gitignore¶

Some .gitignore anatomy¶

Some comments on .gitignore¶

Making changes¶

add¶

WARNING: Do NOT use git add .¶

commit¶

push¶

Note to Mac Users¶

Breakout Room¶

Git habits¶

Version Control with `git`¶

Clone your forked playground repo to your `CS107/git_tutorial/` directory.¶

`log`¶

`status`¶

`config`¶

`.gitignore`¶

Some `.gitignore` anatomy¶

Some comments on `.gitignore`¶

`add`¶

WARNING: Do NOT use `git add .`¶

`commit`¶

`push`¶