The Global Information Tracker (GIT) Blog Part-2
14th April 2024
This blog is the second half of a two-part series blog on Git. If you have not read the first blog I would highly recommend one to read the first one for context and then refer to this for a smooth understanding.
Table of content
Part-3 Advance Git : In advance git we explore concepts like rebasing, manipulating history, look into various workflows and much more
Part-4 Conclusion : In this section one gets a sneak into how git works under the hood i.e. plumbing commands
Note: After readings the blogs if you are still curious and want to in dive more deep. Then looking into the documentation and code is the best way to do : ) These are linked below in the references section.
Part-3 Advance Git
3.1 Git rebasing
Making multiple branches and working on a project can quickly make the history of the project very complex. Although it might seem pretty fine for someone working on it since the beginning but if someone new has to join the team or someone wants to contribute to the project then things can get overwhelming very quickly as it would be an uphill task to understand the complex history if required. To make the git history understandable we shall look into the concept of Rebasing. Git rebase allows us to move a feature branch to a new parent commit i.e it allows us to detach the feature branch and attach it to the starting point of main branch. The benefit of this is that this makes the history linear. By making the history linear it makes it easy for someone to follow the history of the repo without having to go off tangents when they see branches in the history. Essentially git rebase gives you the power to rewrite history, let it be adding in new changes, modifying the old ones or deleting a few of them.
The beauty of rebase is two fold. One it allows you to do fast-forward merges instead of the complex 3-way merges. Two it allows to stay updated with the main branch without having a merge commit i.e the way you perform a rebase is quite similar to a merge, where you go to the branch you want to move and then you perform git rebase <the name of branch you want to move to>. In addition to this, git rebase allows us to manipulate individual commits. To get to this level of wizardry all you have to do is add -i flag to your rebase command to make the rebase interactive i.e git rebase -i <the branch you want to rebase to>. A text editor opens up where you get to see the log history. There are four commands one can use to edit the log history. One can combine multiple commits into a single commit by replacing pick with squash and then saving the file. When one try’s to bunch a group of commits into a single one by performing the above actions, Git asks whether to replace the commit messages with a new single commit message. The edit keyword allows one to go to a specific commit and edit it, think of it like updating a commit that you made a while ago(back to the future?). Don’t worry, if things get complicated you can always abort the rebase by performing git rebase —abort.
Given the rapid pace of development one may end up committing too many changes at once, later on one can revise the commit history to ensure it's clear and readable for other developers. To do this one first start’s an interactive session with git rebase -i <branch you want to move this to>. Next, change the pick to edit on the commit you want to go back to and delete all the commits above it. Once that’s done one can perform git reset —mixed Head~1 where you are reseting your working directory by moving the HEAD one step back. Now if you do ~n you move the HEAD n steps back. The flag —mixed helps in keeping all the files/changes of the commit you are altering in the directory even though you alter the history. This might remind you about how back in basics we used git reset —hard to go back to the most recent commit.
An interesting thing to note is that when you are performing rebase you are not present on any branch while the rebase is being performed. Once the session is ends i.e when you type in git rebase —continue. That’s when you come back to a branch, so think of rebasing as a technique thats done in isolation.
Now if you are ever at a state where you want to take a look at a branch that you deleted without merging or if you by mistake moved back more than intended while using git reset, don’t worry there is a place where all of the actions/commits made are stored in. git reflog lets you look at all the changes you made to the repository. Reflog is a chronological listing of our history, without regard for the repository’s branch structure. This lets us find all the dangling commits that would otherwise be lost from the project history. To retrieve commits you lost, checkout to the commit id of it and then create a branch there and checkout to main and merge it, this way one can undo changes.
3.2 Multi user workflow
Git works as an enabler for programmers to collaborate on projects. In this section, let’s try to understand the various kinds of workflows, the different developer platforms and how people collaborate in the real world using Git. When you want to "contribute" to a project you need to have the project’s repo and this can be achieved through either the internet or by being on the same local network or we can even have them on our local filesystem. These kind of repos are called as "remote repositories", remote because the repo is not your own. The first step to contribute to a project like this is to clone their codebase, git clone <weblink of repo> can be used if you are trying to clone the repo from a developer site like "Github". git clone <path of codebase> can also be used if the repo is in a shared workspace.
To check if your repo has any remote connections one can perform git remote. If your repo is a clone it would usually show an origin connection, here origin is the remote repo pointing to the original repo which we cloned. If the repo you want to connect remotely is present locally then git remote add <name for remote> <path to the repo> should create a connection with it. Apart from just accessing repos locally one can use SSH to access repos on someone else’s computer. Each time when an update is made in the remote repo one needs to fetch them by performing git fetch <remote repo name>. One can check for remote branches by performing git branch -r. To work on these remote branches you either have to merge them into an existing branch or create a new branch off them. Similar to git fetch which fetches changes from a remote repo, we use git push <repo-name> <branch name> to push our branches to another remote repository. An important thing to note is that pushing does not directly push the tags associated with a particular branch, you have to push these tags as well if requited. From my perspective, adding in a flag that automatically sends the tags attached with the commits can be a good contribution one can make to the git codebase. Pushing creates new local branch/s as it is a necessary tool for maintaining public repositories. Whereas fetching from the remote repositories creates remote branches.
If you are building something big you would mostly either have a centralised/distributed workflow. In a centralised workflow you would have all the repos connected to a central repo where one can push and fetch changes only from it. These central repos are typically hosted on a server to allow an online mode of collaboration between developers around the world. To create one you need to run git init —bare <centralreponame> .git . As you can see, we have used —bare a flag so as to not create a working directory but to just have a central logging system. Think of this as a combined history file of the various developers working on the project. Once the central repo is up we ask all the developers to add it into their individual developer repos. In case if there is already a remote repo connection use git remote rm <existing connection name> to remove the existing connection and set up a connection only with the central repo. In case your central repo is on a server then git remote add ssh:<[email protected]>/<path to central-repo.git> is used to establish a connection with it.
We push the history of our work with git push origin main to the central repo. The central repo would contain the history of our entire project. To avoid merge conflicts and divergence in the origin/main branch and your working directories main branch always remember to never rebase commits that have been pushed to a shared repo. If you really need to change a public commit use git revert. Given the fact that multiple developers are working on the central repo, there might arise a time where, when one tries to push changes to origin they might be rejected. This is because the origin/master branch has progressed since the last time one has fetched from it. The rejection is performed so as to allow only fast-forward merges. To overcome this issue one needs to fetch the changes from the origin/master and rebase their commits to it before pushing. This factors in the changes other developers have made and rebases ones changes to add them in after the changes that have been pushed to the origin/main. The benefit of going through and setting up a central repo ensures us that no one overwrites another’s content as the push would be rejected if otherwise.
When we move onto to a distributed workflow where we host repo/s on a remote server or on a DVCS platform(like GitHub/Gitlab), we ideally would not want any Tom, Dick, and Harry to be able to push to the official central repository since it could lead to multiple issues. For example security issues can be one of the many problems
To maintain sanity in our codebase it is highly recommended to have both a public and a private repository for every developer. The way many DVCS deal with this is by using the HTTPS protocol i.e each used is required to be authenticated before they can push code to the repo. However we use the HTTP protocol to pull code so as to allow anyone to download the code and work on it.
In general the workflow that everyone follow is known as integrator. The integrator workflow requires that everyone pull from a single place which is the official repository and push all of their changes to their own public repositories. In this way, additions from one contributor can be approved, integrated, and made available to everyone without having to interrupt the developments being made by others. While this setup forces one to keep track of more remotes, it also makes it much more easier to work with a large number of developers. You’ll never have to worry about security using an integrator workflow because the maintainers will be the only one with access to the "official" repository.
There’s an interesting side-effect to this kind of security. By giving each developer their own public repository, the integrator workflow creates a more stable developer environment for open-source software projects. Should the lead developer stop maintaining the "official" repository, any of the other participants could take over by simply designating their public repository as the new "official" project. These aspects make Git a distributed version control system: where there is no single central repository that Git forces everyone to rely upon.
The Integrator workflows is generally bulky i.e a contributor shares a complete branch, to overcome this issue let’s look into the concept of patch workflows. In patch workflow the communication happens on a commit level basis. To create a patch we use git format-patch <name of branch the commits are missing from> . When you run this you create a patch file that contains information about every commit in the current branch that is missing from the main by creating a .patch file. Instead of having to use the git log to look at the history and figure out what are the changes that have been made, patches condenses this information and makes it very quick and easy to look at the changes being made. Back in the day when the DVCS weren't so developed, patches were used as a method to highlight what changes have been made and these patch files use to be mailed to other developers. This can be understood when you realise that one can use the git send-email command emails to send in mails with the .patch files. Ideally one would just download these patch files and use git am < <path to patch file> to add in the changes being made commit by commit. The git am command is configured to read from something called "standard input" and the "<" character is how we can turn a file’s contents into standard input.
The benefit of patches is modularisation of changes into commits. This gives one the freedom to isolate commits they want and move them around as they please. From the maintainer’s perspective, patches also provide the same security as the integrator workflow: as he/she still won’t have to give anyone access to the "official" repository and they won’t have to keep track of every developers remote repositories. As a developer, you’re most likely to use patches when you want to fix a bug in someone else’s project. After fixing it, you can send them a patch of the resulting commit. For this kind of one-time-fix, it’s much more convenient for you to generate a patch than to set up a public Git repository.
3.3 Tips and Tricks
Let’s get into the juicy part of this blog where we dive into various practical tips and tricks related to git that one can use in their everyday life. Each para below are different tricks and tips I have picked up along the way during undergrad.
An easy way to look at the logs of various branches when the repo becomes very big is git log <since>..<until> displays the difference in snapshots that are present from until but not from since. Adding in the —stat flag shows one information regarding as to what files have been changed in each commit. This helps one decide exactly what files should they look into check what changes have being made to the repo. git log -n N lets you look at last N commits from the current HEAD. This is equivalent to git log HEAD~N..HEAD which is seems to be more tedious and less intuitive compared to git log -n N.
Once you are done with the project or have finished developing a version of it, it is always good to have a local copy of the remote repo.This can be done with git archive <branch name> —format=zip/tar —output=../<name>.zip/.tar depending on your preferences one can have a .zip or a .tar file. These files can also be used to share the project with someone who doesn’t have git installed (for example: A client). Similarly git bundle create ../repo.bundle <branchname> creates a bundle file. Unlike the zip or tar files, the bundle files contain the history/log of all the commis that have been made while building the project. Bundle files also allows people to clone the repo with having to interact with the remote copy by just cloning the bundle.
Using a .gitignore file one can make files/directories stored in working directory invisible to Git. Files like your api keys or generated files can be left untracked without having to use the git status command. To ignore files with a specific extension we can use "*". For example to we can use *.exe and *.out to ignore the executables generated in a project thats written in C.
If you are ever find yourself in a pickle where you are working on something and it’s not yet complete to commit and have an emergency task that’s come up. Then one can stash the changes they have made with git stash and post the completion of the emergency task one can comeback to the work in progress task with git stash apply. One can stash changes among multiple branches as well i.e inter branch stashing is also possible. Another exciting aspect of Git is its Hooks, that are located within the .git folder. A hook is a script that Git executes every time a particular event occurs in a repository. Each of the .sample scripts in the hooks directory represents a different event that you can listen for, and each of them can do anything, from automatically creating and publishing releases to enforcing a commit policy and making sure a project compiles. For more details regarding hooks refer to the official git documentation. To look at the difference between two commits one can use the git diff command, where git diff <commit-1>..<commit-2> shows us the exact changes that have being made. The git diff command is incredibly useful for pinpointing contributions from other developers. Using git diff without any arguments shows us the uncommitted changes.
One can also use the git checkoutcommand to checkout files as to what were the various version of it were like. For example one perform git checkout HEAD <filename>. This checkouts out the file to its most recent version of it.
Aliasing commands lets you traverse and use git faster. This can be achieved by making changes to the config file with commands like git config —global/—local alias.co checkout, this lets you use the checkout command with just git co. One can learn more about it by referring to git documentation. Git includes a long list of configuration options, all of which can be found in the official manual. Note that storing your global configurations in a plaintext file makes it incredibly easy to transfer your settings to a new Git installation: just copy ~/.gitconfig onto your new machine.
Part-4 Conclusion
In the last leg of the blog, let’s take a look at the actual internals of git. Plumbing refers to the low level commands that give us access to the actual internal representation of the codebase. Let’s start with something very basic i.e what happens when we commit a message. The plumbing command used here is git cat-file commit HEAD, this shows us the commit object that contains the parameters: tree, parent, author, committer and the commit message. Git follows a tree object representation of snapshots where each commit is tied to a parent which in turn is a commit.
The structure in which Git works is: you have blob objects that store file data, above which are tree objects that store other trees and blob objects. Then you have commit objects that tie trees into the project history. When we perform git ls-tree we get to look at the various blobs and trees. We can take a look at a particular blob with git cat-file blob <blob-ID>. When you run this you get to look at as to what is the file data that is stored in the blob. Git shares these same blobs across multiple trees. All of the data is stored in the objects folder inside the .git directory. Each of the folders listed inside are either trees or blobs that contain information/data regarding the codebase. One can compress all of the data present in these folders with git gc, the purpose of git gc is to compresses these objects when the repo grows out to be very big. Running git gc every now and then is usually a good idea, as it keeps your repository optimised.
When we perform a git commit, first the index is updated and it is added to a tree which generates a commit ID for it. Post this, the tree object is attached to the parent tree. One can perform this entire process manually as well with the help of plumbing commands. Once this is done we have to attach this dangling commit by updating our HEAD to it. Updation of the HEAD can be done by going to the head file in the references under the .git directory and then changing in where the HEAD points to.
As we start to wrap things up remember that as you migrate these skills to real-world projects, Git is merely a tool for tracking your files, not a cure-all for managing software projects. No amount of intimate Git knowledge can make up for a haphazard set of conventions within a development team.
This blog was meant to prepare you for the realities of distributed software development and not to transform you into a Git expert overnight. You should be able to manage your own projects, collaborate with other Git users, and perhaps most importantly, understand exactly what any other piece of Git documentation is trying to convey. With all of these convenient features, it’s easy to get so caught up in designing the perfect workflow that you lose sight of Git’s underlying purpose. As you add new commands to your repertoire, remember that Git should always make it easier to develop a software project—never harder. If you ever find that Git is causing more harm than good, don’t be scared to drop some of the advanced features and go back to the basics. Pat yourself on the back for sticking through the blogs and understanding a portion of Global Information Tracker also known as Git.
While it’s impossible to cover all of Git’s supporting features in a two part blog series such as this, I hope that you now have a clearer picture of Git’s numerous capabilities. In the resources/references I have shared links of some resources one can look into to dive into the depths of Git. If you have made it this far, let me know your thoughts, you can mail me with your feedback and ensure that subject is Git blog 808. I hope you enjoyed the journey of trying to understand git a little better : )
Resources/References
1. Git documentation
2. Ry's Git Tutorial Book
3. Interactive Git Tutorials
4. Pro Git Book
5. Git CheatSheet