Working and Understanding git ( Part 0)
In the beginning there were always a requirement of keeping data safe .We used cp , sync , rsync etc to do and then were CVS , RCS , VSS which were centralized . The idea was mostly collaboration driven and multiple people could have their versions . But most of the time there was a lot of network activity . Enter git
It gives away every user his own version control system . And at what cost . A lot of previous version control systems used what is called Delta Storage what means every increment is stored but git uses Snap Shot Storage which means the entire thing . This may appear a huge overhead but not actually because of the data model of git .
This is a screenshot of how contemporary Version control Systems are Delta or Snapshot
Git considers all files irrespective of type as blobs and stores the size and blob along with a SHA1 sum . This blob is compressed with zlib and stored in git . So while git is taking snapshots , it continues to pack the blobs and stores them as a pack . All of this happens under the hood and what you see is a pack file . So it is not as a bad on storage as you would have imagined.
One of the main concepts in gits data model is that it is comprised of blobs(your stuff) , trees ( folders ) , commits(nest-able) and tags . Commit is the snapshot when you tell git to write to its database and Tag is a bunch of commits . All of these are immutable and that is what gives git the intense power of branching and merging .
We would see that later
Lets understand a bit about these entities with an example
$ mkdir test-project
$ cd test-project
$ git init
Initialized empty Git repository in .git/
$ echo 'hello world' > file.txt
$ git add .
$ git commit -a -m "initial commit"
[master (root-commit) 54196cc] initial commit
1 file changed, 1 insertion(+)
create mode 100644 file.txt
$ echo 'hello world!' >file.txt
$ git commit -a -m "add emphasis"
[master c4d59f3] add emphasis
1 file changed, 1 insertion(+), 1 deletion(-)
What are the 7 digits of hex that Git responded to the commit with? It turns out that every object in the Git history is stored under a 40-digit hex name. That name is the SHA-1 hash of the object's contents; among other things, this ensures that Git will never store the same data twice (since identical data is given an identical SHA-1 name), and that the contents of a Git object will never change (since that would change the object's name as well). The 7 char hex strings here are simply the abbreviation of such 4 character long strings. Abbreviations can be used everywhere where the 40 character strings can be used, so long as they are unambiguous.
It is expected that the content of the commit object you created while following the example above generates a different SHA-1 hash than the one shown above because the commit object records the time when it was created and the name of the person performing the commit.
We can ask Git about this particular object with the cat-file command.
$ git cat-file -t 54196cc2
commit
$ git cat-file commit 54196cc2
tree 92b8b694ffb1675e5975148e1121810081dbdff
initial commit
A tree can refer to one or more "blob" objects, each corresponding to a file. In addition, a tree can also refer to other tree objects, thus creating a directory hierarchy. You can examine the contents of any tree using ls-tree (remember that a long enough initial portion of the SHA-1 will also work):
$ git ls-tree 92b8b694
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad file.txt
Thus we see that this tree has one file in it. The SHA-1 hash is a reference to that file's data:
$ git cat-file -t 3b18e512
blob
A "blob" is just file data, which we can also examine with cat-file:
$ git cat-file blob 3b18e512
hello world
Note that this is the old file data; so the object that Git named in its response to the initial tree was a tree with a snapshot of the directory state that was recorded by the first commit.
All of these objects are stored under their SHA-1 names inside the Git directory:
$ find .git/objects
.git/objects/info
.git/objects/
.git/objects/3b
.git/objects/3b/18e512dba79e4c8300dd08aeb37f8e728b8dad
.git/objects/92
.git/objects/92/b8b694ffb1675e5975148e1121810081dbdffe
.git/objects/54
.git/objects/54/196cc2703dc165cbd373a65a4dcf22d50ae7f7
.git/objects/a0
.git/objects/a0/423896973644771497bdc03eb99d5281615b51
.git/objects/d0
.git/objects/d0/492b368b66bdabf2ac1fd8c92b39d3db916e59
.git/objects/c4
.git/objects/c4/d59f390b9cfd4318117afde11d601c1085f241
and the contents of these files is just the compressed data plus a header identifying their length and their type. The type is either a blob, a tree, a commit, or a tag.
The simplest commit to find is the HEAD commit, which we can find from .git/HEAD:
$ cat .git/HEAD
ref: refs/heads/master
As you can see, this tells us which branch we're currently on, and it tells us this by naming a file under the .git directory, which itself contains a SHA-1 name referring to a commit object, which we can examine with cat-file:
$ cat .git/refs/heads/master
c4d59f390b9cfd4318117afde11d601c1085f241
$ git cat-file -t c4d59f39
commit
$ git cat-file commit c4d59f39
tree d0492b368b66bdabf2ac1fd8c92b39d3db916e59
parent 54196cc2703dc165cbd373a65a4dcf22d50ae7f7
author 1143418702 -0500
committer 1143418702 -0500
The "tree" object here refers to the new state of the tree:
$ git ls-tree d0492b36
100644 blob a0423896973644771497bdc03eb99d5281615b51 file.txt
$ git cat-file blob a0423896
hello world!
and the "parent" object refers to the previous commit:
$ git cat-file commit 54196cc2
tree 92b8b694ffb1675e5975148e1121810081dbdff
initial commit
The tree object is the tree we examined first, and this commit is unusual in that it lacks any parent.
Most commits have only one parent, but it is also common for a commit to have multiple parents. In that case the commit represents a merge, with the parent references pointing to the heads of the merged branches.
So now we know how Git uses the object database to represent a project's history:
- "commit" objects refer to "tree" objects representing the snapshot of a directory tree at a particular point in the history, and refer to "parent" commits to show how they're connected into the project history.
- "tree" objects represent the state of a single directory, associating directory names to "blob" objects containing file data and "tree" objects containing subdirectory information.
- "blob" objects contain file data without any other structure.
- References to commit objects at the head of each branch are stored in files under .git/refs/heads/.
- The name of the current branch is stored in .git/HEAD
This is going to lead us to References which are like lightweight movable pointers and mutable . They point to commits so that we can branch and merge
So here the box in gray is the mutable pointer which we would use for branching and merging . Now we start changing code . Exclamation marks show change in code . Lets see what git does
Git makes new objects and just moves the Reference since it is mutable
We would cover branching and merging and workflows in another post
Great stuff! Thanks for this post.