Groovy Fun with Git - Part 1 of 3

Pro Git, Scott Chacon's great book on Git, has a chapter on Git internals that is a must read, if you want to take a look under the hood, and see how Git does things. There are many blog posts, slide-presentations, and video tutorial sessions on this topic. I find Chacon's to be the shortest and the clearest.

In Pro Git's chapter 10: Git Internals, Scott uses a very simple example to show how you can do simple experiments to explore the internals of Git. He shows how Git goes about constructing its on-disk data structures (in the .git/objects directory), as you order it to do things. He shows how, using Git's "plumbing" commands, you can examine the data structure as it grows.

Reading along, I found out how brilliant Git's internal design is. It is visible using text tools. You don't need specialized GUIs or hex dump tools (compare with the nightmare from Microsoft known as the registry). Scott shows how Git is essentially a key-value store, with simple constraints, modeled after the Unix filesystem design. The brilliant part of the design is its amazing simplicity. The data structure basically consists of two node types, a "blob" node that points at a compressed blob (zlib), and a "tree" node that holds a list of pointers to other "blob" and "tree" nodes. This allows a recursive implementation of object trees (the revision tree). Git, in addition, has node types that represent references to revision trees and hold commit metadata. The commit nodes are implemented in a linked list, each commit points at it's immediately preceding commit - its parent. The ultimate parent is the first commit. Later commits are children of previous commits. A branch is simply a reference to a commit node. That's it! Git internal design can be described in one paragraph. Please see Chapter 10 of Scott's book. Also, of course, to learn Git's internals requires some exploration of commands and their effect on the internal structures.

You can explore Git's object database as you do a Unix filesystem. You can also look at Git as a NoSQL database with a high-level query language (ceramic commands), and a low-level language (plumbing commands). The internals of the data structures are a masterpiece of simple, elegant design, that is not hidden from view, but available and explorable via OS tools and low-level commands. I wish the major databases (MySQL, MS SQL Server, etc.) provided that kind of visibility into their internals, at that level of accessibility. Git is a tremendous demonstration of Linus Torvald's design genius.

As I followed along, and later while doing my own simple Git exploratory experiments, I found that I was repeatedly using a few Git plumbing commands to examine what the object database looked like. Essentially you need "git cat-file -p" to print out the content of a Git object. Scott uses a few other plumbing commands to create objects in Git. I decided that I will refrain from using plumbing commands to make things happen, since, in practice, this is a dangerous habit to get into. You can easily corrupt your git repository with a single command. But plumbing commands that are read-only and help examine the database should be fine.

In most of my experiments, I needed to see a snapshot of the data structure, similar to what you see with "tree .git/objects" but shows more information. The way I would conduct my experiments is to have three vertical panes using Gnome Terminator. In the leftmost pane, I type commands. In the center pane I show the output of "tree .git/objects", and in the rightmost pane, I show a view similar to the "tree .git/objects" but with information obtained from "git cat-file -p", or a shell script that calls it. See the screenshot below.

The "git cat-file" plumbing command shows only one object. I needed to see all the objects as the repository grew. So I wrote a simple dump tool, a small Bash script to loop through the .git/objects directory, construct the object's name (the SH1) from the two char directory name concatenated with the 38 character object name, and calling "git cat-file -p <object name>". The output from the bash script can be massaged with the usual Unix text tools (grep, sed, awk, sort, cut, etc.) to filter, sort, and format.

As I did more and more experiments to explore different aspects and commands of Git, my little Bash script proved pretty hard to extend, as I wanted more and more features.

 I considered using a higher order scripting language to replace Bash (quickly, I might add - say a few hours). There is a lot of good material about using Python as a sysadmin language. For me, that would be a major diversion. I know some Python - but not enough to work fast. I decided to try Groovy (I am a Java developer mainly) as a Bash replacement language. I was very pleasantly surprised how easy that proved to be. It took less than two hours to translate my Bash script to a Groovy script. I just needed to Google how to call an OS command from Groovy and how to check error and output. The rest was regular Groovy. I found that approach, Groovy for shell scripting, to be elegant, fast enough, and perfect for my purposes. So far my purpose is only to have a tool that allows me to explore the git object database.

In subsequent posts, I will describe how to use my Groovy tool (gitobjects) to support probing experiments using Git. It really made this project fun for me. I will also go over the Groovy code.


       

To view or add a comment, sign in

More articles by Nabil Hijazi

  • Strunk and White Applied to AI Writing

    You want AI-generated text that sounds human. Not robotic.

  • Groovy Fun with Git - Part 3 of 3

    Design and Coding of the Groovy Script When I started considering Groovy as an alternative for Bash scripting my goals…

  • Groovy Fun with Git - Part 2 of 3 - Using the Groovy Script

    In Part 1, I introduced a simple script to help explore the Git data structures, as we do simple experiments with git…

  • Microservices and Database Replication

    In a previous post, I discussed briefly the issue of data sharing in microservices. The consensus seems to be that each…

    2 Comments
  • Microservices - It's Not The Size That Matters!

    The diagram above is NOT something you want! That is how to do microservices the wrong way. In many ways "micro" is not…

    9 Comments
  • Why Microservices Are Hard

    Microservices are the latest incarnation of a "software brick" - an independent software component. A software…

  • Database Considered Harmful?

    Think "Events" (not CRUD) As you dip your toes into the world of microservices, you start thinking this is great stuff,…

  • Data and Microservices

    When you first meet the concept of microservices, you find it striking how simple the ideas are. They are also not new.

    3 Comments
  • Decomposing into Microservices

    Event Partitioning: Old Idea from Structured Analysis. A Perfect Fit for Microservices Thinking.

  • Dependency Hell in Microservices and How to Avoid It

    In my previous post I talked about independence being THE defining characteristic for a microservice. It is also the…

    1 Comment

Explore content categories