Git Mirror Anywhere using the Dumb Http Protocol

Lets talk about Git. If you’ve done any professional software development, you’ve probably heard about Git.

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency.

It’s a tiny, but powerful piece of software, that most software developers use every day. Even so, under the hood there’s dozens of powerful features that most developers don’t even know exist. Today I hope to introduce you to one of them, the “Dumb HTTP Protocol”.


I recently found myself in a position needing to mirror a git repo to a firewalled environment where I didn’t want to stand up a dedicated Git server. I had access to a blob storage account, that could be used to serve static content over HTTP, but no compute.

While investigating the Git protocol for Gitmask I had previously learned about something called the “Dumb HTTP Protocol”. Unlike the SSH and HTTP Git protocol that most of your are aware of, the Dumb HTTP protocol expects the bare Git repository to be served like normal files from the web server.

At first glance, this looks liked exactly what we want, however, as you read further into the documentation you’ll see “Basically, all you have to do is put a bare Git repository under your HTTP document root and set up a specific post-update hook, and you’re done”.

Since we don’t want to run a server at all, a post-hook script seems like a non-starter. Thankfully this is not the case, as long as you are ok with a bit of extra work.


The Dumb HTTP Protocol

Before we go to the solution, lets take a moment to dive into what actually happens when you attempt to clone from a Git remote using the Dumb HTTP protocol. Please note, some of the following examples are copied from the Git Documentation.

git clone http://server/simplegit-progit.git

The first thing this command does is pull down the info/refs file. This file is written by the update-server-info command in the Post hook, and does not normally exist.

GET $GIT_REPO_URL/info/refs HTTP/1.0

S: 200 OK
S:
S: 95dcfa3633004da0049d3d0fa03f80589cbcaf31	refs/heads/maint
S: ca82a6dff817ec66f44342007202690a93763949	refs/heads/master
S: 2cb58b79488a98d2721cea644875a8dd0026b115	refs/tags/v1.0
S: a3c2e2402b99163d1d59756e5f207ae21cccba4c	refs/tags/v1.0^{}

The returned content is a UNIX formatted text file describing each ref and its known value. The file should not include the default ref named HEAD.

Now you have a list of the remote references and SHA-1s. Next, you look for what the HEAD reference is so you know what to check out when you’re finished:

GET $GIT_REPO_URL/HEAD HTTP/1.0

ref: refs/heads/master

You need to check out the master branch when you’ve completed the process. At this point, you’re ready to start the walking process. Because your starting point is the ca82a6 commit object you saw in the info/refs file, you start by fetching that:

GET $GIT_REPO_URL/objects/ca/82a6dff817ec66f44342007202690a93763949 HTTP/1.0

(179 bytes of binary data)

You get an object back – that object is in loose format on the server, and you fetched it over a static HTTP GET request. You can zlib-uncompress it, strip off the header, and look at the commit content:

$ git cat-file -p ca82a6dff817ec66f44342007202690a93763949
tree cfda3bf379e4f8dba8717dee55aab78aef7f4daf
parent 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
author Scott Chacon <schacon@gmail.com> 1205815931 -0700
committer Scott Chacon <schacon@gmail.com> 1240030591 -0700

Change version number

Next, you have two more objects to retrieve – cfda3b, which is the tree of content that the commit we just retrieved points to; and 085bb3, which is the parent commit:

GET $GIT_REPO_URL/objects/08/5bb3bcb608e1e8451d4b2432f8ecbe6306e7e7

(179 bytes of data)

To see what packfiles are available on this server, you need to get the objects/info/packs file, which contains a listing of them (also generated by update-server-info):

GET $GIT_REPO_URL/objects/info/packs
P pack-816a9b2334da9953e530f27bcac22082a9f5b835.pack

We’ll stop here. At this point we have a mechanism for retrieving information about the head of each branch, and a mechanism for retrieving the file content associated with a commit.

Git Compatible Static Content Repository

So how do we leverage this knowledge to generate a version of our Git repository, that we can serve using a simple HTTP content server (no post-hook.sh necessary)?

First we need to clone a bare version of our Git repository locally.

git clone --bare $GIT_REPO_URL

Then we’ll run the git update-server-info command on our bare repository, to generate the info files that Git clients expect.

cd $REPO_DIR
git update-server-info

At this point, we can copy this directory and serve it using a simple HTTP server (eg. S3 over CloudFront, Nginx, Apache, Artifactory, etc.).

References

  • https://git-scm.com/book/en/v2/Git-Internals-Transfer-Protocols
  • https://git-scm.com/docs/http-protocol

Jason Kulatunga

Build Automation & Infrastructure guy @Adobe. I write about, and play with, all sorts of new tech. All opinions are my own.

San Francisco, CA blog.thesparktree.com

Subscribe to Sparktree

Get the latest posts delivered right to your inbox.

or subscribe via RSS with Feedly!