GitHub saving space?

Posted by – 07/02/2009

I was browsing GitHub, getting to know the system and feeling pretty amazed by it (seriously… I felt I discovered Orkut for developers)… when some thought just stroke me: how do they save space?

Yes, I know git is pretty efficient when it comes to saving space. Yes, I also know that space are becoming cheaper with time, but still, they claim to have +50k developers hooked up their servers… if they’re not doing something about space, things will go inefficient quite quickly. Rails alone seems to have 464 forks! If all of them represent one bare clone of the ‘canonical’ repository on GitHub’s side, that is a lot of space wasted in duplicated things…

Git has one amazing feature: hashing the objects it keeps track of. It surely doesn’t seem too complicated to design a schema that avoids having two copies of objects with the same hash. So all those forks of Rails, on GitHub’s side, would be just hardlinks to the ‘canonical’ repository…

Surely people working in GitHub are smart enough to have though about it on their own… Who knows? Maybe that’s exactly what they’re doing already! If so, can I ask them to share their schema, since it can become very useful to the rest of us? If not, can we beat them ;-D?

Fork me on GitHub

6 Comments on GitHub saving space?

  1. spectra says:

    @Tom,

    Thanks. That answer my question. Great service, BTW.

  2. We do indeed use alternates to share objects between forks of repositories. We also never do a gc with prune so that a parent repo will never lose objects that a fork might still be using. All of this keeps our total disk usage very minimal!

  3. Matt Palmer says:

    I’m pretty sure GitHub are well aware of the limitations of the—shared option… At any rate, disk space is piddlingly cheap, and they’ve got Engine Yard SANs behind them to provide pretty much unlimited storage for them. In the scaling game, disk space is pretty much the last thing you ever have to worry about these days.

  4. spectra says:

    @ulrik, Aqua,

    You are right! I didn’t notice they are using that at first, but a search in their blog brought this up.

    I believe it’s not documented in any other place that GitHub use—shared (AKA alternates).

    BTW, how that “warning” in git-clone manpage about the—shared option:

    NOTE: this is a possibly dangerous operation; do not use it unless you understand what it does. If you clone your repository using this option, then delete branches in the source repository and then run git-gc(1) using the—prune option in the source repository, it may remove objects which are referenced by the cloned repository.

    affects GitHub?

  5. Aqua says:

    Yes, as ulrik says, they probably do “git clone—shared [...]” operation.

  6. ulrik says:

    Hi, the git folks thought about this themselves first! It is called “alternates”, and you can instruct git clone to use alternates when setting up a repository (cloning locally). You can use this to save lots of space if you have 100x the linux source tree or similar!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>