Fast CDN-based repository clones on Bitbucket Cloud

February 2nd 2017 Erik van Zijst in Bitbucket Cloud, Mercurial

Mercurial or Git?

Git and Mercurial were both started in April of 2005 by Linux kernel developers. Both are fully distributed, similar in technical design, and both have grown large communities over the last 12 years.

A few years later dedicated code hosting sites began to spring up around these new source control systems. Bitbucket Cloud was one of those early sites, launched in 2008 as a Mercurial-only service.

A lot has happened since then. Today Bitbucket Cloud is one of the largest, and busiest, DVCS services, seamlessly working with both Mercurial and Git.

Our aim is for Bitbucket to abstract away most of the underlying SCM specifics in favor of a unified interface. That said, under the hood we try to take advantage of both Git and Mercurial's individual strengths. To this end we recently did some work to improve our Mercurial integration.

Clonebundles: faster, CDN-based clones

When cloning a repository, both Mercurial and Git collect all the repo data the client needs and then generate a custom, compressed file that the client can apply locally. This is a very expensive process, especially for larger repos and most of our server-side resources are spent serving clone and pull operations.

Luckily as of 3.6 (2015), Mercurial offers the ability for the server to host a pre-bundled snapshot, or clonebundle, of the entire repository externally. When cloning, clients then automatically download the clonebundle first, immediately and transparently followed by a pull for any recent changes not in the bundle.

On the client you can tell when a clonebundle is being applied:

$ hg clone https://bitbucket.org/pypy/pypy
destination directory: pypy
applying clone bundle from https://media-api.atlassian.io/file/197da297-28a0-4a44-8f78-54c8ccafb845/binary?token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJiNjRjYThlNi1lNzAzLTRjYjktOGYzMC1jYWU0NGY3MGVhMjQiLCJhY2Nlc3MiOnsidXJuOmZpbGVzdG9yZTpmaWxlOjE5N2RhMjk3LTI4YTAtNGE0NC04Zjc4LTU0YzhjY2FmYjg0NSI6WyJyZWFkIl19LCJuYmYiOjE0ODM2NTI3NzYsImV4cCI6MTQ4MzY1MzE5Nn0.TxibjiWrLa6F-qvyHjav65SfyNVRSFUmsLEmloH4-k0&client=b64ca8e6-e703-4cb9-8f30-cae44f70ea24
adding changesets
adding manifests
adding file changes
added 89375 changesets with 279660 changes to 46755 files (+179 heads)
finished applying clone bundle
searching for changes
adding changesets
adding manifests
adding file changes
added 9 changesets with 13 changes to 8 files (+1 heads)
updating to branch default
5528 files updated, 0 files merged, 0 files removed, 0 files unresolved

Here you can see that the clonebundle did the heavy lifting, providing the client with 89375 changesets, while Bitbucket only had to provide an additional 9 through the subsequent pull, making what would normally have been a very costly, CPU-bound server-side operation effectively free.

The additional benefit is that since clonebundles are static files, they can be served much quicker than when they were generated on the fly, leading to substantially faster clone times in some cases.

The latency of the additional HTTP request to fetch the bundle tends to outweigh the benefit of the faster download on very small repos and so you may not see a clonebundle for every repository.

Besides using Mercurial 3.6 or higher, there is nothing you need to do to take advantage of clonebundles.

GeneralDelta revlogs: improved compression

The past few years also saw an improvement in compression efficiency through a modification in Mercurial's revlog file format.

A revlog file is a series of revisions. When a new revision arrives, its content is compared against the previous revision in the file and a zlib-compressed delta/diff is generated and appended to the revlog. Similar to MPEG video's "key frames", the length of the "delta chain" is kept in check by occasionally storing a full revision.

Unfortunately a revlog linearizes the DAG, interleaving revisions from parallel branches, and so a revision's physical parent in the revlog may not be its actual logical parent. Diffing revisions of diverging branches can lead to larger deltas in the revlog, affecting the overall compression ratio.

In version 1.9 (2010) this was addressed with the introduction of "GeneralDelta", which adds a logical parent pointer that allows deltas to be against their true logical parent, leading to smaller repos.

New repositories created with 3.7 (2016) or higher automatically use this new format, however old existing repos would need to be converted first. Over the holiday season we have done this on our end and so every repository on Bitbucket has become a little more efficient.

Mozilla's Gregory Szorc (incidentally also responsible for clonebundles) offers a ton more interesting detail on his blog.

Bundle2: a better wire transport format

When data is pushed or pulled, the bundle protocol is used. The original protocol however had limited extensibility and predated later features like bookmarks, phases and obsolescence, which lead to multi stage, non-atomic pushes that in rare cases could lead to race conditions.

The newer bundle2 protocol has addressed these and other issues. It also supports GeneralDelta, reducing not only the chattiness of the original protocol, but offering better compression too.

Until now, Bitbucket exclusively supported the original bundle1 protocol, but with the migration to GeneralDelta, we have also enabled bundle2. Clients running 3.6 or higher will now automatically use it. Older bundle1 clients remain supported too.