Importing CVS repositories with Graph Based Algorithms
The cvs2svn people, mainly Michael Haggerty have come up with a new algorithm (see his mail) for cvs2svn 2.0, which is based on GraphTheory. In the branch net.venge.monotone.cvsimport-branch-reconstruction, MarkusWanner maintains an implementation in C++, with some monotone specific modifications.
Overview
An import consists of three steps:
- parsing the RCS files to collect all events into blobs
- split dependency cycles between those blobs
- streamline the graph, to resolve branch affiliation
- consuming the blobs in the graph to create monotone revisions
Parsing the RCS files
In a first step, the RCS files are parsed to generate file deltas and hashes. In all subsequent steps, files are only referred to by the hash, which allows us to do most CVS conversions in memory (as an example, importing the mozilla repository peaked at around 1.8 gb memory usage, last time I tried). Additionally, we also collect all cvs_event
s into blobs. Such cvs_event
s (which should probably better be called rcs_event
s) can be:
cvs_commit
: the commit action for this single filecvs_symbol
: a symbol on a specific commitcvs_tag
: a tag for the file at the underlying symbolcvs_branch_start
: a branchpoint or start of a branch at the underlying symbolcvs_branch_end
: end of a branch
According to the RCS version numbers, dependencies between those events are recorded, i.e.:
- a commit of revision 1.2 depends on revision 1.1
- symbol
MY_TAG
depends on the commit of revision 1.4, thecvs_tag
forMY_TAG
then depends on that symbol
Depending on the type of these cvs_event
s, they are collected in blobs: cvs_commit
s with the same author and changelog end up in the same blob, obviously all equally named symbols also get their own blob. And based on the underlying symbol, all cvs_tag
, cvs_branch_start
and cvs_branch_end
events are also collected in a separate blob.
After this step, the dependency graph looks something like this one here, from the importing_cvs_small_real_repo
unit test:
Splitting dependency cycles
In a second step, we check the graph for dependency cycles using a depth first search. Upon discovery of such a cycle, we try to find the best split point, to split one blob of that cycle into two. The simplest, and currently used method is splitting at the largest gap in time. This is repeated until no more cycles are left, we then have an acyclic graph, but not a tree - meaning there can still be cross or forward edges.
See such a dependency cycle highlighted in red below, from the importing_cvs_cycle_splitter2
unit test. As commit "blob A" has the largest gap in time, it's the one to get split up into two independent blobs in that sample. The example above does not have any cycles.
Streamlining the graph
To figure out in what branch a certain blob belongs to, we'd like to get a tree of branches. Additionally, as CVS - unlike monotone - cannot represent multiple heads and merges, we'd like to eliminate all cross or forward edges. We do that in various ways, ... to be described...
A sample for such a cross or forward edge, from the small_real_repo
test:
Consuming the blobs
As we now have a tree of blobs (revisions), it's trivial to import that into monotone. See the streamlined graphs from our two example unit tests, first the simpler cycle_splitter2
:
cycle_splitter2
graph after streamlining
And here the small_real_repo
in all its beauty:
small_real_repo
graph after streamlining
Another Sample, from importing_cvs_cycle_splitter3
Involving cycles with branch starts:
- all the graph before modifications
- all first cycle encountered
- all no more cycles, w/o weak deps
- all streamlining 1
- all streamlining 2
- final graph, importable by monotone
Real World test repositories, and how they are currently failing
Tested with: 040f83b3bcd4c04adc7e3947d06a9ccffc223ad5
2008-05-01T14:00:49
- netbsd/othersrc: creating monotone revs:
../nvm.cbr/rcs_import.cc:5,018: invariant 'I(parent_blobs.size() == 1)' violated
- openssl/openssl: cycle splitting stage:
../nvm.cbr/rcs_import.cc:3,598: invariant 'I(false)' violated