version-control - tutorial - workflow in r
How do you combine “Revision Control” with “Workflow” for R? (4)
I remember coming across R users writing that they use "Revision control" (e.g: "Source control"), and I am curious to know: How do you combine "Revision control" with your statistical analysis workflow?
Two (very) interesting discussions talk about how to deal with the workflow. But neither of them refer to the revision control element:
A Long Update To The Question: Following some of the people's answers, and Dirk's question in the comment, I would like to direct my question a bit more.
After reading the Wiki article about "revision control" (which I was previously not familiar with), it was clear to me that when using revision control, what one does is to build a development structure of his code. This structure either leads to a "final product" or to several branches.
When building something like, let's say, a website. There is usually one end product you work towards (the website), with some prototypes along the way.
But when doing a statistical analysis, the work (to my view) is different. Sometimes you know where you want to get to. But more often, you explore. Explore cleaning the dataset. Explore different methods for statistical analysis, and ask various questions of your data (and I am writing this, knowing how Frank Harrell, and other experience statisticians feels about Data dredging).
That is why the workflow question with statistical programming is (in my view) a serious and deep question, raising many issues, The simpler ones are technical:
- Which revision control software do you use (and why) ?
- Which IDE do you use(and why) ? The more interesting question are about work process:
- How do you structure your files?
- What do you keep as a separate file and what as a revision? or asking in a different way - What should be a "branch" and what should be a "sub project" in your code? For example: When starting to explore your data, should a plot be creating and then erased because it didn't lead any where (but kept as a revision) or should there be a backup file of that path?
How you solve this tension was my initial curiosity. The second question is "what might I be missing?". What rules (of thumb) should one follow so to avoid common pitfalls doing statistical programming with version control?
In my intuition, I feel that statistical programming is inherently different then software development (I am writing this without being a real expert in statistical programming, and even less so in software development). That's way I am unsure which of the lessons I have read here about version control would be applicable.
Thanks a lot, Tal
After reading your update, it seems like you are viewing the choice and use of a version control systems as dictating the structure of your repository and workflow. In my opinion, version control is more akin to an insurance policy as it provides the following services:
Backups. If something gets accidentally deleted or the whims of fate fry your hard drive your work can be recovered from the repository. With distributed version control nothing short of the apocalypse can cause you to loose work-- in which case you'll probably have other things to worry about anyway.
The mother of all undo buttons. Was the analysis looking better an hour ago? a day ago? a week ago? Version control provides a rewind button that allows you to travel back in time.
If you are the only person working on a project, the above two points probably outline how version control systems will affect the way you work.
The other side of version control systems is that they foster collaborative efforts by allowing people to experiment on an isolated copy or "branch" of the project material and then "merge" any positive changes back into the master copy. It also provides a means for project members to keep tabs on who's changes affected which lines of which files.
As an example, I keep all of my college coursework under version control in a Subversion repository. I am the only one who works on this repository so I never branch or merge the source-- I just commit and occasionally rewind. The ability to rewind my work reduces the risks of trying some sort of new analysis-- I just do it. If two hours later it looks like it wasn't such a good idea, I just revert the project files and try something different.
In contrast, most all of my non-coursework package/program development is hosted under git. In this sort of a setting I frequently want to experiment on a branch while having a stable master copy available. I use git rather than Subversion in these situations because git makes branching and merging an effortless task.
The important point is that in both of these cases the structure of my repository and the workflow I use are not decided by my version control system-- they are decided by me. The only impact the version control has on my workflow is that it frees me from worrying about trying something new, deciding I don't like it, and then having to undo all the changes to get back to where I started. Because I use version control, I can follow Yogi Berra's advice:
When you come to a fork in the road, take it.
Because I can always go back and take it the other way.
I use git, myself. Local repositories, stored in the same directory as the R project. That way, if I eliminate a project down the road, the repository goes with it; I can work offline; and I don't have IRB, FERPA, HIPPA issues to deal with.
If I need added backup assurance, I can git to a remote (secured!) repository.
My workflow is not that different than Bernd's. I usually have a main directory where I put all my *.R code files. As soon as I have more than about 5 lines in a text file I start version control, in my case git. Most of my work is not in a team context meaning that I'm the only one changing my code. As soon as I make a substantive change (yes that is subjective) I do a check in. I agree with Dirk that this process is orthogonal to the workflow.
There's a lot of room for personal idiosyncrasies in version control, but I recommend this one tip as a best practice: if you report results to others (i.e. journal article, your team, management in your firm) ALWAYS do a version control check in right before running results that go out to others. Invariably, 3 months later someone will look at your results and ask some question about the code which you can't answer unless you know the EXACT state of the code when you produced those results. So make it a practice and put in the comments "this is the version of the code that I used for 4th quarter financials" or whatever your use case is.
Also keep in mind that version control is no replacement for a good backup plan. My motto is: "3 copies. 2 geographies. 1 mind at peace."
EDIT (Feb 24, 2010): Joel Spolsky, one of the founders of , just released a highly visual and very cool intro to Mercurial. This tutorial alone may be reason to adopt Mercurial if you have not already chosen a revision control system. I think when it comes to Git vs. Mercurial the most important advice is to chose one and use it. Maybe use what your friends/coworkers use or use the one with the best tutorial. But just use one already! ;)
Rather than focusing on revision control in particular, it sounds like you're really asking a bigger question about how statistical analysis compares to software development. That's an interesting question. Here are some thoughts:
Data analysis can be more like an art than a science. In a sense, you might want to look for inspiration to the process that an author would follow when writing a book more than the process that a software developer would follow. On the other hand, I have yet to encounter a software project that followed a straight line. And even at a theoretical level, there is a great amount of variance in software development methodologies. Of these, given that a statistical analysis can be a discovery process (i.e. one that can't be fully planned up front), it would make sense to follow something like an agile methodology (much more so that something like the waterfall methodology). In other words, you need to plan for your analysis to be iterative and self-reflective.
That said, I think the notion that statistical analysis is purely exploratory with no goal in mind is potentially problematic. That can lead to the point where you are 5 steps past your eureka moment, and have no way to get back to it. There is always a goal of some sort, even if the goal itself is changing. Moreover, if there is no goal, how will you know when you've reached the end?
One approach is to start off with one R file as you start a project (or a set of files like in the Josh and Bernd examples), and progressively add to it (so that it grows in size) as you make discoveries. This is also especially true when you have data that needs to be kept as part of the analysis. This file should be version controlled regularly to ensure that you can always step backwards if you make mistakes (allowing to incremental gains). Version control systems are immensely helpful in development not just because they ensure that you don't lose things, but also because they provide you with a timeline. And tag your check-ins so that you know what's in them at a glance, and note major milestones. I love JD's point about checking in before submitting something.
Once you have reached your final set of conclusions, it's often best to create a final version of your file that summarizes your analysis from start to end. You might even consider putting this into a Sweave document so that it's fully self-contained and literate.
You should also give serious thought to what others around you are doing. Nothing makes me cringe more than to see people reinventing the wheel, especially when it means extra work for the group as a whole to integrate with.
Your decisions about which version control system to use, which IDE, etc. (implementation issues) are ultimately extremely low on the totem pole in relation to the overall project management. Just use any one of them properly and you're already 95% of the way there, and the differences between them are small in comparison to the alternative of using nothing.
Lastly, if you are using something like github, google code, or R-forge, you will note something that they all have in common: a suite of tools beyond just a version control system. Namely, you should consider using things like the issue tracking system and the wiki to document progress and log open issues/tasks. The more organized you are with your analysis, the greater the likelihood of success.