Academic Technical Debt

Technical debt is "a concept in programming that reflects the extra development work that arises when code that is easy to implement in the short run is used instead of applying the best overall solution."

Although most often spoken of as a problem in software engineering for industry, academia has a severe technical debt problem. The problem is that even in the rare case that you can obtain the code and data used to generate the analysis contained in a paper, often the code is broken (I am presuming that it was at some past point not broken).

This in turn requires debugging of a foreign codebase which may or may not (probably the latter) have been written in a manner intended to be understandable to people other than the original author. Often this means that the code isn't ever fixed.

This problem is harder the more complex the analysis. I think the main issues are portability, documentation, and maintenance. Most academic code is neither portable, documented, nor maintained, but the first two are more appreciated than the issue of maintenance.

Portability refers the usability of software in different environments. In this case I mean the ability to run analysis code on different people's computers and at different times. Does your code run on OSX, Linux, and Windows? Are all software dependencies recorded in a manner that will allow them to be available later and in different environments? Most people would find this too tedious to check manually, which is why continuous integration systems should be used where possible, such as Travis-CI.

Python has a very nice way of ensuring dependency portability: virtual environments. They allow you to record and load specific versions of dependencies, ensuring portability across time. R has not really had something simlar, though in some cases things like deployr might work. I think a simpler solution is to maybe write analyses as R packages, with versions corresponding to releases of the analysis. This could therefore encode versioned dependencies as well as making use of the existing infrastructure designed to test the portability of R packages (i.e., via Travis-CI and CRAN).

Documentation is something that is getting better I think. Anyone who has had to work on an unfamiliar codebase has probably encountered poorly designed code without adequate documentation. This is an especially important thing to do if you aren't a software developer. I think it is helpful to think of this as documentation for your future self as well.

Software maintenance is very commonly overlooked. I think this is a moral/incentive failure on our part which is very much related to our current scientific publication-based reward system. After a paper is published who cares about maintaining the software that was needed to get the results? Since there aren't additional rewards after the publication aside from citations, which, except in the case that the paper was about a software package, are little affected by how well the software works, this isn't something many people care about. If you care about your own scientific integrity though, at the very least if someone has a problem using your code, you should make a good-faith effort to help them to get it to work.

There are two things that I think make all of the above easier to do which appear to be underappreciated. Those are dependency minimization and maintainability. Minimization is easy to understand. The fewer external dependencies your code has, the fewer chances one of those dependencies decreases your code's portability, all else equal. Of course often all else is not equal, in that if you are reimplimenting code contained in external packages which is more reliable then this may not be good at all. In mlr we were previously using reshape2 and plyr for internal manipulation of data. These packages have been superseded by dplyr and tidyr. All of these packages work pretty well but they have a lot of external dependencies that mlr doesn't currently have. So instead we've rewritten everything with data.table which has no dependencies and can do everything we were using reshape2/plyr for.

Dependency maintenance is obviously easier when you have fewer of them, but I think it also makes sense to think a bit about how likely the dependency is to be maintained itself. Is there a team of people working on the package? Do they have long-term incentives to maintain it?

This is one of the reasons I have contributed to mlr rather than caret, despite the latter's much greater popularity. Below is caret's contributor summary page.

It is apparent that almost all of the development is handled by one person. This results in things like bug fixes being much slower to make it into the package. What do you think would happen of Max (topepo) were to stop working on the package? Without other regular contributors there aren't other people that are ready to step in to maintain the package, much less to continue development of it.

In contrast mlr's team is much deeper.

If Bernd no longer worked on the package other team members could step in almost immediately to lead future development, and there are enough regular contributors (the list continues far below) that maintenance wouldn't be an issue.