In recent years a few fields of science have been hit with reproducibility scandals. Initially precipitated by John Ioannidis’s study of reproducibility in clinical research [1], [2], the reproducibility crisis has expanded to include scientific domains like psychological science [3] and cancer research [4]. Being experiment based, reproducing work in these domains is often time and resource intensive. Career pressures push scientists to go for new big results frequently. This creates a perverse incentive structure which encourages the production of dubious or outright fradulent results [5], [6]. Indeed, researchers have a wide variety of tools at their disposal in order to publish dubious research such as choices in experimental design and p-value games.
Computational sciences are different from purely experimental sciences in that the craft produces an inherently replicable record: The code. (Leaving asside for the moment that even single-threaded computational execution isn’t truly replicable [7], [8], We’ll assume for the moment that it’s close enough..). The existence of this defacto record means that results dependent on computational code should in principle be verifiable on nearly any computer (Barring special hardware constraints). I understand that there are many viable reasons why this is difficult most notably perhaps cross-platform development issues, however the existence of these problems does not preclude an effort to reduce their severity.
Computational science has an opportunity to get ahead of this problem. In fact, some domains such as Astronomy [9] and Density functional theory [10] have already recognized and started to solve it. This is a good start, but we can do much better. Very broadly, there are a number of issues which may prevent an article’s replication, or discourage a researcher from preparing their work sufficiently to allow replication. These issues fall into a number of rough catgories:
Before I get too deep into it, I generally refer to the ACM definitions of reproducibility. As an overview:
I understand that some of these terms may be controversial, however I am adapting the terms of the world I usually work in as this will ease my mental burden on a daily basis. Please bear with me.
Certain software practices or impressions make reproducibility more difficult, or influence people’s opinions regarding whether or not they should release their code. Being careful to avoid falling into a few of these traps, or mitigating their consequences can improve the likelihood a given work can be replicated or reproduced.
Sharing one’s code exposes not only their research, but coding practices to greater scrutiny. This is simply too much for some people and they refuse to go through the effort of prepraing their code for release as they’re going to be criticized no matter what. This just isn’t a rewarding activity for some people.
I would remind however, that constructive criticism usually leads to better code, and may correct mistakes in your paper, or make your code more performant as an example.
Code developed as part of a large collection of work often cannot easily be shared. This isn’t always a deal-breaker, however developers have to spend a lot of time to make sure test cases are correct and that other features in the code base don’t prevent their paper’s code from working on a multitude of computers.
It is usually a good idea to develop logically separated libraries separately, and to use git’s submodule capability to tie together different works for the purposes of your research.
Researchers do not want to put in the work required to learn how to properly package their work. This results in ad hoc packages which are not cross-platform compatible, or which do not work outside the authors own computer. Such work really is worth the cost however. The result is maintainable code builds which can grow without inducing much extra work due to ever expanding makefiles and the like. Not only will it make developing code easier, but it will make sharing of complex code possible.
Sometimes code cannot be developed without legal issues preventing or impeding its sharing. There is often no avoiding it, so here I discuss legal scenarios which often occur which will prevent scientific work from being reproduced.
In some cases, input data or parts of the code base may contain personally identifying information (PII) or confidential information which cannot be shared with the public. This is a common reason that code is not disclosed in the medical community. This can be mitigated however if great effort is taken to ensure that such information is missing from the code base.
Sometimes work is developed in secret as part of governmental efforts. This code is often export controlled, and cannot be shared with the public. In some cases however, citizens of that country can gain access to the code so even in these instances good coding practices should be followed to ensure ease of reproduction.
Sometimes work is developed by a group of people who feel they must protect their work as part of their competitive advantage. Whether or not this is right is another question I’ve written a post about, however it means that code can’t be shared with the public and replication is impossible. In some instances the license is permissive enough that there is a process to obtain the code which most people will be able to pass. In these instances, patches can be shared and replication should be possible.
There are some hardware or low-level related issues which are impossible to avoid. Even under the best of circumstances for the most dilligent of researchers these will be a problem.
In some cases, studies are done on unusually large computers, or with special computer components which makes replication difficult as they must use the same size or type of hardware. For example, there are FPGAs, GPUs, and now neuromorphic chips. These are all specialty or niche computing devices which have specialize compute characteristics which would have to be emulated on general purpose computers. In addition to this, large scale simulations may take place with the worlds leading edge Peta scale computers which would take a prohibitively long time using a more normal scale computer. This makes replicating such results difficult.
Floating point arithmetic has limited precision and is not associative [11]. This means differences in summation order in parts of a program on different runs can result in different output. Some scientific techniques are more sensitive to this issue requiring hundreds or even thousands of digits of precision after which normal floating point arithmetic is no longer enough [12]. Knowing that this variation exists and how it may affect the scientific output is then necessary in these cases so that scientists attempting to replicate the result know what kind of normal variation they may experience.
Some reproducibility problems come about because of the way we work. Usually this results from a sloppy workflow or not being specific about what the research procedure or requirements are.
Computationally based science often occurs in a weird world which is a combination of theory and practice. Often a theoretical descrpition of the proposed work is created after which the implementation is developed. This may lead to bugs which must be fixed. Sometimes these bugs actually change the theoretical description but the scientist doesn’t remember to update this description [13], [14]. This is why sharing the code generating a work’s results is so important. It is a truely accurate description of the algorithm which a paper cannot match.
This is a serious and pervasive issue in almost every science. When people conduct a study, each group looking at the matter will make different choices about how they select participants, or provide multiple choices. Or perhaps what the format of the input data is, (Image size for instance). This means that studies which would otherwise cover the same thing, cannot be compared in an apples-to-apples manner. In the realm of computational sciences, the availability of source code should help aleviate this issue as slightly different codes and datasets can be molded to fit each other’s restrictions more easily.
As a thought experiment, I’ve developed three types of workflows which would be possible in a perfect world, but are currently either extremely difficult or impossible usually due to lack of source code transparance and usability.
Tools we as a community build should all take steps towards making these workflows easier.
Awareness of the reproducibility problem is building in all sciences. Computational science has a chance to get ahead of this problem and create an example for other sciences to follow. While there are many issues affecting computational reproducibility, these issues can all be tackled or mitigated in some way. Raising awareness of these issues and building tools to help with these issues will improve research produced by all members of the community.