Computational Reproducibility

In recent years a few fields of science have been hit with reproducibility scandals. Initially precipitated by John Ioannidis’s study of reproducibility in clinical research [1], [2], the reproducibility crisis has expanded to include scientific domains like psychological science [3] and cancer research [4]. Being experiment based, reproducing work in these domains is often time and resource intensive. Career pressures push scientists to go for new big results frequently. This creates a perverse incentive structure which encourages the production of dubious or outright fradulent results [5], [6]. Indeed, researchers have a wide variety of tools at their disposal in order to publish dubious research such as choices in experimental design and p-value games.

Computational sciences are different from purely experimental sciences in that the craft produces an inherently replicable record: The code. (Leaving asside for the moment that even single-threaded computational execution isn’t truly replicable [7], [8], We’ll assume for the moment that it’s close enough..). The existence of this defacto record means that results dependent on computational code should in principle be verifiable on nearly any computer (Barring special hardware constraints). I understand that there are many viable reasons why this is difficult most notably perhaps cross-platform development issues, however the existence of these problems does not preclude an effort to reduce their severity.

Computational science has an opportunity to get ahead of this problem. In fact, some domains such as Astronomy [9] and Density functional theory [10] have already recognized and started to solve it. This is a good start, but we can do much better. Very broadly, there are a number of issues which may prevent an article’s replication, or discourage a researcher from preparing their work sufficiently to allow replication. These issues fall into a number of rough catgories:

Software Practices
Legal Issues
Workflow related Issues
Hardware or Low-level Issues

Types of Reproducibility

Before I get too deep into it, I generally refer to the ACM definitions of reproducibility. As an overview:

Repeatability: When the same research team can get the same results when running the same experiment again.
Replicability: When a different research team can get the same results using the same experimental setup as the other team
Reproducibility: When a different research team can get the same results using a different experimental method.

I understand that some of these terms may be controversial, however I am adapting the terms of the world I usually work in as this will ease my mental burden on a daily basis. Please bear with me.

Software Practices

Certain software practices or impressions make reproducibility more difficult, or influence people’s opinions regarding whether or not they should release their code. Being careful to avoid falling into a few of these traps, or mitigating their consequences can improve the likelihood a given work can be replicated or reproduced.

Greater Scrutiny

Sharing one’s code exposes not only their research, but coding practices to greater scrutiny. This is simply too much for some people and they refuse to go through the effort of prepraing their code for release as they’re going to be criticized no matter what. This just isn’t a rewarding activity for some people.

I would remind however, that constructive criticism usually leads to better code, and may correct mistakes in your paper, or make your code more performant as an example.

Work is part of a large repository

Code developed as part of a large collection of work often cannot easily be shared. This isn’t always a deal-breaker, however developers have to spend a lot of time to make sure test cases are correct and that other features in the code base don’t prevent their paper’s code from working on a multitude of computers.

It is usually a good idea to develop logically separated libraries separately, and to use git’s submodule capability to tie together different works for the purposes of your research.

Ad Hoc packaging

Researchers do not want to put in the work required to learn how to properly package their work. This results in ad hoc packages which are not cross-platform compatible, or which do not work outside the authors own computer. Such work really is worth the cost however. The result is maintainable code builds which can grow without inducing much extra work due to ever expanding makefiles and the like. Not only will it make developing code easier, but it will make sharing of complex code possible.

Legal Issues

Sometimes code cannot be developed without legal issues preventing or impeding its sharing. There is often no avoiding it, so here I discuss legal scenarios which often occur which will prevent scientific work from being reproduced.

Work may Concern Confidential Information

In some cases, input data or parts of the code base may contain personally identifying information (PII) or confidential information which cannot be shared with the public. This is a common reason that code is not disclosed in the medical community. This can be mitigated however if great effort is taken to ensure that such information is missing from the code base.

Work may Concern Export Controlled Information

Sometimes work is developed in secret as part of governmental efforts. This code is often export controlled, and cannot be shared with the public. In some cases however, citizens of that country can gain access to the code so even in these instances good coding practices should be followed to ensure ease of reproduction.

Work Developed Under License

Sometimes work is developed by a group of people who feel they must protect their work as part of their competitive advantage. Whether or not this is right is another question I’ve written a post about, however it means that code can’t be shared with the public and replication is impossible. In some instances the license is permissive enough that there is a process to obtain the code which most people will be able to pass. In these instances, patches can be shared and replication should be possible.

Hardware or Low-Level Issues

There are some hardware or low-level related issues which are impossible to avoid. Even under the best of circumstances for the most dilligent of researchers these will be a problem.

Hardware Constraints

In some cases, studies are done on unusually large computers, or with special computer components which makes replication difficult as they must use the same size or type of hardware. For example, there are FPGAs, GPUs, and now neuromorphic chips. These are all specialty or niche computing devices which have specialize compute characteristics which would have to be emulated on general purpose computers. In addition to this, large scale simulations may take place with the worlds leading edge Peta scale computers which would take a prohibitively long time using a more normal scale computer. This makes replicating such results difficult.

Inherent Numerical Precision Issues

Floating point arithmetic has limited precision and is not associative [11]. This means differences in summation order in parts of a program on different runs can result in different output. Some scientific techniques are more sensitive to this issue requiring hundreds or even thousands of digits of precision after which normal floating point arithmetic is no longer enough [12]. Knowing that this variation exists and how it may affect the scientific output is then necessary in these cases so that scientists attempting to replicate the result know what kind of normal variation they may experience.

Some reproducibility problems come about because of the way we work. Usually this results from a sloppy workflow or not being specific about what the research procedure or requirements are.

Sloppy Workflows

Computationally based science often occurs in a weird world which is a combination of theory and practice. Often a theoretical descrpition of the proposed work is created after which the implementation is developed. This may lead to bugs which must be fixed. Sometimes these bugs actually change the theoretical description but the scientist doesn’t remember to update this description [13], [14]. This is why sharing the code generating a work’s results is so important. It is a truely accurate description of the algorithm which a paper cannot match.

Incompatible Study Structures and Requirements

This is a serious and pervasive issue in almost every science. When people conduct a study, each group looking at the matter will make different choices about how they select participants, or provide multiple choices. Or perhaps what the format of the input data is, (Image size for instance). This means that studies which would otherwise cover the same thing, cannot be compared in an apples-to-apples manner. In the realm of computational sciences, the availability of source code should help aleviate this issue as slightly different codes and datasets can be molded to fit each other’s restrictions more easily.

Idealized Workflow Possibilities

As a thought experiment, I’ve developed three types of workflows which would be possible in a perfect world, but are currently either extremely difficult or impossible usually due to lack of source code transparance and usability.

Method Appropriation
Method Tweaking
Method Improvement

Tools we as a community build should all take steps towards making these workflows easier.

Method Appropriation

A is working on an airfoil using method \(\boldsymbol{\alpha}\), but the simulation has a problem of some sort
B published method \(\boldsymbol{\beta}\) which claims to fix said problem
A reads B’s paper
A downloads B’s materials, verifies the result
A uses method \(\boldsymbol{\beta}\) to solve their problem

Method Tweaking

A needs to simulate a large protein to find binding sites
A finds method \(\boldsymbol{\beta}\) by B for simulating large proteins efficiently, but doesn’t find binding sites
A download’s B’s materials, verifies the result
A adds code to find binding sites to \(\boldsymbol{\beta}\) producing method \(\boldsymbol{\alpha}\)
A publishes method \(\boldsymbol{\alpha}\) citing B’s method \(\boldsymbol{\beta}\)

Method Improvement

Performing lattice QCD, A finds method \(\boldsymbol{\beta}\) by B.
A spots mistake in method \(\boldsymbol{\beta}\) from article
A downloads B’s materials, verifies the result
A changes \(\boldsymbol{\beta}\) creating method \(\boldsymbol{\alpha}\) solving mistake
A runs method \(\boldsymbol{\alpha}\) verifying better agreement with experimentally measured values
A publishes \(\boldsymbol{\alpha}\) citing B about this change

Conclusion

Awareness of the reproducibility problem is building in all sciences. Computational science has a chance to get ahead of this problem and create an example for other sciences to follow. While there are many issues affecting computational reproducibility, these issues can all be tackled or mitigated in some way. Raising awareness of these issues and building tools to help with these issues will improve research produced by all members of the community.

Bibliography

I. JA, “Contradicted and initially stronger effects in highly cited clinical research,” JAMA, vol. 294, no. 2, pp. 218–228, 2005. 10.1001/jama.294.2.218
J. P. A. Ioannidis, “Why Most Published Research Findings Are False,” PLOS Medicine, vol. 2, no. 8, Aug. 2005. 10.1371/journal.pmed.0020124
O. S. Collaboration, “Estimating the reproducibility of psychological science,” Science, vol. 349, no. 6251, 2015. 10.1126/science.aac4716
C. G. Begley and L. M. Ellis, “Drug development: Raise standards for preclinical cancer research,” Nature, vol. 483, no. 7391, pp. 531–533, 2012. 10.1038/483531a
C. G. Begley and J. P. A. Ioannidis, “Reproducibility in Science,” Circulation Research, vol. 116, no. 1, pp. 116–126, 2014. 10.1161/CIRCRESAHA.114.303819
“Scandal Rocks Scientific Community.” . http://www.dw.com/en/scandal-rocks-scientific-community/a-646321
D. Ince, “The Problem of Reproducibility,” CHANCE, vol. 25, no. 3, pp. 4–7, 2012. 10.1080/09332480.2012.726554
D. Monniaux, “The Pitfalls of Verifying Floating-point Computations,” ACM Trans. Program. Lang. Syst., vol. 30, no. 3, pp. 12:1–12:41, May 2008. 10.1145/1353445.1353446
J.-hoon Kim et al., “The AGORA High-resolution Galaxy Simulations Comparison Project. II. Isolated Disk Test,” The Astrophysical Journal, vol. 833, no. 2, p. 202, 2016. http://stacks.iop.org/0004-637X/833/i=2/a=202
K. Lejaeghere et al., “Reproducibility in density functional theory calculations of solids,” Science, vol. 351, no. 6280, 2016. 10.1126/science.aad3000
“IEEE Standard for Floating-Point Arithmetic,” IEEE Std 754-2008, pp. 1–70, Aug. 2008. 10.1109/IEEESTD.2008.4610935
D. H. Bailey, “High-precision floating-point arithmetic in scientific computation,” Computing in Science Engineering, vol. 7, no. 3, pp. 54–61, May 2005. 10.1109/MCSE.2005.52
J. F. Claerbout and M. Karrenbach, “Electronic documents give reproducible research a new meaning,” in SEG Technical Program Expanded Abstracts, 1992, pp. 601–604. 10.1190/1.1822162
D. L. Donoho, A. Maleki, I. U. Rahman, M. Shahram, and V. Stodden, “Reproducible Research in Computational Harmonic Analysis,” Computing in Science & Engineering, vol. 11, no. 1, pp. 8–18, 2009. 10.1109/MCSE.2009.15

Last Updated: Apr 09 2021