Researchers have called out for more transparency from The Public Health Agency of Sweden regarding the COVID-19 estimates for Sweden. Recently, a report has been released covering such estimates for the Stockholm region. Along the report, the code used for these estimates was uploaded to Github, which makes it possible for others to review and critique the work. In this post we will take a look at the reproducibility aspects of this release. We find that it is possible to some extent reproduce the figures in the report, and we suggest many improvements to the repository.
Introduction
To strengthen and validate our scientific claims, replication by a completely independent study is important — but this can be time-consuming and costly and is hard in practice.
Researchers in Computer Science have therefore since the 1980s called for methods and tools for making scientific work reproducible. The principle of Reproducible Research (Buckheit & Donoho, 1995; Claerbout & Karrenbach, 1992; Association for Computing Machinery (ACM), n.d.) is to make data and computer code available for others to analyze and criticize.
Reproducible research is a minimum standard when full, independent, replication of a study by independent researchers is not possible (Peng, 2011).
The Public Health Agency of Sweden released a report (Folkhälsomyndigheten, 2020) on the 21th of April with accompanying code on Github (committed on the 23rd). The fact that the code is made available is of course very positive, however, we will will review and evaluate this code from a reproducibility point of view. We will use the requirements from (Monperrus, 2018) and (Leek et al., 2016) to evaluate commit bb616e9
(2020-04-24) of the repository.
It is important to note that we will not review or critique this code from a health perspective. There will not be a single exponential curve in this post.
This is the structure of the repository:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
├── Data
│ ├── Data_2020-04-10Ny.txt
│ └── Sverige_population_2019.Rdata
├── LICENSE
├── README.md
├── Results
│ ├── Figures
│ │ ├── Incidence_number_infected_14Days_CI_non-reported_98.7perc_less_inf_100perc.pdf
│ │ ├── ...
│ └── Tables
│ ├── Raw_data_fitted_model_para_p_asymp0.9873infect0.11.txt
| ├── ...
└── Script
└── Estimate_SEIR_for_sharing_new_incidence.R
Evaluation
The repository must be findable and downloadable
An important requirement on a good data science repository is that it is findable and downloadable (Monperrus, 2018). The report (Folkhälsomyndigheten, 2020) itself does not contain a link to the code, but we find a link on another page of The Public Health Agency of Sweden website. We didn’t find the repository by Googling the name of the report, but the repository includes the name and a link to the report.
The repository must be under version control and include a license
The code has been released on Github with a GNU General Public License v2.0 license. We note that this seems to include results that are in the repository. Results that are generated by this code should not be covered by this license, and the report states that all figures are copyrighted and that a permission must be given by the copyright holder to publish them.
Best practices for licensing scientific code is a larger topic, and one that could be debated, so we leave that as further work.
The first commit is verified, while the following commits are not. Verifying commits is a way to allow people to see that the content comes from a trusted source. The commits are from an individual without a Github account.
See issue Clarify license on Gitub.
The repository must be documented
There are no instructions on how to run this code in the README.md
file. There is no inventory, but folders are self-explanatory with names such as Scripts
and Data
.
There are some instructions inside the script-file, such as an instruction to set the absolute path to the project, something we didn’t need to do. Absolute paths should be avoided.
The script file itself is sparsely commented, and very long. A better structure is needed to make it more readable.
The repository must be exercisable
We can see that there is an R-script in the Scripts
folder, so we will attempt to run this as is.
1
Rscript Script/Estimate_SEIR_for_sharing_new_incidence.R
We get an error related to character encoding (Error: invalid multibyte character in parser at line 16
), and it seems as if it is encoded with ISO-8859 which we can check with
1
file Script/Estimate_SEIR_for_sharing_new_incidence.R
We can easily save this in UTF-8 instead.
1
iconv -f iso-8859-1 -t utf-8 < Script/Estimate_SEIR_for_sharing_new_incidence.R > Script/Estimate_SEIR_for_sharing_new_incidence_utf8.R
Since there is no documentation, we don’t know what the required environment is. In the code itself, there is a note that R 3.5.2
has been used and the loaded packages are listed in one place. We don’t have a full description of the session or environment, so we must ourselves figure out what versions of the dependencies were used. Luckily, there are not many dependencies, and we can install them as follows.
1
install.packages(c("reshape2", "openxlsx", "RColorBrewer", "rootSolve","deSolve"))
Another alternative, a better one IMHO, is to use the checkpoint
package. It allows one to set a checkpoint in time, so that another user will use the packages and versions available at that time. We add the following to the top of the script.
1
2
3
install.packages("checkpoint")
library(checkpoint)
checkpoint("2017-04-22")
An even better alternative would have been to use, and make available, a Docker image (Merkel, 2014; Boettiger, 2014) that have R and all the dependencies installed.
In any case, it is often a good idea to include a Makefile
so that someone can run make
to the script.
See pull request Fixes encoding error and makes runnable on mybinder.org on Github.
Input data lineage
The data used to produce the result is included in the repository. The file ./Data/Data_2020-04-10Ny.txt
contains values up to 2020-04-10.
1
2
3
4
"Datum" "Incidens"
2020-02-17 1
2020-02-18 0
2020-02-19 0
Following the requirements set by (Leek et al., 2016) we see that we should have received: the raw data, a tidy data set, a code book and a script to translate the raw data to tidy data. Immediately we see that this repository only contains a tidy dataset that was edited. There is a comment in the code that says that the dataset differs from the reported case data in that the imported cases were removed.
In normal cases, it is important that the rawest form of the data that the researchers have access to is included in the repository. However, here it is likely that this data contains privacy sensitive information so it is understandable that this is not included. It would be good to at the very least include a dataset with the reported and imported cases as two columns in the same file. This would allow for someone to rerun the code with new data.
While there was no code-book included in the repository, the dataset is simple and the cleaning and the analysis seems to be well explained in the report.
In one place, there are some magic numbers. These numbers are present in the dataset and didn’t need to be input manually. Furthermore, they are not used for the analysis but seem to indicate that this analysis was run for other regions in Sweden.
1
df_riket <- data.frame(ARegion = "Riket", Pop = 2385128+ 5855459+ 2078886)
See issue Input data lineage on Gitub.
The repository must be complete
A repository is said to be complete if all numbers and figures from the paper be re-computed from the code (Monperrus, 2018).
The script produces a long list of figures (12 of them) and tables and while the numbers have not been checked in detail, they seem to be good. The report on the other hand contains more figures that are not generated by the code.
To be completely reproducible, the code must generate the report.
See issue Making the code complete on Gitub.
The repository must be durable
The specific commit used for the report should be archived and referenced from within the report. Zenodo makes it extremely simple to archive from a Github repository.
Related work
BenjaK identified and corrected issues related to character encoding, packages as well as many other issues such as No seed for RNG.
consideRatio submitted a pull request to make the code runnable on mybinder.org. This means that the code is runnable in RStudio online - try it out here.
The excellent Machine Learning Reproducibility Checklist (Pineau, 2018) lists many more requirements that we did not cover here.
Conclusion
It is very positive that this code was made available to the public to be reviewed and critiqued by anyone. It had some issues, some of which have already been corrected by others, but not at the time of writing these improvements have not been included in the original repository. We suggested many improvements that could be made in terms of reproducibility, which possibly could mean that other people can make contributions that improve the model and analysis made here. These improvements range from making better documentation, to handling input data better.
This is the first code published on Github by the account FohmAnalys. I hope that the release of this code means that we can expect more openness and transparency in the future.
We, as scientists, have a lot to learn about open science and how to make code available, and I hope that this post could inspire you to share your own code.
Finally, a big thank you to The Public Health Agency of Sweden and the researchers working on this code for making it publically available!
References
- Buckheit, J. B., & Donoho, D. L. (1995). Wavelab and reproducible research. In Wavelets and statistics (pp. 55–81). Springer.
- Claerbout, J. F., & Karrenbach, M. (1992). Electronic documents give reproducible research a new meaning. In SEG Technical Program Expanded Abstracts 1992 (pp. 601–604). Society of Exploration Geophysicists. https://doi.org/10.1190/1.1822162
- Association for Computing Machinery (ACM). Artifact Review and Badging. https://www.acm.org/publications/policies/artifact-review-badging
- Peng, R. D. (2011). Reproducible Research in Computing Science. Science, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847.Reproducible
- Folkhälsomyndigheten. (2020). Skattning av peakdag och antal infekterade i covid-19-utbrottet i Stockholms län februari-april 2020. Folkhälsomyndigheten. www.folkhalsomyndigheten.se/publicerat-material/
- Monperrus, M. (2018). How to make a good open-science repository? https://researchdata.springernature.com/users/336958-martin-monperrus/posts/57389-how-to-make-a-good-open-science-repository
- Leek, J., Collado-Torres, L., Reich, N. G., & Horton, N. (2016). How to share data with a statistician. https://github.com/jtleek/datasharing
- Merkel, D. (2014). Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J., 2014(239). http://dl.acm.org/citation.cfm?id=2600239.2600241
- Boettiger, C. (2014). An introduction to Docker for reproducible research, with examples from the R environment. https://doi.org/10.1145/2723872.2723882
- Pineau, J. (2018). The Machine Learning Reproducibility Checklist (Version 2.0) (p. 1). https://www.cs.mcgill.ca/ jpineau/ReproducibilityChecklist.pdf
Suggested citation
If you would like to cite this work, here is a suggested citation in BibTeX format.