The process of writing a LaTeX document can be one full of manual steps, resulting in a patchwork document that is not exercisable nor complete. This makes it impossible to reproduce the document from code and data. In this post we will create a pipeline for compiling a LaTeX document that works both locally and using GitLab CI. This is part of a series to create the perfect open science git
repository.
Introduction
When writing a document in LaTeX I’d like to use git
for version control even if am working alone on a project. This allows me to track my progress, have a backup, and make sure the document is completely reproducible from raw data. The principle of Reproducible Research (Buckheit & Donoho, 1995; Claerbout & Karrenbach, 1992; Association for Computing Machinery (ACM), n.d.) is to make data and computer code available for others to analyze and criticize.
A good open source repository is exercisable and complete (Monperrus, 2018; Association for Computing Machinery (ACM), n.d.). This means that it must be possible to fully reproduce the document, down to the last pixel, from running a single script in the repository.
In this post we will take a look at the practicalities of writing a reproducible document in LaTeX using a Gitlab CI pipeline to ensure that we pass these requirements.
This post is part of a series and follows Publication ready figures. To see more on requirements on open source repositories see Reproducibility aspects of the Swedish COVID–19 estimate report.
Our contributions
- We define three phases of document compilation that compile figures, compile the main document and test the compiled document against some set of known requirements.
- We contruct a local compilation pipeline based on
latexmk
,make
and Docker (Merkel, 2014). - We construct a Gitlab CI pipeline that automatically compile the document when we push new code to the remote repository.
What we will need
In this post we will use git
and target the Gitlab CI pipeline framework, and so you will need a repository on Gitlab.
I recommend adding a .gitignore
based on the Gitlab TeX .gitignore template.
The local build system
We will use latexmk
to build our LaTeX document. There are other build systems for LaTeX such as rubber
, latexrun
which also can be used, but latexmk
has the advantage as is robust and already installed in the Docker image we are using.
We will use GNU make to trigger the latexmk
build locally, or in the GitLab runner. The entry points will be slightly different in these cases.
Here we assume that running make figures
is a step that is very time consuming so we would like to avoid running that all the time.
The command that we run from the command line to compile the LaTeX document is make
. This will first run a Docker container, mount the working directory as and run make pdf
inside the container. Since the working directory is mounted, the pdf-file will remain after the Docker container has been shut down and removed.
The complete script Makefile
can be seen in the GitLab repository.
Generating figures
The figures resides in the subdirectory figures
which contains a Makefile
. We can compile the figures locally with make -C figures
or in a Docker container with
1
docker run --rm -w /data/ -v`pwd`:/data python:3.8 make -C /data/figures
Each figure is generated from raw data and plotted using a Python script. Each script generates a figure in TiKZ format with the same base name, but with extension “.tex”.
Compiling the document
Our document can be compiled using latexmk
inside a Docker container with make
. This is the same as running
1
docker run --rm -w /data/ -v`pwd`:/data martisak/texlive2020 make pdf
The document will be compiled inside the container using
1
latexmk -bibtex -pdf -pdflatex="pdflatex -interaction=nonstopmode" main.tex
The container we are using is based on a Docker image which has TeXLive 2020 installed on top of an Ubuntu base image.
Running unit tests
The test cases, written in Ruby can either be run locally with
1
rspec spec/pdf_spec.rb
which is the same as make check
or in a Docker container with
1
docker run --rm -w /data/ -v`pwd`:/data ruby:2.7.1 bundle update --bundler; make check
This is the same as running make check_docker
. For a more in-depth guide to LaTeX document unit testing see How to beat publisher PDF checks with LaTeX document unit testing.
LaTeX development environment
When writing a paper we would of course like to see the results of our changes in near real time, and not have to commit our changes to git
in order to compile the document.
We can tweak the render
make
target a bit so that latexmk
will be run with the -pvc
flag (Wienke, 2018). This puts latexmk
into preview and continuously update mode.
1
make clean render LATEXMK_OPTIONS_EXTRA=-pvc
This means we can run this command once and just edit our document in our favorite text editor.
The GitLab CI pipeline
In GitLab we have a possibility to run a pipeline for each commit using GitLab CI/CD. For this project we have defined three stages: the first stage figures
creates the plots in Python; the second build
compiles the LaTeX document and the third test
runs unit tests on the compiled PDF document.
The complete script .gitlab-ci.yml
can be found in the GitLab repository.
Compiling figures
Our first pipeline stage will compile figures according to Publication ready figures. For this we use the official python:3.8
Docker image. Any job artifacts created in this step will be carried over to the next stage.
1
2
3
4
5
6
7
8
figures:
image: python:3.8
stage: figures
script:
- make -C figures
artifacts:
untracked: true
expire_in: 1 week
The figures are placed in the figures
subdirectory and are built using a Makefile
.
The reason for separating this step into a separate stage is that we assume generating figures can take a very long time, for example if a Machine Learning model is trained in this step. In this way we can also keep it separate when running it locally, so that we don’t have to regenerate the figures everytime we want to compile the LaTeX document.
Speeding up the build with caching
The figures
stage can take a very long time since we need to download and install packages every time the stage runs. To avoid this we can use the example from Cache dependencies in GitLab CI/CD so that the figure stage becomes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
variables:
PIP_CACHE_DIR: "$CI_PROJECT_DIR/.cache/pip"
cache:
key: "$CI_JOB_STAGE-$CI_COMMIT_REF_SLUG"
paths:
- .cache/pip
- venv/
figures:
image: python:3.8
stage: figures
before_script:
- python -V
- pip install virtualenv
- virtualenv venv
- source venv/bin/activate
script:
- make -C figures
artifacts:
untracked: true
expire_in: 1 week
We are using a virtualenv
(Gabor, 2020) to be able to cache the installed packages as well.
Care has to be taken with this - the cache can become to big for Gitlab to handle.
Compiling the LaTeX document
The second stage in the pipeline will compile the actual LaTeX document. Here, we need to use a docker image that have LaTeX and all needed packages installed. The Docker image we use is martisak/texlive2020
, which is using TeXLive 2020.
The job artifact of interest is of course the compiled pdf-document, but we include any untracked file so that any logfiles and other generated files will be included.
1
2
3
4
5
6
7
8
9
10
11
compile:
image: martisak/texlive2020
stage: build
script:
- make pdf
dependencies:
- figures
artifacts:
untracked: true
expire_in: 1 week
when: on_success
Running unit tests
The final stage of the pipeline will run unit tests on the created pdf-file. This is useful to for example make sure the number of pages are as expected, to check that the fonts are embedded properly and that any metadata is set correctly. We will cover these tests in detail in a later post, for now it is enough to say that these tests are written in Ruby, so we will use an appropriate Docker image.
1
2
3
4
5
6
7
8
9
test:
image: ruby:2.7.1
stage: test
dependencies:
- compile
script:
- bundle install
- make check
when: on_success
Adding a “Download PDF” button
Now when we have gone through all of this, we would like to share our final document with others. I like using a Gitlab badge for this.
Since we named our document main.pdf
and the compilation stage is named compile
we can find our document at https://gitlab.com/martisak/latex-pipeline/-/jobs/artifacts/master/raw/main.pdf?job=compile
.
Of course, we need a fancy image to go with it, and we can generate one using shields.io.
You can add this badge either by adding it to your README.md
or in your Gitlab settings under General and Badges.
Related work
A common way of writing LaTeX documents together with others is to use Overleaf. Editing can be done by all authors in real time and the compilation of the document is very fast. However, the online version doesn’t allow us to run arbitrary code, or perform test cases on our document. Furthermore, the version control is hidden from us. Overleaf has a few ways of letting us share the work. In my work, some of the content is proprietary and can be sensitive until the document is reviewed. This means I am not able to use cloud solutions to write my documents. However, Overleaf provides a Docker image that can be deployed locally.
Many authors have looked into using Gitlab CI for building LaTeX documents, for example (Manik, 2019; Lühr, 2018; Khan, 2018; Ergus, 2016). (Ajayakumar, 2020) wrote a very nice and complete guide, and used Gitlab Pages to deploy the compiled document.
In this post we extend this work and make a complete pipeline that also be run locally. Our pipeline consists of three stages, figures
, build
and test
each responsible for a separate part of the build process.
Conclusions
We have constructed a simple pipeline for compiling LaTeX documents in a Docker container. This fulfills the requirements that our repository shall be complete and exercisable (Monperrus, 2018; Association for Computing Machinery (ACM), n.d.).
To quickly get started, you can fork my repository on Gitlab or use the cookiecutter
template provided here.
In upcoming posts we will further look into defining test cases for documents, complicating the build with Pandoc and other tricks to annoy your co-authors.
References
- Buckheit, J. B., & Donoho, D. L. (1995). Wavelab and reproducible research. In Wavelets and statistics (pp. 55–81). Springer.
- Claerbout, J. F., & Karrenbach, M. (1992). Electronic documents give reproducible research a new meaning. In SEG Technical Program Expanded Abstracts 1992 (pp. 601–604). Society of Exploration Geophysicists. https://doi.org/10.1190/1.1822162
- Association for Computing Machinery (ACM). Artifact Review and Badging. https://www.acm.org/publications/policies/artifact-review-badging
- Monperrus, M. (2018). How to make a good open-science repository? https://researchdata.springernature.com/users/336958-martin-monperrus/posts/57389-how-to-make-a-good-open-science-repository
- Merkel, D. (2014). Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J., 2014(239). http://dl.acm.org/citation.cfm?id=2600239.2600241
- Wienke, J. (2018). LaTeX Best Practices: Lessons Learned from Writing a PhD Thesis. https://www.semipol.de/2018/06/12/latex-best-practices.html
- Gabor, B. (2020). virtualenv. https://virtualenv.pypa.io/
- Manik, D. (2019). GitLab pipelines for every need: testing, documentation, and writing a paper. In deRSE 2019 - Konferenz für ForschungssoftwareentwicklerInnen in Deutschland. https://doi.org/10.5446/42490
- Lühr, L. (2018). Automate Awesome CV with XeLaTeX and GitLab CI. https://ayeks.de/post/2018-01-25-awesome-cv-cicd/
- Khan, S. (2018). Setting up GitLab to automatically generate PDFs from committed LaTeX files. https://sayantangkhan.github.io/latex-gitlab-ci.html
- Ergus, A. (2016). Using GitLab CI for Building LaTeX. https://github.com/aufenthaltsraum/stuff/wiki/Using-GitLab-CI-for-Building-LaTeX
- Ajayakumar, V. (2020). Continuous Integration of LaTeX projects with GitLab Pages. https://www.vipinajayakumar.com/continuous-integration-of-latex-projects-with-gitlab-pages.html
Suggested citation
If you would like to cite this work, here is a suggested citation in BibTeX format.