Martin Isaksson

### Tags

When submitting a scientific paper to a conference or a journal, there is often a mandatory step of passing the automated PDF checks set up by that publication. This step can often be nerve-racking and cause many hours of LaTeX troubleshooting. In this post we will create a series of test cases to catch these problems early in the writing process so that you can submit your manuscript only once.

## Introduction

Recently, I submitted a scientific paper to an IEEE conference. For a manuscript to be accepted by the publishing system Editor's Assistant (EDAS) it has to pass an unknown number of unspecified test cases. This took far too many attempts, as can be seen in the figure to the right.

Here is one error that I received.

The gutter between columns is 0.165 inches wide (on page 3), but should be at least 0.2 inches.

Nowhere did it say that the gutter should be 0.2 inches. Another IEEE conference that I submitted to had a smallest gutter width of 0.16 inches, and it seems that this is up to each conference chair to decide. As you can imagine, when trying to fix this, some text will spill over to the next page so then the document is over the page-limit. Uploading a document many times is a pain.

In this post we will create a series of test cases to catch these errors locally before submitting.

The publishing system gave this message for the final version of the manuscript that was uploaded without problems.

The paper has 6 pages, has a paper size of 8.5x11 in (letter), is formatted in 2 columns, with a gutter of 0.201 inches (smallest on pg. 5), the most common font size is 9.96 pt, the average line spacing is 11.95 pt, margins are 0.673 (L) x 0.653 (R) x 0.701 (T) x 0.990 (B) inches, uses PDF version 1.7 and was created by TeX.
It can take many attempts to pass EDAS PDF checks.

## Template

The techniques we use here can of course be applied to any PDF document. We will here take a look at a two-column conference paper since this provides us with a number of interesting things to test that other formats don’t.

We use the example from the IEEE Manuscript Templates for Conference Proceedings to test these methods.

The IEEE Manuscript Templates for Conference Proceedings example is particularly interesting due to the multiple author bounding boxes. We can download the template and an example document and compile it directly to produce the PDF that we will run our test cases on.

## Test cases

### Requirements and setup

We need to understand the requirements of the publishing system. Some of these requirements can be found on the conference website. For example, we see that the page limit is 6 pages in a 10 point font, and that we should use the IEEE Manuscript Templates for Conference Proceedings template. Other than that, there is no more useful information.

Here are the requirements, gathered from various sources, that we are going to write test cases for in this post.

RequirementValueSource
AnnotationsnoObvious
BookmarksnoEDAS FAQ
EncryptednoObvious
Font size10 ptConference CFP, (Shell, 2002)
Font typeno PS Type 3 fontsHearsay
Fonts embeddedembeddedEDAS FAQ
LanguageEnglishConference CFP
Maximum file size40 MBHearsay
Minimum bottom margin1 inIEEE allowed paper sizes
Minimum gutter width0.16 inEDAS fault
Minimum left/right margin0.625 inIEEE allowed paper sizes
Minimum top margin0.65 inIEEE allowed paper sizes, (Shell, 2002)
Number of columns2IEEE Manuscript Templates, (Shell, 2002)
Number of pages1 <= x <= 6Conference CFP
PapersizeLetter (612 pt x 792 pt)(Shell, 2002)
TitleTitle CaseEDAS FAQ

Some of them, for example the margins are tweaked after a paper that passed the test had margins narrower than the one suggested by the IEEE requirements.

### Test framework

For the test framework we will use the popular Python test framework pytest (Krekel et al., 2004) with the PyMyPdf (McKie & Liu, 2020) package to interact with the PDF file. The entire script can be found in the Gitlab repository How to beat publisher checks with LaTeX document unit testing.

We setup our requirements from the table above as follows in config.yml, in YAML format.

### Annotations

It would be a bit embarrassing to submit a file with annotations still in it, so let’s start by checking that we didn’t add any.

For the metadata fields creator, producer, author, title, subject, encryption and keywords we can simply check that the result is as expected by comparing to what we defined in the configuration file config.yml.

For the PDF version, we usually specify a minimum version so we define a separate test case for that. Should we need to change this in the document, we can add \pdfminorversion=7 to our preamble.

### Number of pages

It is quite common that a conference and a journal has a maximum number of pages. The lowest number of pages is of course one, but we’d most likely want to use every inch of space available to us.

### Dimensions

To calculate margins and other dimensions it will be required to find the dimensions of each page and each bounding box within each page. In the process we also find the number of columns.

The basic algorithm is as follows: We first loop through each page and each bounding box within that page. For every bounding box we add an interval to an interval tree - one for the dimensions in the x-direction, and one for the dimensions in the y-direction. For this we will use the Python package intervaltree (Leib Halbert & Tretyakov, 2018).

Interval trees (contributors, 2020) are interesting in their own right, but we won’t go into the details of how they work. Here it is enough to say that we can do operations on these interval trees to find the widths of gutters and margins easily.

For each new bounding box we find, we add the interval between the left edge and the right edge to one interval tree. After we have done this for all bounding boxes we merge the overlap of these intervals so that we are left with a list of non-overlapping intervals. We do this both for the x-dimension (illustrated below in blue) and the y-dimension (illustrated below in red).

The first page with overlayed non-overlapping bounding boxes in the x-direction in blue and non-overlapping bounding boxes in the y-direction in red. To find the two columns, the 12 first bounding boxes were skipped.
The second page with overlayed non-overlapping bounding boxes in blue. There is only one non-overlapping bounding box in the y-direction.
The first and the second page illustrate how complicated the bounding box analysis can be.

We see that the first page contains things like the title and author blocks that straddle the gutter. This will affect how we can detect the columns and calculate the width of the gutter. Here, we take an easy way out and just skip the first 12 bounding boxes. We find the number of bounding boxes to skip by counting the red boxes in the figure. This problem also extends to pages where a top figure spans the two columns.

After we have calculated the non-overlapping intervals we can easily calculate the width of these. For a two column document, the first interval is the left margin, the second is the first column and the third interval is the gutter. Since the margin and gutter is different on each page, we assert that all of them meet the requiements.

Should the gutter be too narrow (something that always happens way too often) we can tweak the column separation with

Another thing that will effect this is microtype and it’s various options, for example protrusion.

Links and bookmarks are created by hyperref. I’d like to keep this package, but set the output to draft for the final publication in order to disable it.

Testing that we have not links or bookmarks in our document is simple, we just make sure that the list of links on each page is empty and that the bookmark list is empty.

### File size

The system that we upload our document to has a limit on the size of document, and it is easy to test for this.

### Spelling and grammar

To test spelling and grammar I use LanguageTool, textidote and vale. However, the number of false positives are staggering and they are unusable for automatic testing. This is also a larger topic that deserves an in-depth analysis.

### Title in title case

The title shall be in title case. In this regard EDAS follows the Associated Press Stylebook and the New York Times style book. These state that only short prepositions and articles with four letters or less are lowercase.

The Python package titlecase uses a wordlist from New York Times Manual of Style to decide what words shall be lowercase.

### Required text

For my work I am required to put in a pre-defined sentence in the acknowledgment-section so I want to test that the document contains this string. This test case can easily be modified to detect black-listed words.

## Embedded fonts

We can test that the fonts are embedded by trying to extract them from each page. This can be optimized since each font will be extracted several times. At the same time we will check that the font is not a Postscript Type 3 font, since these can be bitmap fonts. This can for example happen when using matplotlib since matplotlib will use Type 3 fonts per default. See (Oaks, 2014) or Publication ready figures for ways to get around this.

## Results of the test cases

We can compile the example document and run our test suite with

This is the same as running make render test (using Docker containers), which gives us these results.

There are many test frameworks and PDF readers that could have been used instead of pytest and pyMuPDF. In the past I have used rspec with pdf/reader which is easy to get started with, but since I am more familiar with Python I opted for that when it came to more advanced tests.

The IEEE Xpress PDF checks or The IEEE LaTeX Analyzer from the IEEE author center do not help us here, since a conference chair can specify other requirements in EDAS that are not checked by these tools.

## Discussion

### Tests not implemented

There are a few common mistakes that we didn’t create test cases for. The required font sizes is listed by (Shell, 2002), but hard to test for since figures and titles can have wildly different font sizes. Line-spacing is similar to font-size in this regard.

Other common mistakes, such as not referencing a figure in the text is better suited for linting tools such as textidote.

## Conclusions

In this post we have implemented a few test cases to detect common mistakes in IEEE conference submission - before the conference PDF checker catches them. I hope that this saves you some frustration, and some time.

## References

1. Shell, M. (2002). How to use the IEEEtran LATEX class. Journal of LaTeX Class Files, 12(4), 100–120.
2. Krekel, H., Oliveira, B., Pfannschmidt, R., Bruynooghe, F., Laugher, B., & Bruhin, F. (2004). pytest. https://github.com/pytest-dev/pytest
3. McKie, J. X., & Liu, R. (2020). PyMuPDF. Github. https://github.com/pymupdf/PyMuPDF
4. Leib Halbert, C., & Tretyakov, K. (2018). intervaltree. https://pypi.org/project/intervaltree/
5. contributors, W. (2020). Interval tree — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Interval_tree&oldid=956136582
6. Oaks, J. (2014). Avoiding Type 3 fonts in matplotlib plots. http://phyletica.org/matplotlib-fonts/

## Suggested citation

If you would like to cite this work, here is a suggested citation in BibTeX format.