Scientific Document Processing


Why

Most of you have experience using programs like Microsoft Word to write papers and reports. The reason programs like these are so popular is that they are “what you see is what you get” (WYSIWYG) in their design. You see a graphical representation of the page, and of your document. Stylistic aspects like boldface, underlining, spacing, etc, are inserted using menus and buttons. MS Word and programs like it are easy to start using immediately. In many ways they mimic the typewriter (link for those of you who may not know what these ancient devices are)… you see a page, and you start typing.

There are however shortcomings of WYSIWYG programs like MS Word, especially for larger documents that include features like tables of contents, citations and bibliographies, figures and tables, etc. Technical documents that include mathematical equations can also be rather time consuming to generate in WYSIWYG word processors.

What is the alternative? There is another way of typesetting documents that involves separation of content and style. Content is specified using plain text, and then stylistic aspects are specified using markup (special keywords and codes).

One example of this style of document processing is HTML. Page content is specified using plain text, and stylistic aspects are specified using HTML tags (e.g. <br> for a line break, <div> to define a new section, <b> for boldface, <p> for a new paragraph, etc). Another example is CSS and how it works with HTML. An entire web site can be written in a generic way in HTML, and by referring to different CSS files, totally different stylistic presentations can be implemented. Another example is Markdown, in which simple markup tags can be used in a plaintext document to denote stylistic features. Markdown documents can be translated into HTML and several other formats using programs like pandoc. Another example, and the one that we will be looking at here, is called LaTeX (pronounced “lay-teck”).

Why not Microsoft Word? (a short diatribe)

Separating style and content frees your mind to focus on your thoughts, and writing, without constantly being taken out of the moment, to manually implement stylistic features of your document. If every time you start a new section, or compose a list, etc, you need to mentally pause your thoughts in order to find the menu or submenu element or ribbon-bar-button to enlarge the font, make it bold, insert bullet points, adjust spacing, etc, then that detracts from the continuity of your thoughts. In programs like MS Word, all of these stylistic aspects have to be implemented manually, throughout the document. In systems like LaTeX, style is specified separately from content, all in a single place, and so changing the look of your document can be done at a separate time from writing the content.

You all know, if you’ve ever written a large document like a scientific manuscript, that handling things like citations and bibliographies, figures, tables, equations, etc can be a huge nuisance. There are bibliography managers with citation features, such as EndNote, Zotero, and others, but in my experience they are rather difficult and unreliable to use. Imagine you are writing for a journal like Nature or Science, in which citations appear as numbers in increasing order as the document progresses. Imagine now you have 50 citations, and you decide that in the middle of your document, after citation #25, you need to add another one. Now you have to go through and manually change citations #26-50 to be 27-51. A similar issue exists with numbered equations, with Figures, and with Tables. Moving things around and you work on subsequent drafts of your article can often involve serious busywork in order to manually change citation, figure, equation, and table numbering.

In LaTeX, all of this is handled 100% automatically for you. At no point do you need to manually number citations, figures, equations, tables, sections, subsections, chapters, etc. These things are specified using LaTeX markup codes, and when you compile your LaTeX plaintext file (which generates a .pdf), all of the numbering is handled for you by the LaTeX system. This turns out to be a huge time saver, especially for long documents like a manuscript, or a thesis.

Side note: my PhD thesis (200+ pages) was written using LaTeX, and everything including citations, equations, figures, tables, sections, even a table of contents with page numbers, is generated automatically by LaTeX. What’s more, the thesis document itself is now > 25 years old, but I can still today compile it and generate a .pdf using the original LaTeX plaintext files.

Another reason why I like systems like LaTeX more than programs like MS Word is that Word files are encoded in a binary format, and are not immediately human-readable. To read a Word file one needs a program that knows how to open Word files. What’s more, Word file formats have changed over the years, and so there is no guarantee that some time in the future you will be able to easily open an old Word file. What’s more, different operating systems (Windows, Linux, Mac OSX) often are not 100% compatible and interchangable with respect to Word files. In contrast, systems like LaTeX use plain text ascii files. These are universally readable and 100% platform independent. ASCII files will be readable forever. LaTeX is also open-source and free, whereas MS Word is proprietary and licensed.

Have a look at the following sites to read more about the issue of Word vs LaTeX, and content vs style in document processing:

What

How

A few years ago I produced some introductory material for a plaintext_workshop I put on for interested profs and trainees in the Brain and Mind Institute at Western.

Here are some links to gentle introductions:

I will openly admit that getting to know LaTeX involves somewhat of a learning curve. The best place to start is by looking at example templates and modifying them to suit your needs.

Overleaf

Overleaf is a collaborative, online LaTeX editor. They have a free tier and also a paid tier. The big advantage is that you don’t have to install LaTeX or maintain packages on your own computer. The disadvantage is that you don’t have LaTeX on your own computer.

They provide some good introductory tutorials for LaTeX. They also provide a large number of templates that you can launch and modify.

Quarto

Quarto is a relatively new open-source scientific and technical publishing system, that uses plaintext markdown and/or jupyter notebooks to produce not just .pdf documents like reports and papers but also web pages, entire websites, interactive documents, books, and even presentations. I used Quarto to produce this course’s website as well as the website, notes, slides, homeworks, and exams, for my undergraduate statistics class Psychology 2812.

They have just released a new Quarto Manuscripts document type that looks interesting though I haven’t tried it yet.