7. Document Processing & Reproducible Research
Table of Contents
1 Document Processing
Most of you have experience using programs like Microsoft Word to write papers and reports. The reason programs like these are so popular is that they are "what you see is what you get" (WYSIWYG) in their design. You see a graphical representation of the page, and of your document. Stylistic aspects like boldface, underlining, spacing, etc, are inserted using menus and buttons. MS Word and programs like it are easy to start using immediately. In many ways they mimic the typewriter (link for those of you who may not know what these ancient devices are)… you see a page, and you start typing.
There are however shortcomings of WYSIWYG programs like MS Word, especially for larger documents that include features like tables of contents, citations and bibliographies, figures and tables, etc. Technical documents that include mathematical equations can also be rather time consuming to generate in WYSIWYG word processors.
What is the alternative? There is another way of typesetting documents that involves separation of content and style. Content is specified using plain text, and then stylistic aspects are specified using markup (special keywords and codes).
One example of this style of document processing is HTML. Page content is specified using plain text, and stylistic aspects are specified using HTML tags (e.g. <br> for a line break, <div> to define a new section, <b> for boldface, etc). Another example is CSS and how it works with HTML. An entire web site can be written in a generic way in HTML, and by referring to different CSS files, totally different stylistic presentations can be implemented. Another example is Markdown, in which simple markup tags can be used in a plaintext document to denote stylistic features. Markdown documents can be translated into HTML and several other formats using programs like pandoc. Another example, and the one that we will be looking at here, is called LaTeX (pronounced "lay-teck").
1.1 Why not MS Word?
Separating style and content frees your mind to focus on your thoughts, and writing, without constantly being taken out of the moment, to manually implement stylistic features of your document. If every time you start a new section, or compose a list, etc, you need to mentally pause your thoughts in order to find the menu or submenu element or ribbon-bar-button to enlarge the font, make it bold, insert bullet points, adjust spacing, etc, then that detracts from the continuity of your thoughts. In programs like MS Word, all of these stylistic aspects have to be implemented manually, throughout the document. In systems like LaTeX, style is specified separately from content, all in a single place, and so changing the look of your document can be done at a separate time from writing the content.
You all know, if you've ever written a large document like a scientific manuscript, that handling things like citations and bibliographies, figures, tables, equations, etc can be a huge nuisance. There are bibliography managers with citation features, such as EndNote, but in my experience they are rather difficult and unreliable to use. Imagine you are writing for a journal like Nature or Science, in which citations appear as numbers in increasing order as the document progresses. Imagine now you have 50 citations, and you decide that in the middle of your document, after citation #25, you need to add another one. Now you have to go through and manually change citations #26-50 to be 27-51. A similar issue exists with numbered equations, with Figures, and with Tables. Moving things around and you work on subsequent drafts of your article can often involve serious busywork in order to manually change citation, figure, equation, and table numbering.
In LaTeX, all of this is handled 100% automatically for you. At no point do you need to manually number citations, figures, equations, tables, sections, subsections, chapters, etc. These things are specified using LaTeX markup codes, and when you compile your LaTeX plaintext file (which generates a .pdf), all of the numbering is handled for you by the LaTeX system. This turns out to be a huge time saver, especially for long documents like a manuscript, or a thesis.
Side note: my PhD thesis (200+ pages) was written using LaTeX, and everything including citations, equations, figures, tables, sections, even a table of contents with page numbers, is generated automatically by LaTeX. What's more, the thesis document itself is now almost 15 years old, but I can still today compile it and generate a .pdf using the original LaTeX plaintext files.
Another reason why I like systems like LaTeX more than programs like MS Word is that Word files are encoded in a binary format, and are not immediately human-readable. To read a Word file one needs a program that knows how to open Word files. What's more, Word file formats have changed over the years, and so there is no guarantee that some time in the future you will be able to easily open an old Word file. What's more, different operating systems (Windows, Linux, Mac OSX) often are not 100% compatible and interchangable with respect to Word files. In contrast, systems like LaTeX use plain text ascii files. These are universally readable and 100% platform independent. LaTeX is also open-source and free, whereas MS Word is proprietary and licensed.
Have a look at the following sites to read more about the issue of Word vs LaTeX, and content vs style in document processing:
- Benefits of LaTeX typesetting
- Why LaTeX?
- Why LaTeX is superior to MS Word
- Why LaTeX?
- Five reasons you should use LaTeX and five tips for teaching it
Here are some links to recent LaTeX tutorials:
2 LaTeX
2.1 installing LaTeX
2.2 Basic LaTeX document structure
Every LaTeX document must have at the very least, a basic structure that looks like this:
% latexdemo1.tex % a very basic document \documentclass{article} \begin{document} Hello World! \end{document}
This LaTeX plaintext file can be compiled from the command line by:
pdflatex latexdemo1.tex
Here is what the above document looks like when compiled into a .pdf: latexdemo1.pdf. You can see that it's not very interesting. We have our document text, "Hello World!", and we have a page number at the bottom.
The \documentclass
command signifies one of several pre-canned
document styles. Some of the more relevant ones for us are article
,
which provides a basic report style appropriate for scientific
manuscripts and reports, report
which provides for longer reports
containing several chapters, and book
for actual books. Other styles
you may find useful are letter
for writing letters, and beamer
for
presentations (i.e. using LateX instead of MS Powerpoint).
The document class command can be supplemented with one of a number of optional style options including things like font size, paper format (letter, A4, legal, etc), single vs two-columns, portrait vs landscape page orientation, etc.
The idea here is that a given document class (e.g. article
) provides
a pre-determined systematic document appearance. This can be modified
by other LaTeX commands. Typically these commands are entered after
the \documentclass
command but before the \begin{document}
command.
2.3 Including Packages
In the header between the \documentclass
and \begin{document}
commands, one can also load LaTeX packages using the
\includepackage{}
command. There are a ton of packages available,
and they all live in a central repository known as CTAN, the
Comprehensive TeX Archive Network.
Here is one package that we will use in order to demonstrate the various document styles available: the lipsum package, which will insert chunks of dummy text so that we can see how different styles look. Here is an example:
% latexdemo2.tex % a very basic document \documentclass{article} \usepackage{lipsum} \title{A demo document in \LaTeX{}} \author{John Doe} \date{\today} \begin{document} \maketitle \lipsum[1] \section*{Introduction} \lipsum[1-5] \section*{Methods} \lipsum[1-5] \end{document}
Here is what the .pdf file looks like: latexdemo2.pdf.
The \lipsum[]
commands mean, generate paragraphs of dummy text. The
results in a 4-page document that allows you to see the basic
article
style. We also see commands for the title, author, and
date. These are specified in the header, and depending on the
documentstyle, are rendered differently.
As an exercise, try changing the \documentclass
to book
,
recompile, and see what it looks like. Try report
and do the
same. You get the idea. By changing style commands independent of the
content (which doesn't have to change), we can alter the appearance of
our document to suit our needs.
2.4 Equations
Let's look at another example document. This time we've introduced equations:
% latexdemo3.tex \documentclass{article} \title{A demo document in \LaTeX{}} \author{John Doe} \date{\today} \begin{document} \maketitle We will see two equations here. Equation \ref{eq:regression_line} gives the formula for a line of best fit for a linear regression. Equation \ref{eq:sample_mean} gives the formula for the mean of a sample. It is not the same as Equation \ref{eq:regression_line}. This is an equation: $\hat{Y} = \beta_{0} + \beta_{1} X$. \begin{equation} \hat{Y} = \beta_{0} + \beta_{1} X \label{eq:regression_line} \end{equation} \begin{equation} \bar{X} = \frac{1}{N} \sum_{i=1}^{N} X_{i} \label{eq:sample_mean} \end{equation} Note how the equation numbering is handled automatically for you. \end{document}
Here is what the .pdf file looks like: latexdemo3.pdf.
Note a couple of things. First, equations are specified using LaTeX equation codes. A good summary of all the codes for the various math symbols is here.
Note also that nowhere in the LaTeX document itself did I manually
specify the Equation numbers. I only specified labels, and then within
the document I can refer to these labels without having to worry about
the numbering. As an exercise, try switching the order of the
equations, and recompile. See how the numbering is automatically
adjusted for you. Note also that you will likely need to issue the
pdflatex
command twice. The first time you issue it you might see a
message like this:
LaTeX Warning: Label(s) may have changed. Rerun to get cross-references right.
The LaTeX system needs multiple passes to automatically handle all of the numbering for elements like equations, sections, tables, figures, citations, etc.
2.5 Citations & Bibliography
In LaTeX, citations and bibliographies are handled by a separate but related program (included with LaTeX) called BibTeX. BibTeX files are also plain text files, that list bibliography items in a standardized format. There is some documentation here.
Here is an example of a BibTeX file:
@article{poldrack2012future, title={The future of fMRI in cognitive neuroscience}, author={Poldrack, Russell A}, journal={Neuroimage}, volume={62}, number={2}, pages={1216--1220}, year={2012}, publisher={Elsevier} } @article{myung2003tutorial, title={Tutorial on maximum likelihood estimation}, author={Myung, In Jae}, journal={Journal of Mathematical Psychology}, volume={47}, number={1}, pages={90--100}, year={2003}, publisher={Elsevier} } @book{dawkins1996blind, title={The blind watchmaker: Why the evidence of evolution reveals a universe without design}, author={Dawkins, Richard}, year={1996}, publisher={WW Norton \& Company} }
You can see this .bib file has three entries: two journal articles and one book. Now within our LaTeX file, if we want to cite any of these articles, we need to do a couple of things.
First we need to specify the name of the .bib file containing the
references using the \bibliography{}
command. We also need to
specify a style for our bibliography using the \bibliographystyle}
command. There are several styles built-in, see Choosing a BibTeX
Style. There are also styles you can import using the \usepackage
command, e.g. the chicago style, the apacite style for APA style, and
the natbib package for more customizable styles. Also see BibTeX
Bibliography and LaTeX Style Formats for Biologists (however slightly
outdated I think).
Finally, we can use the \cite{}
command to actually cite
papers. Here is an example:
% latexdemo4.tex \documentclass{article} \title{A demo document in \LaTeX{}} \author{John Doe} \date{\today} \usepackage{chicago} \begin{document} \maketitle \section*{Introduction} Here is a sentence in which we cite an article on fMRI \cite{poldrack2012future}. Here is another sentence in which we cite a paper on maximum likelihood estimation \cite{myung2003tutorial}. Finally here is a citation of a book on evolution \cite{dawkins1996blind}. Here is a sentence in which we cite all three \cite{poldrack2012future,myung2003tutorial,dawkins1996blind}. \bibliographystyle{chicago} \bibliography{latexdemorefs} \end{document}
To compile this we need to issue a series of commands:
pdflatex latexdemo4 bibtex latexdemo4 pdflatex latexdemo4
Here is what the .pdf looks like: latexdemo4.pdf.
A convenient way to get bibliography entries into .bib format is by using google scholar. Under each article you find there is a link called cite, and when you click that, a window pops up and one of the links at the bottom of this window is Import into BibTeX. If you click that you will see the entry in .bib format, and you can copy and paste it into your .bib file.
For the Mac there is a really nice free, open-source program called BibDesk that provides a pointy-clicky GUI interface to .bib files. You can even search Pubmed, google scholar, etc and have the results imported automatically into a .bib file.
2.6 article template
Here is a link to a template for a basic scientific manuscript. There are some formatting codes that we haven't seen yet, but you can probably guess what they do, and if not you can look them up on google to find out. The template includes a title page, a bibliography, Figures plus Figure Captions, and all the standard article sections (Abstract, Introduction, Methods, Results, Discussion).
and the compiled article in .pdf format:
2.7 Next steps
My suggestion is to go through some of the first few sections of
I will openly admit that getting to know LaTeX involves somewhat of a learning curve. The best place to start is by looking at example templates and modifying them to suit your needs.
Here are some links to gentle introductions:
- An introduction to LaTeX
- LaTeX (wikibooks)
- Getting to Grips with LaTeX
- How to use BibTeX
- LaTeX Cheat Sheet (print this out and keep it handy)
3 Markdown
Markdown is a lightweight markup language originally intended as a way to automatically generate HTML code from easy-to-read formatting codes. It was developed by John Gruber and Aaron Swartz. Since its original conception it has sort of blossomed into a format that has gone way beyond just HTML.
Markdown documents can be written in any plain text editor, for indeed at the end of the day (much like LaTeX), Markdown documents are just plain text documents, annotated with special markup codes that specify formatting of common format elements, like headings, subheadings, lists, tables, inline graphics, code snippets (with syntax highlighting), etc. Here is a Markdown Cheatsheet that shows you the kind of things that are possible.
Many advanced text editors, like Sublime Text, and others, have Markdown knowledge "built in". The convention is that Markdown files have a .md
filename suffix.
There are a number of ways to generate output-formatted documents from Markdown. Some text editors can generate HTML previews (and HTML output files) from Markdown — just google Markdown editors.
There are also programs like Marked that will show you an output-formatted preview, as you edit a Markdown file in any editor (Marked runs in the background).
Finally the program called Pandoc can convert from Markdown (and many other formats) to many output formats including HTML, LaTeX, pdf, MS Word, etc etc.
Some people write papers in Markdown format:
4 Reproducible Research
In a typical scientific workflow, one has data analysis code written in one language (e.g. R or Python) store in files somewhere, one might have Figures generated using another program, and their files stored somewhere else, and finally a manuscript in some other program, e.g. MS Word, with that file stored somewhere else.
This is non-optimal in the long run for a couple of reasons. First, if there is ever a change to the data, or a need to re-analyse data, or re-generate a Figure, doing so and incorporating that change into your manuscript document will involve several error-prone steps, like locating the appropriate data analysis code files, performing whatever steps are necessary to re-generate the Figure, and then finally replacing the old Figure with the new Figure in the manuscript document itself. Second, you end up with several different kinds of files, sometimes in several different locations. Third, if you want to share your data and analysis workflow with colleagues, this depends on them (1) being able to re-create the same set of steps as you did, and (2) it may depend on them having specific, particular software.
The idea behind so-called reproducible research is that your data analysis, graphical Figures, and manuscript all live in one single document… or at least they are tied together if they are in separate files… and that the document can be automatically prepared (e.g. compiling it in LaTeX) and in doing so, will execute the data analysis code, and graphical Figure generation, all in one automatic step. This is advantageous both for your own organization and also for sharing the data + analysis with others.
The insertion of code that can be executed, within a document, is sometimes called literate programming.
Here's a recent paper in PLoS Computational Biology:
Sandve, G. K., Nekrutenko, A., Taylor, J., & Hovig, E. (2013). Ten Simple Rules for Reproducible Computational Research. PLoS Computational Biology, 9(10), e1003285.
4.1 Sweave
One system for reproducible research that is worth looking at is Sweave (pronounced "ess weave"). Sweave is a tool that allows you to directly embed R code for data analysis and/or Figure generation into a LaTeX document. Sweave comes included with R so if you've installed R, you have Sweave.
The workflow using Sweave involves inserting special Sweave codes into
your LaTeX document which signals R code chunks. The next step is to
process this .Rnw file into a .tex file, using the command R
CMD SWEAVE
. Then you can compile the resulting .tex file using
R CMD pdflatex
. Here is a simple example:
% sweavedemo1.Rnw % a very basic Sweave document \documentclass{article} \title{My manuscript} \author{John Doe} \date{\today} \usepackage[margin=1.0in]{geometry} \begin{document} \maketitle \abstract{This is my article. It will be revolutionary. It will overturn all the current theories.} \section*{Introduction} Hello world. Here is where I do my lit review. \section*{Methods} I haven't really done anything yet. \section*{Results} I have a list of numbers. Whoopee. Here is a plot of my 7 numbers: <<echo=false,fig=true,height=4>>= mylist <- c(1,2,3,4,5,4,3) mlm <- mean(mylist) plot(mylist, type="b", col="blue", lwd=2, xlab="my numbers", ylab="value") @ The mean of my list turned out to be \Sexpr{round(mlm*100)/100}. \section*{Discussion} What would you like me to discuss? \end{document}
We compile this into a .pdf in two steps:
R CMD Sweave sweavedemo1.Rnw R CMD pdflatex sweavedemo1.tex
Here is the resulting .pdf file: sweavedemo1.pdf.
Notice a couple of things. In the .Rnw document, Sweave code is
denoted using <<>>
separators. Also note that the code is just R
code, it runs the same as if you are running it independently in an R
session. Also notice that we can access R variables that have been
defined in code so far, using the \Sexpr{}
command.
What this all means, if you are starting to "get it", is that in principle you could have all of your data analysis and Figure generation code inside your manuscript file. When anything changes (new data, or modified analysis) all you would have to do is recompile the document, .Rnw -> .tex -> .pdf, and all of the changes will propagate automatically. It also makes it really easy to share with others, and it means someone else ought to be able to directly run the same exact analysis that you published in your paper. This would seem like a desirable feature in science.
See these links for some more information, and tutorials, about using Sweave:
4.2 Pweave
There is a system called Pweave that allows you to do a similar thing as Sweave, but with Python code instead of R code. In fact Pweave supports not just LaTeX but also HTML and Pandoc formats as well.
I don't have much experience yet with Pweave so I will not provide sample documents (yet). I will leave it up to you to get started with Pweave if you desire Python literate code.
Here is a LaTeX example using Pweave. You can see that the idea is very similar to Sweave.
4.3 MATweave
There appears to be a solution out there for integrating MATLAB with LaTeX as well, called MATweave. I haven't tried it myself but if you're interested in pursuing the reproducible research approach using MATLAB it might be worth checking out.
4.4 iPython notebook
Finally, another tool that might be useful to you is the iPython Notebook. This is a tool that comes with iPython that provides a web-based interactive environment for combining Python code, document text, graphics, and even rich media such as videos and web pages, all in a single document.
iPython notebooks can also be saves in an iPython notebook format (.ipynb) or as python code (.py). There is even a iPython Notebook Viewer where you can share iPython notebooks online. There is a tool called nbconvert that can convert iPython notebook files into many formats including HTML, LaTeX, Markdown, etc.
Here is how you start the iPython notebook from the command line, to automatically load the SciPy, Numpy, and the matplotlib environment:
ipython notebook --pylab=inline
This should then automatically launch your web browser and show the current directory, and any iPython notebook files that reside there. To create a new one, click the "New Notebook" button.
Here is a link to a gallery of interesting iPython Notebooks. There are also some sample notebooks on the iPython Notebook Viewer homepage.
Here are some YouTube videos demonstrating the many features of iPython including the iPython Notebook:
- iPython in Depth, SciPy2013 Tutorial 1/3 (YouTube)
- iPython in Depth, SciPy2013 Tutorial 2/3 (YouTube)
- iPython in Depth, SciPy2013 Tutorial 3/3 (YouTube)
4.5 Including code listings
A common situation (e.g. in a course like this one) is where you want to include a code listing, or segments of code, in a document. The tools we have discussed all have the ability to do this.
For LaTeX, there is a package called listings that can include code listings in a nicely formatted way that is specific to whatever language the code is in. The listings package "knows about" many different programming languages. There are lots of customizable options for the listing package, see the documentation for details.
For iPython Notebook, of course code blocks are shown inline as part of the notebook document itself, you don't need to do anything special.
4.6 knitr
There is another package called knitr which is very much like Sweave, but incorporates a number of fixes and simplifications that improve on some of the long-standing issues and gotchas with Sweave.
In addition, knitr can work with either LaTeX or Markdown input files. See the main knitr website for links to examples.
5 Some links
- How to ditch Word
- R and My Divorce From Word
- Markdown for scientific writing
- All you need is text — Markdown (via pandoc) for academia
- Detexify — a LaTeX symbol classifier (draw using your mouse)
- AsciiDoc — Text based document generation
- Asciidoctor — A fast text processor & publishing toolchain for converting AsciiDoc to HTML5, DocBook & more
- writeLaTeX — online collaborative LaTeX Editor (with templates)
- Intro to LaTeX course part 1 — through writeLaTeX
- Intro to LaTeX course part 2
- Intro to LaTeX course part 3
6 Assignment
Assignment 3 is due no later than Sun Nov 2, 2013, 23:59:59 EDT.