5. File input and output
Table of Contents
Most of the time you will be writing programs that analyse data — whether those data are collected from experiments, or generated by models and simulations. We will need to be able to read in data from a file. It will also be useful to be able to write data out to a file.
This page is a useful summary of file input/output operations in Python/NumPy/SciPy.
For MATLAB see documentation pages for data import and export.
For R there is useful documentation for R Data Import/Export.
For C see the section in the C Bootcamp on Input and Output or C file input/output (Wikipedia) or C Programming/File IO (wikibooks).
1 ASCII vs binary files
There are two types of file formats, ASCII files (otherwise known as plain text files) and binary files. In fact, this is a lie and there really is only one file type, namely binary files, since all data are ultimately stored as 0s and 1s (binary)… but we have conventions, like the ASCII code, which allow us to make assumptions, to make life easier… so we know that if a (binary) file is coded using a series of bytes, each of which corresponds to an ASCII code, then this file is in fact a "plain text" or "ascii" file (and you can read it using any plain text editor like vim, emacs, sublime text, notepad, even MS Turd will open plain text files).
If what I just said sounds like gibberish to you then you might want to look at the following refresher on digital representation of data: A1. Digital Representation of Data, and also have a look at this web page on Ascii vs. Binary Files.
2 Reading plain text files
If your data are store in a plain text file (sometimes called an ascii file) then each language has built-in functions that we can use to read values.
Let's say we have a plain text file called data05.txt that looks like this:
2 3 4 5 6 7 8 9
Python / NumPy
The NumPy function loadtxt()
is our friend
here. Check out the help documentation for a full description. You can
specify the variable type (e.g. "float" or "int", etc), and you can
specify the delimiter that separates each column of data. The data are
loaded into a NumPy array.
>>> from numpy import loadtxt >>> data = loadtxt("data05.txt", dtype="float", delimiter=" ") >>> data array([[2, 3], [4, 5], [6, 7], [8, 9]]) >>> type(data) <type 'numpy.ndarray'> >>> data[:,0] array([2, 4, 6, 8]) >>> data[:,1] array([3, 5, 7, 9])
MATLAB / Octave
There is a rather general purpose function called load()
that will
load all kinds of files. It tries to auto-detect the file type and
behave appropriately. See the help documentation for a full
description of all of the options.
>> data = load('data05.txt'); >> data data = 2 3 4 5 6 7 8 9 >> whos data Name Size Bytes Class Attributes data 4x2 64 double >> data(:,1) ans = 2 4 6 8 >> data(:,2) ans = 3 5 7 9
R
In R your best option is to use the read.table()
function, which
loads data into a variable type called a data frame, which is a very
useful type for storing data tables. See the help documentation for a
summary of all of the optional arguments (there are many). One of the
optional arguments is called col.names
and allows you to specify the
names of each column of data. This allows you, if you wish, to access
the data by name rather than by column number (see example code
below).
> d <- read.table("data05.txt", sep=" ", col.names=c("pizza","cheeseburger")) > d pizza cheeseburger 1 2 3 2 4 5 3 6 7 4 8 9 > d[,1] [1] 2 4 6 8 > d[,2] [1] 3 5 7 9 > names(d) [1] "pizza" "cheeseburger" > d[,"pizza"] [1] 2 4 6 8 > d[,"cheeseburger"] [1] 3 5 7 9 > d$pizza [1] 2 4 6 8 > d$cheeseburger [1] 3 5 7 9
3 Reading binary files
A binary file, unlike a plain text file, is not directly readable or recognizable. If you try to open a binary file in a text editor you will see what appears to be random characters. An ascii file contains sequences of bytes (typically 8 bits) that are interpreted, when the file is loaded by a text editor, as ascii characters. In contrast, a binary file contains bits and bytes that do not necessarily correspond to ascii characters. For example, try opening a MS Turd file in a plain text editor and you will see lots of interesting stuff corresponding to custom binary representations of various properties of your document.
Why binary files? When you store data in a binary format, it typically results in a smaller file than when it is stored in a plaintext (ascii) file. Binary files are typically faster to read and write, as well.
Ascii files follow a prescribed format (sequences of 8-bit bytes, interpreted as ascii codes). In contrast, binary files can follow any arbitrary format that you want. If you keep the format hidden (proprietary) then you can make it so only your program(s) can open the data files, and nobody else can (unless they know the format). This is not a good thing in science, however, if we want our work to be open and available to all. If you store data in a binary format then (in my opinion) you ought to also provide the binary format, so that the file is accessible to all.
An Example
There is a file you can download called data05.bin that is in a custom binary format (that I made up for the purposes of this example). There is a header, followed by the data. The first 16 bytes of the header contains a character string (16 characters long). The last two bytes contain integers that tell us how many rows and columns (in that order) of data there are. Each integer is 4 bytes long. So the total header is 24 bytes long. The data follow this and are 4-byte integers each. It is ordered row-wise, that is row 1 (v1,v2) followed by row 2 (v1,v2), etc. For those interested, I generated this data file in C using this program: writebinarydata.c.
Here is how we would read this file:
C
#include <stdio.h> int main(int argc, char *argv[]) { FILE *fid = fopen("data05.bin", "r"); // read the header char hs[16]; fread(&hs, 16, 1, fid); int rows; int cols; fread(&rows, 4, 1, fid); fread(&cols, 4, 1, fid); // read 5 integer values for each of two variables // format is: v1 v2 v1 v2 v1 v2 v1 v2 v1 v2 // where v1,v2 are each 4 bytes long int v1[5]; int v2[5]; int i; for (i=0; i<rows; i++) { fread(&(v1[i]), 4, 1, fid); fread(&(v2[i]), 4, 1, fid); } fclose(fid); // print to screen for (i=0; i<16; i++) { printf("%c", hs[i]); } printf("\nrows=%d, cols=%d\n", rows, cols); for (i=0; i<rows; i++) { printf("%3d %3d\n", v1[i], v2[i]); } return 0; }
MATLAB
>> fid = fopen('data05.bin', 'r'); >> hs = fread(fid,16,'int8=>char')'; >> rows = fread(fid, 1, 'int'); >> cols = fread(fid, 1, 'int'); >> d = fread(fid, rows*cols, 'int'); >> d = reshape(d,cols,rows)'; >> disp(['header = ',hs]); header = October 15, 2013 >> disp(['rows = ',num2str(rows),' cols = ',num2str(cols)]); rows = 5 cols = 2 >> d d = 1 2 2 4 3 6 4 8 5 10
Python / NumPy
>>> from numpy import * >>> fd = open("data05.bin", "r") >>> hs = fromfile(fd, dtype="<i1", count=16) >>> hs = ''.join(map(chr,list(hs))) >>> rows = fromfile(fd, dtype="<i4", count=1)[0] >>> cols = fromfile(fd, dtype="<i4", count=1)[0] >>> d = fromfile(fd, dtype="<i4", count=(rows*cols)) >>> d = d.reshape((rows,cols)) >>> fd.close() >>> print "header = %s" % (hs) header = October 15, 2013 >>> print "rows = %d, cols = %d" % (rows, cols) rows = 5, cols = 2 >>> print d [[ 1 2] [ 2 4] [ 3 6] [ 4 8] [ 5 10]]
R
> fd <- file("data05.bin", "rb") > hs <- rawToChar(as.raw(readBin(fd, integer(), n=16, size=1))) > rows <- readBin(fd, integer(), n=1, size=4) > cols <- readBin(fd, integer(), n=1, size=4) > d <- readBin(fd, integer(), n=(rows*cols), size=4) > d <- matrix(d, nrow=rows, ncol = cols, byrow=TRUE) > close(fd) > cat("header = ", hs, "\n") header = October 15, 2013 > cat("rows =", rows, "cols =", cols, "\n") rows = 5 cols = 2 > d [,1] [,2] [1,] 1 2 [2,] 2 4 [3,] 3 6 [4,] 4 8 [5,] 5 10
4 Reading some common file formats
Each language has its own particular binary file format for storing
data. MATLAB saves data in binary files with a .mat
suffix. The
save()
command does this. The load()
command loads .mat
files
back into memory.
In R, write.table()
will save a data frame to a file, in an ascii
format. The save()
function will save data in a binary format
specific to R. It can be loaded again using the load()
command.
In Python, NumPy has a function called save()
that will save an
array to a binary file in NumPy .npy
format. The savez()
function
will save several arrays into a binary .npz
archive. The load()
function will load the data back into NumPy.
In plain Python (no NumPy involved) there is a way of saving Python
variables to binary files, in a format called a pickle
. See pickle -
Python object serialization for the details.
In C, there is no standard binary format for files, it's up to you to design your own.
Other common binary file types
Each language has the ability to load a number of other commonly used binary file types, including image files (jpeg, tiff, etc) and some other common types such as Microsoft Excel files.
For R, see R Data Import/Export for details of the different formats R
can load. There is even a package to read and write MATLAB .mat
files, see Package R.matlab.
For Python / NumPy, one option to know about is the pandas library. There are a range of file input/output options listed here: IO Tools. NumPy also has a number of different file i/o options, see Input and output for details.
In MATLAB there is a function called xlsread()
that will read MS
Excel files, see here for documentation.
For common neuroimaging file formats, e.g. NIfTI and ANALYSE, you can find libraries and third-party files for each language that will read and write to these formats. Just search on google.
5 Saving data to files
Each of the file reading methods discussed above has a matching file writing function, just see the corresponding documentation for details.
6 Text (ascii) or Binary?
The question of which file format to use as you go forward and write programs for analysing your data is an interesting one to consider. For long-term archival purposes, I would suggest storing your data in an ascii format, so that it remains readable by human eyes. There will always be programs to read in ascii files. The risk of storing data in a binary format is that (a) whatever program you used to save the data will no longer be easily accessible in the future, and/or (b) you may not even remember what the binary format is. The disadvantage of storing data in an ascii or plain text format is that the files will be larger than if they were stored in a binary format. The availability and affordability of large amounts of storage is growing so quickly however that I don't think one has to worry too much about this problem.
7 Exercises
Exercises 16, 17 and 18 will get you reading in files of different formats.