5. File input and output

1. ASCII vs binary files
2. Reading plain text files
3. Reading binary files
4. Reading some common file formats
5. Saving data to files
6. Text (ascii) or Binary?
7. Exercises

Most of the time you will be writing programs that analyse data — whether those data are collected from experiments, or generated by models and simulations. We will need to be able to read in data from a file. It will also be useful to be able to write data out to a file.

This page is a useful summary of file input/output operations in Python/NumPy/SciPy.

For MATLAB see documentation pages for data import and export.

For R there is useful documentation for R Data Import/Export.

For C see the section in the C Bootcamp on Input and Output or C file input/output (Wikipedia) or C Programming/File IO (wikibooks).

1 ASCII vs binary files

There are two types of file formats, ASCII files (otherwise known as plain text files) and binary files. In fact, this is a lie and there really is only one file type, namely binary files, since all data are ultimately stored as 0s and 1s (binary)… but we have conventions, like the ASCII code, which allow us to make assumptions, to make life easier… so we know that if a (binary) file is coded using a series of bytes, each of which corresponds to an ASCII code, then this file is in fact a "plain text" or "ascii" file (and you can read it using any plain text editor like vim, emacs, sublime text, notepad, even MS Turd will open plain text files).

If what I just said sounds like gibberish to you then you might want to look at the following refresher on digital representation of data: A1. Digital Representation of Data, and also have a look at this web page on Ascii vs. Binary Files.

2 Reading plain text files

If your data are store in a plain text file (sometimes called an ascii file) then each language has built-in functions that we can use to read values.

Let's say we have a plain text file called data05.txt that looks like this:

Python / NumPy

The NumPy function loadtxt() is our friend here. Check out the help documentation for a full description. You can specify the variable type (e.g. "float" or "int", etc), and you can specify the delimiter that separates each column of data. The data are loaded into a NumPy array.

>>> from numpy import loadtxt
>>> data = loadtxt("data05.txt", dtype="float", delimiter=" ")
>>> data
array([[2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> type(data)
<type 'numpy.ndarray'>
>>> data[:,0]
array([2, 4, 6, 8])
>>> data[:,1]
array([3, 5, 7, 9])

MATLAB / Octave

There is a rather general purpose function called load() that will load all kinds of files. It tries to auto-detect the file type and behave appropriately. See the help documentation for a full description of all of the options.

>> data = load('data05.txt');
>> data

data =

     2     3
     4     5
     6     7
     8     9

>> whos data
  Name      Size            Bytes  Class     Attributes

  data      4x2                64  double              

>> data(:,1)

ans =

     2
     4
     6
     8

>> data(:,2)

ans =

     3
     5
     7
     9

In R your best option is to use the read.table() function, which loads data into a variable type called a data frame, which is a very useful type for storing data tables. See the help documentation for a summary of all of the optional arguments (there are many). One of the optional arguments is called col.names and allows you to specify the names of each column of data. This allows you, if you wish, to access the data by name rather than by column number (see example code below).

> d <- read.table("data05.txt", sep=" ", col.names=c("pizza","cheeseburger"))
> d
  pizza cheeseburger
1     2            3
2     4            5
3     6            7
4     8            9
> d[,1]
[1] 2 4 6 8
> d[,2]
[1] 3 5 7 9
> names(d)
[1] "pizza"        "cheeseburger"
> d[,"pizza"]
[1] 2 4 6 8
> d[,"cheeseburger"]
[1] 3 5 7 9
> d$pizza
[1] 2 4 6 8
> d$cheeseburger
[1] 3 5 7 9

3 Reading binary files

A binary file, unlike a plain text file, is not directly readable or recognizable. If you try to open a binary file in a text editor you will see what appears to be random characters. An ascii file contains sequences of bytes (typically 8 bits) that are interpreted, when the file is loaded by a text editor, as ascii characters. In contrast, a binary file contains bits and bytes that do not necessarily correspond to ascii characters. For example, try opening a MS Turd file in a plain text editor and you will see lots of interesting stuff corresponding to custom binary representations of various properties of your document.

Why binary files? When you store data in a binary format, it typically results in a smaller file than when it is stored in a plaintext (ascii) file. Binary files are typically faster to read and write, as well.

Ascii files follow a prescribed format (sequences of 8-bit bytes, interpreted as ascii codes). In contrast, binary files can follow any arbitrary format that you want. If you keep the format hidden (proprietary) then you can make it so only your program(s) can open the data files, and nobody else can (unless they know the format). This is not a good thing in science, however, if we want our work to be open and available to all. If you store data in a binary format then (in my opinion) you ought to also provide the binary format, so that the file is accessible to all.

An Example

There is a file you can download called data05.bin that is in a custom binary format (that I made up for the purposes of this example). There is a header, followed by the data. The first 16 bytes of the header contains a character string (16 characters long). The last two bytes contain integers that tell us how many rows and columns (in that order) of data there are. Each integer is 4 bytes long. So the total header is 24 bytes long. The data follow this and are 4-byte integers each. It is ordered row-wise, that is row 1 (v1,v2) followed by row 2 (v1,v2), etc. For those interested, I generated this data file in C using this program: writebinarydata.c.

Here is how we would read this file:

#include <stdio.h>

int main(int argc, char *argv[]) {

  FILE *fid = fopen("data05.bin", "r");
  // read the header
  char hs[16];
  fread(&hs, 16, 1, fid);
  int rows;
  int cols;
  fread(&rows, 4, 1, fid);
  fread(&cols, 4, 1, fid);
  // read 5 integer values for each of two variables
  // format is: v1 v2 v1 v2 v1 v2 v1 v2 v1 v2
  // where v1,v2 are each 4 bytes long
  int v1[5];
  int v2[5];
  int i;
  for (i=0; i<rows; i++) {
    fread(&(v1[i]), 4, 1, fid);
    fread(&(v2[i]), 4, 1, fid);
  }
  fclose(fid);

  // print to screen
  for (i=0; i<16; i++) {
    printf("%c", hs[i]);
  }
  printf("\nrows=%d, cols=%d\n", rows, cols);
  for (i=0; i<rows; i++) {
    printf("%3d %3d\n", v1[i], v2[i]); 
  }

  return 0;
}

MATLAB

>> fid = fopen('data05.bin', 'r');
>> hs = fread(fid,16,'int8=>char')';
>> rows = fread(fid, 1, 'int');
>> cols = fread(fid, 1, 'int');
>> d = fread(fid, rows*cols, 'int');
>> d = reshape(d,cols,rows)';
>> disp(['header = ',hs]);
header = October 15, 2013
>> disp(['rows = ',num2str(rows),' cols = ',num2str(cols)]);
rows = 5 cols = 2
>> d

d =

     1     2
     2     4
     3     6
     4     8
     5    10

Python / NumPy

>>> from numpy import *
>>> fd = open("data05.bin", "r")
>>> hs = fromfile(fd, dtype="<i1", count=16)
>>> hs = ''.join(map(chr,list(hs)))
>>> rows = fromfile(fd, dtype="<i4", count=1)[0]
>>> cols = fromfile(fd, dtype="<i4", count=1)[0]
>>> d = fromfile(fd, dtype="<i4", count=(rows*cols))
>>> d = d.reshape((rows,cols))
>>> fd.close()
>>> print "header = %s" % (hs)
header = October 15, 2013
>>> print "rows = %d, cols = %d" % (rows, cols)
rows = 5, cols = 2
>>> print d
[[ 1  2]
 [ 2  4]
 [ 3  6]
 [ 4  8]
 [ 5 10]]

> fd <- file("data05.bin", "rb")
> hs <- rawToChar(as.raw(readBin(fd, integer(), n=16, size=1)))
> rows <- readBin(fd, integer(), n=1, size=4)
> cols <- readBin(fd, integer(), n=1, size=4)
> d <- readBin(fd, integer(), n=(rows*cols), size=4)
> d <- matrix(d, nrow=rows, ncol = cols, byrow=TRUE)
> close(fd)
> cat("header = ", hs, "\n")
header =  October 15, 2013 
> cat("rows =", rows, "cols =", cols, "\n")
rows = 5 cols = 2
> d
     [,1] [,2]
[1,]    1    2
[2,]    2    4
[3,]    3    6
[4,]    4    8
[5,]    5   10

4 Reading some common file formats

Each language has its own particular binary file format for storing data. MATLAB saves data in binary files with a .mat suffix. The save() command does this. The load() command loads .mat files back into memory.

In R, write.table() will save a data frame to a file, in an ascii format. The save() function will save data in a binary format specific to R. It can be loaded again using the load() command.

In Python, NumPy has a function called save() that will save an array to a binary file in NumPy .npy format. The savez() function will save several arrays into a binary .npz archive. The load() function will load the data back into NumPy.

In plain Python (no NumPy involved) there is a way of saving Python variables to binary files, in a format called a pickle. See pickle - Python object serialization for the details.

In C, there is no standard binary format for files, it's up to you to design your own.

Other common binary file types

Each language has the ability to load a number of other commonly used binary file types, including image files (jpeg, tiff, etc) and some other common types such as Microsoft Excel files.

For R, see R Data Import/Export for details of the different formats R can load. There is even a package to read and write MATLAB .mat files, see Package R.matlab.

For Python / NumPy, one option to know about is the pandas library. There are a range of file input/output options listed here: IO Tools. NumPy also has a number of different file i/o options, see Input and output for details.

In MATLAB there is a function called xlsread() that will read MS Excel files, see here for documentation.

For common neuroimaging file formats, e.g. NIfTI and ANALYSE, you can find libraries and third-party files for each language that will read and write to these formats. Just search on google.

5 Saving data to files

Each of the file reading methods discussed above has a matching file writing function, just see the corresponding documentation for details.

6 Text (ascii) or Binary?

The question of which file format to use as you go forward and write programs for analysing your data is an interesting one to consider. For long-term archival purposes, I would suggest storing your data in an ascii format, so that it remains readable by human eyes. There will always be programs to read in ascii files. The risk of storing data in a binary format is that (a) whatever program you used to save the data will no longer be easily accessible in the future, and/or (b) you may not even remember what the binary format is. The disadvantage of storing data in an ascii or plain text format is that the files will be larger than if they were stored in a binary format. The availability and affordability of large amounts of storage is growing so quickly however that I don't think one has to worry too much about this problem.

7 Exercises

Exercises 16, 17 and 18 will get you reading in files of different formats.