home

Scientific Computing (Psychology 9040a)

Fall, 2021


Input & Output

Most of the time you will be writing programs that analyse data—whether those data are collected from experiments, or generated by models and simulations. We will need to be able to read in data from a file. It will also be useful to be able to write data out to a file.

The MathWorks online documentation has a page devoted to importing and exporting data, here:

Data Import and Export

Here we will go over how to read and write to some common types of files including ASCII files (plain text), MATLAB .mat files, as well as other binary formats. The MathWorks has a page listing all of the various file formats that MATLAB knows how to import, it is quite lengthy:

Supported File Formats for Import and Export

In general, there are two types of file formats, ASCII files (otherwise known as plain text files) and binary files. In fact, this is a lie and there really is only one file type, namely binary files, since all data are ultimately stored as 0s and 1s (binary)—but we have conventions, like the ASCII code, which allow us to make assumptions, to make life easier. So we know that if a (binary) file is coded using a series of bytes, each of which corresponds to an ASCII code, then this file is in fact a “plain text” or ASCII file (and you can read it using any plain text editor like vim, emacs, sublime text, notepad, even MS Turd will open plain text files). Binary files include things like image formats such as .png, .jpeg, sound files such as .mp3 and video files such as .mp4 and .mov. Like I said before though, really, all files are binary. It’s just that we can open files containing ASCII code using many programs which know how to interpret the 0s and 1s as ASCII codes.

Plain text files

If your data are stored in a plan text file (ascii) then you can use the MATLAB function load to load in the file. For example let’s say we have a plain text file called mydata.txt that contains the following:

2 3
4 5
6 7
8 9

Then we can use the load command to load the data:

>> d = load('mydata.txt');
>> whos
  Name      Size            Bytes  Class     Attributes

  d         4x2                64  double              

>> d

d =

     2     3
     4     5
     6     7
     8     9

You can also load the data without giving the load function an output variable to store the data—in this case the data will be stored in a new variable with the same name as the filename (but any file suffix, such as .txt stripped):

>> load mydata.txt
>> whos
  Name        Size            Bytes  Class     Attributes

  mydata      4x2                64  double              

>> mydata

mydata =

     2     3
     4     5
     6     7
     8     9

For loading ASCII files, the file must contain a rectangular table of numbers, with an equal number of elements in each row. Delimiters such as spaces, commas, semicolons or tabs can be used—but they have to the same througout the file. If these conditions are not met, MATLAB will complain. For example if our data file mydata2.txt looks like this:

2 3
4 5
6 7 8
8 9

MATLAB will complain about number of columns not being the same:

>> load mydata2.txt
Error using load
Number of columns on line 3 of ASCII file mydata2.txt must be the same as previous lines.

To save data to an ASCII file, you can use the save command. For example let’s say we have data store in a variable called data that looks like this:

data =

    0.6557    0.7577
    0.0357    0.7431
    0.8491    0.3922
    0.9340    0.6555
    0.6787    0.1712

Then we can use save with the -ascii flag to save this into an ASCII file:

>> save mynewdata.txt data -ascii

The first argument (mynewdata.txt) is the filename of the new file to be created. The second argument (data) is the name of the variable to be saved to the file, and the third argument (-ascii) is a flag to the save command that tells MATLAB to save the data in plain text (ASCII) format. Now if we look at the new file (for example by opening it in the MATLAB text editor) that was created, mynewdata.txt it looks like this:

   6.5574070e-01   7.5774013e-01
   3.5711679e-02   7.4313247e-01
   8.4912931e-01   3.9222702e-01
   9.3399325e-01   6.5547789e-01
   6.7873515e-01   1.7118669e-01

Note how it has been saved in scientific notation.

If you want finer control over how things are stored in an ASCII file, you can read (as well as write) using lower-level control using MATLAB’s built-in functions fprintf and fscanf. These mirror the functions with the same name that may be familiar to you if you have programmed in C before. Here is an example of writing to an ASCII file where we want a very specific format:

data = [
    0.6557    0.7577
    0.0357    0.7431
    0.8491    0.3922
    0.9340    0.6555
    0.6787    0.1712
];

fid = fopen('myfile.txt','w');
fprintf(fid, 'myfile.txt contains some data\n');
for i=1:size(data,1)
    fprintf(fid,'item 1.1: %.4f, item 1.2: %.4f\n', data(i,1), data(i,2));
end
fprintf(fid, 'end of data\n');
fclose(fid);

This creates a file called myfile.txt that looks like this:

myfile.txt contains some data
item 1.1: 0.6557, item 1.2: 0.7577
item 1.1: 0.0357, item 1.2: 0.7431
item 1.1: 0.8491, item 1.2: 0.3922
item 1.1: 0.9340, item 1.2: 0.6555
item 1.1: 0.6787, item 1.2: 0.1712
end of data

Binary files

MATLAB has its own binary format for files, denoted using a .mat file suffix. The save and load functions in MATLAB with no other options use this default binary format. The advantage of MATLAB’s binary format over an ASCII format is (1) your data files will be smaller in size, and (2) with MATLAB’s .mat format you can store more than one variable (and variables of different kinds) in a single file. For example here we store a scalar variable called mynumber, a vector called myvector, a matrix called mymatrix and a structure called mystructure in a single binary .mat file called myfile.mat:

>> whos
  Name             Size            Bytes  Class     Attributes

  mymatrix         4x2                64  double              
  mynumber         1x1                 8  double              
  mystructure      1x1               840  struct              
  myvector         1x7                56  double              

>> save myfile mynumber myvector mymatrix mystructure

Now there is a file in my working directory called myfile.mat. I can now load the file (and all the variables contained within it) into MATLAB’s memory using the load function. First I clear the memory to demonstrate that I’m not cheating:

>> clear
>> whos

Now I load the file:

>> load myfile
>> whos
  Name             Size            Bytes  Class     Attributes

  mymatrix         4x2                64  double              
  mynumber         1x1                 8  double              
  mystructure      1x1               840  struct              
  myvector         1x7                56  double              

ASCII or binary?

The question of which file format to use as you go forward and write programs for analysing your data is an interesting one to consider.

For long-term archival purposes, many suggest storing your data in ASCII format, so that it remains readable by human eyes. There will also always be programs to read ASCII files. It is possible to write MATLAB data structures like arrays and matrices to ASCII files using a built-in function called dlmwrite().

The downside of storing large amounts of data in ASCII format is of course that it is very inefficient from a file size point of view. Every ASCII character in a file is represented by one 8-bit byte. So for example if you store the following in an ASCII file:

1.2345678901
2.3456789012
3.4567890123
4.5678901234
5.6789012345

You will use 12 bytes for each of the five floating-point numbers listed, for a total of 60 bytes. If you were to store those 5 numbers using single-precision floating-point binary format, you would only require 4 bytes per number (not 12), for a total of 20 bytes—one third the file size.

In addition, reading and writing ASCII files is very slow compared to reading and writing raw bytes. As an example here is some code that generates a (10,000 x 25) matrix of values, and stores it in either an ASCII file with 10 decimal places of precision, or to a binary file using single-precision floating-point format (4 bytes per value).

%% make up a matrix of dummy signals
t = 0:.001:10-.001; t=t';
x1 = sin(t*2*pi*1);
x2 = cos(t*2*pi*2);
x3 = x1 + randn(size(x1));
x4 = x2 + randn(size(x2));
x5 = randn(size(x1));
S = [x1 x2 x3 x4 x5];
S = [S S S S S];
[n,m] = size(S);

%% write to an ASCII file
dlmwrite('trial1.asc', S, 'precision', '%.10f');

%% write to a binary file using single-precision format (4 bytes)
fid = fopen('trial1.bin','w');
count = fwrite(fid, S, 'float32', 'ieee-le');
fclose(fid);

The binary file is of course 1,000,000 bytes (about 1 MB) but the ASCII file is 3,374,820 bytes (about 3.4 MB).

Here is some code to demonstrate how much faster reading in binary data is compared to ASCII data. We read in the file 1,000 times to simulate what one might do for example when reading in 1,000 trials of an experiment:

%% speed tests
n = 1000;

tic
for i=1:n
    trial1 = dlmread('trial1.asc',',');
end
t=toc;
fprintf("ASCII took %.3f seconds for %d reads\n", t, n);

tic
for i=1:n
    fid = fopen('trial1.bin','r');
    trial1 = fread(fid, [n,m], 'float32');
    fclose(fid);
end
t=toc;
fprintf("binary took %.3f seconds for %d reads\n", t, n);
ASCII took 103.458 seconds for 1000 reads
binary took 0.142 seconds for 1000 reads

That is not a typo. Reading in the ASCII file 1,000 times took 103.458 seconds while reading in the same set of numbers stored in a binary format took 142 milliseconds… more than 700 times faster.

Obviously dealing with binary files is much faster.

The potential risk of storing data in a binary format is that you must remember what format the data are stored in (how many bytes per value, integer vs floating-point, is there a header with a certain number of bytes, etc etc.)

All files are just a series of 1s and 0s

Just a reminder that all files, whether ‘binary’ or ASCII or .xls or .jpg or .mp3 or .mp4, are actually, just a series of 1s and 0s. It’s how those 1s and 0s are interpreted by the program reading them in, that distinguishes them as text files, or excel spreadsheets, or photos, or songs, or videos.

As an illustration here is some code that writes 13 8-bit bytes to a file called ‘message.txt’. Execute the code and then open the file ‘message.txt’ in your favourite text editor or word processing program (or in MATLAB). We wrote 13 x 8 = 104 1s and 0s to a file, but when a text editor program reads in that file it interprets each grouping of 8 bits as a byte, and assumes those bytes encode ASCII characters, and then displays those characters to you on the screen.

%% write 8-bit bytes to a file using binary bits
fid = fopen('message.txt','w');
fwrite(fid,0b1001000,'uint8');
fwrite(fid,0b1100101,'uint8');
fwrite(fid,0b1101100,'uint8');
fwrite(fid,0b1101100,'uint8');
fwrite(fid,0b1101111,'uint8');
fwrite(fid,0b0101100,'uint8');
fwrite(fid,0b0100000,'uint8');
fwrite(fid,0b1110111,'uint8');
fwrite(fid,0b1101111,'uint8');
fwrite(fid,0b1110010,'uint8');
fwrite(fid,0b1101100,'uint8');
fwrite(fid,0b1100100,'uint8');
fwrite(fid,0b0100001,'uint8');
fclose(fid);