UP | HOME

Exercise 14 solutions

Table of Contents


A note about this exercise. It turns out that solving the regression problem \(Y = X \beta\) using the OLS equation \(\hat{\beta} = \left( X^{\top}X \right)^{-1} X^{\top} Y\) is actually numerically not very efficient. In particular, taking the inverse of a matrix is typically not needed. There is a more advanced method in linear algebra called QR decomposition which is more efficient and more numerically stable than having to take the inverse directly. There's a blog post by John D. Cook that talks about this in some detail: Don't invert that matrix.

In MATLAB you can solve the regression equation \(Ax=b\) more directly by using the slash operator:

x = A\b

In NumPy the equivalent is using the linalg.lstsq() function from the linalg submodule of NumPy:

x = linalg.lstsq(a,b)

In R it's the qr.solve() function:

x = qr.solve(A,b)

In any case, here are solutions using the full OLS equation from the exercise, for your reference. It's a good exercise at least to get you manipulating matrices and vectors.

Python

e14.py

from numpy import *
from numpy.linalg import inv

# enter the data as column vectors
inc = array([50,35,80,25,90,75,65,95,70,110],"float")
inc = inc * 1000
age = array([35,40,45,25,70,55,50,60,45,50],"float")
edu = array([4,2,6,0,4,6,4,6,8,8],"float")

# dependent variable Y
Y = inc
# matrix of independent variables
# including a column of 1s
X = array([ones(len(age)), age, edu])
X = transpose(X)

# find beta weights for:
# Y = X * b
b = dot(dot(inv(dot(X.T, X)), X.T), Y)

# estimated values of Y
Yhat = dot(X, b)

print b

MATLAB / Octave

e14.m

% enter the data as column vectors
inc = [50,35,80,25,90,75,65,95,70,110]';
inc = inc * 1000;
age = [35,40,45,25,70,55,50,60,45,50]';
edu = [4,2,6,0,4,6,4,6,8,8]';

% dependent variable Y
Y = inc;
% matrix of independent variables
% including a column of 1s
X = [ones(length(age),1) age edu];

% find beta weights for:
% Y = X * b
b = inv(X' * X) * X' * Y;

% estimated values of Y
Yhat = X * b;

disp(['b = ',mat2str(b,8)]);

R

e14.R

# enter the data as column vectors
inc <- c(50,35,80,25,90,75,65,95,70,110)
inc <- inc * 1000
age <- c(35,40,45,25,70,55,50,60,45,50)
edu <- c(4,2,6,0,4,6,4,6,8,8)

# dependent variable Y
Y <- matrix(inc, ncol=1)
# matrix of independent variables
# including a column of 1s
X <- matrix(c(rep(1,length(age)), age, edu), ncol=3)

library(MASS) # to get ginv() function
# find beta weights for:
# Y = X * b
b <- ginv(t(X) %*% X) %*% t(X) %*% Y

# estimated values of Y
Yhat <- X %*% b

cat("b = ", b)

C

There are no built-in types for matrices in C. One can have multi-dimensional arrays but they are sort of hacked together as arrays of pointers to arrays. The best bet for doing matrix stuff in C is to use a third party library such as the GNU Scientific Library.

To use this library you have to first download it to your machine, and then install it. There are installation instructions in the documentation. On my Mac I use Homebrew and I just install it by opening a Terminal and typing:

brew install gsl

Once installed, you have to link to the library once in your C code itself, by including the appropriate header, and second, when compiling, by linking to the library itself. See the sample code below for an example.

The GSL has many, many useful functions and types for scientific programming of all kinds, there is a table of contents here. Each section has code examples to help get you on your way.

There is also a library called Apophenia and a companion book Modeling With Data (and blog) by Ben Klemens, that provides a rich statistical library in C.

e14.c

// compile with:
// gcc -o e14 e14.c `pkg-config --cflags --libs gsl`

#include <stdio.h>
#include <gsl/gsl_linalg.h>
#include <gsl/gsl_blas.h>

// a little function to take an array of doubles and return a gsl vector
//
gsl_vector *array2vec(double a[], int n) {
  gsl_vector *v = gsl_vector_alloc(n);
  int i;
  for (i=0; i<n; i++) {
    gsl_vector_set(v, i, a[i]);
  }
  return v;
}

int main(int argc, char *argv[]) {

  double income[] = {50000, 35000, 80000, 25000, 90000,
                     75000, 65000, 95000, 70000, 110000};
  double age[] = {35, 40, 45, 25, 70, 55, 50, 60, 45, 50};
  double education[] = {4, 2, 6, 0, 4, 6, 4, 6, 8, 8};
  double ones[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1};

  // stick data into gsl vectors
  gsl_vector *inc_v = array2vec(income, 10);
  gsl_vector *age_v = array2vec(age, 10);
  gsl_vector *edu_v = array2vec(education, 10);
  gsl_vector *one_v = array2vec(ones, 10);

  // create a gsl matrix X to hold [ones, age, education]
  gsl_matrix *X = gsl_matrix_alloc(10,3);
  // copy in the columns
  gsl_matrix_set_col(X, 0, one_v);
  gsl_matrix_set_col(X, 1, age_v);
  gsl_matrix_set_col(X, 2, edu_v);

  // allocate a gsl vector b to hold the beta weights
  gsl_vector *b = gsl_vector_alloc(3);

  // our equation is inc_v = X * b and we want to solve for b
  // we could use the OLS equation b = inv(X'X)X'Y but we won't
  // instead we will use QR decomposition, which turns out to be way more efficient
  // http://www.gnu.org/software/gsl/manual/html_node/QR-Decomposition.html
  gsl_vector *tau = gsl_vector_alloc(3);
  gsl_vector *resid = gsl_vector_alloc(10);
  gsl_linalg_QR_decomp(X, tau);
  gsl_linalg_QR_lssolve(X, tau, inc_v, b, resid);

  // print the results (the 3 coefficients b) to the screen
  printf("\nb = [\n");
  gsl_vector_fprintf(stdout, b, "%.3f");
  printf("]\n\n");

  // free the various heap memory that we allocated
  gsl_vector_free(resid);
  gsl_vector_free(tau);
  gsl_vector_free(b);
  gsl_matrix_free(X);
  gsl_vector_free(one_v);
  gsl_vector_free(edu_v);
  gsl_vector_free(age_v);
  gsl_vector_free(inc_v);

  return 0;
}

Paul Gribble | fall 2014
This work is licensed under a Creative Commons Attribution 4.0 International License
Creative Commons License