Exercise 14 solutions
Table of Contents
A note about this exercise. It turns out that solving the regression problem \(Y = X \beta\) using the OLS equation \(\hat{\beta} = \left( X^{\top}X \right)^{-1} X^{\top} Y\) is actually numerically not very efficient. In particular, taking the inverse of a matrix is typically not needed. There is a more advanced method in linear algebra called QR decomposition which is more efficient and more numerically stable than having to take the inverse directly. There's a blog post by John D. Cook that talks about this in some detail: Don't invert that matrix.
In MATLAB you can solve the regression equation \(Ax=b\) more directly by using the slash operator:
x = A\b
In NumPy the equivalent is using the linalg.lstsq()
function from the linalg
submodule of NumPy:
x = linalg.lstsq(a,b)
In R it's the qr.solve()
function:
x = qr.solve(A,b)
In any case, here are solutions using the full OLS equation from the exercise, for your reference. It's a good exercise at least to get you manipulating matrices and vectors.
Python
e14.py
from numpy import * from numpy.linalg import inv # enter the data as column vectors inc = array([50,35,80,25,90,75,65,95,70,110],"float") inc = inc * 1000 age = array([35,40,45,25,70,55,50,60,45,50],"float") edu = array([4,2,6,0,4,6,4,6,8,8],"float") # dependent variable Y Y = inc # matrix of independent variables # including a column of 1s X = array([ones(len(age)), age, edu]) X = transpose(X) # find beta weights for: # Y = X * b b = dot(dot(inv(dot(X.T, X)), X.T), Y) # estimated values of Y Yhat = dot(X, b) print b
MATLAB / Octave
e14.m
% enter the data as column vectors inc = [50,35,80,25,90,75,65,95,70,110]'; inc = inc * 1000; age = [35,40,45,25,70,55,50,60,45,50]'; edu = [4,2,6,0,4,6,4,6,8,8]'; % dependent variable Y Y = inc; % matrix of independent variables % including a column of 1s X = [ones(length(age),1) age edu]; % find beta weights for: % Y = X * b b = inv(X' * X) * X' * Y; % estimated values of Y Yhat = X * b; disp(['b = ',mat2str(b,8)]);
R
e14.R
# enter the data as column vectors inc <- c(50,35,80,25,90,75,65,95,70,110) inc <- inc * 1000 age <- c(35,40,45,25,70,55,50,60,45,50) edu <- c(4,2,6,0,4,6,4,6,8,8) # dependent variable Y Y <- matrix(inc, ncol=1) # matrix of independent variables # including a column of 1s X <- matrix(c(rep(1,length(age)), age, edu), ncol=3) library(MASS) # to get ginv() function # find beta weights for: # Y = X * b b <- ginv(t(X) %*% X) %*% t(X) %*% Y # estimated values of Y Yhat <- X %*% b cat("b = ", b)
C
There are no built-in types for matrices in C. One can have multi-dimensional arrays but they are sort of hacked together as arrays of pointers to arrays. The best bet for doing matrix stuff in C is to use a third party library such as the GNU Scientific Library.
To use this library you have to first download it to your machine, and then install it. There are installation instructions in the documentation. On my Mac I use Homebrew and I just install it by opening a Terminal and typing:
brew install gsl
Once installed, you have to link to the library once in your C code itself, by including the appropriate header, and second, when compiling, by linking to the library itself. See the sample code below for an example.
The GSL has many, many useful functions and types for scientific programming of all kinds, there is a table of contents here. Each section has code examples to help get you on your way.
There is also a library called Apophenia and a companion book Modeling With Data (and blog) by Ben Klemens, that provides a rich statistical library in C.
e14.c
// compile with: // gcc -o e14 e14.c `pkg-config --cflags --libs gsl` #include <stdio.h> #include <gsl/gsl_linalg.h> #include <gsl/gsl_blas.h> // a little function to take an array of doubles and return a gsl vector // gsl_vector *array2vec(double a[], int n) { gsl_vector *v = gsl_vector_alloc(n); int i; for (i=0; i<n; i++) { gsl_vector_set(v, i, a[i]); } return v; } int main(int argc, char *argv[]) { double income[] = {50000, 35000, 80000, 25000, 90000, 75000, 65000, 95000, 70000, 110000}; double age[] = {35, 40, 45, 25, 70, 55, 50, 60, 45, 50}; double education[] = {4, 2, 6, 0, 4, 6, 4, 6, 8, 8}; double ones[] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1}; // stick data into gsl vectors gsl_vector *inc_v = array2vec(income, 10); gsl_vector *age_v = array2vec(age, 10); gsl_vector *edu_v = array2vec(education, 10); gsl_vector *one_v = array2vec(ones, 10); // create a gsl matrix X to hold [ones, age, education] gsl_matrix *X = gsl_matrix_alloc(10,3); // copy in the columns gsl_matrix_set_col(X, 0, one_v); gsl_matrix_set_col(X, 1, age_v); gsl_matrix_set_col(X, 2, edu_v); // allocate a gsl vector b to hold the beta weights gsl_vector *b = gsl_vector_alloc(3); // our equation is inc_v = X * b and we want to solve for b // we could use the OLS equation b = inv(X'X)X'Y but we won't // instead we will use QR decomposition, which turns out to be way more efficient // http://www.gnu.org/software/gsl/manual/html_node/QR-Decomposition.html gsl_vector *tau = gsl_vector_alloc(3); gsl_vector *resid = gsl_vector_alloc(10); gsl_linalg_QR_decomp(X, tau); gsl_linalg_QR_lssolve(X, tau, inc_v, b, resid); // print the results (the 3 coefficients b) to the screen printf("\nb = [\n"); gsl_vector_fprintf(stdout, b, "%.3f"); printf("]\n\n"); // free the various heap memory that we allocated gsl_vector_free(resid); gsl_vector_free(tau); gsl_vector_free(b); gsl_matrix_free(X); gsl_vector_free(one_v); gsl_vector_free(edu_v); gsl_vector_free(age_v); gsl_vector_free(inc_v); return 0; }