Principal Components Analysis in Excel VBA
by Nick V. Flor • March 9, 2016 • @ProfessorF
I love Excel as an Exploratory Data Analysis tool. I’ve written macros to scrape data off of Twitter (like the millions of tweets for every #GOPDebate), and then to analyze that data in various ways—both quantitatively and qualitatively, both with & without hypotheses!
I have to say, it’s both astonishing and mind-boggling what you find when you let the data speak for itself. In later posts, I’ll discuss some of my findings.
But for now I’m on a mission to get citizens to level-up their “data analysis” skills—it’s the only way to prevent being fooled by politicians and charlatans.
Anyway, one particularly useful statistical procedure for analyzing large amounts of data is Principal Components Analysis (“PCA”). It is both a way of reducing the complexity (dimensionality) of your data, and finding structure in your data.
Of course Excel doesn’t implement PCA, and the math on the internet is cryptic. So it’s not clear how to implement the necessary functions. Here’s one way I do PCA, using Gram-Schmidt QR decomposition.
Note: fmMult is my own matrix multiplication function, you can probably get away with Excel’s mmult. I also reindex the matrices from 0..N-1. This makes it easier for me to convert the code to C# or JavaScript.
' ' QR Decomposition Function using Gram-Schmidt ' Adapted from pseudo-code in http://www.math.umn.edu/~olver/ln_/qr.pdf ' Function GramSchmidt(ByRef A, ByRef R) Dim mRows, nCols Dim i, j, k
getNumRowsCols A, mRows, nCols R = createMatrix(mRows, nCols) For j = 0 To (nCols - 1) R(j, j) = 0 For i = 0 To (mRows - 1) R(j, j) = R(j, j) + A(i, j) ^ 2 Next R(j, j) = Sqr(R(j, j)) For i = 0 To (mRows - 1) A(i, j) = A(i, j) / R(j, j) Next For k = (j + 1) To (nCols - 1) R(j, k) = 0 For i = 0 To (mRows - 1) R(j, k) = R(j, k) + A(i, j) * A(i, k) Next For i = 0 To (mRows - 1) A(i, k) = A(i, k) - A(i, j) * R(j, k) Next Next Next End Function Function pfGramSchmidt(AR As Range, Optional loops As Integer = -1) Dim A, Q, R, i, j, E
If loops < 0 Then loops = 0 ' this is really one iteration A = reindexRange(AR) getNumRowsCols A, mRows, nCols R = createIdentity(mRows, nCols) E = createIdentity(mRows, nCols) For i = 0 To loops A = fmMult(R, A) ' A=R*Q GramSchmidt A, R ' A is now Q, Diagonals are Eigenvalues E = fmMult(E, A) ' E: Eigenvectors Next ' ' Return a Range with Q, R, and Eigenvalues ' ReDim res(mRows - 1, (3 * nCols) - 1) For i = 0 To mRows - 1 For j = 0 To nCols - 1 res(i, j) = A(i, j) res(i, nCols + j) = R(i, j) res(i, 2 * nCols + j) = E(i, j) Next Next pfGramSchmidt = res End Function
If you have any questions about how to use this, or to implement it yourself, I’m always happy to help.
Good luck.