A practical approach to Principal Component Analysis (PCA)

Lorenzo D'Isidoro
6 min readNov 16, 2020

A piratical approach to Principal Component Analysis using Python.

Introduction

In this article, we are talking about Principal Component Analysis (PCA) and how can be implemented using Numpy Python library.

If you’ve never heard of PCA just know for now that it is an unsupervised linear transformation technique for dimensionality reduction. This kind of analysis can be used for dimensionality reduction of data used in exploratory data analysis and machine learning, the aim is to reduce the dataset by compressing it into a new subspace of characteristics by selecting only the set of eigenvectors, called principal components, that contain most of the information. More details can be found on Wikipedia or Machine Learning with Python book.

Numpy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices.

Overview

The aim of this analysis is to reduce the dataset information by transforming it into a subspace of features with lower size than the original. In other words, compress the dataset extracting only the most relevant features, so the data will be throw into a new, smaller feature space.
Suppose to have a dataset X nxd-dimensional, whit n elements and d features

The feature vector x initially is d-dimensional, a projection matrix W d x k-dimensional will be define for reduce x in a new subspace k-dimensional called z.

Following an example where the values ​​has been assigned to dataset X in order to plot the graph of the first two features, in this case it shows that there is a negative correlation between the first and second feature (x1 and x2)

In order to understand if there are negative or positive correlations between the other features, a lot of other columns should be plotted in other plans or can be drawn a multi-dimensional plane where there is an axis for each feature, but none of these ways are the best.
The best way is to use the PCA chart that converts the correlations between all features on a two-dimensional plane, strongly correlated features will create a cluster of elements in the chart as you can see in the following image.

PCA chart

So again, the PCA allow you the possibility to “compress” a lot of data in a single areas of the graph and each area contains a groups of elements that are strongly correlated.

Covariance matrix

The first step is build the covariance matrix C dxd-dimensional and break up it in k eigenvectors (the principal components), where k is the dimensionality of the new subspace of the features. The covariance between two features is calculated as follows

The elements of covariance matrix C can be calculated using this formula

At the end of the next paragraph can be found how to calculate covariance matrix using Numpy, this library is used because provides methods that simplify a lot this operations, although in this case is quite simple.

Eigenvectors and eigenvalues

As said before the aim is to reduce the dataset by compressing it into a new subspace of features by selecting only the set of eigenvectors that contain the most relevant information. Eigenvalues and eigenvectors can be calculated using the homogeneous linear system and characteristic polynomial, both are defined below

The eigenvalues will be the zeros of the characteristic polynomial:

For example for d equals to 3 the steps to calculate the eigenvalues of the matrix C are

So for d = 3 there is a third degree equation which one the results are the eigenvalues that can be obtained applying the Ruffini’s rule

Eigenvalues for d = 3

To calculate the eigenvectors the eigenvalues must be substituted in the characteristic polynomial

Eigenvectors formula

This equation will lead us to have to solve a linear system.

Python implementation

Is possible to implement what we have seen so far whit few line of code using the Numpy library, follows a very simple script that calculates covariance matrix, eigenvectors and eigenvalues.
In this code is used the dataset called Wine, this and other datasets are available here. First of all the dataset Wine is loaded using pandas library, after that the 13 features and its values are loaded in the matrix X ignoring the labels (which are at the column 0).

This code can be found at this link

Dataset Transformation

The eigenvalues defines the order of magnitude of the eigenvectors, the eigenvalues will be sorted in descending order and will be taken the k eigenvectors corresponding to the k largest eigenvalues which are the most informative. The variance relative to the eigenvalue i can be calculated as the ratio between the eigenvalue i and the sum of all the eigenvalues

At this point the eigen-pairs, defined as the eigenvectors eigenvalues map, can be sorted in descending order based on eigenvalues and, in this case for k = 2, can be stacked the two columns of the most discriminating eigenvectors to create the projection matrix W. It is the matrix of basis vectors with greater variance, one vector per column, they are a sub-set of those in V which contains one eigenvector per column (source wikipedia.org).
The projection matrix W multiplied by the initial dataset X nxd-dimensional will produce the transformed dataset X_pca nx2-dimensional, this is for example purposes only but in practice the number of principal components k must be chosen based on the computational efficiency and performance to be obtained.

Python implementation

Let’s add a few lines of code to the previous script to complete the transformation of our dataset.
The eigen-pairs array is created and sorted in descending order, after that is used the Numpy function called hstack that stack arrays (as said before they are the columns of the most discriminating eigenvectors) in sequence horizontally, the result will be the projection matrix W which one is used to perform the dot product whit X in order to obtain the transformed dataset nx2-dimensional.

This code can be found at this link

The whole script can be found here and to run it you need to have Python 3 and pip installed, if so run the following commands to install dependencies and run the script

$ pip3 install numpy
$ pip3 install pandas
$ python3 pca_test.py

So was shown in a practical and fast way how the initial dataset has been compressed into a new subspace of characteristics by selecting only the set of eigenvectors that contain the most relevant information, going from d to z columns of dataset characteristics whit z << d.

--

--