In this post, we will learn about Principal Component Analysis (PCA) — a popular dimensionality reduction technique in Machine Learning. Our goal is to form an intuitive understanding of PCA without going into all the mathematical details.
At the time of writing this post, the population of the United States is roughly 325 million. You may think millions of people will have a million different ideas, opinions, and thoughts, after all, every person is unique. Right?
Wrong!
Humans are like sheep. We follow a herd. It’s sad but true.
Let’s say you select 20 top political questions in the United States and ask millions of people to answer these questions using a yes or a no. Here are a few examples
1. Do you support gun control?
2. Do you support a woman’s right to abortion?
And so on and so forth. Technically, you can get different answers sets because you have 20 questions and each question has to be answered using a yes or a no.
In practice, you will notice the answer set is much much smaller. In fact, you replace the top 20 questions with a single question
“Are you a democrat or a republican?”
and accurately predict the answer to the rest of the questions with a high degree of accuracy. So, this 20-dimensional data is compressed to a single dimension and not much information is lost!
This is exactly what PCA allows us to do. In a multi-dimensional data, it will help us find the dimensions that are most useful and contain the most information. It will help us extract essential information from data by reducing the dimensions.
We will need some mathematical tools to understand PCA and let’s begin with an important concept in statistics called the variance.
What is variance?
The variance measures the spread of the data. In Figure 1 (a), the points have a high variance because they are spread out, but in Figure 1 (b), the points have a low variance because they are close together.
Also, note that in Figure 1 (a) the variance is not the same in all directions. The direction of maximum variance is especially important. Let’s see why.
Why do we care about the direction of maximum variance?
Variance encodes information contained in the data. For example, if you had 2D data represented by points with coordinates. For n such points, you need 2n numbers to represent this data. Consider a special case where for every data point the value along the y-axis was 0 (or constant). This is shown in Figure 2
It is fair to say that there is no (or very little) information along the y-axis. You can compactly represent this data using n numbers to represent its value along the x-axis and only 1 common number to represent the constant along the y-axis. Because there is more variance along the x-axis, there is more information, and hence we have to use more numbers to represent this data. On the other hand, since there is no variance along the y-axis, a single number can be used to represent all information contained in n points along this axis.
What is Principal Component Analysis
Now consider a slightly more complicated dataset shown in Figure 3 using red dots. The data is spread in a shape that roughly looks like an ellipse. The major axis of the ellipse is the direction of maximum variance and as we know now, it is the direction of maximum information. This direction, represented by the blue line in Figure 3, is called the first principal component of the data.
The second principal component is the direction of maximum variance perpendicular to the direction of the first principal component. In 2D, there is only one direction that is perpendicular to the first principal component, and so that is the second principal component. This is shown in Figure 3 using a green line.
Now consider 3D data spread like an ellipsoid (shown in Figure 4). The first principal component is represented by the blue line. There is an entire plane that is perpendicular to the first principal component. Therefore, there are infinite directions to choose from and the second principal component is chosen to be the direction of maximum variance in this plane. As you may have guessed, the third principal component is simply the direction perpendicular to both the first and second principal components.
PCA and Dimensionality Reduction
In the beginning of this post, we had mentioned that the biggest motivation for PCA is dimensionality reduction. In other words, we want to capture information contained in the data using fewer dimensions.
Let’s consider the 3D data shown in Figure 4. Every data point has 3 coordinates – x, y, and z which represent their values along the X, Y and Z axes. Notice the three principal components are nothing but a new set of axes because they are perpendicular to each other. We can call these axes formed by the principal components X’, Y’ and Z’.
In fact, you can rotate the X, Y, Z axes along with all the data points in 3D such that the X-axis aligns with the first principal component, the Y-axis aligns with the second principal component and the Z-axis aligns with the third principal components. By applying this rotation we can transform any point (x, y, z) in the XYZ coordinate system to a point (x’, y’, z’) in the new X’Y’Z’ coordinate system. It is the same information presented in a different coordinate system, but the beauty of this new coordinate system X’Y’Z’ is that the information contained in X’ is maximum, followed by Y’ and then Z’. If you drop the coordinate z’ for every point (x’, y’, z’) we still retain most of the information, but now we need only two dimensions to represent this data.
This may look like a small saving, but if you have 1000 dimensional data, you may be able to reduce the dimension dramatically to maybe just 20 dimensions. In addition to reducing the dimension, PCA will also remove noise in the data.
What are Eigen vectors and Eigen values of a matrix?
In the next section, we will explain step by step how PCA is calculated, but before we do that, we need to understand what Eigen vectors and Eigen values are.
Consider the following 3×3 matrix
Consider a special vector , where
Let us multiply the matrix with the vector
and see why I called this vector special.
where
So, an Eigen vector of a matrix
is a vector whose direction does not change when the matrix is multiplied to it. In other words,
where
How to Calculate PCA?
Usually, you can easily find the principal components of given data using a linear algebra package of your choice. In the next post, we will learn how to use the PCA class in OpenCV. Here, we briefly explain the steps for calculating PCA so you get a sense of how it is implemented in various math packages.
Here are the steps for calculating PCA. We have explained the steps using 3D data for simplicity, but the same idea applies to any number of dimensions.
- Assemble a data matrix: The first step is to assemble all the data points into a matrix where each column is one data point. A data matrix,
, of
3D points would like something like this
- Calculate Mean: The next step is to calculate the mean (average) of all data points. Note, if the data is 3D, the mean is also a 3D point with x, y and z coordinates. Similarly, if the data is m dimensional, the mean will also be m dimensional. The mean
is culculated as
- Subtract Mean from data matrix: We next create another matrix
by subtracting the mean from every data point of
- Calculate the Covariance matrix: Remember we want to find the direction of maximum variance. The covariance matrix captures the information about the spread of the data. The diagonal elements of a covariance matrix are the variances along the X, Y and Z axes. The off-diagonal elements represent the covariance between two dimensions ( X and Y, Y and Z, Z and X ).The covariance matrix,
is calculated using the following product.
where,represents the transpose operation. The matrix
is of size
where
is the number of dimensions ( which is 3 in our example ).Figure 5 shows how the covariance matrix changes depending on the spread of data in different directions.
If you are more interested in why this procedure works, here is an excellent article titled A Geometric Interpretation of the Covariance Matrix.
There is another widespread dimensionality reduction and visualization technique called t-SNE.
In my opinion, the example (question with a yes / no answer) is not ideal :
a) in some counties, the distinction US (rpublican/democrat) is unknown. It varies across time (ex: Lincoln was republican…in 1860;George Wallace, up to 1970 was a democrat… the racial question was not seen the same way 150 – 60years ago )
b) I know a republican (tea party) who is … pro abortion (other ex: if binational Israeli American think republican will favor more Israel, even if they can be seen as liberals, they can choose Republican … or not) .
Even in the Republican senate, sometimes, some republican vote against their party (x : MacCain); if problem was monodimensional, they could not. I am afraid the information which is lost is … the most interesting part of information (ex: Trump and Schartenegger are both republicans and have different positions on global warming…) and its time evolution (and the way one can attract votes in “outliers”.)
Synthetic data (x= t + random(); y=t+random() ; t varies) can be easier to grab to any reader -and is a 2 dimensional problem; if t is allowed to vaty “a lot”, one can summarize it to a 1 D problem -less storage-)
Sure no dataset would be perfect and therefore dimensionality reduction is lossy. I really don’t care about synthetic datasets as it leads to a boring article and the audience cannot even relate to it. You will find hundreds of such articles on the internet that you can refer to, or write one yourself :P.
How did you make your illustrations :
with political data?
with synthetical datasets?
with a hand drawing software?
among hundreds of articles , tutorials (which are often self consistent) and datasets in the internet, I can quote:
wikipedia
R help (but octave, opencv can be used, too)
I avoid … increasing redundancy and distraction (politics being the worse source of distraction: hundreds of tutorials very wisely -and freely!- prefer …. iris flowers, helping beginners to concentrate on the topic :
I bet I am
free
not to exponentiate with confusion; some people would feel
free
noting it is worse than hundreds of tutorials on the web).
There is already such a source for EVERY machine learning concept. It is called Wikipedia and it is the best source if you are looking for plain hard math, some good examples etc.
https://en.wikipedia.org/wiki/Principal_component_analysis
My goal in this article was to provide both intuition and some entertainment.
Clearly, it is not for people like you who want more rigor from a blog post? Really? This blog is deliberately written in a more relaxed style and if you want rigor you should be reading some scientific papers or Wikipedia.
Yeah, a relaxed style, coming from a teacher and an authority, does not dismiss some of the oversights in this post. See my below comment. Really disappointing.
Also I should mention that sassing everybody in the comments is not helping your case.
It is not a matter of rigor nor of style :
politics can stir passions, inhibiting rational reasoning (which must occur at the end) : and your example / interpretation cannot explain why some/most US citizens are not one dimensional, nor how they can evolve -perhaps because they are not that one dimensional : I am not an expert in US politics, and never will ; thus I can find counter examples, but I do not have any theory – (even their parties : Democrats and Republican seem to have , in some extent, swapped their position w/r racial issues in 100 ans -less with Wallace case-)
I read scientifical papers (in an uninteresting domain in this context and for image processing ) and wikipedia – verified Wallace position with it- in many domains.
🙂 I think we have both made our points, and we have at least one thing that we agree on — politics stirs passion. I understand why you feel political commentary is distracting.
There are also other posts on this blog which have no political commentary that you may like and not feel that “it is worse than hundreds of tutorials on the web” 😀
Well, I noticed your post because I know there are very good tutorials (else, I could not have installed ocv on a nanopi). I protested (and did not want there would be any consequences) because this post was worse/less excellent than the others (I never write a post is good: as I carefully choose my links, it would be tiring …)
BTW: did you notice MV hinted at classification (which is another topic) instead of “dimension reduction” (on an expert point of view -I am not an expert : but experts may protest instead of learning ML- , or with some (wikipedian, say) knowledge of US history, dimension reduction is not relevant to explain shifts in US policies : the 1rst component of a poll is not the relevant variable (perhaps it is the 2th … 2**nth one; perhaps its varies with time and skills of politicians; perhaps none of the components is relevant and PCA is relevant for other topics ))
I know you are right about wikipedia; however, wikipedia can be fully prohibited in some countries, which might have consequences on your audience (statistical, mathematical and computer science are often exceltent).
wikipedia is even able of some kind of introspection https://en.wikipedia.org/wiki/Censorship_of_Wikipedia
azeri version :https://tr.wikipedia.org/wiki/Vikipedi%27nin_sans%C3%BCrlenmesi
Yeah, tossing in politics as an example for classification is a no go.
I get enough politics without needing it to infect my ML learning until I want to explore the issues myself.
There is another defect :
illustrations show continuous vectors (real-valued);
politics classification uses (in a very -over?- simplified way) categorical variables (0/1 : the category “do not know” does not exist) which cannot be visualized the same way.
Most of images have real valued features/colors, gradients …., at least when starting.
A well known dataset (the iris one, 1938 : before flower power: cannot be distracted with politics : ask google wikipedia with “iris flower dataset”) or synthetical one -one knows the solution- would have been more appropiate.
Discrete valued functions can transformed to continuous all the time for analysis. It is not a problem.
You are free to write your own article with the “more appropriate” dataset. I will choose examples that I like :D.
Well, in a technical/scientifical blog, I would have expected … a technical answer (and I doubt your example is consistent with the illustration : I search texts, if I am not competent, for internal self consistency)
w/r freedom:
one is free *not* to write anything (PCA has been known since 1938; Benzecri -a wikipedia tag- in 1970 analysed Lebanon polls (and its most interesting part : why people do not answer some questions: they had,IIRC, interesting ways of displaying informations).
One is
“free”
to read -and to look for expert advice, without belittling it-
For instance, GNU Linux Magazine France dedicates 80 pages to neural networks and pattern recognition, and, at the same month -jan, 2018…- hints at looking at http://callingbullshit.org//wiki
where one neural network in 1990s and a very recent study from face recognition gurus are analysed….(and results on testing sets are explained…)
This site, with known examples, makes a great contrast with your astrology example (is the Huff post a scientific journal? what do you hint at).
OTOH, I managed to enhance quality of grey conversion of some political video with looking at the 3 channels colors, noticing R and G are very correlated -this is verified on some extracts- , that the 1rst – un normed- component is , in BGR space, (0.5,0.8 ,1), very different from classical color->grey conversion (it is 1rst step in finding good features to track -works in monochrome- , if 2nd and 3rd components are noise, one might hope trackers will work better).
Links to the bad quality (but politically interesting , whether true or false) can be found -if they are not withdrawn for very political reasons- in https://www.heise.de/tp/features/Pech-fuer-Erdogan-PKK-deckt-MIT-Agenten-Netzwerk-auf-3935816.html?seite=all
Idea of using PCA to reduce image channels numbers stems from the early 80-s: satellite channels were noisy, and they sometimes tried to make a good composite out of 3…. Further, it was replaced with physical analysis of channels ls biases (but I do not have info about PKK’s cameras and it might be very difficult to ask them….)
The OP is stuck in a local minima. =)
Also, political bias often requires severe dropout in order to prevent overfitting of preconceptions.
As a rule of thumb, I read the news only once a week. If you are getting more politics than you like, stop reading the news 😀
Satya, you’re missing the point. MV is right; this was a poor take for a couple of reasons. First of all, you don’t actually cite the study of political views; you just speculate as to what would happen. In my experience as a political researcher your conclusions are exactly what someone who reads the news once a week would expect – but they’re not actually the case. So, issue 1: base assumption not true.
Issue 2: base assumption, coming from a data scientist like you, though false, pushes people to believe that political issues invariably fall on a singular bicameral political axis. Your authority makes it dangerous for you to advance this unfounded claim.
Issue 3: Over and over and over and over in data science, we have to go over with people how humans are not data points. Outliers matter. “not much information is lost,” by which you mean a model gets maybe 80% right, leaves out outliers—and when we’re talking about humans, outliers MATTER. I recommend you read Weapons of Math Destruction. In your position as a teacher and blogger, you need to think about how the way you frame your teaching affects larger society. We also had this same issue with your use of the Lena image in your very first OpenCV course. You need to be much, MUCH more aware of the impact of your framings. It seems you’re either not employing or not listening to people who could help you fix this.
The leading example was meant to show data has lower dimensionality than it may appear at the outset. It succeeds in making that point.
Second, a blog post is not a scientific paper. It does not have to be rigorous. If I were being rigorous, I would cite papers.
Third, I truly believe reading the news everyday makes people stupid. This is especially true when people subscribe to only one newspaper. Otherwise, half the nation would not be “shocked” at results of the elections :P. Clearly, they had been reading the analysis of astrologers at the Huffington Post who predicted 98% chance of win for Hillary while throwing away common sense. They also claimed to use rigorous data science! Most news sources ( on both sides of the political aisle ) use data to essentially lie or create fake outrage. So you are better off consuming less of it.
Well, which nation/nation half was shocked?
Can you prove they read the huff post?
Was there a simultaneous measure of their IQ? (before / after reading the Huff Post)
Politics seem
very
distracting
To be fair:
a) the link given right at the end is excellent
b) Image processing is not every Data Science (that is why I shyly hinted political was not one dimensional) : in political science, a poll result (1 sample) is much more expensive than … a phone picture (300*600*3 pixels). Even if, sometimes, a picture gets reduced to a one dimensional object, it does not matter in a video… (but how a Republican/a Democrat differs from his caricatural one dimensional representation might be a way to win elections). -not distracting)
c) if there had been no political reference, the post would have been much better/excellent (-not distracting : one topic is enough to begin with-)
As a rule of thumb, if you feel the need to toss politics into places other than the news, perhaps you should be a reporter.
Satya, thanks for your post, interesting as always.
And forget the haters…
🙂 Thanks for the kind words.
Haters is a strong word. They are simply people with different opinions.
It is just that if authors / writers / creators start caring too much about public opinion and pleasing everybody, we lose our originality and are left with bland and unoriginal content.
Thanks, again.
Pulling a militor vacuum there. Nice job.
Very informative article, i have been struggling to understand PCA and LDA for sometime.
this article helped a lot.
Thanks, Pradeep. If you have not already done it, please check the follow up article on EigenFaces also
https://learnopencv.com/eigenface-using-opencv-c-python/
Sorry, I’ll take the opposite opinion and say that the political example was spot on. It immediately made the point.