# Principal Components Analysis

## The data

File wind.data contains daily average wind speeds at 5 meteorological stations in the Republic of Ireland (This is part of a data set taken at 12 meteorological stations. The original file was downloaded from the StatLib Data Repository and its analysis is discused in
Haslett, J., Raftery, A. E. (1989) Space-time Modelling with Long-memory Dependence: Assessing Ireland's Wind Power Resource, with Discussion. Applied Statistics 38, 1-50.

To run these examples, you must load package descriptive and the data:

load("descriptive")$s2 : read_matrix (file_search ("wind.data"))$


Before reducing the dimensionality, we need to know the true dimension of the multivariate variable,

p: length(first(s2));


$5$

Sample size,

n: length(s2);


$100$

Let's calculate the multivariate mean,

mean(s2);


$\left[ 9.948 , 10.16 , 10.87 , 15.72 , 14.84 \right]$

The variances,

var(s2);


$\left[ 17.22 , 14.99 , 15.48 , 32.18 , 24.42 \right]$

Since data are of similar nature and scale, we don't need to standardize them.

The multivariate scatter plot to see if variables are correlated.

set_draw_defaults(
xlabel     = "",
ylabel     = "")$scatterplot( s2, nclasses = 5, fill_color = blue, fill_density = 0.3, color = red, point_size = 1/2, dimensions = [850,750], xtics = 5)$


There seems to be linear correlation among all variables, so we can calculate the correlation matrix,

cor(s2);


$\pmatrix{1.0&0.8476&0.8804&0.824&0.752\cr 0.8476&1.0&0.8736&0.6903& 0.7825\cr 0.8804&0.8736&1.0&0.7764&0.8323\cr 0.824&0.6903&0.7764&1.0 &0.7294\cr 0.752&0.7825&0.8323&0.7294&1.0\cr }$

## The model

In order to see if we can reduce the dimensionality of this sample, we calculate the principal components,

pc: principal_components(s2);


$\left[ \left[ 87.57 , 8.753 , 5.515 , 1.889 , 1.613 \right] , \left[ 83.13 , 8.31 , 5.235 , 1.793 , 1.531 \right] , \\ \pmatrix{ 0.4149&0.03379&-0.4757&-0.581&-0.5126\cr 0.369&-0.3657&-0.4298& 0.7237&-0.1469\cr 0.3959&-0.2178&-0.2181&-0.2749&0.8201\cr 0.5548& 0.7744&0.1857&0.2319&0.06498\cr 0.4765&-0.4669&0.712&-0.09605&- 0.1969\cr } \right]$

The returned list contains three elements, the first of them is a list with the variances of the principal components in descending order. We see that the first variance is much greater than the second one:

first(pc);


$\left[ 87.57 , 8.753 , 5.515 , 1.889 , 1.613 \right]$

The second element is the list with percentages of total variation explained by each principal component:

second(pc);


$\left[ 83.13 , 8.31 , 5.235 , 1.793 , 1.531 \right]$

We can accumulate percentages to help to decide how many variables we are going to take for further analysis. We see that with the two first principal componentes we explain about 91.44% of total variance:

block([ap: copy(pc[2])],
for k:2 thru length(ap) do ap[k]: ap[k]+ap[k-1],
ap);


$\left[ 83.13 , 91.44 , 96.68 , 98.47 , 100.0 \right]$

We can also plot a Pareto chart showing the percentages graphically:

draw2d(
fill_density  = 0.2,
apply(bars, makelist([k, pc[2][k], 1/2], k, p)),
points_joined = true,
point_type    = filled_circle,
point_size    = 2,
points(makelist([k, pc[2][k]], k, p)),
xlabel        = "Variances",
ylabel        = "Percentages",
xtics         = setify(makelist([concat("PC",k),k], k, p))) $ Finally, the last element returned by function principal_components is the rotation matrix: rot: third(pc);  $\pmatrix{0.4149&0.03379&-0.4757&-0.581&-0.5126\cr 0.369&-0.3657&- 0.4298&0.7237&-0.1469\cr 0.3959&-0.2178&-0.2181&-0.2749&0.8201\cr 0.5548&0.7744&0.1857&0.2319&0.06498\cr 0.4765&-0.4669&0.712&-0.09605 &-0.1969\cr }$ These are the coefficients we need to calculate the p principal components $Y_1, Y_2, Y_3, \ldots, Yp$ from the original variables $X_1, X_2, X_3, \ldots, Xp.$ If the observed data corresponding to one metheorological station is the vector $(x_1, x_2, x_3, x_4, x_5),$ the first principal component is calculated this way: matrix([x1, x2, x3, x4, x5]) . col(rot,1);  $0.4765\,{\it x_5}+0.5548\,{\it x_4}+0.3959\,{\it x_3}+0.369\, {\it x_2}+0.4149\,{\it x_1}$ ... the second matrix([x1, x2, x3, x4, x5]) . col(rot,2);  $-0.4669\,{\it x_5}+0.7744\,{\it x_4}-0.2178\,{\it x_3}-0.3657\, {\it x_2}+0.03379\,{\it x_1}$ ... and so on. In general, the transformed multivariate sample is calculated this way: new: s2 . rot$


And if we want to use the first two components, we remove the last three columns from the transformed sample:

new2: submatrix(new, 3,4,5);


$\pmatrix{30.66&0.9512\cr 27.93&1.78\cr 22.64&-0.567\cr 13.44&-4.203 \cr 23.64&-1.542\cr 18.23&-3.966\cr 24.12&0.1813\cr 27.78&-0.2169 \cr 25.23&-1.561\cr 20.34&-3.698\cr 28.55&4.444\cr 29.73&-1.084\cr 8.702&-0.7874\cr 8.754&2.462\cr 14.22&-0.04463\cr 21.12&1.576\cr 28.8&-0.469\cr 30.19&-10.2\cr 10.48&0.4667\cr 11.37&0.1515\cr 19.18& -4.55\cr 18.99&0.08038\cr 32.9&-4.489\cr 44.54&-2.166\cr 27.14&- 2.229\cr 43.18&-5.436\cr 49.67&-3.064\cr 33.06&0.8554\cr 50.74&- 0.9644\cr 32.28&-0.2534\cr 24.75&-2.612\cr 14.98&-0.08559\cr 18.81& 1.342\cr 16.99&-0.3552\cr 28.55&0.4827\cr 34.3&0.8969\cr 45.56& 0.7853\cr 30.51&-3.537\cr 42.18&0.6286\cr 39.6&0.9222\cr 32.79&- 0.9442\cr 39.74&-1.838\cr 37.8&5.232\cr 42.83&4.742\cr 30.05&-2.859 \cr 24.46&3.398\cr 39.77&-1.495\cr 34.39&-1.734\cr 29.94&2.494\cr 40.82&6.545\cr 25.03&0.6239\cr 17.42&3.25\cr 27.6&-0.2265\cr 42.01&- 0.1073\cr 16.71&-0.5108\cr 30.1&-0.8958\cr 45.95&-3.996\cr 32.98& 0.9275\cr 33.2&0.4342\cr 32.0&3.716\cr 30.16&1.459\cr 35.01&4.616 \cr 19.65&1.3\cr 26.0&-3.86\cr 14.44&2.485\cr 21.26&4.471\cr 27.66& 1.426\cr 31.58&-1.637\cr 23.97&-2.048\cr 36.51&3.325\cr 40.06&-3.581 \cr 33.96&-0.7837\cr 30.13&3.338\cr 33.26&4.878\cr 23.83&6.641\cr 38.61&1.058\cr 36.82&-3.485\cr 20.68&-0.9488\cr 32.66&-6.451\cr 22.35&-4.885\cr 19.41&-4.469\cr 27.15&-1.928\cr 27.42&-1.599\cr 37.65&-0.6065\cr 34.42&-6.523\cr 17.13&-5.456\cr 31.67&-0.5124\cr 36.0&-2.487\cr 31.92&-1.82\cr 23.97&-2.088\cr 20.15&0.6859\cr 19.49& -1.63\cr 18.92&0.6379\cr 23.3&1.388\cr 26.37&1.804\cr 19.06&-0.8494 \cr 12.86&-0.1988\cr 22.46&-1.34\cr 24.96&-5.134\cr 15.1&-2.422\cr }$