Data transformations


Standardization means to subtract to each element the sample mean and then dividing by the standard deviation. We need to load package descriptive before calling function standardize.


\[\left[ -{{2\,\sqrt{3}}\over{\sqrt{5}}} , -{{3^{{{3}\over{2}}} }\over{2\,\sqrt{5}}} , -{{\sqrt{3}}\over{\sqrt{5}}} , -{{\sqrt{3} }\over{2\,\sqrt{5}}} , 0 , {{\sqrt{3}}\over{2\,\sqrt{5}}} , {{\sqrt{ 3}}\over{\sqrt{5}}} , {{3^{{{3}\over{2}}}}\over{2\,\sqrt{5}}} , {{2 \,\sqrt{3}}\over{\sqrt{5}}} \right]\]

When the argument is a matrix, rows are interpreted as individual characteristics and each element is standardized with respect to the mean and standard deviation of its column,

m : matrix([3,3,7,8],[4,6,5,0],[7,6,5,4],[7,2,8,3],[0,6,2,6]) $

\[ \pmatrix{-{{6\,\sqrt{10}}\over{5\,\sqrt{87}}}&-{{8}\over{\sqrt{5}\, \sqrt{19}}}&{{8\,\sqrt{10}}\over{5\,\sqrt{53}}}&{{19}\over{\sqrt{5} \,\sqrt{46}}}\cr -{{\sqrt{10}}\over{5\,\sqrt{87}}}&{{7}\over{\sqrt{5 }\,\sqrt{19}}}&-{{2\,\sqrt{10}}\over{5\,\sqrt{53}}}&-{{21}\over{ \sqrt{5}\,\sqrt{46}}}\cr {{14\,\sqrt{10}}\over{5\,\sqrt{87}}}&{{7 }\over{\sqrt{5}\,\sqrt{19}}}&-{{2\,\sqrt{10}}\over{5\,\sqrt{53}}}&- {{1}\over{\sqrt{5}\,\sqrt{46}}}\cr {{14\,\sqrt{10}}\over{5\,\sqrt{87 }}}&-{{13}\over{\sqrt{5}\,\sqrt{19}}}&{{13\,\sqrt{10}}\over{5\, \sqrt{53}}}&-{{6}\over{\sqrt{5}\,\sqrt{46}}}\cr -{{21\,\sqrt{10} }\over{5\,\sqrt{87}}}&{{7}\over{\sqrt{5}\,\sqrt{19}}}&-{{17\,\sqrt{ 10}}\over{5\,\sqrt{53}}}&{{9}\over{\sqrt{5}\,\sqrt{46}}}\cr } \]

Selecting records in multivariate samples

Given a multivariate sample, we can extract subsamples according to certain conditions.

n : matrix([A,3,7,8],[B,6,5,0],[A,6,5,4],[A,2,8,3],[B,6,2,6]);

/* function 'takeA' will be applied to all rows,
   extracting those beginning with letter A */
takeA(v) := is(first(v)='A)$
subsample(n, takeA);

\[ \pmatrix{A&3&7&8\cr A&6&5&4\cr A&2&8&3\cr } \]

Following with this example, we can select rows with more conditions, at the time we remove or reorder the columns.

takeA5(v) := is(first(v)='A and second(v) < 5)$
subsample(n, takeA5, 2, 4, 3);

\[ \pmatrix{3&8&7\cr 2&3&8\cr } \]

Now, we keep those individuals whose sum of components is greater than certain value.

subsample(m, lambda([z], apply("+", z) > 20));

\[ \pmatrix{3&3&7&8\cr 7&6&5&4\cr } \]

Column transformations in multivariate samples

While function subsample filters records (rows) in multivariate samples, function tranform_sample can be used to transform, remove and create variables (columns).

transform_sample needs three arguments: the sample matrix, a list with the names of the variables, and a list indicating how to build the new matrix.

Given a matrix, we want to build a new matrix with the following conditions:

d : matrix([3,3,7,8],[4,6,5,0],[7,6,5,4],[7,2,8,3],[1,6,2,6]) $

    [x1, x2, x3, x4], 
    [log(x1), x2*x3, x1, makelist(1,k,length(d))]);

\[ \pmatrix{\log 3&21&3&1\cr \log 4&30&4&1\cr \log 7&30&7&1\cr \log 7& 16&7&1\cr 0&12&1&1\cr } \]

This time we want to standardize the fourth column of the original matrix. Pay attention to the single quote operator applied to the list of transforming expressions, this is necessary to avoid the execution of standardize before calling tranform_sample.

    [x1, x2, x3, x4], 
    '[x1, standardize(x4), x1^2]);

\[ \pmatrix{3&{{19}\over{2\,\sqrt{46}}}&9\cr 4&-{{21}\over{2\,\sqrt{46 }}}&16\cr 7&-{{1}\over{2\,\sqrt{46}}}&49\cr 7&-{{3}\over{\sqrt{46}}} &49\cr 1&{{9}\over{2\,\sqrt{46}}}&1\cr } \]

Box-Cox transformations (no estimation)

Box-Cox transformations help to approximate data to a Gaussian distribution. This property is necessary for applying a large set of statistical procedures. They are defined as \[ \mbox{bc}(x, k, m) = \left\{ \begin{array}{ll} \log(x+m) & \mbox{if } k=0 \\ \frac{(x+m)^k - 1}{k} & \mbox{otherwise} \end{array} \right. \] where k is the power to which data will be raised, and m is a constant used to avoid negative values. We plot Box-Cox transformations for different values of parameter k,

bc(x,k,m) := 
    if k=0
        then log(x+m)
        else ((x+m)^k-1) / k $

    yrange     = [-2,2],
    line_width = 2,
    grid       = true,
    xaxis      = true,
    yaxis      = true,
    map(lambda([z], [color = random_color(),
                     key   = string('k = z),
                     explicit( bc(x,z,0), x, 0, 3)]),
        [-1, 0, 1/2, 1, 2, 3]) ) $


Here is a worked example to show how to apply this procedure,

/* the sample and its histogram */
m: [299.9,195.0,155.8,478.7,396.7,640.7,457.8,46.59,298.5,109.3,
    269.0,270.6,442.5,277.2,284.8,300.8,251.1,320.9,674.0,177.0] $
histogram(m, fill_density = 0.5) $
Histograma 1
/* since it seems to be a certain level of positive skewness,
   we apply a Box-Cox transformation  */
m2: bc(m,1/2,0)$
histogram(m2, fill_density = 0.5) $
Histograma 2

This was an example of a Box-Cox transformation applied to a sample list. Let's now transform columns of a matrix,

d : matrix([3,3,7,8],[4,6,5,0],[7,6,5,4],[7,2,8,3],[1,6,2,6]);

    [x1, x2, x3, x4], 
    '[bc(x3, 6, 0), bc(x1, 0, 1)]);

\[ \pmatrix{19608&\log 4\cr 2604&\log 5\cr 2604&\log 8\cr {{87381 }\over{2}}&\log 8\cr {{21}\over{2}}&\log 2\cr } \]

Again, do not forget the simple quote.

© 2011-2016, TecnoStats.