Processing math: 100%

Statistical graphics

Generalities
Histograms
Scatterplots
Barsplots
Piecharts
Boxplots
Stemplots
Starplots
Statistical charts in complex scenes

Generalities

In general, arguments to statistical graphics are data and options in equation format (option = value). Options are of two types:

specific to each stats plot
general draw options

There are three functions associated to each graph. In the case of histograms, we have:

histogram: for plotting in any terminal, works in all interfaces.
wxhistogram: for embedded graphics in wxMaxima and iMaxima interfaces.
histogram_description: a draw object to be included in complex and multiple objects scenes.

To run these examples, you must load package descriptive and set some global options:

load("descriptive")$

set_draw_defaults(
  terminal   = png,
  xlabel     = "",
  ylabel     = "")$

Histograms

Specific options (defaults in parentheses):

nclasses (10): number of classes, a positive integer
frequency (absolute): absolute, relative, density or percent
htics (auto): auto, endpoints, intervals, or a list of labels

We begin with an histogram of absolute frequencies. File pidigits.data is part of the Maxima distribution and contains the first one hundred digits of number $\pi$ .

/* read sample */
s1 : read_list (file_search ("pidigits.data")); 

histogram ( 
    s1,

    /* specific histogram options */
    nclasses = 8,

    /* local draw options affecting the histogram */
    fill_density = 0.3,
    dimensions   = [400,300],
    fill_color   = red,
    line_width   = 3) $

An histogram of percents and more draw options.

histogram (
    s1,

    /* specific histogram options */
    nclasses = 8,
    frequency = percent,

    /* local draw options affecting the histogram */
    fill_density = 0.3,
    fill_color   = green,
    line_width   = 3,

    /* global draw options */
    grid             = true,
    title            = "A green histogram",
    xlabel           = "Pi digit",
    ylabel           = "Percentages (%)",
    dimensions       = [400, 400],
    background_color = cyan) $

Fixing the limits of the plot from -2 to 12. The number of classes is 3.

histogram (
    s1,
    nclasses     = [-2,12,3],
    fill_density = 0.2, 
    dimensions   = [400,300])$

Fixing the limits of the plot from -2 to 12. The number of classes is 10, the default.

histogram (
  s1,
  nclasses     = [-2,12],
  fill_density = 0.2, 
  dimensions   = [400,300])$

We can also play with tic marks on the x-axis via the htics option. (This is a contibution by Len Brin.)

fpprintprec: 5$

histogram (
  s1,
  nclasses     = [-2,12,5],
  htics        = 'intervals,
  fill_density = 0.2, 
  dimensions   = [460,300])$

Instead of writing the intervals on the x-axis, we can write the end points of the classes.

histogram (
  s1,
  nclasses     = 6,
  htics        = endpoints,
  fill_density = 0.6, 
  dimensions   = [400,300])$

Write user defined labels for the classes.

histogram (
  s1,
  nclasses     = [-2,12,6],
  htics        = ["Cl-1","Cl-2","Cl-3","Cl-4","Cl-5","Cl-6"],
  fill_density = 0.2, 
  dimensions   = [400,300])$

See what happens when the number of labels is less than the number of classes.

histogram (
  s1,
  nclasses     = 6,
  htics        = ["Cl 1","Cl 2","Cl 3"],
  fill_density = 0.2, 
  dimensions   = [400,300])$

Or when the list of labels is empty. See also the relative frequency scale.

histogram (
  s1,
  nclasses     = 6,
  frequency    = relative,
  htics        = [],
  fill_color   = navy,
  fill_density = 0.2, 
  dimensions   = [400,300])$

Now we want a scene with two objects, the histogram and the Gaussian density. Since the histogram must be consider as any other draw object, we need to call histogram_description to build the complete scene.

/* Simulate a Gaussian sample of size 1000 */
( load("distrib"),
  m: 14, /* mean */
  s: 2,  /* standard deviation */
  s2: random_normal(m, s, 1000) ) $ 

/* See that the order objects are plotted is important */
draw2d(
    grid       = true, 
    dimensions = [400,300],
    histogram_description(
        s2,
        nclasses     = 9,
        frequency    = density,
        fill_density = 0.5),
    explicit(pdf_normal(x,m,s), x, m - 3*s, m + 3* s) ) $

Note that histogram_description only returns draw directives to be used later by draw:

histogram_description(
  s2,
  nclasses     = 9,
  frequency    = relative,
  fill_density = 0.5);

[[fill_density = 0.5], xrange = [6.4658, 20.402], 
yrange = [- 0.013, 0.273], [], bars([7.8594, 0.006, 1.3936], 
[9.253, 0.021, 1.3936], [10.647, 0.075, 1.3936], [12.04, 0.174, 1.3936], 
[13.434, 0.26, 1.3936], [14.827, 0.239, 1.3936], [16.221, 0.148, 1.3936], 
[17.614, 0.066, 1.3936], [19.008, 0.011, 1.3936])]

The histogram function automatically sets the x and y ranges, ignoring options xrange and yrange:

histogram (
  s1,
  nclasses     = 8,
  xrange       = [-5,15],
  yrange       = [-15,40],
  fill_density = 0.5, 
  dimensions   = [400,300])$

If you want specific values for the axes ranges, make use of histogram_description:

draw2d(
  histogram_description (
     s1,
     nclasses     = 8,
     fill_density = 0.2),
  grid       = true,
  xrange     = [-5,15],
  yrange     = [-15,40], 
  dimensions = [400,300]) $

Scatterplots

An univariate scatter plot.

load (distrib)$

scatterplot(
  random_normal(0,1,200),
  xaxis      = true,
  point_size = 2,
  dimensions = [600,150])$

A bivariate scatter plot. File wind.data is part of the Maxima distribution and contains wind speeds from five Irish weather stations.

s2 : read_matrix (file_search ("wind.data"))$

scatterplot(
  submatrix(s2, 1,2,3),
  title      = "Data from stations #4 and #5",
  point_type = diamant,
  point_size = 2,
  color      = blue, 
  dimensions = [400,300])$

A trivariate scatter plot.

scatterplot(
    submatrix (s2, 1,2),
    dimensions = [700,500]);

A multivariate scatter plot.

scatterplot(
    s2,
    nclasses     = 5,
    fill_color   = blue,
    fill_density = 0.3,
    color        = red,
    point_size   = 1/2,
    dimensions   = [900,800],
    xtics        = 5) $

Barsplots

Specific options (defaults in parentheses):

box_width (3/4): a number between zero and 1
grouping (clustered): or stacked
groups_gap (1): distance between clusters, must be a positive integer
ars_colors ([]): list of colors for bars. If the list is shorter than the number of samples, colors are generated randomly
frequency (absolute): absolute, relative or percent
ordering (orderlessp): orderlessp or ordergreatp
sample_keys ([]): entries for legend
start_at (0): starting point on the x axis

Plotting absolute frequencies. File biomed.data is part of the Maxima distribution and contains biomedical data, it is a multivariate sampleof six dimensions. The first variable corresponds to the group (A or B) and the second is the age of the patient. Both variables are discrete.

m : read_matrix (file_search ("biomed.data"))$

/* barsplot for ages */
barsplot(
  col(m,2),

  box_width  = 1,
  title      = "Ages",
  xlabel     = "years", 
  dimensions = [400,300] )$

Barsplot of two discrete variables. Note that sample spaces are different.

barsplot(
  col(m,1), col(m,2),

  box_width    = 1,
  bars_colors  = [salmon, green],
  title        = "Patient ages and groups",
  xlabel       = "Age - Group",
  fill_density = 3/4, 
  dimensions   = [400,300] ) $

Barsplots in a multiplot context. When plotting complex or multiplot scenes, we need to call barsplot_description, this function creates a graphic object to be later plotted by any of the draw' functions.

 /* Remember that 'gr2d' is a scene object used by package 'draw'.
   We create two 2d scenes and then we build the multiplot */

bp1 :
  gr2d(
    ylabel = "Relative frequencies",
    color = red,
    barsplot_description(
        col(m,1),
        box_width = 1,
        bars_colors = [blue],
        frequency = relative))$
bp2:
  gr2d(
    ylabel = "Percentages (%)",
    color = blue,
    barsplot_description(
        col(m,2),
        box_width = 1,
        fill_density = 0.5,
        label_orientation = 'vertical,
        bars_colors = [red],
        frequency = percent))$

/* vertical multiplot */
draw(
  dimensions = [500, 500],
  bp1,
  bp2 ) $

Example of an horizontal multiplot.

draw(
  dimensions = [850, 300],
  columns    = 2,
  bp1,
  bp2 ) $

If we want to get an idea on how are ages and groups related, we can plot a stacked barsplot:

agesA: col(subsample (m, lambda([v], v[1] = A)), 2) $
agesB: col(subsample (m, lambda([v], v[1] = B)), 2) $

barsplot(
  agesA, agesB,

  groups_gap  = 3,
  bars_colors = [red, blue],
  grouping    = stacked,
  sample_keys = ["Group A", "Group B"],
  title       = "Groups A and B within ages",
  ylabel      = "# of individuals",
  dimensions  = [600, 400] )$

Comparing two samples:

l1: makelist(random(10),k,1,50)$
l2: makelist(random(10),k,1,100)$

barsplot(
  l1, l2,

  box_width   = 1,
  bars_colors = [violet, gray],
  frequency   = percent,
  sample_keys = ["1st sample", "2nd sample"], 
  dimensions  = [400,300] ) $

We now have four samples of a categorical (Yes, No, Maybe) variable sampled in four different cities.

r1: makelist([Yes, No, Maybe][random(3)+1],k,1,150) $
r2: makelist([Yes, No, Maybe][random(3)+1],k,1,50) $
r3: makelist([Yes, No, Maybe][random(3)+1],k,1,75) $
r4: makelist([Yes, No, Maybe][random(3)+1],k,1,200) $

barsplot(
  r1, r2, r3, r4,

  title        = "Asking for something in four populations",
  ylabel       = "# of individuals",
  fill_density = 0.5,
  dimensions   = [600, 400],
  groups_gap  = 3,
  frequency   = relative,
  sample_keys = ["Betanzos", "Carballo", "Ferrol", "A Coruna"],
  yrange      = [0, 0.6], /* expanding the y-range upwards to
                             get more place for the legend */
  ordering    = ordergreatp )$

As above, but with stacked bars.

barsplot(
  r1, r2, r3, r4,

  groups_gap  = 1,
  fill_density = 0.5,
  dimensions   = [600, 400],
  sample_keys = ["Betanzos", "Carballo", "Ferrol", "A Coruna"],
  grouping    = stacked,
  ordering    = ordergreatp,
  bars_colors = [green, blue, yellow, brown],
  dimensions  = [400,300],
  xrange      = [-1,4] /* Now, we expand the x-range rightwards
                          to get more space for the legend */)$

With option start_at, we can draw more than one bars diagrams on the x axis. Labels are vertically oriented. In the next plot, we want three independent barsplots in one scene. Again, we need the barsplot_description mechanism:

/* Sample simulation */
sample1: makelist([Yes, No][random(2)+1],k,1,20) $
sample2: makelist([Yes, No, Maybe][random(3)+1],k,1,100) $
sample31: makelist([A,B,C,D][random(4)+1],k,1,100) $
sample32: makelist([A,B,C,D][random(4)+1],k,1,100) $ 

draw2d(
  label_orientation = 'vertical,
  dimensions   = [400,300],

  barsplot_description(
    sample1,
    bars_colors = [red],
    sample_keys = ["Are you smoker?"],
    start_at    = 10 ),

  barsplot_description(
    sample2,
    sample_keys = ["Do you like pizza?"],
    start_at    = 1 ),

  fill_density = 0.3,
  barsplot_description(
    sample31, sample32,
    bars_colors = [orange, blue],
    sample_keys = ["Political party (men)", "Political party (women)"],
    start_at    = 25 ),

    xrange = [0, 40],
    yrange = [0, 50] )$

Piecharts

Specific options (defaults in parentheses):

sector_colors ([]): list of colors for sectors. If the list is shorter than the number of sectors, colors are generated randomly.
pie_center ([0,0]): a pair of numbers.
pie_radius (1): a positive number.

A simple pie chart

s1 : read_list (file_search ("pidigits.data"))$

piechart(
    s1,
    proportional_axes = xy,
    xrange            = [-2, 4],
    yrange            = [-1.1, 1.1],
    dimensions        = [550,300],
    title             = "Digit frequencies in pi" )$

User defined colors for sectors

s: makelist(random(6)+1, k, 100)$

piechart(
    s,
    sector_colors     = [red,blue,yellow,green,black,orange],
    proportional_axes = xy,
    dimensions        = [450,300],
    xrange            = [-1.5, 1.5] ) $

Two pie charts with different radii. For more than one piechart, make use of piechart_description.

s: makelist(random(6)+1, k, 100)$
a: makelist(['yes, 'no, "do not ask me"][1+random(3)], k, 1, 75)$

draw2d(
    proportional_axes = xy,
    xrange            = [-2, 3.5],
    dimensions        = [400,300],
    piechart_description(
        s,
        sector_colors = [red,blue,orange,navy,gray,black]),
    piechart_description(
        a,
        sector_colors = [green,yellow,cyan],
        pie_center    = [-1,1.5],
        pie_radius    = 1/2)) $

Boxplots

Specific options (defaults in parentheses):

box_width (3/4): widths for boxes.
box_orientation (vertical): vertical or horizontal.
range (inf): sets the interval for outliers. See examples below.
outliers_size (1): Point size for isolated outliers.

Comparing the variablility of three samples.

A : [[6, 4, 6, 2, 4, 8, 6, 4, 6, 4, 3, 2],
     [8, 10, 7, 9, 12, 8, 10],
     [16, 13, 17, 12, 11, 18, 13, 18, 14, 12]]$

boxplot(
    A, 
    dimensions   = [400,300]) $

As above, but horizontally oriented, together with the grid.

boxplot(
  A,
  box_orientation = horizontal,
  grid            = true, 
  dimensions      = [400,300]) $

Boxplot generates its own tics on the axes. If we want to write our own marks, we have to use the boxplot_description mechanism.

/* Setting tics on the vertical axis */

draw2d(
  boxplot_description(A, box_orientation=horizontal,grid=true),
  dimensions = [400,300],
  ytics = {["Sample 1", 1], ["Sample 2", 2], ["Sample 3", 3]} )$

Fine tunning our plot with draw options.

s2 : read_matrix(file_search("wind.data"))$

boxplot(
  s2,
  box_width = 0.2,

  background_color = light_gray,
  font             = "Times-Roman",
  font_size        = 17,
  title            = "Windspeeds in knots",
  xlabel           = "Number of Station",
  color            = red,
  line_width       = 2,
  dimensions       = [400,300] )$

With option range it is possible to handle outliers, which are those observations outside the interval $[q_1-r \cdot IQR, q_3+r \cdot IQR]$ , where $q_i$ is the i-th quartile and $IQR=q_3-q_1$ is the interquartile range. When outliers are present, they are plotted as isolated points, and whiskers are allocated at the extremes of the rest of observations.

B: [[7, 15, 5, 8, 6, 5, 7, 3, 1],
    [10, 8, 12, 8, 11, 9, 20],
    [23, 17, 19, 7, 22, 19]] $

boxplot (B, range=1)$

Outliers in horizontal orientation,

boxplot (B, range=1.5, box_orientation = horizontal)$

Adding a polygonal line. In this case, we make use of function boxplot_description, which describes the boxes, and add it to draw2d together with other graphic objects.

draw2d(
        /* set the terminal */
        terminal = wxt,

        /* boxplot */
        boxplot_description(
          B,
          range            = 1.5,
          line_width       = 3,
          outliers_size    = 2,
          color            = red,
          background_color = light_gray),

        /* polygonal line joining the medians */
        points_joined = true,
        /* color and line_width have being modified
           during the boxplot definition */
        color         = blue,
        line_width    = 2,
        points(map(median, B)),
        xtics = {["Low",1],["Medium",2],["High",3]}) $

Stemplots

This stem-and-leaf function is a kind contribution by Len Brin. It admits option leaf_unit, which must be a power of ten.

/* Simulating a Gaussian sample */

load(distrib)$
sample: random_normal(15, 6, 100)$

stemplot(
  sample,
  leaf_unit = 0.1);

 1|69
 4|47
 5|226
 6|8
 7|247
 8|1344669
 9|0678
10|56
11|256678
12|0035899
13|23444789
14|04445
15|224556
16|3458
17|112467899
18|00234
19|45
20|1234
21|46
22|127799
23|19
24|1334
25|12
26|58
28|37
key: 6|3 =  6.3

Changing the value of leaf_unit.

stemplot(
  sample,
  leaf_unit = 1);

0|2245556777888889999
1|000011222222223333333334444444555566666777777888888888899
2|000012223333344444557789
key: 6|3 =  63

Starplots

Specific options (defaults in parentheses):

stars_colors ([]): list of colors for bars. If the list is shorter than the number of samples, colors are generated randomly.
frequency (absolute): absolute or relative.
ordering (orderlessp): orderlessp or ordergreatp.
sample_keys ([]): entries for legend.
star_center ([0,0]): a pair of numbers.
star_radius (1): a positive number.

Simulating two discrete samples with the same sample space.

l1: makelist(random(10),k,1,50)$
l2: makelist(random(10),k,1,200)$

/* Plot based on absolute frequencies.
   Location and radius defined by the user. */
starplot(
  l1, l2,
  stars_colors = [blue,red],
  sample_keys  = ["1st sample", "2nd sample"],
  star_center  = [1,2],
  star_radius  = 4,

  proportional_axes = xy,
  line_width        = 2, 
  dimensions        = [400,300] ) $

As above, but plotting relative frequencies. Default location and radius.

starplot(
  l1, l2,
  stars_colors = [blue, light_blue],
  frequency    = relative,
  sample_keys  = ["1st sample", "2nd sample"],

  proportional_axes = xy,
  line_width        = 2, 
  dimensions        = [400,300] ) $

Using starplot_description in a multiplot context.

pl1:
  gr2d(
    starplot_description(
      l1, l2,
      stars_colors      = [blue, light_blue],
      sample_keys       = ["1st sample", "2nd sample"],
      star_center       = [1,2],
      star_radius       = 4,
      proportional_axes = xy,
      line_width        = 2,
      title             = "Absolute frequencies") ) $

pl2:
  gr2d(
    starplot_description(
      l1, l2,
      stars_colors      = [blue, light_blue],
      frequency         = relative,
      sample_keys       = ["1st sample", "2nd sample"],
      proportional_axes = xy,
      line_width        = 2,
      title             = "Relative frequencies") ) $

draw(
  dimensions = [600, 400],
  columns    = 2,
  pl1, pl2)$

Statistical charts in complex scenes

Statistical charts described above cannot be displayed together with other graphics elements unless you make use of the following mechanism. You can describe an statistical chart without plotting it, and then use this description as any other graphics element when invoking the draw function:

s1 : read_list (file_search ("pidigits.data"))$

/* variable scene stores the histogram description*/
scene: histogram_description (
         s1,
         title        = "pi digits",
         xlabel       = "digits",
         ylabel       = "Absolute frequency",
         fill_color   = grey,
         fill_density = 0.6);

$\left[ \left[ {\it title}=\mbox{ pi digits } , {\it xlabel}= \mbox{ digits } , {\it ylabel}=\mbox{ Absolute frequency } , {\it fill\_color}={\it grey} , \\ {\it fill\_density}=0.6 \right] , {\it xrange}=\left[ -0.45 , 9.45 \right] , {\it yrange}=\left[ - 0.65 , 13.65 \right] , \left[ \right] , \\ {\it bars}\left(\left[ 0.45 , 8.0 , 0.9 \right] , \left[ 1.35 , 8.0 , 0.9 \right] , \left[ 2.25 , 12.0 , 0.9 \right] , \\ \left[ 3.15 , 12.0 , 0.9 \right] , \left[ 4.05 , 10.0 , 0.9 \right] , \left[ 4.95 , 8.0 , 0.9 \right] , \\ \left[ 5.85 , 9.0 , 0.9 \right] , \left[ 6.75 , 8.0 , 0.9 \right] , \left[ 7.65 , 12.0 , 0.9 \right] , \left[ 8.55 , 13.0 , 0.9 \right] \right) \right]$

/* line and histogram, in this order */
draw2d(
  color = blue,
  explicit(2*x,x,-5,20),
  scene) $

If we don't want the blue line behind the bars, change the order of the graphic objects. We also add the grid lines:

draw2d(
  scene,
  color  = blue,
  grid   = true,
  xrange = auto,
  xtics  = auto,
  explicit(2*x,x,-5,20) ) $

Finally, this is the method for building multiplots with statistical charts (output in postscript format):

s3 : read_matrix (file_search ("biomed.data"))$
bp1: barsplot_description(
       col(s3,1),
       title        = "Groups of patients",
       xlabel       = "Group",
       ylabel       = "# of individuals",
       fill_density = 0.5)$
bp2: barsplot_description(
       col(s3,2),
       title        = "Ages",
       xlabel       = "years",
       fill_density = 0.3)$
draw(gr2d(bp1),
     gr2d(bp2),
     dimensions = 100*[10, 10],
     terminal   = eps_color)$