Statistics

This chapter describes the statistical functions provided by the PLT Scheme Science Collection. The basic statistical functions include functions to compute the mean, variance, and standard deviation. More advanced functions allow you to calculate absolute deviation, skewness, and kurtosis, as well as the median and arbitrary percentiles. The algorithms use recurrance relations to compute average quantities in a stable way, without large intermediate values that might overflow.

The functions described in this chapter are defined in the statistics.ss file in the science collection and are made available ising the following form:

(require (planet "statistics.ss" ("williams" "science.plt")))

8.1  Mean, Standard Deviation, and Variance

mean

Function:
(mean data)

Contract:

(-> (vectorof real?) real?)

This function returns the arithmetic mean of data.

variance

Function:
(variance data mu)
(variance data)

Contract:

(case->
   (-> (vector-of real?) real? (>=/c 0.0))
   (-> (vector-of real?) (>=/c 0.0)))

This function returns the sample variance of data. If mu is not provided, it is calculated by a call to (mean data).

standard-deviation

Function:
(standard-deviation data mu)
(standard-deviation data)

Contract:

(case->
   (-> (vector-of real?) real? (>=/c 0.0))
   (-> (vector-of real?) (>=/c 0.0)))

The standard deviation is defined as the square root of the variance. This function returns the square root of the corresponding variance function above.

variance-with-fixed-mean

Function:
(variance-with-fixed-mean data mu)

Contract:

(-> (vector-of real?) real? (>=/c 0.0))

This function returns an unbiased estimate of the variance of data when the population mean, mu, of the underlying distribution is known a priori.

standard-deviation-with-fixed-mean

Function:
(standard-deviation-with-fixed-mean data mu)

Contract:

(-> (vector-of real?) real> (>=/c 0.0))

This function returns the standard deviation of data with a fixed population mean, mu. The result is the square root of the variance-with-fixed-mean function.

8.2  Absolute Deviation

absolute-deviation

Function:
(absolute-deviation data mu)
(absolute-deviation data)

Contract:

(case->
   (-> (vector-of real?) real? (>=/c 0.0))
   (-> (vector-of real?) (>=/c 0.0)))

This function returns the absolute deviation of data relative to the given value of the mean, mu. If mu is not provided, it is calculated by a call to (mean data). This function is also useful if you want to calculate the absolute deviation relative to any value other than the mean, such as zero or the median.

8.3  Higher Moments (Skewness and Kurtosis)

skew

Function:
(skew data mu sigma)
(skew data)

Contract:

(case->
   (-> (vector-of real?) real? (>=/c 0.0) real?)
   (-> (vector-of real?) real?)))

The shewness measures the symmetry of the tails of a distribution. This function returns the skewness of data using the given values of the mean, mu and standard deviation, sigma. If mu and sigma are not provided, they are calculated by calls to (mean data) and (standard-deviation data mu).

kurtosis

Function:
(kurtosis data mu sigma)
(kurtosis data)

Contract:

(case->
   (-> (vector-of real?) real? (>=/c 0.0) real?)
   (-> (vector-of real?) real?)))

The kurtosis measures how sharply peaked a distribution is relative to its width. This function returns the kurtosis of data using the given values of the mean, mu and standard deviation, sigma. If mu and sigma are not provided, they are calculated by calls to (mean data) and (standard-deviation data mu).

8.4  Autocorrelation

lag-1-autocorrelation

Function:
(lag-1-autocorrelation data mu)
(lag-1-autocorrelation data)

Contract:

(case->
   (-> non-empty-vector-of-reals real? real?)
   (-> non-empty-vector-of-reals real?))

This function returns the lag-1 autocorrelation of data using the given value of the mean, mu. If mu is not provided, it is calculated by a call to (mean data).

8.5  Covariance

covariance

Function:
(covariance data1 data 2 mu1 mu2)
(covariance data1 data2)

Contract:

(case->
   (->r ((data1 (vectorof real?))
         (data2 (and/c (vectorof real?)
                       (lambda (x)
                         (= (vector-length data1)
                            (vector-length data2)))))
         (mu1 real?)
         (mu2 real?))
        real?)
   (->r ((data1 (vectorof real?))
         (data2 (and/c (vectorof real?)
                       (lambda (x)
                         (= (vector-length data1)
                            (vector-length data2)))))
        real?))

This function returns the covariance of data1 and data2, which must both be the same length, using the given values of the means, mu1 and mu2. If the values of mu1 and mu2 are not given, they are calculated using calls to (mean data1) and (mean data2), respectively.

covariance-with-fixed-means

Function:
(covariance-with-fixed-means data1 data2)

Contract:

(->r ((data1 (vectorof real?))
      (data2 (and/c (vectorof real?)
                    (lambda (x)
                      (= (vector-length data1)
                         (vector-length data2)))))
      (mu1 real?)
      (mu2 real?))
     real?)

This function returns the covariance of data1 and data2, which must both be the same length, when the population means, mu1 and mu2, of the underlying distributions are known a priori.

8.6  Weighted Samples

weighted-mean

Function:
(weighted-mean w data)

Contract:

(->r ((w (vectorof real?))
      (data (and/c (vectorof real?)
                   (lambda (x)
                     (= (vector-length w)
                        (vector-length data))))))
      real?)

This function returns the weighted mean of data using weights w.

weighted-variance

Function:
(weighted-variance w data wmu)
(weighted-variance w data)

Contract:

(case->
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
         (mu real?)
        (>=/c 0.0))
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
        (>=/c 0.0)))

This function returns the weighted variance of data using weights w and the given weighted mean, wmu. If wmu is not provided, it is calculated by a call to (weighted-mean w data).

weighted-standard-deviation

Function:
(weighted-standard-deviation w data wmu)
(weighted-standard-deviation w data)

Contract:

(case->
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
         (mu real?)
        (>=/c 0.0))
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
        (>=/c 0.0)))

The standard deviation is defined as the square root of the variance. This function returns the square root of the corresponding weighted-variance function above.

weighted-variance-with-fixed-mean

Function:
(weighted-variance-with-fixed-mean w data wmu)

Contract:

(->r ((w (vectorof real?))
      (data (and/c (vectorof real?)
                   (lambda (x)
                     (= (vector-length w)
                        (vector-length data)))))
      (mu real?)
     (>=/c 0.0))

This function returns an unbiased estimate of the weighted variance of data using weights w when the weighted population mean, wmu, of the underlying distribution is known a priori.

weighted-standard-deviation-with-fixed-mean

Function:
(weighted-standard-deviation-with-fixed-mean w data wmu)

Contract:

(->r ((w (vectorof real?))
      (data (and/c (vectorof real?)
                   (lambda (x)
                     (= (vector-length w)
                        (vector-length data)))))
      (mu real?)
     (>=/c 0.0))

This function returns the weighted standard deviation of data using weights w with a fixed weighted population mean, wmu. The result is the square root of the weighted-variance-with-fixed-mean function.

weighted-absolute-deviation

Function:
(weighted-absolute-deviation w data wmu)
(weighted-absolute-deviation w data)

Contract:

(case->
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
         (mu real?)
        (>=/c 0.0))
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)+
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
        (>=/c 0.0)))

This function returns the weighted absolute deviation of data using weights s relative to the given value of the weighted mean, wmu. If wmu is not provided, it is calculated by a call to (weighted-mean w data). This function is also useful if you want to calculate the weighted absolute deviation relative to any value other than the weighted mean, such as zero or the weighted median.

weighted-skew

Function:
(weighted-skew w data wmu wsigma)
(weighted-skew w data)

Contract:

(case->
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
         (mu real?)
         (sigma (>=/c 0.0)))
        real?)
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
        real?))

The shewness measures the symmetry of the tails of a distribution. This function returns the weighted skewness of data using weights w and the given values of the weighted mean, wmu and weighted standard deviation, wsigma. If wmu and wsigma are not provided, they are calculated by calls to (weighted-mean w data) and (weighted-standard-deviation w data wmu).

weighted-kurtosis

Function:
(weighted-kurtosis w data wmu wsigma)
(weighted-kurtosis w data)

Contract:

(case->
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
         (mu real?)
         (sigma (>=/c 0.0)))
        real?)
   (->r ((w (vectorof real?))
         (data (and/c (vectorof real?)
                      (lambda (x)
                        (= (vector-length w)
                           (vector-length data)))))
        real?))

The kurtosis measures how sharply peaked a distribution is relative to its width. This function returns the weighted kurtosis of data using weights w and the given values of the weighted mean, wmu and weighted standard deviation, wsigma. If wmu and wsigma are not provided, they are calculated by calls to (weighted-mean w data) and (weighted-standard-deviation w data wmu).

8.7  Maximum and Minimum

maximum

Function:
(maximum data)

Contract:

(-> non-empty-vector-of-reals? real?)

This function returns the maximum value in data.

minimum

Function:
(minimun data)

Contract:

(-> non-empty-vector-of-reals? real?)

This function returns the minimum value in data.

minimum-maximum

Function:
(minimun-maximum data)

Contract:

(-> non-empty-vector-of-reals? (values real? real?))

This function returns the minimum and maximum values in data as multiple values.

maximum-index

Function:
(maximum-index data)

Contract:

(-> non-empty-vector-of-reals? natural-number?)

This function returns the index of the maximum value in data. When there are several equal maximum elements, the index of the first one is chosen.

minimum-index

Function:
(minimun-index data)

Contract:

(-> non-empty-vector-of-reals? natural-number?)

This function returns the index of the minimum value in data. When there are several equal minimum elements, the index of the first one is chosen.

minimum-maximum-index

Function:
(minimun-maximum-index data)

Contract:

(-> non-empty-vector-of-reals?
    (values natural-number? number?))

This function returns the indices of the minimum and maximum values in data as multiple values. When there are several equal minimum or maximum elements, the index of the first ones are chosen.

8.8  Median and Percentiles

The median and percentile functions described in this section operate on sorted data. The contracts for these functions enforce this. Also, for convenience we use quantiles measured on a scale of 0 to 1 instead of percentiles (which use a scale of 0 to 100).

median-from-sorted-data

Function:
(median-from-sorted-data sorted-data)

Contract:

(-> (and/c non-empty-vector-of-reals? sorted?)
    real?)

This function returns the median value of sorted-data. When the dataset has an odd number of elements, the median is the value of element (n - 1)/2. When the dataset has an even number of elements, the median is the mean of the two nearest middle values, elements (n - 1)/2 and n/2.

quantile-from-sorted-data

Function:
(qualtile-from-sorted-data sorted-data f)

Contract:

(-> (and/c non-empty-vector-of-reals? sorted?)
    (real-in 0.0 1.0) real?)

This function returns a quantile calue of sorted-data. The quantile is determined by the value f, a fraction between 0 and 1. For example, to compute the value of the 75th percentile, f should have the value 0.75.

The quantile is found by interpolation using the formula

quantile = 1 - delta(x[i]) + delta(x[i+1])

where i is floor((n - 1)f) and delta is (n - 1)f - 1.

8.9  Example

This example generates two vectors of data from a unit Gaussian distribution and a vector of cosine squared weighting data. All of the vectors are of length 1,000. These data are used to test all of the statistics functions.

(require (planet "gaussian.ss" ("williams" "science.plt")
                 "random-distributions"))
(require (planet "statistics.ss" ("williams" "science.plt")))
(require (planet "math.ss" ("williams" "science.plt")))

(define (naive-sort! data)
  (let loop ()
    (let ((n (vector-length data))
          (sorted? #t))
      (do ((i 1 (+ i 1)))
          ((= i n) data)
        (if (< (vector-ref data i)
               (vector-ref data (- i 1)))
            (let ((t (vector-ref data i)))
              (vector-set! data i (vector-ref data (- i 1)))
              (vector-set! data (- i 1) t)
              (set! sorted? #f))))
        (if (not sorted?)
            (loop)))))

(let ((data1 (make-vector 1000))
      (data2 (make-vector 1000))
      (w     (make-vector 1000)))
  (do ((i 0 (+ i 1)))
      ((= i 1000) (void))
    ;; Random data from unit gaussian
    (vector-set! data1 i (random-unit-gaussian))
    (vector-set! data2 i (random-unit-gaussian))
    ;; Cos^2 weighting
    (vector-set! w i
      (expt (cos (- (* 2.0 pi (/ i 1000.0)) pi)) 2)))
  (printf "Statistics Example~n")
  (printf "                                mean = ~a~n"
          (mean data1))
  (printf "                            variance = ~a~n"
          (variance data1))
  (printf "                  standard deviation = ~a~n"
          (standard-deviation data1))
  (printf "                   variance from 0.0 = ~a~n"
          (variance-with-fixed-mean data1 0.0))
  (printf "         standard deviation from 0.0 = ~a~n"
          (standard-deviation-with-fixed-mean data1 0.0))
  (printf "                  absolute deviation = ~a~n"
          (absolute-deviation data1))
  (printf "         absolute deviation from 0.0 = ~a~n"
          (absolute-deviation data1 0.0))
  (printf "                                skew = ~a~n"
          (skew data1))
  (printf "                            kurtosis = ~a~n"
          (kurtosis data1))
  (printf "               lag-1 autocorrelation = ~a~n"
          (lag-1-autocorrelation data1))
  (printf "                          covariance = ~a~n"
          (covariance data1 data2))
  (printf "                       weighted mean = ~a~n"
          (weighted-mean w data1))
  (printf "                   weighted variance = ~a~n"
          (weighted-variance w data1))
  (printf "         weighted standard deviation = ~a~n"
          (weighted-standard-deviation w data1))
  (printf "          weighted variance from 0.0 = ~a~n" 
          (weighted-variance-with-fixed-mean w data1 0.0))
  (printf "weighted standard deviation from 0.0 = ~a~n" 
          (weighted-standard-deviation-with-fixed-mean w data1 0.0))
  (printf "         weighted absolute deviation = ~a~n"
          (weighted-absolute-deviation w data1))
  (printf "weighted absolute deviation from 0.0 = ~a~n"
          (weighted-absolute-deviation w data1 0.0))
  (printf "                       weighted skew = ~a~n"
          (weighted-skew w data1))
  (printf "                   weighted kurtosis = ~a~n"
          (weighted-kurtosis w data1))
  (printf "                             maximum = ~a~n"
          (maximum data1))
  (printf "                             minimum = ~a~n"
          (minimum data1))
  (printf "              index of maximum value = ~a~n"
          (maximum-index data1))
  (printf "              index of minimum value = ~a~n"
          (minimum-index data1))
  (naive-sort! data1)
  (printf "                              median = ~a~n"
          (median-from-sorted-data data1))
  (printf "                        10% quantile = ~a~n"
          (quantile-from-sorted-data data1 .1))
  (printf "                        20% quantile = ~a~n"
          (quantile-from-sorted-data data1 .2))
  (printf "                        30% quantile = ~a~n"
          (quantile-from-sorted-data data1 .3))
  (printf "                        40% quantile = ~a~n"
          (quantile-from-sorted-data data1 .4))
  (printf "                        50% quantile = ~a~n"
          (quantile-from-sorted-data data1 .5))
  (printf "                        60% quantile = ~a~n"
          (quantile-from-sorted-data data1 .6))
  (printf "                        70% quantile = ~a~n"
          (quantile-from-sorted-data data1 .7))
  (printf "                        80% quantile = ~a~n"
          (quantile-from-sorted-data data1 .8))
  (printf "                        90% quantile = ~a~n"
          (quantile-from-sorted-data data1 .9))
)

Produces the following output:

Statistics Example
                                mean = 0.03457693091555611
                            variance = 1.0285343857083422
                  standard deviation = 1.0141668431320077
                   variance from 0.0 = 1.028701415474174
         standard deviation from 0.0 = 1.014249188056946
                  absolute deviation = 0.7987180852601665
         absolute deviation from 0.0 = 0.7987898146946209
                                skew = 0.043402934671178436
                            kurtosis = 0.17722452271704014
               lag-1 autocorrelation = 0.0029930889831972143
                          covariance = 0.005782911085590894
                       weighted mean = 0.05096139259270008
                   weighted variance = 1.0500293763787367
         weighted standard deviation = 1.0247094107007786
          weighted variance from 0.0 = 1.0510513958491579
weighted standard deviation from 0.0 = 1.0252079768755011
         weighted absolute deviation = 0.8054378524718832
weighted absolute deviation from 0.0 = 0.8052440544958938
                       weighted skew = 0.046448729539282155
                   weighted kurtosis = 0.3050060704791675
                             maximum = 3.731148814104969
                             minimum = -3.327265864298485
              index of maximum value = 502
              index of minimum value = 476
                              median = 0.019281803306206644
                        10% quantile = -1.243869878615807
                        20% quantile = -0.7816243947573505
                        30% quantile = -0.4708703241429585
                        40% quantile = -0.2299309332835332
                        50% quantile = 0.019281803306206644
                        60% quantile = 0.30022966479982344
                        70% quantile = 0.5317978807508836
                        80% quantile = 0.832291888537874
                        90% quantile = 1.3061151234700463