On this page:
8.1 Running Statistics
statistics?
make-statistics
statistics-reset!
statistics-tally!
statistics-n
statistics-min
statistics-max
statistics-mean
statistics-variance
statistics-standard-deviation
8.2 Running Statistics Example
8.3 Mean, Standard Deviation, and Variance
mean
mean-and-variance
variance
standard-deviation
sum-of-squares
variance-with-fixed-mean
standard-deviation-with-fixed-mean
8.4 Absolute Deviation
absolute-deviation
8.5 Higher Moments (Skewness and Kurtosis)
skew
kurtosis
8.6 Autocorrelation
lag-1-autocorrelation
8.7 Covariance
covariance
8.8 Correlation
correlation
8.9 Weighted Samples
weighted-mean
weighted-variance
weighted-standard-deviation
weighted-variance-with-fixed-mean
weighted-standard-deviation-with-fixed-mean
weighted-absolute-deviation
weighted-skew
weighted-kurtosis
8.10 Maximum and Minimum
maximum
minimum
minimum-maximum
maximum-index
minimum-index
minimum-maximum-index
8.11 Median and Quartiles
median-from-sorted-data
quantile-from-sorted-data
8.12 Statistics Example

8 Statistics

    8.1 Running Statistics

    8.2 Running Statistics Example

    8.3 Mean, Standard Deviation, and Variance

    8.4 Absolute Deviation

    8.5 Higher Moments (Skewness and Kurtosis)

    8.6 Autocorrelation

    8.7 Covariance

    8.8 Correlation

    8.9 Weighted Samples

    8.10 Maximum and Minimum

    8.11 Median and Quartiles

    8.12 Statistics Example

This chapter describes the statistical functions provided by the PLT Scheme Science Collection. The basic statistical functions include functions to compute the mean, variance, and standard deviation, More advanced functions allow you to calculate absolute deviation, skewness, and kurtosis, as well as the median and arbitrary percentiles. The algorithms use recurrance relations to compute average quantities in a stable way, without large intermediate values that might overflow.

The functions described in this chapter are defined in the "statistics.ss" file in the science collection and are made available using the form:

 (require (planet williams/science/statistics))

8.1 Running Statistics

A running statistics object accumulates a minimal set of statistics (n, min, max, mean, variance, and standard deviation) for a set of data values. A running statistics object does not require that a sequence (e.g., list or vector) of the data value be maintained.

(statistics? x)  boolean?
  x : any/c
Returns #t is x is a running statistics object.

(make-statistics)  statistics?
Returns a new, empty running statistics object.

(statistics-reset! s)  void?
  s : statistics?
Resets the running statistics object s.

(statistics-tally! s x)  void?
  s : statistics?
  x : real?
Updates the running statistice object s with the value of x.

(statistics-n s)  exact-nonnegative-integer?
  s : statistics?
Returns the number of values that have been added to the running statistics object s. This value is zero initially and after a call to statistics-reset!.

(statistics-min s)  real?
  s : statistics?
Returns the minimum value that has been added to the running statistics object s. This value is +inf.0 initially and after a call to statistics-reset!.

(statistics-max s)  real?
  s : statistics?
Returns the maximum value that has been added to the running statistics object s. This value is -inf.0 initially and after a call to statistics-reset!.

(statistics-mean s)  real?
  s : statistics?
Returns the arithmetic mean of the values that have been added to the running statistics object s. This value is zero initially and after a call to statistics-reset!.

(statistics-variance s)  real?
  s : statistics?
Returns the estimated, or sample, variance of the values that have been added to the running statistics object s. This value is zero initially and after a call to statistics-reset!.

(statistics-standard-deviation s)  real?
  s : statistics?
Returns the standard deviation of the values that have been added to the running statistics object s. This is the square root of the value returned by statistics-variance.

8.2 Running Statistics Example

This example generated 100 random numbers between 0.0 and 10.0 and maintains running statistics on the values.

  #lang scheme
  (require (planet williams/science/statistics)
           (planet williams/science/random-distributions))
  
  (define (test)
    (let ((stat (make-statistics)))
      (for ((i (in-range 100)))
        (statistics-tally! stat (random-flat 0.0 10.0)))
      (printf "Running Statistics Example~n")
      (printf "                 n = ~a~n" (statistics-n stat))
      (printf "               min = ~a~n" (statistics-min stat))
      (printf "               max = ~a~n" (statistics-max stat))
      (printf "              mean = ~a~n" (statistics-mean stat))
      (printf "          variance = ~a~n" (statistics-variance stat))
      (printf "standard deviation = ~a~n" (statistics-standard-deviation stat))))
  
  (test)

Produces the following results.

Running Statistics Example

                 n = 100

               min = 0.11100957474903939

               max = 9.938914540059452

              mean = 5.466640451797567

          variance = 8.677003172428925

standard deviation = 2.945675333846031

8.3 Mean, Standard Deviation, and Variance

(mean data)  real?
  data : sequence-of-real?
Returns the arithmetic mean of "data".

(mean-and-variance data)  
real? (>=/c 0.0)
  data : sequence-of-real?
Returns the aritnmetic mean and the estimated, or sample, variance of data as multiple values. These values are computed in a single pass through data.

(variance data mean)  (>=/c 0.0)
  data : sequence-of-real?
  mean : real?
(variance data)  (>=/c 0.0)
  data : sequence-of-real?
Returns the estimated, or sample, variance of data relative to the given value of mean. If mean is not provided, the variance is relative to the arithmetic mean and is computed in a single pass through data.

(standard-deviation data mean)  (>=/c 0.0)
  data : sequence-of-real?
  mean : real?
(standard-deviation data)  (>=/c 0.0)
  data : sequence-of-real?
Returns the estimated, or sample, standard deviation of datarelative to the given value of mean. If mean is not provided, the standard deviation is relative to the arithmetic mean and is computed in a single pass through data. The standard deviation is defined as the square root of the variance.

(sum-of-squares data mean)  (>=/c 0.0)
  data : sequence-of-real?
  mean : real?
(sum-of-squares data)  (>=/c 0.0)
  data : sequence-of-real?
Returns the total sum of squates of data aout the mean. If mean is not provided, it is calculated by a call to (mean data).

(variance-with-fixed-mean data mean)  (>=/c 0.0)
  data : sequence-of-real?
  mean : real?
Returns an unbiased estimate of the variance of data when the population mean, mean, of the underlying distribution is known a priori.

(standard-deviation-with-fixed-mean data    
  mean)  (>=/c 0.0)
  data : sequence-of-real?
  mean : real?
Returns the standard deviation of data for a fixed population mean, mean. The result is the square root of the variance-with-fixed-mean function.

8.4 Absolute Deviation

(absolute-deviation data mean)  (>=/c 0.0)
  data : sequence-of-real?
  mean : real?
(absolute-deviation data)  (>=/c 0.0)
  data : sequence-of-real?
Returns the absolute devistion of data relative to the given value of the mean, mean. If mean is not provided, it is calculated by a call to (mean data). This function is also useful if you want to calculate the absolute deviation to any value other than the mean, such as zero or the median.

8.5 Higher Moments (Skewness and Kurtosis)

(skew data mean sd)  real?
  data : sequence-of-real?
  mean : real?
  sd : (>=/c 0.0)
(skew data)  real?
  data : sequence-of-real?
Returns the skewness of data using the given values of the mean, mean, and standard deviation, sd. The skewness measures the symmetry of the tails of a distribution. If mean and sd are not provided, they are calculated by a call to mean-and-variance.

(kurtosis data mean sd)  real?
  data : sequence-of-real?
  mean : real?
  sd : (>=/c 0.0)
(kurtosis data)  real?
  data : sequence-of-real?
Returns the kurtosis of data using the given values of the mean, mean, and standard deviation, sd. The kurtosis measures how sharply peaked a distribution is relative to its width. If mean and sd are not provided, they are calculated by a call to mean-and-variance.

8.6 Autocorrelation

(lag-1-autocorrelation data mean)  real?
  data : nonempty-sequence-of-real?
  mean : real?
(lag-1-autocorrelation data)  real?
  data : nonempty-sequence-of-real?
Returns the lag-1 autocorrelation of data using the given value of the mean, mean. If mean is not provided, it is calculated by a call to (mean data).

8.7 Covariance

(covariance data1 data2 mean1 mean2)  real?
  data1 : nonempty-sequence-of-real?
  data2 : nonempty-sequence-of-real?
  mean1 : real?
  mean2 : real?
(covariance data1 data2)  real?
  data1 : nonempty-sequence-of-real?
  data2 : nonempty-sequence-of-real?
Returns the covariance of data1 and data2 using the given values of mean1 and mean2. If the values of mean1 and mean2 are not given, they are calculated using calls to (mean data1) and (mean data2), respectively.

8.8 Correlation

(correlation data1 data2)  real?
  data1 : nonempty-sequence-of-real?
  data2 : nonempty-sequence-of-real?
Returns the Pearson correlation coefficient between data1 and data2.

8.9 Weighted Samples

(weighted-mean weights data)  real?
  weights : sequence-of-real?
  data : sequence-of-real?
Returns the weighted mean of data using weights, weights.

(weighted-variance weights data wmean)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
  wmean : real?
(weighted-variance weights data)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
Returns the weighted variance of data using weights, weights, and the given weighted mean, wmean. If wmean is not provided, it is calculated by a call to (weighted-mean weights data).

(weighted-standard-deviation weights    
  data    
  wmean)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
  wmean : real?
(weighted-standard-deviation weights data)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
Returns the weighted standard deviation of data using weights, weights. The standard deviation is defined as the square root of the variance. The result is the square root of the corresponding weighted-variance function.

(weighted-variance-with-fixed-mean weights    
  data    
  wmean)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-reals?
  wmean : real?
Returns an unbiased estimate of the weighted variance of data using weights, weights, when the weighted population mean, wmean, of the underlying population is known a priori.

(weighted-standard-deviation-with-fixed-mean weights 
  data 
  wmean) 
  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
  wmean : real?
Returns the weighted standard deviation of data using weights, weights, with a fixed population mean, wmean. The result is the square root of the weighted-variance-with-fixed-mean function.

(weighted-absolute-deviation weights    
  data    
  wmean)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
  wmean : real?
(weighted-absolute-deviation weights data)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
Returns the weighted absolute devistion of data using weights, weights, relative to the given value of the weighted mean, wmean. If wmean is not provided, it is calculated by a call to (weighted-mean weights data). This function is also useful if you want to calculate the weighted absolute deviation to any value other than the mean, such as zero or the weighted median.

(weighted-skew weights data wmean wsd)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
  wmean : real?
  wsd : (>=/c 0.0)
(weighted-skew weights data)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
Returns the weighted skewness of data using weights, weights, using the given values of the weighted mean, wmean, and weighted standard deviation, wsd. The skewness measures the symmetry of the tails of a distribution. If wmean and wsd are not provided, they are calculated by calls to (weighted-mean weights data) and (weighted-standard-deviation weights data wmean).

(weighted-kurtosis weights data wmean wsd)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
  wmean : real?
  wsd : (>=/c 0.0)
(weighted-kurtosis weights data)  (>=/c 0.0)
  weights : sequence-of-real?
  data : sequence-of-real?
Returns the weighted kurtosis of data using weights, weights, using the given values of the weighted mean, wmean, and weighted standard deviation, wsd. The kurtosis measures how sharply peaked a distribution is relative to its width. If wmean and wsd are not provided, they are calculated by calls to (weighted-mean weights data) and (weighted-standard-deviation weights data wmean).

8.10 Maximum and Minimum

(maximum data)  real?
  data : nonempty-sequence-of-real?
Returns the maximum value in data.

(minimum data)  real?
  data : nonempty-sequence-of-real?
Returns the minimum value in data.

(minimum-maximum data)  
real? real?
  data : nonempty-sequence-of-real?
Returns the minimum and maximum values on data as multiple values.

(maximum-index data)  exact-nonnegative-integer?
  data : nonempty-sequence-of-real?
Returns the index of the maximum value in data. When there are several equal maximum elements, the index of the first one is chosen.

(minimum-index data)  exact-nonnegative-integer?
  data : nonempty-sequence-of-real?
Returns the index of the minimum value in data. When there are several equal minimum elements, the index of the first one is chosen.

(minimum-maximum-index data)  
exact-nonnegative-ineger?
exact-nonnegative-integer?
  data : nonempty-sequence-of-real?
Returns the indices of the minimum and maximum values in data as multiple values. When there are several equal minimum or maximum elements, the index of the first ones are chosen.

8.11 Median and Quartiles

Thw median and quartile functions described in this section operate on sorted data. The contracts for these functions enforce this. Also, for convenience we use quantiles measured on a scale of 0 to 1 instead of percentiled, which use a scale of 0 to 100).

(median-from-sorted-data sorted-data)  real?
  sorted-data : nonempty-sorted-vector-of-real?
Returns the median value of sorted-data. When the dataset has an odd number of elements, the median is the value of element (n - 1)/2. When the dataset has an even number of elements, the median is the mean of the two nearest middle values, elements (n - 1)/2 and n/2.

(quantile-from-sorted-data sorted-data f)  real?
  sorted-data : nonempty-sorted-vector-of-real?
  f : (real-in 0.0 1.0)
Returns a quantile value of sorted-data. The quantile is determined by the value f, a fraction between 0 and 1. For example to compute the 75th percentile, f should have the value 0.75.

The quantile is found by interpolation using the formula:

quantile = 1 - delta(x[i]) + delta(x(i + 1))

where i is floor((n - 1) × f) and delta is (n - 1) × f - 1.

8.12 Statistics Example

This example generates two vectors from a unit Gaussian distribution and a vector of conse squared weighting data. All of the vectors are of length 1,000. Thes data are used to test all of the statistics functions.

  #lang scheme
  (require (planet williams/science/random-distributions/gaussian)
           (planet williams/science/statistics)
           (planet williams/science/math))
  
  (define (naive-sort! data)
    (let loop ()
      (let ((n (vector-length data))
            (sorted? #t))
        (do ((i 1 (+ i 1)))
            ((= i n) data)
          (when (< (vector-ref data i)
                   (vector-ref data (- i 1)))
            (let ((t (vector-ref data i)))
              (vector-set! data i (vector-ref data (- i 1)))
              (vector-set! data (- i 1) t)
              (set! sorted? #f))))
        (unless sorted?
          (loop)))))
  
  (let ((data1 (make-vector 1000))
        (data2 (make-vector 1000))
        (w     (make-vector 1000)))
    (do ((i 0 (+ i 1)))
        ((= i 1000) (void))
  
      (vector-set! data1 i (random-unit-gaussian))
      (vector-set! data2 i (random-unit-gaussian))
  
      (vector-set! w i
        (expt (cos (- (* 2.0 pi (/ i 1000.0)) pi)) 2)))
    (printf "Statistics Example~n")
    (printf "                                mean = ~a~n"
            (mean data1))
    (printf "                            variance = ~a~n"
            (variance data1))
    (printf "                  standard deviation = ~a~n"
            (standard-deviation data1))
    (printf "                   variance from 0.0 = ~a~n"
            (variance-with-fixed-mean data1 0.0))
    (printf "         standard deviation from 0.0 = ~a~n"
            (standard-deviation-with-fixed-mean data1 0.0))
    (printf "                  absolute deviation = ~a~n"
            (absolute-deviation data1))
    (printf "         absolute deviation from 0.0 = ~a~n"
            (absolute-deviation data1 0.0))
    (printf "                                skew = ~a~n"
            (skew data1))
    (printf "                            kurtosis = ~a~n"
            (kurtosis data1))
    (printf "               lag-1 autocorrelation = ~a~n"
            (lag-1-autocorrelation data1))
    (printf "                          covariance = ~a~n"
            (covariance data1 data2))
    (printf "                       weighted mean = ~a~n"
            (weighted-mean w data1))
    (printf "                   weighted variance = ~a~n"
            (weighted-variance w data1))
    (printf "         weighted standard deviation = ~a~n"
            (weighted-standard-deviation w data1))
    (printf "          weighted variance from 0.0 = ~a~n"
            (weighted-variance-with-fixed-mean w data1 0.0))
    (printf "weighted standard deviation from 0.0 = ~a~n"
            (weighted-standard-deviation-with-fixed-mean w data1 0.0))
    (printf "         weighted absolute deviation = ~a~n"
            (weighted-absolute-deviation w data1))
    (printf "weighted absolute deviation from 0.0 = ~a~n"
            (weighted-absolute-deviation w data1 0.0))
    (printf "                       weighted skew = ~a~n"
            (weighted-skew w data1))
    (printf "                   weighted kurtosis = ~a~n"
            (weighted-kurtosis w data1))
    (printf "                             maximum = ~a~n"
            (maximum data1))
    (printf "                             minimum = ~a~n"
            (minimum data1))
    (printf "              index of maximum value = ~a~n"
            (maximum-index data1))
    (printf "              index of minimum value = ~a~n"
            (minimum-index data1))
    (naive-sort! data1)
    (printf "                              median = ~a~n"
            (median-from-sorted-data data1))
    (printf "                        10% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.1))
    (printf "                        20% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.2))
    (printf "                        30% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.3))
    (printf "                        40% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.4))
    (printf "                        50% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.5))
    (printf "                        60% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.6))
    (printf "                        70% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.7))
    (printf "                        80% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.8))
    (printf "                        90% quantile = ~a~n"
            (quantile-from-sorted-data data1 0.9)))

Produces the following output:

Statistics Example

                                mean = 0.03457693091555611

                            variance = 1.0285343857083435

                  standard deviation = 1.0141668431320083

                   variance from 0.0 = 1.028701415474174

         standard deviation from 0.0 = 1.014249188056946

                  absolute deviation = 0.7987180852601665

         absolute deviation from 0.0 = 0.7987898146946209

                                skew = 0.04340293467117837

                            kurtosis = 0.17722452271702993

               lag-1 autocorrelation = 0.0029930889831972143

                          covariance = 0.005782911085590894

                       weighted mean = 0.05096139259270008

                   weighted variance = 1.0500293763787367

         weighted standard deviation = 1.0247094107007786

          weighted variance from 0.0 = 1.0510513958491579

weighted standard deviation from 0.0 = 1.0252079768755011

         weighted absolute deviation = 0.8054378524718832

weighted absolute deviation from 0.0 = 0.8052440544958938

                       weighted skew = 0.046448729539282155

                   weighted kurtosis = 0.3050060704791675

                             maximum = 3.731148814104969

                             minimum = -3.327265864298485

              index of maximum value = 502

              index of minimum value = 476

                              median = 0.019281803306206644

                        10% quantile = -1.243869878615807

                        20% quantile = -0.7816243947573505

                        30% quantile = -0.4708703241429585

                        40% quantile = -0.2299309332835332

                        50% quantile = 0.019281803306206644

                        60% quantile = 0.30022966479982344

                        70% quantile = 0.5317978807508836

                        80% quantile = 0.832291888537874

                        90% quantile = 1.3061151234700463