Version: 4.1.1

8Statistics

This chapter describes the statistical functions provided by the PLT Scheme Science Collection. The basic statistical functions include functions to compute the mean, variance, and standard deviation, More advanced functions allow you to calculate absolute deviation, skewness, and kurtosis, as well as the median and arbitrary percentiles. The algorithms use recurrance relations to compute average quantities in a stable way, without large intermediate values that might overflow.

The functions described in this chapter are defined in the "statistics.ss" file in the science collection and are made available using the form:

 (require (planet williams/science/statistics))

8.1Mean, Standard Deviation, and Variance

 (mean data) → real? data : (vectorof real?)

Returns the arithmetic mean of "data".

 (variance data mu) → (>=/c 0.0) data : (vectorof real?) mu : real? (variance data) → (>=/c 0.0) data : (vectorof real?)

Returns the sample variance of data. If mu is not provided, it is calculated by a call to (mean data).

 (standard-deviation data mu) → (>=/c 0.0) data : (vectorof real?) mu : real? (standard-deviation data) → (>=/c 0.0) data : (vectorof real?)

Returns the standard deviation of data. The standard deviation is defined as the square root of the variance. The result is the square root of the corresponding variance function.

 (variance-with-fixed-mean data mu) → (>=/c 0.0) data : (vectorof real?) mu : real?

Returns an unbiased estimate of the variance of data when the population mean mu of the underlying distribution is known a priori.

 (standard-deviation-with-fixed-mean data mu) → (>=/c 0.0)

data : (vectorof real?)

mu : real?

Returns the standard deviation of data with a fixed population mean mu. The result is the square root of the variance-with-fixed-mean function.

8.2Absolute Deviation

 (absolute-deviation data mu) → (>=/c 0.0) data : (vectorof real?) mu : real? (absolute-deviation data) → (>=/c 0.0) data : (vectorof real?)

Returns the absolute devistion of data relative to the given value of the mean mu. If mu is not provided, it is calculated by a call to (mean data). This function is also useful if you want to calculate the absolute deviation to any value other than the mean, such as zero or the median.

8.3Higher Moments (Skewness and Kurtosis)

 (skew data mu sigma) → real? data : (vectorof real?) mu : real? sigma : (>=/c 0.0) (skew data) → real? data : (vectorof real?)

Returns the skewness of data using the given values of the mean mu and standard deviation sigma. The skewness measures the symmetry of the tails of a distribution. If mu and sigma are not provided, they are calculated by calls to (mean data) and (standard-deviation data mu).

 (kurtosis data mu sigma) → real? data : (vectorof real?) mu : real? sigma : (>=/c 0.0) (kurtosis data) → real? data : (vectorof real?)

Returns the kurtosis of data using the given values of the mean mu and standard deviation sigma. The kurtosis measures how sharply peaked a distribution is relative to its width. If mu and sigma are not provided, they are calculated by calls to (mean data) and (standard-deviation data mu).

8.4Autocorrelation

(lag-1-autocorrelation data mu sigma)  real?

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

mu : real?

sigma : (>=/c 0.0)

(lag-1-autocorrelation data)  real?

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

Returns the lag-1 autocorrelation of data using the given value of the mean mu. If mu is not provided, it is calculated by a call to (mean data).

8.5Covariance

(covariance data1 data2 mu1 mu2)  real?

data1 : (vectorof real?)

data2

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length data1) (vector-length data2))))

mu1 : real?

mu2 : real?

(covariance data1 data2)  real?

data1 : (vectorof real?)

data2

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length data1) (vector-length data2))))

Returns the covariance of data1 and data2, which must both be the same length, using the given values of mu1 and mu2. If the values of mu1 and mu2 are not given, they are calculated using calls to (mean data1) and (mean data2), respectively.

 (covariance-with-fixed-means data1 data2 mu1 mu2) → real?

data1 : (vectorof real?)

data2

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length data1) (vector-length data2))))

mu1 : real?

mu2 : real?

Returns the covariance of data1 and data2, which must both be the same length, when the population means mu1 and mu2 of the underlying distributions are known a priori.

8.6Weighted Samples

(weighted-mean w data)  real?

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

Returns the weighted mean of data using weights w.

(weighted-variance w data wmu)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

wmu : real?

(weighted-variance w data)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

Returns the weighted variance of data using weights w and the given weighted mean wmu. If wmu is not provided, it is calculated by a call ro (weighted-mean w data).

(weighted-standard-deviation w data wmu)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

wmu : real?

(weighted-standard-deviation w data)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

Returns the weighted standard deviation of data using weights w. The standard deviation is defined as the square root of the variance. The result is the square root of the corresponding weighted-variance function.

 (weighted-variance-with-fixed-mean w data wmu) → (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

wmu : real?

Returns an unbiased estimate of the weighted variance of data using weights w when the weighted population mean wmu of the underlying population is known a priori.

 (weighted-standard-deviation-with-fixed-mean w data wmu) → (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

wmu : real?

Returns the weighted standard deviation of data using weights w with a fixed population mean mu. The result is the square root of the weighted-variance-with-fixed-mean function.

(weighted-absolute-deviation w data wmu)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

wmu : real?

(weighted-absolute-deviation w data)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

Returns the weighted absolute devistion of data using weights w relative to the given value of the weighted mean wmu. If wmu is not provided, it is calculated by a call to (weighted-mean w data). This function is also useful if you want to calculate the weighted absolute deviation to any value other than the mean, such as zero or the weighted median.

(weighted-skew w data wmu wsigma)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

wmu : real?

wsigma : (>=/c 0.0)

(weighted-skew w data)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

Returns the weighted skewness of data using weights w using the given values of the weighted mean wmu and weighted standard deviation wsigma. The skewness measures the symmetry of the tails of a distribution. If wmu and wsigma are not provided, they are calculated by calls to (weighted-mean w data) and (weighted-standard-deviation w data wmu).

(weighted-kurtosis w data wmu wsigma)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

wmu : real?

wsigma : (>=/c 0.0)

(weighted-kurtosis w data)  (>=/c 0.0)

w : (vectorof real?)

data

:

 (and/c (vectorof real?) (lambda (x) (= (vector-length w) (vector-length data))))

Returns the weighted kurtosis of data using weights w using the given values of the weighted mean wmu and weighted standard deviation wsigma. The kurtosis measures how sharply peaked a distribution is relative to its width. If wmu and wsigma are not provided, they are calculated by calls to (weighted-mean w data) and (weighted-standard-deviation w data wmu).

8.7Maximum and Minimum

(maximum data)  real?

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

Returns the maximum value in data.

(minimum data)  real?

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

Returns the minimum value in data.

(minimum-maximum data)

 real? real?

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

Returns the minimum and maximum values on data as multiple values.

(maximum-index data)  natural-number/c

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

Returns the index of the maximum value in data. When there are several equal maximum elements, the index of the first one is chosen.

(minimum-index data)  natural-number/c

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

Returns the index of the minimum value in data. When there are several equal minimum elements, the index of the first one is chosen.

(minimum-maximum-index data)

 natural-number/c natural-number/c

data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length data) 0)))

Returns the indices of the minimum and maximum values in data as multiple values. When there are several equal minimum or maximum elements, the index of the first ones are chosen.

8.8Median and Percentiles

Thw median and percentile functions described in this section operate on sorted data. The contracts for these functions enforce this. Also, for convenience we use quantiles measured on a scale of 0 to 1 instead of percentiled, which ise a scale of 0 to 100).

(median-from-sorted-data sorted-data)  real?

sorted-data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length sorted-data) 0)) sorted?)

Returns the median value of sorted-data. When the dataset has an odd number of elements, the median is the value of element (n - 1)/2. When the dataset has an even number of elements, the median is the mean of the two nearest middle values, elements (n - 1)/2 and n/2.

(percentile-from-sorted-data sorted-data f)  real?

sorted-data

:

 (and/c (vectorof real?) (lambda (x) (> (vector-length sorted-data) 0)) sorted?)

f : (real-in 0.0 1.0)

Returns a quantile value of sorted-data. The quantile is determined by the value f, a fraction between 0 and 1. For example to compute the 75th percentile, f should have the value 0.75.

The quantile is found by interpolation using the formula:

quantile = 1 - delta(x[i]) + delta(x(i + 1))

where i is floor((n - 1) × f) and delta is (n - 1) × f - 1.

8.9Statistics Example

This example generates two vectors from a unit Gaussian distribution and a vector of conse squared weighting data. All of the vectors are of length 1,000. Thes data are used to test all of the statistics functions.

 (require (planet williams/science/random-distributions/gaussian) (planet williams/science/statistics) (planet williams/science/math)) (define (naive-sort! data) (let loop () (let ((n (vector-length data)) (sorted? #t)) (do ((i 1 (+ i 1))) ((= i n) data) (when (< (vector-ref data i) (vector-ref data (- i 1))) (let ((t (vector-ref data i))) (vector-set! data i (vector-ref data (- i 1))) (vector-set! data (- i 1) t) (set! sorted? #f)))) (unless sorted? (loop))))) (let ((data1 (make-vector 1000)) (data2 (make-vector 1000)) (w     (make-vector 1000))) (do ((i 0 (+ i 1))) ((= i 1000) (void)) (vector-set! data1 i (random-unit-gaussian)) (vector-set! data2 i (random-unit-gaussian)) (vector-set! w i (expt (cos (- (* 2.0 pi (/ i 1000.0)) pi)) 2))) (printf "Statistics Example~n") (printf "                                mean = ~a~n" (mean data1)) (printf "                            variance = ~a~n" (variance data1)) (printf "                  standard deviation = ~a~n" (standard-deviation data1)) (printf "                   variance from 0.0 = ~a~n" (variance-with-fixed-mean data1 0.0)) (printf "         standard deviation from 0.0 = ~a~n" (standard-deviation-with-fixed-mean data1 0.0)) (printf "                  absolute deviation = ~a~n" (absolute-deviation data1)) (printf "         absolute deviation from 0.0 = ~a~n" (absolute-deviation data1 0.0)) (printf "                                skew = ~a~n" (skew data1)) (printf "                            kurtosis = ~a~n" (kurtosis data1)) (printf "               lag-1 autocorrelation = ~a~n" (lag-1-autocorrelation data1)) (printf "                          covariance = ~a~n" (covariance data1 data2)) (printf "                       weighted mean = ~a~n" (weighted-mean w data1)) (printf "                   weighted variance = ~a~n" (weighted-variance w data1)) (printf "         weighted standard deviation = ~a~n" (weighted-standard-deviation w data1)) (printf "          weighted variance from 0.0 = ~a~n" (weighted-variance-with-fixed-mean w data1 0.0)) (printf "weighted standard deviation from 0.0 = ~a~n" (weighted-standard-deviation-with-fixed-mean w data1 0.0)) (printf "         weighted absolute deviation = ~a~n" (weighted-absolute-deviation w data1)) (printf "weighted absolute deviation from 0.0 = ~a~n" (weighted-absolute-deviation w data1 0.0)) (printf "                       weighted skew = ~a~n" (weighted-skew w data1)) (printf "                   weighted kurtosis = ~a~n" (weighted-kurtosis w data1)) (printf "                             maximum = ~a~n" (maximum data1)) (printf "                             minimum = ~a~n" (minimum data1)) (printf "              index of maximum value = ~a~n" (maximum-index data1)) (printf "              index of minimum value = ~a~n" (minimum-index data1)) (naive-sort! data1) (printf "                              median = ~a~n" (median-from-sorted-data data1)) (printf "                        10% quantile = ~a~n" (quantile-from-sorted-data data1 0.1)) (printf "                        20% quantile = ~a~n" (quantile-from-sorted-data data1 0.2)) (printf "                        30% quantile = ~a~n" (quantile-from-sorted-data data1 0.3)) (printf "                        40% quantile = ~a~n" (quantile-from-sorted-data data1 0.4)) (printf "                        50% quantile = ~a~n" (quantile-from-sorted-data data1 0.5)) (printf "                        60% quantile = ~a~n" (quantile-from-sorted-data data1 0.6)) (printf "                        70% quantile = ~a~n" (quantile-from-sorted-data data1 0.7)) (printf "                        80% quantile = ~a~n" (quantile-from-sorted-data data1 0.8)) (printf "                        90% quantile = ~a~n" (quantile-from-sorted-data data1 0.9)))

Produces the following output:

 Statistics Example mean = 0.03457693091555611 variance = 1.0285343857083422 standard deviation = 1.0141668431320077 variance from 0.0 = 1.028701415474174 standard deviation from 0.0 = 1.014249188056946 absolute deviation = 0.7987180852601665 absolute deviation from 0.0 = 0.7987898146946209 skew = 0.043402934671178436 kurtosis = 0.17722452271704014 lag-1 autocorrelation = 0.0029930889831972143 covariance = 0.005782911085590894 weighted mean = 0.05096139259270008 weighted variance = 1.0500293763787367 weighted standard deviation = 1.0247094107007786 weighted variance from 0.0 = 1.0510513958491579 weighted standard deviation from 0.0 = 1.0252079768755011 weighted absolute deviation = 0.8054378524718832 weighted absolute deviation from 0.0 = 0.8052440544958938 weighted skew = 0.046448729539282155 weighted kurtosis = 0.3050060704791675 maximum = 3.731148814104969 minimum = -3.327265864298485 index of maximum value = 502 index of minimum value = 476 median = 0.019281803306206644 10% quantile = -1.243869878615807 20% quantile = -0.7816243947573505 30% quantile = -0.4708703241429585 40% quantile = -0.2299309332835332 50% quantile = 0.019281803306206644 60% quantile = 0.30022966479982344 70% quantile = 0.5317978807508836 80% quantile = 0.832291888537874 90% quantile = 1.3061151234700463