1 Basic statistical functions

When a vector represents numerical data, there are a number of standard functions that will be useful for any statistical calculations:

  • sum {#sum} sums all values in the vector
  • mean {#mean} computes the sample mean, i.e. \(\frac{1}{n} \sum_{i=1}^n x_i\)
  • median {#median}computes the sample median value
  • min and max {#minmax} compute the sample minimum and maximum
  • range {#range} computes the sample range, i.e. the difference between the max and min value
  • var and sd {#var} compute the sample variance and standard deviation, i.e. \(s^2=\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2\)
  • quantile {#quantile} computes the min, max, median, and lower and upper quartiles. Other quantiles can be computed using the probs argument
  • summary {#summary} calculates the min, max, mean, median, and quantiles.

For illustration, consider these 54 measurements of leaf biomass

leafbiomass <- c(0.430, 0.400, 0.450, 0.820, 0.520, 1.320, 0.900, 1.180, 0.480, 0.210, 
                 0.270, 0.310, 0.650 ,0.180, 0.520, 0.300, 0.580, 0.480, 0.580, 0.580,
                 0.410, 0.480, 1.760, 1.210, 1.180, 0.830, 1.220, 0.770, 1.020, 0.130,
                 0.680, 0.610, 0.700, 0.820, 0.760, 0.770, 1.690, 1.480, 0.740, 1.240,
                 1.120, 0.750, 0.390, 0.870, 0.410, 0.560, 0.550, 0.670, 1.260, 0.965,
                 0.840, 0.970, 1.070, 1.220)
mean(leafbiomass) ## compute the sample mean
## [1] 0.7649074

We can check the mean function is working by using sum and length to directly calculate \(\frac{1}{n} \sum_{i=1}^n x_i\):

sum(leafbiomass)/length(leafbiomass)
## [1] 0.7649074

Similarly, for the standard deviation:

sd(leafbiomass)
## [1] 0.3780717
sqrt(sum((leafbiomass-mean(leafbiomass))^2)/(length(leafbiomass)-1))
## [1] 0.3780717

The other functions are fairly straightforward

min(leafbiomass)
## [1] 0.13
median(leafbiomass)
## [1] 0.72
quantile(leafbiomass)
##     0%    25%    50%    75%   100% 
## 0.1300 0.4800 0.7200 1.0075 1.7600
summary(leafbiomass)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1300  0.4800  0.7200  0.7649  1.0075  1.7600

R Help: sum, mean, median, min, max, range, sd, var, quantile, summary

1.1 Distributions

Unsurprisingly, R provides a range of functions to support calculations with standard probability distributions. There are a large number of probability distributions available, but we will only need a few. If you would like to know what distributions are available you can do a search using the command help(Distributions).

For every distribution there are four functions. The functions for each distribution begin with a particular letter to indicate the functionality:

Letter Function
“d” evaluates the probability density (or mass) function, \(f(x)\)
“p” evaluates the cumulative density function, \(F(x)=P[X <= x]\), hence finds the probability the specified random variable is less than the given argument.
“q” evaluates the inverse cumulative density function (quantiles), \(F^{-1}(q)\) i.e. the value \(x\) such that \(P[X <= x] = q\). Used to obtain critical values associated with particular probabilities \(q\).
“r” generates random numbers

The appropriate functions for common distributions are given below, along with the optional parameter arguments.

  • Normal distribution: dnorm, pnorm, qnorm, rnorm. Parameters: mean (\(\mu\)) and sd (\(\sigma\)).
  • \(t\) distribution: dt, pt, qt, rt. Parameter: df.
  • \(\chi^2\) distribution: dchisq, pchisq, qchisq, rchisq. Parameter: df.
  • Binomial: dchisq, pchisq, qchisq, rchisq. Parameters: size (\(n\)) and prob (\(p\)).
  • Poisson: dpois, ppois, qpois, rpois. Parameter: lambda (\(\lambda\))
  • Uniform: dunif, punif, qunif, runif. Parameters: min, and max.
  • Beta: dbeta, pbeta, qbeta, rbeta. Parameters: shape1 (\(a\)), shape2 (\(b\)).
  • Gamma: dgamma, pgamma, qgamma, rgamma. Parameters: shape (\(\alpha\)), rate (\(\beta\)).

R also has distribution functions for the test statistics of the rank sum test (qwilcox etc) and the signed rank test (qsignrank). See Practical 6 for more information on how to use these.

We illustrate the four types of functions for distributions below in the context of the Normal distribution, but you can substitute the normal distribution for any of the distributions and functions listed above (though remember to change the parameters).

R Help: Available distributions,

1.1.1 Density functions

For example, lets look at the functions for the Normal distribution. The first function we look at is the density function, dnorm. Given a set of values it returns the value of the Normal pdf at each point. If you only give the points it assumes you want to use a mean of zero and standard deviation of one, i.e. the standard Normal pdf \(\phi(z)\). To use different values for the mean and standard deviation, we specify them in the optional mean and sd arguments:

dnorm(0)
## [1] 0.3989423
dnorm(-3:3)
## [1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725 0.053990967
## [7] 0.004431848
dnorm(20, mean=10, sd=5)
## [1] 0.01079819

1.1.2 Cumulative distribution functions

The second type of function is pnorm which returns the cumulative distribution function for a Normal density. Given a number or a list it computes the probability that a normally distributed random number will be less than that number. . It accepts the same options as dnorm and defaults to the standard Normal behaviour, i.e. as \(\Phi(z)\):

pnorm(0) ## should be 0.5
## [1] 0.5
pnorm(1.96) ## should be ~0.975
## [1] 0.9750021
pnorm(20, mean=10, sd=5)
## [1] 0.9772499

pnorm (and all the “p” functions) is particularly useful when computing \(p\)-values in significance tests

If we wish to find the probability that a number is larger than the given number, so \(1-F(x)\) rather than \(F(x)\), you can set the lower.tail option to FALSE:

pnorm(0,lower.tail=FALSE)
## [1] 0.5
pnorm(1,lower.tail=FALSE)
## [1] 0.1586553

1.1.3 Inverse cumulative distribution functions

The next type of function is qnorm which is the inverse of pnorm, so qnorm is \(F^{-1}(x)\). The idea behind qnorm is that you give it a probability value \(q\), and it returns the number \(x\) such that \(F(x) = P[X <= x] = q\). This is particularly useful for finding critical values associated from a distribution associated with a particular significance level.

qnorm(0.975) ## should be about 1.96
## [1] 1.959964
qnorm(0.5) ## should be 0
## [1] 0
qnorm(0.25,mean=2,sd=2)
## [1] 0.6510205

1.1.4 Random number generation

The last type of function we examine is the rnorm function which generates random numbers whose distribution is normal. Its argument is the number of random numbers that you want to generate, and it has optional arguments to specify the mean and standard deviation or other parameters:

rnorm(4)
## [1]  0.2167549 -0.5424926  0.8911446  0.5959806
rnorm(4,mean=3,sd=10)
## [1] 19.3561800  9.8927544 -9.8124663  0.8685548
mean(rnorm(1000,mean=3))
## [1] 2.96563