When a vector represents numerical data, there are a number of standard functions that will be useful for any statistical calculations:
sum
{#sum} sums all values in the vectormean
{#mean} computes the sample mean, i.e. \(\frac{1}{n} \sum_{i=1}^n x_i\)median
{#median}computes the sample median valuemin
and max
{#minmax} compute the sample minimum and maximumrange
{#range} computes the sample range, i.e. the difference between the max and min valuevar
and sd
{#var} compute the sample variance and standard deviation, i.e. \(s^2=\frac{1}{n-1} \sum_{i=1}^n (x_i-\bar{x})^2\)quantile
{#quantile} computes the min, max, median, and lower and upper quartiles. Other quantiles can be computed using the probs
argumentsummary
{#summary} calculates the min, max, mean, median, and quantiles.For illustration, consider these 54 measurements of leaf biomass
leafbiomass <- c(0.430, 0.400, 0.450, 0.820, 0.520, 1.320, 0.900, 1.180, 0.480, 0.210,
0.270, 0.310, 0.650 ,0.180, 0.520, 0.300, 0.580, 0.480, 0.580, 0.580,
0.410, 0.480, 1.760, 1.210, 1.180, 0.830, 1.220, 0.770, 1.020, 0.130,
0.680, 0.610, 0.700, 0.820, 0.760, 0.770, 1.690, 1.480, 0.740, 1.240,
1.120, 0.750, 0.390, 0.870, 0.410, 0.560, 0.550, 0.670, 1.260, 0.965,
0.840, 0.970, 1.070, 1.220)
mean(leafbiomass) ## compute the sample mean
## [1] 0.7649074
We can check the mean
function is working by using sum
and length
to directly calculate \(\frac{1}{n} \sum_{i=1}^n x_i\):
sum(leafbiomass)/length(leafbiomass)
## [1] 0.7649074
Similarly, for the standard deviation:
sd(leafbiomass)
## [1] 0.3780717
sqrt(sum((leafbiomass-mean(leafbiomass))^2)/(length(leafbiomass)-1))
## [1] 0.3780717
The other functions are fairly straightforward
min(leafbiomass)
## [1] 0.13
median(leafbiomass)
## [1] 0.72
quantile(leafbiomass)
## 0% 25% 50% 75% 100%
## 0.1300 0.4800 0.7200 1.0075 1.7600
summary(leafbiomass)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1300 0.4800 0.7200 0.7649 1.0075 1.7600
R Help: sum, mean, median, min, max, range, sd, var, quantile, summary
Unsurprisingly, R provides a range of functions to support calculations with standard probability distributions. There are a large number of probability distributions available, but we will only need a few. If you would like to know what distributions are available you can do a search using the command help(Distributions)
.
For every distribution there are four functions. The functions for each distribution begin with a particular letter to indicate the functionality:
Letter | Function |
---|---|
“d” | evaluates the probability density (or mass) function, \(f(x)\) |
“p” | evaluates the cumulative density function, \(F(x)=P[X <= x]\), hence finds the probability the specified random variable is less than the given argument. |
“q” | evaluates the inverse cumulative density function (quantiles), \(F^{-1}(q)\) i.e. the value \(x\) such that \(P[X <= x] = q\). Used to obtain critical values associated with particular probabilities \(q\). |
“r” | generates random numbers |
The appropriate functions for common distributions are given below, along with the optional parameter arguments.
dnorm
, pnorm
, qnorm
, rnorm
. Parameters: mean
(\(\mu\)) and sd
(\(\sigma\)).dt
, pt
, qt
, rt
. Parameter: df
.dchisq
, pchisq
, qchisq
, rchisq
. Parameter: df
.dchisq
, pchisq
, qchisq
, rchisq
. Parameters: size
(\(n\)) and prob
(\(p\)).dpois
, ppois
, qpois
, rpois
. Parameter: lambda
(\(\lambda\))dunif
, punif
, qunif
, runif
. Parameters: min
, and max
.dbeta
, pbeta
, qbeta
, rbeta
. Parameters: shape1
(\(a\)), shape2
(\(b\)).dgamma
, pgamma
, qgamma
, rgamma
. Parameters: shape
(\(\alpha\)), rate
(\(\beta\)).R also has distribution functions for the test statistics of the rank sum test (qwilcox
etc) and the signed rank test (qsignrank
). See Practical 6 for more information on how to use these.
We illustrate the four types of functions for distributions below in the context of the Normal distribution, but you can substitute the normal distribution for any of the distributions and functions listed above (though remember to change the parameters).
R Help: Available distributions,
For example, lets look at the functions for the Normal distribution. The first function we look at is the density function, dnorm
. Given a set of values it returns the value of the Normal pdf at each point. If you only give the points it assumes you want to use a mean of zero and standard deviation of one, i.e. the standard Normal pdf \(\phi(z)\). To use different values for the mean and standard deviation, we specify them in the optional mean
and sd
arguments:
dnorm(0)
## [1] 0.3989423
dnorm(-3:3)
## [1] 0.004431848 0.053990967 0.241970725 0.398942280 0.241970725 0.053990967
## [7] 0.004431848
dnorm(20, mean=10, sd=5)
## [1] 0.01079819
The second type of function is pnorm
which returns the cumulative distribution function for a Normal density. Given a number or a list it computes the probability that a normally distributed random number will be less than that number. . It accepts the same options as dnorm
and defaults to the standard Normal behaviour, i.e. as \(\Phi(z)\):
pnorm(0) ## should be 0.5
## [1] 0.5
pnorm(1.96) ## should be ~0.975
## [1] 0.9750021
pnorm(20, mean=10, sd=5)
## [1] 0.9772499
pnorm
(and all the “p” functions) is particularly useful when computing \(p\)-values in significance tests
If we wish to find the probability that a number is larger than the given number, so \(1-F(x)\) rather than \(F(x)\), you can set the lower.tail
option to FALSE
:
pnorm(0,lower.tail=FALSE)
## [1] 0.5
pnorm(1,lower.tail=FALSE)
## [1] 0.1586553
The next type of function is qnorm
which is the inverse of pnorm
, so qnorm
is \(F^{-1}(x)\). The idea behind qnorm
is that you give it a probability value \(q\), and it returns the number \(x\) such that \(F(x) = P[X <= x] = q\). This is particularly useful for finding critical values associated from a distribution associated with a particular significance level.
qnorm(0.975) ## should be about 1.96
## [1] 1.959964
qnorm(0.5) ## should be 0
## [1] 0
qnorm(0.25,mean=2,sd=2)
## [1] 0.6510205
The last type of function we examine is the rnorm
function which generates random numbers whose distribution is normal. Its argument is the number of random numbers that you want to generate, and it has optional arguments to specify the mean and standard deviation or other parameters:
rnorm(4)
## [1] 0.2167549 -0.5424926 0.8911446 0.5959806
rnorm(4,mean=3,sd=10)
## [1] 19.3561800 9.8927544 -9.8124663 0.8685548
mean(rnorm(1000,mean=3))
## [1] 2.96563