In this session we perform simple simulation exercises in order to investigate the properties of the sample mean introduced in the lectures. In this workshop we will use the loops and write our own functions in R.
To start, create a new R script by clicking on the corresponding menu button (in the top left); and save it somewhere in your computer. You can now write all your code into this file, and then execute it in R using the menu button Run
or by simple pressing ctrl
\(+\) enter
.
We’ll be using the hospital
data set from the durham
library in our sampling experiments. So, before we begin you’ll need to load the data:
durham
library and the hospital
data set.disch
(or similar) containing the data for the number of discharges in the hospital
data.
# solution
library(durham)
data(hospital)
disch <- hospital$discharges
Recall that our general approach to sampling is to sample without replacement, so we don’t repeatedly observe the same individual from the population. We can use R
’s sample
function to take random samples from a specified population vector, and we can do so either with or without replacement.
The sample
function allows us to take a sample of a specified size from a vector of values which is treated as the population.
If we have a population vector pop
and we want to take a sample of size n
without replacement, then we use the following command:
sample(pop, n)
If we want to sample with replacement, then we can add the additional optional argument replace=TRUE
.
Finally, the default is for all of the elements of the population to have equal probability. This can be changed to specified probability weights using the prob
argument, but we won’t be looking into that any further. See the help file (by running ?sample
) for more details on the sample
function.
mean
. Do this in three lines, one for each sample size.# (a)
sample(disch, 4)
## [1] 2190 87 1153 1766
# (b)
sample(disch, 4) ## unsurprisingly different
## [1] 1009 1369 410 1093
# (c)
mean(sample(disch,4))
## [1] 287.75
mean(sample(disch,16))
## [1] 939.375
mean(sample(disch,64))
## [1] 859.2969
# (d)
# potentially quite different values.
# We would expect the mean of these means to be mu,
# so the values should be centred about mu but its
# difficult to say more with only a few points
\(~\)
We will be performing a repeated sampling experiment where we repeat the sampling many times to generate different samples each time. Then we can use the many samples to produce many different sample means, which we can explore to check their behaviour against what we would expect from the theory seen in lectures.
To simplify the repeated sampling calculation, we will use a loop to repeat the calculations and we will write our own function to take care of the calculations we wish to perform during each iteration.
As with other programming languages (such as Python), we can combine multiple commands in R
into our own custom functions that perform more complicated tasks or calculations. Throughout this course, we will be writing our own functions to solve specific problems and today we will learn the basic syntax of how to create a function.
When creating a new function, it needs to have a name, probably at least one argument (although it doesn’t have to), and a body of code that does something. At the end it usually should (although doesn’t have to) return a value or object out of the function.
The general syntax for writing your own function is
name.of.function <- function(arg1, arg2, arg3=2) {
# function code to do some useful stuff
return(something) # return value
}
name.of.function
: is the function’s name. This can be any valid variable name, but you should avoid using names that are used elsewhere in R, such as mean
, function
, plot
, etc.arg1
, arg2
, arg3
: these are the arguments of the function. You can write a function with any number of arguments, or none at all. These can be any R object: numbers, strings, arrays, data frames, or even other functions. Essentially, this is the list of everything that is needed for the name.of.function
function to run. Some arguments have default values specified, such as arg3
in our example, which is set to 2
unless otherwise specified. Arguments without a default must have a value supplied for the function to run. You do not need to provide a value for those arguments with a default as they are considered as optional, and when omitted the function will simply use the default value in its definition.{}
brackets is run every time the function is called. Note that unlike Python where the code inside the function is indented, with R
the code inside the function must be enclosed in curly braces {}
. This code might be very long or very short. Ideally functions are short and do just one thing – problems are rarely too small to benefit from some abstraction. Sometimes a large function is unavoidable, but usually these can be in turn constructed from a number of small functions.return()
or omitting the return
statement entirely), whereas a function that does a mathematical operation might return a number, or a vector.
For example, we can write a function to compute the sum of squares of two numbers as
sum.of.squares <- function(x,y) {
return(x^2 + y^2)
}
and we can then evaluate
sum.of.squares(3,4)
## [1] 25
MyFun
that takes a single argument x
and returns the value of \(1+x+0.1x^2\).MyFunPlotter
, which takes a single argument x
, and which draws a scatterplot
of the points (x, MyFun(x))
. Evalate your function for \(x=0,\dots,100\). Hint: See the previous workshop for how to draw scatterplots, and use your function from the previous part of the question.
# (a)
MyFun <- function(x){
return(1+x+0.1*x^2)
}
# (b)
MyFun(0)
## [1] 1
# (c)
MyFun(0:100)
## [1] 1.0 2.1 3.4 4.9 6.6 8.5 10.6 12.9 15.4 18.1
## [11] 21.0 24.1 27.4 30.9 34.6 38.5 42.6 46.9 51.4 56.1
## [21] 61.0 66.1 71.4 76.9 82.6 88.5 94.6 100.9 107.4 114.1
## [31] 121.0 128.1 135.4 142.9 150.6 158.5 166.6 174.9 183.4 192.1
## [41] 201.0 210.1 219.4 228.9 238.6 248.5 258.6 268.9 279.4 290.1
## [51] 301.0 312.1 323.4 334.9 346.6 358.5 370.6 382.9 395.4 408.1
## [61] 421.0 434.1 447.4 460.9 474.6 488.5 502.6 516.9 531.4 546.1
## [71] 561.0 576.1 591.4 606.9 622.6 638.5 654.6 670.9 687.4 704.1
## [81] 721.0 738.1 755.4 772.9 790.6 808.5 826.6 844.9 863.4 882.1
## [91] 901.0 920.1 939.4 958.9 978.6 998.5 1018.6 1038.9 1059.4 1080.1
## [101] 1101.0
# (d)
MyFunPlotter <- function(x){
plot(x=x,y=MyFun(x))
}
MyFunPlotter(0:100)
\(~\)
A “loop”” is a way to repeat a sequence of instructions under certain conditions. They allow you to automate parts of your code that are in need of repetition. We will look at two ways of creating loops in our code:
for
loops in R.apply
functions in R.In R a for
loop takes the following general form:
for (variable in sequence) {
## code to repeat goes here
}
where variable
is a name given to the iteration variable and which takes each possible value in the vector sequence
at each pass through the loop.
Here is a quick trivial example, printing the square root of the integers one to ten:
for (x in 1:10) {
print(sqrt(x))
}
for
.for
loop to:
# (a)
sqrt(1:10) ## use a vector instead
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
# (b)
for(i in 1:5){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
for(i in 10:3){
print(i^2)
}
## [1] 100
## [1] 81
## [1] 64
## [1] 49
## [1] 36
## [1] 25
## [1] 16
## [1] 9
for(i in 1:10){
print("My Name")
}
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
## [1] "My Name"
\(~\)
While for
loops are helpful when we need to explictly repeat a block of code, sometimes we just want to apply the same function (or calculations) to all of the elements of a vector. We can use for
to do this, but R provides a function which does exactly that - sapply
.
sapply
applies a specified function to every element of a vector and returns a vector formed from the results.
sapply(X, FUN)
applies the function FUN
to every element of the vector X
it then returns a vector containing the values of FUN(X[1])
, FUN(X[2])
, and so on. So, to replicate the square-root example above using sapply
, we would write
sapply(1:10, FUN=sqrt)
or alternatively
sapply(1:10, FUN=function(i){sqrt(i)})
which applies the square root function to each of the integers 1 to 10. Unlike a loop, sapply
automatically returns its results as a vector without us having to write code for that (our for
loop above only printed the values).
There are other variations of apply
which work with matrices and other data structures, and we will see those later.
If we combine this technique of looping with the ability to write our own functions, then we have a very flexible way of re-writing a standard loop in a vectorised way. In general, using the sapply
function is to be preferred to a for
loop particularly when we want to keep the results of the calculations from each iteration. However, for
loops are still useful and more natural in certain cases (where we do not want the output values, or where each iteration has a dependency on the calculations at the previous step).
sapply
to re-write your for-loops. You should write your answers in one line of code for each part:
sapply(1:5, FUN=function(i) { i })
## [1] 1 2 3 4 5
sapply(10:3, FUN=function(i) { i^2 })
## [1] 100 81 64 49 36 25 16 9
sapply(1:10, FUN=function(i) { 'My Name' })
## [1] "My Name" "My Name" "My Name" "My Name" "My Name" "My Name" "My Name"
## [8] "My Name" "My Name" "My Name"
\(~\)
sapply
and the functions sample
and mean
.n
that returns a vector of the means of 1000 samples, each of size n
, from the discharges data. Hint: This is a one line function that places a version of the command used in the previous question within the function.sapply(1:30, FUN=function(i) { mean(sample(disch, 4)) } )
## [1] 541.50 987.50 729.75 1188.50 860.75 583.50 1613.00 1074.00 814.50
## [10] 860.25 1078.00 1176.00 1031.25 816.00 623.00 339.25 1231.00 967.00
## [19] 795.00 1282.50 777.75 820.50 580.50 545.00 770.75 641.75 857.50
## [28] 892.75 951.50 672.75
dischSample <- function(n){
return(sapply(1:1000, FUN=function(i) { mean(sample(disch, n)) } ))
}
means4 <- dischSample(4)
means16 <- dischSample(16)
means64 <- dischSample(64)
We now have samples of size 1000 from each of three sampling distributions of \(\bar{X}\), for when \(n=4\), \(n=16\), \(n=64\). As our samples of \(\bar{X}\) values are quite large in size, we might expect that each of these distributions is quite similar to the distribution of the underlying theoretical population of sample means.
boxplot
comparing each of the three samples on the same scale. (Hint: see the previous workshop for how to draw a boxplot.)
mean(means4)
## [1] 808.1368
sd(means4)
## [1] 293.6133
mean(means16)
## [1] 816.885
sd(means16)
## [1] 139.8785
mean(means64)
## [1] 818.0686
sd(means64)
## [1] 64.11728
boxplot(means4, means16, means64, names = c("means4", "means16", "means64"))
\(~\)
R makes it easy to combine multiple plots into one overall graph, using either the par
or layout
functions.
With the par
function, we specify the argument mfrow=c(nr, nc)
to split the plot window into a grid of nr x nc plots that are filled in by row. For example, to divide the plot window into a 2x2 grid we call
par(mfrow=c(2,2))
This has no immediate effect, but the next four plots drawn (using plot
, hist
, boxplot
, etc) will be drawn in the 2x2 grid.
To revert to a single plot layout, we ask for a 1x1 grid via par(mfrow=c(1,1))
.
par
function with the mfrow
argument to configure to plot window for a column of three vertical plots. Note: this function will have no effect until you start drawing new plots.xlim
argument to specify the horizontal axis limits, and range
will determine the min and max of a vector).
par(mfrow=c(3,1))
hist(means4, xlim=range(means4))
hist(means16, xlim=range(means4))
hist(means64, xlim=range(means4))
It is often useful to add simple straight lines to lines to plots, which can be achieved using the abline
function. abline
can be used in three different ways:
h
argument, abline(h=3)
draws a horizontal line at \(y=3\)v
argument, abline(v=5)
draws a vertical line at \(x=5\)a
and b
arguments representing the intercept and slope respectively; abline(a=1,b=2)
draws the line at \(y=1+2x\)abline
can be customised using any of the usual colour and line modifications using colour (col
), line types and widths (lty
, and lwd
).
par(mfrow=c(3,1))
hist(means4, xlim=range(means4))
abline(v=mean(disch),col='red')
hist(means16, xlim=range(means4))
abline(v=mean(disch),col='red')
hist(means64, xlim=range(means4))
abline(v=mean(disch),col='red')
\(~\) According to theory seen in lectures, the sample mean is an unbiased estimator of the population mean, \(\mu\), and has variance denoted \(\sigma^2_{\bar{X}}\). For Normal data or large sample sizes, the sampling distribution is either exactly or approximately Normal with the above mean and variance.
## distributions are all centred about mu (as they should be since Xbar is unbiased)
## spread decreases with sample size (due primarily to the 1/n term in the variance of Xbar)
## the distribution looks "more Normal" as n gets bigger (central limit theorem)
\(~\)
If you finish, you can investigate some of the additional problems below.
We can use R
to investigate the properties of the distribution of the sample mean more closely. The theory tells us that the variance of the sample mean, \(\bar{X}\), for a sample of size \(n\) from a finite population of size \(N\) is given by:
\[ \text{Var}[\bar{X}] = \sigma^2_{\bar{X}} = \left(\frac{N-n}{N-1}\right)\frac{\sigma^2}{n}, \sigma^2 = \frac{1}{N} \sum_{i=1}^{N}(X_i-E(X_i))^2 \] where \(\sigma^2\) is the population variance of the original data, \(X_i\). An unbiased estimator of this variance is the estimated variance, given by \[ S^2_{\bar{X}} = \left(1-\frac{n}{N}\right)\frac{s^2}{n}, \] where \(s^2\) is the sample variance.
n
that returns a vector of 1000 realisations of \(S^2_{\bar{X}}\) for samples of n
from the discharges data.## Careful with this step, as using `var` on the discharges vector uses the sample variance formula which
## assumes the sample mean is not mu and which divides by (c-1). We need the population variance as
## mean(disch)=mu since this is our population, and we want to divide by N. So either we need to correct the
## output from var or compute directly.
# (a)
N <- length(disch)
sigma2 <- sum((disch-mean(disch))^2)/N
## or
sigma2 <- (N-1)/N * var(disch)
## compute sigma_xbar for n=4,16,64 - we should probably do this with a function!
trueV4 <- (N-4)/(N-1) * sigma2/4
trueV16 <- (N-16)/(N-1) * sigma2/16
trueV64 <- (N-64)/(N-1) * sigma2/64
# (b)
## easiest to do this with a new function to compute S^2_Xbar for a given sample
S2Xbar <- function(x){
n <- length(x)
return( (1-n/N)*var(x)/n)
}
## then just swap `mean` for `S2Xbar` from our previous sampling function
dischVariances <- function(n){
return(sapply(1:1000, FUN=function(i) { S2Xbar(sample(disch, n)) } ))
}
# (c)
vars4 <- dischVariances(4)
vars16 <- dischVariances(16)
vars64 <- dischVariances(64)
# (d), (e), and (f)
par(mfrow=c(3,1))
hist(vars4, xlim=range(vars4))
abline(v=trueV4,col='red')
hist(vars16, xlim=range(vars4))
abline(v=trueV16,col='red')
hist(vars64, xlim=range(vars4))
abline(v=trueV64,col='red')