1 Programming

As R is a statistical programming language it also provides the standard programming constructs such as loops, if statements, and the ability to write functions.

1.1 Writing your own functions

A function needs to have a name, probably at least one argument (although it doesn’t have to), and a body of code that does something. At the end it usually should (although doesn’t have to) return an object out of the function. The general syntax for writing your own function is

name.of.function <- function(arg1, arg2, arg3=2, ...) {
  # function code to do some useful stuff
  return(something) # return value 
}
  • name.of.function: is the function’s name. This can be any valid variable name, but you should avoid using names that are used elsewhere in R, such as mean, function, plot, etc.
  • arg1, arg2, arg3: these are the arguments of the function. You can write a function with any number of arguments, or none at all. These can be any R object: numbers, strings, arrays, data frames, of even pointers to other functions; anything that is needed for the name.of.function function to run. Some arguments have default values specified, such as arg3 in our example. Arguments without a default must have a value supplied for the function to run. You do not need to provide a value for those arguments with a default as they are considered as optional, and when omitted the function will simply use the default value in its definition.
  • The ... argument: The …, or ellipsis, element in the function definition allows for other unspecified optional arguments to be passed into the function, which are usually passed onto to another function. This technique is often in plotting, but has uses in many other places.
  • Function body: The function code between the within the {} brackets is run every time the function is called. This code might be very long or very short. Ideally functions are short and do just one thing – problems are rarely too small to benefit from some abstraction. Sometimes a large function is unavoidable, but usually these can be in turn constructed from a bunch of small functions. More on that below.
  • Return value: The last line of the code is the value that will be returned by the function. It is not necessary that a function return anything, for example a function that makes a plot might not return anything, whereas a function that does a mathematical operation might return a number, or a vector.

For example, we can write a function to compute the sum of squares of two numbers as

sum.of.squares <- function(x,y) {
  return(x^2 + y^2)
}

and we can then evaluate

sum.of.squares(3,4)
## [1] 25

1.2 Local vs global variables

Now, it’s not necessarily the case that you must use return() at the end of your function. If we don’t explicitly return something, then R will automatically return the results of evaluating the last statement inside the function. The reason you return an object (aside from making your code more readable) is if you’ve saved the value of your statements into an object inside the function. Variables created inside a function only exist within that function, and won’t appear outside in your workspace. See how it works in the following two examples:

fun1 <- function(x) {
    3 * x - 1
}
fun1(5)
## [1] 14
fun2 <- function(x) {
    y <- 3 * x - 1
}
fun2(5)

In the first function, I just evaluate the statement 3*x-1 without saving it anywhere inside the function. So when I run fun1(5), the result comes popping out of the function. However, when I call fun2(5), nothing happens. That’s because the object y that I saved my result into doesn’t exist outside the function and I haven’t used return(y) to pass the value of y outside the function. If I try to use y outside of the function, I will encounter errors because it only exists within the local environment of the function. I can return the value of y using the return(y) at the end of the function fun2, but I can’t return the object itself; it’s stuck inside the function.

1.3 Repeating calculations using loops

Conceptually, a loop is a way to repeat a sequence of instructions under certain conditions. They allow you to automate parts of your code that are in need of repetition. Typically, there are two types of loops. Loops which execute for a prescribed number of times, as controlled by a counter or an index, incremented at each iteration cycle are represented as for loops in R. Loops that are based on the testing of a logical condition at each iteration in the loop are while or repeat loops.

In R a for {#for} loop takes the following general form

for (variable in sequence) { 
  ## code to repeat goes here
}

where variable is a name given to the iteration variable and which takes each possible value in the vector sequence at each pass through the loop. Here is a quick trivial example, printing the square root of the integers one to ten:

for (x in 1:10) {
  print(sqrt(x))
}
## [1] 1
## [1] 1.414214
## [1] 1.732051
## [1] 2
## [1] 2.236068
## [1] 2.44949
## [1] 2.645751
## [1] 2.828427
## [1] 3
## [1] 3.162278

As with the example above, there is often no need to explicity write for loops to repeat calculations in R code as most built-in functions and arithmetic can be evaluated for vector arguments anyway (and usually more efficiently). For the example above, simply evaluating sqrt(1:10) would give the answer we need for rather less typing

sqrt(1:10)
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
##  [9] 3.000000 3.162278

A break statement is used inside a loop (for, while) to stop the iterations and jump to the code outside of the loop. In a nested looping situation, where there is a loop inside another loop, this statement exits from the innermost loop that is being evaluated.

x <- 1:5
for (val in x) {
    if (val == 3){
        break
    }
    print(val)
}
## [1] 1
## [1] 2

Conversely, a next statement is useful when we want to skip the rest of the code in current iteration of a loop without terminating it. On encountering next, R skips any further evaluation and starts next iteration of the loop.

x <- 1:5

for (val in x) {
    if (val == 3){
        next
    }
    print(val)
}
## [1] 1
## [1] 2
## [1] 4
## [1] 5

R Help: for

1.4 Repeating calculations using the apply functions

We saw above how to use a for loop to apply the same code to a collection of objects described by the sequence over which we are looping. However, we can achieve the same results by writing a function to perform the code within the body of the loop, and then applying that function to every element of sequence. Handily, R has a family of functions, the ‘apply’ family, which do exactly that. We will use the following two members of apply function family:

  • sapply - applies a function to every element of a vector and returns a vector formed from the results
  • apply - applies a function to the either the rows or the columns of a matrix (or data frame)

Each of these has an argument FUN which takes a function to apply to each element of the object. So, to replicate the simple example above using apply, we would write

sapply(1:10, FUN=sqrt)
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
##  [9] 3.000000 3.162278

Unlike a loop, sapply automatically returns its results as a vector (or whatever form is most natural) without us having to write code for that. Therefore, if we combine this technique with the ability to write our own functions then we have a very flexible way of re-writing a standard loop in a vectorised way. In general, using an apply-type function is to be preferred to a for loop particularly when we want to keep the results of the calculations from each iteration. However, for loops are still useful and more natural in certain cases (where we do not want the output values, or where each iteration has a dependency on the calculations at the previous step).

1.4.1 Repeating calculations for each row/column of a matrix

The apply function can be used to evaluate the same function for either every row or every column of a given matrix (or data frame). To apply the function over rows we supply the argument MARGIN=1, and to apply to each column we set MARGIN=2. We must also provide the function we wish to apply in the FUN argument.

For example, to calculate the means of each column in the mtCars data set, we could write

data(mtcars)
apply(mtcars, MARGIN=2, FUN=mean)
##        mpg        cyl       disp         hp       drat         wt       qsec 
##  20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
##         vs         am       gear       carb 
##   0.437500   0.406250   3.687500   2.812500

Thus apply is very useful for quickly computing summaries and calculations across entire data sets.

R Help: sapply, apply

1.5 If statements

A standard programming construct is the if statement, which are used to tell R that we want to make a choice based on the results of a test.

if(test){
  ## do this code if TRUE
} else{
  ## do this code if FALSE
}

If the test is TRUE, then the code inside the if statement (i.e., the lines in the curly braces underneath it) is executed. If the test is FALSE, the body of the else is executed instead. Only one or the other is ever executed:

x <- -5
if(x > 0){
   print("Non-negative number")
} else {
   print("Negative number")
}
## [1] "Negative number"

We can chain a sequence of if and else statements together to consider a sequence of alternative test conditions:

x <- 0
if (x < 0) {
   print("Negative number")
} else if (x > 0) {
   print("Positive number")
} else {
   print("Zero")
}
## [1] "Zero"