As R is a statistical programming language it also provides the standard programming constructs such as loops, if statements, and the ability to write functions.
A function needs to have a name, probably at least one argument (although it doesn’t have to), and a body of code that does something. At the end it usually should (although doesn’t have to) return an object out of the function. The general syntax for writing your own function is
name.of.function <- function(arg1, arg2, arg3=2, ...) {
# function code to do some useful stuff
return(something) # return value
}
name.of.function
: is the function’s name. This can be any valid variable name, but you should avoid using names that are used elsewhere in R, such as mean
, function
, plot
, etc.arg1
, arg2
, arg3
: these are the arguments of the function. You can write a function with any number of arguments, or none at all. These can be any R object: numbers, strings, arrays, data frames, of even pointers to other functions; anything that is needed for the name.of.function
function to run. Some arguments have default values specified, such as arg3
in our example. Arguments without a default must have a value supplied for the function to run. You do not need to provide a value for those arguments with a default as they are considered as optional, and when omitted the function will simply use the default value in its definition....
argument: The …, or ellipsis, element in the function definition allows for other unspecified optional arguments to be passed into the function, which are usually passed onto to another function. This technique is often in plotting, but has uses in many other places.{}
brackets is run every time the function is called. This code might be very long or very short. Ideally functions are short and do just one thing – problems are rarely too small to benefit from some abstraction. Sometimes a large function is unavoidable, but usually these can be in turn constructed from a bunch of small functions. More on that below.For example, we can write a function to compute the sum of squares of two numbers as
sum.of.squares <- function(x,y) {
return(x^2 + y^2)
}
and we can then evaluate
sum.of.squares(3,4)
## [1] 25
Now, it’s not necessarily the case that you must use return()
at the end of your function. If we don’t explicitly return
something, then R will automatically return the results of evaluating the last statement inside the function. The reason you return
an object (aside from making your code more readable) is if you’ve saved the value of your statements into an object inside the function. Variables created inside a function only exist within that function, and won’t appear outside in your workspace. See how it works in the following two examples:
fun1 <- function(x) {
3 * x - 1
}
fun1(5)
## [1] 14
fun2 <- function(x) {
y <- 3 * x - 1
}
fun2(5)
In the first function, I just evaluate the statement 3*x-1
without saving it anywhere inside the function. So when I run fun1(5)
, the result comes popping out of the function. However, when I call fun2(5)
, nothing happens. That’s because the object y
that I saved my result into doesn’t exist outside the function and I haven’t used return(y)
to pass the value of y
outside the function. If I try to use y
outside of the function, I will encounter errors because it only exists within the local environment of the function. I can return the value of y
using the return(y)
at the end of the function fun2
, but I can’t return the object itself; it’s stuck inside the function.
Conceptually, a loop is a way to repeat a sequence of instructions under certain conditions. They allow you to automate parts of your code that are in need of repetition. Typically, there are two types of loops. Loops which execute for a prescribed number of times, as controlled by a counter or an index, incremented at each iteration cycle are represented as for
loops in R. Loops that are based on the testing of a logical condition at each iteration in the loop are while
or repeat
loops.
In R a for
{#for} loop takes the following general form
for (variable in sequence) {
## code to repeat goes here
}
where variable
is a name given to the iteration variable and which takes each possible value in the vector sequence
at each pass through the loop. Here is a quick trivial example, printing the square root of the integers one to ten:
for (x in 1:10) {
print(sqrt(x))
}
## [1] 1
## [1] 1.414214
## [1] 1.732051
## [1] 2
## [1] 2.236068
## [1] 2.44949
## [1] 2.645751
## [1] 2.828427
## [1] 3
## [1] 3.162278
As with the example above, there is often no need to explicity write for
loops to repeat calculations in R code as most built-in functions and arithmetic can be evaluated for vector arguments anyway (and usually more efficiently). For the example above, simply evaluating sqrt(1:10)
would give the answer we need for rather less typing
sqrt(1:10)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
A break
statement is used inside a loop (for
, while
) to stop the iterations and jump to the code outside of the loop. In a nested looping situation, where there is a loop inside another loop, this statement exits from the innermost loop that is being evaluated.
x <- 1:5
for (val in x) {
if (val == 3){
break
}
print(val)
}
## [1] 1
## [1] 2
Conversely, a next
statement is useful when we want to skip the rest of the code in current iteration of a loop without terminating it. On encountering next
, R skips any further evaluation and starts next iteration of the loop.
x <- 1:5
for (val in x) {
if (val == 3){
next
}
print(val)
}
## [1] 1
## [1] 2
## [1] 4
## [1] 5
R Help: for
We saw above how to use a for
loop to apply the same code to a collection of objects described by the sequence
over which we are looping. However, we can achieve the same results by writing a function to perform the code within the body of the loop, and then applying that function to every element of sequence. Handily, R has a family of functions, the ‘apply’ family, which do exactly that. We will use the following two members of apply function family:
sapply
- applies a function to every element of a vector and returns a vector formed from the resultsapply
- applies a function to the either the rows or the columns of a matrix (or data frame)Each of these has an argument FUN
which takes a function to apply to each element of the object. So, to replicate the simple example above using apply, we would write
sapply(1:10, FUN=sqrt)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
## [9] 3.000000 3.162278
Unlike a loop, sapply
automatically returns its results as a vector (or whatever form is most natural) without us having to write code for that. Therefore, if we combine this technique with the ability to write our own functions then we have a very flexible way of re-writing a standard loop in a vectorised way. In general, using an apply
-type function is to be preferred to a for
loop particularly when we want to keep the results of the calculations from each iteration. However, for
loops are still useful and more natural in certain cases (where we do not want the output values, or where each iteration has a dependency on the calculations at the previous step).
The apply
function can be used to evaluate the same function for either every row or every column of a given matrix (or data frame). To apply the function over rows we supply the argument MARGIN=1
, and to apply to each column we set MARGIN=2
. We must also provide the function we wish to apply in the FUN
argument.
For example, to calculate the means of each column in the mtCars
data set, we could write
data(mtcars)
apply(mtcars, MARGIN=2, FUN=mean)
## mpg cyl disp hp drat wt qsec
## 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
## vs am gear carb
## 0.437500 0.406250 3.687500 2.812500
Thus apply
is very useful for quickly computing summaries and calculations across entire data sets.
A standard programming construct is the if
statement, which are used to tell R that we want to make a choice based on the results of a test.
if(test){
## do this code if TRUE
} else{
## do this code if FALSE
}
If the test
is TRUE
, then the code inside the if
statement (i.e., the lines in the curly braces underneath it) is executed. If the test
is FALSE
, the body of the else is executed instead. Only one or the other is ever executed:
x <- -5
if(x > 0){
print("Non-negative number")
} else {
print("Negative number")
}
## [1] "Negative number"
We can chain a sequence of if
and else
statements together to consider a sequence of alternative test conditions:
x <- 0
if (x < 0) {
print("Negative number")
} else if (x > 0) {
print("Positive number")
} else {
print("Zero")
}
## [1] "Zero"