stats
packageThe t.test
function produces a variety of t-tests. Unlike most statistical packages, the default assumes unequal variance of the two samples, however we can override this with the var.equal
argument.
# independent 2-sample t-test assuming equal variances and a pooled variance estimate
t.test(x,y,var.equal=TRUE)
# independent 2-sample t-test unequal variance and the Welsh df modification.
t.test(x,y)
t.test(x,y,var.equal=FALSE)
You can use the alternative="less"
or alternative="greater"
options to specify a one-sided test.
A paired-sample \(t\)-test is obtained from the same function by specifying paired=TRUE
and ensuring both samples are the same size:
# paired t-test
t.test(x,y,paired=TRUE)
Finally, a one-sample \(t\)-test can be performed to test the population mean \(\mu\) of the sample against a specific value by supplying only a single sample and a value for mu
:
# one sample t-test
t.test(y,mu=3) # Ho: mu=3
R Help: t.test
The nonparametric tests for comparing two samples can be performed in much the same way, only this time we use the wilcox.test
function.
# independent 2-group Mann-Whitney U Test
wilcox.test(y,x) # where y and x are numeric
# paired 2-group Signed Rank Test
wilcox.test(y1,y2,paired=TRUE) # where y1 and y2 are numeric
R Help: wilcox.test
lm
Linear models can be fitted by the method of least squares in R using the function lm
. Suppose our response variable is \(y\), and we have a predictor variable \(x\) and we want to \(y\) as a linear function of \(x\), we can use lm
to do this:
model <- lm(y ~ x)
Alternatively, if we have a data frame called dataset
with columns a
and b
then we could fit the linear regression of a
on b
without having to extract the columns by using the data
argument
model <- lm(a ~ b, data=dataset)
The function lm
fits a linear model to data and we specify the model using a ‘formula’ where the response variable is on the left hand side separated by a ~
from the explanatory variable(s). The formula provides a flexible way to specify various different functional forms for the relationship. The data argument is used to tell R where to look for the variables used in the formula. Note: The ~
(tilde) symbol should be interpreted as ‘is modelled as’.
For example, consider the data below from a study in central Florida where 15 alligators were captured and two measurements were made on each of the alligators. The weight (in pounds) was recorded with the snout vent length (in inches – this is the distance between the back of the head to the end of the nose). The data were analysed the data on the log scale (natural logarithms), and the goal is to determine whether there is a linear relationship between the variables:
alligator = data.frame(
lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,
3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,
3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25)
)
model <- lm(lnWeight~lnLength,data=alligator)
Inspecting the fitted regression lm
object shows us a summary of the estimated model parameters:
model
##
## Call:
## lm(formula = lnWeight ~ lnLength, data = alligator)
##
## Coefficients:
## (Intercept) lnLength
## -8.476 3.431
We can also add a fitted regression line to a plot by simply passing the fitted model to abline
:
plot(y=alligator$lnWeight, x=alligator$lnLength)
abline(model,col='red')
R Help: lm
The summary
function in R is a multi-purpose function that can be used to describe an R object. It can be applied to vectors, matrices, data frames, linear models (lm
objects), anova decompositions and many other R objects. Applying summary to our lm
object via summary(model)
will give a summary of many key elements of the regrssion:
summary(model)
##
## Call:
## lm(formula = lnWeight ~ lnLength, data = alligator)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24348 -0.03186 0.03740 0.07727 0.12669
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.4761 0.5007 -16.93 0.00000000030795 ***
## lnLength 3.4311 0.1330 25.80 0.00000000000149 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1229 on 13 degrees of freedom
## Multiple R-squared: 0.9808, Adjusted R-squared: 0.9794
## F-statistic: 665.8 on 1 and 13 DF, p-value: 0.000000000001495
There is a lot of information in this output, but the key quantities are:
summary
statistics of the residuals from the regression. Note that the mean of the residuals is always 0 and so is omitted.(Intercept)
, indicating that the first row of the table contains information about the fitted intercept. Subsequent rows will be named after the other explanatory variables in the model forumula after the ~
Estimate
column gives the least squares estimates of the coefficients.Std. Error
column gives the corresponding standard error for each coefficient.t value
column contains the \(t\) test statistics for a test of the hypothesis \(H_0: \beta_i=0\) against \(H_0: \beta_i\neq 0\), for each coefficient \(\beta_i\).Pr(>|t|)
column is then the \(p\)-value associated with that each test, where a low p-value indicates that we have evidence to reject the null hypothesis that the estimate is different from 0 (with the significance levels given by the number of stars).R Help: summary
R also has a number of functions that, when applied to the results of a linear regression, will return key quantities such as residuals and fitted values.
coef(model)
and coefficicents(model)
– returns the estimated model coefficients as a vector \((b_0,b_1)\)fitted(model)
and fitted.values(model)
– returns the vector of fitted values, \(\widehat{y}_i=b_0+b_1 x_i\)resid(model)
and residuals(model)
– returns the vector of residual values, \(e_i=y_i-\widehat{y}\)confint(model, level=0.95)
– CIs for model parameterspredict(model, x, level=0.95)
– predictions for the location of the SLR line, or for \(y\) at some new value x
.We can also extract the individual elements from the summary output from a regression analysis. Suppose we save the results of a call to the summary function of a lm
object as summ
. By using the $
we can extract even more valuable information from the components of the regression summary:
summ$residuals
– extracts the regression residuals as resid
abovesumm$coefficients
– returns the \(p \times 4\) coefficient summary table with columns for the estimated coefficient, its standard error, t-statistic and corresponding (two-sided) p-value.summ$sigma
– the regression standard errorsumm$r.squared
, summ$adj.r.squared
– the regression \(R^2\) and adjusted \(R^2\) respectivelysumm$cov.unscaled
– the \(p\times p\) (co)variance matrix for the least squares coefficients. The diagonal elements comprise the estimates of the variances for each of the \(p\) coefficients, and the off-diagonals give the estimated covariances between each pair of coefficients.The following commands are for future reference and show the syntax of how to create linear models with multiple predictors, models with quadratic (or higher power) terms, and models with interaction terms.
## multiple regression involving 2 predictors
model1 <- lm(y ~ x1 + x2, data=myData)
## quadratic regression on x1. Note each higher order term must be
## enclosed within the I() function
model2 <- lm(y ~ x1 + I(x1^2), data=myData)
## multiple regression with interaction
model3 <- lm(y ~ x1 + x2 + x1:x2, data=myData)
## more concise way of expressing model3
model3a <- lm(y ~ x1*x2, data=myData)
R Help: lm, formula for more details on specifying a regression.