In this course, we will learn how to use R and RStudio for data analysis and visualization. Before we do that you will need to install R and RStudio locally on your computer. Both are free to download from the links below. Please notice that you need to download R first then RStudio.
Both R and RStudio are also available on the University network via AppsAnywhere.
https://appsanywhere.durham.ac.uk/
Watch the video for a step-by-step instructions on how to install R and RStudio on your computer.
Data consist of information coming from observations, counts, measurements, or responses.
Statistics is the science of collecting, organizing, analyzing, and interpreting data in order to make decisions.
A population is the collection of all outcomes, responses, measurements, or counts that are of interest. Populations may be finite or infinite. If a population of values consists of a fixed number of these values, the population is said to be finite. If, on the other hand, a population consists of an endless succession of values, the population is an infinite one.
A sample is a subset of a population.
A parameter is a numerical description of a population characteristic.
A statistic is a numerical description of a sample characteristic.
The study of statistics has two major branches - descriptive statistics and inferential statistics:
Descriptive statistics is the branch of statistics that involves the organization, summarization, and display of data.
Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about a population, e.g. estimation and hypothesis testing.
Population | Sample | |
---|---|---|
Size | \(N\) | \(n\) |
Parameter | Statistic | |
Mean | \(\mu\) | \(\bar{x}\) |
Variance | \(\sigma^2\) | \(s^2\) |
Standard deviation | \(\sigma\) | \(s\) |
Proportion | \(\pi\) | \(\hat{\pi}\) |
Correlation | \(\rho\) | \(r\) |
There are several ways for collecting data:
The characteristics of big data (the 4Vs?):
Data sets can consist of two types of data:
Qualitative (categorical) data consist of attributes, labels, or nonnumerical entries. e.g. name of cities, gender etc.
Quantitative data consist of numerical measurements or counts. e.g. heights, weights, age. Quantitative data can be distinguished as:
Discrete data result when the number of possible values is either a finite number or a “countable” number. e.g. the number of phone calls you received in any given day.
Continuous data result from infinitely many possible values that correspond to some continuous scale that covers a range of values without gaps, interruptions, or jumps. e.g. height, weight, sales and market shares.
Cross-sectional data: Data on different entities (e.g. workers, consumers, firms, governmental units) for a single time period. For example, data on test scores in different school districts.
Time series data: Data for a single entity (e.g. person, firm, country) collected at multiple time periods. For example, the rate of inflation and unemployment for a country over the last 10 years.
Panel data: Data for multiple entities in which each entity is observed at two or more time periods. For example, the daily prices of a number of stocks over two years.
Nominal: Categories only, data cannot be arranged in an ordering scheme. (e.g. Marital status: single, married etc.)
Ordinal: Categories are ordered, but differences cannot be determined or they are meaningless (e.g. poor, average, good)
Interval: differences between values are meaningful, but there is no natural starting point, ratios are meaningless (e.g. we cannot say that the temperature 80\(^{\circ}\)F is twice as hot as 40\(^{\circ}\)F)
Ratio: Like interval level, but there is a natural zero starting point and rations are meaningful (e.g. is twice as much as )
Measures of central tendency provide numerical information about a `typical’ observation in the data.
\[\;\;\;\; \text{Sample mean:}\;\;\; \bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i\]
The variation (dispersion) of a set of observations refers to the variability that they exhibit.
Range = maximum data value - minimum data value
The variance measures the variability or spread of the observations from the mean.
\[\text{Sample variance:}\;\;\;s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2\]
Shortcut formula for sample variance is given by \[\text{Sample variance:}\;\;\;s^2=\frac{1}{n-1}\left\{\sum_{i=1}^{n}x^2_i-n\bar{x}^2\right\}\]
The standard deviation (\(s\)) of a data set is the square root of the sample variance.
Skewness is a measure of the asymmetry of the distribution.
Kurtosis measures the degree of peakedness or flatness of the distribution.
The empirical rule states (for a normally distributed data) that 68% of the data falls within one standard deviation; 95% of the data falls within two standard deviations; 99.7% of the data falls within three standard deviations from the mean.
The \(z\)-score of an observation tells us the number of standard deviations that the observation is from the mean, that is, how far the observation is from the mean in units of standard deviation.
\[z=\frac{x-\bar{x}}{s}\]
As the \(z\)-score has no unit, it can be used to compare values from different data sets or to compare values within the same data set. The mean of \(z\)-scores is 0 and the standard deviation is 1.
Note that \(s>0\) so if \(z\) is negative, the corresponding \(x\)-value is below the mean. If \(z\) is positive, the corresponding \(x\)-value is above the mean. And if \(z=0\), the corresponding \(x\)-value is equal to the mean.
The 25th percentile, \(Q_1\), is often referred to as the first quartile.
The 50th percentile (the median), \(Q_2\), is referred to as the second or middle quartile.
The 75th percentile, \(Q_3\), is referred to as the third quartile
The interquartile range (\(IQR\)) of a data set is the difference between the first and third quartiles (\(IQR = Q_3 - Q_1\))
The IQR is a measure of variation that gives you an idea of how much the middle 50% of the data varies.
To graph a boxplot (a box-and-whisker plot), we need the following values (called the five-number summary):
The box represents the interquartile range (\(IQR\)), which contains the middle 50% of values.
Some data sets contain outliers or extremes values, observations that fall well outside the overall pattern of the data. Boxplots can help us to identify such values if some rules-of-thumb are used, e.g.:
Outlier: Cases with values between 1.5 and 3 box lengths (the box length is the interquartile range) from the upper or lower edge of the box.
Extremes: Cases with values more than 3 box lengths from the upper or lower edge of the box.
Frequency distributions are tabular or graphical presentations of data that show each category for a variable and the frequency of the category’s occurrence in the data set. Percentages for each category are often reported instead of, or in addition to, the frequencies.
The Mode can be used in this case as a measure of central tendency.
Bar charts and Pie charts are often used to display the results of categorical or qualitative variables. Pie charts are more useful for displaying results of variables that have relatively few categories, in that pie charts become cluttered and difficult to read if variables have many categories.
The accounting final exam grades of 10 students are: 88, 51, 63, 85, 79, 65, 79, 70, 73, and 77. Their study programs, respectively, are: MA, MA, MBA, MBA, MBA, MBA, MBA, MSc, MSc, and MSc.
The sample mean grade is \[\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_i=\frac{1}{10}(88+51+\ldots+77)=73\]
Next we arrange the data from the lowest to the largest grade: 51, 63, 65, 70, 73, 77, 79, 79, 85, 88. The median grade is 75, which located midway between the 5th and 6th ordered data points \((73+77)/2=75\).
The mode is 79 since it appears twice and all other grades appeared only once.
The range is \(88-51=37\).
The variance \[s^2=\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2=\frac{1}{9}((88-73)^2+\ldots+(77-73)^2)=123.78\]
The standard deviation: \(s=\sqrt{123.78}=11.13\)
The coefficient of variation: \(CV=s/\bar{x}=11.13/73=0.1525\)
Empirical rule: the empirical rule states (for a normally distributed data) that 68% of the data falls within one standard deviation from the mean. In our example, this means that 68% of the grades fall between 61.87 and 84.13 (\(73\pm 11.12555\))
\(~\)
# R codes for "Accounting final exam grades" example
# Data example
grades<-c(88,51,63,85,79,65,79,70,73,77)
program<-factor(c("MA","MA","MBA","MBA","MBA","MBA","MBA","MSc","MSc","MSc"))
# no of observations
length(grades)
## [1] 10
# Mean, Median, Variance, standard deviation, range, quantile
mean(grades)
## [1] 73
median(grades)
## [1] 75
var(grades)
## [1] 123.7778
sd(grades)
## [1] 11.12555
range(grades)
## [1] 51 88
quantile(grades,probs=c(0,0.25,0.5,0.75,1))
## 0% 25% 50% 75% 100%
## 51.00 66.25 75.00 79.00 88.00
# Summary
summary(grades)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 51.00 66.25 75.00 73.00 79.00 88.00
# Calculate z-score
(grades-mean(grades))/sd(grades)
## [1] 1.3482484 -1.9774310 -0.8988323 1.0785987 0.5392994 -0.7190658
## [7] 0.5392994 -0.2696497 0.0000000 0.3595329
scale(grades)
## [,1]
## [1,] 1.3482484
## [2,] -1.9774310
## [3,] -0.8988323
## [4,] 1.0785987
## [5,] 0.5392994
## [6,] -0.7190658
## [7,] 0.5392994
## [8,] -0.2696497
## [9,] 0.0000000
## [10,] 0.3595329
## attr(,"scaled:center")
## [1] 73
## attr(,"scaled:scale")
## [1] 11.12555
\(~\)
# Histograms present frequencies for values grouped into interval.
hist(grades,xlab="grades", main="Histogram of grades")
# Boxplot
boxplot(grades,xlab="grades")
\(~\)
Stem-and-leaf plots: each score on a variable is divided into two parts, the stem gives the leading digits and the leaf shows the trailing digits.
The accounting final exam grades (arranged from the lowest to the largest grade) are: 51, 63, 65, 70, 73, 77, 79, 79, 85, 88.
\(~\)
# Stem-and-leaf plot.
stem(grades)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 1
## 6 | 35
## 7 | 03799
## 8 | 58
Dot plot: is a simple graph to show the relative positions of the data points.
col2<-as.character(factor(program,labels=c("red","blue","orange")))
dotchart(grades, labels=factor(1:10), groups=program, pch=16, col=col2, xlab="Grades",xlim=c(45,100))
\(~\)
# Frequency table
table(program)
## program
## MA MBA MSc
## 2 5 3
# Pie and Bar charts
pie(table(program))
barplot(table(program))
For a continuous random variable, the role of the probability mass function is taken by a density function, \(f(x)\), which has the properties that \(f(x) \geq 0\) and \[\int_{-\infty}^{\infty} f (x) dx = 1\]
For any \(a < b\), the probability that \(X\) falls in the interval \((a, b)\) is the area under the density function between \(a\) and \(b\): \[P(a < X < b) =\int_{a}^{b} f (x) dx\]
Thus the probability that a continuous random variable \(X\) takes on any particular value is 0: \[P(X = c) =\int_c^c f (x) dx = 0\] %Although this may seem strange initially, it is really quite natural. If the uniform random variable of Example A had a positive probability of being any particular number, it should have the same probability for any number in \([0, 1]\), in which case the sum of the probabilities of any countably infinite subset of \([0, 1]\) (for example, the rational numbers) would be infinite.
If \(X\) is a continuous random variable, then \[P(a < X < b) = P(a \leq X < b) = P(a < X \leq b)\] Note that this is not true for a discrete random variable.
The cumulative distribution function (cdf) of a continuous random variable \(X\) is defined as: \[F(x) = P(X \leq x)=\int_{-\infty}^x f (u) du\]
The cdf can be used to evaluate the probability that \(X\) falls in an interval: \[P(a \leq X \leq b) = \int_a^b f (x) dx = F(b) - F(a)\]
If X is a continuous random variable with density \(f (x)\), then \[\mu= E(X) =\int_{-\infty}^{\infty}x f (x) dx\] or in general, for any function \(g\), \[E(g(X)) =\int_{-\infty}^{\infty}g(x) f (x) dx\]
The variance of \(X\) is \[\sigma^2=Var(X) = E\left\{[X - E(X)]^2\right\}=\int_{-\infty}^{\infty}(x -\mu)^2 f (x) dx\]
The variance of \(X\) is the average value of the squared deviation of \(X\) from its mean.
The variance of \(X\) can also be expressed as \(Var(X)=E(X^2)-[E(X)]^2\) .
The cumulative distribution function is \[F(x)=\left\{\begin{array}{lll}0& \text{for}& x<a\\ \frac{x-a}{b-a}& \text{for}& a\leq x <b \\ 1&\text{for}& x \geq b\end{array}\right.\]
A special case, \(f (x) = 1\) and \(0 \leq x \leq 1\).
\[f (z) = \frac{1}{\sqrt{2\pi}} e^{-z^2/2}, -\infty<z<\infty\]
We write \(Z \sim N(0,1)\) as short way of saying `\(Z\) follows a standard normal distribution with mean 0 and variance 1’.
To standardize any variable \(X\) (into \(Z\)) we calculate \(Z\) as: \[ Z = \frac{X - \mu} { \sigma}\] The \(Z\)-score calculated above indicates how many standard deviations \(X\) is from the mean.
If \(X\sim N(\mu, \sigma^2)\) then \(Y=e^X\) (\(y \geq 0\)) has a log-normal distribution with mean \(E(Y)=e^{\mu+\sigma^2/2}\) and variance \(V(Y)=(e^{\sigma^2}-1)e^{2\mu+\sigma^2}\).
We consider here 3 probability distributions derived from the normal distribution:
These distributions are mainly useful for statistical inference, e.g. hypothesis testing and confidence intervals (to follow).
If \(Z \sim N(0,1)\) and \(U \sim \chi^2_n\) and \(Z\) and \(U\) are independent, then the distribution of \(Z/\sqrt{U/n}\) is called the \(t\) distribution with \(n\) degrees of freedom.
Let \(U\) and \(V\) be independent chi-square random variables with \(m\) and \(n\) degrees of freedom, respectively. The distribution of \[W=\frac{U/m}{V/n}\] is called the \(F\) distribution with \(m\) and \(n\) degrees of freedom and is denoted by \(F_{m,n}\).
\[f_Y (y) = \frac{1}{a\sigma\sqrt{2\pi }}exp\left[-\frac{1}{2}\left(\frac{y-b-a\mu}{a\sigma}\right)^2\right]\]
Thus, \(Y = aX + b\) follows a normal distribution with parameters \(a \mu + b\) and \(a\sigma\).
If \(X \sim N(\mu, \sigma^2)\) and \(Y = aX + b\), then \(Y \sim N(a \mu + b, a^2\sigma^2)\).
Can you use this to show that \(Z\sim N(0,1)\)?
The joint density function \(f(x,y)\) of two continuous random variables \(X\) and \(Y\) is such that \[f(x,y) \geq 0\] \[\int_{-\infty}^{\infty}\int_{-\infty}^{\infty}f(x,y)\;dxdy=1\] \[\int_{c}^{d}\int_{a}^{b}f(x,y)dxdy=P(a \leq X \leq b, c \leq Y \leq d )\] The marginal density function of \(X\) is \[f_X(x)=\int_{-\infty}^{\infty}f(x,y)\;dy\] Similarly, the marginal density function of \(Y\) is \[f_Y(y)=\int_{-\infty}^{\infty}f(x,y)\;dx\]
The cdf of two continuous random variables \(X\) and \(Y\) can be obtained as \[F(x,y)=\int_{-\infty}^{x}\int_{-\infty}^{y}f(u,v)dudv\] and \[f(x,y)=\frac{\partial^2}{\partial x \partial y}F(x,y)\] wherever the derivative is defined.
The expected value of a constant is the constant itself, i.e. if \(c\) is a constant, \(E(c)=c\).
The variance of a constant is zero, i.e. if \(c\) is a constant, \(Var(c)=0\).
If \(a\) and \(b\) are constants, and \(Y = aX + b\), then \(E(Y)=a E(X)+b\) and \(Var(Y) = a^2Var(X)\) (if \(Var(X)\) exists).
If \(X\) and \(Y\) are independent, then \(E(XY) = E(X)E(Y)\) and \[Var(X+ Y)=Var(X) + Var(Y)\] \[Var(X- Y)=Var(X) + Var(Y)\]
If \(X\) and \(Y\) are independent random variables and \(g\) and \(h\) are fixed functions, then \[E[g(X)h(Y )] = E[g(X)]E[h(Y )]\]
Let \(f(x,y)\) be the joint PDF of random variables \(X\) and \(Y\). The conditional expectation of \(X\), given \(Y=y\), is defined as \[E(X|Y=y)=\sum_x x f_{X|Y}(x|Y=y)\;\;\; \text{if $X$ is discrete}\] \[E(X|Y=y)=\int_{-\infty}^{\infty} x f_{X|Y}(x|Y=y)dx\;\;\; \text{if $X$ is continuous}\] The conditional variance of \(X\) given \(Y=y\) is defined as, if \(X\) is discrete, \[Var(X|Y=y)=\sum_x [X-E(X|Y=y)]^2 f_{X|Y}(x|Y=y)\] and if \(X\) is continuous, \[Var(X|Y=y)=\int_{-\infty}^{\infty} [X-E(X|Y=y)]^2 f_{X|Y}(x|Y=y)dx\]