Introduction to R

Sun 19 January 2014

This tutorial is a practical guide to getting started with R to perform simple data analysis tasks that you might otherwise do in Microsoft Excel. There is a strong emphasis on practical and it is meant for beginners who have only heard of R or just starting with it. It assumes familiarity with simple analysis techniques like split-aggregate-combine and basic statistical concepts such as mean, median and frequency distribution. It also assumes some familiarity with using command-line tools, though this is certainly not a necessity.

So let's dive right in.

Installing R

R can be downloaded for free for Windows, Mac OS X or Linux from the R Project website. This will install all the required libraries and some base packages that you will need to perform simple data analyses.

A better alternative (in my opinion) is to install RStudio - a free, open-source IDE - that improves the usability to a great extent. Installing RStudio will also install R if you don't already have it installed on your system. For the purposes of this tutorial, I will assume that you are using RStudio, though the exercise can be completed with just R.

Getting Help

At any time in the R console, you can use the help command to get more information about any function.

# Get help for any function e.g. data.frame
help(data.frame)

You can also use the ?? command to search the documentation.

# Search the documentation for instances of the term 'plyr'
??plyr

Packages

R relies on its extensive community to build and share pieces of code, referred to as packages, to perform most of the complex tasks. In fact, the basic functionality itself is provided by the base package that is loaded by default when you start R. Packages are distributed through the Comprehensive R Archive Network (CRAN) and can be installed using the install.packages command.

# Install the ggplot2 package
install.packages('ggplot2', dependencies = T)

Some packages depend on functionality provided by yet another package. The dependencies = T parameter makes sure that all needed packages are installed.

There are literally thousands of R packages available for download and they perform pretty much any statistical/analysis task that you can imagine. To search for a package that does what you want to do, just Google <task> r cran and you'll probably get a result on the first page. I've provided a list of some packages that I've personally found useful at the end of this post.

Basic Data Types

Here are some basic data types that R supports -

a = 12
class(a)
## [1] "numeric"

b = 12.56
class(a)
## [1] "numeric"

c = "This is a character string"
class(c)
## [1] "character"

d = as.Date("2014-01-15")
class(d)
## [1] "Date"

There are many more data types that are derived from the base classes (and used in various libraries) but we won't go into details about them at this point. You can easily convert between data types using the as.XXX functions where XXX is replaced with the target data type. The last example above is basically converting a character data type to a Date type.

Important Data Structures

The two most important data structures in R that you need to know about are Vectors and Data Frames.

Vector

A Vector can be thought of as a single column of data. It can be constructed using the c() function.

e = c(1, 2, 3, 4)
e
## [1] 1 2 3 4

f = c("Eenie", "Meenie", "Mienee", "Mo")
f
## [1] "Eenie"  "Meenie" "Mienee" "Mo"

# Get the number of elements in the vector
length(f)
## [1] 4

A vector must contains items of the same type. If you specify items of different types in the c() constructor, they will be converted to the lowest common subtype.

You can reference a single element within a Vector using the [] operator.

# Indices start at 1, not 0!
e[1]
## [1] 1

f[2]
## [1] "Meenie"

# Oops!
f[5]
## [1] NA

The last example used an index that did not expect and R returned NA. NA is R's equivalent of the NULL value and refers to missing data (rather than 0 data). It's an important thing to note.

Data Frame

A Data Frame is the most commonly used data structure in R and can be thought of as a spreadsheet (or a matrix, though matrices are a separate, distinct data structure in R) - it has rows and columns and contains values.

g = data.frame(col1 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), col2 = c(11, 12, 13, 
    14, 15, 16, 17, 18, 19, 20))
g
##    col1 col2
## 1     1   11
## 2     2   12
## 3     3   13
## 4     4   14
## 5     5   15
## 6     6   16
## 7     7   17
## 8     8   18
## 9     9   19
## 10   10   20

Data Frames can be constructed using the data.frame constructor or from other data structures using the as.data.frame function. col1 and col2 are the column names in the example above. You can also specify row names, though that's optional and R assigns numbers in the sequence 1..n if you omit this.

There are a few different ways to access the data in a data frame -

# Get the first column by number
g[, 1]
##  [1]  1  2  3  4  5  6  7  8  9 10

# Get the first column by name
g$col1
##  [1]  1  2  3  4  5  6  7  8  9 10

# Get the second value of the first column
g$col1[2]
## [1] 2

# Get the value in the 2nd row and 2nd column
g[2, 2]
## [1] 12

# Get the number of rows in the data frame
nrow(g)
## [1] 10

# Get the number of columns in the data frame
ncol(g)
## [1] 2

As you can see, items are addressed in the [row, col] format and a missing index signifies all values. Thus g[,] would select all the values in the data frame (all rows and all columns).

You can add additional columns easily in a couple of ways -

# Directly define a new column name
g$col3 = g$col2 * rnorm(10)
# Use the cbind() (column bind) function
g = cbind(g, col4 = c(31, 32, 33, 34, 35, 36, 37, 38, 39, 40))

Similarly, you can use the rbind() function to add more rows to the data frame.

Basic Filtering

R provides a very easy way to filter the data once it is in the right form. Let's see some examples -

# Return all rows where the value in column 1 is odd
g[g$col1%%2 == 1, ]
##   col1 col2    col3 col4
## 1    1   11 -8.1003   31
## 3    3   13 -0.4155   33
## 5    5   15 13.2743   35
## 7    7   17 -0.4592   37
## 9    9   19 -9.3142   39

# Return all the rows where the value in column 2 is greater than 18
g[g$col2 > 18, ]
##    col1 col2   col3 col4
## 9     9   19 -9.314   39
## 10   10   20  3.546   40

# Return all the rows where the value in column 1 is even and the value in
# column 2 is odd
g[g$col1%%2 == 0 & g$col2%%2 == 1, ]
## [1] col1 col2 col3 col4
## <0 rows> (or 0-length row.names)

A commonly used task is to filter a data frame to only rows that do not contain a NA value. Here's how you would do it -

# Add a column with some NAs
g = cbind(g, col5 = c(31,32,NA,34,35,36,37,NA,39,40))
g
##    col1 col2       col3 col4 col5
## 1     1   11  -8.810477   31   31
## 2     2   12 -11.964423   32   32
## 3     3   13 -19.017651   33   NA
## 4     4   14  14.609760   34   34
## 5     5   15  12.961873   35   35
## 6     6   16 -18.831297   36   36
## 7     7   17  -1.416232   37   37
## 8     8   18  18.072227   38   NA
## 9     9   19  12.517877   39   39
## 10   10   20   1.049552   40   40
# Show rows with no NA values (using the complete.cases function)
g[complete.cases(g),]
##    col1 col2       col3 col4 col5
## 1     1   11  -8.810477   31   31
## 2     2   12 -11.964423   32   32
## 4     4   14  14.609760   34   34
## 5     5   15  12.961873   35   35
## 6     6   16 -18.831297   36   36
## 7     7   17  -1.416232   37   37
## 9     9   19  12.517877   39   39
## 10   10   20   1.049552   40   40
# Set all NA values to 0
g[is.na(g)] = 0
g
##    col1 col2       col3 col4 col5
## 1     1   11  -8.810477   31   31
## 2     2   12 -11.964423   32   32
## 3     3   13 -19.017651   33    0
## 4     4   14  14.609760   34   34
## 5     5   15  12.961873   35   35
## 6     6   16 -18.831297   36   36
## 7     7   17  -1.416232   37   37
## 8     8   18  18.072227   38    0
## 9     9   19  12.517877   39   39
## 10   10   20   1.049552   40   40

Basic Statistical Metrics

The base pakcage provides the functions to calculate the basic statistical measures such as mean, median and standard deviation.

# Calculate the mean of the first column
mean(g$col1)
## [1] 5.5

# Calculate the median of the second column
median(g$col2)
## [1] 15.5

# Calculate the standard deviation of the third column
sd(g$col3)
## [1] 7.907

You can also get the summary statistics of the entire data frame using the summary function -

summary(g)
##       col1            col2           col3              col4     
##  Min.   : 1.00   Min.   :11.0   Min.   :-13.668   Min.   :31.0  
##  1st Qu.: 3.25   1st Qu.:13.2   1st Qu.: -8.892   1st Qu.:33.2  
##  Median : 5.50   Median :15.5   Median : -0.869   Median :35.5  
##  Mean   : 5.50   Mean   :15.5   Mean   : -2.352   Mean   :35.5  
##  3rd Qu.: 7.75   3rd Qu.:17.8   3rd Qu.:  1.439   3rd Qu.:37.8  
##  Max.   :10.00   Max.   :20.0   Max.   : 13.274   Max.   :40.0

Sorting Data

R provides a convenient order function to sort data in vectors and data frames. Here are a few examples -

# Sort data frame by col3
g[order(g$col3),]
##    col1 col2       col3 col4 col5
## 3     3   13 -19.017651   33    0
## 6     6   16 -18.831297   36   36
## 2     2   12 -11.964423   32   32
## 1     1   11  -8.810477   31   31
## 7     7   17  -1.416232   37   37
## 10   10   20   1.049552   40   40
## 9     9   19  12.517877   39   39
## 5     5   15  12.961873   35   35
## 4     4   14  14.609760   34   34
## 8     8   18  18.072227   38    0
# Sort data frame by col1 in reverse (descending) order (note the '-' sign)
g[order(-g$col1),]
##    col1 col2       col3 col4 col5
## 10   10   20   1.049552   40   40
## 9     9   19  12.517877   39   39
## 8     8   18  18.072227   38    0
## 7     7   17  -1.416232   37   37
## 6     6   16 -18.831297   36   36
## 5     5   15  12.961873   35   35
## 4     4   14  14.609760   34   34
## 3     3   13 -19.017651   33    0
## 2     2   12 -11.964423   32   32
## 1     1   11  -8.810477   31   31

Plotting Graphs

The base package provides some rudimentary graphics capabilities that are more than sufficient for initial data analysis when you are just exploring the data. For more advanced visualizations, packages like ggplot2 may be used.

Here are some examples using the base package -

# A scatterplot of col1 vs. col2
plot(g$col1, g$col2)

plot of chunk unnamed-chunk-10

# A line plot of col2 vs. col3
plot(g$col2, g$col3)
lines(g$col2, g$col3)

plot of chunk unnamed-chunk-10

# A histogram of a 1000 randomly generated numbers from the normal
# distribution
hist(rnorm(1000))

plot of chunk unnamed-chunk-10

There are a bunch of parameters that can be passed into the plotting functions to change the aesthetics. For example -

hist(rnorm(1000, mean = 2, sd = 5), main = "This is a fancy title", xlab = "This is the X-axis title", 
    ylab = "This is the Y-axis title", col = "blue")

plot of chunk unnamed-chunk-11

Aggregation Techniques

The split-aggregate-combine philosophy is the cornerstone of data analysis. R makes it extremely simple to do this. Let's look at a few examples but first, let's create a data set we can work with.

# Create an awesome data set Set of dates -- all days in 2013
dates = seq(from = as.Date("2013-01-01"), to = as.Date("2013-12-31"), by = "day")
# Set of cities
cities = c("Seattle", "New York", "San Francisco", "Denver")
# Create a data frame populated with some random temperatue data
temps = data.frame(date = rep(dates, 4), city = c(rep(cities[1], 365), rep(cities[2], 
    365), rep(cities[3], 365), rep(cities[4], 365)), temp = c(c(sort(rnorm(180, 
    mean = 40, sd = 8)) + rnorm(180) * 10, sort(rnorm(185, mean = 40, sd = 8), 
    decreasing = T) + rnorm(185) * 10), c(sort(rnorm(180, mean = 50, sd = 8)) + 
    rnorm(180) * 10, sort(rnorm(185, mean = 50, sd = 8), decreasing = T) + rnorm(185) * 
    10), c(sort(rnorm(180, mean = 70, sd = 8)) + rnorm(180) * 10, sort(rnorm(185, 
    mean = 70, sd = 8), decreasing = T) + rnorm(185) * 10), c(sort(rnorm(180, 
    mean = 30, sd = 8)) + rnorm(180) * 10, sort(rnorm(185, mean = 30, sd = 8), 
    decreasing = T) + rnorm(185) * 10)))

head(temps)
##         date    city  temp
## 1 2013-01-01 Seattle 10.48
## 2 2013-01-02 Seattle 51.13
## 3 2013-01-03 Seattle 42.92
## 4 2013-01-04 Seattle 14.79
## 5 2013-01-05 Seattle 27.71
## 6 2013-01-06 Seattle 31.08

# Basic summary stats
summary(temps)
##       date                       city          temp      
##  Min.   :2013-01-01   Denver       :365   Min.   :-14.2  
##  1st Qu.:2013-04-02   New York     :365   1st Qu.: 33.1  
##  Median :2013-07-02   San Francisco:365   Median : 46.5  
##  Mean   :2013-07-02   Seattle      :365   Mean   : 47.4  
##  3rd Qu.:2013-10-01                       3rd Qu.: 60.3  
##  Max.   :2013-12-31                       Max.   :110.6

Let's do some basic data analysis using visualization -

par(mfrow = c(2, 2))
plot(temps[temps$city == cities[1], 1], temps[temps$city == cities[1], 3], main = cities[1])
plot(temps[temps$city == cities[2], 1], temps[temps$city == cities[2], 3], main = cities[2])
plot(temps[temps$city == cities[3], 1], temps[temps$city == cities[3], 3], main = cities[3])
plot(temps[temps$city == cities[4], 1], temps[temps$city == cities[4], 3], main = cities[4])

plot of chunk unnamed-chunk-13

Now, let's say if you want to calculate the mean and median daily temperature in each of the cities. You could do it by filtering the rows (by either taking the indices 1 - 365, 366 - 730 etc. or using the filtering technique described above) but there's an easier way using the aggregate function.

# Calculate the average daily temperature in each city
aggregate(temp ~ city, data = temps, mean)
##            city  temp
## 1        Denver 29.81
## 2      New York 50.03
## 3 San Francisco 68.56
## 4       Seattle 41.02

# Calculate the median daily temperature in each city
aggregate(temp ~ city, data = temps, median)
##            city  temp
## 1        Denver 28.90
## 2      New York 49.94
## 3 San Francisco 69.03
## 4       Seattle 41.40

The aggregate function, as the name implies, aggregates the data across one or more dimensions and displays the results. In the example above, we aggregate the temp column across cities and apply the mean function to calculate the value.

Let's take this further and calculate the average monthly temperature -

# Create a new column for the month name
temps$month = format(temps$date, "%B")
# Do the aggregation
monthly_means = aggregate(temp ~ city + month, data = temps, mean)
monthly_means[1:4, ]
##            city month  temp
## 1        Denver April 29.38
## 2      New York April 51.67
## 3 San Francisco April 71.93
## 4       Seattle April 42.84

Extra Credit

The package reshape has a bunch of functions that allow you to transform the data into the format you want. For example, the monthly average data above can be transformed into a more intuitive (sorting notwithstanding) format using this package.

# Load the library. Might need to install it using install.package().
library(reshape)
## Loading required package: plyr
## 
## Attaching package: 'reshape'
## 
## The following objects are masked from 'package:plyr':
## 
##     rename, round_any

# Reshape the data
reshape(monthly_means, v.names = "temp", idvar = "city", timevar = "month", 
    direction = "wide")
##            city temp.April temp.August temp.December temp.February
## 1        Denver      29.38       33.53         17.15         26.44
## 2      New York      51.67       55.12         35.55         44.54
## 3 San Francisco      71.93       74.22         56.50         59.85
## 4       Seattle      42.84       44.63         28.58         33.84
##   temp.January temp.July temp.June temp.March temp.May temp.November
## 1        18.30     44.52     44.98      25.19    35.62         23.81
## 2        37.71     63.44     63.33      48.89    52.16         45.16
## 3        59.20     80.79     81.50      67.04    76.12         61.89
## 4        30.23     53.19     50.77      40.66    49.68         36.25
##   temp.October temp.September
## 1        26.90          31.96
## 2        50.11          52.61
## 3        62.74          70.46
## 4        36.00          45.24

Reading and Writing Data

R allows you to read and write data easily from and to various data sources. The two most common ones are reading from a CSV or from a database such as MySQL.

Reading from CSV

R provides a handy read.csv() function to read data from CSV files. To test it out, I'll use this CSV file (you should download it too to follow along).

# Read the CSV
precip = read.csv("precipitation.csv")
head(precip)
##   State_id YEAR Month PRECIP..in.
## 1 '212142' 1889     1        0.58
## 2 '212142' 1889     2        0.65
## 3 '212142' 1889     3        0.78
## 4 '212142' 1889     4        1.53
## 5 '212142' 1889     5        1.94
## 6 '212142' 1889     6        3.71

class(precip)
## [1] "data.frame"

# Display column names
names(precip)
## [1] "State_id"    "YEAR"        "Month"       "PRECIP..in."

# Rename the columns
names(precip) = c("state_id", "year", "month", "inches")
head(precip)
##   state_id year month inches
## 1 '212142' 1889     1   0.58
## 2 '212142' 1889     2   0.65
## 3 '212142' 1889     3   0.78
## 4 '212142' 1889     4   1.53
## 5 '212142' 1889     5   1.94
## 6 '212142' 1889     6   3.71

There are a bunch of parameters that can be provided to the read.csv function for different formats that data might be in (for example, lack of a header row, different types of separators, how strings are encoded etc.). Look at the documentation by typing help(read.csv) for more information.

Write to CSV

Once you are done with the analysis, you might want to save the results in a file to import into other tools. A good way to export tabular data is the CSV format and R provides the convenient write.csv function to enable this.

Let's save the temperature data we created above to a file -

write.csv(temps, file = "temperatures.csv", row.names = FALSE)

Voila! The file temperatures.csv should now be saved in your current working directory. The row.names parameter is set to FALSE to not write row names to the file (it's usually left out).

Reading from MySQL

The RMySQL package provides a pretty good interface to read from and write to a MySQL database. Here's how you use it -

# Load the library
library(RMySQL)
# Connect to the MySQL server
conn = dbConnect(MySQL(), 
                 host="[host-name]", 
                 user="[username]", 
                 password="[password]", 
                 db="[db name]"))
# Build a query
query = "SELECT * FROM some_table WHERE some_column > some_value"
# Execute the query and retrieve the data
data = dbGetQuery(conn, query)

Writing to MySQL

dbGetQuery basically does a dbSendQuery followed by fetch to retrieve the results. You can execute an INSERT statement just as easily by using the dbSendQuery function instead.

# Write to the database
query = "INSERT INTO some_table (col1, col2) VALUES (val1, val2)"
dbSendQuery(db, query)

Miscellaneous Tips

Working Directory

R has the concept of a working directory - a path on the filesystem - where it reads data from and writes results to. By default, it is likely set to your home directory or the directory from where R is executed. It is a good idea to set this path to a clean path before starting a project.

# Get current working directory
getwd()
## [1] "/Users/fahmed/code/projects/analysis/fahmed/misc"
# Set the working directory
setwd("~/analysis/projects/rtutorial")

Saving the session

R keeps the user-defined and calculated variables in memory. It also provides the ability to write this information to a file should one want to come back to an analysis at a later point. By default, it stores the session in a file named .RData in the current working directory. This can be overridden to explicity save the session (and only selected variables) to a file of one's choice.

# Save all the variables in memory to a file called snapshot1.RData
save.image(file = "snapshot1.RData")
# Read the file back into memory
load("snapshot1.RData")

The variables that you defined prior to saving the session will be available once again after the load function.

Useful Base Functions

# seq - Generate a sequence of data
seq(10)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq(from = 1, to = 20, by = 2)
##  [1]  1  3  5  7  9 11 13 15 17 19
seq(from = as.Date("2014-01-01"), to = as.Date("2014-01-07"), by = "days")
## [1] "2014-01-01" "2014-01-02" "2014-01-03" "2014-01-04" "2014-01-05"
## [6] "2014-01-06" "2014-01-07"

# rep - Repeat a sequence
rep(1, 10)
##  [1] 1 1 1 1 1 1 1 1 1 1
a = c(1, 2, 3, 4)
rep(a, 3)
##  [1] 1 2 3 4 1 2 3 4 1 2 3 4

# Data type conversion
as.numeric("12")
## [1] 12
as.character(12)
## [1] "12"
as.Date("2013-01-12")
## [1] "2013-01-12"

Useful Packages

Hopefully, at this point, you are able to perform some simple analyses in R! There is, of course, much to learn and the best way is to just dive in and try. The R community is very helpful and there's a ton of material on the Internet. Good luck!

Last Edited: Jan 19, 2014 14:45 PM