15.071x Analytics Edge Course: 15.071x Analytics Edge, Introduction to R

The page contains information about getting started in R, and basic data analysis.

Some General Tips

If you ever see a plus sign in your R console instead of the greater-than sign, it's stuck! Hit the escape key to get back to the cursor.
To get help on any function, type a question mark and then the name of the function (for example, ?sqrt).
To add new observations to a data frame use the rbind() function.

"Vocabulary" / Definitions

Data Structure: It is a particular way of organizing information so it can be efficiently stored and retrieved by the computer. Examples include arrays (or lists), hash tables, structs, and sets.
Variable Names: they are case-sensitive, don't start them with a number, and don't use spaces.
Vectors: created with the c() function, or the seq() function.
Data frames: created with the data.frame() function, or by reading in a csv file with the read.csv() function.
Observation: row of values within a data frame.

Reading in Files

To read in a csv file, first navigate to the directory on your computer containing the file. This can be done on a Mac by going to the "Misc" menu, and then selecting "Change Working Directory...", and on a Windows machine by going to the "File" menu, and then selection "Change dir...". Then read the file into R with the following command:

DataFrame = read.csv("filename.csv")

"DataFrame" should be replaced with what you want to call your data frame in R, and "filename.csv" should be replaced with the name of the file you want to read in.
If you are using R without a graphical interface, in the "filename.csv" just add the path or change it with the next commands. To show the directory you are working with:

getwd()

To set a new directory:

setwd()

Summarizing and Subsetting Data

To summarize a data frame, the str and summary functions are helpful:

str(DataFrame)
summary(DataFrame)

To create a subset of a data frame, use the subset function. Here are a few examples:

NewDataFrame = subset(DataFrame, Variable1 <= 20 | Variable2 == 1)
NewDataFrame = subset(DataFrame, Variable1 > 100)
NewDataFrame = subset(DataFrame, Variable1 < 10 & Variable2 != 1)

The pipe symbol means "or", the ampersand symbol means "and", the double-equals means "exactly equal to", and the exclamation followed by an equals sign means "not equal to".

Basic Data Analysis

Useful functions for computing statistics about a variable:

mean(DataFrame$Variable1)
sd(DataFrame$Variable1)
summary(DataFrame$Variable1)
which.max(DataFrame$Variable1)
which.min(DataFrame$Variable1)

Plotting

Scatterplots:

plot(DataFrame$Variable1, DataFrame$Variable2)
plot(DataFrame$Variable1, DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

Histograms:

hist(DataFrame$Variable1)
hist(DataFrame$Variable1, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

Boxplots:

boxplot(DataFrame$Variable1)
boxplot(DataFrame$Variable1 ~ DataFrame$Variable2)
boxplot(DataFrame$Variable1 ~ DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")

Summary Tables

Tables of counts:

table(DataFrame$Variable1)
table(DataFrame$Variable2 == 1)
table(DataFrame$Variable1, DataFrame$Variable2)

Table of summary statistics (like pivot tables in Excel):

tapply(DataFrame$Variable1, DataFrame$Variable2, mean)
tapply(DataFrame$Variable1, DataFrame$Variable2, min, na.rm=TRUE)

External libraries

External libraries enable the user to add new features to the standard R environment.
To import a library, you need to the install.packages command as following Import library

install.packages("ROCR")

It will download the library in the R environment. But you still need to explicitely indicate afterwards that you want to use it by using the library command

library(ROCR)

More on tapply() using a very simple example

Say x is a list of 10 integers --> 2, 4, 6, 7, 9, 11, 12, 15, 16, 17 and say g contains the associated groups: group a or b or c associated with each of x This means g is a, a, a, a, b, b, b, c, c, c where a,b,c represent group a, group b, group c resp.

x=c(2, 4, 6, 7, 9, 11, 12, 15, 16, 17)
g=c("a","a","a","a","b","b","b","c","c","c")

To show mean of each group we will use this tapply() formula

tapply(x, g, mean)

The result will be grouped by g.

          a        b        c 
    4.75000 10.66667 16.00000

To show max (or min or range) by group we will use the following tapply() formula

tapply(x, g, max)

The result will be the max of group a, group b and group c

         a  b  c 
         7 12 17

Condition in tapply formula: To check all the entries of x for each group g where x>9 we can use the following tapply() formula

tapply(x>9, g, sum)

The result will count the total number of entries x in each group that are greater than 9

         a b c 
         0 2 3

15.071x Analytics Edge Course

Tuesday, September 29, 2015

15.071x Analytics Edge, Introduction to R