The page contains information about getting started in R, and basic data analysis.
If you are using R without a graphical interface, in the "filename.csv" just add the path or change it with the next commands. To show the directory you are working with:
To set a new directory:
To create a subset of a data frame, use the subset function. Here are a few examples:
The pipe symbol means "or", the ampersand symbol means "and", the double-equals means "exactly equal to", and the exclamation followed by an equals sign means "not equal to".
Histograms:
Boxplots:
Table of summary statistics (like pivot tables in Excel):
To import a library, you need to the install.packages command as following Import library
It will download the library in the R environment. But you still need to explicitely indicate afterwards that you want to use it by using the library command
To show max (or min or range) by group we will use the following tapply() formula
Condition in tapply formula: To check all the entries of x for each group g where x>9 we can use the following tapply() formula
The result will count the total number of entries x in each group that are greater than 9
Some General Tips
- If you ever see a plus sign in your R console instead of the greater-than sign, it's stuck! Hit the escape key to get back to the cursor.
- To get help on any function, type a question mark and then the name of the function (for example, ?sqrt).
- To add new observations to a data frame use the rbind() function.
"Vocabulary" / Definitions
- Data Structure: It is a particular way of organizing information so it can be efficiently stored and retrieved by the computer. Examples include arrays (or lists), hash tables, structs, and sets.
- Variable Names: they are case-sensitive, don't start them with a number, and don't use spaces.
- Vectors: created with the c() function, or the seq() function.
- Data frames: created with the data.frame() function, or by reading in a csv file with the read.csv() function.
- Observation: row of values within a data frame.
Reading in Files
To read in a csv file, first navigate to the directory on your computer containing the file. This can be done on a Mac by going to the "Misc" menu, and then selecting "Change Working Directory...", and on a Windows machine by going to the "File" menu, and then selection "Change dir...". Then read the file into R with the following command:DataFrame = read.csv("filename.csv")
"DataFrame" should be replaced with what you want to call your data
frame in R, and "filename.csv" should be replaced with the name of the
file you want to read in.If you are using R without a graphical interface, in the "filename.csv" just add the path or change it with the next commands. To show the directory you are working with:
getwd()
To set a new directory:
setwd()
Summarizing and Subsetting Data
To summarize a data frame, the str and summary functions are helpful:str(DataFrame)
summary(DataFrame)
To create a subset of a data frame, use the subset function. Here are a few examples:
NewDataFrame = subset(DataFrame, Variable1 <= 20 | Variable2 == 1)
NewDataFrame = subset(DataFrame, Variable1 > 100)
NewDataFrame = subset(DataFrame, Variable1 < 10 & Variable2 != 1)
The pipe symbol means "or", the ampersand symbol means "and", the double-equals means "exactly equal to", and the exclamation followed by an equals sign means "not equal to".
Basic Data Analysis
Useful functions for computing statistics about a variable:mean(DataFrame$Variable1)
sd(DataFrame$Variable1)
summary(DataFrame$Variable1)
which.max(DataFrame$Variable1)
which.min(DataFrame$Variable1)
Plotting
Scatterplots:plot(DataFrame$Variable1, DataFrame$Variable2)
plot(DataFrame$Variable1, DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")
Histograms:
hist(DataFrame$Variable1)
hist(DataFrame$Variable1, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")
Boxplots:
boxplot(DataFrame$Variable1)
boxplot(DataFrame$Variable1 ~ DataFrame$Variable2)
boxplot(DataFrame$Variable1 ~ DataFrame$Variable2, xlab="X-axis Label", ylab="Y-axis Label", main="Main Title of Plot")
Summary Tables
Tables of counts:table(DataFrame$Variable1)
table(DataFrame$Variable2 == 1)
table(DataFrame$Variable1, DataFrame$Variable2)
Table of summary statistics (like pivot tables in Excel):
tapply(DataFrame$Variable1, DataFrame$Variable2, mean)
tapply(DataFrame$Variable1, DataFrame$Variable2, min, na.rm=TRUE)
External libraries
External libraries enable the user to add new features to the standard R environment.To import a library, you need to the install.packages command as following Import library
install.packages("ROCR")
It will download the library in the R environment. But you still need to explicitely indicate afterwards that you want to use it by using the library command
library(ROCR)
More on tapply() using a very simple example
Say x is a list of 10 integers --> 2, 4, 6, 7, 9, 11, 12, 15, 16, 17 and say g contains the associated groups: group a or b or c associated with each of x This means g is a, a, a, a, b, b, b, c, c, c where a,b,c represent group a, group b, group c resp.x=c(2, 4, 6, 7, 9, 11, 12, 15, 16, 17)
g=c("a","a","a","a","b","b","b","c","c","c")
To show mean of each group we will use this tapply() formulatapply(x, g, mean)
The result will be grouped by g. a b c
4.75000 10.66667 16.00000
To show max (or min or range) by group we will use the following tapply() formula
tapply(x, g, max)
The result will be the max of group a, group b and group c a b c
7 12 17
Condition in tapply formula: To check all the entries of x for each group g where x>9 we can use the following tapply() formula
tapply(x>9, g, sum)
The result will count the total number of entries x in each group that are greater than 9
a b c
0 2 3
No comments:
Post a Comment