6 Data subsetting
1 Subsetting and Manipulation (data sumo)
With a good basic set of moves for subsetting and manipulating data, you can overpower any dataset no matter how large and powerful they may be. Then, you will have strong data Sumo.
Subsetting and manipulating data is probably the commonest activity for anyone who works with data. This is a core activity for exploratory data analysis, but is also extensively used in simple data acquisition, analysis and graphing, while also being related to more general data manipulating activities, for example database queries. This page is an introduction to the core syntax and some of the tools for manipulating and subsetting data in R.
1.1 Objectives
Indexing concept
Using
which()
and subsettingSelection on data.frame objects
Using
aggregate()
Practice exercises
2 Indexing concept
If you would like to slice and dice your data, you will need to learn all about indexing!
The basics of the indexing concept in R
syntax is very simple, where data storage objects like vectors (1 dimension), matrices (2 dimensions) and arrays (3 or more dimensions) store individual data values that can be accessed by the “address” of the dimension(s).
2.1 How indexing works
Say you have a numeric vector called my_vector
that has 10 values. The index values will be 1
to 10
, with each value corresponding consecutively to the data value at that position.
<- c(11.3, 11.2, 10.4, 10.4, 8.7,
my_vector 10.8, 10.5, 10.3, 9.7, 11.2)
my_vector
[1] 11.3 11.2 10.4 10.4 8.7 10.8 10.5 10.3 9.7 11.2
Notice the [1]
in the R console output? This indicates the index of value right next to it and the R system will provide an index value for longer vectors as the wrap in the console. If we could see the actual index values it would look something like this:
> my_vector
1] 11.3 11.2 10.4 10.4 8.7 10.8 10.5 10.3 9.7 11.2
[# ^ ^ ^ ^ ^ ^ ^ ^ ^ ^
# 1 2 3 4 5 6 7 8 9 10
2.2 Vectors
You can create vector subsets by manipulating the index. Vector objects have indices in 1 dimension. For example, my_vector[1:i]
, where i
is the length of the vector.
## Vectors ####
# Try this
<- c(11.3, 11.2, 10.4, 10.4,
my_vector 8.7, 10.8, 10.5, 10.3, 9.7, 11.2)
# Return all values
# Typical way
my_vector # Blank index implies all index values
my_vector[ ] 1:10] # Returns all index values explicitly
my_vector[
# Return the first 3 values
1:3 # Reminder of the function of the colon operator ":"
1:3] # Notice consecutive indices can use the ":" operator
my_vector[
# Return 5th and 9th values
c(5, 9)] # Notice we have to place non-consecutive index values in the c() function my_vector[
2.3 Matrices
Matrix objects have 2 dimensions denoted as my_matrix[1:i, 1:j]
, where i
is the number of rows and j
is the number of columns.
## Matrices ####
# Try this
<- matrix(data = c(2,3,4,5,6,6,6,6),
my_matrix nrow = 2, byrow = T)
# notice how the arguments arranged the data
my_matrix
# Flash challenge: make a matrix with the same data vector above to look like...
# [,1] [,2]
# [1,] 2 6
# [2,] 3 6
# [3,] 4 6
# [4,] 5 6
# "Slicing" out a row or column
1, ] # Slice out row 1
my_matrix[3] # Slice out column 3
my_matric[ ,
# Matrix columns and rows often have names
names(my_matrix) # No names yet
nrow(my_matrix) # Returns number of rows (useful for large matrices)
rownames(my_matrix) # No row names; 2 rows, need two names
rownames(my_matrix) <- c("dogs", "cats")
# Now the rows have names!
my_matrix rownames(my_matrix) # Get them this way too!
# Flash challenge: Name the columns of my_matrix "a", "b", "c", "d" with colnames()
my_matrix
# Should look like this:
# a b c d
# dogs 2 3 4 5
# cats 6 6 6 6
# You can also slice out matrix portions by name
"dogs", c("b", "d")]
my_matrix[
# Finally, functions act on values, not the index
mean(my_matrix["dogs", c("b", "d")])
2.4 Arrays
Arrays are data objects with more than 2 dimensions (well, technically a matrix with 2 dimensions is also an array, but let’s ignore that for now). Array dimensions are denoted as my_array[1:i, 1:j, 1:k]
, where i
is the number of rows and j
the columns and k
the “depth” of i * j
.
## Arrays ####
# Try this
# help(runif)
# help(round)
# Try it to see what it does...
<- round(runif(n = 27, min = 0, max = 100), 0)
my_vec # See what we did there?
my_vec
length(my_vec) # Just checking
<- array(data = my_vec,
my_array dim = c(3, 3, 3))
my_array
# Flash challenge:
# Specify and print the 1st and 3rd slice of the k dimension of my_array
# Assuming my_array has dimensions i, j, k like my_array[i,j,k]
3 which()
and subsetting
Subsetting data objects is done by exploiting the index system. We usually do this by either specifying index values explicitly (effective, but it requires that you know A LOT about the data object), or by constructing queries that choose subset of data based on particular values. The which()
function is a powerful way to construct queries.
# Try this
help(which) # Notice how the x argument is required to be a LOGICAL vector?
# Make a NUMERIC vector
<- c(3, 4, 5, 4, 3, 4, 5, 6, 6, 7)
vector_a
# Use a boolean phrase to ask which elements of vector_a are greater than 5
> 5 # Interesting... it is a LOGICAL vector!
vector_a
# which() will return the index values of TRUE values
# In other words, WHICH values in vector_a are greater than 5?
which(vector_a > 5)
What is the point of all this? THE POINT is to be able to use expressions to obtain indices and values in data structures…
# What VALUES in vector_a are > 5?
which(vector_a > 5)]
vector_a[
# This also works on vectors of other types
# Consider a character vector
<- c("wheat", "maize", "wheat", "maize", "wheat", "wheat")
char_vec
# Which elements are equivalent to "wheat"?
== "wheat"
char_vec which(char_vec == "wheat")
which(char_vec == "wheat")] # This works
char_vec[ == "wheat"] # Same output
char_vec[ char_vec
# Flash challenge: Explain in your own words why
# the previous 2 lines of code have identical output?
We are just beginning to scratch the surface of possibilities with the which()
function. Keep this function in mind and practice it when you can.
4 Selection on data.frame
objects
Data frames are the ultimate data object for getting, storing, organizing and analyzing data. A good scientist must learn to communicate the subtlety of data. A good statistician must learn not to underestimate the subtletly of data. A good student must learn that
subtlety may exist, even in simple data.
There are a couple of data object types that have a special characteristic in that they store data of different types, where vectors, matrices and arrays can only store one type of data (e.g., numeric, character, logical, etc.). The special data objects that can contain multiple data types are list objects, and data frames. Here we will focus on data frames.
Data frames can have different vector types arranged by column but there is a constraint that each vector must be the same length, that is, each ROW is considered an observation for each variable value (though there may be missing data coded by NA
).
There are a few ways to think about selecting values in a data frame. The first is simply to access values through the variable names, which can either be done by using the data frame name with the $
operator and the variable name, or by using the [ , ]
syntax with either the variable name or the column number of the variable of interest see here.
A second powerful way to access variables in a data frame is by selecting particular rows of a data frame. This may be done by selecting the rows of a data based on values of one or more variables. We will practice doing this using which()
, the [ , ]
syntax, and boolean phrases is the following code block.
For the following section, we will use the OrchardSprays
dataset that exists as a data frame in the in-built {datasets}
package. You can use help(OrchardSprays)
to see the help page (the help page is characteristically terse, so some description is given here).
4.1 OrchardSprays
data
This is a classic dataset based on an experiment looking at how a chemical additive could be used to deter honeybees from being attracted to crops and subsequently killed by pesticides.
The experiment involved having a treatment
consisting of adding a “lime sulfur emulsion” (honeybee deterrent) in increasing concentrations to a sucrose solution. The treatment
variable had 8 levels including a control (no deterrent) and 7 other levels with increasing concentration of the deterrent. The treatment levels were named A (the highest amount of deterrent), B (second highest deterrent) through to G (lowest deterrent) and H (control - no deterrent) The decrease
variable was a measure of the quantity of sucrose solution that was taken by honeybees (the prediction here is that higher concentrations of the deterrent should result in a lower decrease in the sucrose solution).
The experiment involved a Latin Square design, with the order of the 8 treatments arranged randomly in an array of 8 columns (the purpose of this design is to randomize any effect of the treatment
ORDER or POSITION on the response variable). This resulted in an 8 row by 8 column experiment. The response was measured after placing 100 honeybees into an experimental chamber with the 64 containers of sucrose solution.
## OrchardSprays ####
## Understand the data - an important step
# Try this
# Load the OrchardSpray data using the data() function
data(OrchardSprays) # Should see OrchardSprays <promise> in the Global Env.
# Look at the data head()
head(OrchardSprays) # First 6 rows
# Look at variable types with str()
help(str) # Good function to see info about data object
str(OrchardSprays)
# First let's just look at the data
# Don't worry too much about the code for these graphs if you have not encountered it before
boxplot(decrease ~ treatment, data = OrchardSprays,
main = "The pattern fits the prediction",
ylab = "Amount of sucrose consumed",
xlab = "Lime sulpher treatment amount in decreasing order (H = control)")
# This is the experimental design
# Latin Square is kind of like Sudoku
# No treatment can be in row or column more than once
plot(x = OrchardSprays$colpos, # NB use of $ syntax to access data
y = OrchardSprays$rowpos,
pch = as.character(OrchardSprays$treatment),
xlim = c(0,9), ylim = c(0,9),
main = "The Latin Square design of treatments",
xlab = "\"Column\" position",
ylab = "\"Row\" position")
4.2 Practice selecting parts a data frame
Selecting particular parts of a data frame based on the values of one variable is a common and extremely useful task.
## Practice selecting parts a data frame ####
# Select the rows of the dataset for treatment "D"
# (Pseudocode steps to solve)
# Break it down to make small steps easy to read
# 01 Boolean phrase to identify rows where treatment value is "D"
# 02 which() to obtain index of TRUE in boolean vector
# 03 Exploit [ , ] syntax with data frame object to slice out rows
# 01 Boolean phrase
$treatment # Just print variable to compare visually to boolean
OrchardSprays$treatment == "D" # logical vector - TRUE in "D" positions
OrchardSprays
# 02 which()
which(OrchardSprays$treatment == "D") # Index of TRUE values
<- which(OrchardSprays$treatment == "D") # Place index in a variable
my_selection # Just checking
my_selection
# 03 Exploit [ , ] syntax with data frame object to slice out rows
OrchardSprays[my_selection, ]
# Flash challenge: Select and print all rows at "colpos" values of 2
4.3 Selection based on more than one variable value
Using the basic building blocks of boolean selection, more complex rules for selecting data can be made.
## Compound boolean for selection ####
# Select all rows of the data frame where
# rowpos equals 4 OR 6 AND treatment equals "A" OR "H"
# What we expect is exactly 2 values (A or H) for each powpos (4 or 6)
# rowpos 4 and 6
$rowpos == 4 # The 4s
OrchardSprays$rowpos == 6 # The 6s
OrchardSprays
$rowpos == 4 | OrchardSprays$rowpos == 6 # All together
OrchardSprays
# now with which()
which(OrchardSprays$rowpos == 4) # The 4s
which(OrchardSprays$rowpos == 6) # The 6s
which(OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) # All together
# treatment A and H
which(OrchardSprays$treatment == "A" | OrchardSprays$treatment == "H") # All together
# Now we need the intersection of value that are in both our which() vectors
which((OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) & # It works
$treatment == "A" | OrchardSprays$treatment == "H") )
(OrchardSprays
# NB this is a long way of spelling out our selection,
# but trying to be very explicit with what is going on
<- which((OrchardSprays$rowpos == 4 | OrchardSprays$rowpos == 6) &
my_selec2 $treatment == "A" | OrchardSprays$treatment == "H") )
(OrchardSprays
# Double check it works and is similar to expectation...
OrchardSprays[my_selec2, ]
# Flash challenge: Calculate the mean of decrease for treatment "A"
# and the mean of decrease for treatment "H"
5 aggregate()
function
We often may wish to summarize parts of a data set according to some index of variable values. A very convenient tool for the is the aggregate()
function, which we will practice here.
help(aggregate)
# A few important things to note about how this function works:
# The "x" argument is a data object you input, but should only contain numeric values usually
# If a data.frame object is input as x, a data.frame object is the output
# The "by" argument must be a list() object and can be one or more indices
# The FUN argument is the name of the function that will act on the "x" argument data
# Let's try a few examples
## 1 calculate the mean of decrease by treatment in OrchardSprays
aggregate(x = OrchardSprays$decrease,
# NB use of list() and naming it "treatment"
by = list(treatment = OrchardSprays$treatment),
FUN = mean)
# we can "recycle" the code above to apply different functions
# standard deviation with sd()
aggregate(x = OrchardSprays$decrease,
# NB use of list() and naming it "treatment"
by = list(treatment = OrchardSprays$treatment),
FUN = sd)
# Range with range()
aggregate(x = OrchardSprays$decrease,
# NB use of list() and naming it "treatment"
by = list(treatment = OrchardSprays$treatment),
FUN = range)
# What if we want several summary statistics?
aggregate(x = OrchardSprays$decrease,
# NB use of list() and naming it "treatment"
by = list(treatment = OrchardSprays$treatment),
# NB use of function()
FUN = function(x) c(mean = mean(x), # Add naming
sd = sd(x),
range = range(x)))
## Example of use of aggregate object
# Say you would like to graph a barplot of the MEAN of decrease by treatment
# and you would like to show STANDARD DEVIATION error bars
# Make data frame with summary values using aggregate()
<- aggregate(x = OrchardSprays$decrease,
my_mean by = list(treatment = OrchardSprays$treatment),
FUN = mean)
<- aggregate(x = OrchardSprays$decrease,
my_sd by = list(treatment = OrchardSprays$treatment),
FUN = sd)
my_mean
my_sd
# Tidy things up in a new data frame using data.frame()
# Take care of naming variables for clarity
help(data.frame) # Continue using help() as a good habit
<- data.frame(treatment = my_mean$treatment,
new_data mean = my_mean$x,
sd = my_sd$x)
# Looks good
new_data
# There is a lot going on in the following code
# The point is to show what is possible
<- barplot(new_data$mean,# use mean for barheight
(bar_centers ylim = c(0, 115),
ylab = "Mean solution decrease (+- 1 SD)",
xlab = "Treatment"))
# NB bar_centers holds the numerical position value of the bars...
help(arrows) # Use to draw error bars
arrows(x0 = bar_centers,
x1 = bar_centers,
y0 = new_data$mean , # start error bar at top of the bar!
y1 = new_data$mean + new_data$sd, # end error bar here!
angle = 90,
length = 0.1)
# Last step: label the x axis
axis(side = 1,
at = bar_centers,
labels = new_data$treatment)
# Flash challenge: Draw a new barplot by recycling the code above
# This time, add error bars showing on both the top and the bottom of the mean values
6 Practice exercises
For the following exercises, use the trees
dataset built into R
, which has Girth
, Height
and Volume
variables for 31 Black Cherry trees.
# Examine the data
help(trees)
data(trees)
str(trees)
6.1
Show code to calculate the mean Girth
of Black Cherry trees with Height
less than 75 ft.
6.2
Use help(cut)
and then use the cut()
function to create a new factor variable based on the Height
numeric variable in the trees
dataset. Try setting the breaks
argument to 2 or 3. Rename the levels of your new factor to something meaningful. Show the code.
6.3
Using the new factor from question 2, use aggregate()
to calculate the mean and standard deviation of all three variables in the trees
data. Show your code and report the results to 2 decimal points of accuracy.
6.4
Show the code using which()
and boolean phrases as appropriate to find the rows in the trees
dataset where Girth
is higher than 11 and Height
is lower than 75.
6.5
Run the following code:
<- data.frame(volume = c(4,5,6,5,6,7,6,5,6,8,7,3,8,7,NA,10),
data_1 population = c("A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B"))
Use aggregate()
to calculate the mean of volume
for each population
(hint: you may need to use help for the functions involved and pay close attention to your data frame…).
6.6
Write a plausible practice question involving any aspect of using which()
, boolean phrases and/or aggregate()
involving the in-built R
dataset iris
.
Harper Adams Data Science
This website supports students and staff at Harper Adams University and the MSc in Data Science for Global Agriculture, Food, and Environment led by Ed Harris.