Manipulating data in R
Data manipulation
- This involves ways of selecting, inserting, deleting, sorting and summarising data.
- Is the process of making data more organized.
- Common packages used in data manipulation in R are:
- dplyr
- data.table
- Each of packages has its own pros and cons.
- We will focus on dplyr in this tutorial
Using dplyr for data manipulation
- dplyr has 5 main verbs used for the common data manipulation tasks.
- select - use to select one or more columns
- filter - used to select rows/cases based on a particular criteria
- arrange - sort data based on one or more columns
- mutate - used to compute new columns
- summarise - used to compute data summaries base on particular column(s)
dplyr::select()
- It is used to select one or more columns from a data frame.
- Below is an example
data_sub <- select(sldata,
starts_with("new_SLID"),
starts_with("CID"),
starts_with("Landscape"),
starts_with("new_SID"),
starts_with("HDDS"),
starts_with("HDA_index"))
data_sub[1:3,]
## new_SLID CID Landscape new_SID HDDS HDA_index
## 1 111011 Nicaragua Nicaragua-Honduras El Tuma 6 12.6
## 2 111021 Nicaragua Nicaragua-Honduras El Tuma 5 10.0
## 3 111031 Nicaragua Nicaragua-Honduras El Tuma 6 10.0
dplyr::select() continued
- You can also use select to remove variables from a data set as shown below
- Below were remove HDA index variable from the data
data_sub <- select(data_sub, -HDA_index)
data_sub[1:3,]
## new_SLID CID Landscape new_SID HDDS
## 1 111011 Nicaragua Nicaragua-Honduras El Tuma 6
## 2 111021 Nicaragua Nicaragua-Honduras El Tuma 5
## 3 111031 Nicaragua Nicaragua-Honduras El Tuma 6
dplyr::filter()
- It is used with logical statements to select cases that meet the criteria given
- Below we select all cases from West Africa landscape.
data_sub <- filter(sldata,
Landscape=="West Africa")
dim(data_sub)
## [1] 600 35
dplyr::arrange()
- Used to order rows of data
- Can be used with one column or multiple columns
- Below we will arrange the data set based on Country column, followed by the Landscape column
sldata <- arrange(sldata, CID, Landscape)
dplyr::mutate()
- It is used to compute or add new columns to a data frame
- Below we will compute a new variable HHibi using variables Farm size Ha and HH size already in the data frame.
sldata <- mutate(sldata,
HHibi = (Farm_size_Ha/(HH_size * 365))*100)
summary(sldata$HHibi)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0000 0.0402 0.1370 1.0893 0.3288 1774.0275 20
dplyr::summarise()
- Used to compute summary statistics from the data
- Computed below is the average HDDS value.
- This is mostly used with group_by function when creating summaries.
hdds_summary <- summarise(sldata,
avg_hdds = mean(HDDS))
hdds_summary[,]
## [1] 7.9092
Combining all or some operations
- Using the “piping” operator, we can make our code more readable and be able to combine several operations in one step.
- The “piping” operator can be pronounced as then
- Below is a simple example
hdds_summary <- sldata %>%
group_by(CID) %>%
summarise(avg_hdds = round(mean(HDDS), 1)) %>%
arrange(desc(avg_hdds))
## `summarise()` ungrouping output (override with `.groups` argument)
hdds_summary[3:5,]
## # A tibble: 3 x 2
## CID avg_hdds
## <fct> <dbl>
## 1 Cameroon 10.1
## 2 Honduras 8
## 3 Kenya 7.6
General advice
- Practice, practice and practice some more
- Google is your friend
- Know how to reproduce your error