Manipulating data in R
Nov 20, 2016
Parmutia Makui
3 minute read

Data manipulation

  • This involves ways of selecting, inserting, deleting, sorting and summarising data.
  • Is the process of making data more organized.
  • Common packages used in data manipulation in R are:
    • dplyr
    • data.table
  • Each of packages has its own pros and cons.
  • We will focus on dplyr in this tutorial

Using dplyr for data manipulation

  • dplyr has 5 main verbs used for the common data manipulation tasks.
    • select - use to select one or more columns
    • filter - used to select rows/cases based on a particular criteria
    • arrange - sort data based on one or more columns
    • mutate - used to compute new columns
    • summarise - used to compute data summaries base on particular column(s)

dplyr::select()

  • It is used to select one or more columns from a data frame.
  • Below is an example
data_sub <- select(sldata,
                   starts_with("new_SLID"),
                   starts_with("CID"),
                   starts_with("Landscape"),
                   starts_with("new_SID"),
                   starts_with("HDDS"),
                   starts_with("HDA_index"))

data_sub[1:3,]
##   new_SLID       CID          Landscape new_SID HDDS HDA_index
## 1   111011 Nicaragua Nicaragua-Honduras El Tuma    6      12.6
## 2   111021 Nicaragua Nicaragua-Honduras El Tuma    5      10.0
## 3   111031 Nicaragua Nicaragua-Honduras El Tuma    6      10.0

dplyr::select() continued

  • You can also use select to remove variables from a data set as shown below
  • Below were remove HDA index variable from the data
data_sub <- select(data_sub, -HDA_index)

data_sub[1:3,]
##   new_SLID       CID          Landscape new_SID HDDS
## 1   111011 Nicaragua Nicaragua-Honduras El Tuma    6
## 2   111021 Nicaragua Nicaragua-Honduras El Tuma    5
## 3   111031 Nicaragua Nicaragua-Honduras El Tuma    6

dplyr::filter()

  • It is used with logical statements to select cases that meet the criteria given
  • Below we select all cases from West Africa landscape.
data_sub <- filter(sldata,
                   Landscape=="West Africa")

dim(data_sub)
## [1] 600  35

dplyr::arrange()

  • Used to order rows of data
  • Can be used with one column or multiple columns
  • Below we will arrange the data set based on Country column, followed by the Landscape column
sldata <- arrange(sldata, CID, Landscape)

dplyr::mutate()

  • It is used to compute or add new columns to a data frame
  • Below we will compute a new variable HHibi using variables Farm size Ha and HH size already in the data frame.
sldata <- mutate(sldata, 
                 HHibi = (Farm_size_Ha/(HH_size * 365))*100)

summary(sldata$HHibi)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    0.0000    0.0402    0.1370    1.0893    0.3288 1774.0275        20

dplyr::summarise()

  • Used to compute summary statistics from the data
  • Computed below is the average HDDS value.
  • This is mostly used with group_by function when creating summaries.
hdds_summary <- summarise(sldata,
                          avg_hdds = mean(HDDS))

hdds_summary[,]
## [1] 7.9092

Combining all or some operations

  • Using the “piping” operator, we can make our code more readable and be able to combine several operations in one step.
  • The “piping” operator can be pronounced as then
  • Below is a simple example
hdds_summary <- sldata %>%
  group_by(CID) %>%
  summarise(avg_hdds = round(mean(HDDS), 1)) %>%
  arrange(desc(avg_hdds))
## `summarise()` ungrouping output (override with `.groups` argument)
hdds_summary[3:5,]
## # A tibble: 3 x 2
##   CID      avg_hdds
##   <fct>       <dbl>
## 1 Cameroon     10.1
## 2 Honduras      8  
## 3 Kenya         7.6

General advice

  • Practice, practice and practice some more
  • Google is your friend
  • Know how to reproduce your error