Data manipulation

This involves ways of selecting, inserting, deleting, sorting and summarising data.
Is the process of making data more organized.
Common packages used in data manipulation in R are:
- dplyr
- data.table
Each of packages has its own pros and cons.
We will focus on dplyr in this tutorial

Using dplyr for data manipulation

dplyr has 5 main verbs used for the common data manipulation tasks.
- select - use to select one or more columns
- filter - used to select rows/cases based on a particular criteria
- arrange - sort data based on one or more columns
- mutate - used to compute new columns
- summarise - used to compute data summaries base on particular column(s)

dplyr::select()

It is used to select one or more columns from a data frame.
Below is an example

data_sub <- select(sldata,
                   starts_with("new_SLID"),
                   starts_with("CID"),
                   starts_with("Landscape"),
                   starts_with("new_SID"),
                   starts_with("HDDS"),
                   starts_with("HDA_index"))

data_sub[1:3,]

##   new_SLID       CID          Landscape new_SID HDDS HDA_index
## 1   111011 Nicaragua Nicaragua-Honduras El Tuma    6      12.6
## 2   111021 Nicaragua Nicaragua-Honduras El Tuma    5      10.0
## 3   111031 Nicaragua Nicaragua-Honduras El Tuma    6      10.0

dplyr::select() continued

You can also use select to remove variables from a data set as shown below
Below were remove HDA index variable from the data

data_sub <- select(data_sub, -HDA_index)

data_sub[1:3,]

##   new_SLID       CID          Landscape new_SID HDDS
## 1   111011 Nicaragua Nicaragua-Honduras El Tuma    6
## 2   111021 Nicaragua Nicaragua-Honduras El Tuma    5
## 3   111031 Nicaragua Nicaragua-Honduras El Tuma    6

dplyr::filter()

It is used with logical statements to select cases that meet the criteria given
Below we select all cases from West Africa landscape.

data_sub <- filter(sldata,
                   Landscape=="West Africa")

dim(data_sub)

## [1] 600  35

dplyr::arrange()

Used to order rows of data
Can be used with one column or multiple columns
Below we will arrange the data set based on Country column, followed by the Landscape column

sldata <- arrange(sldata, CID, Landscape)

dplyr::mutate()

It is used to compute or add new columns to a data frame
Below we will compute a new variable HHibi using variables Farm size Ha and HH size already in the data frame.

sldata <- mutate(sldata, 
                 HHibi = (Farm_size_Ha/(HH_size * 365))*100)

summary(sldata$HHibi)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
##    0.0000    0.0402    0.1370    1.0893    0.3288 1774.0275        20

dplyr::summarise()

Used to compute summary statistics from the data
Computed below is the average HDDS value.
This is mostly used with group_by function when creating summaries.

hdds_summary <- summarise(sldata,
                          avg_hdds = mean(HDDS))

hdds_summary[,]

## [1] 7.9092

Combining all or some operations

Using the “piping” operator, we can make our code more readable and be able to combine several operations in one step.
The “piping” operator can be pronounced as then
Below is a simple example

hdds_summary <- sldata %>%
  group_by(CID) %>%
  summarise(avg_hdds = round(mean(HDDS), 1)) %>%
  arrange(desc(avg_hdds))

## `summarise()` ungrouping output (override with `.groups` argument)

hdds_summary[3:5,]

## # A tibble: 3 x 2
##   CID      avg_hdds
##   <fct>       <dbl>
## 1 Cameroon     10.1
## 2 Honduras      8  
## 3 Kenya         7.6

General advice

Practice, practice and practice some more
Google is your friend
Know how to reproduce your error