Working with Data in R

Exploratory Data Analysis with the tidyverse

Dr. Arun Mitra
Assistant Professor, Dept of Community & Family Medicine
All India Institute of Medical Sciences, Bibinagar - Hyderabad

What is Exploratory Data Analysis?

  • EDA is the critical first step.

  • EDA is a state of mind.

  • EDA is exploring your ideas.

  • EDA has no strict rules.

  • EDA helps understand your data.

  • EDA is an iterative cycle.

  • EDA is a creative process.

Definition of EDA

“detective work – numerical detective work – or counting detective work – or graphical detective work”

  • Tukey, 1977 Page 1, Exploratory Data Analysis

How to do EDA?

  • The easiest way to do EDA is to use questions as tools to guide your investigation.

  • EDA is an important part of any data analysis, even if the questions are known already.

Asking the right questions

Key to asking quality questions is to generate a large quantity of questions.

It is difficult to ask revealing questions at the start of the analysis.

But, each new question will expose a new aspect and increase your chance of making a discovery.

What Questions to ask?

  • What type of variation occurs within your variables?

  • What type of covariation occurs between your variables?

  • Whether your data meets your expectations or not.

  • Whether the quality of your data is robust or not.

Steps of a data analysis project

  • Preparing Tidy Data
    • Data Cleaning
    • Data Wrangling
  • Data Exploration
    • Transformation
    • Visualization
  • Statistical Analysis
  • Prepare Results
  • Draw Inferences
  • Report Findings

Data Wrangling in R

dplyr Package

The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data.

In short, it makes data exploration and data manipulation easy and fast in R.

Verbs of the dplyr

  • There are many verbs in dplyr that are useful.

Using the pipe operator (|> or %>%)

Let us load some data

library(tidyverse)
library(here)
library(rio)

filepath <- here("data", "who_tubercolosis_data.csv")

tb <- filepath |>
  import(setclass = "tibble")

tb
# A tibble: 3,850 × 18
   country     who_region  year      pop incidence_100k incidence_number
   <chr>       <chr>      <int>    <int>          <dbl>            <int>
 1 Afghanistan EMR         2000 20093756            190            38000
 2 Afghanistan EMR         2001 20966463            189            40000
 3 Afghanistan EMR         2002 21979923            189            42000
 4 Afghanistan EMR         2003 23064851            189            44000
 5 Afghanistan EMR         2004 24118979            189            46000
 6 Afghanistan EMR         2005 25070798            189            47000
 7 Afghanistan EMR         2006 25893450            189            49000
 8 Afghanistan EMR         2007 26616792            189            50000
 9 Afghanistan EMR         2008 27294031            189            52000
10 Afghanistan EMR         2009 28004331            189            53000
# ℹ 3,840 more rows
# ℹ 12 more variables: hiv_percent <dbl>, hiv_incidence_100k <dbl>,
#   hiv_number <int>, mort_nohiv_100k <dbl>, mort_nohiv_number <int>,
#   mort_hiv_100k <dbl>, mort_hiv_number <int>, mort_100k <dbl>,
#   mort_number <int>, case_fatality_ratio <dbl>,
#   new_incidence_100k <dbl>, case_detection_percent <dbl>

Take a Quick Look at the data

tb |>
  glimpse()
Rows: 3,850
Columns: 18
$ country                <chr> "Afghanistan", "Afghanistan", "Afghan…
$ who_region             <chr> "EMR", "EMR", "EMR", "EMR", "EMR", "E…
$ year                   <int> 2000, 2001, 2002, 2003, 2004, 2005, 2…
$ pop                    <int> 20093756, 20966463, 21979923, 2306485…
$ incidence_100k         <dbl> 190, 189, 189, 189, 189, 189, 189, 18…
$ incidence_number       <int> 38000, 40000, 42000, 44000, 46000, 47…
$ hiv_percent            <dbl> 0.36, 0.30, 0.26, 0.23, 0.22, 0.22, 0…
$ hiv_incidence_100k     <dbl> 0.68, 0.57, 0.49, 0.44, 0.41, 0.42, 0…
$ hiv_number             <int> 140, 120, 110, 100, 100, 100, 110, 12…
$ mort_nohiv_100k        <dbl> 67.00, 62.00, 56.00, 57.00, 51.00, 46…
$ mort_nohiv_number      <int> 14000, 13000, 12000, 13000, 12000, 12…
$ mort_hiv_100k          <dbl> 0.15, 0.17, 0.27, 0.25, 0.21, 0.19, 0…
$ mort_hiv_number        <int> 31, 35, 60, 57, 50, 48, 46, 45, 48, 5…
$ mort_100k              <dbl> 67.00, 62.00, 56.00, 57.00, 51.00, 46…
$ mort_number            <int> 14000, 13000, 12000, 13000, 12000, 12…
$ case_fatality_ratio    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ new_incidence_100k     <dbl> 35, 48, 63, 60, 76, 87, 98, 108, 104,…
$ case_detection_percent <dbl> 19, 26, 33, 32, 40, 46, 52, 57, 55, 4…

Check the first few rows

tb |>
  head()
# A tibble: 6 × 18
  country     who_region  year     pop incidence_100k incidence_number
  <chr>       <chr>      <int>   <int>          <dbl>            <int>
1 Afghanistan EMR         2000  2.01e7            190            38000
2 Afghanistan EMR         2001  2.10e7            189            40000
3 Afghanistan EMR         2002  2.20e7            189            42000
4 Afghanistan EMR         2003  2.31e7            189            44000
5 Afghanistan EMR         2004  2.41e7            189            46000
6 Afghanistan EMR         2005  2.51e7            189            47000
# ℹ 12 more variables: hiv_percent <dbl>, hiv_incidence_100k <dbl>,
#   hiv_number <int>, mort_nohiv_100k <dbl>, mort_nohiv_number <int>,
#   mort_hiv_100k <dbl>, mort_hiv_number <int>, mort_100k <dbl>,
#   mort_number <int>, case_fatality_ratio <dbl>,
#   new_incidence_100k <dbl>, case_detection_percent <dbl>

Check the dimensions and the column names

tb |>
  dim()
[1] 3850   18
tb |>
  names()
 [1] "country"                "who_region"             "year"                  
 [4] "pop"                    "incidence_100k"         "incidence_number"      
 [7] "hiv_percent"            "hiv_incidence_100k"     "hiv_number"            
[10] "mort_nohiv_100k"        "mort_nohiv_number"      "mort_hiv_100k"         
[13] "mort_hiv_number"        "mort_100k"              "mort_number"           
[16] "case_fatality_ratio"    "new_incidence_100k"     "case_detection_percent"

Lets find the unique countries in the bottom 50 rows of the dataset

unique(tail(tb, n = 50)$country)
[1] "Yemen"    "Zambia"   "Zimbabwe"
tb |>
  tail(50)  |>
  distinct(country)
# A tibble: 3 × 1
  country 
  <chr>   
1 Yemen   
2 Zambia  
3 Zimbabwe

distinct() and count()

The distinct() function will return the distinct values of a column, while count() provides both the distinct values of a column and then number of times each value shows up.

tb  |>
  distinct(who_region)
# A tibble: 6 × 1
  who_region
  <chr>     
1 EMR       
2 EUR       
3 AFR       
4 WPR       
5 AMR       
6 SEA       
tb  |>
  count(who_region)
# A tibble: 6 × 2
  who_region     n
  <chr>      <int>
1 AFR          835
2 AMR          808
3 EMR          396
4 EUR          967
5 SEA          196
6 WPR          648

arrange()

The arrange() function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest.

Use the desc() function to arrange by descending.

tb |>
  count(who_region) |>
  arrange(n)
# A tibble: 6 × 2
  who_region     n
  <chr>      <int>
1 SEA          196
2 EMR          396
3 WPR          648
4 AMR          808
5 AFR          835
6 EUR          967
tb |>
  count(who_region) |>
  arrange(-n) # alt: arrange(desc(n))
# A tibble: 6 × 2
  who_region     n
  <chr>      <int>
1 EUR          967
2 AFR          835
3 AMR          808
4 WPR          648
5 EMR          396
6 SEA          196

Logical Operators in R

  • If you want to satisfy all of multiple conditions, you can use the “and” operator, &.

  • The “or” operator | (the vertical pipe character, shift-backslash) will return a subset that meet any of the conditions.

filter()

  • Rows with year 2015 and above
tb|>
  filter(year >= 2015)
# A tibble: 648 × 18
   country     who_region  year    pop incidence_100k incidence_number
   <chr>       <chr>      <int>  <int>          <dbl>            <int>
 1 Afghanistan EMR         2015 3.37e7          189              64000
 2 Afghanistan EMR         2016 3.47e7          189              65000
 3 Afghanistan EMR         2017 3.55e7          189              67000
 4 Albania     EUR         2015 2.92e6           16                480
 5 Albania     EUR         2016 2.93e6           16                480
 6 Albania     EUR         2017 2.93e6           20                580
 7 Algeria     AFR         2015 3.99e7           74              30000
 8 Algeria     AFR         2016 4.06e7           70              29000
 9 Algeria     AFR         2017 4.13e7           70              29000
10 American S… WPR         2015 5.55e4            8.3                5
# ℹ 638 more rows
# ℹ 12 more variables: hiv_percent <dbl>, hiv_incidence_100k <dbl>,
#   hiv_number <int>, mort_nohiv_100k <dbl>, mort_nohiv_number <int>,
#   mort_hiv_100k <dbl>, mort_hiv_number <int>, mort_100k <dbl>,
#   mort_number <int>, case_fatality_ratio <dbl>,
#   new_incidence_100k <dbl>, case_detection_percent <dbl>

filter()

  • All entries for India
tb|>
  filter(country == "India")
# A tibble: 18 × 18
   country who_region  year        pop incidence_100k incidence_number
   <chr>   <chr>      <int>      <int>          <dbl>            <int>
 1 India   SEA         2000 1053050912            289          3040000
 2 India   SEA         2001 1071477855            288          3090000
 3 India   SEA         2002 1089807112            287          3130000
 4 India   SEA         2003 1108027848            285          3160000
 5 India   SEA         2004 1126135777            282          3180000
 6 India   SEA         2005 1144118674            279          3190000
 7 India   SEA         2006 1161977719            274          3180000
 8 India   SEA         2007 1179681239            268          3160000
 9 India   SEA         2008 1197146906            261          3130000
10 India   SEA         2009 1214270132            254          3090000
11 India   SEA         2010 1230980691            247          3050000
12 India   SEA         2011 1247236029            241          3000000
13 India   SEA         2012 1263065852            234          2960000
14 India   SEA         2013 1278562207            228          2920000
15 India   SEA         2014 1293859294            223          2880000
16 India   SEA         2015 1309053980            217          2840000
17 India   SEA         2016 1324171354            211          2790000
18 India   SEA         2017 1339180127            204          2740000
# ℹ 12 more variables: hiv_percent <dbl>, hiv_incidence_100k <dbl>,
#   hiv_number <int>, mort_nohiv_100k <dbl>, mort_nohiv_number <int>,
#   mort_hiv_100k <dbl>, mort_hiv_number <int>, mort_100k <dbl>,
#   mort_number <int>, case_fatality_ratio <dbl>,
#   new_incidence_100k <dbl>, case_detection_percent <dbl>

Filter by year and country

tb |>
  filter(year >= 2015 & country == "India")
# A tibble: 3 × 18
  country who_region  year        pop incidence_100k incidence_number
  <chr>   <chr>      <int>      <int>          <dbl>            <int>
1 India   SEA         2015 1309053980            217          2840000
2 India   SEA         2016 1324171354            211          2790000
3 India   SEA         2017 1339180127            204          2740000
# ℹ 12 more variables: hiv_percent <dbl>, hiv_incidence_100k <dbl>,
#   hiv_number <int>, mort_nohiv_100k <dbl>, mort_nohiv_number <int>,
#   mort_hiv_100k <dbl>, mort_hiv_number <int>, mort_100k <dbl>,
#   mort_number <int>, case_fatality_ratio <dbl>,
#   new_incidence_100k <dbl>, case_detection_percent <dbl>

The %in% function

  • To filter() a categorical variable for only certain levels, we can use the %in% operator.

  • Let’s see data from India, Nepal, Pakistan and Bangladesh First we will have to figure out how those are spelled in this dataset.

  • Open the spreadsheet viewer and find out.

  • We’ll see a way to find them in code later on in the course.

# Create the Indian Subcontinent Variable
indian_subcont <- c(
            "India",
            "Nepal",
            "Pakistan",
            "Bangladesh",
            "Afghanistan"
            )

# Filter using the %in% function
tb  |>
  filter(country %in% indian_subcont) |>
  count(country)
# A tibble: 6 × 2
  country         n
  <chr>       <int>
1 Afghanistan    18
2 Bangladesh     18
3 India          18
4 Nepal          18
5 Pakistan       18
6 Sri Lanka      18

summarize()

  • The summarize() function summarizes multiple values to a single value.

  • On its own the summarize() function doesn’t seem to be all that useful.

The dplyr package provides a few convenience functions called n() and n_distinct() that tell you the number of observations or the number of distinct values of a particular variable.

summarize() is the same as summarise()

summarize() in Action

tb|>
  summarize(n = n(),
            n_countries = n_distinct(country),
            n_years = n_distinct(year),
            mean_hiv_percent = mean(hiv_percent, na.rm = TRUE),
            sd_hiv_percent = sd(hiv_percent, na.rm = TRUE))
# A tibble: 1 × 5
      n n_countries n_years mean_hiv_percent sd_hiv_percent
  <int>       <int>   <int>            <dbl>          <dbl>
1  3850         218      18             12.1           17.7

group_by()

  • We saw that summarize() isn’t that useful on its own. Neither is group_by().

  • All this does is takes an existing data frame and converts it into a grouped data frame where operations are performed by group.

  • The real power comes in where group_by() and summarize() are used together. First, write the group_by() statement. Then pipe the result to a call to summarize().

group_by() and summarize()

tb |>
  group_by(who_region)  |>
  summarize(n_countries = n_distinct(country),
            mean_inc = mean(incidence_100k, na.rm = TRUE),
            sd_inc = sd(incidence_100k, na.rm = TRUE)) |>
  ungroup()
# A tibble: 6 × 4
  who_region n_countries mean_inc sd_inc
  <chr>            <int>    <dbl>  <dbl>
1 AFR                 47    297.   248. 
2 AMR                 47     32.8   45.6
3 EMR                 22     83.7  123. 
4 EUR                 55     42.7   51.1
5 SEA                 11    260.   151. 
6 WPR                 36    126.   153. 

mutate()

  • Mutate creates a new variable or modifies an existing one.

if_else()

Lets create a column called ind_sub if the country is in the Indian Subcontinent.

tb |>
  mutate(ind_sub = if_else(country %in% indian_subcont,
                              "Indian Subcontinent", "Others")) |>
  select(country, ind_sub) |>
  filter(ind_sub != "Others") |>
  distinct()
# A tibble: 6 × 2
  country     ind_sub            
  <chr>       <chr>              
1 Afghanistan Indian Subcontinent
2 Bangladesh  Indian Subcontinent
3 India       Indian Subcontinent
4 Nepal       Indian Subcontinent
5 Pakistan    Indian Subcontinent
6 Sri Lanka   Indian Subcontinent

group_by() and summarize()

tb |>
  mutate(ind_sub = if_else(country %in% indian_subcont,
                              "Indian Subcontinent", "Others")) |>
  group_by(ind_sub)  |>
  summarize(n_countries = n_distinct(country),
            mean_inc = mean(incidence_100k, na.rm = TRUE),
            sd_inc = sd(incidence_100k, na.rm = TRUE)) |>
  ungroup()
# A tibble: 2 × 4
  ind_sub             n_countries mean_inc sd_inc
  <chr>                     <int>    <dbl>  <dbl>
1 Indian Subcontinent           6     194.   70.0
2 Others                      212     123.  181. 

case_when()

Alternative of if_else()

if_else() vs case_when()

Note that the if_else() function may result in slightly shorter code if you only need to code for 2 options.

For more options, nested if_else() statements become hard to read and could result in mismatched parentheses so case_when() will be a more elegant solution.

join()

  • Typically in a data science or data analysis project one would have to work with many sources of data.

  • The researcher must be able to combine multiple datasets to answer the questions he or she is interested in.

  • As with the other dplyr verbs, there are different families of verbs that are designed to work with relational data and one of the most commonly used family of verbs are the mutating joins.

More on join()

  • left_join(x, y) which combines all columns in data frame x with those in data frame y but only retains rows from x.

  • right_join(x, y) also keeps all columns but operates in the opposite direction, returning only rows from y.

  • full_join(x, y) combines all columns of x with all columns of y and retains all rows from both data frames.

  • inner_join(x, y) combines all columns present in either x or y but only retains rows that are present in both data frames.

Visual representation of the left_join()

Visual representation of the right_join()

Visual representation of the full_join()

pivot()

Most often, when working with our data we may have to reshape our data from long format to wide format and back. We can use the pivot family of functions to achieve this task.

Other Useful Functions

drop_na()

The drop_na() function is extremely useful for when we need to subset a variable to remove missing values.

select()

While the filter() function allows you to return only certain rows matching a condition, the select() function returns only certain columns. The first argument is the data, and subsequent arguments are the columns you want.

More Resources for dplyr

Questions?

Thank You!