Exercise 2: Working with Data

About the dataset

The dataset contains information on the deaths due to COVID-19 in 14 districts of Kerala state. This information is available for download from the Government of Kerala COVID-19 Dashboard (https://dashboard.kerala.gov.in/covid/

Step 1: Load Packages

library(tidyverse)  # for additional packages for tidy operations
library(here)       # for organising file paths
library(rio)        # for importing data
library(janitor)    # for data cleaning and exploration 
library(gtsummary)  # for publication-ready tables

Step 2: Load Data

Create path to the file to read

filepath <- here('data', "kerala_covid_deaths.rds")

Read data

df <- read_rds(here(filepath))

3. Are the names of the dataset clean? If not, clean them.

df <- df |> 
  janitor::clean_names()

names(df)

[1] "sl_no"                    "date_reported"           
[3] "district"                 "name"                    
[5] "place"                    "age"                     
[7] "sex"                      "date_of_death"           
[9] "history_traveler_contact"

4. Check the class and structure of the dataset

df |> 
  class()

[1] "tbl_df"     "tbl"        "data.frame"

df |> 
  str()

tibble [26,982 × 9] (S3: tbl_df/tbl/data.frame)
 $ sl_no                   : num [1:26982] 1 2 3 4 5 6 7 8 9 10 ...
 $ date_reported           : Date[1:26982], format: "2021-10-21" "2021-10-21" ...
 $ district                : Factor w/ 14 levels "Alappuzha","Ernakulam",..: 12 12 12 12 12 12 12 12 12 12 ...
 $ name                    : chr [1:26982] "Anu john b" "Dinamony k" "Govinda pillai" "I danam" ...
 $ place                   : chr [1:26982] "Kattakada" "Kilimanoor" "Pallichal" "Thiruvananthapuram" ...
 $ age                     : num [1:26982] 31 87 77 65 49 88 72 74 58 55 ...
 $ sex                     : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 2 1 ...
 $ date_of_death           : Date[1:26982], format: "2021-10-14" "2021-10-08" ...
 $ history_traveler_contact: logi [1:26982] NA NA NA NA NA NA ...

5. Remove the variables not necessary for the analysis

df <- df |> 
  select(-c("sl_no", "name", "place", "history_traveler_contact"))

view(df)

Here we have removed the variables not necessary for the analysis.

6. Describing the dataset

Write an R code to look at the number of rows and columns.

df |> 
  dim()

[1] 26982     5

df |> 
  nrow()

[1] 26982

df |> 
  ncol()

[1] 5

df |> 
  glimpse()

Rows: 26,982
Columns: 5
$ date_reported <date> 2021-10-21, 2021-10-21, 2021-10-21, 2021-10-21, 2021-10…
$ district      <fct> Thiruvananthapuram, Thiruvananthapuram, Thiruvananthapur…
$ age           <dbl> 31, 87, 77, 65, 49, 88, 72, 74, 58, 55, 88, 87, 58, 57, …
$ sex           <fct> Male, Male, Male, Male, Male, Male, Male, Male, Female, …
$ date_of_death <date> 2021-10-14, 2021-10-08, 2021-10-18, 2021-10-17, 2021-10…

df |> as_tibble()

# A tibble: 26,982 × 5
   date_reported district             age sex    date_of_death
   <date>        <fct>              <dbl> <fct>  <date>       
 1 2021-10-21    Thiruvananthapuram    31 Male   2021-10-14   
 2 2021-10-21    Thiruvananthapuram    87 Male   2021-10-08   
 3 2021-10-21    Thiruvananthapuram    77 Male   2021-10-18   
 4 2021-10-21    Thiruvananthapuram    65 Male   2021-10-17   
 5 2021-10-21    Thiruvananthapuram    49 Male   2021-10-11   
 6 2021-10-21    Thiruvananthapuram    88 Male   2021-10-17   
 7 2021-10-21    Thiruvananthapuram    72 Male   2021-09-10   
 8 2021-10-21    Thiruvananthapuram    74 Male   2021-10-06   
 9 2021-10-21    Thiruvananthapuram    58 Female 2021-10-12   
10 2021-10-21    Thiruvananthapuram    55 Male   2021-10-18   
# ℹ 26,972 more rows

How many districts are there?

df |> 
  distinct(district)

# A tibble: 14 × 1
   district          
   <fct>             
 1 Thiruvananthapuram
 2 Kollam            
 3 Pathanamthitta    
 4 Alappuzha         
 5 Kottayam          
 6 Ernakulam         
 7 Thrissur          
 8 Palakkad          
 9 Malappuram        
10 Kozhikode         
11 Wayanad           
12 Kannur            
13 Kasaragod         
14 Idukki

df |> 
  count(district) |> nrow()

[1] 14

Solution

There are 14 districts

How many deaths occurred in each district?

df |> 
  count(district) |> 
  arrange(n)

# A tibble: 14 × 2
   district               n
   <fct>              <int>
 1 Idukki               406
 2 Wayanad              458
 3 Kasaragod            532
 4 Pathanamthitta       902
 5 Kottayam            1237
 6 Alappuzha           1599
 7 Kannur              1884
 8 Kollam              2203
 9 Malappuram          2403
10 Ernakulam           2621
11 Palakkad            2645
12 Kozhikode           2813
13 Thrissur            3106
14 Thiruvananthapuram  4173

According to this dataset, when did the maximum number of deaths occur?

df |> 
  count(date_reported) |> 
  arrange(-n)

# A tibble: 431 × 2
   date_reported     n
   <date>        <int>
 1 2021-06-06      227
 2 2021-06-02      213
 3 2021-08-25      213
 4 2021-09-21      213
 5 2021-06-07      210
 6 2021-06-05      209
 7 2021-09-15      208
 8 2021-06-13      206
 9 2021-05-29      198
10 2021-08-19      197
# ℹ 421 more rows

Solution

Maximum Number of deaths occured on 2021-06-06.

Maximum Number of deaths occured on June 6, 2021.

Which are the top five districts in COVID-19 deaths?

df |> 
  count(district) |> 
  arrange(-n) |> 
  slice(1:5)

# A tibble: 5 × 2
  district               n
  <fct>              <int>
1 Thiruvananthapuram  4173
2 Thrissur            3106
3 Kozhikode           2813
4 Palakkad            2645
5 Ernakulam           2621

TOP 5 Districts are: Trivandrum, Thrissur, kozhikode, Palakkad, Eranakulam

Is there a delay between death and reporting of death? If yes, how many days is the delay?

# Hint: use mutate() to subtract the relevant variables from each other, you can try the mean() function

#| echo: true


df |> 
  mutate(
    delay = date_reported - date_of_death) |>
  pull(delay) |> 
  mean(na.rm = T) |> 
  round(1)

Time difference of 6.2 days

An average Delay of 6.2 days is seen.

Is this delay same across all the districts?

df |> 
  mutate(
    delay = date_reported - date_of_death) |>
  group_by(district) |>
  summarize(
    mean_delay = mean(delay, na.rm = T))

# A tibble: 14 × 2
   district           mean_delay    
   <fct>              <drtn>        
 1 Alappuzha           9.597248 days
 2 Ernakulam          10.152614 days
 3 Idukki              7.551724 days
 4 Kannur              7.351911 days
 5 Kasaragod           4.460526 days
 6 Kollam              6.452565 days
 7 Kottayam            7.047696 days
 8 Kozhikode           4.633843 days
 9 Malappuram          5.778610 days
10 Palakkad            4.883554 days
11 Pathanamthitta      4.966741 days
12 Thiruvananthapuram  4.978672 days
13 Thrissur            5.018674 days
14 Wayanad             3.742358 days

No. There are variations across all the districts from mean.

Create a new categorical variable representing age as a dichotomous variable

df <- df |> 
  mutate(
    age_group = 
      ifelse(
        age <= 60, "<60 Years",">60 Years"))

view(df)

Now, two categories have been created: <=60 and >60.

Create a new categorical variable representing the Wave of COVID (Use a cut off of 2021-04-01)

df <- df |> 
  mutate(
    wave = ifelse(
      date_of_death <= "2021-04-01", "First Wave", "Second Wave"))

ASSIGNMENTS

What is the distribution of deaths across time?
Visualize mortality of COVID-19 in Kerala in different districts. Which are the high burden district?
Does these districts remain the same in both the COVID waves? Comment.