library(tidyverse) # for additional packages for tidy operations
library(here) # for organising file paths
library(rio) # for importing data
library(janitor) # for data cleaning and exploration
library(gtsummary) # for publication-ready tables
Exercise 2: Working with Data
About the dataset
The dataset contains information on the deaths due to COVID-19 in 14 districts of Kerala state. This information is available for download from the Government of Kerala COVID-19 Dashboard (https://dashboard.kerala.gov.in/covid/
Step 1: Load Packages
Step 2: Load Data
Create path to the file to read
<- here('data', "kerala_covid_deaths.rds") filepath
Read data
<- read_rds(here(filepath)) df
3. Are the names of the dataset clean? If not, clean them.
<- df |>
df ::clean_names()
janitor
names(df)
[1] "sl_no" "date_reported"
[3] "district" "name"
[5] "place" "age"
[7] "sex" "date_of_death"
[9] "history_traveler_contact"
4. Check the class and structure of the dataset
|>
df class()
[1] "tbl_df" "tbl" "data.frame"
|>
df str()
tibble [26,982 × 9] (S3: tbl_df/tbl/data.frame)
$ sl_no : num [1:26982] 1 2 3 4 5 6 7 8 9 10 ...
$ date_reported : Date[1:26982], format: "2021-10-21" "2021-10-21" ...
$ district : Factor w/ 14 levels "Alappuzha","Ernakulam",..: 12 12 12 12 12 12 12 12 12 12 ...
$ name : chr [1:26982] "Anu john b" "Dinamony k" "Govinda pillai" "I danam" ...
$ place : chr [1:26982] "Kattakada" "Kilimanoor" "Pallichal" "Thiruvananthapuram" ...
$ age : num [1:26982] 31 87 77 65 49 88 72 74 58 55 ...
$ sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 2 1 ...
$ date_of_death : Date[1:26982], format: "2021-10-14" "2021-10-08" ...
$ history_traveler_contact: logi [1:26982] NA NA NA NA NA NA ...
5. Remove the variables not necessary for the analysis
<- df |>
df select(-c("sl_no", "name", "place", "history_traveler_contact"))
view(df)
Here we have removed the variables not necessary for the analysis.
6. Describing the dataset
Write an R code to look at the number of rows and columns.
|>
df dim()
[1] 26982 5
|>
df nrow()
[1] 26982
|>
df ncol()
[1] 5
|>
df glimpse()
Rows: 26,982
Columns: 5
$ date_reported <date> 2021-10-21, 2021-10-21, 2021-10-21, 2021-10-21, 2021-10…
$ district <fct> Thiruvananthapuram, Thiruvananthapuram, Thiruvananthapur…
$ age <dbl> 31, 87, 77, 65, 49, 88, 72, 74, 58, 55, 88, 87, 58, 57, …
$ sex <fct> Male, Male, Male, Male, Male, Male, Male, Male, Female, …
$ date_of_death <date> 2021-10-14, 2021-10-08, 2021-10-18, 2021-10-17, 2021-10…
|> as_tibble() df
# A tibble: 26,982 × 5
date_reported district age sex date_of_death
<date> <fct> <dbl> <fct> <date>
1 2021-10-21 Thiruvananthapuram 31 Male 2021-10-14
2 2021-10-21 Thiruvananthapuram 87 Male 2021-10-08
3 2021-10-21 Thiruvananthapuram 77 Male 2021-10-18
4 2021-10-21 Thiruvananthapuram 65 Male 2021-10-17
5 2021-10-21 Thiruvananthapuram 49 Male 2021-10-11
6 2021-10-21 Thiruvananthapuram 88 Male 2021-10-17
7 2021-10-21 Thiruvananthapuram 72 Male 2021-09-10
8 2021-10-21 Thiruvananthapuram 74 Male 2021-10-06
9 2021-10-21 Thiruvananthapuram 58 Female 2021-10-12
10 2021-10-21 Thiruvananthapuram 55 Male 2021-10-18
# ℹ 26,972 more rows
How many districts are there?
|>
df distinct(district)
# A tibble: 14 × 1
district
<fct>
1 Thiruvananthapuram
2 Kollam
3 Pathanamthitta
4 Alappuzha
5 Kottayam
6 Ernakulam
7 Thrissur
8 Palakkad
9 Malappuram
10 Kozhikode
11 Wayanad
12 Kannur
13 Kasaragod
14 Idukki
|>
df count(district) |> nrow()
[1] 14
There are 14 districts
How many deaths occurred in each district?
|>
df count(district) |>
arrange(n)
# A tibble: 14 × 2
district n
<fct> <int>
1 Idukki 406
2 Wayanad 458
3 Kasaragod 532
4 Pathanamthitta 902
5 Kottayam 1237
6 Alappuzha 1599
7 Kannur 1884
8 Kollam 2203
9 Malappuram 2403
10 Ernakulam 2621
11 Palakkad 2645
12 Kozhikode 2813
13 Thrissur 3106
14 Thiruvananthapuram 4173
According to this dataset, when did the maximum number of deaths occur?
|>
df count(date_reported) |>
arrange(-n)
# A tibble: 431 × 2
date_reported n
<date> <int>
1 2021-06-06 227
2 2021-06-02 213
3 2021-08-25 213
4 2021-09-21 213
5 2021-06-07 210
6 2021-06-05 209
7 2021-09-15 208
8 2021-06-13 206
9 2021-05-29 198
10 2021-08-19 197
# ℹ 421 more rows
Maximum Number of deaths occured on 2021-06-06.
Maximum Number of deaths occured on June 6, 2021.
Which are the top five districts in COVID-19 deaths?
|>
df count(district) |>
arrange(-n) |>
slice(1:5)
# A tibble: 5 × 2
district n
<fct> <int>
1 Thiruvananthapuram 4173
2 Thrissur 3106
3 Kozhikode 2813
4 Palakkad 2645
5 Ernakulam 2621
TOP 5 Districts are: Trivandrum, Thrissur, kozhikode, Palakkad, Eranakulam
Is there a delay between death and reporting of death? If yes, how many days is the delay?
# Hint: use mutate() to subtract the relevant variables from each other, you can try the mean() function
#| echo: true
|>
df mutate(
delay = date_reported - date_of_death) |>
pull(delay) |>
mean(na.rm = T) |>
round(1)
Time difference of 6.2 days
An average Delay of 6.2 days is seen.
Is this delay same across all the districts?
|>
df mutate(
delay = date_reported - date_of_death) |>
group_by(district) |>
summarize(
mean_delay = mean(delay, na.rm = T))
# A tibble: 14 × 2
district mean_delay
<fct> <drtn>
1 Alappuzha 9.597248 days
2 Ernakulam 10.152614 days
3 Idukki 7.551724 days
4 Kannur 7.351911 days
5 Kasaragod 4.460526 days
6 Kollam 6.452565 days
7 Kottayam 7.047696 days
8 Kozhikode 4.633843 days
9 Malappuram 5.778610 days
10 Palakkad 4.883554 days
11 Pathanamthitta 4.966741 days
12 Thiruvananthapuram 4.978672 days
13 Thrissur 5.018674 days
14 Wayanad 3.742358 days
No. There are variations across all the districts from mean.
Create a new categorical variable representing age as a dichotomous variable
<- df |>
df mutate(
age_group =
ifelse(
<= 60, "<60 Years",">60 Years"))
age
view(df)
Now, two categories have been created: <=60 and >60.
Create a new categorical variable representing the Wave of COVID (Use a cut off of 2021-04-01)
<- df |>
df mutate(
wave = ifelse(
<= "2021-04-01", "First Wave", "Second Wave")) date_of_death
ASSIGNMENTS |
What is the distribution of deaths across time?
Visualize mortality of COVID-19 in Kerala in different districts. Which are the high burden district?
Does these districts remain the same in both the COVID waves? Comment.