Exercise 1: Working with Data

In this workbook we shall learn to work with data in R using data we collected from the participants of this workshop using a google form. The data will have the following variables :

Age
State/ UT of Residence
Gender
Years of professional experience
Area of practice
Whether student/faculty
Frequency of data analysis tools
Software of choice for data analysis
No. of conferences attended in 2023

Step 1: Loading data and packages

#| echo: true
#| warning: false






## Loading the packages
library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

✔ ggplot2 3.4.3     ✔ purrr   1.0.2
✔ tibble  3.2.1     ✔ dplyr   1.1.3
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(here)

here() starts at C:/Users/Arun/Dropbox/PhD/Workshops/communicating-research-workshop

library(gtsummary)

# Loading the data

conf_data <- read_csv(here("data","quest_particip.csv"))

Rows: 29 Columns: 10

── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Timestamp, State/UT of residence, Gender, Specialty area/ Area of p...
dbl (3): Age (in years), Years of professional experience (in numbers), Freq...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Step 2: Looking at the data

## View the data
# conf_data |> view()

conf_data |>
  dim()

[1] 29 10

## Overview of the data
conf_data |> 
  dim()

[1] 29 10

## Glimpse the data
conf_data |>
  glimpse()

Rows: 29
Columns: 10
$ Timestamp                                       <chr> "31/01/2024 17:41:04",…
$ `Age (in years)`                                <dbl> 49, 22, 26, 29, 33, 45…
$ `State/UT of residence`                         <chr> "Delhi", "Assam", "And…
$ Gender                                          <chr> "Male", "Male", "Femal…
$ `Years of professional experience (in numbers)` <dbl> 20, 1, 2, 3, 5, 18, 1,…
$ `Specialty area/ Area of practice`              <chr> "Clinical departments"…
$ `Whether student or faculty`                    <chr> "Faculty", "Student", …
$ `Frequency of using Data analysis tools`        <dbl> 5, 1, 3, 2, 2, 4, 2, 5…
$ `Software of choice for data analysis`          <chr> "SPSS", "R", "MS Excel…
$ `No. of conferences/workshops attended in 2023` <chr> "None", "1-3", "More t…

## Check names of variables
conf_data |>
  names()

 [1] "Timestamp"                                    
 [2] "Age (in years)"                               
 [3] "State/UT of residence"                        
 [4] "Gender"                                       
 [5] "Years of professional experience (in numbers)"
 [6] "Specialty area/ Area of practice"             
 [7] "Whether student or faculty"                   
 [8] "Frequency of using Data analysis tools"       
 [9] "Software of choice for data analysis"         
[10] "No. of conferences/workshops attended in 2023"

## Clean the names of the dataset
conf_data <-
  conf_data |>
  janitor::clean_names()

## Check of the names are cleaned
conf_data |>
  names()

 [1] "timestamp"                                   
 [2] "age_in_years"                                
 [3] "state_ut_of_residence"                       
 [4] "gender"                                      
 [5] "years_of_professional_experience_in_numbers" 
 [6] "specialty_area_area_of_practice"             
 [7] "whether_student_or_faculty"                  
 [8] "frequency_of_using_data_analysis_tools"      
 [9] "software_of_choice_for_data_analysis"        
[10] "no_of_conferences_workshops_attended_in_2023"

Step 4: Data Exploration

## look at gender of participants
conf_data |>
  count(gender)

# A tibble: 3 × 2
  gender                n
  <chr>             <int>
1 Female               13
2 Male                 13
3 Prefer not to say     3

## look at place of residence
conf_data |>
  count(state_ut_of_residence)

# A tibble: 16 × 2
   state_ut_of_residence     n
   <chr>                 <int>
 1 Andhra Pradesh            2
 2 Assam                     1
 3 Bihar                     1
 4 Chattisgarh               1
 5 Delhi                     2
 6 Gujarat                   1
 7 Haryana                   1
 8 Karnataka                 5
 9 Kerala                    6
10 Maharashtra               1
11 Orissa                    1
12 Pondicherry               1
13 Punjab                    1
14 Tamil Nadu                3
15 Uttar Pradesh             1
16 West Bengal               1

conf_data |>
  select(years_of_professional_experience_in_numbers) |> # select the variable
  class()

[1] "tbl_df"     "tbl"        "data.frame"

# a different way
conf_data |>
  pull(years_of_professional_experience_in_numbers) |>  # select the variable as a vector
  class()

[1] "numeric"

### Years of professional experience summary
conf_data$years_of_professional_experience_in_numbers |> # '$' sign could be used to select any variable from a dataset.
  summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    5.00   10.00   12.48   20.00   32.00

### Age of the participants
conf_data |>
  pull(age_in_years) |>
  summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   32.00   40.00   39.21   47.00   62.00

conf_data$age_in_years |> summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   32.00   40.00   39.21   47.00   62.00

conf_data |>
  pull(age_in_years) |>
  summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  22.00   32.00   40.00   39.21   47.00   62.00

Exercise 1

What are the software of choice for data analysis among the participants ?
Can you tell how many participants are students ?

Grouping and summarising data

conf_data %>% 
  group_by(gender) %>% # select variable to group by 
  summarise(mean_age=mean(age_in_years)) ## selecting variable to summarise and the function

# A tibble: 3 × 2
  gender            mean_age
  <chr>                <dbl>
1 Female                37.3
2 Male                  43.5
3 Prefer not to say     28.7

Exercise 2

What is the median years of experience for student and faculty groups?
Can you tell me about the frequency of using data analysis tools among those who who use different software for data analysis ?

Step 5: Data Wrangling

Renaming variables

# Renaming the data

conf_data <- conf_data |> 
  rename(exp_years=years_of_professional_experience_in_numbers,
         #new_name = old_name
         area_spec=specialty_area_area_of_practice,
         conf_no=no_of_conferences_workshops_attended_in_2023,
         da_tool=software_of_choice_for_data_analysis,
         freq_da_tool=frequency_of_using_data_analysis_tools
         )


## Checking the new names

conf_data |> names()

 [1] "timestamp"                  "age_in_years"              
 [3] "state_ut_of_residence"      "gender"                    
 [5] "exp_years"                  "area_spec"                 
 [7] "whether_student_or_faculty" "freq_da_tool"              
 [9] "da_tool"                    "conf_no"

Data Wrangling - Creating a new variable

# Creating a new variable called seniority from years of experience
conf_data <- conf_data |> 
  mutate(seniority=if_else(# new_variable_name=
    exp_years>=mean(exp_years),"Seniors","Juniors"#arguement of years of experience if greater than mean, If arguement is true - Seniors,
    #If arguement is false - Juniors
  ))

## Looking a new variable

conf_data |> count(seniority)

# A tibble: 2 × 2
  seniority     n
  <chr>     <int>
1 Juniors      15
2 Seniors      14

Exercise 3

Can you make a new variable using the variable on frequency of use of data analysis tools? Code - 4,5 as more frequent | 1,2,3 as less frequent

Step 6: Data Visualization

# Plot 1: Create a histogram of Age

conf_data |> # Step 1: Specify Dataset
  ggplot() + # Step 2: Initiate the plot
  geom_histogram( # Step 3: Add Geometry (stars with `geom_`)
    aes(x = age_in_years)) + # Step 4: Add Aesthetics (within the `aes()`)
  labs(
    title = "Age Distribution",
    x = "Age in years"  )

# Plot 2: Create a barchart of gender

conf_data |> # Step 1: Specify Dataset
  ggplot() + # Step 2: Initiate the plot
  geom_bar(  # Step 3: Add Geometry (stars with `geom_`)
    aes(x = gender, # Step 4a: Add Aesthetics (within the `aes()`)
        fill = gender)) + # Step 4b: Add Fill color (within the `aes()`)
  labs(
    title = "Gender of participants in the workshop",
    x = "Location",
    caption = "IAPSMCON Pre conference workshop exercise"
  )

# Plot 3: Create a boxplot of mean years of experience among participant

conf_data |> # Step 1: Specify Dataset
  ggplot() + # Step 2: Initiate the plot
  geom_boxplot(  # Step 3: Add Geometry (stars with `geom_`)
    aes(x = whether_student_or_faculty, # Step 4a: Add grouping variable (within the `aes()`)
        y = exp_years,# Step 4b: Add continous variable (within the `aes()`)
        fill = whether_student_or_faculty)) + # Step 4b: Add Fill color (within the `aes()`)
  labs(
    title = "Years of experience among the participants",
    fill = "Participant type",
    x = "Participant type",
    y = "Years of experience"
  )

Exercise 4

Can you make a histogram on the frequency of data analysis tools? What happens when you give fill as the new variable you created in exercise 3?
Can you make a barchart on seniority of the participants ?
Can you create a boxplot on Frequency of data analysis tools among people who use different type of tools ?

Creating summary tables

Table 1: Descriptive statistics regarding participants at IAPSMCON workshop- Gender differentiated
Characteristic	Female, N = 13¹	Male, N = 13¹	Prefer not to say, N = 3¹
Age (years)	37.3 (9.7)	43.5 (9.8)	28.7 (3.5)
State of residence
Andhra Pradesh	1 (7.7%)	1 (7.7%)	0 (0%)
Assam	0 (0%)	1 (7.7%)	0 (0%)
Bihar	1 (7.7%)	0 (0%)	0 (0%)
Chattisgarh	0 (0%)	1 (7.7%)	0 (0%)
Delhi	1 (7.7%)	1 (7.7%)	0 (0%)
Gujarat	1 (7.7%)	0 (0%)	0 (0%)
Haryana	1 (7.7%)	0 (0%)	0 (0%)
Karnataka	3 (23%)	1 (7.7%)	1 (33%)
Kerala	3 (23%)	2 (15%)	1 (33%)
Maharashtra	1 (7.7%)	0 (0%)	0 (0%)
Orissa	0 (0%)	1 (7.7%)	0 (0%)
Pondicherry	0 (0%)	1 (7.7%)	0 (0%)
Punjab	0 (0%)	1 (7.7%)	0 (0%)
Tamil Nadu	0 (0%)	2 (15%)	1 (33%)
Uttar Pradesh	0 (0%)	1 (7.7%)	0 (0%)
West Bengal	1 (7.7%)	0 (0%)	0 (0%)
Years of experience	10.5 (9.4)	16.8 (7.7)	2.3 (1.2)
area_spec
Clinical departments	3 (23%)	3 (23%)	1 (33%)
Public health/ Community medicine	10 (77%)	10 (77%)	2 (67%)
Designation
Faculty	6 (46%)	10 (77%)	1 (33%)
Student	7 (54%)	3 (23%)	2 (67%)
freq_da_tool
1	1 (7.7%)	3 (23%)	0 (0%)
2	4 (31%)	1 (7.7%)	2 (67%)
3	1 (7.7%)	2 (15%)	0 (0%)
4	5 (38%)	5 (38%)	1 (33%)
5	2 (15%)	2 (15%)	0 (0%)
da_tool
Don't use	0 (0%)	1 (7.7%)	0 (0%)
Jamovi	1 (7.7%)	0 (0%)	0 (0%)
MS Excel	3 (23%)	2 (15%)	0 (0%)
R	2 (15%)	2 (15%)	0 (0%)
SPSS	4 (31%)	6 (46%)	2 (67%)
STATA	3 (23%)	2 (15%)	1 (33%)
conf_no
1-3	5 (38%)	7 (54%)	1 (33%)
4-6	5 (38%)	2 (15%)	1 (33%)
More than 6	2 (15%)	3 (23%)	0 (0%)
None	1 (7.7%)	1 (7.7%)	1 (33%)
seniority
Juniors	9 (69%)	3 (23%)	3 (100%)
Seniors	4 (31%)	10 (77%)	0 (0%)
¹ Mean (SD); n (%)

In line coding

There were 29 participants at the pre conference workshop. The mean age of the participants at the workshop is 39.2068966 years.