Spatial Data Science

for Public Health

Best Practices and Case Studies



Dr. Arun Mitra Peddireddy
SCTIMST, Trivandrum


Spatial Epidemiology Series     |     Webinar No. 3     |     04 Nov 2023

Outline






  1. Introduction
  2. Foundations of Spatial Data Science
  3. Best Practices in Spatial Data Science
  4. Case Studies from India
  5. Challenges and Future Directions
  6. Q&A Session










Introduction

GIS and Public Health

  • Extremely useful in providing a fresh outlook to public health.

  • Provides opportunity to enable overlaying data with its spatial representation

  • Supports better planning and decision-making.

  • The convergence of many new sub-disciplines:

    • medical geography
    • public health informatics
    • data science

Map of the plague in the province of Bari, Naples, 1690-1692

The map shows areas most affected and the boundaries of a military quarantine imposed to prevent its spread to neighboring towns and to other provinces.

Applications of GIS in Public Health

  • disease surveillance
  • environmental health
  • infectious diseases
    • mathematical modelling
    • agent based modelling
  • population genetics
  • medical imagining
  • cancer biology

While traditional uses of GIS in healthcare still are relevant, newer methods and advancing technology would be monumental for public health research.

What is Spatial Data Science?

Definition


Spatial data science (SDS) is a subset of Data Science that focuses on the unique characteristics of spatial data, moving beyond simply looking at where things happen to understand why they happen there.

CARTO - https://carto.com/what-is-spatial-data-science

Like data science, spatial data science seems to be a field that arises bottom-up in and from many existing scientific disciplines and industrial activities concerned with application of spatial data, rather than being a sub-discipline of an existing scientific discipline.

Edzer Pebesma, Roger Bivand - Spatial Data Science With Applications in R

How is it different from Data Science?

How is it different from Data Science?

How is it different from Data Science?

Why Spatial Data Science for Public Health?

Why Spatial Data Science for Public Health?

Why Spatial Data Science for Public Health?

Why Spatial Data Science for Public Health?



  • Wealth of Spatial Data

  • 70% of all data that is generated data has spatial attributes

  • Routine health data can be geo-referenced

  • Provide a gateway for researchers and practitioners to examine the role and harness the power of SDS in public health

  • Coupled with the emerging field of spatial statistics, the analysis of this location-based data is developing new and novel directions for public health.










Foundational Concepts

Spatial Dependence and Complete Spatial Randomness


Spatial dependence is “the propensity for nearby locations to influence each other and to possess similar attributes”.

This means natural phenomenon are not spatially distributed at random.

  • temparature,
  • rainfall,
  • population density,
  • socio-economic conditions etc.

It can be measured by the indices of Spatial Autocorrelation.

Spatial Autocorrelation

Refers to the presence of systematic spatial variation in a mapped variable.

The terms spatial association and spatial dependence are often used to reflect spatial auto- correlation as well.

Indices to measure Spatial Dependence

  • Covariance Functions and Variograms

  • Global Spatial Autocorrelation Measures

    • Moran’s I index
    • General G-Statistic
    • Geary’s C index
  • Local Indicators of Spatial Association (LISA)

    • Local Moran’s I index
    • Getis-Ord Gi and Gi statistics
  • Space-Time Correlation Analysis

    • Bivariate Moran’s I for STC
    • Differential Moran’s I
    • Emerging Hot Spot Analysis (EHSA)

Why is the CRS Important?

The Mercator projection, for example, is used where angular relationships are important, but the relationship of areas are distorted.

The Mollweide Equal Area Cylindrical projection, for example, ensures that all mapped areas have the same proportional relationship to the areas on the Earth.

The Plate Carree Equidistant Cylindrical projection, for example, is used when accurate distance measurement is important.

The Robinson projection is a compromise where distortions of area, angular conformity and distance are acceptable.

The United Nations Logo uses the Azimuthal Equidistant projection

What four commonly used projections do, as shown on the human head

CRS in Action

Data Science Approach as a methodological approach

Note

The key word in data science is not data, it is science.

– Jeff Leek, JHU Data Science Lab

Reproducible Research

Reproducible Research


There are four key elements of reproducible research:

  • data documentation
  • data publication
  • code publication,
  • output publication.

Tools for Spatial Data Science

  • GIS related
  • Data Science related
  • Spatial Data Science related

R is the best spatial data science tool available for public health !!!


R provides a range of powerful packages for geospatial analysis, enabling advanced computations and analytics.

R Spatial Analysis Ecosystem

R Spatial Learning Resources

  • Wealth of Resource material

  • Powerful tools/packages

  • seamlessly handle vector and raster data

  • inractive visualization

  • end-to-end solution


Newest addition: Spatial Data Science: With Applications in R

The sf package

install.packages("sf")

The sf package is an R implementation of Simple Features.

This package incorporates:

  • a new spatial data class system in R

  • functions for reading and writing data

  • tools for spatial operations on vectors

Geometry Types in sf

Loading sf package

library(sf)

fs::dir_tree(here("spatial_files", "kl_pop_centers"))
C:/Users/Arun/Dropbox/Research/Spatial_Data_Science_Talk/spatial_files/kl_pop_centers
├── kl_pop_centers.dbf
├── kl_pop_centers.prj
├── kl_pop_centers.shp
└── kl_pop_centers.shx

Load spatial data into R

shape_file <- here("spatial_files", "kl_pop_centers", "kl_pop_centers.shp")

kl_pop_centers <- st_read(shape_file)
Reading layer `kl_pop_centers' from data source 
  `C:\Users\Arun\Dropbox\Research\Spatial_Data_Science_Talk\spatial_files\kl_pop_centers\kl_pop_centers.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 170 features and 14 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 74.95388 ymin: 8.35761 xmax: 77.28071 ymax: 12.60804
Geodetic CRS:  WGS 84

View the sf object

kl_pop_centers
Simple feature collection with 170 features and 14 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 74.95388 ymin: 8.35761 xmax: 77.28071 ymax: 12.60804
Geodetic CRS:  WGS 84
First 10 features:
   Rotation Scale    name_of_to      district  state ELEVATION     District_1
1         0     0 VADAKKANCHERI      PALAKKAD KERALA         0       P>LAKK>D
2         0     0      ANGAMALI     ERNAKULAM KERALA         0      ERN>KULAM
3         0     0    MALAYATLUR     ERNAKULAM KERALA         0      ERN>KULAM
4         0     0        KALADI     ERNAKULAM KERALA         0      ERN>KULAM
5         0     0      TOMALLUR PATTANAMTITTA KERALA         0 PATTANAMTHITTA
6         0     0     GURUVAYUR      THRISSUR KERALA         0        TRISS@R
7         0     0 TRIMBRANALLUR      THRISSUR KERALA         0        TRISS@R
8         0     0      KADIKKAD      THRISSUR KERALA         0        TRISS@R
9         0     0     CHALAKUDI      THRISSUR KERALA         0        TRISS@R
10        0     0          MALA      THRISSUR KERALA         0        TRISS@R
   STATE_1     TEHSIL Shape_Leng Shape_Area pop_2020      lon       lat
1   KERALA    >LATT@R   150475.2  578919605   456575 76.48236 10.591936
2   KERALA      >LUVA   155736.4  550828742   615928 76.38866 10.200267
3   KERALA      >LUVA   155736.4  550828742   615928 76.51483 10.197496
4   KERALA      >LUVA   155736.4  550828742   615928 76.43405 10.167068
5   KERALA       AD@R   110539.2  270281878   187760 76.68601  9.227245
6   KERALA CH>LAKKUDI   376808.1 1270626860  1153806 76.04651 10.598107
7   KERALA CH>LAKKUDI   376808.1 1270626860  1153806 76.10642 10.523958
8   KERALA CH>LAKKUDI   376808.1 1270626860  1153806 75.96130 10.681153
9   KERALA CH>LAKKUDI   376808.1 1270626860  1153806 76.34039 10.308017
10  KERALA CH>LAKKUDI   376808.1 1270626860  1153806 76.26185 10.250355
                    geometry
1  POINT (76.48236 10.59194)
2  POINT (76.38866 10.20027)
3   POINT (76.51483 10.1975)
4  POINT (76.43405 10.16707)
5  POINT (76.68601 9.227245)
6  POINT (76.04651 10.59811)
7  POINT (76.10642 10.52396)
8   POINT (75.9613 10.68115)
9  POINT (76.34039 10.30802)
10 POINT (76.26185 10.25036)

Plot the sf object

kl_pop_centers %>% 
  ggplot() +
  geom_sf()

Plot the sf object

kl_pop_centers %>% 
  ggplot() +
  geom_sf(aes(color = district))

Concept of the sf package

Dependencies of the sf package

Methods in sf

methods(class="sf")
  [1] $<-                          [                           
  [3] [[<-                         aggregate                   
  [5] anti_join                    arrange                     
  [7] as.data.frame                cbind                       
  [9] coerce                       dbDataType                  
 [11] dbWriteTable                 distinct                    
 [13] dplyr_reconstruct            drop_na                     
 [15] filter                       full_join                   
 [17] gather                       group_by                    
 [19] group_split                  identify                    
 [21] initialize                   inner_join                  
 [23] left_join                    merge                       
 [25] mutate                       nest                        
 [27] pivot_longer                 pivot_wider                 
 [29] plot                         print                       
 [31] rbind                        rename                      
 [33] right_join                   rowwise                     
 [35] sample_frac                  sample_n                    
 [37] select                       semi_join                   
 [39] separate                     separate_rows               
 [41] show                         slice                       
 [43] slotsFromS3                  spread                      
 [45] st_agr                       st_agr<-                    
 [47] st_area                      st_as_s2                    
 [49] st_as_sf                     st_as_sfc                   
 [51] st_bbox                      st_boundary                 
 [53] st_break_antimeridian        st_buffer                   
 [55] st_cast                      st_centroid                 
 [57] st_collection_extract        st_concave_hull             
 [59] st_convex_hull               st_coordinates              
 [61] st_crop                      st_crs                      
 [63] st_crs<-                     st_difference               
 [65] st_drop_geometry             st_filter                   
 [67] st_geometry                  st_geometry<-               
 [69] st_inscribed_circle          st_interpolate_aw           
 [71] st_intersection              st_intersects               
 [73] st_is                        st_is_valid                 
 [75] st_join                      st_line_merge               
 [77] st_m_range                   st_make_valid               
 [79] st_minimum_rotated_rectangle st_nearest_points           
 [81] st_node                      st_normalize                
 [83] st_point_on_surface          st_polygonize               
 [85] st_precision                 st_reverse                  
 [87] st_sample                    st_segmentize               
 [89] st_set_precision             st_shift_longitude          
 [91] st_simplify                  st_snap                     
 [93] st_sym_difference            st_transform                
 [95] st_triangulate               st_triangulate_constrained  
 [97] st_union                     st_voronoi                  
 [99] st_wrap_dateline             st_write                    
[101] st_z_range                   st_zm                       
[103] summarise                    transform                   
[105] transmute                    ungroup                     
[107] unite                        unnest                      
see '?methods' for accessing help and source code

Interactive sf

  • Light weight
  • Interactive
  • Cross Platform

Where to look for help?

https://posit.co/wp-content/uploads/2022/10/sf.pdf










Best Practices

Best Practices

  • Data Related
    • Data Acquisition
    • Data Cleaning
    • Data Curation
  • Analysis Related

Exploratory Data Analysis

  • EDA is the critical first step.

  • EDA is a state of mind.

  • EDA is exploring your ideas.

  • EDA has no strict rules.

  • EDA helps understand your data.

  • EDA is an iterative cycle.

  • EDA is a creative process.

What is EDA?

It is mostly a philosophy of data analysis where the researcher examines the data without any pre-conceived ideas in order to discover what the data can tell him or her about the phenomena being studied.

detective work – numerical detective work – or counting detective work – or graphical detective work”

– Tukey, 1977** Page 1, Exploratory Data Analysis

Questions to ask in EDA

The easiest way to do EDA is to use questions as tools to guide your investigation. EDA is an important part of any data analysis, even if the questions are known already.

“There are no routine statistical questions, only questionable statistical routines.”

Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.”

John Tukey

Asking the right questions

Key to asking quality questions is to generate a large quantity of questions.

It is difficult to ask revealing questions at the start of the analysis.

But, each new question will expose a new aspect and increase your chance of making a discovery.


6 W’s of Spatial EDA / ESDA

  • What?

  • Where?

  • When?

  • Who?

  • Why?

  • How?

Questions to ask:

  • What type of variation occurs within your variables?

  • What type of covariation occurs between your variables?

  • Whether your data meets your expectations or not.

  • Whether the quality of your data is robust or not.

The process of EDA


It is an iterative process

  1. Import
  2. Tidy
  3. Explore
  • Transform
  • Visualize
  • Transform
  • Visualize
  • Transform
  • Visualise …

Steps for any good data anlysis project

Preparing Tidy Data

  • Data Cleaning
  • Data Wrangling

Data Exploration

  • Data Transformation
  • Data Visualization

Statistical Analysis

Prepare Results

Draw Inferences

Report Findings

Spatial Data Visualization

Cartographic Principles

  • Geography and Geospatial Science Working Group (GeoSWG) recognised the need for best practices in cartography.

    • Visual contrast
    • Legibility
    • Figure-Ground Orientation
    • Hierarchical Organization
    • Balance


These guidelines, help the researchers develop high-quality, consistent map products.

Cartographic Guidelines for Public Health

  • CDC, Atlanta
  • Some important aspects:
    • Map Elements
      • Title and Borders
      • North Arrow / Graticule / Scale
      • Inset Maps
      • Labels and Legend
    • Other Elements
      • Data Sources
      • Dates
      • Projection










Case Studies

Case Studies

  • Point Pattern Data

  • Areal Data

  • Raster Data

  • Network Data

  • Spatio-temporal Models

  • Machine Learning Methods

  • Big Data

Challenges and Future Directions

New Requirements for Spatial Analysis

  • Immediate: The time from action to insight is reducing dramatically

  • Fresh: Primary data needs to be days or months old not years old

  • Multi-source: Competitive alternative sources for completeness or validation

  • Continuous: Analysis can no longer be a point in time

  • Automated: Possibility to continuously replicate and connect to decision tools

Digital Twins

:::