geographically weighted regression

Geographically Weighted Regression for Community Health Index (Part 1)

Posted on

Introduction

Health development is a planned and sustainable effort to improve the health status of society through various strategies, policies and interventions. The main goal of health development is to achieve optimal levels of health for the entire population, including increasing access to health services, disease prevention, health promotion, and improving the general quality of life. Therefore, the Indonesian Ministry of Health’s Health Research and Development Agency (BALITBANGKES) compiled the Public Health Development Index (PHDI). PHDI is a collection of health indicators that can be easily and directly measured to describe health problems.

We will modelling the PHDI using the Geographically Weighted Regression which generate distinct regression models for each specific location. The case study chosen is the PHDI on the island of Java using 2018 data collected from the Central Statistics Agency (BPS).

Data

The data used is 2018 PHDI data sourced from the Central Statistics Agency of Indonesia collected from 119 regencies/cities on the island of Java with average PHDI was 0.6087. The variables are:
Y : Community Health Development Index
X1: Percentage of Population Having Access to Clean Water
X2: Egg and Milk Consumption (Per Capita Per Week)/1000
X3: number of health centers
PHDI data source can be download here.
shp map for java island can be download here.

Summary of Logical Sequence in Analysis:

  1. Initial Exploration: Visualize your data on a map to look for clear patterns.
  2. Test of Spatial Dependence (Global Moran’s I) on the Dependent Variable: To see if there are clusters of values on the variable you want to explain.
  3. OLS Model (Classical Regression): Run an OLS model as a baseline.
  4. Test of Spatial Dependence (Global Moran’s I) on OLS Residuals: This is a crucial step. If the OLS residuals show significant spatial autocorrelation, it indicates the presence of unmodeled spatial effects.
  5. Test of Spatial Heterogeneity (F-Test for Non-Stationarity): If the OLS residuals show autocorrelation, the next step is to test whether spatial heterogeneity is the cause. This will tell you whether GWR is the right choice.
  6. GWR Model: If the F-test is significant, run GWR.
  7. GWR Residual Analysis: Test Moran’s I on the GWR residuals. Ideally, the spatial autocorrelation in the GWR residuals should be significantly reduced or insignificant, indicating that the GWR has successfully captured most of the spatial pattern.
  8. Interpreting the Local GWR Coefficients: Analyze and plot the local GWR coefficients to understand how the relationship varies across space.

The analysist will be done by R. By following these steps, we can systematically identify the presence and nature of spatial dependencies and heterogeneity in your data.

First we load the R packages:

library(readxl) 
library(openxlsx) 
library(spdep) 
library(raster) 
library(corrplot)  
library(DescTools) 
library(nortest) 
library(car) 
library(spatialreg)
library(sf)
library(nortest) 
library(DescTools) 
library(lmtest)
library(tidyverse)
library(spgwr)
library(ggplot2)

Next, read the file PHDI data:

dfx <- read_excel("D:/dataset-chdi.xlsx")
head(dfx)
str(dfx)

Select columns/attributes/variables from data. We only need city_code, Y, X1, X2, X3

dfx1<-dfx%>% dplyr::select(city_code,Y,X1,X2,X3)
str(dfx1)
tibble [119 × 5] (S3: tbl_df/tbl/data.frame)
$ city_code: num [1:119] 3101 3171 3172 3173 3174 ...
$ Y : num [1:119] 0.638 0.674 0.68 0.65 0.649 ...
$ X1 : num [1:119] 85 90 91 90 90 ...
$ X2 : num [1:119] 18 30.1 31.4 20.5 24.5 ...
$ X3 : num [1:119] 6 72 84 35 73 45 101 58 45 62 ...

show summary data

summary(dfx1)

   city_code          Y                X1               X2               X3       
 Min.   :3101   Min.   :0.5345   Min.   : 61.91   Min.   : 7.191   Min.   :  3.0  
 1st Qu.:3276   1st Qu.:0.6253   1st Qu.: 79.40   1st Qu.:12.475   1st Qu.: 21.0  
 Median :3327   Median :0.6437   Median : 90.72   Median :15.104   Median : 27.0  
 Mean   :3386   Mean   :0.6406   Mean   : 87.71   Mean   :16.534   Mean   : 30.4  
 3rd Qu.:3516   3rd Qu.:0.6633   3rd Qu.: 96.06   3rd Qu.:19.150   3rd Qu.: 36.5  
 Max.   :3674   Max.   :0.7319   Max.   :100.00   Max.   :33.651   Max.   :101.0  

Step 1: Initial Exploration: Visualize your data on plots and map to look for clear patterns and relationship.

The Boxplot

par(mfrow = c(1, 4)) # set layout for 3 plots in 1 row
boxplot(dfx$Y,main="Y distribution") 
boxplot(dfx$X1,main="X1 distribution") 
boxplot(dfx$X2,main="X2 distribution") 
boxplot(dfx$X3,main="X3 distribution") 

Based on the Boxplot above, it can be seen that there are no outlier values in variable X1, whereas in Y, X2 and X3 there are outliers.

Relationship of the data

We use Scatter Plot to show linear relationship of data X1 -> Y, X2 -> Y, and X3 -> Y

par(mfrow = c(1, 3)) # set layout for 3 plots in 1 row

plot(dfx1$X1, dfx1$Y,
  xlab="X1", 
  ylab="Y",
  main="Scatter Plot of X1 and Y",
  pch=20, col="orange", cex=2)
reg.klasik = lm(Y~X1, data=dfx1)
x_num <-sapply(reg.klasik,is.numeric)
abline(reg.klasik)

plot(dfx$X2, dfx$Y,
  xlab="X2", 
  ylab="Y",
  main="Scatter Plot of X2 and Y",
  pch=20, col="orange", cex=2)
reg.klasik = lm(Y~X2, data=dfx)
abline(reg.klasik)

plot(dfx$X3, dfx$Y,
  xlab="X3", 
  ylab="Y",
  main="Scatter Plot of X3 and Y",
  pch=20, col="orange", cex=2)
reg.klasik = lm(Y~X3, data=dfx)
abline(reg.klasik)

Use Correlation matrix to show the correlation between Y and variables X1, X2, X3

corrplot(cor(dfx1[,2:5]), method = “number”)

Based on the correlation matrix above, it can be seen that Y has a positive correlation with X1 and X2, and has a negative correlation with X3.

We can do also correlation tests to find out numerical correaltion between Y to X1, X2, and X3.
For X1 vs Y, we use hypotesist:
H0: There is no correlation between variables X1 and Y
H1: There is a correlation between variables X1 and Y

cor.test(dfx$Y, dfx1$X1)

result:
	Pearson's product-moment correlation

data:  dfx$Y and dfx1$X1
t = 2.0507, df = 117, p-value = 0.04253
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.006492699 0.354384355
sample estimates:
      cor 
0.1862706 

conclution: p-value < 0.05, it can be concluded that H0 is rejected, which means there is a correlation between variables X1 and Y.

For X2 vs Y
H0: There is no correlation between variables X2 and Y
H1: There is a correlation between variables X2 and Y

cor.test(dfx1$Y, dfx1$X2)

result:
	Pearson's product-moment correlation

data:  dfx1$Y and dfx1$X2
t = 8.072, df = 117, p-value = 6.864e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4685205 0.7024543
sample estimates:
      cor 
0.5980791 

conclution: p-value < 0.05, it can be concluded that H0 is rejected, which means there is a correlation between variables X2 and Y.

For X3 vs Y
H0: There is no correlation between variables X3 and Y
H1: There is a correlation between variables X3 and Y

cor.test(dfx$Y, dfx$X3)

result:
	Pearson's product-moment correlation

data:  dfx$Y and dfx$X3
t = -2.5987, df = 117, p-value = 0.01056
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.39690734 -0.05595789
sample estimates:
       cor 
-0.2336006 

Kesimpulan : p-value < 0.05, it can be concluded that H0 is rejected, which means there is a correlation between variables X3 and Y.

Data Spatial Distribution

First of all we load packages from R

library(tmap)
library(tmaptools)
library(RColorBrewer)

#Input map shp
mapx <- st_read(file.path("D:/javamap/jawa.shp"))

result:
Reading layer `jawa' from data source `D:\javamap\jawa.shp' using driver `ESRI Shapefile'
Simple feature collection with 119 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.0998 ymin: -8.78036 xmax: 116.2702 ymax: -5.048857
Geodetic CRS:  WGS 84

We set tmap mode to plotting and switch off spherical geometry (s2) in order to run the tm_shape()

tmap_mode("plot")
Colours<- rev(brewer.pal(8, "RdYlBu"))
sf_use_s2(FALSE)

next, we merge column Y, X1, X2, X3 from dfx to mapx

mapx$Y<- dfx$Y
mapx$X1<- dfx$X1
mapx$X2<- dfx$X2
mapx$X3<- dfx$X3
mapx

result:
Simple feature collection with 119 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 105.0998 ymin: -8.78036 xmax: 116.2702 ymax: -5.048857
Geodetic CRS:  WGS 84
First 10 features:
   PROVNO KABKOTNO    PROVINSI           KABKOT ID2013                       geometry      Y    X1     X2  X3
1      31       01 DKI JAKARTA KEPULAUAN SERIBU   3101 MULTIPOLYGON (((106.3842 -5... 0.6385 85.00 17.971   6
2      31       71 DKI JAKARTA  JAKARTA SELATAN   3171 MULTIPOLYGON (((106.8491 -6... 0.6737 90.00 30.138  72
3      31       72 DKI JAKARTA    JAKARTA TIMUR   3172 MULTIPOLYGON (((106.9719 -6... 0.6804 91.00 31.450  84
4      31       73 DKI JAKARTA    JAKARTA PUSAT   3173 MULTIPOLYGON (((106.8788 -6... 0.6503 90.00 20.456  35
5      31       74 DKI JAKARTA    JAKARTA BARAT   3174 MULTIPOLYGON (((106.711 -6.... 0.6494 90.00 24.460  73
6      31       75 DKI JAKARTA    JAKARTA UTARA   3175 MULTIPOLYGON (((106.969 -6.... 0.6365 90.00 28.309  45
7      32       01  JAWA BARAT            BOGOR   3201 MULTIPOLYGON (((106.994 -6.... 0.6098 81.06 16.918 101
8      32       02  JAWA BARAT         SUKABUMI   3202 MULTIPOLYGON (((106.9652 -6... 0.6057 74.58 13.788  58
9      32       03  JAWA BARAT          CIANJUR   3203 MULTIPOLYGON (((107.2843 -6... 0.6016 93.62 10.065  45
10     32       04  JAWA BARAT          BANDUNG   3204 MULTIPOLYGON (((107.75 -6.8... 0.6065 77.81 15.152  62

Now we can plot the mapx data to a map based on the response variable Y

tm_shape(mapx) +
tm_compass(position = c("left", "bottom"))+
tm_polygons("Y", title='legend note', style='quantile', 
              #n=7, palette=c("red","pink","orange","yellow","yellowgreen","springgreen3","royalblue1"), title='X2',
             n=5, palette=c("mistyrose1","pink","tomato1","red3","darkred"),
              border.col='grey27', alpha=.9)+
tm_text(
    # The key argument: specify the column for the text labels
    text = "ID2013", 
    # Optional arguments to customize the text appearance
    size = 0.5,
    fontface = "bold",
    col = "black",
    shadow = FALSE
  ) +
  tm_layout(title = "response Y",
            title.position = c("center", "top"),
            legend.position = c("left", "bottom"),
            legend.text.size = 0.8,
            frame = FALSE, 
    inner.margins = c(0, 0, 0, 0),
    outer.margins = c(0, 0, 0, 0)
            ) 

Based on 2018 PHDI data on the island of Java, several districts still have low PHDI levels spread across 4 provinces, namely West Java Province, Central Java Province, East Java Province and Banten Province. In West Java Province there are Sukabumi Regency, Cianjur Regency, Bandung Regency, Garut Regency, Tasikmalaya Regency. In Central Java Province there are Banjarnegara Regency and Jepara Regency. For East Java Province there are Bondowoso Regency, Situbondo Regency, Probolinggo Regency, Bangkalan Regency, Sampang Regency and Pamekasan Regency. Meanwhile, in Banten Province there are Pandeglang Regency, Lebak Regency, Serang Regency and Serang City.

Next, Step 2: Test of Spatial Dependence (Global Moran’s I) on the Dependent Variable: To see if there are clusters of values on the variable you want to explain.

Leave a Reply

Your email address will not be published. Required fields are marked *