Data source, study design, and setting
This study utilized secondary data from 2016 Ethiopian Demographic and Health Survey (EDHS). The survey data were downloaded from the Measure DHS website after reasonable request and data use permission was fully guaranteed. The 2016 EDHS is part of the worldwide MEASURE DHS project which was funded by the United States Agency for International Development (USAID) and was implemented by the Ethiopian Central Statistical Agency. A DHS is undertaken every 5 years and the 2016 survey is the fourth Demographic and Health Survey in Ethiopia which covers all the nine regions and two administrative cities.
Sample size and sampling procedure
The Ethiopian Demographic and Health Survey program (EDHS) has collected data on nationally representative samples of all age groups and key indicators. The information on the sociodemographic, socioeconomic, and maternal-related variables was included in the survey. A stratified two-stage cluster sampling procedure was employed to select study participants. In the 2016 survey, a total of 645 EAs (202 urban and 443 rural) were selected. From these enumeration areas, 18,008 households and from those households a total of 15,683 reproductive-age women were included in the survey. The relevant information on the sampling procedure and data quality can be accessed elsewhere [4]. For the current study, a total of 3381(weighted sample) teenagers (15–19 years old) were included.
Study variables
Dependent variable
Teenage pregnancy: It is a composite binary outcome variable that refers to the pregnancy experience of a woman between the ages of 15–19 years. History of birth before age of 19 or being pregnant at the time of the interview was considered as teenage pregnancy. Therefore, it was categorized in such a way that 0 = no pregnancy before age 19 and 1 = pregnancy experienced before the age of 19 years. Finally, the weighted proportion of teenage pregnancy per cluster which is a continuous variable was used for spatial analysis including spatial regression analysis.
Independent variables
The aggregated community variables such as community poverty (the proportion of the two lowest wealth quintiles), community contraceptive use (the proportion of women who didn’t use any type of contraceptive), community traditional contraceptive use (the proportion of women who use traditional contraceptive methods), community women education (proportion of women with no education), female community employment (the proportion of unemployed women), community media exposure (the proportion of women who were not exposed to television, radio or reading newspaper) community health insurance coverage (proportion of women who were not covered by health insurance) and community illiteracy (the proportion of women unable to read and write) were considered as candidate independent variables for the spatial regression models.
Data management and analysis
Descriptive analyses were performed using Stata version 14 statistical software. Whereas the spatial analysis was performed using ArcGIS 10.7. Before conducting spatial analysis, the weighted proportions of teenage pregnancy (outcome variable) and candidate predictor variables performed in stata and were exported to ArcGIS. A detailed explanation of the weighting procedure can be found elsewhere [23].
Spatial analysis
Spatial autocorrelation
Spatial autocorrelation rises from the concept of correlation or dependency. Geographically close areas are more related than distant areas. In global autocorrelation the concept is stationary. The correlation between nearby or connected observations will remain the same. Moran I is an indicator of spatial autocorrelation in the range of − 1 to 1. The value being positive shows that close areas have similar values whereas a negative value is an indicator if dissimilarity between adjacent values [24]. The global moran’s I was computed as follows [25]
$${\text{I}} = \frac{{{\text{n}}\sum\nolimits_{{\text{i}}}^{{\text{n}}} {\sum\nolimits_{{\text{j}}}^{{\text{n}}} {{\text{wij}}} } \left( {{\text{yi}} - \overline{{\text{y}}} } \right)\left( {{\text{yj}} - \overline{{\text{y}}} } \right)}}{{\left( {\sum\nolimits_{{\text{i}}}^{{\text{n}}} {\sum\nolimits_{{\text{j}}}^{{\text{n}}} {{\text{wij}}} } } \right)\sum\nolimits_{{\text{i}}} {\left( {yi - \bar{y}} \right)^{2} } }}$$
where yi represents the vector of observations at n different locations, and wij are elements of a spatial weight matrix.
Hot spot analysis
Hot spot analysis identifies statistically significant clustering areas using vectors calculates The Getis–Ord Gi statistic the resultant Z score and p value will identify where the high or low values cluster spatially. The hot spot area is where high values of the given data are surrounded by similar high values to the opposite where low values are surrounded by similar low values give the cold spot areas [26].
Spatial scan statistics
Satscan analyzes spatial–temporal and space–time data using spatial–temporal or space–time scan statistics. It is used to perform geographical surveillance of disease and to detect areas of significantly high or low rates. In the Bernoulli-based model pregnant teenagers were taken as cases and non-pregnant teenagers as controls to determine the geographical locations of statistically significant clusters of teenage pregnancy using kuldorff sat scan version 9.6 software The default maximum spatial cluster size of < 50% of the population was used. The primary and secondary clusters were detected and ranked according to the likelihood ratio test, based on 999 Monte Carlo replications [27]
Spatial regression analysis
Ordinary least squares (OLS) regression
After detecting the hot spot areas of teenage pregnancy, spatial regression modeling was performed to identify predictors of the observed spatial clustering of teenage pregnancy. So first ordinary least square regression was conducted. Findings from the ordinary least squares (OLS) regression are only reliable if the regression model satisfies all of the assumptions that are required by this method. The coefficients of explanatory variables in a properly specified OLS model should be statistically significant and have either a positive or negative sign. Besides, there should not be a correlation among explanatory variables (free from multicollinearity). The model should be unbiased (heteroscedasticity or non-stationarity). The residuals should be normally distributed and revealed no spatial patterns. The model should include key explanatory variables. The residuals must be free from spatial autocorrelation [28]. Thus, these assumptions were checked accordingly. The OLS regression equation [29] is given as:
$$Y_{i} = \beta + \mathop \sum \limits_{{K = 1}}^{P} \left( {\beta _{k} X_{{ik}} ~} \right) + \in _{i} ~$$
where i = 1, 2,…n; β0, β1, β2, …βp are the model parameters, yi is the outcome variable for observation i, Xik are explanatory variables and 1, ∈2, … ∈n are the error term/residuals with zero mean and homogenous variance σ2.
To identify a model that fulfills the assumption of the OLS method, exploratory regression identifies models with high Adjusted R2 values. Besides, it identifies models that meet all of the assumptions of the OLS method [30].
Geographically weighted regression (GWR)
A variable that is a strong predictor in one cluster may not necessarily be a strong predictor in another cluster. This type of cluster variation (non-stationary) can be identified through the use of GWR. In this context, GWR can help to answer the question: “Does the association vary across space?” Unlike OLS that fits a single linear regression equation to all of the data in the study area, GWR creates an equation for each DHS cluster. While the equation in OLS is calibrated using data from all features (cluster in this case), GWR uses data from nearby features. Thus, the GWR coefficient takes different values for each cluster [31] Maps of the coefficients associated with each explanatory variable, which are produced using the GWR, provide guidelines for targeted interventions. The GWR model [32]can be written as:
$$~Y_{{i~}} = \beta _{{O~}} (u_{{i~}} v_{{i~}} ) + \mathop \sum \limits_{{k = 1}}^{p} \beta _{{k~}} \left( {u_{{i~}} v_{{i~}} } \right)X_{{ik~ + ~ \in _{i} }}$$
where yi are observations of response y, uivi are geographical points (longitude, latitude), \(\beta _{{k~}} (u_{{i}} v_{{i}} )\) (k = 0, 1 … p) are p unknown functions of geographic locations uivi, Xik are explanatory variables at location uivi, i = 1, 2, … n and \(\in _{i}\) are error terms/residuals with zero mean and homogenous variance (σ2).