I recently read the 2014 Gates Annual Letter, in which Bill and Melinda Gates debunk 3 myths that *block progress for the poor*. It's a fascinating read and I highly recommend it. It's sometimes easy to let the current state of affairs overwhelm all the progress we've made as a society in the last few decades and this letter does a great job of showing us that we are on the right path.

The debunking of the third myth, that *Saving Lives Leads to Overpopulation*, particularly struck me as novel. The basic premise is that as infant mortality rates go down, it leads to a drop in the average number of children a woman has (the *fertility rate*) and thus in fact results in a slowdown in the population growth! While it certainly makes sense if you think about it, it is counter-intuitive on the surface.

To convince myself, I decided to dig into the data from the **World Bank** to see the trends myself. This post describes the methodology with the supporting **R** code.

First, we download the relevant data sets and import them into R. We look at data from 1961 to 2011 (data from 2012 and 2013 is spotty at best) and remove the unneeded columns.

```
# Import infant mortality data (note that the filename has been changed)
# From: http://data.worldbank.org/indicator/SP.DYN.IMRT.IN
infmort <- read.csv('../assets/infmort.csv', skip=2)
row.names(infmort) <- infmort[,1]
infmort <- infmort[,-c(1,2,3,4,56,57,58)]
# Import fertility data (note that the filename has been changed)
# From: http://data.worldbank.org/indicator/SP.DYN.TFRT.IN/countries
fertility <- read.csv('../assets/fertility.csv', skip=2)
row.names(fertility) <- fertility[,1]
fertility <- fertility[, -c(1,2,3,4,56,57,58)]
# Import healthcare expenditure per capita data (note that the filename has been changed)
# From: http://data.worldbank.org/indicator/SH.XPD.PCAP/countries
healthxp <- read.csv('../assets/healthexp.csv', skip=2)
row.names(healthxp) <- healthxp[,1]
healthxp <- healthxp[,-c(1,2,3,4,56,57,58)]
```

To kick things off, let's first look at how the infant mortality rate (the number of deaths per 1000 live births) has changed over the years right here at home in the United States -

```
# Plot the infant mortality rate in the U.S.
infmort.us <- data.frame(mortality = as.numeric(as.vector(infmort['United States',])),
year = as.numeric(substr(colnames(infmort), 2, nchar(colnames(infmort)))))
ggplot(data = infmort.us, aes(x = year, y = mortality)) +
geom_point(size = 3) +
geom_line(lwd = 1.2) +
ggtitle("Infant Mortality Rate in the United States") +
theme_bw()
```

The graph shows a clear downward trend, a reason to be optimistic in itself!

Next, we examine the relationship between the infant mortality rate and the fertility rate - the basis for the argument put forth in the annual letter. Instead of looking at it at a per-country basis, we average the rates across all the countries -

```
# Create data frame containing the average infant mortality and fertility rates across the world over time
# The missing values are removed
world_avg <- data.frame(infmort = apply(infmort, 2, function (x) { mean(x, na.rm = T) }),
fertility = apply(fertility, 2, function (x) { mean(x, na.rm = T) }))
# Plot the average infant mortality rate vs. the fertility rate
ggplot(data = world_avg, aes(x = infmort, y = fertility)) +
geom_point(size = 3) +
geom_line(lwd = 1.2) +
ggtitle("Infant Mortality Rate vs. Fertility Rate (Worldwide Average)") +
xlab("Infant Mortality Rate") +
ylab("Fertility Rate") +
theme_bw()
```

The plot shows that there's clearly a strong linear relationship between the two factors. As the mortality rate goes up, so does the fertility rate as women, on average, have more children to increase the chances of at least one of them surviving. This was the argument put forth by Melinda Gates and it clearly holds true.

Let's examine how strong this relationship really is. To test this, we build a linear regression model to predict the fertility rate as a factor of the mortality rate.

```
# Create a linear model of fertility rate as a function of the mortality rate
md <- lm(world_avg$fertility ~ world_avg$infmort)
summary(md)
```

```
Call:
lm(formula = world_avg$fertility ~ world_avg$infmort)
Residuals:
Min 1Q Median 3Q Max
-0.2992 -0.1227 -0.0280 0.1111 0.2405
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.9117891 0.0590677 32.37 <2e-16 ***
world_avg$infmort 0.0367284 0.0009115 40.30 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1505 on 49 degrees of freedom
Multiple R-squared: 0.9707, Adjusted R-squared: 0.9701
F-statistic: 1624 on 1 and 49 DF, p-value: < 2.2e-16
```

The model has an R^2 of 0.97, which means that the model is able to explain almost all the variability in the predicted variable (fertility) using just one factor (mortality rate).

This is a plot of the predicted values over the original graph -

To take the examination further, I also downloaded data on per capita healthcare expenditure. I wanted to see if there's a relationship between the amount spent on healthcare and infant mortality rates.

Let's first import the data into R -

```
# Import healthcare expenditure per capita data (note that the filename has been changed)
# From: http://data.worldbank.org/indicator/SH.XPD.PCAP/countries
healthxp <- read.csv('../assets/healthexp.csv', skip=2)
row.names(healthxp) <- healthxp[,1]
healthxp <- healthxp[,-c(1,2,3,4,56,57,58)]
```

Now let's plot the per capita healthcare expenditure against the infant mortality rate -

```
# Health care expenditure per capita vs. infant mortality rate (United States)
health.us <- data.frame(healthxp = as.numeric(as.vector(healthxp['United States',35:51])),
infmort = as.numeric(as.vector(infmort['United States',35:51])))
ggplot(data = health.us, aes(x = healthxp, y = infmort)) +
geom_point(size = 3) +
geom_line(lwd = 1.2) +
ggtitle("Healthcare Expenditure per capita vs. Infant Mortality Rate (US)") +
xlab("Healthcare Expenditure per capita (current US$)") +
ylab("Infant Mortality Rate") +
theme_bw()
```

Interestingly, we see a somewhat strong linear relationship between the two variables. Is it that we are spending our way to better odds at survival? Not so much. While these variables are **correlated**, **causality** is not established. There are likely other factors at play - advances in medicine, better sanitation, access to healthcare to name a few.

Just for comparison, let's plot the same graph for the United Kingdom -

```
# Health care expenditure per capita vs. infant mortality rate (United Kingdom)
health.uk <- data.frame(healthxp = as.numeric(as.vector(healthxp['United Kingdom',35:51])),
infmort = as.numeric(as.vector(infmort['United Kingdom',35:51])))
ggplot(data = health.uk, aes(x = healthxp, y = infmort)) +
geom_point(size = 3) +
geom_line(lwd = 1.2) +
ggtitle("Healthcare Expenditure per capita vs. Infant Mortality Rate (UK)") +
xlab("Healthcare Expenditure per capita (current US$)") +
ylab("Infant Mortality Rate") +
theme_bw()
```

The linear relationship is not so clear in this case.

Lastly, I wanted to examine whether there are countries where the relationship between infant mortality rates and fertility rates does *not* hold. That is, are there countries where a *decrease* in infant mortality rates results in an *increase* in fertility rates?

To do this, we write a function that loops over all the countries in the data set, builds a linear model between the two variables and spits out countries where the slope of the intercept is negative. It also plots the data for those countries for visualization.

```
make_plot <- function (fer, inf, name) {
df <- data.frame(x = inf, y = fer)
p <- ggplot(data = df, aes(x = x, y = y)) +
geom_point(size = 3) +
geom_line(lwd = 1.2) +
xlab("Infant Mortality") +
ylab("Fertility") +
ggtitle(name) +
theme_bw()
return(list(p))
}
plots <- list()
for (x in 1:nrow(fertility)) {
fer <- as.numeric(as.vector(fertility[x,]))
inf <- as.numeric(as.vector(infmort[x,]))
if (!(all(is.na(fer)) | all(is.na(inf)))) {
m = lm(fer ~ inf)
if (m$coefficients[2] < 0) {
print(row.names(fertility)[x])
plots <- c(plots, make_plot(fer, inf, row.names(fertility)[x]))
}
}
}
do.call(grid.arrange, plots)
```

The results are in - there are six countries where this relationship does not hold. These countries are -

```
"Antigua and Barbuda"
"Mali"
"Korea, Dem. Rep."
"Chad"
"Timor-Leste"
"Congo, Dem. Rep."
```

And here are the graphs for these countries -

While I am not familiar with the challenges that these countries face, a common theme between 4 of the 6 countries is that they all have very high infant mortality rates. One plausile explanation could be that the linear relationship between fertility and mortality is only apparent at relatively lower mortality rates.