Predicting median household income using health metrics | Health metric choropleths | Data sources | Data tools
Health metrics, household income, and historical HOLC grades
Predicting median household income using health metrics
The City Health Dashboard provides health data for a variety of metrics on the census tract level. I sought to predict median household income in a census tract (from the American Community Survey) using a ridge regression model.
Because the distribution of income tends to be log-normal rather than normal, I used the base-10 logarithm of median household income as the target dataset.
I trained the ridge regression model using the available health metric data and acheived a model score on 0.95.
The model tends to overestimate high incomes. This may be because the highest category of income is given as >$250,000. I truncated these values to $250,000; however, the model may improve if I model the tail of income using the log-normal distribution.
The highest weighted feature is income inequality, which is an index of the number of households with income in the top 20% or bottom 20% of the national income distribution. The second-highest weighted feature is dental care, or percentage of adults who visited a dentist or dental clinic in the previous year.
I tested the model against unseen data from Colorado. The model achieved a good score of 0.90; however, I noticed that there seemed to be some systematic error - the model was more likely to overestimate income than to underestimate it.
Indeed, the distribution of median household income for census tracts in Colorado is shifted to the left of the distribution of median household income for census tracts in California.
After adjusting the Colorado incomes by the difference in state median household income, the model is able to achieve a score of 0.93.
This suggests that median household income does depend on health metrics, but there is an additional factor causing income in California to be higher overall.
Health metric choropleths
To gain additional insight into these health metrics, I created an interactive choropleth of Los Angeles County where we can visualize each metric. In addition, we can overlay the 1939 Home Owners' Loan Corporation grades, which were based on "infiltration of inharmonious racial groups," which "tend[ed] to lower the levels of land values and to lessen the desirability of residential areas" (Frederick Babcock, Underwriting Manual - via Mapping Inequality).
These redlining practices continue to have ramifications today, due to the long-term effects on wealth inequality of restricted homeownership in the 1940s.
Panel app is deployed on Heroku and may load slowly or experience errors.
Data sources
Health data: City Health Dashboard, version 14.1.
Household income data: American Community Survey Table B19013, 2018.
Los Angeles 1939 Home Owners' Loan Corporation grade data: Robert K. Nelson, LaDale Winling, Richard Marciano, Nathan Connolly, et al., “Mapping Inequality,” *American Panorama*, ed. Robert K. Nelson and Edward L. Ayers, https://dsl.richmond.edu/panorama/redlining/.
California census tracts: US Census.
Data tools
All data analysis and visualization performed in Python.
Analysis | Visualization | Deployment |
NumPy | GeoViews | Heroku |
Pandas | HoloViews | Docker |
GeoPandas | Colorcet | |
scikit-learn | Panel |
los-angeles-health-and-income is maintained by lealiaxiong.