Nationwide Diversity and Integration

This notebook pulls together the main features of the censusdis and divintseg projects to demonstrate how quickly and easily we can

Identify US Census data variables.
- We will use metadata calls to identify 16 race and ethnicity variables.
Download the variables for the entire country at the block group level.
- There are almost 240,000 census block groups in the 50 states and DC.
- We will download all 16 variables for all of them.
Compute diversity and integration metrics at the census tract level.
- There are over 84,000 census tracts in the 50 states and DC.
- Integration expresses how integrated the block groups within a tract are.
Attach geography bounds to all the census tracts.
- This process will transparently download and cache census map data on a state by state basis.
- censusdis will determine how to do this and manage the process. No user input is required.
Download state boundaries.
- This is simply to make our final maps look better.
Plot maps showing nationwide diversity and integration.
- Data is plotted at the census tract level.
- Alaska and Hawaii are automatically relocated on the map to better use output real estate.
- The state boundaries are plotted to make the maps more readable.
Determine which regions of the country have most concentrated number of the most and least diverse and integrated tracts.
- We compute the top and bottom 10% of each metric.
- We scatter plot them on maps.

Even in a fresh python environment with no data pre-cached locally, this entire process can be run on a fairly standard laptop with a high-speed internet connection in a just a few minutes.

For more details, please refer to the README files and other documentation in the `censusdis <https://github.com/vengroff/censusdis>`__ and `divintseg <https://github.com/vengroff/divintseg>`__ projects.

0. Imports and configuration

[1]:

# So we can run from within the censusdis project and find the packages we need.
import os
import sys

sys.path.append(
    os.path.join(os.path.abspath(os.path.join(os.path.curdir, os.path.pardir)))
)

[2]:

import censusdis.data as ced
import censusdis.maps as cem
from censusdis.states import ALL_STATES_AND_DC

import divintseg as dis

[3]:

# Fill in your own key here, or leave as is
# if you are doing low volume queries that
# will not hit the census server usage limits.
CENSUS_API_KEY = None

1. Identify the dataset and variables we will use

[4]:

DATASET = "acs/acs5"
YEAR = 2020

[5]:

GROUP = "B03002"

2. Download the data

[6]:

%%time

df_bg = ced.download(
    DATASET,
    YEAR,
    leaves_of_group=GROUP,
    state=ALL_STATES_AND_DC,
    block_group="*",
    api_key=CENSUS_API_KEY,
)

CPU times: user 1.91 s, sys: 158 ms, total: 2.07 s
Wall time: 19.2 s

[7]:

df_bg.shape

[7]:

(239780, 20)

3. Compute diversity and integration

[8]:

%%time

df_di = dis.di(
    df_bg,
    by=["STATE", "COUNTY", "TRACT"],
    over="BLOCK_GROUP",
).reset_index()

CPU times: user 1min, sys: 96.1 ms, total: 1min
Wall time: 1min

[9]:

df_di.shape

[9]:

(84414, 5)

4. Infer the geographies so we can plot

[10]:

%%time

gdf_di = ced.add_inferred_geography(df_di, YEAR)

CPU times: user 11.1 s, sys: 160 ms, total: 11.3 s
Wall time: 11.4 s

5. Get state boundaries for the plots

[11]:

%%time

gdf_states = cem.ShapeReader(year=YEAR).read_cb_shapefile("us", "state")

gdf_states = gdf_states[gdf_states.STATEFP.isin(ALL_STATES_AND_DC)]

CPU times: user 297 ms, sys: 4.79 ms, total: 302 ms
Wall time: 301 ms

6. Plot diversity and integration

[12]:

%%time

for col in ["diversity", "integration"]:
    ax = cem.plot_us(
        gdf_di[gdf_di.geometry != None],
        col,
        cmap="hot",
        legend=True,
        figsize=(12, 6),
        vmin=0.0,
        vmax=1.0,
    )

    ax = cem.plot_us_boundary(gdf_states, edgecolor="gray", linewidth=0.5, ax=ax)

    ax.set_title(col.title())

    ax.tick_params(
        left=False,
        right=False,
        bottom=False,
        labelleft=False,
        labelbottom=False,
    )

CPU times: user 18.5 s, sys: 4.25 s, total: 22.7 s
Wall time: 17.9 s

../_images/nb_Nationwide_Diversity_and_Integration_20_1.png

../_images/nb_Nationwide_Diversity_and_Integration_20_2.png

7. What are the top and bottom 10%?

[13]:

%%time

n = len(gdf_di.index)

gdf_di_top_d = gdf_di.nlargest(n // 10, "diversity")
gdf_di_bottom_d = gdf_di.nsmallest(n // 10, "diversity")

gdf_di_top_i = gdf_di.nlargest(n // 10, "integration")
gdf_di_bottom_i = gdf_di.nsmallest(n // 10, "integration")

gdf_di_centroids = gdf_di.copy()

gdf_random_sample = gdf_di.sample(n=n // 10, random_state=17)

CPU times: user 36.6 ms, sys: 3.66 ms, total: 40.3 ms
Wall time: 40.4 ms

[14]:

%%time

# Compute the geometry to centroid points to make the plot
# more readable.

gdf_di_top_d.geometry = cem.geographic_centroids(gdf_di_top_d)
gdf_di_bottom_d.geometry = cem.geographic_centroids(gdf_di_bottom_d)

gdf_di_top_i.geometry = cem.geographic_centroids(gdf_di_top_i)
gdf_di_bottom_i.geometry = cem.geographic_centroids(gdf_di_bottom_i)

gdf_di_centroids.geometry = cem.geographic_centroids(gdf_di_centroids)

gdf_random_sample.geometry = cem.geographic_centroids(gdf_random_sample)

CPU times: user 1.48 s, sys: 77.6 ms, total: 1.56 s
Wall time: 1.55 s

[15]:

def plot_top_or_bottom(
    gdf,
    metric_description: str,
    top_or_bottom: str = "top",
):
    """
    A helper plot routine.

    Covers most of the repetitive boilerplate that we don't
    want to deal with typing more than once.
    """

    # Plot configuration.

    top_color = "#0343df"
    bottom_color = "#E50000"
    random_color = "#74216f"

    markersize = 1

    state_linewidth = 0.5
    state_edgecolor = "gray"

    all_color = "lightgray"

    figsize = (12, 6)

    # Actual plotting.

    # State bounds
    ax = cem.plot_us_boundary(
        gdf_states,
        edgecolor=state_edgecolor,
        linewidth=state_linewidth,
        figsize=figsize,
    )

    # All tracts in a background color
    cem.plot_us(
        gdf_di_centroids[gdf_di_centroids.geometry != None],
        markersize=markersize,
        color=all_color,
        ax=ax,
    )

    # The sample we want to emphasize
    cem.plot_us(
        gdf[gdf.geometry != None],
        markersize=markersize,
        color=top_color
        if top_or_bottom == "top"
        else (bottom_color if top_or_bottom == "bottom" else random_color),
        ax=ax,
    )

    if top_or_bottom == "top":
        ax.set_title(
            f"Top 10% Most {metric_description.title()} Census Tracts (n = {len(gdf.index)})"
        )
    elif top_or_bottom == "bottom":
        ax.set_title(
            f"Bottom 10% Least {metric_description.title()} Census Tracts (n = {len(gdf.index)})"
        )
    else:
        ax.set_title(f"Random 10% Sample of Census Tracts (n = {len(gdf.index)})")

    ax.tick_params(
        left=False,
        right=False,
        bottom=False,
        labelleft=False,
        labelbottom=False,
    )

Random 10% Sample

The purpose of this first plot is to show us what the distribution of a random 10% sample looks like. Then we can see if our top and bottom 10% look at all like this or if they are concentrated in regions according to some different kind of spatial distribution.

[16]:

%%time
plot_top_or_bottom(gdf_random_sample, "", top_or_bottom="random")

CPU times: user 3.17 s, sys: 2.03 s, total: 5.2 s
Wall time: 2.31 s

../_images/nb_Nationwide_Diversity_and_Integration_26_1.png

[17]:

%%time
plot_top_or_bottom(gdf_di_top_d, "Diverse")

CPU times: user 3.01 s, sys: 1.93 s, total: 4.94 s
Wall time: 2.2 s

../_images/nb_Nationwide_Diversity_and_Integration_27_1.png

[18]:

%%time
plot_top_or_bottom(gdf_di_bottom_d, "Diverse", top_or_bottom="bottom")

CPU times: user 3.09 s, sys: 1.97 s, total: 5.06 s
Wall time: 2.33 s

../_images/nb_Nationwide_Diversity_and_Integration_28_1.png

[19]:

%%time
plot_top_or_bottom(gdf_di_top_i, "Integrated")

CPU times: user 3.14 s, sys: 1.9 s, total: 5.04 s
Wall time: 2.19 s

../_images/nb_Nationwide_Diversity_and_Integration_29_1.png

[20]:

%%time
plot_top_or_bottom(gdf_di_bottom_i, "Integrated", top_or_bottom="bottom")

CPU times: user 3.12 s, sys: 2.01 s, total: 5.12 s
Wall time: 2.32 s

../_images/nb_Nationwide_Diversity_and_Integration_30_1.png

[ ]: