Getting Started

N.B. If you already have an environment with censusdis installed and prefer to jump straight to complete demo notebooks you can find them here.

Installing `censusdis`

Installation follows the typical model for Python:

pip install censusdis

will install the package in your python environment.

If you are using a tool like conda or poetry to manage your dependencies, then you can add censusdis the same way you would add any other dependency.

Making Your First Query

Let’s start with a simple example. We will use censusis.data to load the population and median houshold income of every state in the country from the 2020 American Community Survey 5-Year Data. In Census terms, the name of dataset we want to use is “acs/acs5” and the name of the variables we want to load are “B01003_001E” and “B19013_001E”. If you have worked with US Census data before you may recognize the format of the data set and variable names. If you are new to US Census data, don’t worry. We will talk about how to discover data sets and query metadata on available variables later.

We will import censusdis.data and set things up as describe above with the following code:

import censusdis.data as ced

# American Community Survey 5-Year Data
# https://www.census.gov/data/developers/data-sets/acs-5year.html
DATASET = "acs/acs5"

# The year we want data for.
YEAR = 2020

# This are the census variables for total population and median household income.
# For more details, see
#
#     https://api.census.gov/data/2020/acs/acs5/variables.html,
#     https://api.census.gov/data/2020/acs/acs5/variables/B01003_001E.html, and
#     https://api.census.gov/data/2020/acs/acs5/variables/B19013_001E.html.
#
TOTAL_POPULATION_VARIABLE = "B01003_001E"
MEDIAN_HOUSEHOLD_INCOME_VARIABLE = "B19013_001E"

# The variables we are going to query.
VARIABLES = ["NAME", TOTAL_POPULATION_VARIABLE, MEDIAN_HOUSEHOLD_INCOME_VARIABLE]

Once we have done that, we can make the following query to get the data we want:

# Get the value of our variables for every state in the
# year we have chosen.
df_states = ced.download(
    DATASET,
    YEAR,
    VARIABLES,
    state="*",
)

The call to ced.download will construct a URL in the Census API’s preferred format (https://api.census.gov/data/2020/acs/acs5?get=NAME,B01003_001E,B19013_001E&for=state:*), make a request to the Census servers at that URL, parse the JSON that is returned, and turn it into a pandas.DataFrame.

df_states now has the name and population and median income of all 50 states and the District of Columbia. The value returned into df_states is:

   STATE                  NAME  B01003_001E  B19013_001E
   42          Pennsylvania     12794885        63627
   06            California     39346023        78672
   54         West Virginia      1807426        48037
   49                  Utah      3151239        74197
   36              New York     19514849        71117
   11  District of Columbia       701974        90842
   02                Alaska       736990        77790
   12               Florida     21216924        57703
   45        South Carolina      5091517        54864
   38          North Dakota       760394        65315
  23                 Maine      1340825        59489
  13               Georgia     10516579        61224
  01               Alabama      4893186        52035
  33         New Hampshire      1355244        77923
  41                Oregon      4176346        65667
  56               Wyoming       581348        65304
  04               Arizona      7174064        61529
  22             Louisiana      4664616        50800
  18               Indiana      6696893        58235
  16                 Idaho      1754367        58915
  09           Connecticut      3570549        79855
  15                Hawaii      1420074        83173
  17              Illinois     12716164        68428
  25         Massachusetts      6873003        84385
  48                 Texas     28635442        63826
  30               Montana      1061705        56539
  31              Nebraska      1923826        63015
  39                  Ohio     11675275        58116
  08              Colorado      5684926        75231
  34            New Jersey      8885418        85245
  24              Maryland      6037624        87063
  51              Virginia      8509358        76398
  50               Vermont       624340        63477
  37        North Carolina     10386227        56642
  05              Arkansas      3011873        49475
  53            Washington      7512465        77006
  20                Kansas      2912619        61091
  40              Oklahoma      3949342        53840
  55             Wisconsin      5806975        63293
  28           Mississippi      2981835        46511
  29              Missouri      6124160        57290
  26              Michigan      9973907        59234
  44          Rhode Island      1057798        70305
  27             Minnesota      5600166        73382
  19                  Iowa      3150011        61836
  35            New Mexico      2097021        51243
  32                Nevada      3030281        62043
  10              Delaware       967679        69110
  72           Puerto Rico      3255642        21058
  21              Kentucky      4461952        52238
  46          South Dakota       879336        59896
  47             Tennessee      6772268        54833

Notice that the data frame has four columns, STATE, NAME, B01003_001E, and B19013_001E. NAME, B01003_001E, and B19013_001E are what we asked for. But what about the first column, STATE? That is additional data that indicates the state of each row, specified in terms of a FIPS Code. FIPS codes are two-digit strings that the US Census uses to identify states.

censusdis returns FIPS codes like these to you because they tend to be very useful in cases where you might want to join this data with other data, either from other censusdis queries or from other sources. Joining on a FIPS code is usually more reliable and less error-prone than joining on a string like the name of a state. One data set might use the name “N. Carolina” and another one might use “North Carolina”, and a third might use “NC”. FIPS codes help us avoid confusion or the need to keep mapping between them.

The states are in no particular order other than what the underlying US Census API returned to us. If order matters to you, you can sort the dataframe by whatever column(s) you like, such as by the name of the state, or by the population.

Filtering Queries

Our first query got the population and median income of every state. Sometimes, especially when we are working at a smaller level of granularity like a county, we don’t want the data for the entire country. We might want it just for the counties of a particular state, say New Jersey. In that case, we can specify this with additional arguments to ced.download. For example:

from censusdis import states

df_counties = ced.download(
    DATASET,
    YEAR,
    VARIABLES,
    state=states.NJ,
    county="*",
)

This code is almost exactly the same as the last query except that we changed state="*" to state=states.NJ and county="*". So instead of asking for the data aggregated at the state level across all states, we are asking for only the data from the state of New Jersey, aggregated at the county level. The value returned into df_counties is:

   STATE COUNTY                           NAME  B01003_001E  B19013_001E
   34    003      Bergen County, New Jersey       931275       104623
   34    009    Cape May County, New Jersey        92701        72385
   34    015  Gloucester County, New Jersey       291745        89056
   34    021      Mercer County, New Jersey       368085        83306
   34    027      Morris County, New Jersey       492715       117298
   34    033       Salem County, New Jersey        62754        64234
   34    039       Union County, New Jersey       555208        82644
   34    001    Atlantic County, New Jersey       264650        63680
   34    005  Burlington County, New Jersey       446301        90329
   34    007      Camden County, New Jersey       506721        70957
  34    011  Cumberland County, New Jersey       150085        55709
  34    013       Essex County, New Jersey       798698        63959
  34    017      Hudson County, New Jersey       671923        75062
  34    019   Hunterdon County, New Jersey       125063       117858
  34    023   Middlesex County, New Jersey       825015        91731
  34    025    Monmouth County, New Jersey       620821       103523
  34    029       Ocean County, New Jersey       602018        72679
  34    031     Passaic County, New Jersey       502763        73562
  34    035    Somerset County, New Jersey       330151       116510
  34    037      Sussex County, New Jersey       140996        96222
  34    041      Warren County, New Jersey       105730        83497

Note that in this case, we received both the FIPS code for the state (34 in New Jersey) and the county within the state, along with the name of the county and its population. The same county FIPS codes are reused from one state to the next, so if we wanted to join this with data from elsewhere we would need to join on both the state FIPS code and the county FIPS code. Note also that joining by NAME could get really messy. Is “Bergen CNTY, NJ” the same as “Bergen County, New Jersey”?

Since the first two queries we did both went to the same underlying “acs/acs5” dataset, the numbers they contain should add up. We can verify this by seeing if the total population of all the counties in New Jersey in the second query is equal to the population of the state from the first query with:

df_counties["B01003_001E"].sum()

Sure enough, this sum is 8885418, exactly what we saw in the New Jersey row of df_states.

Additional Geographies

Depending on what dataset we are querying, data may be available at a wide variety of geographic levels. Some, like region, are very large. In the US Census data model, there are only four regions. Their populations can be queried with:

df_region = ced.download(
    DATASET,
    YEAR,
    VARIABLES,
    region="*",
)

The result is:

  REGION              NAME  B01003_001E  B19013_001E
    2    Midwest Region     68219726        62054
    3      South Region    124605822        59816
    4       West Region     77726849        72464
    1  Northeast Region     56016911        72698

On the other hand, we can go down to very small geographies called block groups. These are small neighborhoods of just a few blocks, each of which is typically home to somewhere between hundreds and thousands of people. Here is a block group query for Essex County, NJ:

COUNTY_ESSEX_NJ = "013" # See county query above.

df_bg = ced.download(
    DATASET,
    YEAR,
    VARIABLES,
    state=states.NJ,
    county=COUNTY_ESSEX_NJ,
    block_group="*",
)

The results of this are much larger than our previous dataframes. There are 672 block groups in the county. The results (leaving out a bunch of rows in the middle) look like:

   STATE  COUNTY   TRACT BLOCK_GROUP                                                          NAME B01003_001E  B19013_001E
    34    013  000100           2      Block Group 2, Census Tract 1, Essex County, New Jersey         1826        31250
    34    013  000200           2      Block Group 2, Census Tract 2, Essex County, New Jersey         2156        39944
    34    013  000400           1      Block Group 1, Census Tract 4, Essex County, New Jersey         2121        41736
    34    013  000600           1      Block Group 1, Census Tract 6, Essex County, New Jersey         2363        44705
    34    013  000700           2      Block Group 2, Census Tract 7, Essex County, New Jersey         2321        32382
    34    013  000800           1      Block Group 1, Census Tract 8, Essex County, New Jersey         1811        78100
    34    013  000900           1      Block Group 1, Census Tract 9, Essex County, New Jersey         1066        16125
    34    013  001000           1     Block Group 1, Census Tract 10, Essex County, New Jersey         1305   -666666666
    34    013  001100           2     Block Group 2, Census Tract 11, Essex County, New Jersey         1660        69650
    34    013  001400           2     Block Group 2, Census Tract 14, Essex County, New Jersey         1434        54516

...

  34    013  004700           2     Block Group 2, Census Tract 47, Essex County, New Jersey         1373        53125
  34    013  004700           3     Block Group 3, Census Tract 47, Essex County, New Jersey         1028   -666666666
  34    013  004700           4     Block Group 4, Census Tract 47, Essex County, New Jersey         1253        53368
  34    013  004700           5     Block Group 5, Census Tract 47, Essex County, New Jersey          796        49097
  34    013  004801           1  Block Group 1, Census Tract 48.01, Essex County, New Jersey         1850        37619
  34    013  004801           2  Block Group 2, Census Tract 48.01, Essex County, New Jersey          530        58705
  34    013  004802           1  Block Group 1, Census Tract 48.02, Essex County, New Jersey         2130        11634
  34    013  004802           2  Block Group 2, Census Tract 48.02, Essex County, New Jersey          694        19919
  34    013  004802           3  Block Group 3, Census Tract 48.02, Essex County, New Jersey         1102        11713
  34    013  004900           1     Block Group 1, Census Tract 49, Essex County, New Jersey          885        28362

An interesting thing happened here. We asked for all the block groups in the county. censusdis was smart enough to realize that block groups are nested inside geographies called census tracts, that are in turn nested inside counties. In order to give us enough identifiers to unambiguously differentiate the rows, the TRACT column was added even though we did not mention it in our query. As you can see in the results, the block group identifier is typically a single digit number so many rows use the same value, but is unique within a tract. Each row is a unique combination of state, census tract, and block group.

One other interesting thing happened. There are two rows where the value -666666666 was returned in the column B19013_001E. This is a special value that indicates that there was not enough data in the survey to estimate the value accurately. In many cases we will want to drop these rows or treat them in a special way in our analysis.

If you want to find out what all the supported geographies for a data set are, you can check a US Census page like https://api.census.gov/data/2020/dec/pl/geography.html, which is normally linked from the page describing the dataset (https://api.census.gov/data/2020/dec/pl.html in this case).

censusdis queries the same geography data that powers these pages so that it can tell you what options are available and how, in python, to specify them as arguments. You can look at this information with the following code:

import censusdis.geography as cgeo

specs = cgeo.geo_path_snake_specs(DATASET, YEAR)

specs will now contain:

{'010': ['us'],
 '020': ['region'],
 '030': ['division'],
 '040': ['state'],
 '050': ['state', 'county'],
 '060': ['state', 'county', 'county_subdivision'],
 '067': ['state', 'county', 'county_subdivision', 'subminor_civil_division'],
 '070': ['state', 'county', 'county_subdivision', 'place_remainder_or_part'],
 '140': ['state', 'county', 'tract'],
 '150': ['state', 'county', 'tract', 'block_group'],

 ...

 '330': ['combined_statistical_area'],

 ...

 '550': ['state',
         'congressional_district',
         'american_indian_area_alaska_native_area_hawaiian_home_land_or_part'],
 '610': ['state', 'state_legislative_district_upper_chamber'],
 '612': ['state', 'state_legislative_district_upper_chamber', 'county_or_part'],
 '620': ['state', 'state_legislative_district_lower_chamber'],
 '622': ['state', 'state_legislative_district_lower_chamber', 'county_or_part'],
 '795': ['state', 'public_use_microdata_area'],
 '860': ['zip_code_tabulation_area'],
 '950': ['state', 'school_district_elementary'],
 '960': ['state', 'school_district_secondary'],
 '970': ['state', 'school_district_unified']}

mirroring what was on the web site, but in a form that additional code can more easily digest. Note that the queries we performed so far corresponded to geographies '040', '020', and 150. In all cases, censusdis chose the least specific geography that could be matched against the keyword arguments we provided.

We can query any of these geographies we like, using the argument naming conventions returned in specs above. For example:

df_csa = ced.download(
    DATASET,
    YEAR,
    VARIABLES,
    combined_statistical_area="*"
)

which produces the results:

    COMBINED_STATISTICAL_AREA                                                     NAME  B01003_001E  B19013_001E
                       104                               Albany-Schenectady, NY CSA      1169019        69275
                       106                   Albuquerque-Santa Fe-Las Vegas, NM CSA      1156289        55499
                       107                               Altoona-Huntingdon, PA CSA       167640        51497
                       108                            Amarillo-Pampa-Borger, TX CSA       308297        56120
                       118                          Appleton-Oshkosh-Neenah, WI CSA       407758        65838
                       120                         Asheville-Marion-Brevard, NC CSA       538785        54033
                       122  Atlanta--Athens-Clarke County--Sandy Springs, GA-AL CSA      6770764        68938
                       140                                  Bend-Prineville, OR CSA       215482        67851
                       142                      Birmingham-Hoover-Talladega, AL CSA      1315561        56576
                       144                              Bloomington-Bedford, IN CSA       213724        53695

...

                     539                                   Tupelo-Corinth, MS CSA       202909        47893
                     540                               Tyler-Jacksonville, TX CSA       282525        57327
                     544                             Victoria-Port Lavaca, TX CSA       121092        58325
                     545                        Virginia Beach-Norfolk, VA-NC CSA      1858942        67884
                     548       Washington-Baltimore-Arlington, DC-MD-VA-WV-PA CSA      9781219        95810
                     554            Wausau-Stevens Point-Wisconsin Rapids, WI CSA       306886        59919
                     556                                 Wichita-Winfield, KS CSA       674758        57808
                     558                          Williamsport-Lock Haven, PA CSA       152563        53990
                     566                             Youngstown-Warren, OH-PA CSA       640629        48251
                     517                              Spencer-Spirit Lake, IA CSA        33398        55762

for the 175 CSAs in the US.

More Variables

So far, we have only been looking at the variables NAME, B01003_001E, and B19013_001E from the acs/acs5 dataset. But there are thousands of other interesting variables in various data sets you might want to look at.

In many data sets, variables are organized into groups. censusdis has APIs to explore groups of related variables and load the ones you are most interested in. There is an example in the SoMa DIS Demo notebook, which looks at racial demographics and computes diversity and integration metrics at the census tract level.

One way to explore variables is to look at groups of variables. We did a little bit of this in the SoMa DIS Demo notebook. We do some more rigorous analysis of groups and variables in the Exploring Variables notebook.

Adding Geography and Plotting

All of the US Census data we queried above was organized by geography. Often it is interesting to plot this data. But in order to do so, we need data on the shapes and locations of the geographical areas corresponding to each geography represented in the data. Often this means loading the geometry separately and then joining it together with the data. With censusdis, we don’t have to do this. Instead, we can ask it to include geometry with the data it returns by adding the with_geometry=True flag. Here is an example that follows up on the examples in the previous section:

gdf_counties = ced.download(
    DATASET,
    YEAR,
    VARIABLES,
    state="*",
    county="*",
    with_geometry=True
)

In this example, aside from adding with_geometry=True, we passed state="*" and county="*". This means we want data for all the counties in all the states in the country.

If we look at the return value, it looks like:

       STATE       COUNTY                      NAME        B01003_001E     B19013_001E                                              geometry
        01      001   Autauga County, Alabama              55639           57982  POLYGON ((-86.92120 32.65754, -86.92035 32.658...
    01          003   Baldwin County, Alabama             218289           61756      POLYGON ((-88.02858 30.22676, -88.02399 30.230...
        01      005   Barbour County, Alabama              25026           34990      POLYGON ((-85.74803 31.61918, -85.74544 31.618...
        01      007      Bibb County, Alabama              22374           51721      POLYGON ((-87.42194 33.00338, -87.33177 33.005...
        01      009    Blount County, Alabama              57755           48922      POLYGON ((-86.96336 33.85822, -86.95967 33.857...
        01      011   Bullock County, Alabama              10173           33866      POLYGON ((-85.99926 32.25018, -85.98655 32.250...
        01      013    Butler County, Alabama              19726           44850      POLYGON ((-86.90894 31.96167, -86.88668 31.961...
        01      015   Calhoun County, Alabama             114324           50128      POLYGON ((-86.14623 33.70218, -86.14577 33.704...
        21      135    Lewis County, Kentucky              13345           29844      POLYGON ((-83.64418 38.63783, -83.64048 38.648...
        21      137  Lincoln County, Kentucky              24493           42231      POLYGON ((-84.85792 37.48407, -84.85755 37.508...

   ...

        27      153    Todd County, Minnesota              24603           54502      POLYGON ((-95.15557 46.36888, -95.15013 46.368...

It contains results for all 3,221 counties in the country. But in addition to the columns we explicitly asked for and the two that identify the state and county of each row, there is a final column called geometry that represents the geometry of the county. The entire data frame is actually a GeoDataFrame, which is an extension of the Pandas DataFrame you are probably used to.

Now we can plot data in our geo-data frame as follows:

import censusdis.maps as cem

ax = cem.plot_us(
    gdf_counties,
    MEDIAN_HOUSEHOLD_INCOME_VARIABLE,
    cmap="autumn",
    legend=True,
    vmin=0.0,
    vmax=150_000,
    figsize=(12, 6)
)

ax.set_title(f"{YEAR} Median Household Income by County")

ax.axis("off")

The resulting plot looks like

We used cem.plot_us because it does some nice things for us, like relocate Alaska, Hawaii, and Puerto Rico from their actual longitude and latitude to locations that allow us to plot the map more compactly. In addition to doing this relocation, cem.plot_is takes the same *args and **kwargs that Matplotlib normally takes.

Additional Examples in Notebooks

There are additional more advanced examples and additional maps and visualizations, presented in more Demo Notebooks.

Census API Key (Optional Initially)

The Census API that censusdis calls recommends the use of an API key. But chances are you were able to make it through all of the examples above without one. This is because until you are doing a large number of queries, you don’t need one. But once you are doing a large number of queries or putting censusdis into a production pipeline, you should return to this section and obtain a key.

Luckily, the key is free and easy to get, and once you have a key in a file in the right place on your machine censusdis will automatically use it with every call to the Census API.

To obtain a key, visit this page. The key will be sent to you be email. It will be a long string of numbers and letters. Make a directory called .censusdis in your home directory, then inside that directory create a text file called api_key.txt. The file should have just one line, and you should paste the key you got via email into it. Once this is done, all censusdis calls to the Census API will use this key.

Help and Issues

If you have questions or want to report a bug or feature request, please contact us by opening an issue at https://github.com/vengroff/censusdis/issues.