censusdis.data

Utilities for loading census data.

This module relies on the US Census API, which it wraps in a pythonic manner.

class censusdis.data.ContainedWithin(area_threshold: float = 0.8, **kwargs: str | Iterable[str])[source]

Bases: object

A representation of a geography that we want to query some other geographies that are contained within.

download(dataset: str, vintage: int | Literal['timeseries'], download_variables: str | Iterable[str] | None = None, *, group: str | Iterable[str] | None = None, leaves_of_group: str | Iterable[str] | None = None, set_to_nan: bool | Iterable[int] = True, skip_annotations: bool = True, with_geometry: bool = False, remove_water: bool = False, api_key: str | None = None, variable_cache: VariableCache | None = None, row_keys: str | Iterable[str] | None = None, **kwargs: str | Iterable[str]) <Mock name='mock.DataFrame' id='140323436868624'> | <Mock name='mock.GeoDataFrame' id='140323436868816'>[source]

Download data for geographies contained within a containing geography.

Parameters:
  • dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”. There are symbolic names for datasets, like ACS5 for “acs/acs5” in :py:module:`censusdis.datasets.

  • vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.

  • download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].

  • group – One or more groups (as defined by the U.S. Census for the data set) whose variable values should be downloaded. These are in addition to any specified in download_variables.

  • leaves_of_group – One or more groups (as defined by the U.S. Census for the data set) whose leaf variable values should be downloaded.These are in addition to any specified in download_variables or group. See VariableCache.group_leaves() for more details on the semantics of leaves vs. non-leaf group variables.

  • set_to_nan – A list of values that should be set to NaN. Normally these are special values that the U.S. Census API sometimes returns. If True, then all values in censusdis.values.ALL_SPECIAL_VALUES will be replaced. If False, no replacements will be made.

  • skip_annotations – If True try to filter out group or leaves_of_group variables that are annotations rather than actual values. See VariableCache.group_variables() for more details. Variable names passed in download_variables are not affected by this flag.

  • with_geometry – If True a gpd.GeoDataFrame will be returned and each row will have a geometry that is a cartographic boundary suitable for platting a map. See https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.2020.html for details of the shapefiles that will be downloaded on your behalf to generate these boundaries.

  • remove_water – If True and if with_geometry=True, will query TIGER for AREAWATER shapefiles and remove water areas from returned geometry.

  • api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited to 500 per day.

  • variable_cache – A cache of metadata about variables.

  • row_keys – An optional set of identifier keys to help merge together requests for more than the census API limit of 50 variables per query. These keys are useful for census datasets such as the Current Population Survey where the geographic identifiers do not uniquely identify each row.

  • kwargs – A specification of the geometry that we want data for. For example, state = “*”, county = “*” will download county-level data for the entire US.

Return type:

A DataFrame or ~gpd.GeoDataFrame containing the requested US Census data.

censusdis.data.GeoFilterType

The type we accept for geographic filters.

They are used for the values of kwargs to download().

These filters are either single values as a string, or, if multivalued, then an iterable containing all the values allowed by the filter. For example:

import censusdis.data as ced

from censusdis.states import NJ, NY, CT

# Two different kinds of kwarg for `state=`, both of
# which are of `GeoFilterType`:
df_one_state = ced.download("aca/acs5", 2020, ["NAME"], state=NJ)
df_tri_state = ced.download("aca/acs5", 2020, ["NAME"], state=[NJ, NY, CT])

alias of Optional[Union[str, Iterable[str]]]

censusdis.data.add_inferred_geography(df_data: <Mock name='mock.DataFrame' id='140323436868624'>, year: int | None = None) <Mock name='mock.GeoDataFrame' id='140323436868816'>[source]

Infer the geography level of the given dataframe.

Add geometry to each row for the inferred level.

Parameters:
  • df_data – A dataframe of variables with one or more columns that can be used to infer what geometry level the rows represent.

  • year – The year for which to fetch geometries. We need this because they change over time. If None, look for a ‘YEAR’ column in df_data and possibly add different geometries for different years as needed.

Returns:

  • A geo data frame containing the original data augmented with

  • the appropriate geometry for each row.

censusdis.data.census_table_url(dataset: str, vintage: int | Literal['timeseries'], download_variables: Iterable[str], *, api_key: str | None = None, **kwargs: str | Iterable[str]) Tuple[str, Mapping[str, str], BoundGeographyPath][source]

Construct the URL to download data from the U.S. Census API.

Parameters:
  • dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”.

  • vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.

  • download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].

  • api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited.

  • kwargs – A specification of the geometry that we want data for.

Return type:

The URL, parameters and bound path.

censusdis.data.contained_within(area_threshold: float = 0.8, **kwargs: str | Iterable[str]) ContainedWithin[source]

Construct a representation of a geography that we want to query some other geographies contained within.

Parameters:
  • area_threshold – What fraction of the area of other geographies must be contained in our geography to be included.

  • kwargs – A specification of the geometry that we want data for geometries that are contained within. For example, state = “NJ”, place = “01960” will specify the city of Asbury Park, NJ.

censusdis.data.download(dataset: str, vintage: int | Literal['timeseries'], download_variables: str | Iterable[str] | None = None, *, group: str | Iterable[str] | None = None, leaves_of_group: str | Iterable[str] | None = None, set_to_nan: bool | Iterable[int] = True, skip_annotations: bool = True, with_geometry: bool = False, remove_water: bool = False, download_contained_within: Dict[str, str | Iterable[str]] | None = None, area_threshold: float = 0.8, api_key: str | None = None, variable_cache: VariableCache | None = None, row_keys: str | Iterable[str] | None = None, **kwargs: str | Iterable[str]) <Mock name='mock.DataFrame' id='140323436868624'> | <Mock name='mock.GeoDataFrame' id='140323436868816'>[source]

Download data from the US Census API.

This is the main API for downloading US Census data with the censusdis package. There are many examples of how to use this in the demo notebooks provided with the package at https://github.com/vengroff/censusdis/tree/main/notebooks.

A note on variables and groups: there are multiple ways to specify the variables you want to download, either individually in download_variables, by one or more groups in group, and by the leaves of one or more groups in leaves_of_group. Note that these three sources af variables are deduplicated, so you will only get one column for a variable no matter how many times it is specified.

Specifying census geographies: censusdis provides access to many census datasets, each of which can be retrieved at a particular set of geographic grains. To accomodate this, download() takes a set of kwargs to define the geographic level of the returned data. You can check which geographies are available for a particular dataset with the geographies().

Parameters:
  • dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”. There are symbolic names for datasets, like ACS5 for “acs/acs5” in :py:module:`censusdis.datasets.

  • vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.

  • download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].

  • group – One or more groups (as defined by the U.S. Census for the data set) whose variable values should be downloaded. These are in addition to any specified in download_variables.

  • leaves_of_group – One or more groups (as defined by the U.S. Census for the data set) whose leaf variable values should be downloaded.These are in addition to any specified in download_variables or group. See VariableCache.group_leaves() for more details on the semantics of leaves vs. non-leaf group variables.

  • set_to_nan – A list of values that should be set to NaN. Normally these are special values that the U.S. Census API sometimes returns. If True, then all values in censusdis.values.ALL_SPECIAL_VALUES will be replaced. If False, no replacements will be made.

  • skip_annotations – If True try to filter out group or leaves_of_group variables that are annotations rather than actual values. See VariableCache.group_variables() for more details. Variable names passed in download_variables are not affected by this flag.

  • with_geometry – If True a gpd.GeoDataFrame will be returned and each row will have a geometry that is a cartographic boundary suitable for platting a map. See https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.2020.html for details of the shapefiles that will be downloaded on your behalf to generate these boundaries.

  • remove_water – If True and if with_geometry=True, will query TIGER for AREAWATER shapefiles and remove water areas from returned geometry.

  • download_contained_within – A dictionary specifying the geography or geographies that our results should be filtered down to be contained within.

  • area_threshold – What fraction of the area of other geographies must be contained in our geography to be included. Ignored if download_contained_within is None.

  • api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited to 500 per day.

  • variable_cache – A cache of metadata about variables.

  • row_keys – An optional set of identifier keys to help merge together requests for more than the census API limit of 50 variables per query. These keys are useful for census datasets such as the Current Population Survey where the geographic identifiers do not uniquely identify each row.

  • kwargs – A specification of the geometry that we want data for. For example, state = “*”, county = “*” will download county-level data for the entire US.

Return type:

A DataFrame or ~gpd.GeoDataFrame containing the requested US Census data.

censusdis.data.geographies(dataset: str, vintage: int | Literal['timeseries']) List[List[str]][source]

Determine what geographies are supported for a dataset and vintage.

This utility gives us a list of the different geography keywords we can use in calls to download() with for the given dataset and vintage.

Parameters:
  • dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”.

  • vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.

Returns:

  • A list of lists of geography keywords. Each element

  • of the outer list is a list of keywords that can be

  • used together.

censusdis.data.geography_names(dataset: str, vintage: int | Literal['timeseries'], **kwargs: str | Iterable[str]) <Mock name='mock.DataFrame' id='140323436868624'>[source]

Get the name of a specific geography.

The arguments are a subset of those to download(). This function is designed to make it easy to fetch the name of a geography when we know the FIPS code but want a human-readable name or label for display.

Parameters:
  • dataset – The dataset to download from. For example censusdis.datasets.ACS5.

  • vintage – The vintage to download data for. For example, 2020.

  • kwargs – A specification of the geometry that we want data for. For example, state = “34”, county = “017” will download the name of Hudson County, New Jersey.

Returns:

  • A dataframe with columns specifying the geography and one for the name.

  • All column names will be in ALL CAPS.