censusdis.data

Utilities for loading census data.

This module relies on the US Census API, which it wraps in a pythonic manner.

censusdis.data.GeoFilterType

The type we accept for geographic filters.

They are used for the values of kwargs to download().

These filters are either single values as a string, or, if multivalued, then an iterable containing all the values allowed by the filter. For example:

import censusdis.data as ced

from censusdis.states import NJ, NY, CT

# Two different kinds of kwarg for `state=`, both of
# which are of `GeoFilterType`:
df_one_state = ced.download("aca/acs5", 2020, ["NAME"], state=NJ)
df_tri_state = ced.download("aca/acs5", 2020, ["NAME"], state=[NJ, NY, CT])

alias of Optional[Union[str, Iterable[str]]]

censusdis.data.add_inferred_geography(df_data: <Mock name='mock.DataFrame' id='139698346002944'>, year: int | None = None) → <Mock name='mock.GeoDataFrame' id='139698493266336'>[source]

Infer the geography level of the given dataframe and add geometry to each row for that level.

Parameters:

df_data – A dataframe of variables with one or more columns that can be used to infer what geometry level the rows represent.
year – The year for which to fetch geometries. We need this because they change over time. If None, look for a ‘YEAR’ column in df_data and possibly add different geometries for different years as needed.

Returns:

A geo data frame containing the original data augmented with
the appropriate geometry for each row.

censusdis.data.census_table_url(dataset: str, vintage: int | Literal['timeseries'], download_variables: Iterable[str], *, api_key: str | None = None, **kwargs: str | Iterable[str]) → Tuple[str, Mapping[str, str], BoundGeographyPath][source]

Construct the URL to download data from the U.S. Census API.

Parameters:

dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”.
vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.
download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].
api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited.
kwargs – A specification of the geometry that we want data for.

Return type:

The URL, parameters and bound path.

censusdis.data.clip_water(gdf_geo: <Mock name='mock.GeoDataFrame' id='139698493266336'>, year: int, minimum_area_sq_meters: int = 10000, sliver_threshold=0.01)[source]

Removes water from input geodataframe.

Parameters:

gdf_geo – The GeoDataFrame from which we want to remove water
year – The year for which to fetch geometries. We need this because they change over time.
minimum_area_sq_meters – The minimimum size of a water area to be removed

Returns:

A GeoDataFrame with the water areas larger than
the specified threshold removed.

Download data from the US Census API.

This is the main API for downloading US Census data with the censusdis package. There are many examples of how to use this in the demo notebooks provided with the package at https://github.com/vengroff/censusdis/tree/main/notebooks.

A note on variables and groups: there are multiple ways to specify the variables you want to download, either individually in download_variables, by one or more groups in group, and by the leaves of one or more groups in leaves_of_group. Note that these three sources af variables are deduplicated, so you will only get one column for a variable no matter how many times it is specified.

Specifying census geographies: censusdis provides access to many census datasets, each of which can be retrieved at a particular set of geographic grains. To accomodate this, download() takes a set of kwargs to define the geographic level of the returned data. You can check which geographies are available for a particular dataset with the geographies().

Parameters:

dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”.
vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.
download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].
group – One or more groups (as defined by the U.S. Census for the data set) whose variable values should be downloaded. These are in addition to any specified in download_variables.
leaves_of_group – One or more groups (as defined by the U.S. Census for the data set) whose leaf variable values should be downloaded.These are in addition to any specified in download_variables or group. See VariableCache.group_leaves() for more details on the semantics of leaves vs. non-leaf group variables.
set_to_nan – A list of values that should be set to NaN. Normally these are special values that the U.S. Census API sometimes returns. If True, then all values in censusdis.values.ALL_SPECIAL_VALUES will be replaced. If False, no replacements will be made.
skip_annotations – If True try to filter out group or leaves_of_group variables that are annotations rather than actual values. See VariableCache.group_variables() for more details. Variable names passed in download_variables are not affected by this flag.
with_geometry – If True a gpd.GeoDataFrame will be returned and each row will have a geometry that is a cartographic boundary suitable for platting a map. See https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.2020.html for details of the shapefiles that will be downloaded on your behalf to generate these boundaries.
remove_water – If True and if with_geometry=True, will query TIGER for AREAWATER shapefiles and remove water areas from returned geometry.
api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited to 500 per day.
variable_cache – A cache of metadata about variables.
row_keys – An optional set of identifier keys to help merge together requests for more than the census API limit of 50 variables per query. These keys are useful for census datasets such as the Current Population Survey where the geographic identifiers do not uniquely identify each row.
kwargs – A specification of the geometry that we want data for. For example, state = “*”, county = “*” will download county-level data for the entire US.

Return type:

A DataFrame containing the requested US Census data.

censusdis.data.geographies(dataset: str, vintage: int | Literal['timeseries']) → List[List[str]][source]

What geographies are supported for a dataset and vintage?

This utility gives us a list of the different geography keywords we can use in calls to download() with for the given dataset and vintage.

Parameters:

dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”.
vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.

Returns:

A list of lists of geography keywords. Each element
of the outer list is a list of keywords that can be
used together.

censusdis.data.get_shapefile_path() → Path | None[source]

Get the path to the directory to cache shapefiles.

This is where we will cache shapefiles downloaded when with_geometry=True is passed to download().

Return type:: The path to use for caching shapefiles.

censusdis.data.infer_geo_level(df_data: <Mock name='mock.DataFrame' id='139698346002944'>) → str[source]

Infer the geography level based on columns names.

Parameters:

df_data –

A dataframe of variables with one or more columns that can be used to infer what geometry level the rows represent.

For example, if the column “STATE” exists, we could infer that the data in on a state by state basis. But if there are columns for both “STATE” and “COUNTY”, the data is probably at the county level.

If, on the other hand, there is a “COUNTY” column but not a `”STATE” column, then there is some ambiguity. The data probably corresponds to counties, but the same county ID can exist in multiple states, so we will raise a CensusApiException with an error message expalining the situation.

If there is no match, we will also raise an exception. Again we do this, rather than for example, returning None, so that we can provide an informative error message about the likely cause and what to do about it.

This function is not often called directly, but rather from add_inferred_geography(), which infers the geography level and then adds a geometry column containing the appropriate geography for each row.

Return type:

The name of the geography level.

censusdis.data.set_shapefile_path(shapefile_path: Path | None) → None[source]

Set the path to the directory to cache shapefiles.

This is where we will cache shapefiles downloaded when with_geometry=True is passed to download().

Parameters:: shapefile_path – The path to use for caching shapefiles.