censusdis.data

Utilities for loading census data.

This module relies on the US Census API, which it wraps in a pythonic manner.

censusdis.data.GeoFilterType

The type we accept for geographic filters.

They are used for the values of kwargs to download().

These filters are either single values as a string, or, if multivalued, then an iterable containing all the values allowed by the filter. For example:

import censusdis.data as ced

from censusdis.states import STATE_NJ, STATE_NY, STATE_CT

# Two different kinds of kwarg for `state=`, both of
# which are of `GeoFilterType`:
df_one_state = ced.download("aca/acs5", 2020, ["NAME"], state=STATE_NJ)
df_tri_state = ced.download("aca/acs5", 2020, ["NAME"], state=[STATE_NJ, STATE_NY, STATE_CT])

alias of Optional[Union[str, Iterable[str]]]

censusdis.data.add_inferred_geography(df_data: <Mock name='mock.DataFrame' id='140157649637920'>, year: ~typing.Optional[int] = None) → <Mock name='mock.GeoDataFrame' id='140157649451136'>[source]

Infer the geography level of the given dataframe and add geometry to each row for that level.

Parameters

df_data – A dataframe of variables with one or more columns that can be used to infer what geometry level the rows represent.
year – The year for which to fetch geometries. We need this because they change over time. If None, look for a ‘YEAR’ column in df_data and possibly add different geometries for different years as needed.

Returns

A geo data frame containing the original data augmented with
the appropriate geometry for each row.

censusdis.data.census_table_url(dataset: str, vintage: Union[int, Literal['timeseries']], download_variables: Iterable[str], *, api_key: Optional[str] = None, **kwargs: Union[str, Iterable[str]]) → Tuple[str, Mapping[str, str], BoundGeographyPath][source]

Construct the URL to download data from the U.S. Census API.

Parameters

dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”.
vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.
download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].
api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited.
kwargs – A specification of the geometry that we want data for.

Return type

The URL, parameters and bound path.

censusdis.data.download(dataset: str, vintage: Union[int, Literal['timeseries']], download_variables: Optional[Union[str, Iterable[str]]] = None, *, group: Optional[Union[str, Iterable[str]]] = None, leaves_of_group: Optional[Union[str, Iterable[str]]] = None, set_to_nan: Optional[Union[bool, Iterable[int]]] = None, skip_annotations: bool = True, with_geometry: bool = False, api_key: Optional[str] = None, variable_cache: Optional[VariableCache] = None, **kwargs: Union[str, Iterable[str]]) → GeoDataFrame' id='140157649451136'>][source]

Download data from the US Census API.

This is the main API for downloading US Census data with the censusdis package. There are many examples of how to use this in the demo notebooks provided with the package at https://github.com/vengroff/censusdis/tree/main/notebooks.

A note on variables and groups: there are multiple ways to specify the variables you want to download, either individually in download_variables, by one or more groups in group, and by the leaves of one or more groups in leaves_of_group. Note that these three sources af variables are deduplicated, so you will only get one column for a variable no matter how many times it is specified.

Parameters

dataset – The dataset to download from. For example “acs/acs5”, “dec/pl”, or “timeseries/poverty/saipe/schdist”.
vintage – The vintage to download data for. For most data sets this is an integer year, for example, 2020. But for a timeseries data set, pass the string ‘timeseries’.
download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].
group – One or more groups (as defined by the U.S. Census for the data set) whose variable values should be downloaded. These are in addition to any specified in download_variables.
leaves_of_group – One or more groups (as defined by the U.S. Census for the data set) whose leaf variable values should be downloaded.These are in addition to any specified in download_variables or group. See VariableCache.group_leaves() for more details on the semantics of leaves vs. non-leaf group variables.
set_to_nan – If not None, this specifies special values that should be replaced with NaN. Normally censusdis.values.ALL_SPECIAL_VALUES or a subset thereof. The default is None so that we never change values without the caller explicitly asking us to. Setting to True is equivalent to censusdis.values.ALL_SPECIAL_VALUES.
skip_annotations – If True try to filter out group or leaves_of_group variables that are annotations rather than actual values. See VariableCache.group_variables() for more details. Variable names passed in download_variables are not affected by this flag.
with_geometry – If True a gpd.GeoDataFrame will be returned and each row will have a geometry that is a cartographic boundary suitable for platting a map. See https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.2020.html for details of the shapefiles that will be downloaded on your behalf to generate these boundaries.
api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited.
variable_cache – A cache of metadata about variables.
kwargs – A specification of the geometry that we want data for.

Return type

A DataFrame containing the requested US Census data.

censusdis.data.download_detail(dataset: str, year: int, download_variables: Iterable[str], *, with_geometry: bool = False, api_key: Optional[str] = None, variable_cache: Optional[VariableCache] = None, **kwargs: Union[str, Iterable[str]]) → GeoDataFrame' id='140157649451136'>][source]

Deprecated version of download(); use download instead.

This function offers a subset of the current functionality of download() but under the old name.

Back in the pre-history of censusdis, this function started life as a way to download ACS detail tables. It evolved significantly since then and does much more now. Hence, the name was changed.

This function will disappear completely no later than version 1.0.0.

censusdis.data.get_shapefile_path() → Optional[str][source]

Get the path to the directory to cache shapefiles.

This is where we will cache shapefiles downloaded when with_geometry=True is passed to download().

Return type: The path to use for caching shapefiles.

censusdis.data.infer_geo_level(df_data: <Mock name='mock.DataFrame' id='140157649637920'>) → str[source]

Infer the geography level based on columns names.

Parameters

df_data –

A dataframe of variables with one or more columns that can be used to infer what geometry level the rows represent.

For example, if the column “STATE” exists, we could infer that the data in on a state by state basis. But if there are columns for both “STATE” and “COUNTY”, the data is probably at the county level.

If, on the other hand, there is a “COUNTY” column but not a `”STATE” column, then there is some ambiguity. The data probably corresponds to counties, but the same county ID can exist in multiple states, so we will raise a CensusApiException with an error message expalining the situation.

If there is no match, we will also raise an exception. Again we do this, rather than for example, returning None, so that we can provide an informative error message about the likely cause and what to do about it.

This function is not often called directly, but rather from add_inferred_geography(), which infers the geography level and then adds a geometry column containing the appropriate geography for each row.

Return type

The name of the geography level.

censusdis.data.set_shapefile_path(shapefile_path: Optional[str]) → None[source]

Set the path to the directory to cache shapefiles.

This is where we will cache shapefiles downloaded when with_geometry=True is passed to download().

Parameters: shapefile_path – The path to use for caching shapefiles.