censusdis.data

Utilities for loading census dats.

This module relies on the US Census API, which it wraps in a pythonic manner.

exception censusdis.data.CensusApiException[source]: Bases: Exception

class censusdis.data.CensusApiVariableSource[source]

Bases: VariableSource

A VariableSource that gets data from the US Census remote API.

Users will rarely if ever need to explicitly construct objects of this class. There is one behind the singleton cache censusdis.censusdata.variables.

get(dataset: str, year: int, name: str) → Dict[str, Any][source]

Get information on a variable for a given dataset in a given year.

The return value is a dictionary with the following fields:

Title
“name”	The name of the variable.
‘“label”`	A description of the variable. Within groups, hierarchies of variables are represented by seperating levels with “!!”.
“concept”	The concept this variable and others in the group represent.
“group”	The group the variable belongs to. To query an entire group, use the `get_group()` method.
“limit”
“attributes”	A comma-separated list of variables that are attributes of this one.

This dictionary is very much like the JSON returned from US Census API URLs like https://api.census.gov/data/2020/acs/acs5/variables/B03001_001E.json

Parameters

dataset – The census dataset, for example dec/acs5 for ACS5 data (https://www.census.gov/data/developers/data-sets/acs-5year.html and https://api.census.gov/data/2020/acs/acs5.html) or dec/pl for redistricting data (https://www.census.gov/programs-surveys/decennial-census/about/rdo.html and https://api.census.gov/data/2020/dec/pl.html)
year – The year
name – The name of the variable to get information about. For example, B03002_001E is a variable from the ACS5 data set that represents total population in a geographic area.

Return type

A dictionary of information about the variable.

get_group(dataset: str, year: int, name: Optional[str]) → Dict[str, Dict][source]

Get information on a group of variables for a given dataset in a given year.

The return value is a dictionary that is very much like the JSON returned from US Census API URLs like https://api.census.gov/data/2020/acs/acs5/groups/B03002.json

See get() for more details.

Parameters

dataset – The census dataset, for example dec/acs5 for ACS5 data (https://www.census.gov/data/developers/data-sets/acs-5year.html and https://api.census.gov/data/2020/acs/acs5.html) or dec/pl for redistricting data (https://www.census.gov/programs-surveys/decennial-census/about/rdo.html and https://api.census.gov/data/2020/dec/pl.html)
year – The year
name – The name of the group to get information about. For example, B03002 is a group from the ACS5 data set that contains variables that represent the population of various racial and ethnic groups in a geographic area.

Returns

A dictionary with a single key “variables”. The value
associated with that key is a dictionary that maps from the
names of variables in the group to dictionaries of attributes
of the variable, in the same form as that returned for individual
variables by the method get().

static group_url(dataset: str, year: int, group_name: Optional[str] = None) → str[source]

Get the URL to fetch metadata about a group of variables.

This can either be all the variables in a dataset, if a group name is not specified, or just the variables in a particular group if the data set has groups.

Some datasets, dec/pl dataset for example, do not have groups, so a group name need not be passed. Others, like acs/acs5 have groups, so a group name such as B01001 will normally be passed in.

Parameters

dataset – The census dataset.
year – The year
group_name – The name of the group, or None if the dataset has no groups.

Return type

The URL to fetch the metadata from.

static url(dataset: str, year: int, name: str, response_format: str = 'json') → str[source]

Construct the URL to fetch metadata about a variable.

This is where we fetch metadata that is then put into the local cache.

Parameters

dataset – The census dataset.
year – The year
name – The name of the variable.
response_format – The desired format of the response. Either json (the default) or html.

Return type

The URL to fetch the metadata from.

static variables_url(dataset: str, year: int, response_format: str = 'json') → str[source]

Construct the URL to fetch metadata about all variables.

Parameters

dataset – The census dataset.
year – The year
response_format – The desired format of the response. Either json (the default) or html.

Return type

The URL to fetch the metadata from.

class censusdis.data.VariableCache(*, variable_source: Optional[VariableSource] = None)[source]

Bases: object

A cache of vatiables and groups.

This looks a lot like a VariableSource but it implements a cache in front of a VariableSource.

Users will rarely if ever need to construct one of these themselves. In almost all cases they will use the singleton censusdis.censusdata.variables.

class GroupTreeNode(name: Optional[str] = None)[source]

Bases: object

add_child(path_component: str, child: GroupTreeNode)[source]

get(component, default: Optional[GroupTreeNode] = None)[source]

is_leaf() → bool[source]

items() → Iterable[Tuple[str, GroupTreeNode]][source]

keys() → Iterable[str][source]

leaf_variables() → Generator[str, None, None][source]

leaves() → Generator[GroupTreeNode, None, None][source]

property name

values() → Iterable[GroupTreeNode][source]

clear()[source]

Clear the entire cache.

This just means that further calls to get() will have to make a call to the source behind the cache.

get(dataset: str, year: int, name: str) → Dict[str, Dict][source]

Get the description of a given variable.

See VariableSource.get() for details on the data format. We first look in the cache and then if we don’t find what we are looking for, we call the source behind us and cache the results before returning them.

Parameters

dataset – The census dataset.
year – The year
name – The name of the variable.

Return type

The details of the variable.

get_group(dataset: str, year: int, name: Optional[str]) → Dict[str, Dict][source]

Get information on the variables in a group.

Parameters

dataset – The census dataset.
year – The year
name – The name of the group. Or None if this data set does not have groups.

Returns

A dictionary that maps from the names of each variable in the group
to a dictionary containing a description of the variable. The
format of the description is a dictionary as described in
the documentation for
VariableSource.get().

group_leaves(dataset: str, year: int, name: str, *, skip_annotations: bool = True) → List[str][source]

Find the leaves of a given group.

Parameters

dataset – The census dataset.
year – The year
name – The name of the group.
skip_annotations – If True try to filter out variables that are annotations rather than actual values, by skipping those with labels that begin with “Annotation” or “Margin of Error”.

Returns

A list of the variables in the group that are leaves,
i.e. they are not aggregates of other variables. For example,
in the group B03002 in the acs/acs5 dataset in the
year 2020, the variable B03002_003E is a leaf, because
it represents
”Estimate!!Total (!!Not Hispanic or Latino:!!White alone”,)
whereas B03002_002E is not a leaf because it represents
”Estimate!!Total (!!Not Hispanic or Latino:”, which is a total)
that includes B03002_003E as well as others like “B03002_004E”,
”B03002_005E” and more.
The typical reason we want leaves is because that gives us a set
of variables representing counts that do not overlap and add up
to the total. We can use these directly in diversity and integration
calculations using the divintseg package.

group_tree(dataset: str, year: int, group_name: Optional[str], *, skip_annotations: bool = True) → GroupTreeNode[source]

invalidate(dataset: str, year: int, name: str)[source]: Remove an item from the cache.

items() → Iterable[Tuple[Tuple[str, int, str], dict]][source]: Items in the mapping from variable name to descpription.

keys() → Iterable[Tuple[str, int, str]][source]: Keys, i.e. the names of variables, in the cache.

values() → Iterable[dict][source]: Values, i.e. the descriptions of variables, in the cache.

class censusdis.data.VariableSource[source]

Bases: ABC

A source of variables, typically used behind a VariableCache.

The purpose of this class is to get variable and group information from a source, typically a remote API call to the US Census API. Another use case is to enable mocking for testing the rest of the VariableCache functionality, which is a superset of what this class does.

abstract get(dataset: str, year: int, name: str) → Dict[str, Any][source]

Get information on a variable for a given dataset in a given year.

The return value is a dictionary with the following fields:

Title
“name”	The name of the variable.
‘“label”`	A description of the variable. Within groups, hierarchies of variables are represented by seperating levels with “!!”.
“concept”	The concept this variable and others in the group represent.
“group”	The group the variable belongs to. To query an entire group, use the `get_group()` method.
“limit”
“attributes”	A comma-separated list of variables that are attributes of this one.

This dictionary is very much like the JSON returned from US Census API URLs like https://api.census.gov/data/2020/acs/acs5/variables/B03001_001E.json

Parameters

dataset – The census dataset, for example dec/acs5 for ACS5 data (https://www.census.gov/data/developers/data-sets/acs-5year.html and https://api.census.gov/data/2020/acs/acs5.html) or dec/pl for redistricting data (https://www.census.gov/programs-surveys/decennial-census/about/rdo.html and https://api.census.gov/data/2020/dec/pl.html)
year – The year
name – The name of the variable to get information about. For example, B03002_001E is a variable from the ACS5 data set that represents total population in a geographic area.

Return type

A dictionary of information about the variable.

abstract get_group(dataset: str, year: int, name: str) → Dict[str, Dict][source]

Get information on a group of variables for a given dataset in a given year.

The return value is a dictionary that is very much like the JSON returned from US Census API URLs like https://api.census.gov/data/2020/acs/acs5/groups/B03002.json

See get() for more details.

Parameters

dataset – The census dataset, for example dec/acs5 for ACS5 data (https://www.census.gov/data/developers/data-sets/acs-5year.html and https://api.census.gov/data/2020/acs/acs5.html) or dec/pl for redistricting data (https://www.census.gov/programs-surveys/decennial-census/about/rdo.html and https://api.census.gov/data/2020/dec/pl.html)
year – The year
name – The name of the group to get information about. For example, B03002 is a group from the ACS5 data set that contains variables that represent the population of various racial and ethnic groups in a geographic area.

Returns

A dictionary with a single key “variables”. The value
associated with that key is a dictionary that maps from the
names of variables in the group to dictionaries of attributes
of the variable, in the same form as that returned for individual
variables by the method get().

censusdis.data.add_inferred_geography(df: <Mock name='mock.DataFrame' id='140229805010608'>, year: int) → <Mock name='mock.GeoDataFrame' id='140229769618768'>[source]

Infer the geography level of the given dataframe and add geometry to each row for that level.

Parameters

df – A dataframe of variables with one or more columns that can be used to infer what geometry level the rows represent.
year – The year for which to fetch geometries. We need this because they change over time.

Returns

A geo data frame containing the original data augmented with
the appropriate geometry for each row.

censusdis.data.census_detail_table_url(dataset: str, year: int, fields: Iterable[str], *, api_key: Optional[str] = None, **kwargs: Union[str, Iterable[str]]) → Tuple[str, Mapping[str, str], BoundGeographyPath][source]

censusdis.data.data_from_url(url: str, params: Optional[Mapping[str, str]] = None) → <Mock name='mock.DataFrame' id='140229805010608'>[source]

censusdis.data.download_detail(dataset: str, year: int, download_variables: Iterable[str], *, with_geometry: bool = False, api_key: Optional[str] = None, variable_cache: Optional[VariableCache] = None, **kwargs: Union[str, Iterable[str]]) → GeoDataFrame' id='140229769618768'>][source]

Download data from the US Census API.

This is the main API for downloading US Census data with the censusdis package. There are many examples of how to use this in the demo notebooks provided with the package at https://github.com/vengroff/censusdis/tree/main/notebooks.

Parameters

dataset – The dataset to download from. For example acs/acs5 or dec/pl.
year – The year to download data for.
download_variables – The census variables to download, for example [“NAME”, “B01001_001E”].
with_geometry – If True a gpd.GeoDataFrame will be returned and each row will have a geometry that is a cartographic boundary suitable for platting a map. See https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.2020.html for details of the shapefiles that will be downloaded on your behalf to generate these boundaries.
api_key – An optional API key. If you don’t have or don’t use a key, the number of calls you can make will be limited.
variable_cache – A cache of metadata about variables.
kwargs – A specification of the geometry that we want data for.

Return type

A DataFrame containing the requested US Census data.

censusdis.data.get_shapefile_path() → str[source]

Get the path to the directory to cache shapefiles.

This is where we will cache shapefiles downloaded when with_geometry=True is passed to download_detail().

Return type: The path to use for caching shapefiles.

censusdis.data.infer_geo_level(df: <Mock name='mock.DataFrame' id='140229805010608'>) → str[source]

Infer the geography level based on columns names.

Parameters

df –

A dataframe of variables with one or more columns that can be used to infer what geometry level the rows represent.

For example, if the column “STATE” exists, we could infer that the data in on a state by state basis. But if there are columns for both “STATE” and “COUNTY”, the data is probably at the county level.

If, on the other hand, there is a “COUNTY” column but not a `”STATE” column, then there is some ambiguity. The data probably corresponds to counties, but the same county ID can exist in multiple states, so we will raise a CensusApiException with an error message expalining the situation.

If there is no match, we will also raise an exception. Again we do this, rather than for example, returning None, so that we can provide an informative error message about the likely cause and what to do about it.

This function is not often called directly, but rather from add_inferred_geography(), which infers the geography level and then adds a geometry column containing the appropriate geography for each row.

Return type

The name of the geography level.

censusdis.data.json_from_url(url: str, params: Optional[Mapping[str, str]] = None) → Any[source]

censusdis.data.set_shapefile_path(shapefile_path: str) → None[source]

Set the path to the directory to cache shapefiles.

This is where we will cache shapefiles downloaded when with_geometry=True is passed to download_detail().

Parameters: shapefile_path – The path to use for caching shapefiles.