.. _getting_started: Getting Started =============== *N.B.* If you already have an environment with ``censusdis`` installed and prefer to jump straight to complete demo notebooks you can find them `here <./notebooks.html>`_. .. _installation: Installing ``censusdis`` ------------------------ Installation follows the typical model for Python:: pip install censusdis will install the package in your python environment. If you are using a tool like `conda `_ or `poetry `_ to manage your dependencies, then you can add ``censusdis`` the same way you would add any other dependency. Making Your First Query ----------------------- Let's start with a simple example. We will use ``censusis.data`` to load the population and median houshold income of every state in the country from the 2020 American Community Survey 5-Year Data. In Census terms, the name of dataset we want to use is `"acs/acs5" `_ and the name of the variables we want to load are `"B01003_001E" `_ and `"B19013_001E" `_. If you have worked with US Census data before you may recognize the format of the data set and variable names. If you are new to US Census data, don't worry. We will talk about how to discover data sets and query metadata on available variables later. We will import `censusdis.data` and set things up as describe above with the following code:: import censusdis.data as ced # American Community Survey 5-Year Data # https://www.census.gov/data/developers/data-sets/acs-5year.html DATASET = "acs/acs5" # The year we want data for. YEAR = 2020 # This are the census variables for total population and median household income. # For more details, see # # https://api.census.gov/data/2020/acs/acs5/variables.html, # https://api.census.gov/data/2020/acs/acs5/variables/B01003_001E.html, and # https://api.census.gov/data/2020/acs/acs5/variables/B19013_001E.html. # TOTAL_POPULATION_VARIABLE = "B01003_001E" MEDIAN_HOUSEHOLD_INCOME_VARIABLE = "B19013_001E" # The variables we are going to query. VARIABLES = ["NAME", TOTAL_POPULATION_VARIABLE, MEDIAN_HOUSEHOLD_INCOME_VARIABLE] Once we have done that, we can make the following query to get the data we want:: # Get the value of our variables for every state in the # year we have chosen. df_states = ced.download( DATASET, YEAR, VARIABLES, state="*", ) The call to ``ced.download`` will construct a URL in the Census API's preferred format (`https://api.census.gov/data/2020/acs/acs5?get=NAME,B01003_001E,B19013_001E&for=state:* `_), make a request to the Census servers at that URL, parse the JSON that is returned, and turn it into a ``pandas.DataFrame``. ``df_states`` now has the name and population and median income of all 50 states and the District of Columbia. The value returned into ``df_states`` is:: STATE NAME B01003_001E B19013_001E 0 42 Pennsylvania 12794885 63627 1 06 California 39346023 78672 2 54 West Virginia 1807426 48037 3 49 Utah 3151239 74197 4 36 New York 19514849 71117 5 11 District of Columbia 701974 90842 6 02 Alaska 736990 77790 7 12 Florida 21216924 57703 8 45 South Carolina 5091517 54864 9 38 North Dakota 760394 65315 10 23 Maine 1340825 59489 11 13 Georgia 10516579 61224 12 01 Alabama 4893186 52035 13 33 New Hampshire 1355244 77923 14 41 Oregon 4176346 65667 15 56 Wyoming 581348 65304 16 04 Arizona 7174064 61529 17 22 Louisiana 4664616 50800 18 18 Indiana 6696893 58235 19 16 Idaho 1754367 58915 20 09 Connecticut 3570549 79855 21 15 Hawaii 1420074 83173 22 17 Illinois 12716164 68428 23 25 Massachusetts 6873003 84385 24 48 Texas 28635442 63826 25 30 Montana 1061705 56539 26 31 Nebraska 1923826 63015 27 39 Ohio 11675275 58116 28 08 Colorado 5684926 75231 29 34 New Jersey 8885418 85245 30 24 Maryland 6037624 87063 31 51 Virginia 8509358 76398 32 50 Vermont 624340 63477 33 37 North Carolina 10386227 56642 34 05 Arkansas 3011873 49475 35 53 Washington 7512465 77006 36 20 Kansas 2912619 61091 37 40 Oklahoma 3949342 53840 38 55 Wisconsin 5806975 63293 39 28 Mississippi 2981835 46511 40 29 Missouri 6124160 57290 41 26 Michigan 9973907 59234 42 44 Rhode Island 1057798 70305 43 27 Minnesota 5600166 73382 44 19 Iowa 3150011 61836 45 35 New Mexico 2097021 51243 46 32 Nevada 3030281 62043 47 10 Delaware 967679 69110 48 72 Puerto Rico 3255642 21058 49 21 Kentucky 4461952 52238 50 46 South Dakota 879336 59896 51 47 Tennessee 6772268 54833 Notice that the data frame has four columns, ``STATE``, ``NAME``, ``B01003_001E``, and ``B19013_001E``. ``NAME``, ``B01003_001E``, and ``B19013_001E`` are what we asked for. But what about the first column, ``STATE``? That is additional data that indicates the state of each row, specified in terms of a `FIPS Code `_. FIPS codes are two-digit strings that the US Census uses to identify states. ``censusdis`` returns FIPS codes like these to you because they tend to be very useful in cases where you might want to join this data with other data, either from other ``censusdis`` queries or from other sources. Joining on a FIPS code is usually more reliable and less error-prone than joining on a string like the name of a state. One data set might use the name "N. Carolina" and another one might use "North Carolina", and a third might use "NC". FIPS codes help us avoid confusion or the need to keep mapping between them. The states are in no particular order other than what the underlying US Census API returned to us. If order matters to you, you can sort the dataframe by whatever column(s) you like, such as by the name of the state, or by the population. Filtering Queries ----------------- Our first query got the population and median income of every state. Sometimes, especially when we are working at a smaller level of granularity like a county, we don't want the data for the entire country. We might want it just for the counties of a particular state, say New Jersey. In that case, we can specify this with additional arguments to ``ced.download``. For example:: from censusdis import states df_counties = ced.download( DATASET, YEAR, VARIABLES, state=states.NJ, county="*", ) This code is almost exactly the same as the last query except that we changed ``state="*"`` to ``state=states.NJ`` and ``county="*"``. So instead of asking for the data aggregated at the state level across all states, we are asking for only the data from the state of New Jersey, aggregated at the county level. The value returned into ``df_counties`` is:: STATE COUNTY NAME B01003_001E B19013_001E 0 34 003 Bergen County, New Jersey 931275 104623 1 34 009 Cape May County, New Jersey 92701 72385 2 34 015 Gloucester County, New Jersey 291745 89056 3 34 021 Mercer County, New Jersey 368085 83306 4 34 027 Morris County, New Jersey 492715 117298 5 34 033 Salem County, New Jersey 62754 64234 6 34 039 Union County, New Jersey 555208 82644 7 34 001 Atlantic County, New Jersey 264650 63680 8 34 005 Burlington County, New Jersey 446301 90329 9 34 007 Camden County, New Jersey 506721 70957 10 34 011 Cumberland County, New Jersey 150085 55709 11 34 013 Essex County, New Jersey 798698 63959 12 34 017 Hudson County, New Jersey 671923 75062 13 34 019 Hunterdon County, New Jersey 125063 117858 14 34 023 Middlesex County, New Jersey 825015 91731 15 34 025 Monmouth County, New Jersey 620821 103523 16 34 029 Ocean County, New Jersey 602018 72679 17 34 031 Passaic County, New Jersey 502763 73562 18 34 035 Somerset County, New Jersey 330151 116510 19 34 037 Sussex County, New Jersey 140996 96222 20 34 041 Warren County, New Jersey 105730 83497 Note that in this case, we received both the FIPS code for the state (34 in New Jersey) and the county within the state, along with the name of the county and its population. The same county FIPS codes are reused from one state to the next, so if we wanted to join this with data from elsewhere we would need to join on both the state FIPS code and the county FIPS code. Note also that joining by NAME could get really messy. Is "Bergen CNTY, NJ" the same as "Bergen County, New Jersey"? Since the first two queries we did both went to the same underlying "acs/acs5" dataset, the numbers they contain should add up. We can verify this by seeing if the total population of all the counties in New Jersey in the second query is equal to the population of the state from the first query with:: df_counties["B01003_001E"].sum() Sure enough, this sum is ``8885418``, exactly what we saw in the New Jersey row of ``df_states``. Additional Geographies ---------------------- Depending on what dataset we are querying, data may be available at a wide variety of geographic levels. Some, like region, are very large. In the US Census data model, there are only four regions. Their populations can be queried with:: df_region = ced.download( DATASET, YEAR, VARIABLES, region="*", ) The result is:: REGION NAME B01003_001E B19013_001E 0 2 Midwest Region 68219726 62054 1 3 South Region 124605822 59816 2 4 West Region 77726849 72464 3 1 Northeast Region 56016911 72698 On the other hand, we can go down to very small geographies called *block groups*. These are small neighborhoods of just a few blocks, each of which is typically home to somewhere between hundreds and thousands of people. Here is a block group query for Essex County, NJ:: COUNTY_ESSEX_NJ = "013" # See county query above. df_bg = ced.download( DATASET, YEAR, VARIABLES, state=states.NJ, county=COUNTY_ESSEX_NJ, block_group="*", ) The results of this are much larger than our previous dataframes. There are 672 block groups in the county. The results (leaving out a bunch of rows in the middle) look like:: STATE COUNTY TRACT BLOCK_GROUP NAME B01003_001E B19013_001E 0 34 013 000100 2 Block Group 2, Census Tract 1, Essex County, New Jersey 1826 31250 1 34 013 000200 2 Block Group 2, Census Tract 2, Essex County, New Jersey 2156 39944 2 34 013 000400 1 Block Group 1, Census Tract 4, Essex County, New Jersey 2121 41736 3 34 013 000600 1 Block Group 1, Census Tract 6, Essex County, New Jersey 2363 44705 4 34 013 000700 2 Block Group 2, Census Tract 7, Essex County, New Jersey 2321 32382 5 34 013 000800 1 Block Group 1, Census Tract 8, Essex County, New Jersey 1811 78100 6 34 013 000900 1 Block Group 1, Census Tract 9, Essex County, New Jersey 1066 16125 7 34 013 001000 1 Block Group 1, Census Tract 10, Essex County, New Jersey 1305 -666666666 8 34 013 001100 2 Block Group 2, Census Tract 11, Essex County, New Jersey 1660 69650 9 34 013 001400 2 Block Group 2, Census Tract 14, Essex County, New Jersey 1434 54516 ... 662 34 013 004700 2 Block Group 2, Census Tract 47, Essex County, New Jersey 1373 53125 663 34 013 004700 3 Block Group 3, Census Tract 47, Essex County, New Jersey 1028 -666666666 664 34 013 004700 4 Block Group 4, Census Tract 47, Essex County, New Jersey 1253 53368 665 34 013 004700 5 Block Group 5, Census Tract 47, Essex County, New Jersey 796 49097 666 34 013 004801 1 Block Group 1, Census Tract 48.01, Essex County, New Jersey 1850 37619 667 34 013 004801 2 Block Group 2, Census Tract 48.01, Essex County, New Jersey 530 58705 668 34 013 004802 1 Block Group 1, Census Tract 48.02, Essex County, New Jersey 2130 11634 669 34 013 004802 2 Block Group 2, Census Tract 48.02, Essex County, New Jersey 694 19919 670 34 013 004802 3 Block Group 3, Census Tract 48.02, Essex County, New Jersey 1102 11713 671 34 013 004900 1 Block Group 1, Census Tract 49, Essex County, New Jersey 885 28362 An interesting thing happened here. We asked for all the block groups in the county. ``censusdis`` was smart enough to realize that block groups are nested inside geographies called census tracts, that are in turn nested inside counties. In order to give us enough identifiers to unambiguously differentiate the rows, the ``TRACT`` column was added even though we did not mention it in our query. As you can see in the results, the block group identifier is typically a single digit number so many rows use the same value, but is unique within a tract. Each row is a unique combination of state, census tract, and block group. One other interesting thing happened. There are two rows where the value -666666666 was returned in the column ``B19013_001E``. This is a special value that indicates that there was not enough data in the survey to estimate the value accurately. In many cases we will want to drop these rows or treat them in a special way in our analysis. If you want to find out what all the supported geographies for a data set are, you can check a US Census page like https://api.census.gov/data/2020/dec/pl/geography.html, which is normally linked from the page describing the dataset (https://api.census.gov/data/2020/dec/pl.html in this case). ``censusdis`` queries the same geography data that powers these pages so that it can tell you what options are available and how, in python, to specify them as arguments. You can look at this information with the following code:: import censusdis.geography as cgeo specs = cgeo.geo_path_snake_specs(DATASET, YEAR) ``specs`` will now contain:: {'010': ['us'], '020': ['region'], '030': ['division'], '040': ['state'], '050': ['state', 'county'], '060': ['state', 'county', 'county_subdivision'], '067': ['state', 'county', 'county_subdivision', 'subminor_civil_division'], '070': ['state', 'county', 'county_subdivision', 'place_remainder_or_part'], '140': ['state', 'county', 'tract'], '150': ['state', 'county', 'tract', 'block_group'], ... '330': ['combined_statistical_area'], ... '550': ['state', 'congressional_district', 'american_indian_area_alaska_native_area_hawaiian_home_land_or_part'], '610': ['state', 'state_legislative_district_upper_chamber'], '612': ['state', 'state_legislative_district_upper_chamber', 'county_or_part'], '620': ['state', 'state_legislative_district_lower_chamber'], '622': ['state', 'state_legislative_district_lower_chamber', 'county_or_part'], '795': ['state', 'public_use_microdata_area'], '860': ['zip_code_tabulation_area'], '950': ['state', 'school_district_elementary'], '960': ['state', 'school_district_secondary'], '970': ['state', 'school_district_unified']} mirroring what was on the web site, but in a form that additional code can more easily digest. Note that the queries we performed so far corresponded to geographies ``'040'``, ``'020'``, and ``150``. In all cases, ``censusdis`` chose the least specific geography that could be matched against the keyword arguments we provided. We can query any of these geographies we like, using the argument naming conventions returned in ``specs`` above. For example:: df_csa = ced.download( DATASET, YEAR, VARIABLES, combined_statistical_area="*" ) which produces the results:: COMBINED_STATISTICAL_AREA NAME B01003_001E B19013_001E 0 104 Albany-Schenectady, NY CSA 1169019 69275 1 106 Albuquerque-Santa Fe-Las Vegas, NM CSA 1156289 55499 2 107 Altoona-Huntingdon, PA CSA 167640 51497 3 108 Amarillo-Pampa-Borger, TX CSA 308297 56120 4 118 Appleton-Oshkosh-Neenah, WI CSA 407758 65838 5 120 Asheville-Marion-Brevard, NC CSA 538785 54033 6 122 Atlanta--Athens-Clarke County--Sandy Springs, GA-AL CSA 6770764 68938 7 140 Bend-Prineville, OR CSA 215482 67851 8 142 Birmingham-Hoover-Talladega, AL CSA 1315561 56576 9 144 Bloomington-Bedford, IN CSA 213724 53695 ... 165 539 Tupelo-Corinth, MS CSA 202909 47893 166 540 Tyler-Jacksonville, TX CSA 282525 57327 167 544 Victoria-Port Lavaca, TX CSA 121092 58325 168 545 Virginia Beach-Norfolk, VA-NC CSA 1858942 67884 169 548 Washington-Baltimore-Arlington, DC-MD-VA-WV-PA CSA 9781219 95810 170 554 Wausau-Stevens Point-Wisconsin Rapids, WI CSA 306886 59919 171 556 Wichita-Winfield, KS CSA 674758 57808 172 558 Williamsport-Lock Haven, PA CSA 152563 53990 173 566 Youngstown-Warren, OH-PA CSA 640629 48251 174 517 Spencer-Spirit Lake, IA CSA 33398 55762 for the 175 CSAs in the US. More Variables -------------- So far, we have only been looking at the variables ``NAME``, ``B01003_001E``, and ``B19013_001E`` from the ``acs/acs5`` dataset. But there are thousands of other interesting variables in various data sets you might want to look at. In many data sets, variables are organized into groups. ``censusdis`` has APIs to explore groups of related variables and load the ones you are most interested in. There is an example in the `SoMa DIS Demo <./nb/SoMa%20DIS%20Demo.html>`_ notebook, which looks at racial demographics and computes diversity and integration metrics at the census tract level. One way to explore variables is to look at groups of variables. We did a little bit of this in the `SoMa DIS Demo <./nb/SoMa%20DIS%20Demo.html>`_ notebook. We do some more rigorous analysis of groups and variables in the `Exploring Variables <./nb/Exploring%20Variables.html>`_ notebook. Adding Geography and Plotting ----------------------------- All of the US Census data we queried above was organized by geography. Often it is interesting to plot this data. But in order to do so, we need data on the shapes and locations of the geographical areas corresponding to each geography represented in the data. Often this means loading the geometry separately and then joining it together with the data. With ``censusdis``, we don't have to do this. Instead, we can ask it to include geometry with the data it returns by adding the ``with_geometry=True`` flag. Here is an example that follows up on the examples in the previous section:: gdf_counties = ced.download( DATASET, YEAR, VARIABLES, state="*", county="*", with_geometry=True ) In this example, aside from adding ``with_geometry=True``, we passed ``state="*"`` and ``county="*"``. This means we want data for all the counties in all the states in the country. If we look at the return value, it looks like:: STATE COUNTY NAME B01003_001E B19013_001E geometry 0 01 001 Autauga County, Alabama 55639 57982 POLYGON ((-86.92120 32.65754, -86.92035 32.658... 1 01 003 Baldwin County, Alabama 218289 61756 POLYGON ((-88.02858 30.22676, -88.02399 30.230... 2 01 005 Barbour County, Alabama 25026 34990 POLYGON ((-85.74803 31.61918, -85.74544 31.618... 3 01 007 Bibb County, Alabama 22374 51721 POLYGON ((-87.42194 33.00338, -87.33177 33.005... 4 01 009 Blount County, Alabama 57755 48922 POLYGON ((-86.96336 33.85822, -86.95967 33.857... 5 01 011 Bullock County, Alabama 10173 33866 POLYGON ((-85.99926 32.25018, -85.98655 32.250... 6 01 013 Butler County, Alabama 19726 44850 POLYGON ((-86.90894 31.96167, -86.88668 31.961... 7 01 015 Calhoun County, Alabama 114324 50128 POLYGON ((-86.14623 33.70218, -86.14577 33.704... 8 21 135 Lewis County, Kentucky 13345 29844 POLYGON ((-83.64418 38.63783, -83.64048 38.648... 9 21 137 Lincoln County, Kentucky 24493 42231 POLYGON ((-84.85792 37.48407, -84.85755 37.508... ... 3220 27 153 Todd County, Minnesota 24603 54502 POLYGON ((-95.15557 46.36888, -95.15013 46.368... It contains results for all 3,221 counties in the country. But in addition to the columns we explicitly asked for and the two that identify the state and county of each row, there is a final column called ``geometry`` that represents the geometry of the county. The entire data frame is actually a ``GeoDataFrame``, which is an extension of the Pandas ``DataFrame`` you are probably used to. Now we can plot data in our geo-data frame as follows:: import censusdis.maps as cem ax = cem.plot_us( gdf_counties, MEDIAN_HOUSEHOLD_INCOME_VARIABLE, cmap="autumn", legend=True, vmin=0.0, vmax=150_000, figsize=(12, 6) ) ax.set_title(f"{YEAR} Median Household Income by County") ax.axis("off") The resulting plot looks like .. image:: _static/images/US-median.png We used ``cem.plot_us`` because it does some nice things for us, like relocate Alaska, Hawaii, and Puerto Rico from their actual longitude and latitude to locations that allow us to plot the map more compactly. In addition to doing this relocation, ``cem.plot_is`` takes the same ``*args`` and ``**kwargs`` that Matplotlib normally takes. Additional Examples in Notebooks -------------------------------- There are additional more advanced examples and additional maps and visualizations, presented in more `Demo Notebooks <./notebooks.html>`_. Census API Key (Optional Initially) ----------------------------------- The Census API that ``censusdis`` calls recommends the use of an API key. But chances are you were able to make it through all of the examples above without one. This is because until you are doing a large number of queries, you don't need one. But once you are doing a large number of queries or putting `censusdis` into a production pipeline, you should return to this section and obtain a key. Luckily, the key is free and easy to get, and once you have a key in a file in the right place on your machine ``censusdis`` will automatically use it with every call to the Census API. To obtain a key, visit `this page `_. The key will be sent to you be email. It will be a long string of numbers and letters. Make a directory called ``.censusdis`` in your home directory, then inside that directory create a text file called ``api_key.txt``. The file should have just one line, and you should paste the key you got via email into it. Once this is done, all ``censusdis`` calls to the Census API will use this key. Help and Issues --------------- If you have questions or want to report a bug or feature request, please contact us by opening an issue at https://github.com/vengroff/censusdis/issues.