How Will Differential Privacy Impact 2020 Census Data Quality?
What are the new security concerns regarding the 2020 Census?
There is a growing threat that Census data can be used to re-identify respondents’ personal information. One way this can occur is when Census data is combined with other publicly available data from third parties, such as a commercially available list of names and addresses. When this data is combined with Census records information about sex, race, or other personal information could potentially be linked with a specific respondent. Alternatively, or in combination with this, large scale computing power can potentially re-identify individual records in aggregated data across multiple tables by solving for the solution that best fits all table combinations, similar to how a sudoku puzzle is solved.
Starting in 2010, the U.S. Census introduced privacy measures into the decennial count to help protect respondents’ information by “swapping” records, particularly in geographies with small populations. For instance, if a census tract had only three families living in it, a White couple might be “swapped” with an Asian couple living in an adjacent census tract to help prevent someone from identifying individual respondents.
However, several tests undertaken by the Census between 2016 and 2019 showed that researchers were able to identify respondents when combining available Census tables with commercially available data that included such details as addresses and names. This was possible even with the existing privacy practices. This prompted the Census to increase their privacy measures and create the framework of “differential privacy.”
What is differential privacy?
The U.S. Census Bureau protects user data through a combination of reducing precision of the data, removing vulnerable records, and adding uncertainty. To help quantify the tradeoff between respondent privacy and accuracy of the data the Census Bureau has developed a new framework called differential privacy that helps control how much uncertainty is added to the data. Based on these guidelines the 2020 Census data contains values that were modified slightly within a limited range to add “noise” to the data and provide some uncertainty about the actual values. This helps prevent re-identification of respondents’ confidential information. The exception to this is that national and state total population counts do not have any added noise because those are what is used in the apportionment of state-level congressional seats.
[For those interested in a more detailed explanation, please find additional information at the end of this article.]
What data quality concerns exist around the 2020 Census?
The decennial census is one of the only true enumerations of total population. By contrast, most other population data is estimated from surveys distributed to a sample of the population. This makes the decennial census unique in providing an almost true population count and is one of the main differences between the Census’s decennial data and the American Community Survey (ACS). For the 2020 Census, except for national and state-level total population counts, the published data values have some added noise to help protect respondents’ privacy. This is not an entirely new practice as the 2010 Census included “swapping” to help protect respondents’ privacy. The 2020 Census is still likely more accurate than comparable population data from the ACS, but it is important to keep in mind these small inaccuracies when working with the data.
This added noise may be especially apparent at small geographies, in particular blocks and block groups. In fact, the Census Bureau has stated that due to the added noise some small geographies may appear “fuzzy” and seem incorrect. For instance, according to the Census Bureau there may be blocks that appear to have only children living there, blocks where all housing units are occupied but the total population is zero, and blocks with unusually large households. These irregularities should disappear when block-level data is aggregated since combining adjacent geographies should cancel out the “fuzziness” and create a clearer picture.
While the Census Bureau has always discouraged looking at individual block-level data, extra caution should be taken when looking at other small geographies for the 2020 Census such as block groups. While no formal rule exists, the Census and other researchers have found that this fuzziness mostly goes away when looking at aggregated geographies that are equal or larger than census tracts. Because of these limitations PolicyMap may consider removing trend charts and report capability for block group level data in the future.
Additionally, extra caution should be taken when using 2020 Census data for geographies not on the Census Bureau’s standard hierarchy. While Census Bureau took extra steps to minimize systematic undercounting for American Indian/Alaska Native/Native Hawaiian (AIANNH) areas and Places this was not necessarily the case with other less standard (“off-spine”) geographies.
Data and demography experts disagree on the various approaches to protecting privacy and preserving data quality. PolicyMap is following the discussions, primarily via the American Statistical Association’s updates. As we learn more, we will keep our users updated on changes we deem necessary to make to the PolicyMap platform, such as the removal of block group-level analysis features discussed above, as well as impacts to any forthcoming small area estimates we build as part of our data made by and exclusive to PolicyMap.
Digging deeper into differential privacy
Disclosure avoidance is the practice or series of methods that try to make reidentification attacks harder through a combination of reducing precision of the data, removing vulnerable records, and adding uncertainty. The Census has employed disclosure avoidance methods in the past such as top/bottom coding of extreme values, rounding published statistics, and record swapping. Record swapping was already introduced into the 2010 decennial data, as previously mentioned.
What is new this year is that the census is employing differential privacy, which is a framework for defining and quantifying disclosure avoidance. Any disclosure avoidance technique necessarily introduces data inaccuracy. Differential privacy is designed to help quantify the tradeoff between respondent privacy and accuracy of the data. With this change, however, some data products are likely to have more noise introduced than in previous releases.
The specific disclosure avoidance method that the Census developed is called the TopDown Algorithm (TDA). To summarize how this works, the Census Bureau first establishes a privacy-loss budget for each data product and then allocates this privacy budget across each data query and level of geography. Next, respondent data is aggregated into a histogram of unique combinations for each data query.
The TDA then introduces noise into the data to create “fuzziness.” This added noise is based on a probability distribution centered on zero and with the variance determined by the share of the privacy-loss budget allocated to the query at that geographic level. Because the probability distribution of the noise is centered on zero the average amount of noise added to each record of the histogram will be zero. In other words, most records will have no added noise. These noisy measurements are independent of each other and require a further post-processing to make the data internally consistent and non-negative.
The post-processing is top-down and starts with the largest geography, beginning with the nation and moving on to states then counties then census tracts, and so on. This top-down approach ensures that smaller geography counts sum to larger geography totals. It also helps small geographies with limited records to “borrow accuracy” from larger geographies, which helps increase accuracy. One note here is that this geography hierarchy only initially included the standard Census hierarchy of geographies (see chart here).
What the Census Bureau calls “off-spine” geographies, such as Places or American Indian/Alaska Native/Native Hawaiian (AIANNH) areas, were not initially included in this top-down approach. Adding these additional geographies would have further thinned out the privacy-loss budget for each data query and introduced more data inaccuracy. To get around this, the Census Bureau internally incorporated AIANNH areas and Places (including some Minor Civil Divisions in states where these are commonly used, such as New England) into the standard geography hierarchy processing to minimize the likelihood that post-processing will result in systematic undercounts. In short, they were able to include data for AIANNH and Places with minimal added data loss. Not every “off-spine” geography was included in this process and ZIP code tabulation areas (ZCTA) were left out of the initial 2020 Census data release altogether.
There are two exceptions to the differential privacy process. National and state-level total population counts do not have any added noise because these are what is used in the apportionment of state-level congressional seats and are required by the constitution to be as close as possible to a true count.