May 12 Workshop Notes: Discussion and Concerns
On May 12, the Association of Public Data Users and the Massive Data Institute at Georgetown University held a town hall session on Solving Data “Differences” – Assessing the Use Cases.
For the panel, Amy O’Hara, Research Professor at the Massive Data Institute, joined Connie Citro, Former Director of the Committee on National Statistics (CNSTAT), Joe Salvo, Former Director of the NYC Department of City Planning, and Chris Dick, Founder of Demographic Analytics Advisors. The panel discussed implications of new methods employed in the 2020 Census – above all, the Disclosure Avoidance System (DAS) and differential privacy – on common use cases of census data. With an interactive format including breakout room discussions, the panel solicited questions and concerns from the audience on use cases including urban/rural, housing, workforce, health, and justice issues. The panel and attendees engaged in a fruitful conversation about the implications of these changes and what users would like to see given the need to balance privacy and utility for different data categories and use cases.
During this event facilitators invited attendees to breakout rooms to discuss concerns related to the quality of decennial census data being released this year. There were two main themes identified in those discussions, along with an assortment of other concerns.
Balancing Privacy and Utility
First, participants are concerned with balancing the privacy and utility of data. For example, the Census Bureau’s new privacy mechanism, known as the Disclosure Avoidance System (DAS), uses a process known as differential privacy to limit identification of individuals using granular data. There are currently no tools available that explain these changes in layman’s terms, leading to a lack of clarity among data users on fundamental questions such as whether new data from the Census Bureau will be comparable to non-DAS data.
In particular, there is concern about whether data on smaller populations (and smaller sample sizes), such as American Indians, will be fit for use. Researchers wanting to conduct analyses of housing structure by race, for example, are uncertain if the data will be accurate for all groups. These concerns extend beyond the 2020 census planned releases. Participants were confused and concerned with how the population base from the 2020 census will affect the American Community Survey and population estimates.
Further, with the Disclosure Avoidance System and differential privacy there may be inconsistent household and population data, as person and household records will not be processed simultaneously and therefore not linked. This will prevent meaningful measures of persons-per-household. There are concerns related to the levels of noise in the data, and how that will affect the ability of local governments to serve their communities.
Census data has a wide variety of use cases, and nearly all discussions of the DAS have focused on the Redistricting File to be published Summer 2021. Participants question how all the use cases for the Demographic and Housing Characteristics File will be handled, and whether there will need to be different versions of datasets to fit different use cases or some other work-around.
It is important that the Census Bureau finds a way to communicate differential privacy to laypeople through educational trainings and tools. They must work with community groups to share information and build grassroots understanding. Using story maps like On the Map or GIS visualizations may be able to supplement these trainings, showing how current statistics are affected by these developments.
With the need to balance privacy and data quality in mind, what are some compromises that were acceptable to attendees? The practice of “binning” data may be an option – for example, releasing three-group race data rather than four. For race and Hispanic origin, some attendees indicated that summary race data was usable, but that keeping block level data available is essential. Block level data in general has been helpful for cross-walking between tracts, which supports infrastructure planning. Some participants felt that detailed age data may be more important than race data. For some, it would be preferable to forfeit highly-detailed tables to preserving publications for more geographies. Data accuracy was favored over granularity by many participants (though without consensus over which statistics to roll-up or suppress).
Data Categories and Use Cases
Attendees had various concerns related to specific data categories and use cases. With regard to urban and rural geographies, it is helpful to minimize constraints on data. Attendees were concerned about definition changes that may be implemented, such as the change of the definition for metropolitan statistical areas and how this will affect funding allocations and metropolitan planning organizations. In addition, it is unclear how the differences in data collection and other characteristics between rural and urban areas would cause disproportional errors in imputation.
Group quarters also present unique challenges and opportunities for the census. As many group quarters are businesses or government facilities, extensive administrative data are often available on these facilities, and can play a role in producing more accurate group quarters population counts.
Data Collection
Large-scale changes have occurred in recent decennial censuses in the way the Census Bureau collects data, such as internet response and greater use of administrative data. Users are interested in more information about changes from prior decades and how those changes affected data quality. For example, it would be helpful to have a step-by-step guide to changes found in a single location that explains planned changes to the 2020 census and changes imposed on the Census Bureau due to COVID-19.
The decennial changes unrelated to differential privacy that attendees were monitoring. Housing and housing stock changes and internet self-response (especially in areas with poor internet connections such as rural areas or impoverished inner cities) will impact data in ways that are yet to be determined. Also, during pandemic lockdowns, people moved to unexpected places, exacerbating the typical springtime “snowbird effect.” Finally, there are concerns about duplication of entries due to non-ID submissions.