At the 2014 APDU Annual Conference, a session entitled Administrative Data: Promise vs. Privacy discussed the pros and cons of the application of administrative data versus concerns about personal privacy.
Altogether, statisticians, economists, data nerds, and concerned citizens expressed the possibilities and perils in wielding the staggering amount of data being collected by the federal government. Speakers from the Center of Medicare and Medicaid Services (CMS), U.S. Census Bureau, and Bureau of Labor Statistics (BLS) led the discussion which revealed some of the federal government’s newest innovations in data collection and analysis while explaining how the U.S. Government insures that Americans’ privacy is preserved.
These innovations include QCEW Data Access – designed to allow 3rd party programmers, developers, and organizations to retrieve published QCEW data in CSV format, and QCEW Hurricane Flood Zone Maps – which measures the impact of hurricanes – according to magnitude – on workforces within flood zones. New statistical innovations include, the Census Bureau’s: Longitudinal Employer-Household Dynamics (LEHD) – designed to link survey, census, and administrative records in order to produce new demographic statistics in times of tight budgets. Quarterly Wage Index (QWI) Explorer – provides a demographic composition of each state and the nation’s labor force and industry portfolio.
So how does the federal government plan to preserve American’s privacy in spite of new statistical innovations? Our expert panel explained to the audience some of the federal government’s most effective practices in data disclosure limitation and dissemination. These practices include:
- Perturbation/Noise Infusion – According to the U.S. Census, noise infusion is a method of disclosure avoidance in which, values for each establishment are perturbed (distorted) prior to table creation. Perturbation of the true value ranges typically ranges from 2% or 5% change of the original value; the lower the number, the greater the perturbation.
- Restricting Access or Stripping Unique Identifiers – An identifier serves as a reference to each individual respondent of the survey. Restricting access or stripping the identifier and replacing it with another identifier can create millions of numeral combinations which slow down hackers. For instance, an identifier between (0-9) with at least six-digits creates at least 136,080 numeral combinations a hacker would have to insert in order to pinpoint an individual’s personal information.
- Thresholds – CMS does not display values less than 10 within each cell, including cells where mathematical formulas would produce a value of less than 10.
- Cell Suppression – a very common disclosure limitation technique that involves employing a linear algorithm which identifies sensitive cells belonging to the same respondent within a datable (or multiple data tables) to be withheld from disclosure.
- P% Rule – A cell suppression technique aimed at protecting a respondent from being singled out from a larger group of respondents. The p% rule identifies and suppresses cells (belonging to an individual respondent), whose true value can be estimated (by other respondents’ manipulation of their own data) within a certain probability (p %) – e.g. preventing an internal process of elimination.
- Rigorous Application Process –Typically utilized by qualified researchers and statisticians who test the outcomes of statistical models versus data tables. Approved, researchers are given access to very detailed data but, never at its most granular (individual person) and each researcher’s access is auditable.
The federal government continues to find innovative techniques for analyzing existing sources of data. Together, these topics made for a compelling conversation about data sharing and encryption which promises to further the conversation.