“Anonymized” Data Really Isn’t
I enjoy watching re-runs of the television drama, NCIS, where a dysfunctional little group of crime-fighting superstars often analyze divergent bits of data to solve seemingly unsolvable mysteries. Last night, Agent McGee correlated data from phone records, automobile registrations and police station activity records to pinpoint a bad cop in collusion with an international drug lord. Far fetched? Perhaps not.
I have been spending much of my time recently preparing a white paper addressing the issues of HIPAA privacy and security compliance, particularly in light of expanded regulations emerging from the “stimulus bill†signed into law earlier this year. As I have explored privacy issues related to electronic health records, I was particularly intrigued by an article by Nate Anderson entitled “’Anonymized’ Data Really Isn’t and here’s why notâ€, published in Ars Technica earlier this week.
On the surface, it would seem that removing obvious identifiers such as name, address and Social Security Number from a person’s data record would cause that record to be “anonymous†– not traceable to single individual. This approach is commonly used by large data repositories and marketing firms to allow mass data analysis or demographic advertising targeting.
However, work by computer scientists over the past fifteen years show that it is quite straightforward to extract personal information by analyzing seemingly unrelated, “anonymized†data sets. This work has “shown a serious flaw in the basic idea behind ‘personal information’: almost all information can be ‘personal’ when combined with enough other relevant bits of data.â€
For example, researcher Latanya Sweeny showed in 2000 that “87 percent of all Americans could be uniquely identified using only three bits of information: ZIP code, birthdate, and sex."
Professor Paul Ohm of the Colorado School of Law, in his lengthy new paper on "the surprising failure of anonymization, wrote:
As increasing amounts of information on all of us are collected and disseminated online, scrubbing data just isn’t enough to keep our individual "databases of ruin" out of the hands of the police, political enemies, nosy neighbors, friends, and spies.
If that doesn’t sound scary, just think about your own secrets, large and small—those films you watched, those items you searched for, those pills you took, those forum posts you made. The power of re-identification brings them closer to public exposure every day. So, in a world where the PII concept is dying, how should we start thinking about data privacy and security?
Ohm went on to outline a nightmare scenario:
For almost every person on earth, there is at least one fact about them stored in a computer database that an adversary could use to blackmail, discriminate against, harass, or steal the identity of him or her. I mean more than mere embarrassment or inconvenience; I mean legally cognizable harm. Perhaps it is a fact about past conduct, health, or family shame. For almost every one of us, then, we can assume a hypothetical ‘database of ruin,’ the one containing this fact but until now splintered across dozens of databases on computers around the world, and thus disconnected from our identity. Re-identification has formed the database of ruin and given access to it to our worst enemies.
I won’t ask what your “blackmail-able facts†might be, and won’t tell you mine. But it is sobering to think what abuses might emerge from the continued amassing of online data about all of us. This certainly casts new light on the importance of privacy and security protections for all of our personal data.
(Forgive me if this story is included in your posted articles… didn’t have time to read through all.)
My favorite tale related to this was the story of a Yahoo or Google database of search queries that was ‘anonymized’ for research purposes. The data provider took great pains to remove any identifying characteristics from the records, for example they replaced the ip address with a scrambled one. The data went to a group of researchers, then the entire data set got posted online.
Each anonymous user’s data could still be grouped by the random ip address.
After some days/weeks, the fatal flaw was found: users, while searching for all kinds of things, also did vanity searches on their names… So, in the grouped data you would see the vanity searches — with their name in clear text — right next to some fetish or decidedly politically incorrect search item!
Mike
Comment by mike waddingham on September 10, 2009 at 9:08 pmThanks for sharing the story! Just another example of the flaws of anonymizing.
Comment by Mark Dixon on September 10, 2009 at 9:20 pm