The AOL data mess
Not surprisingly this is the kind of topic that spreads like wildfire across blogland.
AOL Research released (link to Google cache page) the search queries of hundreds of thousands of its users over a three month period. While user IDs are not included in the data set, all the search terms have been left untouched. Needless to say, lots of searches could include all sorts of private information that could identify a user.
The problems in the realm of privacy are obvious and have been discussed by many others so I won’t bother with that part. (See the blog posts linked above.) By not focusing on that aspect I do not mean to diminish its importance. I think it’s very grave. But many others are talking about it so I’ll focus on another aspect of this fiasco.
As someone who has research interests in this area and has been trying to get search companies to release some data for purely academic purposes, needless to say an incident like this is extremely unfortunate. Not that search companies have been particularly cooperative so far – based on this case not surprisingly -, but chances for future cooperation in this realm have just taken a nosedive.
To some extent I understand. No company wants to end up with this kind of a mess on their hands. And it would take way too much work on their part to remove all identifying information from a data set of this sort. I still wonder if there are possible work-arounds though, such as allowing access on the premises or some such solution. But again, that’s a lot of trouble, and why would they want to bother? Researchers like me would like to think we can bring something new to the table, but that may not be worth the risk.
Note, however, that dealing with sensitive data is nothing new in academic research. People are given access to very detailed Census data, for example, and confidentiality is preserved. From what I can tell the problem here did not stem from researchers, it was someone at AOL who was careless with the information. But the outcome will likely be less access to data for all sorts of researchers.
Another question of interest: Now that these data have been made public what are the chances for approval from a university’s institutional review board for work on this data set? (Alex raises related questions as well.) Would an approval be granted? These users did not consent to their data being used for such purposes. But the data have been made public and theoretically do not contain any identifying information. Even if they do, the researcher could promise that results would only be reported in the aggregate leaving out any potentially identifying information. Hmm…
For sure, this will be a great example in class when I teach about the privacy implications of online behavior.
Not surprisingly, people are already crunching the data set, here are some tidbits from it.
A propos the little snippet I grabbed from the data (see image above), see this paper of mine for an exploration of spelling mistakes made while using search engines and browsing the Web. About a third of that sample was AOL users.
The image above is from data in the xxx-01.txt file.