The AOL data mess

Not surprisingly this is the kind of topic that spreads like wildfire across blogland.
AOL search data snippet

AOL Research released (link to Google cache page) the search queries of hundreds of thousands of its users over a three month period. While user IDs are not included in the data set, all the search terms have been left untouched. Needless to say, lots of searches could include all sorts of private information that could identify a user.

The problems in the realm of privacy are obvious and have been discussed by many others so I won’t bother with that part. (See the blog posts linked above.) By not focusing on that aspect I do not mean to diminish its importance. I think it’s very grave. But many others are talking about it so I’ll focus on another aspect of this fiasco.

As someone who has research interests in this area and has been trying to get search companies to release some data for purely academic purposes, needless to say an incident like this is extremely unfortunate. Not that search companies have been particularly cooperative so far - based on this case not surprisingly -, but chances for future cooperation in this realm have just taken a nosedive.

To some extent I understand. No company wants to end up with this kind of a mess on their hands. And it would take way too much work on their part to remove all identifying information from a data set of this sort. I still wonder if there are possible work-arounds though, such as allowing access on the premises or some such solution. But again, that’s a lot of trouble, and why would they want to bother? Researchers like me would like to think we can bring something new to the table, but that may not be worth the risk.

Note, however, that dealing with sensitive data is nothing new in academic research. People are given access to very detailed Census data, for example, and confidentiality is preserved. From what I can tell the problem here did not stem from researchers, it was someone at AOL who was careless with the information. But the outcome will likely be less access to data for all sorts of researchers.

Another question of interest: Now that these data have been made public what are the chances for approval from a university’s institutional review board for work on this data set? (Alex raises related questions as well.) Would an approval be granted? These users did not consent to their data being used for such purposes. But the data have been made public and theoretically do not contain any identifying information. Even if they do, the researcher could promise that results would only be reported in the aggregate leaving out any potentially identifying information. Hmm…

For sure, this will be a great example in class when I teach about the privacy implications of online behavior.

Not surprisingly, people are already crunching the data set, here are some tidbits from it.

A propos the little snippet I grabbed from the data (see image above), see this paper of mine for an exploration of spelling mistakes made while using search engines and browsing the Web. About a third of that sample was AOL users.

The image above is from data in the xxx-01.txt file.

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.


online play money casino online casino slots casino online spil online casino betalings wildvegas online casino online casino online casino penge online casino strategier dansk online casino online casino spille online casino spiller online casino spillere online casino spil de bedste online casino online casino malta online casino strategi online casino reveiw danske online casino spille online casino online casino betaling directorio casino online spil casino online online casino review gambling casino online bonus secure online casino best casino online online casino gaming free no deposit online casino listings usa money casino games online htm jackpot party progressive online casino games epassporte online casino games htm deposits casino games online htm english korea online casino games index htm online casino game online blackjack gratis online blackjack online blackjack spil online blackjack learn blackjack free online gaming black jack online blackjack games free online blackjack training blackjack blackjack online free online casino blackjack game counting cards online blackjack online strip blackjack best online casinos blackjack game casino blackjack game online game bonus free online blackjack no registration or download online blackjack game blackjack online casino gambling online casino blackjack casino gaming play free online poker and blackjack and casino games free online blackjack gambling casino internet blackjack free online casino blackjack games online game bonus free online blackjack play money free online blackjack play money fake best place to play blackjack online play blackjack free online winning blackjack online online blackjack training free online strip blackjack how to win online blackjack what is the best online blackjack game to play cash cash casino free free online online poker poker top online casino games poker htm casino free game online poker guide money online online playing poker poker safely winning online video poker poker games online poker cash rankings online poker saving online high stakes poker money how to make money on online poker saving online high stakes poker money ps3 online roulette spil online roulette forums roulette odds play roulette online roulette glitches roulette game free roulette game roulette layout live online roulette squares on a roulette wheel used roulette table roulette layout photo play online roulette games htm no download flash roulette american roulette wheel roulette casino game
Download And Buy Software Online - Buy Cheap Software Cheap Software Online ArchiveCheap downloadable OEM softwareSoftware: Best Prices. Big Discounts.Buy Cheap SoftwareMP3 DownloadsVideo Downloadsdiscount pharmacy online