Worries about big data and privacy are all over the news, but our new research shows that big data can also help better understand users' privacy concerns. The last issue of Harvard Magazine proclaimed a big data revolution is at hand, and President Obama earlier this year ordered a big data review to explore policy issues raised by advances in technology. A Time article last month even complained, somewhat implausibly, that it was near-impossible to opt-out of big data. But big data can also be used to shield and enhance privacy, thus allowing us to understand far more about what Internet users really care about. Its not merely the traditional categories of sensitive data that intuition might suggest namely sex, money, and medical history. In our new research paper, presented last week at the Security and Privacy conference, we show that people are far more complex than conventional wisdom suggests.
To help us figure out which topics people view as privacy-sensitive, we scrutinized over a million questions and answers on Quora. Why Quora? We considered several options, but Quora was the best choice because it offers the unique feature of allowing people to answer with their real name or anonymously, as illustrated in the screenshot below.
Our data analysis revealed that data sensitivity should not be viewed as a binary concept. Its far more nuanced, and is really a continuum along many thousands of topics. By evaluating Quoras questions and answers, we were able to calculate what we call an anonymity ratio. The higher it is, the more likely it is that discussions relating to that topic will take place anonymously. While some of Quoras topics with a sky-high anonymity ratio (at least three standard deviations above the mean) are what conventional wisdom suggests masturbation, LSD, rape, abuse, sexual orientation, transgender, cannabis, and porn others are far from what you expect. For instance, as visible in the tag cloud below, sensitive topics include discussions of about companies like Zynga and Palantir, disclosures about relationships with grandparents, emotions including shame and aggression, and even details about careers as a model and astronaut. Others are Lady Gaga, bathroom etiquette, being single, and universities including Harvard and the University of Pennsylvania.
Similarly, the individual words that are highly associated with privacy sensitivity show a nuanced perspective. Internet users often feel the need to discuss their family and education anonymously, as you can see in the following tag cloud:
While the idea of using big data to understand which topics users consider privacy-sensitive is simple, as with all good ideas, making it work turned out to be somewhat tricky. I had to come up with a creative use of statistics and machine learning techniques to isolate the topics and validate the results. Over the last few months, weve been working on some followup research building on the same line of thinking. It appears promising and confirms that this approach has the potential to transform how privacy research is done. Our overall hope is to use big data to develop better policies and encourage companies to build features in products that improve their users privacy. To know more, read our paper and let me know in the comments which topics you find sensitive!