There is great appetite to study query logs as a rich window into human intent, but history shows that the privacy concerns are broad and well-founded. It is important to anonymize the query logs before attempting any public release.
We study the privacy preservation properties of a simple query log anonymization technique: each query is tokenized and a secure hash
function is applied to each token. We then investigate the risk of revealing user identity from query logs in two different scenarios.
In the first setting, we study the application of simple classifiers
to map a sequence of queries into the demographics of the user issuing
the queries. In the second setting, we examine an anonymization
approach to bundle logs of multiple users together.
Back to Statistical and Learning-Theoretic Challenges in Data Privacy