Carnegie Mellon: Put Lawyers To Work Automating Privacy Compliance in Big Data Systems

Turning lawyers into programmers may not seem like the most obvious way to ensure that big data systems comply with an organization's privacy policies. But that's one of the outcomes figured out by a team of researchers at Carnegie Mellon and Microsoft Research in a recent project.

The researchers undertook the challenge of developing a way to replace the tedious manual work of safeguarding user data in large Web services, such as Facebook, Google and Microsoft, with an automated system. The project specifically used Microsoft's search engine Bing as a test case.

The problem is a major one, said lead student researcher Shayak Sen, a Ph.D. candidate in computer science who interned at Microsoft Research India. "Tens of millions of lines of code are already in the pipeline," he noted. "And during our implementation on Bing, we found that more than 20 percent of the code was changing on a daily basis." Without automation, there's no way to keep up with the verification of compliance.

That's where the lawyers come in. The researchers found that those who develop privacy policies within an organization (often lawyers) don't typically speak the same language as the software developers. So the students developed a language — Legalease — simple enough to be used by non-programmers who understand the technicalities of the privacy policies. Legalease enforces syntactic restrictions to ensure that encoded policy clauses are structured similarly to policy text defining how user data is allowed to be handled.

In usability testing, 12 Microsoft employees were given a one-page document explaining Legalease and spent an average of under five minutes studying the directions. It took them an average of under 15 minutes to program nine Bing policy clauses laying out how user information could be used. "They were able to perform this task with a high degree of accuracy, which is encouraging," said Sen.

But the research didn't end there. As their report, "Bootstrapping Privacy Compliance in Big Data Systems," describes, what the researchers actually developed was a workflow for privacy compliance that targets large codebases written in languages that support the Map-Reduce programming model. That workflow uses Legalease, along with a self-bootstrapping data inventory mapper developed by Microsoft Research that ties low-level data types in the code to the high-level policy concepts. Grok, as it's called, was deployed by Bing a year before the research began for the purpose of automating policy compliance; but at that time the developers found writing policies for Grok too cumbersome.

"Legalease was the final piece of the automated privacy compliance jigsaw puzzle," said Anupam Datta, associate professor of computer science and electrical and computer engineering and co-author. "Legalease bridged privacy teams with Grok, and through Grok, with the developers."

Datta emphasized that automating the process of compliance checks could push the industry to adopt stronger privacy protection policies. "Sometimes, companies want to make their policies stronger, but hesitate because they are not sure they can ensure compliance in these large systems," he added.

The work was presented at the 35th IEEE Symposium on Security & Privacy, May 18-21, in San Jose, CA, where it won a Google award for the best student paper.

This work was supported, in part, by the Air Force Office of Scientific Research and the National Science Foundation.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured