Carnegie Mellon: Put Lawyers To Work Automating Privacy Compliance in Big Data Systems

Turning lawyers into programmers may not seem like the most obvious way to ensure that big data systems comply with an organization's privacy policies. But that's one of the outcomes figured out by a team of researchers at Carnegie Mellon and Microsoft Research in a recent project.

The researchers undertook the challenge of developing a way to replace the tedious manual work of safeguarding user data in large Web services, such as Facebook, Google and Microsoft, with an automated system. The project specifically used Microsoft's search engine Bing as a test case.

The problem is a major one, said lead student researcher Shayak Sen, a Ph.D. candidate in computer science who interned at Microsoft Research India. "Tens of millions of lines of code are already in the pipeline," he noted. "And during our implementation on Bing, we found that more than 20 percent of the code was changing on a daily basis." Without automation, there's no way to keep up with the verification of compliance.

That's where the lawyers come in. The researchers found that those who develop privacy policies within an organization (often lawyers) don't typically speak the same language as the software developers. So the students developed a language — Legalease — simple enough to be used by non-programmers who understand the technicalities of the privacy policies. Legalease enforces syntactic restrictions to ensure that encoded policy clauses are structured similarly to policy text defining how user data is allowed to be handled.

In usability testing, 12 Microsoft employees were given a one-page document explaining Legalease and spent an average of under five minutes studying the directions. It took them an average of under 15 minutes to program nine Bing policy clauses laying out how user information could be used. "They were able to perform this task with a high degree of accuracy, which is encouraging," said Sen.

But the research didn't end there. As their report, "Bootstrapping Privacy Compliance in Big Data Systems," describes, what the researchers actually developed was a workflow for privacy compliance that targets large codebases written in languages that support the Map-Reduce programming model. That workflow uses Legalease, along with a self-bootstrapping data inventory mapper developed by Microsoft Research that ties low-level data types in the code to the high-level policy concepts. Grok, as it's called, was deployed by Bing a year before the research began for the purpose of automating policy compliance; but at that time the developers found writing policies for Grok too cumbersome.

"Legalease was the final piece of the automated privacy compliance jigsaw puzzle," said Anupam Datta, associate professor of computer science and electrical and computer engineering and co-author. "Legalease bridged privacy teams with Grok, and through Grok, with the developers."

Datta emphasized that automating the process of compliance checks could push the industry to adopt stronger privacy protection policies. "Sometimes, companies want to make their policies stronger, but hesitate because they are not sure they can ensure compliance in these large systems," he added.

The work was presented at the 35th IEEE Symposium on Security & Privacy, May 18-21, in San Jose, CA, where it won a Google award for the best student paper.

This work was supported, in part, by the Air Force Office of Scientific Research and the National Science Foundation.

About the Author

Dian Schaffhauser is a former senior contributing editor for 1105 Media's education publications THE Journal, Campus Technology and Spaces4Learning.

Featured

  • white clouds in the sky overlaid with glowing network nodes, circuits, and AI symbols

    AWS, Microsoft, Google, Others Make DeepSeek-R1 AI Model Available on Their Platforms

    Leading cloud service providers are now making the open source DeepSeek-R1 reasoning model available on their platforms, including Amazon, Microsoft, and Google.

  • chart with ascending bars and two silhouetted figures observing it, set against a light background with blue and purple tones

    Report: Enterprises Embracing Agentic AI

    According to research by SnapLogic, 50% of enterprises are already deploying AI agents, and another 32% plan to do so within the next 12 months..

  • collection of glowing digital documents and seals

    1EdTech: 6 Key Steps for a Successful Credentialing Program

    A new report from 1EdTech Consortium outlines recommendations for creating microcredential programs in schools, colleges, and universities.

  • The AI Show

    Register for Free to Attend the World's Greatest Show for All Things AI in EDU

    The AI Show @ ASU+GSV, held April 5–7, 2025, at the San Diego Convention Center, is a free event designed to help educators, students, and parents navigate AI's role in education. Featuring hands-on workshops, AI-powered networking, live demos from 125+ EdTech exhibitors, and keynote speakers like Colin Kaepernick and Stevie Van Zandt, the event offers practical insights into AI-driven teaching, learning, and career opportunities. Attendees will gain actionable strategies to integrate AI into classrooms while exploring innovations that promote equity, accessibility, and student success.