hamburger icon close icon
Data Privacy

How AI Shapes the Next Generation of Data and Compliance Part 3 of 3: It's All about Context

August 28, 2019

Topics: Cloud Data Sense Advanced6 minute read

Data is changing. Companies no longer generate data as an output of their activities, but in many cases, managing the data is now a major business concern in itself. But with this growth—largely made possible because of the increased storage scale that the cloud provides—comes all new concerns about privacy and for making sure that privacy regulations are not violated.

Part 1 of this series looked briefly at the growing requirements for data security and compliance that organizations are facing in lieu of the growing scale of data in use and with important new regulatory laws in place around the world. In Part 2, I reviewed how supervised and unsupervised learning can help an organization categorize and cluster data in order to get a bird's eye view of all the data across their storage ecosystems and better understand their data management compliance needs.

In this post, I’ll address the data identification tasks that are considered more challenging to automate, since they are located on the edge of artificial intelligence abilities: understanding context.

Data privacy regulations such as the GDPR and CCPA define personal information quite broadly, thereby making context crucial to be able to accurately identify relevant data within the mountains of information that organizations store.

Sentence-Level Data Management : Utilizing Context

There are a few questions that enterprises can ask themselves to gain intuition on how AI is going to help in data management compliance:

  • Can you estimate how many names of people are mentioned in all the documents on your storage devices?
  • Which documents mention a specific company or project?
  • Can you immediately retrieve documents that mention a specific person?

1. Mapping Personal Data :  Named-Entity Recognition

The ability to retrieve information quickly using search engines involves some form of indexing the searchable domain. Mapping content from petabytes of free-text, out of files which are not always accessible (for example when a server is down), is an ambitious engineering challenge. Should a search engine index and store every single word appearing in every document into huge data structures just to retrieve given names?

What if you could somehow label entity names during parsing (person/organization/location), then index just the relevant entities with the documents referring to them? This is where named-entity recognition comes in handy.

Named-entity recognition is an AI method of extracting any specific mention of a named entity within a set of unstructured text. The named entities are classified into categories, such as personal names, organizations, and locations, according to the context in which they appear.

Using deep learning models, an NER process can be automated with impressive precision and recall scores: each sentence is converted to a sequence of vectors, which are then passed forward through recurrent neural networks that were trained to locate and classify named entities using the entire sentence’s context. The end goal is for meaningful names to be detected, indexed, and, ultimately, queried effortlessly.

Detecting names is useful — what about the accompanying context?

The context in which a name appears reveals more relevant information about the personal data that may or may not coexist with the mentioned name. Whether it’s a spreadsheet of contacts, an employee performance review, or a consumer credit risk assessment, it’s the context that tells the full story.

2. Contextual Personal Information Detection

Among the recent data protection regulations, there is one that can be especially hard to comply with. This regulation specifies several categories of personal information for which processing is prohibited, including ethnicity, sexual preferences, political or religious views, and health background. It’s introduced in the European Union’s new General Data Protection Regulation, or GDPR for short.

“Processing of personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, and the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, data concerning health or data concerning a natural person’s sex life or sexual orientation shall be prohibited. ”

— GDPR, Article 9.

In other words, any organization that wishes to comply with GDPR must find and treat accordingly the occurrences of such personal information residing in its data. Yet in practice, such pieces of information do not generally fit neatly in structured tables, but surface in unstructured texts in free form. Moreover, they can’t be detected easily—can you think of a mechanism to find, for example, descriptions of a person’s ethnicity or religious views?

Which of the following sentences might contain personal information about a person’s ethnicity?

  • Joshua has Italian origins.
  • Joshua has Italian restaurants.

Of course, every English-speaking human would be able to figure this out immediately. However, training machines to differentiate between the contextual information in these types of sentences correctly requires not only linguistic understanding and the ability to parse sentences, but also some utilization of context.

Recent breakthroughs in natural language processing research have made it possible to extract context from text more effectively. For instance, take the concept of Word2Vec — a fixed meaningful vector representation for each word in a text. With this technology, every sentence can then be represented as a sequence of vectors. Applying deep learning methods, such as sequence-based neural networks, on those vector representations can be useful to achieve contextual representations of text: a deep learning model is trained to generate some “summarized” encoding for every input text, based on the sequence of word-vectors it contained.

As a result, deep learning models enable the automation of detecting personal information—all while taking word-level context into account.

3. Personal Cross-Reference Resolution: The Missing Link in Data Compliance

GDPR also introduces an unprecedented right of access to data subjects. As GDPR Recital 63 states: “a data subject should have the right of access to personal data which have been collected concerning him or her, and to exercise that right easily and at reasonable intervals, in order to be aware of, and verify, the lawfulness of the processing”

That means that once a data subject submits a request for their personal data, the amount of required effort for an enterprise to accurately retrieve the relevant information in order to comply with the regulation is unbearable. Effective preparations would require a complete and efficient mapping of all relevant information as mentioned above in the section on mapping personal data.

Furthermore, often a data subject is referred to not by their name but by some kind of ID number(s) that could be stored in internal databases. In such cases, to retrieve all the relevant information, all the different personal references pointing at the same person need to be linked. Utilizing such a co-reference mechanism enables the company to search for documents using a person’s name, giving them the ability to reach not only documents that contain the full name, but also ones that refer to the person by their relevant ID.

Final Thoughts

Mapping data subjects and identifying pieces of personal data are the main motivations behind the methods detailed in this post. Context must be taken into account for accurate identification, and advancements in artificial intelligence research are enabling machines to interpret language more thoroughly than ever. Achieving data compliance can be challenging for enterprises—eventually an AI with the right set of tools will make that much easier and faster. NetApp is making this easier with Cloud Compliance, the new AI-driven data mapping tool for Cloud Volumes ONTAP, Azure NetApp Files, and Amazon S3 buckets.

If you haven’t read them yet, check out the other parts of this series on how AI is changing the face of data management compliance: Part 1 gives an overview of the pressure that enterprises are under to secure and define their sensitive information. Part 2 is a deep dive into the different types of data categorization

Read all three chapters of this series in our complete guidebook, How Artificial Intelligence Shapes the Next Generation of Data Management and Compliance.

Lead Data Scientist