Prepare for NLP Project Success: Data and Governance

For teams working on transformative, data-intensive initiatives involving Natural Language Processing (NLP) and artificial intelligence (AI), data security and governance issues can have a major impact on the success of your projects.

In our recent survey of NLP practitioners, we asked respondents to identify the top three challenges based on their level of maturity. Organizations who were evaluating NLP use cases or in the early experimentation stage identified data security and governance (64%) as their biggest challenge.

So, what makes security and governance issues so important at this phase? And how can companies prepare to make this part of the process run as smoothly as possible?

Gather the Necessary Data and Sources

In most organizations, data is not typically owned by the team for which the NLP project is being built. As a result, it takes a tremendous amount of time to gather the right data to build training and testing sets for implementation and experimentation.

To avoid this issue, we recommend that teams start by building a responsibility assignment (RACI) matrix for your project. This should be used to clearly define the responsibilities of each department in getting the necessary data and resources.

Here, if you have a Project Champion—and we highly recommend that you do—who can work with the right Department Heads who are crucial to the NLP Project. This allows you to easily establish accountability from the team who owns the data, controls the quality of the data and eventually transfers the data to the NLP project team.

Identify the Security and Governance Issues for All Sources

Next, identify the security and governance issues for each data source or set. This should include information about the provenance of your data, the nature of the information it contains as well as how you plan to use the data. For example:

Does it include personally identifiable information or other implications for privacy?
Does it contain images as well as text?
Is it sourced from data that you own or hosted by third party sources?
Is it structured, unstructured or both?

This will be essential to ensure that your activities are aligned with your internal policies for data collection and privacy, responsible AI or any other frameworks or policies that you have in place.

Best Practices for Data Security

Regardless of your level of AI maturity, issues around data will only continue to grow in importance, especially regarding the choice of AI models and algorithms being used. The data privacy and copyright infringement concerns being raised by data protection authorities around the world regarding ChatGPT is just one example of why it’s important to understand what your data and machine learning models contain.

In our recent survey of 300 business, technical and academic natural language AI experts, data privacy and security was the top concern (73%) for the enterprise adoption of large language models and generative AI.

The pending ESG reporting regulation in Europe and other parts of the world is another factor—here, data and technologies being used will come under greater scrutiny as a part of ESG metrics.

With this in mind, here are some general considerations to keep in mind when designing and collecting data for your NLP projects:

Security

Will you host or have your data hosted in the cloud?
Is your data protected from unauthorized access?
Does your approach comply with internal guidelines?
Is your approach GDPR compliant?
Are your vendors certified (e.g., SOC 2, Type 2; ISO/IEC 27001)

Privacy and Legality

Does your data include any personally identifiable data (PII)?
Does your training data include anything that could be considered under copyright?
Do you have practices in place to remove PII or copyrighted material from your data sets?
Have you allowed for humans in the loop during the training process and beyond?

Methodology

Do you know where your data comes from?
Do you know how your data has been labeled?
Can you explain how your model reached its results?
Does your model have safeguards for preventing algorithmic bias?

For more considerations to ensure the success of your NLP projects, read the first post in this series, AI Maturity: Assessing Your Readiness for NLP Project Success, to discover how to face the top adoption challenges, and download our complete guide: The Roadmap to NLP Success.