Can you trust your model when data shifts?

In real-case scenarios NLP solutions need to deal with noisy data, and one of the most common sources of noise is “time”. Data distributions can really change a lot through time! Thus we decided to test “robustness to data shift” and compare end-2-end data-driven machine learning solutions with a human-driven symbolic approach. Let’s see what happened!

Transcript:

Brian Munz: Hey, everybody and welcome to the NLP Stream, which is our regular live stream by Expert.ai, where we talk about all things related to NLP, AI, things like that, just what’s going on in the field. Some things are more experimental, some things are more grounded. But this week, I think is going to be very interesting. We have someone speaking who has been on here before, so you probably recognize him. And he is Marco Ernandes who is in R&D at Expert.ai and today he’s going to talk about what to do about your model when the data changes on you. So welcome, Marco.

Marco Ernandes: Thank you very much, Brian. Okay. Shall I start?

Brian Munz: Sure. Yeah.

Marco Ernandes: Yeah, thanks. Do you see my screen?

Brian Munz: Yeah.

Marco Ernandes: Yeah. Great. Thank you. So obviously, I will say this now just to avoid forgetting anything that is being said. If there is any question or things that need a further discussion, we can do that in the comments and be happy to answer any kind of question about the topic. So the topic is today what happens or actually a kind of investigation over how models react to data that changes. And first of all, actually trying to understand what it means to have data that changes, and then how models react to this problem. And there’s actually at the heart of this, an experimental investigation with it that was published on towards data science around in the last year. But I guess it’s pretty interesting and there is an outcome that is in some sense pretty surprising. Okay, let’s see.

Let’s start. First of all just brief context. We have here in the left here, a nice resume of many possible KPIs that we would like to verify when delivering an NLP solution to the world, either research or business-wise. And we can see here that NLP needs to achieve a higher accuracy, needs to be scalable, explainable and so on and so forth. Many things that someone could ask to NLP. In this case we’re going to focus on one of them, in particular data variation, so data shifting. An NLP solution should be resilient or robust to data shifts somehow, at least at up to certain range. And actually this is along with the others, one of the possible reasons that typically lead to failures on delivering or putting to production an NLP solution. And again, this is old information but it’s still very relevant.

We know that around 90% of AI and especially machine learning based NLP solution fail to move from a proof of concept to a real production solution. And why does that happen? Mainly because we don’t take enough care of the correct KPIs. And one of these KPIs can be the data shifting. So I’ll guess I’ll move on here. So this is pretty generic, but just to give an idea that usually it’s not always the super accurate model that is a success in moving from an early stage of investigation to a real production solution. But it’s most often the one that is more robust to many of these issues and especially data shifting. That is what we’re going to see. But in general, my perspective and we will see this to be true with data shifting, is that if we want to make it a success into production, so addressing a problem with what we call practical semantics, we need to combine knowledge and data.

Not taking these decoupled and data would mean no, using only data would mean building a data-driven approach. A hundred percent data driven is what we usually called machine learning end-to-end, often will fail to provide a solution that is really resilient to data shifting. Let’s see what we have in our experience and what we have been discovering in this investigation. Anyway, just a quick check on the vocabulary of, and what are the types of data shifts we’re talking about. First of all, distinguished data drifts and concept drifts that are more or less two different ways data can change. It’s very simple. Data drift means that it’s the data itself, the input data that changes. So mainly if you think it from a machine learning perspective, the training data and the real test data do not really match.

The data distribution is changing. A concentrate is a little different because the data per se could still be the same, the input data, but what is changing is the targets, the actual outcome that you would be expecting. That is also another type of data shifting. When does this happen? Because I don’t want to sound too theoretical. This actually happens a lot. I would say nearly all the time because on one side first point here, we see that data changes through time and that is always the case. I can’t imagine a scenario where time doesn’t have an impact on data, unless maybe we’re talking about Latin literature, but yeah, probably not even in that scenario. Anyway, there is a data aging problem. Obviously data through time changes, but there’s also a data, seasonal changes problem. As I mentioned here, data under Christmas is typically different than data we have in summer.

Then we have in the business, we do have continuous changes in data sources. So new sources of data flows coming from different sources are getting into our model or maybe the balance between sources is changing, one is taking more load and another one less and things are changing that way. Data formats are changing. That is absolutely something I have personal experience a lot in real world solution. And even since an NLP practitioner, NLP expert is not going to be the owner of the entire pipeline. It’s often not going to know when a data format change is going to happen. Fourth point here, you also probably connected to the third one, there are workflow changes in real world business solutions. And we can imagine that we may want to reuse a text categorizer for in a different way.

So changing a little bit the taxonomy in the output, but we want to reuse some models and things are mixed up and you know, need to adapt to these kind of situations. And the example we’re going to see is probably one of this sort. A fifth example here, which is super common is also when moving from POC environment to production, you start having access usually to real world data, because more often than other POC data is not something models can actually see. So the production data is not available in POC environments and that makes a lot of a difference, because then when moving into production you discover a lot of new issues to be addressed. So all in all I would say that data changes all the time.

By the way, we could think of this as, okay, nevermind we have machine learning, we can retrain all the time. That’s not always possible. In many of these scenarios, we can have that the data is shifting but at the same time nobody’s giving us the new supervised labels. So we need to be robust to these data changes. Also, because as I said, many of these scenarios also are unplanned and data changes in an unforeseen way. And there’s no actually warnings apart from some kind of disruption in the services, which is too late. Okay. So let’s see our experience with this. So we’ve been playing with data shifts and to play with data shifts, what we did was take a publicly available Kaggle dataset that is called the consumer complaint dataset. Well known dataset available out there, which is interesting because it’s connected to a real business case.

So it’s something which has, I would say a relevant business value. And we are talking about financial consumer complaints. So people posing, dropping their complaints about their credit card that is not working or some kind of loan that is not going through and so on so forth. So what we did was on one side a straightforward text categorization test. And by the way, I think I have here the list, that’s the list of nine root categories. We see a debt collection, money transfer, mortgages. This is the kind of topics that the tax classifier needs to detect in the financial complaints. And at the same time, so we did a second test which was taking the second level of this taxonomy because as you can see, there is our two levels, two layers and using it to shift the data. And the way we shifted the data was this way.

We took one of the descendant notes, for example, for debt collection. We took student debt, loan debt. So yeah, and the examples falling into this category went into the training set. All of the examples falling in the other categories like mortgage debt, medical debt went into the test set. So what we are trying to model here is are models actually able to pick up the meaning of debt collection from data that talks about student loan debt and generalize. And by generalizing mean being more robust to what may happen to our model when data shifts. Okay. So let’s see what happened. On one side I would say the results have been unsurprising for someone could be surprising, but what we observed was two facts. One is that on the original data where training was done, so on using the entire set of supervised complaints associated to a specific class, machine learning was producing very good results.

By the way, we have transformers here, we have spaCy classifier here, we have other second machine learning models. We also tested our AutoML, hybrid ML engine, which on the, what’s the straightforward non-adversarial data set. And at the same time, we asked a knowledge engineer to write some rules, symbolic rules to classify these documents, these complaints. Well and the results is over here. So it wasn’t performing super well, but actually that’s absolutely understandable. Obviously the knowledge engineer didn’t read the entire 80,000 available examples and actually 72,000, sorry, but I wouldn’t even go that way if we had 72,000 available examples and we knew beforehand that the data distribution is going to stay throughout time to stay solid, and really represent what we are seeing in real time testing data.

Okay. So the interesting part comes when we move down and see what happens with the adversarial set. So again, I repeat myself here. Training data was coming out from one of the classes belonging to the root class and all the examples were targeted with the root class. Any student loan debt document was targeted as debt collection. And in the test set we had all the other sibling classes and their documents used as test set document examples. And what we did see was in some cases an absolute collapse of performances, meaning that there was no real understanding of the generic concept of that the root category was actually trying to model. Some of the models, especially Roberta Large proved to be pretty resilient. So we can see here minus 23 percentage drop in performance in on this adversarial set. But amazingly the symbolic solution that was not performing at state-of-the-art on the full dataset, proved to be extremely resilient, extremely robust to these data changes. Mainly because of two facts.

It was prepared by a human being that was dropping into the model some generic knowledge about the overall class it was modeling, and the fact that probably also, again, the knowledge engineer couldn’t see all the data. So it hadn’t the chance to super fine tune it’s model over against all the nuances of the subclasses that belong to the root class. So that’s mainly our outcome. This is interesting also because there is another consideration that I think needs to be done, which is not only there is an important disruption in performances in machine learning and end-to-end machine learning models, but this kind of disruption is also very different from one model to another. Making also very difficult to predict what may happen in production given the model we have in production.

So just to resume a couple of lessons we learned. So the data changes all the time and for a number of reasons. And when it happens it could be without warning, so we need to be prepared. Second lesson is then that the impact of this data shift can be really important and for some models it could equal to total disruption and then the magnitude of the disruption is also so pretty complicated. And at the same time, we observed that a knowledge driven symbolic model actually proved to be pretty resilient to these data shifts, leveraging on generalization. Yeah. So that’s all I wanted to say for today, and I think it’s something that can be used as an experience by our listeners today.

Brian Munz: Thanks. Yeah. That’s really interesting kind of experiment. One question that came to my mind is if you were, obviously there’s a variety of technologies you have here, but if you managed this model and saw that there’s this large change and degradation of the set accuracy, what would the effort look like across this? How much of a pain point would that be in terms of would you go back and retrain it? How much time is that and symbolic versus the ML?

Marco Ernandes: Yeah, okay. So what we are modeling here is a scenario where you could be lacking the supervised data for retraining. So in that case, so we’re trying to see how the resilience really, okay. We are in a war and data is coming through every day and new data is coming in. What we’re going to do? And we should be start asking, okay, how is the new data coming in? Why are the causes and so on so forth. And maybe we could get to a point where we can retrain, but what we are observing is that if we have a model that contains a general knowledge about the facts that is trying to model something that is not only belonging to the available data, it’s going to be more robust to these changes.

Obviously then we would need to, we could say, okay, let’s retrain and that’s absolutely possible, but we need to then also think into account that this retraining has to be sort of continuous because it’s not going to stop. The data shifting is something that is always going to be there. And so it comes with a cost. A cost of having all the time, the new supervised data, retraining, redeploying and so on. So a little bit of resilience I guess it’s always useful to have, because it gives time and it gives you time to react. Time for picking and the new data that is needed to retrain and at the same time avoiding total disruption, if this is something we are heading to.

Brian Munz: Right. Well and you kind of touched on a little bit, but in terms of when the model is being built or trained or whatever it might be, how do you prevent or not even prevent, but ensure that you have a resilient model to prevent this from being as impactful as it is? So are there any kind of best practices to prepare for this type of a problem, like the data shift?

Marco Ernandes: Yeah. So what we are suggesting here is that to be fully prepared, we need to inject in our model a knowledge that comes not only from the data, but from the external available knowledge about the domain. So usually in a real world application about example, financial complaints, talking to the industry, to the stakeholders, understanding what the real needs they are, what are the workflows involved, why are we doing this job of categorizing complaints? And also, taking into accounts some principles of simplification, keeping the models simple enough to be general enough. But at the same time in taking care of these business needs that are outside of the data often is wise.

So I’m not saying that you should only be doing that with a purely symbolic model. This is our experience in this case. And I guess we can have a hybrid model that takes into account both data and knowledge. Maybe this is something we could talk about in the future, but injecting that symbolic knowledge that comes from the experience in the domain can prove to be the sort of warranty or policy for avoiding disruption. Because if the model itself is trained only on data, its world will be the data itself. And so if what comes in during the production time is not reflected by the training data, then you’re in trouble.

Brian Munz: That makes sense. No. So yeah, this was very interesting. So thanks for presenting and like you said, maybe in the future we’ll hear from you more about other experiments that you’ve done, but definitely appreciate you talking to us today. So thanks for that.

Marco Ernandes: Thank you, Brian.

Brian Munz: So I think we are taking a break for the holidays after this, that’s my impression. But like I’ve said before, make sure to follow us on LinkedIn and you can see any upcoming live streams and we’ll be back for you in the new year. So thanks for joining and thanks again Marco, and hope everyone has a good holiday.

Marco Ernandes: It’s been a pleasure. Bye-bye.

Brian Munz: Bye.