We stand with Ukraine

Hybrid-ML and LLMs put to the Test in Data Scarcity Scenarios

In real life business scenarios we often have conditions that do not allow Large Language Models to be performing, and a lighter hybrid (symbolic + light Machine Learning) model can deliver better performance than these giant language models.

Watch Samuel Algherini to discuss the question “can Hybrid-ML approaches help when supervised data isn’t enough for LLMs?”

To do so he’ll overview

  • The LLMs era: the continuous increase in model size
  • Light models and the hybrid approach
  • Comparisons between different architectures in a data scarcity scenario

Transcript: 

Brian Munz:

Hi, everybody, and welcome to NLP Stream which is a livestream that is done every week where we try to cover topics related to NLP and the world of NLU AI at large. We do this every Thursday at 11:00 Eastern time.

Brian Munz:

This week, we’re joined by Samuel Algherini so wave to everybody. He works with us within expert.ai, which, if you weren’t aware, is an AI company where we provide NLP/NLU services. What he’s going to talk about today it’s a very interesting topic and one that you see come up quite a bit. There’s lots of facets to it, so it can be confusing to navigate what hybrid ML means. So today he’s going to give a run through of that but also in respect to some very tangible things of large language models and data scarcity.

Brian Munz:

I’m very excited to hear what you’ve got to say, so take it away.

Samuel Algherini:

Hi, Brian, thanks for the introduction. Hi everyone, and let me say that I’m very happy to have the opportunity to talk about this very interesting topic because data scarcity it is a very interesting topic.

Samuel Algherini:

Let me share the screen so I can give you a framework. Just a few seconds.

Brian Munz:

Oh, there you go.

Samuel Algherini:

Okay, you should see my screen.

Brian Munz:

Yeah.

Samuel Algherini:

Okay, so let’s start. Today, we’re going to talk about data scarcity scenario. Let me introduce the framework: the things that we’re going to talk about and the frame of reference. Well, scenario, it is data scarcity because we want to talk about case where data are not enough. Today, we live in a world of big data.

Brian Munz:

[inaudible 00:02:40] I think you’re showing your… I just realized you’re showing your file folder not the slides.

Samuel Algherini:

Oh. Oh, one second, sorry.

Brian Munz:

[inaudible 00:02:52] yeah, I wasn’t sure if you were about to-

Samuel Algherini:

No. Okay, you should see now my… Can you see it now?

Brian Munz:

I see Academy Training Projects folder.

Samuel Algherini:

Okay, sorry [inaudible 00:03:09] sorry.

Brian Munz:

Yep, no problem.

Samuel Algherini:

Okay. Okay, you should see my screen now.

Brian Munz:

Yep, we’re good.

Samuel Algherini:

Okay, thanks.

Samuel Algherini:

Let’s go talk about this very interesting topic because we’re going to deal with data scarcity scenario. We are dealing about NLP, so we’re talking about language. With data scarcity, we are focusing about all the cases, and there are a lot when we switch in the real business scenario, all cases where data are not really a lot to create a model that is effective.

Samuel Algherini:

And so the idea is to try to compare different kinds of models, and different sizes, and to see how these perform in this case. Well, I am going to tell you that we will see that light models, and especially hybrid model, and I’m going to explain what we mean with hybrid model. We are going to see that these models perform better than huge, deep learning architecture.

Samuel Algherini:

But one step back, and let me talk briefly about the deep learning era. Today, we have a lot of deep learning models that in the academy and the research are doing very well about benchmark, but let me tell you that you need two things in order to have a very good deep learning architecture.

Samuel Algherini:

And that is the reason why only in this last decade, only in this last 15 years, the deep learning models have been able to reach good results. And it is because today we have a lot of data and a lot of computational power, so it is only because of this possibility that we can train and we can create such a model. In fact… Thanks. Okay.

Samuel Algherini:

Today, we are not going to talk about all the issues that can arise about computational power because in order to train these huge model, a lot of data but you also need a lot of computational power. And so that’s the reason why only the few big tech companies, like Facebook, like Google, and so on, are able to train models that have billions of parameters. In fact, as we can see, in only few years, starting from with birth in the 2018, and we were dealing with models that have few 100 million parameters.

Samuel Algherini:

In few years, we have models with hundred of billion of parameters that are really a lot. And in order to train this model, you need this model are all supervised learning or maybe semi-supervised learning. It means that the documents that you need are documents that are annotated, and so you need a huge amount of data of annotated document, and you need a massive training for that.

Samuel Algherini:

But one thing that I would to highlight is that this training is done, and you can create a model only that respond to general purposes. It means that this model that are freely available and that you can use for free, they are not giving you very good results because you need to fine-tune them for your specific task.

Samuel Algherini:

So let me recap. Let me explain. In this decade, a lot of model have been trained by some big company spending a lot of money and scraping all the web and creating this huge model. We don’t talk now about the cost.

Samuel Algherini:

And I would like to remind you to a link that you will find in this description. An article that I have written with Leonardo Rigutini that talked about the problem, the cost, of training and running these huge model. In this case, in this thought, we’re going to focus about data and the performances of the models.

Samuel Algherini:

And in this step, we see that the pre-train model that we can find available, you need to fine-tune them for your specific task. In fact, it is very important that you have your proper documents, your specific document for your task, and you can then fine-tune the model. And thanks to it, through these documents, you obtain a model that is fine-tuned for your specific task.

Samuel Algherini:

But the problem is that with this huge models, you need a lot of document as well in order to fine-tune that model and have good performances. This is something that is called, an issue that is called [inaudible 00:08:26] learning when you try to fine-tune a model and you have just few documents.

Samuel Algherini:

And in real world, what happens is that the R&D department, the data scientist, is looking for these specific documents that you need for fine-tune the model. But in when we switch to real business scenario, a lot of time we have only a bunch of documents, only we have few hundred documents. And so the idea is to see what happens. How can we deal with this scenario because the feel and the mood of the machine learning engineer, the data scientist that receives only 400 documents, is something like that.

Samuel Algherini:

So the idea is that let’s see how to deal with this case of data scarcity where we have only few documents and we have to fine-tune the model, and let’s see which kind of model may perform very well in this scenario.

Samuel Algherini:

So the idea is to try to compare this different model, and we took into consideration transformers that are very common and very used architecture. And we have taken into consideration Distil-BERT, BERT-Large, and RoBERTa-Large, also SpaCY that is some open source models: CNN, BoW, and Ensemble. We have also used some classic machine learning algorithms like support vector machine, naive Bayes, random forest, and logistic regression. Last but not least, the hybrid model expert.ai hybrid model.

Samuel Algherini:

Now, I’m not going to talk about the hybrid model into the very technical into technicalities and very deep, but there is one thing that I would like to highlight in order to understand why the hybrid model is approaching to this task in a different way respect to the classic machine learning pipeline.

Samuel Algherini:

Let’s take into consideration that every time you want to create a model what you are doing. Well, machine learning algorithm are not working with text, they are working with numbers. The only things that they can work with are numbers, and so the first thing that you have to do is to transform text into numbers. A vector is a series of numbers let’s say. So the first thing is to transform text into a series, into numbers, in vectors, and this vector is the way in which you represent the information.

Samuel Algherini:

Then you can have different kind of architecture. You might have a simple machine learning algorithm or a very sophisticated architecture. It doesn’t, I mean, it’s not really important, but this first step, it’s the way you choose to represent the information and to work then with that, with such vectors.

Samuel Algherini:

And here is the switch in this case with the hybrid model because in the general and common machine learning approach, the vectors that are used are subsymbolic vectors. It means that these series of numbers, these vectors that are composed by these numbers, they are not going to represent anything in the real world. There is nothing that is a real-world entity that represent this number. And here is the differences because in the symbolic representation of the information, that is something that we do in the expert with the hybrid model. It is the creation of vectors where their numbers, zero one, they represent some real entities. And you can do it because of the knowledge graph, especially with the expert.ai Knowledge graph. That is a huge knowledge graph.

Samuel Algherini:

And the knowledge graph is a network, a huge network that represent all the entities, concept, events, all these things in the world, in the expert.ai knowledge graph. There are more than 430,000 concept events hands on. But the very important things is that you do not have only these representation but only the kind of relationship between them. And so thanks to this knowledge graph, you can leverage this knowledge and provide vectors that are qualitative vector.

Samuel Algherini:

In this last decade, I would say in this last year, there is a switch from big data that is the idea you get more data and you trade massive and increasing size model. The idea is to don’t look to have just more data but to look at the quality, to have qualitative data. And in the hybrid model, instead of using the subsymbolic vector, we use the symbolic vectors that represent the information symbolically using the knowledge graph and leveraging it.

Samuel Algherini:

And then we send this information to machine learning algorithms so that the hybridization is in the step essentially in this case. It is the way the model represent the information, represent text. Here is the different from subsymbolic to symbolic one.

Samuel Algherini:

So what about the experiment? The idea is let’s see how it works and to compare it with others. We have worked, and I would say thanks to the R&D department that ran the experiment and the job, we use the consumer complaint dataset. It is a dataset that is a collection of complaints relative to financial products, and the full number of document is approximately 80,000 documents. We have used the 10% of it as test set, so we have tested it on the 10% of the dataset, and there are nine categories. This is the dataset.

Samuel Algherini:

And what about how do we proceed? Well, in incremental training, it means that we have done four training in one case with only 90 documents, the second case 450 documents. And it is incremental because in these 450 documents are included these 90, so the bigger set include the smaller one. In this third case, we have 810 documents. And then in the fourth case with complete with the full training set.

Samuel Algherini:

Now, remember you that we have nine categories, so we run four experiment with this different size in the training set: 90 documents, 450, 810, and the full training set. So let’s see how these model perform, so these are the results. And I would like to spend some minutes about these results.

Samuel Algherini:

As you can see, the hybrid, the expert.ai hybrid model, performed better than the other in this T90, 450, 810 training. And was equal, actually equal, to transformers and SpaCY and yeah, 0.1 better [inaudible 00:16:41] essentially, the performance are very, very similar.

Samuel Algherini:

So let me highlight some things about these results. There are at least three things that I would like to talk to you about. One thing, when you have very few documents, when you have only 90 document in a case, of course we are dealing about one case. These are the results with this dataset. But in this case, we can see that when we have just 90 documents, these transformers are not really predicting, are not… I mean, it is not low. It is like if they don’t even start to work. They are not able to give you some prediction.

Samuel Algherini:

And instead the machine learning algorithm, like support vector machine, naive Bayes, random forest, and logistic regression, these can give you some results. So one thing that we should consider is that when you have very, very few documents, these huge models are not working at all. And as expected, they start to produce some results when we have four, five, six, in this case, 800 documents. And one thing that is interesting is that RoBERTa is not doing bad, but these other kind of architecture, Distil-BERT and BERT-Large, are still struggling.

Samuel Algherini:

And let’s see here the result of the expert.ai hybrid model. We can see that we have in the first case a 64.3, that is 4% better than naive Bayes. And this is the second point that I would like to highlight that is the hybrid model, as I told you, it exploit the and use the symbolic vectors. And then it sends this symbolic factor to machine learning algorithm. The learning algorithm are nothing more but this kind of algorithm machine: support vector machine, naive Bayes, random forest, logistic regression.

Samuel Algherini:

So what is really interesting is that in this case, with this hybrid model here, naive Bayes has been used. The difference is of this four point in this percentage. It’s completely due to the different kind of representation. That means the symbolic vector provides 4% better performance just because of the symbolic representation. And the same is true here with this 74.7 and 77.2. We have one two point approximately of performance that is better than the machine learning algorithm. But the overall idea is that compared to the classic machine learning algorithm, we are able to receive, to add, a boost that comes from the symbolic representation.

Samuel Algherini:

So a couple of things that would like to highlight in order to recap this table is that when you have very few documents, the large language models, that are these huge models, are really unpredictable. They are not really working at all. When you have few hundred documents, depending by the architecture, they are start producing some results. But still they are underperforming the expert.ai hybrid model. And only when we get a training set, a full training set in this case, it was 80,000 documents. In that case, transformers are performing well. But anyway, the expert.ai model is still performing as well as the transformers.

Samuel Algherini:

And we are not talking about the fact that with the expert.ai hybrid model, we didn’t need GPU, or we didn’t need a long time for training and so on. So the fine-tune that RoBERTa-Large, BERT-Large, Distil-BERT have undergone in this case has not been enough in order to produce and in order to deliver results as lighter and our hybrid model can do.

Samuel Algherini:

So, even though this is a single case and we need more data, more case studies, and I can tell you that these are in process, and so stay tuned because there are other very interesting case studies that we have to show. And then license to confirm this scenario is that large language model, they struggle when there are few data. And light hybrid models, as we have seen, they seem to perform better and they are quicker and all these things. But the most important thing is that in this case, they seem to perform better than deep learning architecture. And one thing that comes from this experiment is that it seems that we can get to three, four point in percentage we were looking the metric was the F1 score.

Samuel Algherini:

It seems that it’s possible to get this two, three point of percentage from the symbolic representation. So leveraging the knowledge graph, it seems that this enabled you to improve the algorithm performances.

Samuel Algherini:

So even though every time you have a different case study, so you should always think about your specifics, your unique case study. It seems that as we can also image in case of a data scarcity scenario, probably starting with a large language model might not be the best choice and go with a light model. Or, in this case, it would be better a hybrid model would be the best choice.

Samuel Algherini:

By the way, more case studies of course are necessary, but this seems to be the path. By my side, this is well… Oh.

Brian Munz:

Yeah.

Samuel Algherini:

And anyway are there are any questions or something that can…

Brian Munz:

Yeah, I mean, I have one question. No, thanks, that was really interesting. I had a question about this experiment where, with the symbolic model, was there any rules, writing, or was this mainly just using the knowledge graph and out of engine?

Samuel Algherini:

Thanks, Brian, very, very good question because in this case, no rules were there. So it was just the knowledge graph. And this is a very important possibility because in some cases you can really boost your performance sometimes just adding few rules.

Samuel Algherini:

So in this case, we didn’t use any rules, but it is really an option. It’s really a possibility to add few rules here and there to really post the performances, especially when you have data scarcity. And, of course, you might need to improve your performances with few rules added to symbolic representation might really boost and improve the performances and get the best performance.

Brian Munz:

Yeah. No, I mean, it seems to me that a big reason that hybrid should be, at least in the mind of anyone who’s working with NLP, is especially around…

Brian Munz:

Well, first of all, like you mentioned, with use cases, I see more often than not where you go to a real-world use case and not many companies have tens of thousands of documents in general, unless they’ve been around for 50 years. And then in terms of then taking the larger models and improving them, and tweaking them, and doing retraining them and stuff, symbolics seems like a quicker way to get there, especially if there’s less documents, which is more [inaudible 00:25:39]

Samuel Algherini:

Yeah, Brian, you know what I think that today a lot of people are falling in love with technology and not with the idea that you should resolve the problem. And today, there’s a lot of hype about artificial intelligence and this huge model that sometimes they work… I mean, they do very well, but you need some conditions.

Samuel Algherini:

And so the idea is not to start with large language model. The idea to start to understand your specific case. Too many people today talk about… I think that they fall in love with a kind of technology, and they forget that they have to resolve a specific and unique problem, so the idea is to understand how can I deal with this problem. How can I deal with… What would be the best way to do it?

Samuel Algherini:

And sometimes you don’t need… Sometimes it’s fine with deep structure because you have some conditions you can use all the GPUs you want, but sometimes it is not. And I think that we should, from research and academy, and build this huge research to real business case where the idea is how to solve this problem. How can I solve it?

Samuel Algherini:

And sometimes quick, light, and hybrid approach might be much better than start to fine-tune huge, deep learning model, which, and you have maybe just 700 documents. You need a couple of rules and light model and you have good results. Fall in love the problem and not with technology by itself I mean.

Brian Munz:

Yeah, exactly. I think there’s always this matter of merging the academic world as well as the just sci-fi world with what you end up dealing with in normal business case, where sometimes you go in and say, maybe they don’t even need NLP where someone just is so attracted to the idea of AI.

Brian Munz:

And they’ve seen the movies where the use cases can be a bit more challenging, where in the world of academia, you can guide the challenges and say, well, we’ve got 10,000 documents well. That’s a nice way to test, and, of course, drive forward your research, but in the real world, that’s why I think it’s important what you’re showing here in terms of hybrid being an important just thing in your toolbox to help with the project.

Samuel Algherini:

And today we don’t even talk about the cost of training and using the GPU and all the computation costs. So if it has a sense, okay, it’s fine, but we don’t need necessary to go for that way. It depends.

Samuel Algherini:

Yeah, it is interesting to see that in this case, data scarcity scenario that are very common in real business case, you can go… Most of the time you can go with light and especially with these hybrid models.

Brian Munz:

Right. Now, it looks like we have a question from YouTube. I’ll put it up on the screen.

Brian Munz:

Oh, did it go away? Okay. There it is.

Brian Munz:

They’re saying, could you suggest some tools maybe in Python/Pipelines or best practices to build such symbolic representations?

Samuel Algherini:

Well, you can use some let’s say pretrain them knowledge graph, but of course they are not so full and so rich as the one we have because the expert.ai knowledge graph that is implemented for more than 30 years, so is really huge one.

Samuel Algherini:

There are some open source and free knowledge graph that you can use. And you can also try to do it from scratch, even though it takes a bit longer time and effort.

Brian Munz:

I mean, the way we handle it at expert as well is we have our knowledge graph, of course, that you can get access to. And as Samuel said, there are open source ones out there, DBPedia and things like that. And what we do is we have access to this knowledge graph but also the ability within we have a product called Studio where you can write rules to help hone the model.

Brian Munz:

So for example, if you’re running on the baseline technology and in your use case a word that means one thing means something else, whether it’s slang or whatever it might be, you can then write a rule, which is basically telling the model to consider the word this way rather than that way. And so you can use that methodology of course generally, if you were building something within Python or whatever, to also handle the different classifications and things like that.

Brian Munz:

But also we can post within the comments later to see if we have any other further recommendations for tools like that.

Brian Munz:

That’s our time, but this was really interesting presentation and conversation, so thanks for presenting. Hopefully we’ll see you again sooner than later.

Samuel Algherini:

Thanks, Brain. Thanks everyone and stay tune.

Brian Munz:

Yep. Yeah, and so next week we have Turn Language into Action: A Natural Language Hackathon for Good kickoff event, and so that’s going to be… Experts going to be running a hackathon, online hackathon, not in person or anything, so a challenge where we’re going to have prizes and everything. And I’m going to go through what the challenge is, how to participate, show some examples of past hackathons and stuff, so definitely tune in, especially if you’re interested in putting your hat in the ring for that competition.

Brian Munz:

But until then, thanks for joining, and I will see you next Thursday. Thanks.

 

Related Reading