How reliable are NLP benchmarks?

Many recent works in the NLP field draw attention to understanding the abilities of language models with the help of popular benchmarks. Although many researchers and AI enthusiasts carry out their studies to climb ladders of the leaderboard of these benchmarks, the lack of a detailed analysis of these benchmark datasets and an evaluation of their sufficiency’s brings some obstacles to performing fairly competitions.

In this NLP Stream, we will assess the quality of such a dataset (Word in Context dataset from SuperGLUE benchmark) and its usage in evaluating the NLP performances.

Transcript:

Sinan Gultekin:

Hi. Welcome to this NLP’s livestream. I’m Sinan Gultekin. I’m a software engineer at Expert.ai. Today, we will go through how reliable are NLP Benchmarks. Let me start with sharing my screen and we will be set.

Okay. Before starting the session, I would like to also refer that, if you want to go just deeply to the article, there’s an article about that topic. Leonard and I have written, and we will share the link of the article in the description. If you want, you can just go to the article for more information, more technical stuff. As it has been set, let’s start with our presentation. This is I think, more or less, everyone just knows something about the benchmarks of how it would work. It is basically a standardized way to comparing two or more than two things.

And as ever growing attention to the mission learning models and their performances or behaviors, it actually leads us to… Or many researchers lead to create some datasets, some constructions of the benchmarks. And today, we will basically analyze one of them, not the whole benchmark but it is called SuperGLUE. And inside that benchmark we have tons of different… Not tons. Tens of different datasets. And the WiC or also known as Word in Context dataset is one of that. And the benchmark’s aim is more or less putting an outstanding evaluation for language understanding tasks. And they’re actually doing a great job in terms of that. So making different datasets or different tasks available for everyone. But there are some overlooked points about these kind of datasets. The main question about these datasets are should we just accept them without questioning about their sufficiencies or how they are created or basically some characteristics of the datasets.

And we will go through the WiC datasets in that sense. And let’s see some characteristics of that datasets. Basically it is a word sense disambiguation task. We can say that this task is a very high level for the NLP models and the creators, authors did a very outstanding job to make it a binary categorization because somehow it is even harder. And in that dataset we can see two sentences. There’s sentences that share a pivot word. That pivot word is a polysemic word. So basically each pivot word has more than one meaning and we are trying to find out that if these pivot words have the same meaning in these two sentences or not.

And it is quite a small dataset. As we can see, we have more or less at most 7,000 samples for all different data splits. And for each one we have different numbers of unique pivots. And at the first glance, we can see that there is some unbalance situation between the unique pivots and the number of examples to the different data splits. And this actually leads us to the first analysis about that dataset. When we look at the distributions of the unique pivot words, we can directly see that training validation and this splits just follows the different distributions. But in the training set, we can see a long tail distribution, which means actually in here there could be some pivot words, which even single pivot word could have 85 up to 85 samples. Whereas on the validation test set we can see more uniforming distribution and it is more or less between one in three different samples, one of three samples. So it’s a huge difference.

And of course we can have right to become suspicious about this situation because these kind of different distributions may lead to some biased learnings towards to do certain periods or just lack of understanding or lack of learning of the grand [inaudible 00:06:50] function, which we don’t want of course, we train the models. And let me continue to just analyze the dataset more and more. We also saw one other issue about the balanced situation. Although the researchers actually made and balanced dataset as a whole, for the targets we have exactly the same amount of True and Falses. But when we just go into the pivot level, when we just step into the pivot level, we can see some unbalanced label and which is… It is another critical issue for that dataset because [inaudible 00:07:47] the model animation, running model assembly or [inaudible 00:07:51] it doesn’t matter if model can see only the one type of the classes. It will be really hard to understand that model. There will be another class for that specific pivot.

So it again creates a problem about some… It basically creates a bias towards some target classes for each pivot which suffers from that problem. And last but not least, maybe the strangest flow about that dataset is some lexicographical labeling. What I mean is that this task actually labels True and False with respect to the word sense or basically meaning of the word in sentences. But there are some sentences payers which are actually labeled with respect to geographical roles, whether it is noun or verb, if it is verb, if it is a transitive or intransitive and et cetera. So it is completely against the main idea of the dataset because the main goal of this dataset is just differentiate the words in the different use of senses. So this geographical draws are just big no no. And after doing the hundreds of different experiments, we just try to understand if these construction flows really affects the understanding capabilities of the mission only models or generally in order to do that we just use the very well known technique actually it’s not just used for that, but generally it is a huge use.

So data implementation, we would like to argument the datasets in order to avoid from these construction flows and see if that flawless versions of the dataset performs better or if it is more meaningful for the models. In order to do that, we just [inaudible 00:10:40] in four different ways and three of them changing sample distribution, expanding samples from training sets, validation and test set directly aims to do construction flows. So changing sample distribution, we just tried to get rid of the long tail distribution with the expanding samples from data training set foundation and test set. We just tried to add more sample payers in order to just prevent from the unbalanced labeling for the pivots that suffer from this issue. And finally we… Also, just for curiosity, we would like to implement a swapping method for just to see if it can help for some models what very basic machine learning models, but they have a draw mix of some direction of problems. So since this task is my direction, we would like to just see what will happen if we swap the samples.

And in order to look at, we just go to three different models, three different model categories and in total five models expand that AI’s hybrid technology which consists of symbolic analysis on plus SVM, Support Vector Missions or cosine similarity threshold. And in here I would like to open a little parenthesis for the symbolic analysis and what is [inaudible 00:12:33] about that. Basically symbolic analysis of expert AI uses expert AI’s knowledge graph and which basically helps to construct their vectorial representations of the documents in a more realistic or real life cases. Because with these methods actually normal case with [inaudible 00:13:01] of words or the other classical mission learning models, unfortunately vectorial representations are not very correlated with the real life entities or some sort of concepts and other teams. It’ll be just a number representations. But with the knowledge graphs actually we can represent the real works, the correlation between different entities, documents, concepts and other teams and we can actually show these different representations in a more realistic way.

So this will make a little bit difference to putting a real case scenarios into the mission learning models. And secondly, we will go with the light neural networks, which will be just multilayer preceptor on it, one hidden layer. So very, very light one. And of course we can’t deny the fact that now it is a era of some large language models and we also need to make them account and also it is within that in order to make a comparison with authors results on the original article. So we will use [inaudible 00:14:31] with sentences in embeddings or word embeddings and at the end we will use cosine similarity threshold as the researchers did.

So for each model category we can see the best performing results through the whole different data augmentation techniques. And as we can see there’s a little difference but little bit better performance on the expert AI’s hybrid technology on the validation accuracy and what we can say more about the outcomes, just not the accuracy values and also the more deep investigation of the dataset and how we can get some results from that. So the first thing will be distribution difference of dataset and implementation of the Welch t-Test. So in here we will implement Welch t-Test, which is a very well known statistical methods in to just understand the effect of different distributions between two different population. And in here we can see one of the Welch t-Test results for the NLP model. And it basically says that training set has a different distribution than the validation of test sets and we prove that difference between that distribution differences between the datasets actually affects the credibility of the results.

So we can clearly see after the Welch t-Test results that’s distribution differences affects the learning capability of the models and it is not an optimum result or it’s not a optimum outcome we can expect from a dataset of a benchmark. And second observation could be the accuracies of the target classes. When we just dive deeply to do performance based on the targets, we clearly see that some of the… All models actually prone to favor one class… Favor to one class on top of the other one. And it depends on the model SVM and NLP actually favors the False class. But on the other hand cosine similarity favorite True label. But the main important point is some target classes are favorite. So it’s not a uniform result.

And after that we… And also considering to the fact that more or less all the models just stuck around the… Not certain percentage of the accuracy, we just thought that, “Okay, maybe we can analyze the wrong predictions in order to see how the models or where the models are perform badly.” And when we just do that, that analysis, we actually solve them nearly 13% of the validation sets actually mispredicted by all models. So these 13% is a common wrong prediction along to three different models. It is a huge actually finding because no one can expect something like that. And of course it raises some questions about the reliability of the dataset and but we would like to just go with some certain proofs and just continue to examine the wrong predictions. And the findings are actually getting more and more interesting that we just the… Because among these wrong predictions, 50 of them actually didn’t exist on the training dataset.

So model actually couldn’t touch none of the models, couldn’t have chance to learn these periods or learn these samples. So it’s not fair to… It’s not bigger to expect models far from, well if they can see the pivot words or they can’t have a chance to learn. And another interesting point about that common wrong predictions is 24 of the remaining misclassified pivots actually suffer from an unbalanced target labor, which means model only saw False or True. And because of that, the model couldn’t understand when it should be False or when it should be True because it only saw one of them. And it basically created a bias learning towards the one specific class. And furthermore, there’s some bizarre distribution of these run predictions. We have a very huge True over the Falses and all of these actually proves the [inaudible 00:21:48] suspicions about construction flows. But we would like to do some more tests, more experiments about piece because on the original article it says task should be beyond the scope of the current state of the state of the art systems, but [inaudible 00:22:15] by most college educated English speakers.

So the last product, the second part of the sentence is quite interesting because we just told that if we find some supervisors, some experts, since it says most college educator English speakers, we should expect that these college educators English speakers could perform better than maybe the models. And we did that, we found eight different supervisors and all of them are educated at least in master degree level or native speaker and master educated, at least master degree level. So it is a mixture of native speakers, nearly native speakers. So basically very advanced speakers. And results are quite interesting because when we gave these 81 common predictions to the supervisors, we really expect them to perform well more as maybe 80%, 90% or at least 70%. But unfortunately, on average it is barely more than flipping a coin. So 53% is not good. And it also shows that this task is unfortunately not solvable by most college educated English speakers. So they can’t perform very well on this task.

And if we need to wrap up all the things we discussed until now, we can say that the resulting dataset actually shows a very low level of quality and unfortunately it leads in adequate level of knowledge to express the granularity of the senses. So none of the models actually could perform well because actual dataset couldn’t provide enough knowledge, enough information to the models. So it was not model’s fault. And the distribution difference between the training and validation of test sets really, really limits the machine learning models and it is not fair to expect a good performance if we cannot provide a good data or very consistent distribution between the data, different data splits. And finally, Welch t-Test, examination is actually outcomes… The Welch t-Test examination show that actually. Maybe leaderboard results cannot be considered statistically significant because it’s a proven range to find out if the gap between different populations is significant or not. So Welch t-Test used for that. And when we used Welch t-Test, unfortunately we couldn’t state that difference between the results are significant.

It can be also [inaudible 00:26:29] these little percentages could be just a random noise during the training or the other part of the learning. So it is also not a good thing or it’s not expecting thing from a benchmark. And of course it is not just… The aim is not roasting benchmark or other stuff, it’s just paying more attention what we are using when we are training our models or when we are performing very expensive tasks because couple of very big language models also attempt to just get their seats on the leaderboard. So informing all of these, we should really consider the quality and the sufficiency of the datasets and it is a humble solution. They can say that the distribution through the different datasets should be more or less same. And it is a optimal especially for the datasets of a benchmark. Besides the situation differences, we can also expect a more balanced situation through the labels in order to avoid from the bias learning because of course it will not have the models if we have some flaws which may lead to bias learning.

And lastly, this is a quote from the article. If we are trying to put a strong gold standard task called standard dataset for some certain task can create a benchmark. Especially with this kind of dataset because word context dataset is semi automatically constructed. And for these kind of datasets, a strong annotative is a real need because you cannot directly trust the algorithm to put a very well job. And you cannot just randomly the see some instances from the data splits and comment that the human supervisor or human baseline could be some certain level of accuracy. It should be much more stronger and serious than that. Maybe even some of the test set, maybe whole test set, maybe even training and validation set as well. So that should be a really serious human supervisor performance on that datasets. And it’ll be the end of the presentation. And thank you for all for attending this session and your attention. And I will just stop sharing to look at the chat if we have any question on there and we will see if we have…

Okay. For now, we don’t have any question, but maybe we can wait a little bit and if it is same, maybe we can end this session. Yeah. Okay. So if there will be any further question, you can always reach us through the email to me or to any of my colleagues and they can just direct your questions to me or also you can refer under the comment sections on the articles. I can also look at [inaudible 00:31:53] frequently and that will be all for today’s session. And to see more interesting topics like that, you can stay tuned to the channels and hope to see you for the next ones.