NLP Stream: No Fear Stylometry with Expert.ai

Everyone has their own style in which they write (or speak), and this style is made up of a series of small features are quite hard to detect. Stylometry, the study of linguistic style, breaks down these features and helps us understand what makes our way of expressing ourselves unique.

While stylometry might come across as a sort of disciplinary bogeyman due to its weird combination of hard science and fuzzy humanities, it does serve a variety of purposes from author attribution to fake news detection.

Tune in to watch Gianmarco Saretto disucss:

The purpose of stylometry
How to use expert.ai NL API to perform a stylometric analysis
How to visualize and compare stylometric data
How to make sense of the results

Brian Munz:

Hey, everybody. Welcome to the NLP Stream. And as usual, I’m Brian Munz, I’m a product manager at expert.ai. And what we do is every week we have a live stream where we cover a topic that’s relevant to NLP, NLU. And this week is no different. We’re going to close out the summer with a really interesting topic where even if you don’t know what stylometry is, you will after this. We have a lot to get to, so without further ado, I’d like to jump in and introduce Gianmarco Saretto. Go ahead.

Gianmarco Soretto:

Hi. Hello, Brian. Thank you. Thank you for the introduction and thank you for having me here to talk about stylometry, which is a topic that I’m very excited about. So let me share my screen here. Can you see my screen well?

Brian Munz:

Yep.

Gianmarco Soretto:

I guess now, yeah. What I plan on doing today is to talk a little bit about what stylometry is. So if you’re not familiar with this interesting subfield of natural language processing, again, I hope you will get an idea of what it is, of what it does, why it’s useful.

Gianmarco Soretto:

And then what I’d like to do is try out the tool that is devoted to stylometry E-N-D, expert.ai, natural language API. So we have a tool that does stylometric analysis that is very neat as we will see in a moment, and we will try it out against some presidential speeches. So we will compare the style of President Obama’s speeches and the style of President Trump’s speeches, have a look at interesting differences that might come out. Then we will try to use the stylometric information that we get from the API to try and create a little classifier and see whether these stylometric indexes, values can be predictive of whether speech was delivered by one president or the other.

Gianmarco Soretto:

Yeah, it’s a little bit of material. Let’s get started then with what stylometry is just very briefly. Stylometry is that part of natural language processing, or at least one field of natural language processing, that is concerned with style, with form. So while many applications of NLP have to do with pulling out content from text or classifying text according to what they say, stylometry is all about the form of text, how texts are constructed, sort of how they are constituted. So it’s really all about form and style, really, and finding ways to measure style and to compare style and so on.

Gianmarco Soretto:

So it’s usually concerned with those aspects of text that are neglected otherwise in NLP. For instance, we look at punctuation, we look at the length of tokens. We look at the length of sentences. We look at really individual preferences that have no bearing on maybe a text meaning. Actually, we are really interested in the opposite, in form.

Gianmarco Soretto:

This is fascinating because the assumption behind stylometry is that whenever we write and us all, we all as writers, we basically leave something of a fingerprint in our writing. We make a lot of small choices, some of which we are not even aware of very often, some of which are almost subliminal, that again sort of create, add up to a unique stylistic profile. That’s the idea. Can you hear me well?

Brian Munz:

Yep.

Gianmarco Soretto:

Okay. All good. Again, we leave this sort of fingerprint in our writing, and this is unique. What stylometry tries to do, then, as I’ve tried to depict in this little image here, is it tries to detect in documents those features that can make up this unique stylistic fingerprint. So when we have collected all this information, all this data, we can then compare different styles and maybe even make predictions about whose fingerprint we find in a text.

Gianmarco Soretto:

So stylometry is very often used in cases of authorship attribution, so when we have documents whose author is either unknown or disputed, then we can use these kinds of tools to compare the fingerprint we find in a document with that found in other documents and match it, so determine the author of, again, a text whose author is unknown. So it’s a fascinating tool. In general, I’ll just say this so that we have a term of comparison for what I’m going to show you now with the expert of AI/API. In general, stylometry currently is mostly based on vocabulary and actually on just making huge maps of the words used by an author across many, many documents. So it involves many features that are usually also quite hard to reconstruct. So we get this very complex, gigantic fingerprints, and it’s hard to tell exactly on the basis of what features, of what stylistic choices we make a decision of, “Oh, it’s this author rather than this author.”

Gianmarco Soretto:

And as we will see the tool that we find in the API is compact. It has a few very understandable metrics. So let’s have a look at it together. The tool is called Writeprint, which is a great name I think because it really encapsulates this idea of a fingerprint found in writing. And it’s part of the expert.ai and NL API. So just a little board of introduction about this, but I guess many of you are already familiar with this API. This is a tool that is freely available, made available by expert.ai. It’s based on its core technology. And it allows you to perform different sort of analysis from within your code and all that you need to do to start using it. You sign up, you get some credentials and then you can use it inside your code to process texts, and get different sorts of information from these texts, including Writeprint, so including this sort of stylometric indexes.

Gianmarco Soretto:

Yeah, to do that, you just sign up here, on this button. The website is developer.expert.ai. So you can try it out. And if you’re interested in using it, then you can just have a look at the documentation here, which is nice and actually, this is, I would say the one about the API is very, very clear and it has a quick start, an howto on how to perform different sorts of analysis and breakdown of all the different capabilities, so all the different kinds of analysis that you can perform using the API. So document analysis, document classification and information detection.

Gianmarco Soretto:

What matters to us now is found under information detection. So it’s right here, Writeprint. We can just have a look at it to understand how it works a little better. Here, as you can see, it’s a tool that gives back. So we send a document to the API and what we get back is this fingerprint that is made up of 60 indexes. So 60 metrics, 60 different aspects, facets of a text that taken together sort of constitute a stylistic fingerprint similar to what I have described. And here you have a breakdown of what each one is, but I think the best way to understand how this works is to try it out. So there is this live demo that is available here. That gives you a chance to, if you don’t want to start coding, you just want to try it out with a graphical interface you can do that using this tool.

Gianmarco Soretto:

And here, I had already tried it earlier. As you can see, you can copy-paste the text inside of this box, inside of a text box, and then you can process it and get back all the results of Writeprint. So what I had here was precisely a presidential speech. I grabbed it from here, from the White House website. We can do it again just together. So I just went to this very recent one and just took it and copy-pasted it. Oops, there you go. And then we just put it here.

Gianmarco Soretto:

And this is a good way, again, of understanding what these indexes are like. You have this fingerprint that is made up of 60 different values. What are they? What do they tell us? Here you see these different categories into which these 60 indexes are divided. You have readability indexes that tell you how readable a text is. And these are metrics that are mostly based on the length of reading words within sentences. It tells you how easy or accessible the text might be.

Gianmarco Soretto:

Then we have information about the composition of a text, so the length of tokens, the length of sentences on average within a text, and also punctuation, which is a very important feature as we will see in a moment, so the use of punctuation within a particular document, on average. As you can see, it’s all very understandable, all these metrics. It’s all about then putting them all together.

Gianmarco Soretto:

Grammar, so parts of speech found within a sentence on average. So adjectives, adverbs, articles and so on used by an author on average within a sentence. And then phrase types, very similar. So there are conjunction phrases, adverbs and so on.

Gianmarco Soretto:

And finally, language has to do with vocabulary, really. It’s another way of saying diversity, variety in vocabulary. So all these metrics that indicate that. And then subject-specific vocabulary used by an author, so words that have to do with academic language, business language and so on.

Gianmarco Soretto:

What I was getting at is, again, these metrics, the 60 indicators are all very easily understandable. It’s super clear what each one of these means on average for a sentence and compared to the many, many parameters that are commonly used in stylometric studies, this is quite easy to grasp. Maybe on its own it doesn’t tell as much. Processing a single document and knowing how many commas there are on average in a sentence, not that fascinating, but that’s because really stylometry shines when you use it to compare different styles. It becomes more interesting when you are comparing the style of one writer with the style of another. This is what we’re going to do using presidential speeches, as I’ve mentioned.

Gianmarco Soretto:

So here I’m using this Haggle data set. This is what we will be working with, put together by this Haggle user Joseph Billenburg. And we are going to compare Trump speeches with Obama speeches only because as you can see, it stops here 2019. So we couldn’t include the current president in there.

Gianmarco Soretto:

So I have these Jupyter notebooks here that show you all the steps of the process. I will just briefly show you what the data looks like here. As you can see, there is lots of metadata in this table. We will clean it up and take only the strings that matter to us. But as I do that, I also want to mention something if you decide to use, as I hope you will, the API yourself. The API can only process chunk of text of about 10 kilobytes at a time. So usually, when I’m working with larger documents, as it is the case of many of these speeches, I split them up into chunks roughly the same size. So this is what I’ve done here. And then I’ve selected 60 random. So it’s a very small, if you think about it, a very small data set that we end up with.

Gianmarco Soretto:

So about 60 excerpts from the speeches delivered by Obama and those delivered by Trump and we create a new data frame that is much simpler. So here we have the one by Obama and later those by Trump. Then we can just add to each one of these rows, the Writeprint indexes. So those features that we’ve seen earlier, we will add them to each row to start having a look at these values together.

Gianmarco Soretto:

So let’s have a look at how to make a call to the API. It’s extremely simple. As you see, you need to install this package, this library, and then you import this client. And then the other important step is you need to load your credential onto your environment. What I usually do is I use dotend which is a very neat library to do that.

Gianmarco Soretto:

And because you have signed up right on the developer portal, that username and password that you use there, you have to load it on to your environment to use the API in general, but these are the only two steps. And then the call itself inside of this function. Once you have loaded the credentials and created this client object, you just use these detection method. And then you include the text that you want to process, the kind of detection that you want to perform in this case Writeprint and the language, that’s it.

Gianmarco Soretto:

So we can try it out with the first speech in this table that we have prepared. We are sending the document to the API and we’re getting back the JSON containing all these indexes that we have described and they’re here. So you access them using this extra data variable when you use Writeprint. That’s what you do. These are the indexes that we’ve seen earlier in the live demo. Only now we have it inside of the notebook. So the readability indexes, the grammar indexes and so on, all this information.

Gianmarco Soretto:

Very briefly, I do the same forger entire table so I get all these rows containing all these values. I’m not going to spend too much time on this. The important thing is that we get this table that I’m going to show you here. I just saved it, stored it using [inaudible 00:17:11] elsewhere. So just to show you the finished product here that we’re going to work with, we have this table in which we have the text. Once again, the author of the text or the person who delivered the speech in this case, and then all the indexes. So this row here contains all the 60 indexes retrieved by Writeprint. I’ve chosen to use the mean, the average. You can use different… Actually, you can use the total or the… All of those are available, but I’ve decided to use the average per sentence, basically.

Gianmarco Soretto:

Okay, so what do we do with all this data now that we have it here? As you can see before I tried to model this information to try to create a classifier. What I like doing is I use MacBook Lib and Seaborne to create plots. So I just run below. I just create plots to start detecting interesting differences that there might be between the styles of different writers.

Gianmarco Soretto:

I’m not going to just go over all of this because it’s just my MacBook Lib and Seaborne. Look at them together.

Gianmarco Soretto:

Hello? Oh, I’ve lost you for a second. Are you all there?

Brian Munz:

Yeah.

Gianmarco Soretto:

Okay, all good. Let’s just have a quick look at the information that we’ve gotten from Writeprint organizing these charts. So the first thing that we looked at is the use of parts of speech. And this is a suggestion for you if you want to do this kind of plotting of charts, it’s a good idea… Cluster them together in the way that we have seen in the live demo. So for instance, you create charts that only compare grammatical aspects of the text or punctuation or vocabulary and so on. This is a nice way of having just few graphs that tell a story.

Gianmarco Soretto:

So here we have the average part of speech for each sentence in the speeches of Barack Obama and those of Donald Trump. So this is a nice starting point, I think, because it’s a nice breakdown of the typical sentence in one kind of text and in the other. So, yeah, basically…

Gianmarco Soretto:

… and so on are found in…

Gianmarco Soretto:

Okay. And we notice a couple things here, actually. The first is there are some bumps here when you compare Trump speeches to Barrack Obama speeches in relation to some of these features is this, is that the numbers here are all higher for Obama’s. More of all of them.

Gianmarco Soretto:

… is that the sentences, there are more parts of speech. So very simply here you have the average number of tokens in each sentence. And as you can see, there are far more tokens in the average Obama sentence. So since this has sort of a ripple effect onto all the per-sentence metrics in Writeprint, it’s a good idea to normalize the values when you create charts, and that’s what I’ve done here. So now we are looking at the proportions of how often they use one part of speech compared to the other. And we still see those bumps that I have noticed before.

Gianmarco Soretto:

So basically, the interesting thing to notice is that in Donald Trump’s style, we have a far more adjectives and adverbs. So it’s a style has more of these elements and maybe fewer prepositions, fewer articles and conjunctions. So it’s a style that you can say, maybe it’s slightly simpler from a syntactic standpoint, but more colorful, more emphatic in some ways.

Gianmarco Soretto:

Again, when using normalized values, you can immediately see bumps…

Gianmarco Soretto:

… and pronouns, right? which is interesting. Again, probably shorter sentences with many references to the same reference with pronouns.

Gianmarco Soretto:

Another interesting thing to look at is punctuation. Again, I’ve normalized the values here, so that it’s just in general the use of punctuation for these two speakers. I think what’s remarkable is the use of double quotation marks and question marks in Trump speeches. So again, probably many quotes from other people and more rhetorical questions, one might assume there. And for instance, fewer colons and semicolons, So sentences are probably shorter, again, with more emphasis, somehow.

Gianmarco Soretto:

Finally, you can look at the distribution of vocabulary, so that subject-specific vocabulary we have mentioned earlier. Here we have, as you can see in the speeches, the excerpts from Barack Obama, we have a predominance of words related to business, business-specific words. This doesn’t really happen here in the case of Donald Trump’s style, we have fewer words, terms related to business, and far more related, for instance, to crime or far more layman terms. So again, indicatign of probably a simpler style.

Gianmarco Soretto:

So this is, again, a good starting point to get a sense of how the style of these two speakers differs. Now what we can do is try to see whether all these indexes that we have gotten, which give us some insight into, again, how these styles are different, can be used to actually create a classifier, a model.

Gianmarco Soretto:

So let’s do that, again, using a notebook. So we’re going to do a little bit of machine learning here, and we are going to try to train a model using just these 60 indexes and see how it goes. So again, I’m going to import the table using [inaudible 00:24:42]Pickle that was started, it’s the one that you’ve seen earlier, which has the indexes and the labels. And then I’m going to extract from this table precisely that. So I’m going to have all the features that are needed to train my model and the labels. So I’m going to use just the simple function here. So as you see, I have this matrix, which these are just the Writeprint indexes by the averages for each speech. And then the labels are the… Okay, sorry. I’m getting a… Well, I’ll have to check that later. Sorry, Brian, did you have a…

Brian Munz:

No, no, go ahead.

Gianmarco Soretto:

Okay, thanks. I just got a message, but I don’t know. So very briefly we have, again, all these indications coming from the text, all these indexes, and then we have the label. So what we want to do is create a model that on the basis of this, stylometric indexes of this fingerprint, can tell us who the author speaker is. So we use a second learn to do that.

Gianmarco Soretto:

As you can see here, we train a model. It’s a very simple, as you can see, very simple classifier. And now I’m just going to train it. It takes very little few features, of course. And then having done that, I can now test it. So I’ve separated my training data from my test data, so I can test it now on the test data that I had reserved earlier. And as you can see, we have an accuracy about 92%, which is I guess, pretty all right. It tells us that these features are predictive, at least within this data set.

Gianmarco Soretto:

So I think we are approaching the end of this presentation. What I’d like to do now is try it out on new data. Just so you can see this model in action that we have just trained. I’m sorry. I keep getting… Just like I want to… Oh, no. Okay. I’m sorry.

Brian Munz:

That’s fine.

Gianmarco Soretto:

Okay. Well, apologies. I kept getting, and they thought they were related to this talk, so I was like, “Oh, no, people are texting me maybe something related.”

Brian Munz:

Yeah, no.

Gianmarco Soretto:

Okay.

Brian Munz:

Everything, yeah.

Gianmarco Soretto:

What we are going to do now is we have built this classifier basically that is based on Writeprint outputs. So we will try it out by processing a speech taken from elsewhere with Writeprint and then using the classifier against it. So I’m using these functions again that are nothing more than the Writeprint call and then the extraction of the values from the JSON file. And here is, as you can see, all that we do to test, it’s we send the Writeprint indexes to the classifier we have just trained.

Gianmarco Soretto:

Let’s try this out, then. I have here a website containing Obama speeches. So we can… I don’t know Brian if you have a… Do you want to point out? You can try it out at random or should we just click on one?

Brian Munz:

Yeah, maybe just random.

Gianmarco Soretto:

I don’t know if it’s going to work, so just let’s hope. Okay. Here.

Brian Munz:

There you go.

Gianmarco Soretto:

Okay. I’m just going to pull out a little bit of text. So what’s happening is that we will send this string to the Writeprint and then use the classifier that we have just trained to see whether we get the correct prediction, which should be Obama in this case. Right?I think it’s Obama. So again, pulled out the stylometric information and we have trained a beta model, basically, that uses only the information we have seen to make this sort of prediction.

Gianmarco Soretto:

Let’s just repeat the experiment with Trump. So here we have, his remarks. So I’ll just check this. Let’s run it. And as you can see, again, correct prediction.

Gianmarco Soretto:

As you can see, not only these models, there is a lot to say about Writeprint. It’s a very nice tool that allows you to make this stylometric analysis, so it’s very good to just detect interesting differences between the styles of different people, but it’s also enough to build classifiers that work. So I think it’s pretty exciting.

Brian Munz:

No, it’s interesting, especially that…

Gianmarco Soretto:

Mm…

Brian Munz:

… kind of used a mixture of Writeprint who identify what the different stylometric aspects are and then used it to train a model, which is a really interesting hybrid way to attack the problem. And it’s interesting too to see how hearing and reading them in person is one thing. And having an AI able to identify the difference between style of writing and speaking is pretty interesting. How have we seen this applied in the real world, I guess, in terms of applying this to, I don’t necessarily mean business, but just in a way that you’ve seen out in a while.

Gianmarco Soretto:

Well, to answer this question, I actually have, as I said, I prepared too much. I know we are a little over time, but I will show you just this use case, which I think is really fascinating because authorship attribution is one way of doing stylometry. It’s very academic. It’s usually either legal cases or it’s a high-end kind of use, but we can also try to see whether the style of different kinds of documents can be predictive again, because maybe it is. So I’ve tried out Writeprint with this data set here. I’m going to share my screen just for moment.

Gianmarco Soretto:

This data set of fake news and real news that was put together by the University of Victoria and to see what are the stylistic differences between the two. So I’m not going to spend too much time going through all the process, but just for you to see actually very interesting results I feel. First of all, length of sentences, again, is something to look out for… A good teller.

Gianmarco Soretto:

But then things, even surprising things, like here we have the POS breakdown of fake news. And as you can see, way more adverts and pronouns. So that’s interesting. So shorter sentences, probably again, employing more adverbs. I think it’s an interesting feature there. Or punctuation, that’s another very interesting thing to look at. So here, as you can see-

Brian Munz:

You’re actually not sharing your screen right now.

Gianmarco Soretto:

Real news, you have this-

Brian Munz:

Oh, there you go.

Gianmarco Soretto:

Oh, I’m not sharing my screen?

Brian Munz:

No, no. Yeah, you took it off. Sorry, but it worked for a while-

Gianmarco Soretto:

But did you see the-

Brian Munz:

But yeah.

Gianmarco Soretto:

But you saw this thing?

Brian Munz:

Yeah. No, yeah. So this is-

Gianmarco Soretto:

Okay. Well, it’s fine. So again, more colorful, simpler language. And then here, what you can see punctuation, I think it’s fascinating. Is real news, we you have this big, huge chunk of double quotation marks. So quotation marks, it’s a very simple thing to look out for, but this really highlights it. So quotations, I got of course witnesses and declarations and they are absent from fake news, while you have far more question marks. So again, probably many rhetorical questions in fake news.And then vocabulary again, we have more words related to crime and a different breakdown.

Gianmarco Soretto:

And then I’ve done the same experiment with machine learning and it’s even more predictive actually than the example we saw earlier. So I’m just going to do this very quickly. It’s the same thing, right? So you can see here that we got just on the data here, we thought only using these metrics that have just shown those differences that again might be simple, but they’re very revealing. We got a classifier that is able, at least in the context of this data set, it’s very important to stress, but able to predict one or the other with an accuracy of basically 99%, which is very striking.

Gianmarco Soretto:

So I think there’s lots of things that we can do with stylometry that go beyond I think, authorship attribution, I think.

Brian Munz:

Yeah, that’s really interesting because when you see it makes sense because you’re like, “Well, of course real news would have more quotes in it because it’s not trying to do as much commentary,” where fake news is going to have more commentary, more explanation, exclamation points, et cetera. So it’s interesting to see how language can achieve some of these things that we use a lot of fact checking for and things where in reality, you can see manipulative language just with stylometry.

Gianmarco Soretto:

Mm-hmm.

Brian Munz:

I’d like to thank you for presenting, obviously. This was super interesting. It’s a topic I’m always interested in, so we’ll have to have you come back someday. Thanks.

Gianmarco Soretto:

Thanks a lot.

Brian Munz:

And so yeah, next week we’re going to have Hybrid-ML and Large Language Models to the Test of Data Scarcity Scenarios. So definitely tune in for that. It’s going to be the same time, same place. And thanks again for presenting and we’ll see everyone next week.Okay.