Thumbnail for null by null

12m 14s1,939 words~10 min read
Auto-Generated

[0:00]Hello, my name is Brian and I'm a data scientist with the Department of Customer Service. And today I'm going to be talking to you about some of the work we've been doing with natural language processing in particular in the area of named entity recognition. And I'm going to focus on how we've been able to extract information from government tender documents, in particular, using some fairly sophisticated machine learning models. So first of all, for those of you who might not be familiar with natural language processing or NLP, NLP is essentially a field of artificial intelligence where we develop computer programs and algorithms that allow computers to understand human language, which is obviously very different from the way computers understand computer programs. So, in terms of what we're going to cover today, I'm first of all going to set the scene, talk a little bit about government tenders. For those of you who are not familiar with them. Then I'm going to talk about how we've been using named entity recognition to extract particular pieces of information from those tender documents. Then I'm going to talk about some of the challenges that we've encountered. And then I'm going to provide some sort of lessons learned about our process and then also talk about some next steps and where we see this project going in the future. So, first of all, a little bit about government tenders for those of you who are not familiar. So, here in New South Wales, the government spends about 30 billion with a B. 30 billion every year on procuring various goods and services. And the way in which the government procures these goods and services is through a process called tendering, where basically the government posts a tender on an online portal, which is essentially an advertisement for a particular good or service that the government wishes to procure. And then various businesses or vendors will then respond to that tender by basically outlining how they are able to provide that good or service, how much it's going to cost, etc.. And then ultimately the government will award the tender to the vendor who best meets the requirements of that tender. So, the actual process of reading and evaluating these tenders is very time consuming, because as you can imagine, government tenders are very long and complicated, very detailed, legalistic documents. And the challenge for the government is actually analyzing all of the text that comes in in the various tender responses and determining which vendor actually best meets the government's needs. So, one way in which we can speed up this process is by automating the extraction of key information from those tender documents. So, what exactly is named entity recognition or NER? So, named entity recognition is a sub-discipline of natural language processing where we try to extract particular entities from unstructured text, which are essentially real world objects like people, places, organizations, dates, amounts, etc.. So, a good example of how this works is if you look at the sentence that I've provided here. Brian goes to Sydney on Monday, and then from that sentence, if we apply our name recognition model, we can extract the various named entities. So, for example, Brian is a person, Sydney is a location and Monday is a date. So, essentially, that's how named entity recognition works. So, the way in which we applied name entity recognition to government tenders is by basically taking the government tender documents and then trying to extract particular pieces of information that we deem useful for analyzing those tenders. So, for example, we might be interested in the names of the various vendors who are responding to the tender. We might be interested in the particular dollar amounts that they're quoting. We might be interested in the particular dates and times at which certain events are going to occur. So, essentially, by extracting these pieces of information automatically, we can actually help the government speed up the process of analyzing the tenders. So, the first challenge we encountered when we started this project was the fact that a lot of natural language processing tools are really built for what we call general English. So, they're built on general English corpora, like Wikipedia and news articles. However, government tender documents are very different from that. So, they're very legalistic, very formal, and they're very specific to government. So, a lot of the language and a lot of the entities that we were trying to extract are actually very specific to government. So, for example, we might be trying to extract something like a tender identification number or an ABN number, which is very specific to government and wouldn't appear in a general English corporate. So, essentially, a lot of the existing tools didn't work as well as we would like. So, what we had to do was essentially build our own custom name recognition models. So, how did we do that? So, we used a technique called active learning, which is a very powerful technique, particularly when we don't have a lot of labeled data, which was the case in our scenario. So, essentially, what we do with active learning is we take a small sample of tender documents, say 50, and then we manually label those documents for all of the entities that we're interested in. So, we would highlight, for example, the various vendors, the various dollar amounts, the various dates, etc. Then we train a machine learning model on that manually labeled data. So, in our case, we use a very powerful deep learning model called a transformer, which is state of the art in natural language processing. And once that model has been trained, we then essentially have a model that we can apply to the remaining unlabeled data. And the model then makes predictions on that unlabeled data, and it predicts the entities that it thinks exist in that unlabeled data. And then we essentially apply something called an uncertainty sampling algorithm to that model's predictions. And essentially what that algorithm does is it identifies the pieces of text where the model is most uncertain about its predictions. And then we then provide that information back to a human expert who then re-labels that text, and then we put that back into the training data. And we keep doing that cycle over and over again until the model reaches a certain level of accuracy. So, the good thing about this is it means that we don't have to manually label a huge amount of data because obviously that's very time consuming and very expensive. So, this means we get a good bang for buck for our manual labeling. So, some of the challenges that we encountered with this approach. So, the first challenge was getting access to the data. So, as you can imagine, government tenders are often quite sensitive documents. So, we needed to go through a fairly comprehensive process to ensure that we had appropriate security and privacy in place to protect the data. So, that took a little bit of time. The second challenge was the quality of the data. So, government tenders come in a wide variety of formats, so they can be PDFs, they can be Word documents. A lot of them have tables, a lot of them have images, and extracting text from tables and images is actually very difficult. So, we had to apply a number of different techniques, things like optical character recognition or OCR, to extract information from some of those documents. And even then, the quality of the text wasn't always as high as we would like. So, we had to do a lot of data cleaning, a lot of what we call pre-processing, to clean up the data before we could actually apply our models to it. The third challenge was in the actual manual labeling of the documents. So, as I mentioned, government tender documents are very long and complicated, very detailed, legalistic documents. So, it was very difficult to find a subject matter expert who was able to dedicate their time to actually reading and labeling these documents for all of the different entities we were interested in. And then the fourth challenge was in the actual deployment of the models. So, as you can imagine, because the data is sensitive, we had to deploy the models in a secure environment. And we also had to work closely with the security team to ensure that everything was up to scratch in terms of security and privacy. So, some of the lessons that we learned during this process. So, first of all, starting with a clear problem definition is absolutely critical. So, we spent a lot of time up front talking to our stakeholders and working out exactly what they wanted from this project, what sort of information they wanted to extract, how they wanted to use it, etc.. And that helped us immensely because it meant we weren't just extracting information for information's sake. It meant that we were extracting information that was actually useful to the business. The second lesson was the importance of data quality. So, garbage in, garbage out, as the saying goes. So, if the quality of the data that you're putting into your models isn't very high, then the quality of the predictions that you're going to get out of those models isn't going to be very high either. So, we spent a lot of time cleaning and pre-processing the data. The third lesson was the importance of human in the loop. So, as I mentioned, we use active learning, which is a human in the loop approach. And that was really critical for us because it meant that we were able to get very high quality labels for our training data, which then meant that our models were able to perform very well. And then the fourth lesson was about collaboration. So, working closely with the security and privacy teams was absolutely critical. As was working closely with our stakeholders and the subject matter experts, to ensure that we were building something that was actually useful and that was going to meet their needs. So, in terms of some next steps and where we see this project going in the future. So, the first thing is we want to expand the types of entities that we're extracting. So, currently we're only extracting a very small number of entities. We want to expand that to a much wider variety of entities. We also want to expand the types of documents that we're analyzing. So, currently we're only looking at government tender documents. We want to expand that to other types of government documents, like policy documents, legal documents, etc.. And then we also want to integrate this capability into existing government systems. So, that means working closely with our IT teams and other teams in government to ensure that this capability can be easily integrated into existing systems. And then the fourth thing is we want to explore using some of the more advanced techniques that are available in natural language processing. So, things like large language models and other types of deep learning models that are very powerful in natural language processing. And we want to explore how we can use those to actually extract even more information from these documents. So, essentially, that's where we see this project going in the future. So, thank you very much for your time. And if you have any questions, I'm happy to answer them.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript