Thumbnail for Computational Linguistics -- Definition, Brief History, Scope, Tokenization & Parsing by Language & Linguistics Online Dr Khurram Shahzad

Computational Linguistics -- Definition, Brief History, Scope, Tokenization & Parsing

Language & Linguistics Online Dr Khurram Shahzad

25m 17s3,175 words~16 min read
Auto-Generated

[0:00]In the name of Allah, the Beneficent, the Merciful. Dear audience, this is language and linguistics online. Today I intend to talk about computational linguistics. In this short video, I should talk about the historical perspective of computational linguistics. I should give you two to three different definitions of computational linguistics. Later on I should talk about the scope of computational linguistics. Since we know that human beings they have been given the ability to speak the language, to produce the language. So this is the cognitive ability which has been given to the human beings by Allah Almighty. Now, what human beings are trying to do is that with the help of computer science, they are trying to generate, manipulate, replicate the same kind of language that human beings are producing. So, no doubt, they have been very successful up till now. But still they have not been able to produce such type of model that can think and produce the way human beings can think and produce. That is why we come across different kinds of chat bots, like Chat GPT, Gemini, copilot, Alexa, or Siri, that they have been, you know, getting new versions of them. So it is because whenever they come across different kinds of problems, new versions they have been produced. In computational linguistics, we have got two words, computational, which are related to computers or digital tools, and linguistics. So the scientific study of language, it means that at the cross roads of linguistics and computer science, we have got computer linguistics. It takes ideas from theoretical linguistics and it makes use of digital tools, and in this way it tries to resolve the problems of the human beings. For example, it can do different jobs, like it can write emails, it can analyze the data. It can give its own findings. So with the help of computational linguistics, we will have to talk about such type of issues. So with the help of computational linguistics, we will be talking about human language as well as digital tools. When we talk about linguistics, we have got phonology, we have got syntax, semantics, and pragmatics. So we will have to talk about in details the sounds of the language. Because computer will work according to whatever data we have fed in it. So we will have to elaborate the phonological system of the language. We will have to elaborate in detail the syntactic system of the language. And particularly here, English language. We will have to talk about the semantic system of the language. That what are the meanings which are given in the dictionary? What are the pragmatic or contextual meanings, how people perceive these meanings? So when human beings they are interacting with anyone else, they make use of syntactic structures, semantics, and pragmatics, and of course, phonetics and phonology. How sounds they are produced. So all these things we will have to elaborate to the system so that it should be able to understand that system. No doubt, generative grammar, Chomsky's ideas, they have also played a very important role. So with the limited number of rules, we produce unlimited number of sentences. So the same rules and regulations they have been fed to the computers. So that computers can understand how human beings they are interacting, what kind of language they are making use of, what kind of syntactic structures, phonological structures or phrasal categories they are making use of. So once the system is efficient enough to understand all these things, it will be able to understand, analyze, and then later on produce the human language. One of the philosophers, Muslim philosophers, who gave us the word algorithm. So algorithm here means the rules, regulations that we apply when we are making use of the language. So it means that computers, they also work according to those algorithms. So we will have to understand those algorithms, because whenever we say something to Alexa, Siri, co-pilot or Gemini, it takes our commands. It tries to understand, dissect those commands and then accordingly it performs the job. So computational linguistics, it makes use of algorithms, statistical models, and machine learning to process language data at scale. And it helps in translating the language, in the recognition of the speech, in the production of the speech. Let's talk about the definition of computational linguistics. The scientific study of language from a computational perspective, involving the development of algorithms and models to process and analyze natural language data. This emphasizes the methodological rigor behind computational linguistics. So here, of course, in the 1950s after the Second World War or during the Cold War, because these bigger countries, they wanted to understand the code system of their opponents. So they decided that machine translation should be used. So they created and started investing money in the computational linguistics. And in the beginning, of course, computational linguistics was able to produce the correct translations of the Russian language. So at that time, the purpose was that machines should able to should be able to process, should be able to understand the language, should be able to analyze the language. Another definition is, a field concerned with the use of computers to simulate and analyze human language, including tasks such as language understanding, generation, and translation. This highlights practical simulation of human linguistic abilities. Of course, in the 1980s when statistical models they were being produced,

[7:55]so the purpose of computational linguistics was not just to analyze the language but also to produce the language. And of course, later on, the bigger view that one can give off is the third definition. An interdisciplinary domain integrating insights from linguistics, artificial intelligence, and computer science to enable machines to understand and produce natural language. This is the most expansive framing, acknowledging computational linguistics deep ties to AI. So these days we are living in the 21st century where different chat bots, chat GPT or copilot or Gemini or Alexa, they are performing their jobs. So they are playing a vital role in linguistics in education these days. So today you can ask them and they can make lesson plans for you, they can prepare examination papers for you, and they can check the answer scripts. If it is in soft form, it can be uploaded and it can give you the answers. It can grade those answer scripts. Now let us talk about the history of computational linguistics briefly. In the 1940s and 50s, after the Second World War or Cold War, of course, people they wanted that they should be able to understand the language of the enemy. The kind of codes that they were making use of. So for this purpose, they took help from computers. So early machine translation systems emerge, they were rule-based, word-for-word approaches dominate Cold War research. So when we say rule-based, it means that ideas they were coming from linguistics, from Nom Chomsky, generative grammar. Okay, and in generative grammar usually we go for T diagrams where we dissect the sentence into noun phrases or verb phrases or adjectival phrases. So in the same way they were dissecting the language and they were feeding computers with such type of rules. So when the computer was able to understand those rules, when it was able to read dictionaries, their meanings, noun phrases, verb phrases, adjectival phrases or phonemes or morphemes, so it was able to analyze the language. In the 1980s, corpus linguistics takes hold, large text collections enable empirical, data-driven language study. So of course, corpus linguistics came into being in 1980s. So people they thought that corpus means a large collection of data. Spoken data or written data, if it is a general corpus, it can have language from novels, poems, dramas, okay, language of the law, language of the advocates, language of the medical science, teacher's language, students' language, their essays, their assignments, each and every thing which should be a representative sample, which should have a bigger size, it will be taken and it will be converted into a corpus. So corpus means machine readable data. So in 1980s, large corpus, they were being produced. Like BNC, British National Corpus, ANC, American National Corpus, and these days we have got specialized corpus as well. So specialized corpus means language of the teachers only, language of the students only, for example. In 1990s, statistical methods rise, probabilistic models outperform rigid rule-based systems in accuracy and flexibility. No doubt, word for word translations, they had their own issues. Rule-based grammars or rule-based translations, they had their own issues. So statistical tools they were being produced and they were being manipulated.

[12:25]So it focused on probability sampling. That when you have got a large corpora, so with the help of that large corpora, you can produce, you can give computers a lot of data, so computers can search for you and then they can give you the answers. So the chances are this time better translations will be produced. In 2000 and 2010, machine learning transforms the field, supervised and unsupervised models achieve new performance benchmarks. And of course, these days in 21st century, we are talking about deep learning and neural networks. So NLP natural language processing systems, they play a very important role. So they are going to transform and produce or redefine large language models. So the possibility is that in this way they can produce better results. So we have got deep learning translator, which can translate idioms and phrases as well, fixed expressions as well. Which can understand pragmatics very well. Lastly, I should talk about the scope of computational linguistics. NLP is the most visible and applied branch of Computational Linguistics. It encompasses the full pipeline of tasks that allow machines to interact with human language — from raw text to meaningful understanding and response. So what is raw data? Of course, the kind of language that I am speaking here and the language that you speak, if it is, you know, interviewed, if it is transcribed, if it is collected, if it is recorded, it can be transcribed and it can be turned into a data. So that raw data, computer will be working on in order to give us the answers.

[14:50]So first of all, we have got conversational AI. So these days we have got Gemini, copilot, they can talk to you live. They can go live with you, they can interact with you. So these chat bots can help you resolve educational issues, social issues, or even medical science related issues. Text summarization. So these days you open up PDFs. So they ask you, do you want the summary of this document? So secondly, it can summarize a bigger document and it can manage the document for you. It can tell you that these are the major points within this document. Core NLP Tasks. Number one is tokenization, parsing, named entity recognition, and sentiment analysis. And they are the building blocks of every NLP system. Now tokenization means that usually the data is in the form of paragraphs. It is in the form of text and discourse. So that text and discourse has to be dissected. It has to be dissected into smaller sentences, into verb phrases, noun phrases, and in this way you can generate the T diagrams. Where you can find out that this is NP, this is VP, this is adjectival phrase, this is PP, prepositional phrase. So breaking down of the data into smaller categories is called tokenization. So when you get phonemes, morphemes, or words, or phrases, this is tokenization. These words can be counted, frequency can be determined. So computational linguistics deals with quantitative approach. So where you deal with the numbers. So here tokenization means frequency of the words, frequency of the adjectives, frequency of different kinds of verbs, which have been used by the author or the text producer. So then you will have to analyze the data, how or how many tokens of this particular word the person has spoken or used in this particular text. Number two is parsing. So parsing is moving beyond tokenization. So once you have done the tokenization, now you will have to see the constituent relationship. You will have to see the functional aspect of that syntactic structure. You will have to see the role that it is performing in the sentence, because in subject place or object place, we have got MPs. So now you will have to feed and then you will have to understand and analyze with the help of computational or digital tools that what kind of data it is. Whether this noun phrase is working as a subject of the sentence or it is working as object of the sentence. So after verb what kind of object you can expect? So this is dependency relations, constituency relations. So this you will have to understand through parsing. So computer can do parsing for you. Digital tools can do tokenization for you. They can tell you the functional role of that particular MP within the sentence. So parsing is very important. Named entity recognition. Of course, within the discourse or the text, we have got Karachi, London, America, which are the names of the places, proper nouns. So we need to feed so that computer can work for us and it can give us the right kind of answer. If it can recognize that this is the name of the place, America, this is the proper noun, this is the name of the city, Karachi, this is the name of the person, Ayesha. Or let's say it is the date. I met Ahmed in Karachi in 1990s. So computer should be able to recognize 1990s as a date. So this is recognition of the name. Sentiment analysis. Here still programs are working, but 100%, you know, crap programs they are not available. Human beings are performing their job well, they are trying their level best to produce such tools which are quite objective in nature, which are neutral themselves and which can perform well at your commands. So sentiment analysis means something, the text that you have read, it can be positive, it can be negative, it can be neutral. So we will have to see the kind of adjectives which have been spoken by the person or by the text producer. So if positive words are there, you will have to tell the program that this is what you want, this is the list of the positive words, this is the list of the destructive or negative words, this is the list of the neutral words. So then computer can do the sentiment analysis of the data. Another scope of computational linguistics is that it can help in translation, which is called machine translation these days. Automatic translation of text between languages is one of CL's most impactful applications. From Google Translate to multilingual platforms, MT bridges linguistics divides using three main approaches.

[20:33]Rule-based approach. So once again, the kind of rules that you have given to the computer, the kind of noun phrases, verb phrases that you have talked about, that you have clarified, that you have, you know, given to the computer. So according to those rules, dictionaries, it will be producing rule-based translation, which will be word-for-word translation, which can have issues. Second is statistical probabilistic machine-based translations, where large corpora will be fed to the computer programs. So still different kinds of chat bots, they take help from Google. It searches the data over there and then it produces the answers. So the kind of corpus that you have built, it plays a very important role. If it is large corpus, so then bigger and more and more noun phrases or verb phrases it can look for and on the basis of those probability it will give you a better answer this time. And neural deep learning translators. So again, you will have to, you know, feed large corpora so that the machine could read, understand, analyze, and then generate the kind of language that you are asking it to generate. Speech processing. Speech processing combines linguistics with signal processing and machine learning to handle spoken language. It includes two complementary subfields. Speech recognition, STT, converting spoken audio into text, used in voice typing, smart home devices, and automated call centers. Speech synthesis, TTS, generating natural sounding speech from text, powering screen readers, navigation systems, and voice assistants. So this is a big achievement of human beings that with the help of computational linguistics, digital tools, today we are able to convert spoken language into the written language or text into the speech. Corpus linguistics. Corpus linguistics uses large, structured text collections (corpora) to study language empirically. Empirically something that you can observe. Researchers analyze word frequencies, collocations, and grammatical patterns at scale. So the kind of scale you will have to decide, whether you want analysis at word level, at phrase level, at sentence level, or at text level. What kind of analysis you want? So you can make use of British National Corpus or American National Corpus, or you can make your own specialized corpus as well. Then you can go for syntax and parsing. Syntax governs how words combine into grammatical sentences. Parsing is the computational process of analyzing that structure — often visualized as a parse tree. Applications include grammar checkers like Grammarly, automated essay scoring, and dependency parsers that underpin more complex NLP pipelines. So today Grammarly can check your sentences and it can help you, guide you that these are the mistakes that you are making, particularly in subject verb agreement. For example, understanding of the meaning, of course, computational linguistics can help you with the help of dictionaries and the data which is available online, and the kind of corpus that you are making use of or the dictionaries that you have given to the, you know, computer.

[24:06]It can talk about semantics. It can give you the meanings of the words. It can talk about the denotational, emotional or connotational meanings of the word. Semantics in action. So word sense disambiguation, determining which meaning of bank is intended in a given sentence. So it will see the context. What is coming before, what is coming after within that particular sentence. And it can tell whether you are talking about the bank which can keep your money or you are talking about a different kind of bank. Information retrieval or extraction. So these days through keyword searches, we can extract the data from the computational, through the computational linguistics from the corpus. So this is called text mining, this is called text extraction. So in my later videos, I should talk about these things again. In this video, I have talked about the definition of computational linguistics, a brief history of computational linguistics, and some of the scope of computational linguistics. Thank you very much.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript