Thumbnail for Computational Linguistics -- Corpus, Challenges & Future Directions by Language & Linguistics Online Dr Khurram Shahzad

Computational Linguistics -- Corpus, Challenges & Future Directions

Language & Linguistics Online Dr Khurram Shahzad

19m 23s2,241 words~12 min read
Auto-Generated

[0:00]In the name of Allah, the beneficent, the merciful. Dear audience, this is language and linguistics online. Today I intend to talk about some of the challenges to computational linguistics. I should also talk about the difference between corpus linguistics and computational linguistics. Lastly, I would like to give you some examples from tokenization, parsing, and named entity recognition. In the previous video, I was talking about the scope of computational linguistics. So, within the scope of computational linguistics, we have got corpus linguistics. Remember, Corpus Linguistics focuses on natural language. How it is produced and now how you are going to analyze it. So, of course, corpus means a large collection of texts. A large collection of text which is computer readable is called a corpus. So, with the help of computers, you are analyzing the large corpus, a large collection of texts. And with it, you can find out the frequencies, different linguistic patterns, how people they are making use of language within that particular corpus. Remember, corpus linguistics is a research methodology within linguistics. On the other hand, computational linguistics, it focuses on building algorithms and models. So, these models are going to read, analyze, understand, and generate natural language. So, it is a technical field. So, with the help of these models or computer or digital tools, we are going to analyze the corpus. Computational linguistics lies at the intersection of linguistics and computer science. Now, the difference between corpus linguistics and computational linguistics is related to aims and objectives, research methodology, data use, and output. For example, the major goal of corpus linguistics is to describe and analyze the language use. On the other hand, the major objective of computational linguistics is to enable computers to process language, to emulate language, to produce language. So far as nature is concerned, corpus linguistics is methodological. It is empirical, computational linguistics is technical and it is based on engineering. Data use. Corpus linguistics observes the data and in this way it talks about different patterns and frequencies.

[3:25]Computational linguistics trains and evaluates models on data so that automatically on the basis of your commands, it should work on them and it should give you the output. Corpus linguistics provides frequencies, it deals with linguistic data. Computational linguistics gives you the tools and models so that you can work with the help of AI chatbots. Skills Corpus Linguistics is based on linguistics and statistics. Computational linguistics deals with programming and machine learning linguistics. Within the field of computational linguistics, we have got syntax, semantics, pragmatics, and computer science. Now, within its scope, we have got the concept that how syntactic structures they have been produced by the human beings. So computer is going to model these syntactic structures if we are going to feed all these things in an elaborate way to the computer. So, computational linguistics can do tokenization, parsing, and entity recognition, named entity recognition for us. Linguistics provides data in the form of rules and regulations. So, these rules and regulations they are coming from phonology, they are coming from syntax, they are coming from semantics and pragmatics. So human beings, they are going to explain that what is the difference between dictionary meaning and pragmatic meaning or contextual meaning. So these things we will have to elaborate to the computer system so that computer system should model all these things. First it should understand, it should analyze, and then it should produce natural language. So without linguistic knowledge, it is impossible to model language accurately. So human beings, theoretical linguistics, it is going to provide all these things to computer science. Now what computer science can do is that it can provide algorithms and data structures. So once rules and regulations have been made, so these, you know, programs on the basis of large collection of texts, on the basis of the kind of corpus that you have provided to the computer, the programs, they are going to model the language. So, computers, they can provide us machine learning and AI. They can provide us programs so that we can program the languages. So, there are different kinds of programming languages like C++. Like Python, so people learn these languages so that they should produce programs, computer programs. Computational efficiency. These tools enable the implementation of linguistic models in software systems. So how do these two fields they integrate? So we have got theoretical linguistics or linguistics and we have got computer science. So computational linguistics integrates linguistics and computer science through three primary approaches. So number one is rule-based systems. So these rules have been created by human beings. So in generative grammar, we have talked about how structure is formed, how syntactic structures they have been produced by the human beings. What a clause consists of? So it is it has got phoneme, morpheme, phrase, and clause. So we determine the level and then we analyze the language. So these ideas are coming from linguistics. So these rules, they will help determine that how computer is going to read, understand, analyze, and then emulate, produce, or model the language. Phonological rules, syntactic rules, semantic rules, pragmatic rules. These we will have to feed to these softwares. Statistical models. Language patterns are learned automatically from large text corpora using computational methods, capturing probabilistic regularities. So, of course, even in the previous video, I talked about that these computer systems they are based on probabilistic learning. So they make decisions from a large collection of data and they perceive and understand that what is coming before the verb, what is coming after the verb, what kind of things are expected before the verb, what kind of words are expected after the verb. Bark, so we even human beings know that bark is a word before it dog can come. So computer will work on the basis of such type of probabilities and then it will determine that what kind of noun can come before the verb and what kind of nouns they can come after the verb. So on the basis of these systems, computer is going to model the human language. Machine learning. Systems improve automatically through training data, adapting to new language patterns without manual rule updates. So once these programs they are, you know, trained, these models they are trained, these chatbots they are trained. So no doubt, right now they are making mistakes but every time there is a new version of chat GPT, every time there is a new version of co-pilot, Gemini. So in this way they improve themselves. In this way the kind of mistakes that they are making in future they are not going to make these mistakes. So in grammar checker like Grammarly linguistics provides the rules, computer science implements them in software. In speech recognition, linguistics analyzes phonemes while CS processes audio signals. So the kind of commands that you are giving to conversational AIs like Google Assistant, Gemini, they are going to analyze the data and accordingly they are going to work with you. Accordingly they are going to show you the kind of data that you are commanding, that you want to see, that you want to hear. Now in computational linguistics, we should also study in great detail later on that what are the applications of computational linguistics? So no doubt computational linguistics is being used in education these days, computational linguistics is being used in healthcare, it is being used in business, it is being used in social media and government policies or government tasks. So clinical text analysis, automated medical record processing, and diagnostic support tools that extract meaning from physician notes. So these days whenever you get your medical reports, they do not put their signatures on it and they say it is a computer generated report, it does not require signatures. So AI is working. Computer chatbots, they are working. Programs they are working and they are producing and they are helping the human beings to understand. And remember, now life is becoming very difficult. I do think about my students. Once they get their degrees in computer science or engineering, will they be getting the jobs? Now the kind of tasks that you are going to do, chatbots can do these tasks. So you will have to be very, very precise, you will have to get deep learning, you will have to understand and produce something that these chatbots cannot do. So you will have to work very hard to get good jobs in the market. Business customer feedback analysis, intelligent chatbots, and automated report generation that streamline enterprise operations. So these days you give different kinds of, you know, commands. We have got different apps, banking apps or AWT, for example, investment apps. So you perform job at two o'clock at night and each and everything is being registered. So people are not sitting, but these computers, AI agents they are sitting and they are observing you and they are performing the jobs and doing the jobs for you. So in education, adaptive language learning tools, automated essay grading, and personalized feedback systems for learners at scale. So these days AI chatbots, they can grade your essays. You know, they can produce different lesson plans for you. So writing emails or narratives, it is not a big deal these days. But you will have to see how you are going to intervene, how you are going to get the job done, what kind of data you want from it and how you are going to deal with this AI. So you will have to be very effective learner so that you should outperform these things to get better jobs in the society. Social media and government. Sentiment analysis, trend detection, intelligence analysis, and multilingual translation for policy and public communication.

[14:27]So all of them are very important. So here computational linguistics is working. Lastly, I should talk about the challenges and future directions. So key challenges are ambiguity. No doubt, language it contains ambiguity. Words and sentences, they have got ambiguity, bank and bank, bank financial aid, river bank. So still when you produce sentences, there are challenges for the AI to understand challenges for the computational linguistics to understand these things. So they carry multiple meanings and in this way sometimes these chatbots they face problems. Data scarcity. So languages like Urdu, Punjabi, Pashto, right now, we have not produced corpus. So there is data scarcity. Lack of knowledge, lack of things in the market right now. So many of the world's languages lack sufficient training data for robust models. Context understanding. So machines they have got the problem that right now they may not be able to understand. You know, the contextual meaning, but people are working, programs are working. So in coming future, these problems will be resolved. So emerging trends, deep learning models, transformers and large language models, multilingual systems, support for underrepresented languages, improved HCI, more natural human-computer interaction, ethical AI, fairness, bias reduction, and transparency. So all these are the challenges where AI needs to work and it is working. Lastly, I should give you some of the examples of tokenization and parsing. So word tokenization. The quick brown fox jumps. So it splits text into individual words. So this is the tokenization. The quick brown and fox and jump. So each and every word is broken down into smaller pieces. So this is tokenization. You can count verbs, you can count nouns, you can count articles. Sentence tokenization. Splits text at sentence boundaries. She ran. He followed. So the kind of command that you have given to the computer software. Accordingly it is producing the data. She ran, full stop, one token, he followed, full stop, another token. Subword tokenization. Used in transformer models to handle rare words. Example, unhappiness, so un, happy, and nuss. So in this way it has got three syllables, unhappy, nuss. Challenges, punctuation ambiguity. Still it has got challenges. Contractions like don't. These are the challenges right now. Languages without spaces, like Chinese and Japanese. Stylistic applications, so there you can apply softwares and you can analyze the data and in this way you can give your findings about stylistic analysis as well. Some of the examples of parsing. Analyzing grammatical structure. Parsing analyzes the grammatical structure of a sentence, revealing how words relate to one another and the hierarchical organization of phrases. So here we determine the kind of function which is, which the text is performing, which the noun phrase, verb phrase, or adjectival phrase is performing within the sentence. So there are two types of parsing. Constituency parsing breaks a sentence into nested phrase structures. Example, the boy at an apple. So NP the boy, VP at NP an apple, again. And dependency parsing shows direct word-to-word grammatical relationships. What kind of relationship exists within the sentence? At main verb, boy subject, apple, which is the object, at the same time it is a noun phrase. So this is how you can carry out parsing. Dear audience, in this video, I have talked about some of the things related to the scope of computational linguistics. I have differentiated between corpus linguistics and computational linguistics. Lastly, I have talked about challenges and future directions of computational linguistics. I have also given you some examples of tokenization and parsing. Thank you very much.

Need another transcript?

Paste any YouTube URL to get a clean transcript in seconds.

Get a Transcript