[0:00]Hey all, Data Guy here. And today I have a video for you where I'm bringing it back to basics and really just want to explain what data pipelines are, what they do for beginners, so that everyone can understand why data pipelines are so important, what they do, you know, how they operate, the different, you know, services involved in them. And then also, how they're powering most of the world around you. You might not think it, but every time, you know, you make a search on Netflix, or you buy something on Amazon, or you even just click on an ad. The operations that are giving you a recommendation of what to buy, or what to watch, or what add to serve you, that's all handled by different types of data pipelines. So that's what I'm going to show you today is what are those different types of data pipelines? and really just give you everything you need to know to understand how data moves behind the scenes throughout the world. So, I hope you enjoy this video. If you like these videos, please like and subscribe, it helps me out a ton. But without further ado, let's get into it. So the first thing I want to go over is really just what is a data pipeline at its most basic, you know, dictionary definition. and it's a structured series of processes that collect, process and store data. So you can see a very simple example right here, that you see basically the flow of data between different systems in a structured way where you have many different systems that are working together, you know, you're pulling data from let's say a website, you're storing it in a data warehouse and maybe it's powering a report. Um or you're doing some transformations even just within that data warehouse. And the primary goal of data pipelines are automating data movement and processing. So you don't have to do things like it was in the old days where if you wanted to get data and process it and you know, generate a report, someone had to go get that data, manually make those changes, do that those transformations manually, so there's no kind of automation, uh and then produce that report. Um if you wanted another report, great, you had to go do that again for that same set of data for a different set of data. Um and so, pipe data pipelines are really one of the foundational aspects of the modern data verse. Um and have what allowed, you know, large scale data movement because as you can imagine, you can only have so many people whose job it is to actually manually move data. Automating this allows you to know not only have a couple dozen data pipelines for your business, but thousands, tens of thousands, even millions of data pipelines that's what, you know, this companies like Walmart, like really large enterprises are doing is you know, you have millions of data pipelines that are powering a global business. And generally data pipelines fall into two different approaches. So you either have extract, transform, load, an ETL pipeline, where data is first extracted from your sources, then it's transformed into a clean and structured format in some kind of staging area. This could be an application like Airflow, this could be an S3 bucket, some object storage, um really anything that isn't a data warehouse. Um and then finally that data once it's transformed and prepared, it's loaded into a data warehouse. The other flip side, and this is something where a lot of data pipelines are moving towards is ELT pipelines. Um and so an ELT pipeline, this is actually flipped. And so here you have the data is extracted in its raw format into that data warehouse and then you make use of that data warehouse to actually transform the data where it lives within the data warehouse. And this is the direction a lot of really large enterprises are going, uh because now data warehouses are really efficient and really cheap to do these kind of large scale transformations and queries. Before it wasn't as, you know, computationally expensive and you would, you know, run out your computer budget if you were trying to do all these transformations within your data warehouse. Now it's actually more efficient because you have things like Snowflake, Redshift, Big Query, where you can manipulate large amounts of data within those data warehouses at a very low cost. Now, within data pipelines, there are typically a few key components. Number one, uh and this is kind of an obvious one, but some people don't think about it, uh is data sources. You need to have sources where you're pulling data from, uh typically these are things like databases, uh whether SQL or no SQL based, you know, MySQL or MongoDB. Um also a lot of times APIs, web services, you know, hitting a rest API that gets pulls some data from an external application server. Um you might be receiving files and logs, um from different applications being sent to you, so CSVs, JSON, log streams. Uh you also might be using event streams. Um there are tools like Apache Kafka that essentially have agents that will sit and listen for events from an application. And then every time a new piece of data is created, that data will then be processed and added to the system, and you know, so this is typically really useful for cases like fraud detection. where every time a bank has a transaction, they want to take that transaction, apply it through, you know, their ML uh fraud detection systems, make sure that it isn't using anything illegal, before eventually, you know, accepting that transaction, um and you know, obviously if there's anything fraudulent, getting it alert at the point of that transaction. So you can deny it, not uh you know, wait for hey, every day I want to get a report of all the different fraudulent transactions, so you let them go through and you have to, you know, kind of track them. Uh additionally things like social media monitoring, um you know, if you have a web scripter that's saying, hey, check every Twitter post and then based on the sentiment do something. Um that's a pretty typical like finance use case. Um and then the third type of data pipeline is data integration pipeline. And the reason why I think this is kind of a broken out segment is this is things that, you know, hey, I'm not batch processing or stream processing data for the sake of analysis or kind of, you know, feeding it to some system to analyze or, you know, to alert on. Data integration is more, you know, moving the data throughout my business. Um or just doing basic collection, um a good example of something like this is, hey, I actually uh, you know, a company like Amazon, uh you know, when you fill out a form and you say, hey, I want to search for a new blouse. Um data integration is when it says, hey, take this data, uh feed it into this machine learning model, tell me what this person should buy, um and then feed that result back to that person. Um so that's, you know, one kind of example of data integration pipeline or just if you need to move data between different systems for just, you know, kind of application support purposes. Uh making sure your tool can run online on time, you know, it's not really batch processing or streaming, it's really just, you know, hang, hang, this tool needs to consume data from this other tool, so I need to integrate them. Um that's really why kind of have them as a third category. But I'd say most pipelines fall under that batch processing or streaming category. So now, I just want to finish this video with kind of talking about some real world applications of data pipelines, starting with the one you see up on screens. This is a real Netflix internal diagram of how they actually manage serving people, uh serving data, or videos to all the different many Netflix users. Um and you can see Netflix, typically for streaming service, uses data pipelines to drive constant recommendations, do analysis on, hey, what are users watching, um to recommend them other things to watch to keep them on the platform. Um and really any kind of streaming platform, any kind of, you know, consumption of media is going to use this. You're going to have things like Spotify, Netflix, you know, all Disney Plus, all of those are all uh streaming services that, you know, would all use data pipelines somewhere this one. Additionally, you have things like e-commerce, they're similarly using data pipelines to serve personalized recommendations, do things like dynamic pricing. If they detect a lot of people demanding a certain product, they'll actually use data pipelines and this is where stream processing comes in to increase the price of that product. Um you also have companies in the finance industry. You know, I talked about this before, doing things like fraud detection with data pipelines, you know, things like risk assessment, hey, I want to decide whether or not I'm going to lend to this person. You know, when you file a application for a credit card, it's not really a person that's assessing it nowadays, it's goes through a sequence of data pipelines that gets processed, assigned you a score, and based on that score, you're then approved or denied for that credit card. Um healthcare is also another big data pipeline industry. Um that's where you're integrating patient data health, providing real-time diagnostics on, hey, how is this patient doing? Um and then also another big user and kind of a classic like, hey, how do we solve this problem user is logistics, um, you know, route optimization, doing using data pipelines to understand what the current traffic situation is like, um and then rerouting drivers to use routes that are kind of saving them the most time. Um so, that's really all I wanted to cover today, just really give you a 101 course on how data pipelines are used across the world. Hope this is helpful, I hope it hit the mark and you now know a little bit more about how data pipelines work. Hope you have a great rest of your day. Data guy out.

What is a Data Pipeline! Data Pipelines Explained for Beginners!
The Data and AI Guy
10m 42s1,774 words~9 min read
AI audio transcription
Transcript source
AI audio transcription
This transcript was generated from the video's audio because no usable YouTube caption track was available. The transcript below is server-rendered so it can be read, searched, cited, and shared without opening the original YouTube player.
Pull quotes
[0:00]You might not think it, but every time, you know, you make a search on Netflix, or you buy something on Amazon, or you even just click on an ad.
[0:00]The operations that are giving you a recommendation of what to buy, or what to watch, or what add to serve you, that's all handled by different types of data pipelines.
[0:00]So that's what I'm going to show you today is what are those different types of data pipelines?
[0:00]and really just give you everything you need to know to understand how data moves behind the scenes throughout the world.
Use this transcript
Related transcript hubs
Watch on YouTube
Share
MORE TRANSCRIPTS


