From the course: Data Cleaning in Python Essential Training

Unlock this course with a free trial

Join today to access over 25,500 courses taught by industry experts.

Data pipelines and automation

Data pipelines and automation

- [Narrator] Data pipeline is a series of steps each consuming input and producing output. There are many systems for creating data pipelines such as apache airflow. Writing your own is not that hard but I recommend investing time in an existing one. The main advantage of using a pipeline is that each step is small, self-contained, and easier to check. Some data pipeline systems also allow you to resume a pipeline from the middle, which can save you a lot of time. When designing data pipelines, it's important to add data validation and cleaning into the pipeline. Once you have these in place, you can quickly detect errors and stop the pipeline. Editors start from here. I'm going to use Invoke which is a small system of running tasks. It is simple and easy to use. For a real production system, you'll probably pick a system with more features such as airflow. Let's have a look. So we have our CSV file with some…

Contents