A Comprehensive Guide to Data Engineering - Part Five (2): Data Generation
Photo by Jay Zhang on Unsplash

A Comprehensive Guide to Data Engineering - Part Five (2): Data Generation

In the previous article, I looked at the various types of source systems that you will come across when working as a data engineer. In this article, I'll discuss in depth how to interact with each of these systems. We will also take a look at the stakeholders at this stage of the data lifecycle and investigate the impact of the undercurrents of the data engineering lifecycle on source systems.

Interacting With Source Systems

Databases

When you're working as a data engineer, you're going to come across different kinds of databases that store information. Just like there are many ways to use data, there are many kinds of databases built for different purposes.

Now, think of a database as a big digital filing cabinet where data is stored and organized so that you can find and use it easily. Here are some key things you should know about these digital filing cabinets:

1. Database management system (DBMS): This is like the whole setup that helps you store and find your data files. It's got special tools to make sure data is safe, to find data quickly, and to fix things if something goes wrong.

2. Lookups: This is about how the system finds the data you ask for. Some databases have a quick search feature called 'indexes' that makes finding data fast, but not all databases have this. It's good to know if your filing cabinet has these quick search features and how to keep them working well.

3. Query optimizer: This is like a smart assistant that figures out the fastest way to get the data you need when you ask for it. Not all databases have one, but if they do, it's helpful to know how it works.

4. Scaling and distribution: Sometimes, you get a lot more data or a lot more people wanting to use your data. Some databases can handle this by adding more filing cabinets (this is called 'horizontal scaling') or by making the existing filing cabinet bigger and stronger ('vertical scaling').

5. Modeling patterns: This is about the best way to organize your data. Just like you might decide whether to file things alphabetically or by date, databases work better when data is organized in certain ways.

6. CRUD: This stands for 'Create, Read, Update, Delete,' and it's just about the basic things you can do with data: add new stuff, look at what's there, change it, or get rid of it.

7. Consistency: This is about how up-to-date and in-sync the data is in your database. Some databases make sure that everything is perfectly in sync all the time (that's 'full consistency'). Others are okay with things being a little out of sync for a while if it means they can work faster or more efficiently (this is called 'eventual consistency'). Some databases let you choose how strict you want to be about this consistency thing when you're reading and writing data.

So, as a data engineer, you'll want to get to know these parts of a database since they'll affect how you work with the data inside.

Relational vs Non-Relational Databases

Article content
Image by

There are two main types of these organizers: relational and nonrelational.

Relational databases

These are like traditional file cabinets with drawers (which we call tables). Each drawer holds files (which are like rows) organized into sections (which are like columns). Every file has information in the same format. For example, if one part of the file keeps names, then all parts of that section will only have names.

In these relational databases, each row has a unique label, like a Social Security number for people, so that no two rows are the same. This label is called a 'primary key.' Also, one file cabinet can reference information in another—like saying "Hey, the details for this thing are in that drawer over there." This reference is called a 'foreign key.'

NonRelational Databases

Relational databases have been the go-to for storing data for a long time, and they work great in many situations. However, sometimes people try to use them for everything, which can cause problems when the amount and complexity of the data grow. That's where NoSQL databases come in—they're a different type of database that doesn't rely on the traditional table relationships used in relational databases.

NoSQL databases are designed to handle a variety of data types and are particularly good when you need to scale up quickly or deal with large amounts of data. But, this flexibility comes with some downsides, like giving up some of the reliability and structure that relational databases provide.

There are different kinds of NoSQL databases for different needs:

  • Key-value stores are like big, scalable dictionaries that pair a key with a value, good for quick look-ups and high-speed access.
  • Document stores are a type of key-value store but are focused on handling documents, which are big chunks of data that can contain many different data types.
  • Wide-column stores are for really large amounts of data and are great when you need to read and write data super fast.
  • Graph databases are for when you need to see how things are interconnected, like mapping out a network of friends.
  • Search databases are all about searching through text and logs to find things fast.
  • Time series databases are for when you have data that are recorded over time, like temperature readings or stock prices.

Each type of NoSQL database is a tool for specific situations, and as a data engineer, you’d need to know which one to use when.

APIs

APIs are a way for software applications to interact with each other. There are various types:

  1. REST: Representational state transfer (REST), coined in 2000 by Roy Fielding in his dissertation. The main idea of REST is to allow for stateless interaction. This means that each change on the systems is done without referral to any previous change that has occurred. A simpler way of thinking about this is getting an item from a vending machine. After getting your item, the machine does not "remember" if you come back for a second item as it does not keep a record of your visits.
  2. GraphQL: Developed at Facebook, it differs from the REST in that instead of being stuck to a certain query you are given the ability to mould your query to your specific use case. An example of this would be REST APIs as a vending machine that would only allow you to get Coca-Cola products and GraphQL APIs as vending machines where you get to choose from not just Coca-Cola but also Pepsi and Dr Pepper.
  3. Webhooks: These are what I would refer to as busy-bodies. They are event-driven that is whenever something occurs they immediately send a message to the receiving system. They are also known as reverse APIs due to their trigger-and-send functionality rather than the ask-and-send approach of other APIs.
  4. RPC and gRPC: Developed at Google, remote procedure call allows a machine to invoke a function on another machine. It is popularly used by microservices to interact with other microservices due to its speed and for being programming language agnostic.

Data Sharing

Cloud data sharing is like a secure online storage locker where companies can store their data and choose who else can see or use it. It's more efficient than sending files by email because you can set detailed permissions for specific parts of the data. Online data marketplaces let companies buy and sell data without dealing with technical setups. Within a company, cloud data sharing lets different departments control their data and costs, making it easier to work together without centralizing all the data management. It’s part of a broader approach to use straightforward, compatible tools for handling data rather than complex new technologies.

Third-Party Sources

Today, nearly all companies use tech to run their businesses, and many share their data with customers through online services or subscriptions. This sharing can create a cycle where customers use the data in their apps, leading to more data and more usage. Companies often use APIs and cloud platforms to allow customers to access and use this data, which can help them understand their customers better and improve their services.

Message Queues and Event-Streaming Platforms

Event-driven architecture is a tech approach where certain actions (events) automatically trigger responses in a system. It's like a domino effect: one event sets off a chain of actions. This is becoming more common, particularly because cloud services make it easier to handle.

There are two main tools in event-driven systems:

  1. Message Queues: They act like conveyor belts, delivering tasks to different parts of a system. They help different components of an app work independently without waiting for each other.
  2. Event-Streaming Platforms: These are like advanced message queues that not only deliver tasks but also keep a log of all activities, so you can go back and see what happened, in order.

These tools ensure that tasks are carried out smoothly and in order, even if there are lots of them or they're complex. It's a bit like a well-organized assembly line in a factory, where each part does its job without worrying about the other parts.

Stakeholders of Source Systems

Working with data involves coordinating with two main groups: those who maintain the systems (like developers) and those who manage the data (like IT teams). Building good relationships with them is key because they can affect the quality and availability of the data you need.

To make sure everyone's on the same page, create a data contract. This outlines what data you'll get, how and when you'll get it, and who to contact if there's an issue. An SLA (Service Level Agreement) can also be set to define the expected performance level, like how often the data should be available.

Even if you don't have a formal contract or SLA, it's important to communicate your data needs to these stakeholders so they can support your work effectively.

Undercurrents And Their Influence on Source Systems

When you're working with data, you often have to use information from systems that you don't control, which can be a bit like hoping everything is being handled well on their end. In terms of the undercurrents these are some of the questions you should ask:

  • You should make sure the data is secure and ask questions like:
  • Is it encrypted properly?
  • How am I connecting to the system—is it through a secure method like a VPN?
  • Are passwords and other sensitive information stored safely?
  • You also need to think about how the data is managed:
  • Is there a good system in place to make sure the data is accurate and up to date?
  • How do you deal with changes in the data's structure?
  • Are there rules on how data should be handled to protect people's privacy?

On top of that, you have to be ready for when things go wrong. This means setting up a way to keep an eye on the system and having a plan for what to do if there's an issue, like a system outage or a mistake in the data.

  • When writing code that interacts with these systems, you need to consider:
  • If the network is secure,
  • If you have the right access permissions,
  • How you're getting the data (like through an API or a database),
  • If your code works well with other tools you're using to manage the data workflow,
  • If you can get a lot of data at once without issues,
  • How you'll roll out updates to your code.

In short, as a data engineer, you're not just dealing with data. You also need to work well with others, plan for problems, and keep an eye on how everything is running to make sure you can use the data effectively.

Conclusion

In wrapping up this article, I've explored the multifaceted world of data engineering with a focus on interacting with source systems. I've demystified the workings of databases, outlined the contrasts between relational and non-relational models, and shed light on the diversity of APIs. Moreover, I've touched upon the significance of data-sharing mechanisms and the pivotal role of stakeholders in shaping the data landscape. My journey also took me through the intricacies of event-driven architectures and the paramount importance of security and data management practices.

Resources

  1. Fundamentals of Data Engineering by Joe Reis & Matt Housley



To view or add a comment, sign in

More articles by Kenneth Imade

Others also viewed

Explore content categories