From the course: Introduction to Amazon Redshift: Data Management Essentials

What is Redshift?

- [Instructor] Redshift is one of the frontline products that AWS is pushing forward for creating a data warehouse within the cloud, where the servers and storage and infrastructure of the data warehouse are managed by them. This can greatly reduce the time it takes you to get started with a data warehouse solution, and it can reduce your overall operational costs, since you won't have to be patching servers and installing updates and managing backups. You interact with Redshift using plain old SQL queries. So if you're a database developer that's already familiar with getting data out of a relational database, you'll have a good headstart with Redshift. Not all big data solutions have this feature. Let's refer back to our previous example where we discussed that you'll need to come up with a strategy to sync data between your OLTP database and your OLAP database. Redshift has a feature where you can run a federated query, which means that if you're using Amazon's Relational Database Service or Aurora Postgres or RDS MySQL or Aurora MySQL as your OLTP database, you can create a link to these databases so that when you query your Redshift OLAP warehouse, it'll merge in the real-time records still in your OLTP databases into a single query. This makes dealing with the complexity of having your data split between two different databases much easier. Behind the curtains, Redshift is actually a heavily modified cluster of Postgres database servers. You might already be using Postgres as your current OLTP database, but Redshift has been heavily modified by Amazon to scale Postgres for OLAP workloads and to handle volumes of data and queries that would significantly slow down a normal Postgres instance. When you create a new Redshift provisioned cluster, you will tell Redshift how many nodes you want in the cluster. Your data will be chopped and split amongst all of the nodes in the cluster. And when you run a query in Redshift, each node will search its slice of data, and the leader node will combine the results into a single result set. So instead of having one database server crunching over a huge database, you'll have four database servers, all crunching over four smaller databases, all in parallel at the same time. This is called Massively Parallel Processing, and it's what is meant if you see the acronym MPP in the Redshift documentation. You can choose from two different pricing models for your Redshift nodes: provisioned or serverless. With a provisioned cluster, it will create nodes for you that are special EC2 instances that you'll be charged for just like regular EC2 instances. As the instances run, whether you use them or not, you'll be charged for the compute time. You can use a savings plan to bring down the cost of the EC2 instances by committing upfront to a yearly spend. Or you can select serverless. In the serverless model, Redshift spins up the EC2 instances for the nodes in the background, similar to how Lambda works, and spins them down when they aren't in use. There's a significant pricing difference between the two models, so you'll need to do some research on which one works best for your workloads. The free trial in most regions covers Redshift Serverless, so we'll be using that model in this course. It's great for running a quick demo, since we don't need to provision a dedicated cluster of EC2 nodes. But if your cluster does a lot of heavy processing and needs to be available most of the time, you'll likely want to use a provisioned cluster. Redshift can quickly start costing several hundred or thousands of dollars a month, so make sure you do a proof of concept first and research how you wanna handle the billing.

Contents