AWS Services Every Data Scientist Should Know: A Scenario-Based Guide
Part 1 of a two-part series.
Data Science and AI are two of the hottest subjects in tech right now, and the conversation around them often blurs the line between the two. They're related, but the day-to-day work, the tools, and the AWS services that support each role are actually quite distinct. With that in mind, I thought it would be useful to put together a focused two-part series: one article on the AWS services most relevant to data scientists, and a second one on the services most relevant to AI practitioners.
This first part is for the data scientists. The companion piece on AI follows.
Data science rarely lives inside a single notebook. Real projects span messy raw data, ingestion pipelines, transformation jobs, warehouses, dashboards, and eventually models that need fresh features to stay useful. AWS offers a well-defined service for nearly every stage of that journey, and knowing which service fits which stage is half the battle.
This article walks through the AWS services most relevant to a data scientist's workflow, paired with a practical scenario for each. At the end, a single architecture brings them together to show how they fit into a complete platform.
Amazon S3: The Foundation of Every Data Project
Scenario: A retail analytics team receives daily point-of-sale extracts from hundreds of stores. The files arrive in different formats (CSV, JSON, Parquet) and sizes ranging from a few megabytes to several gigabytes. They need a durable place to keep every file, including older versions, without worrying about running out of space.
How S3 helps: Amazon S3 is object storage that scales to virtually any volume, making it the default landing zone for raw and processed data in AWS. It supports any file format, offers eleven-nines of durability, and integrates natively with almost every other AWS analytics and ML service. Features like versioning, lifecycle policies (to automatically move older data to cheaper storage classes like S3 Glacier), and fine-grained access control through IAM make it both cost-effective and secure. For most data science workloads, S3 becomes the data lake, the single source of truth that every downstream tool reads from.
AWS Lake Formation: Governance for the Data Lake
Scenario: As more teams begin using the S3 data lake, the security team needs to ensure that finance can see salary data while marketing cannot, that PII columns are masked for analysts, and that every access is auditable. Managing these rules through bucket policies alone is becoming unmanageable.
How Lake Formation helps: Lake Formation sits on top of S3 and the Glue Data Catalog and provides centralized, fine-grained access control at the database, table, column, row, and even cell level. Permissions defined once in Lake Formation are automatically enforced when users query through Athena, Redshift Spectrum, EMR, or SageMaker. It also simplifies setting up a data lake in the first place by automating ingestion, cataloging, and registration. For any organization where data governance matters (which is most of them), Lake Formation is the missing piece between raw S3 and trustworthy enterprise analytics.
AWS Database Migration Service: Bringing Operational Data into the Lake
Scenario: The most valuable data in the company sits in operational databases (Aurora, SQL Server, Oracle) that power live applications. Analysts need access to that data, but querying production systems directly is risky and slow.
How DMS helps: AWS Database Migration Service replicates data from operational databases into analytical destinations like S3, Redshift, or other databases, with minimal impact on the source. It supports one-time migrations as well as ongoing change data capture (CDC), so the lake stays in sync with the source as transactions occur. For a data scientist, this is the standard way to bring transactional data into the analytics environment without writing custom extraction code or coordinating database snapshots.
AWS Glue: Cataloging and Transforming at Scale
Scenario: The retail team now has thousands of files in S3, but nobody knows what schemas they have or how to join them. Analysts waste hours manually inspecting files, and ETL jobs break whenever an upstream schema changes.
How AWS Glue helps: Glue is a serverless data integration service with two parts worth understanding. The Glue Data Catalog crawls S3 and automatically infers schemas, producing a searchable metadata layer that other services (Athena, Redshift Spectrum, EMR, Lake Formation) can query directly. Glue ETL runs Spark- or Python-based transformation jobs on demand, without provisioning clusters. Together they turn a sprawling collection of files into a structured, queryable data lake. For a data scientist, this means spending less time wrangling raw files and more time on analysis.
AWS Glue DataBrew: Visual Data Preparation
Scenario: A business analyst needs to clean a marketing dataset (standardize phone formats, split full names, fix inconsistent country codes) but isn't comfortable writing PySpark.
How DataBrew helps: Glue DataBrew is a visual, no-code data preparation tool with more than 250 built-in transformations. Users point it at S3, Redshift, or other sources, build a "recipe" of cleaning steps interactively, and apply that recipe to the full dataset at scale. Recipes are reusable and version-controlled, which makes it useful even for technical teams who want a faster way to prototype data cleaning before promoting it to a Glue ETL job. It bridges the gap between spreadsheets and code-based pipelines.
Amazon Athena: SQL on S3 Without Infrastructure
Scenario: A data scientist needs to answer a quick business question: "Which product categories had the steepest revenue drop last quarter?" The data sits in S3 as Parquet files. Spinning up a database, loading the data, and querying it feels like overkill for a one-off question.
How Athena helps: Athena is a serverless query engine that runs standard SQL directly on data stored in S3, using the Glue Data Catalog for schema information. There's nothing to provision; you write a query, Athena runs it, and you pay only for the data scanned. For exploratory analysis, ad-hoc reporting, and validating assumptions before building a pipeline, it's hard to beat. Partitioning and columnar formats like Parquet dramatically reduce the cost and latency of queries, which is an important habit to build early.
Amazon Redshift: The Analytics Powerhouse
Scenario: A marketing team runs complex joins across billions of rows of customer, transaction, and campaign data, and they need results in seconds, not minutes. They also need concurrent access for dozens of analysts and BI dashboards without query performance degrading.
How Redshift helps: Redshift is a fully managed cloud data warehouse optimized for large-scale analytical queries. It uses columnar storage, massively parallel processing, and result caching to deliver fast performance on huge datasets. Redshift Spectrum extends that capability by letting you query data still sitting in S3, so you don't have to load everything into the warehouse. For a data scientist, Redshift is the go-to when queries outgrow what Athena can handle efficiently, or when the organization needs a governed, high-performance environment for BI and reporting.
Amazon EMR: Big Data Processing with Spark and Hadoop
Scenario: A team needs to process several terabytes of clickstream data every night to build customer behavior features. A single-node script would take days; they need distributed computing.
How EMR helps: EMR is a managed cluster platform for running big data frameworks like Apache Spark, Hive, Presto, and Flink. It handles cluster provisioning, configuration, and scaling, letting data scientists focus on the code rather than the infrastructure. EMR Serverless takes this further by removing cluster management entirely; you submit a Spark job and AWS runs it. For heavy transformations, feature engineering at scale, and iterative analytics on large datasets, EMR is the right tool.
Amazon Kinesis: Working with Streaming Data
Scenario: A ride-sharing company wants to analyze trip events in near real time, tracking active rides, detecting anomalies, and updating dashboards within seconds of an event occurring.
How Kinesis helps: Kinesis is a family of services for streaming data. Kinesis Data Streams ingests high-volume event data, Kinesis Data Firehose delivers streams to destinations like S3 or Redshift with automatic buffering and format conversion, and Managed Service for Apache Flink runs SQL or Flink queries on streams in motion. When freshness matters (fraud detection, live operational metrics, IoT telemetry), Kinesis gives you the plumbing without building it from scratch.
Amazon MSK: Managed Apache Kafka
Scenario: A team has standardized on Apache Kafka across the engineering organization, with existing producers, consumers, and Kafka Streams applications. They want a managed Kafka offering on AWS rather than rewriting everything for Kinesis.
How MSK helps: Amazon Managed Streaming for Apache Kafka (MSK) provides fully managed Kafka clusters, handling the provisioning, patching, and operational overhead of running Kafka at scale. MSK Serverless removes capacity planning entirely. The advantage over Kinesis is full Kafka API compatibility, which matters when teams already have a Kafka ecosystem or need features like long retention, complex consumer groups, and Kafka Connect plugins. It's a better fit when standardization on Kafka is a hard requirement.
Recommended by LinkedIn
Amazon OpenSearch Service: Search and Operational Analytics
Scenario: An operations team needs to search and analyze billions of log entries to investigate incidents, build real-time dashboards, and detect anomalies in user behavior. SQL warehouses aren't optimized for this type of full-text and time-series workload.
How OpenSearch helps: OpenSearch Service (the AWS-managed fork of Elasticsearch) is purpose-built for log analytics, full-text search, and operational monitoring. It indexes large volumes of semi-structured data and serves complex queries with sub-second latency. With OpenSearch Dashboards, teams visualize and alert on the data without leaving the platform. For data scientists working with logs, telemetry, or any scenario where search-style retrieval matters more than relational joins, OpenSearch is the right tool.
AWS Lambda, Step Functions, and EventBridge: The Orchestration Layer
Scenario: A nightly data pipeline must wait for files to land in S3, validate them, trigger a Glue ETL job, run quality checks, load Redshift, refresh QuickSight, and send a notification on failure. Hand-rolling this with cron and shell scripts is fragile.
How they help: AWS Lambda runs code in response to events without managing servers; ideal for lightweight transformations, validations, and glue logic. AWS Step Functions orchestrates multi-step workflows as visual state machines with built-in error handling, retries, and parallel execution. Amazon EventBridge is the event bus that ties services together, triggering pipelines based on schedules or events from across the AWS ecosystem. Together they form the backbone of modern, event-driven data pipelines on AWS, replacing brittle scripts with reliable, observable workflows.
Amazon QuickSight: Turning Analysis into Dashboards
Scenario: After weeks of exploratory analysis, a data scientist has clear insights about customer churn drivers. Executives want to monitor these metrics going forward, but they aren't going to open a Jupyter notebook to do it.
How QuickSight helps: QuickSight is AWS's serverless BI service. It connects directly to Redshift, Athena, S3, RDS, and many other sources, and lets you build interactive dashboards that scale to thousands of users. Its ML Insights feature surfaces anomalies and forecasts without extra modeling work, and Amazon Q in QuickSight adds natural-language querying so business users can ask questions in plain English. For a data scientist, QuickSight is the bridge between analysis and adoption, the place where insights become something the organization actually uses.
Amazon SageMaker Studio: The Workbench for Machine Learning
Scenario: A data scientist is starting a new project. They need a notebook environment to explore data, the ability to scale training onto GPU instances when needed, a way to track experiments and compare models, and a clean path from notebook code to deployed endpoint. Setting all of this up across separate services and EC2 instances is exactly the kind of yak-shaving that slows projects down.
How SageMaker Studio helps: SageMaker Studio is the unified, web-based IDE for machine learning on AWS. From a single interface, a data scientist can run JupyterLab and Code Editor notebooks, launch managed training jobs on the right-sized compute (no instance management), track experiments and compare runs, debug models, deploy endpoints, and monitor them in production. It supports the major frameworks out of the box (PyTorch, TensorFlow, scikit-learn, Hugging Face), integrates with Git, and serves as the entry point to Data Wrangler, Feature Store, Pipelines, JumpStart, and the rest of the SageMaker family. For data scientists doing serious ML on AWS, Studio is the home base.
Amazon SageMaker Data Wrangler: Preparing Data for ML
Scenario: A data scientist is preparing a dataset for a churn prediction model. The raw data needs cleaning, missing values handled, categorical variables encoded, and new features engineered. Doing this in a notebook works, but reproducing it for production is painful.
How Data Wrangler helps: Data Wrangler (part of SageMaker) provides a visual interface for data preparation, with hundreds of built-in transformations for cleaning, encoding, and feature engineering. What makes it especially useful is that every step is recorded as code and can be exported as a reusable pipeline or deployed as a processing job. It connects to S3, Athena, Redshift, and Snowflake, and it bridges the gap between ad-hoc data prep and production-ready feature pipelines.
Amazon SageMaker Feature Store: Reusable Features for ML
Scenario: Three different teams have built churn, propensity, and lifetime-value models, each maintaining its own version of features like "average order value last 30 days." The features drift between teams, training and serving show different numbers, and nobody can tell which version is canonical.
How Feature Store helps: SageMaker Feature Store is a centralized repository for ML features, with both an offline store (in S3, for training) and an online store (low-latency, for inference). Teams define features once, share them across projects, and serve consistent values during training and serving (avoiding the dreaded training-serving skew). Features are versioned and discoverable. For organizations with multiple models in production, Feature Store transforms feature engineering from a per-project chore into reusable infrastructure.
Bringing It Together: A Unified Retail Data Platform
To see how these services fit together, consider a real-world scenario:
A national retail chain wants to unify data from physical stores, e-commerce, mobile apps, and partner suppliers into a single platform that powers analytics, BI dashboards, and machine learning use cases like personalization and demand forecasting.
The accompanying architecture diagram shows how the services discussed in this article come together to address this. Here's the walkthrough:
Ingestion. Transactional data from Aurora flows into S3 through AWS DMS with continuous change data capture. Session data from DynamoDB streams through Lambda into S3. Web and mobile clickstream events feed into Kinesis Data Streams and are delivered to S3 by Kinesis Firehose. Partner files arrive via SFTP and are picked up by Lambda or Glue jobs.
Storage and governance. All ingested data lands in Amazon S3, organized into raw, staged, and curated zones. Lake Formation enforces fine-grained access controls, ensuring each team sees only what they're allowed to.
Cataloging and transformation. AWS Glue crawlers populate the Data Catalog so every dataset is discoverable. Glue ETL jobs and DataBrew recipes shape the raw data into curated, analysis-ready tables. Step Functions orchestrate the end-to-end pipeline, with EventBridge triggering it based on file arrivals or schedules. Heavy distributed processing runs on EMR.
Consumption. Analysts query curated data with Athena for ad-hoc questions and through Redshift for high-performance, concurrent BI workloads. Operational logs and customer events are indexed in OpenSearch for search-style analytics. QuickSight delivers dashboards to business users.
ML enablement. Data scientists work in SageMaker Studio as their daily IDE, prepare features with SageMaker Data Wrangler, and publish them to SageMaker Feature Store, where they become reusable across multiple models (personalization, churn prediction, demand forecasting).
This kind of architecture is not theoretical. Some version of it is running at most large data-driven companies on AWS today.
Putting It Together
You don't need every service for every project. The value is in recognizing which one earns its place given the scenario: the volume of data, the latency required, the audience for the output, and the downstream systems involved.
Most real platforms grow incrementally. They start with S3 and Athena, add Glue for cataloging, bring in Redshift when BI workloads grow, layer in Kinesis or MSK when real-time needs emerge, and eventually introduce SageMaker tooling when ML enters the picture. Knowing the full toolbox lets you make those decisions deliberately rather than reactively.
The best way to internalize these services is to pick one scenario from your own work and map it against this list. Which service fits where? Often, the answer reveals a simpler path than the one you're currently on.
If you found this useful, watch for the companion piece on AWS services for AI, a different set of tools for a different set of problems.