Getting started with AWS SageMaker in 2025 - the complete practical guide

Getting started with AWS SageMaker in 2025 - the complete practical guide

Machine learning is transforming industries. But here's the problem - most teams spend more time managing infrastructure than building models.

Servers crash. Environments break. Deployments fail. And your data scientists are suddenly doing DevOps.

AWS SageMaker changes this equation. It's a fully managed platform that handles infrastructure so you can focus on what actually matters - your models and your data.

This is a comprehensive guide on how to start working with SageMaker. We'll cover everything from basic setup to advanced integrations, with special attention to pricing because that's where most people get surprised.

What is SageMaker and why should you care

SageMaker is AWS's end-to-end machine learning platform. Launched in 2017, it's evolved into one of the most complete ML services available.

Here's what it actually does:

→ Provides managed Jupyter notebooks that spin up in minutes

→ Handles distributed training across multiple GPUs automatically

→ Deploys models to auto-scaling endpoints with one command

→ Includes 24 production-ready algorithms optimized for AWS hardware

→ Supports TensorFlow, PyTorch, Hugging Face, scikit-learn, and custom frameworks

→ Integrates with the entire AWS ecosystem (S3, IAM, CloudWatch, Lambda)

Think of it as your complete ML workspace. Data preparation, model training, hyperparameter tuning, deployment, and monitoring all happen in one place.

The real value? You don't need a dedicated platform engineering team. SageMaker abstracts away the complexity of Kubernetes clusters, Docker containers, and distributed computing.

Why companies choose SageMaker

Speed to production. Teams deploy models in days instead of months. The infrastructure is already built.

Managed infrastructure. No server maintenance, no patching, no cluster management. AWS handles uptime and scaling.

Cost control. Pay only for compute time used. Training jobs shut down automatically when complete. Spot instances reduce costs by 90%.

Enterprise security. Built-in encryption, VPC support, IAM integration, and compliance certifications.

Framework flexibility. Not locked into AWS-specific tools. Bring your existing TensorFlow or PyTorch code.

The complete setup guide

Getting started takes about 30 minutes. Here's the detailed walkthrough.

Step 1: create your AWS account

Visit aws.amazon.com and sign up. You'll need a credit card for verification, but the free tier covers your first two months of exploration.

AWS offers a generous free tier specifically for SageMaker:

→ 250 hours per month of ml.t3.medium notebook instances

→ 50 hours per month of ml.m5.xlarge training instances

→ 125 hours per month of ml.m5.xlarge inference endpoints

→ 25 GB of Amazon S3 storage for data and models

This is enough to complete multiple tutorials, train several models, and understand how the platform works. Costs only start after you exceed these limits or after two months.

Step 2: access SageMaker

Log into the AWS console. In the search bar at the top, type "SageMaker" and click on the service.

You'll land on the SageMaker console. This is mission control for all your ML projects.

Step 3: choose your setup approach

SageMaker offers two setup paths:

Quick setup - AWS automatically configures networking, IAM roles, and security groups. Takes 5 minutes. Best for single users or learning. Note that networking resources created during quick setup may incur additional charges.

Manual setup - you create the domain, configure VPCs, set up IAM roles, and define security policies yourself. Takes 20-30 minutes. Gives you complete control over costs and security. Recommended for production environments.

For your first time, quick setup works perfectly. Click "Set up for single user" and let AWS handle the configuration.

Step 4: create your domain

A domain is your SageMaker workspace. It's where all your notebooks, training jobs, models, and endpoints live.

During setup, you'll:

→ Name your domain (something like "ml-workspace")

→ Select an IAM execution role (quick setup creates this automatically)

→ Choose default instance types for notebooks

→ Configure storage settings

AWS creates the domain in about 5 minutes. You'll get a notification when it's ready.

Step 5: launch SageMaker Studio

Once your domain is ready, click "Open Studio". This launches the web-based IDE.

SageMaker Studio is where everything happens. It looks like JupyterLab but with additional ML-specific tools built in.

You can:

→ Create notebooks with different Python environments

→ Access example notebooks for common ML tasks

→ Browse pre-trained models in JumpStart

→ Monitor training jobs and endpoints → Manage experiments and model versions

Step 6: create your first notebook

Click "File" → "New" → "Notebook". Choose the Data Science 3.0 image and select ml.t3.medium as your instance type.

The notebook spins up in about 60 seconds. You now have a fully functional Python environment with pandas, numpy, scikit-learn, TensorFlow, PyTorch, and the SageMaker SDK pre-installed.

Try this simple code to verify everything works:

import sagemaker
print(sagemaker.__version__)
        

If you see a version number, you're good to go.

Built-in algorithms - your secret weapon

SageMaker includes 24 optimized algorithms. These are production-grade implementations that often outperform custom code.

Supervised learning algorithms:

→ XGBoost - gradient boosting for classification and regression

→ Linear Learner - scalable linear models with automatic feature preprocessing → Factorization Machines - for recommendation systems and sparse data

→ K-Nearest Neighbors - simple but effective for classification

→ AutoGluon-Tabular - automated machine learning for tabular data

→ CatBoost - handles categorical features automatically

Unsupervised learning:

→ K-Means - clustering algorithm that scales to massive datasets

→ PCA - dimensionality reduction

→ Random Cut Forest - anomaly detection

→ IP Insights - learns usage patterns for IPv4 addresses

Time series: → DeepAR - forecasting using recurrent neural networks

Text and NLP:

→ BlazingText - Word2vec and text classification

→ Sequence-to-Sequence - neural machine translation

→ Latent Dirichlet Allocation - topic modeling

Computer vision:

→ Image Classification - classify images into categories

→ Object Detection - find and label objects in images

→ Semantic Segmentation - pixel-level image classification

These algorithms require no code. You just provide data in the correct format, specify hyperparameters, and launch training.

The real advantage? These implementations are heavily optimized for AWS infrastructure. They automatically use GPU acceleration, distributed training, and efficient data loading.

Working with TensorFlow on SageMaker

TensorFlow is one of the most popular frameworks, and SageMaker makes it incredibly easy to use.

Supported versions

SageMaker supports TensorFlow from version 1.4 through 2.18. containers are pre-built with all dependencies.

Training TensorFlow models

You write your training script exactly as you would locally. SageMaker handles everything else:

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(
    entry_point='train.py',
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    framework_version='2.18.0',
    py_version='py311',
    hyperparameters={
        'epochs': 100,
        'batch_size': 32
    }
)

estimator.fit({'training': 's3://my-bucket/data'})
        

Your train.py script uses standard TensorFlow code. SageMaker automatically copies it to the training instance, sets up the environment, runs training, and saves the model to S3.

Distributed training

For large models or datasets, SageMaker supports distributed training. just change instance_count to 2 or more, and SageMaker uses MirroredStrategy or Horovod to parallelize training across GPUs.

estimator = TensorFlow(
    entry_point='train.py',
    role=role,
    instance_count=4,  # distributed across 4 instances
    instance_type='ml.p3.8xlarge',
    framework_version='2.18.0',
    distribution={'parameter_server': {'enabled': True}}
)
        

TensorFlow Serving for deployment

When you deploy a TensorFlow model, SageMaker uses TensorFlow Serving behind the scenes. it automatically creates a REST API endpoint with auto-scaling.

Training compiler acceleration

SageMaker Training Compiler optimizes TensorFlow training code. it can reduce training time by up to 50% with no code changes. just add one parameter:

estimator = TensorFlow(
    entry_point='train.py',
    role=role,
    instance_type='ml.p3.8xlarge',
    framework_version='2.18.0',
    compiler_config={'enabled': True}  # that's it
)
        

The compiler analyzes your model architecture and applies optimizations like operator fusion, memory planning, and graph optimization.

Hugging Face integration - accessing 10,000+ models

Hugging Face and AWS partnered to make deploying transformers models trivial.

What's included

SageMaker has native Hugging Face support. pre-built containers include:

→ Transformers library with all architectures

→ Tokenizers for fast preprocessing

→ Datasets library for data loading

→ Accelerate for distributed training

→ Optimum for inference optimization

Training Hugging Face models

Fine-tuning BERT for sentiment analysis looks like this:

from sagemaker.huggingface import HuggingFace

huggingface_estimator = HuggingFace(
    entry_point='train.py',
    role=role,
    transformers_version='4.26',
    pytorch_version='2.0',
    py_version='py310',
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    hyperparameters={
        'model_name': 'bert-base-uncased',
        'epochs': 3,
        'train_batch_size': 32
    }
)

huggingface_estimator.fit({'train': 's3://bucket/data'})
        

Your train.py uses standard Hugging Face code. SageMaker handles the infrastructure.

Deploying with JumpStart

SageMaker JumpStart provides one-click deployment for popular models. open JumpStart in Studio, browse models, and click "Deploy".

Available models include: → Llama 3.1 and 3.2 (8B, 70B, 405B parameters) → Mistral 7B and Mixtral 8x7B → Qwen 2.5 models → BERT, RoBERTa, DistilBERT variants → T5 for text generation → Vision transformers for image tasks

Deployment takes 5-10 minutes. you get a REST API endpoint ready for inference.

Cost optimization with AWS chips

Deploying on AWS Inferentia2 reduces inference costs by up to 40% compared to GPU instances. training on AWS Trainium cuts training costs by 50%.

These specialized ML chips are fully compatible with Hugging Face models through the Optimum library.

Real-world example

Deploying a text classification model:

from sagemaker.huggingface import HuggingFaceModel

hub = {
    'HF_MODEL_ID': 'distilbert-base-uncased-finetuned-sst-2-english',
    'HF_TASK': 'text-classification'
}

huggingface_model = HuggingFaceModel(
    env=hub,
    role=role,
    transformers_version='4.26',
    pytorch_version='2.0',
    py_version='py310'
)

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type='ml.g4dn.xlarge'
)
        

That's it. five lines of code to deploy a production-ready sentiment analysis API.

Pricing - the critical details everyone needs

SageMaker pricing confuses everyone at first. here's the complete breakdown.

The core principle

You pay for compute resources by the second. no upfront costs, no long-term commitments. when a resource stops, billing stops.

Notebook instances

Charged per second while running. costs vary by instance type:

→ ml.t3.medium - $0.05/hour ($36/month if running 24/7)

→ ml.t3.large - $0.10/hour ($73/month)

→ ml.m5.xlarge - $0.23/hour ($168/month)

→ ml.p3.2xlarge (GPU) - $3.83/hour ($2,799/month)

Critical point - notebooks keep charging until you stop them manually. always stop notebooks when you're done working.

Training jobs

Also charged per second. training automatically stops when complete, so you only pay for actual training time.

Example costs:

→ ml.m5.xlarge - $0.269/hour

→ ml.c5.2xlarge - $0.425/hour

→ ml.p3.2xlarge (GPU) - $3.825/hour

→ ml.p3.8xlarge (4 GPUs) - $14.688/hour

→ ml.p4d.24xlarge (8 A100 GPUs) - $37.688/hour

A typical training job on ml.p3.2xlarge for 4 hours costs about $15. scale up to ml.p3.8xlarge and it's $59 for 4 hours.

Spot instances for training

Spot instances use spare AWS capacity at up to 90% discount. perfect for training because SageMaker handles interruptions automatically.

Example - ml.p3.2xlarge normally costs $3.825/hour. spot price averages $1.15/hour. that's 70% savings.

Enable spot training with one parameter:

estimator = TensorFlow(
    entry_point='train.py',
    role=role,
    instance_type='ml.p3.2xlarge',
    use_spot_instances=True,
    max_wait=7200,  # maximum wait time in seconds
    max_run=3600   # maximum training time
)
        

Inference endpoints - the expensive part

Endpoints run continuously and charge every second until deleted. this is where costs balloon if you're not careful.

Example endpoint costs running 24/7:

→ ml.t3.medium - $36/month

→ ml.m5.xlarge - $168/month

→ ml.g4dn.xlarge (GPU) - $542/month

→ ml.p3.2xlarge - $2,799/month

Here's the trap - you deploy an endpoint for testing, forget about it, and get a $2,799 bill. always delete test endpoints immediately.

Serverless inference - better for variable traffic

Serverless inference scales to zero when idle. you only pay for actual inference requests.

Pricing:

→ compute - $0.20 per GB-second

→ requests - $0.20 per million requests

For applications with sporadic traffic, serverless is dramatically cheaper than persistent endpoints.

Batch transform - for bulk predictions

When you need predictions on large datasets but don't need real-time responses, use batch transform.

You pay only while the job runs. costs are similar to training instances. the job automatically shuts down when complete.

Storage costs

Three storage types affect your bill:

S3 storage - $0.023 per GB per month. this adds up quickly if you save every training checkpoint. one ML project can easily generate 100GB of checkpoints ($2.30/month per project).

EBS volumes - attached to notebook instances. charged even when the notebook is stopped. ml.t3.medium gets a 5GB volume by default ($0.50/month). delete old volumes regularly.

model artifacts - stored in S3. trained models can be 1-10GB depending on architecture.

Savings plans - up to 64% off

SageMaker Savings Plans offer massive discounts for committed usage.

One-year commitment with no upfront payment - 33% discount One-year commitment with full upfront payment - 40% discount three-year commitment with full upfront payment - 64% discount

The discount applies automatically across all SageMaker usage - notebooks, training, and inference.

Example - commit to $100/month for one year with no upfront. you save 33% on all usage. if you use $150 worth of resources, you pay the committed $100 plus $50 at on-demand rates.

Real cost scenarios

Scenario 1 - learning and experimentation:

→ 20 hours notebook time on ml.t3.medium - $1

→ 10 hours training on ml.m5.xlarge - $2.69

→ 5GB S3 storage - $0.12 → total - $3.81/month (covered by free tier)

Scenario 2 - small production deployment:

→ one ml.m5.xlarge endpoint 24/7 - $168/month

→ 50GB S3 storage - $1.15/month

→ occasional training on spot instances - $20/month → total - $189/month

Scenario 3 - serious ML workload:

→ three ml.g4dn.xlarge endpoints - $1,626/month

→ weekly training on ml.p3.8xlarge (4 hours/week) - $235/month

→ 500GB S3 storage - $11.50/month → total - $1,872/month

With a one-year savings plan - $1,248/month (33% savings)

Cost optimization strategies

Stop all resources when not in use. this alone can save 80% of costs.

Use spot instances for training. automatic 70-90% savings with minimal risk.

Right-size instances. don't train simple models on ml.p3.8xlarge. start small, scale up only when needed.

Delete old checkpoints. set up S3 lifecycle policies to automatically delete files older than 30 days.

use serverless inference for low-traffic endpoints. much cheaper than always-on endpoints for sporadic use.

Monitor with billing alarms. set up CloudWatch alerts for spending over $50, $100, etc. this prevents surprise bills.

The mistakes everyone makes

Mistake 1 - forgetting to stop endpoints

you deploy a test endpoint on ml.g4dn.xlarge. it costs $0.74/hour. you forget about it for a month. bill - $542.

Solution - delete test endpoints immediately. set up a CloudWatch alarm for any endpoint running over 24 hours.

Mistake 2 - using GPUs when CPUs work fine

training a simple logistic regression model on ml.p3.2xlarge ($3.83/hour) instead of ml.m5.xlarge ($0.27/hour). 14x cost increase for no benefit.

Solution - start with CPU instances. only move to GPU when you know you need it.

Mistake 3 - saving every checkpoint

Training for 100 epochs, saving checkpoints every epoch. 100 checkpoints at 500MB each is 50GB. that's $13.80/year per model in storage costs.

Solution - save only the best model and the final model. delete the rest.

Mistake 4 - not using spot instances

Running long training jobs on on-demand instances. paying full price when spot could save 70%.

Solution - always use spot for training unless you have a hard deadline.

Mistake 5 - keeping notebooks running overnight

Leaving an ml.m5.xlarge notebook ($0.23/hour) running while you sleep. 8 hours - $1.84. do this every night for a month - $55.

Solution - stop notebooks when you close your laptop. restart takes 60 seconds.

Practical tips for success

Start with examples

SageMaker Studio includes dozens of example notebooks. open JumpStart, browse examples, and clone the ones relevant to your use case.

Example categories include: → tabular data with XGBoost → computer vision with TensorFlow → NLP with Hugging Face → time series forecasting → recommendation systems

These notebooks are production-quality. you can use them as templates for your own projects.

Use SageMaker Autopilot for baseline models

Autopilot is automated machine learning. you provide data, specify the target column, and Autopilot tests dozens of algorithms and hyperparameter combinations.

It generates:

→ data preprocessing code

→ multiple trained models

→ model explainability reports

→ deployment-ready code

Perfect for establishing a performance baseline quickly.

Leverage data wrangler

Data Wrangler is a visual interface for data preparation. you can: → join datasets from S3, Redshift, or Athena → apply transformations using a GUI → generate feature engineering code → export cleaned data or preprocessing code

Saves hours of pandas coding for common data cleaning tasks.

implement monitoring

SageMaker Model Monitor detects data drift and model degradation. it compares incoming data to training data and alerts you when distributions diverge.

This prevents the silent failure mode where model accuracy degrades over time without anyone noticing.

use experiments for tracking

SageMaker Experiments automatically tracks all training runs. you can compare: → hyperparameters used → metrics achieved → training duration → cost per training run

Makes it easy to see which approach worked best across dozens of experiments.

Your first week with SageMaker

Day 1 - complete setup, launch Studio, run an example notebook. understand how notebooks work.

Day 2 - train your first model using a built-in algorithm. try XGBoost on the customer churn dataset.

Day 3 - deploy that model to an endpoint. make predictions via the API. then delete the endpoint.

Day 4 - bring your own code. write a simple scikit-learn or TensorFlow script and run it on SageMaker.

Day 5 - try distributed training. scale your training job to multiple instances.

Day 6 - explore JumpStart. deploy a pre-trained Hugging Face model for text classification.

Day 7 - set up monitoring and billing alarms. understand where costs come from.

By the end of week one, you'll understand the complete workflow. from there, it's about building increasingly complex projects.

The bottom line

SageMaker removes infrastructure complexity from machine learning. you focus on data and models. AWS handles servers, scaling, and deployment.

The platform supports every popular framework. it includes production-grade algorithms. and it scales from experiments to enterprise workloads.

Pricing can be complex, but the model is straightforward - pay for compute by the second. use spot instances, stop resources when idle, and your costs stay reasonable.

The free tier gives you two months to learn without spending money. use that time to understand the platform, train models, and figure out what instance types you actually need.

Start simple. use built-in algorithms before writing custom code. deploy to CPU instances before GPU. monitor everything.

SageMaker is powerful because it handles the parts of ML that don't differentiate your business - infrastructure - so you can focus on what does - your models and your data.

Time to build something.

Is it possible to connect SageMaker across different organizations? If so, could you share the steps to establish this connection?

Like
Reply

To view or add a comment, sign in

More articles by Andrej Kaurin

Others also viewed

Explore content categories