Deploying LLM-Powered APIs with Docker

Alexandre Chatton

Published Aug 14, 2025

Turning AI ideas into deployable services using Python, modern frameworks, and cloud-ready containers

In the previous article of AI Lab serie, we set up an entry point with Traefik. Now it’s time to build the modules that will turn that foundation into real capabilities. Before diving into implementation, let’s zoom out and understand the AI landscape we’re about to integrate into our stack.

We’ll then add a few lines of code to implement our first endpoint capable of querying an LLM.

Mapping the AI Landscape

Artificial Intelligence (AI) is a broad field encompassing several major domains:

• Machine Learning (ML) — algorithms that learn from data to make predictions or decisions.

• Deep Learning (DL) — a subset of ML that uses multi-layered neural networks to identify complex patterns.

• Natural Language Processing (NLP) — techniques enabling machines to understand, interpret, and generate human language.

At the intersection of DL and NLP, we find Large Language Models (LLMs) such as:

ChatGPT, Claude, Gemini, LLaMA, Mistral, or Titan.

These models drive what most people today call Generative AI — systems capable of producing text, images, code, and more.

But AI extends far beyond LLMs. Other important areas include Computer Vision (analyzing and understanding visual data), Reinforcement Learning (training agents through trial-and-error and feedback), and a range of specialized Neural Network architectures designed for specific domains. This is where popular frameworks like TensorFlow, PyTorch, or Keras are used to design, train, and deploy models efficiently.

For many SMBs, the biggest AI opportunity is automating high-value tasks with LLM-powered solutions.

For many SMBs, the real opportunity lies in applying prompt engineering to automate high-value tasks — from text generation, translation, summarization, and Q&A systems to RAG-powered knowledge assistants, contextual chatbots, multimodal workflows such as image-to-text or text-to-image, and even reinforcement learning — all relying on LLM architectures at their core.

The value of an LLM lies largely in the massive computational effort invested to train it and in the resulting billions (or even trillions) of parameters it learns. The intellectual property resides primarily in the weights — the numerical values of these parameters — which capture the knowledge distilled from the training data.

Most major LLM providers do not give direct access to model weights. Instead, they offer API-based access so you can send prompts and receive responses without running the model locally. This approach allows them to:

• Protect intellectual property (model weights are proprietary).

• Control usage (through rate limits, billing, and policy enforcement).

• Handle infrastructure (scaling, updates, optimizations, GPU costs).

Most commercial LLM APIs today use a RESTful HTTP interface with JSON request and response formats. They typically provide official integration libraries for Python — the most widely used language in the AI ecosystem — and often for JavaScript/TypeScript. Direct Java SDKs are less common, but Java applications can interact with these APIs via standard REST calls.

Python in 2025: The Unexpected Powerhouse of Modern AI

I can already hear the chuckles behind me:

“Python? Modern? Is this a joke?”

Yes, I know — Python first appeared when floppy disks were still a thing, and the hottest phone on the market had a monochrome screen and Snake installed. But hold that thought. The Python you’re imagining — slow scripts, indentation wars, and “just for sysadmins” — is not the Python that runs today’s AI stacks. In 2025, Python is the lingua franca of machine learning, deep learning, and LLM orchestration. Give me a minute, and I’m pretty sure you’ll start seeing it less like an antique typewriter and more like the Swiss Army knife of modern AI development.

As we can see, integrating an LLM into your own application is relatively straightforward. However, for our AI Lab, there’s an important consideration:

We want to experiment with different providers — such as OpenAI, Anthropic, or Mistral — and also integrate our own data into the process.

This means we need to carefully evaluate the available frameworks, and in 2025 it’s clear that Python remains the primary language for this kind of work.

Most of today’s AI frameworks and tooling are indeed Python-first (or even Python-only), for a few reasons:

✔️ Ecosystem maturity — PyTorch, TensorFlow, Hugging Face Transformers, LangChain, OpenAI SDK, etc., all have their most complete and up-to-date implementations in Python.

✔️ Developer adoption — AI research has been dominated by Python since the deep learning boom of ~2010’s, so cutting-edge models are released in Python first.

✔️ Rapid prototyping — Python’s syntax + huge library ecosystem make it ideal for experimenting and iterating fast.

✔️ Community & tooling — tutorials, notebooks, pre-trained models, and example repos are overwhelmingly in Python.

2020s — Python dominates AI research and prototyping, but its ecosystem now also focuses on production readiness: FastAPI, Pydantic, async support, and robust serving stacks. This is exactly what we will dig in this article.

Why it matters for AI today: Python is where cutting-edge AI happens first — but in production, the code lives inside a bigger, reliable backbone: API frameworks, process managers, monitoring, and scaling tools.

So, do you still think Python is just a relic of the past — or the most proven and indispensable tool in the AI engineer’s arsenal for 2025?

Choosing the Right Integration Framework

At this early stage, it’s clear that we will work in Python. While we could interact directly with each provider’s exposed API, doing so for multiple vendors quickly becomes cumbersome — especially when switching between different authentication methods, request formats, or model capabilities. This is where an integration framework becomes essential: it abstracts away the provider-specific details and lets us focus on building features, not plumbing code.

Among the many options available are LangChain, AutoGen, CrewAI, Flowise, and AgentOps. Each has its strengths, but LangChain stands out for its depth, flexibility, and agentic capabilities. In practical terms, this means it doesn’t just act as a thin wrapper around APIs; it enables AI-driven workflows where the model can decide dynamically which tools to call, what data to retrieve, and how to combine intermediate results.

For AI Lab, this is critical. We want to experiment with multiple providers — such as OpenAI, Anthropic, or Mistral — integrate our own proprietary datasets, and design prototypes that can evolve into production-grade applications. LangChain’s rich ecosystem of connectors, memory modules, and chain types makes it ideal for building complex solutions like:

• Retrieval-Augmented Generation (RAG) pipelines.

• Multi-step reasoning agents combining different APIs.

• Contextual chatbots with access to internal business data.

• Hybrid workflows that mix LLM calls, image generation, and database lookups.

Beyond features, LangChain benefits from an active open-source community, extensive documentation, and regular updates that keep pace with the rapidly evolving AI landscape. This combination of flexibility, ecosystem maturity, and agentic workflow support makes it a strategic choice for AI Lab’s integration layer.

You can find more about LangChain here: https://python.langchain.com/docs/introduction

Let’s Build Our First Mini Test

Before diving deeper into LangChain’s capabilities, it’s worth stepping back to implement a simple test. The goal of this initial experiment is straightforward:

Expose a minimal Python-based microservice that provides a REST API to interact with OpenAI.

This will give us a working foundation to validate our setup, confirm that the API integration functions as expected, and create a baseline we can extend with more complex LangChain workflows later.

Configuring Docker to Add Our Service

With this setup, Traefik routes requests to https://ai-api.yourdomain.com directly to our Docker service, which listens internally on port 8188. The application’s code resides locally in /opt/ai_stack.

The next step is to define the service image using a Dockerfile — essentially a recipe that packages your code, dependencies, and runtime environment into an immutable image. A container is simply a live, running instance of that image.

This Dockerfile starts from a minimal Linux base image with Python preinstalled. It then installs the required OS packages and Python dependencies using "pip install -r requirements.txt". Each instruction in the file creates a new image layer, which is why it’s common practice to keep requirements.txt outside the Dockerfile — allowing dependency updates without modifying the Dockerfile itself.

The application code from /opt/ai_stack is copied into the container’s /app directory. The file also specifies the port the application will listen on (matching the value configured in Traefik). Finally, the application is launched using Gunicorn, a production-grade Python HTTP server.

What is Gunicorn?

Gunicorn (short for Green Unicorn) is a production-grade Python HTTP server for running WSGI or ASGI applications.

It’s not your application — it’s the component that runs your application and ensures it can handle real-world web traffic reliably, much like Nginx or Apache do for traditional web apps (although Gunicorn actually executes the Python code, whereas Nginx/Apache usually forward requests to it).

What it does?

• Starts multiple worker processes so your app can handle many requests in parallel.

• Listens for HTTP requests and forwards them to your Python application (FastAPI, Flask, Django, etc.).

• Manages workers — restarts them if they crash, handles graceful shutdowns, and enforces timeouts.

• Performs far better under load than running your app directly with python api.py.

WSGI vs ASGI Gunicorn itself is framework-agnostic — it can:

• Serve WSGI apps (Web Server Gateway Interface) like Flask or Django (sync model).

• Serve ASGI apps (Asynchronous Server Gateway Interface) like FastAPI or Starlette (async model) if you choose an ASGI-compatible worker class.

Different worker classes are optimized for different concurrency models.

Our AI Lab use case:

OpenAI

A quick note about the OpenAI API key: Once you’ve created your OpenAI account (separate from a ChatGPT account), you’ll also need to choose the models you want to interact with.

You can choose the model that best fits your needs.

For simple chat interactions, GPT-3.5-turbo is usually sufficient and is much cheaper than GPT-5. OpenAI also provides guidance to help you select the right model.

Pricing examples:

GPT-5: $1.250 USD per 1M input tokens, $0.125 USD per 1M cached input tokens, and $10.000 USD per 1M output tokens.

GPT-3.5-turbo: $0.500 USD per 1M input tokens and $1.500 USD per 1M output tokens.

From Prototype to Production:

Next Steps for Our LangChain–OpenAI REST API

At this stage, we have the early foundation of an API capable of interacting with OpenAI via LangChain and exposing its capabilities through a REST interface. While far from production-ready, it already serves as a functional entry point for validating specific hypotheses in your project.

Because the endpoint follows the REST model, it can be easily integrated into existing applications, regardless of the programming language. For instance, you could invoke it from within a Spring Boot application and continue working in Java as usual. This flexibility allows teams to split responsibilities without disrupting day-to-day operations.

Before moving toward production, it’s important to step back and address several key areas for this Python component:

❌ Secure the OpenAI API key.

❌ Implement a consistent, centralized logging mechanism.

❌ Collect and centralize logs from Uvicorn, Gunicorn, and FastAPI.

❌ Apply a rate limiter at the endpoint level.

❌ Apply a rate limiter at the Traefik level.

❌ Configure CORS appropriately.

❌ Protect the information sent to the LLM to prevent leaks or data breaches.

We will explore some of these aspects in the next article of this AI Lab series. Later, we’ll extend the system with additional modules to support more advanced workflows, such as Retrieval-Augmented Generation (RAG), which will involve adding a Qdrant database.

Try the code and share your results — what’s the first AI endpoint you’ll deploy?

Read Part 3: Protecting Your Keys with Vault

The prototype is ready — now it’s time to get serious about security. We’ll show you how to store your OpenAI keys (and other secrets) in HashiCorp Vault and access them safely inside your AI stack.

As a Fractional CTO, I help teams design efficient, scalable systems — without over-engineering.

📩 Let’s talk If you want to rethink your architecture without overengineering it, my DMs are open.

Ambrosya

Ambrosya Services

Alexandre Chatton

Alexandre Chatton 8mo

The AI Lab Series teaser: https://www.garudax.id/posts/alexandrechatton_most-ai-talks-sound-smart-but-produce-activity-7360614211015577600-Bx8H

3 Reactions

Ambrosya 8mo

What stands out here is the balance between big-picture clarity (mapping the AI landscape beyond just LLMs) and practical implementation (FastAPI + LangChain inside Docker). The reminder that Python remains the powerhouse of AI in 2025 is spot on — not because it’s trendy, but because the ecosystem, tooling, and community make it the most pragmatic choice for turning ideas into real, deployable services. Looking forward to the next chapter on securing keys with Vault: This is a critical step too often postponed until problems arise!

3 Reactions

Alexandre Chatton 8mo

The third article of the AI Lab series is ready. Inside, we cover: • How to secure your OpenAI API key with HashiCorp Vault • Why environment variables aren’t enough • Secrets lifecycle, AppRole + token auth, and runtime-only injection with a sidecar agent • Full FastAPI integration with hvac 🧩 Key takeaway: You can survive a bug, but you won’t survive a leaked API key — secure it the right way from day one. 👇 Read the full article: https://www.garudax.id/pulse/protecting-your-keys-vault-because-key-breach-costs-more-chatton-oofyf

Deploying LLM-Powered APIs with Docker

Alexandre Chatton

Turning AI ideas into deployable services using Python, modern frameworks, and cloud-ready containers

Mapping the AI Landscape

Python in 2025: The Unexpected Powerhouse of Modern AI

Choosing the Right Integration Framework

Let’s Build Our First Mini Test

Configuring Docker to Add Our Service

What is Gunicorn?

Recommended by LinkedIn

What is FastAPI?

Let’s Get Started: Building Our FastAPI Application

Let’s Test Our New API

OpenAI

From Prototype to Production:

Next Steps for Our LangChain–OpenAI REST API

More articles by Alexandre Chatton

Others also viewed

Embarking on a Journey Learning Large Language Models - Session 1

Understanding Word Embeddings: The Building Blocks of NLP and GPTs

Part 9: The Next Leap in AI — From Transformers to Pre-Trained Powerhouses

NATURAL LANGUAGE PROCESSING

S3 - Transformers , What are Inputs and Embedding - That is where everything starts.

Enhancing Semantic Sentence Similarity With Pre-trained SimCSE Models

Fine-Tuning Large Language Models (LLMs) with Your Own Data

How to Become a Master in Large Language Models (LLMs)

Roadmap to become LLM Engineer

BERT Model (On demand topic )

How Llms Process Language

Deep Dive Into LLM System Architecture

Using LLMs as Microservices in Application Development

Recent Developments in LLM Models

How LLMs Generate Data-Rich Predictions

Solving Coding Challenges With LLM Tools

Explore content categories

Turning AI ideas into deployable services using Python, modern frameworks, and cloud-ready containers

Mapping the AI Landscape

Python in 2025: The Unexpected Powerhouse of Modern AI

Choosing the Right Integration Framework

Let’s Build Our First Mini Test

Configuring Docker to Add Our Service

What is Gunicorn?

Recommended by LinkedIn

What is FastAPI?

Let’s Get Started: Building Our FastAPI Application

Let’s Test Our New API

OpenAI

From Prototype to Production:

Next Steps for Our LangChain–OpenAI REST API

More articles by Alexandre Chatton

Un reset monétaire basé sur l’or : solution… ou aveu d’échec ?

AI Lab: To the Heart of AI

Observability Without Noise: Prometheus + Grafana

From Backend Power to User Experience

Rate Limiting: Protecting Your AI Stack from Overload

Preparing Your Application for Production

Protecting Your Keys with Vault: Because a Key Breach Costs More Than a Bug

From Prototype to Production: Why Traefik Is Our AI Lab’s Trusted Gatekeeper

90% of systems fail to scale… due to one overlooked detail: caching.

From Time to Truth: Rethinking Data Models in a Temporal World

Others also viewed

Embarking on a Journey Learning Large Language Models - Session 1

Understanding Word Embeddings: The Building Blocks of NLP and GPTs

Part 9: The Next Leap in AI — From Transformers to Pre-Trained Powerhouses

NATURAL LANGUAGE PROCESSING

S3 - Transformers , What are Inputs and Embedding - That is where everything starts.

Enhancing Semantic Sentence Similarity With Pre-trained SimCSE Models

Fine-Tuning Large Language Models (LLMs) with Your Own Data

How to Become a Master in Large Language Models (LLMs)

Roadmap to become LLM Engineer

BERT Model (On demand topic )

Similar topics

How Llms Process Language

Deep Dive Into LLM System Architecture

Using LLMs as Microservices in Application Development

Recent Developments in LLM Models

How LLMs Generate Data-Rich Predictions

Solving Coding Challenges With LLM Tools

Explore content categories