Deploying LLM-Powered APIs with Docker
Turning AI ideas into deployable services using Python, modern frameworks, and cloud-ready containers
In the previous article of AI Lab serie, we set up an entry point with Traefik. Now it’s time to build the modules that will turn that foundation into real capabilities. Before diving into implementation, let’s zoom out and understand the AI landscape we’re about to integrate into our stack.
We’ll then add a few lines of code to implement our first endpoint capable of querying an LLM.
Mapping the AI Landscape
Artificial Intelligence (AI) is a broad field encompassing several major domains:
• Machine Learning (ML) — algorithms that learn from data to make predictions or decisions.
• Deep Learning (DL) — a subset of ML that uses multi-layered neural networks to identify complex patterns.
• Natural Language Processing (NLP) — techniques enabling machines to understand, interpret, and generate human language.
At the intersection of DL and NLP, we find Large Language Models (LLMs) such as:
ChatGPT, Claude, Gemini, LLaMA, Mistral, or Titan.
These models drive what most people today call Generative AI — systems capable of producing text, images, code, and more.
But AI extends far beyond LLMs. Other important areas include Computer Vision (analyzing and understanding visual data), Reinforcement Learning (training agents through trial-and-error and feedback), and a range of specialized Neural Network architectures designed for specific domains. This is where popular frameworks like TensorFlow, PyTorch, or Keras are used to design, train, and deploy models efficiently.
For many SMBs, the biggest AI opportunity is automating high-value tasks with LLM-powered solutions.
For many SMBs, the real opportunity lies in applying prompt engineering to automate high-value tasks — from text generation, translation, summarization, and Q&A systems to RAG-powered knowledge assistants, contextual chatbots, multimodal workflows such as image-to-text or text-to-image, and even reinforcement learning — all relying on LLM architectures at their core.
The value of an LLM lies largely in the massive computational effort invested to train it and in the resulting billions (or even trillions) of parameters it learns. The intellectual property resides primarily in the weights — the numerical values of these parameters — which capture the knowledge distilled from the training data.
Most major LLM providers do not give direct access to model weights. Instead, they offer API-based access so you can send prompts and receive responses without running the model locally. This approach allows them to:
• Protect intellectual property (model weights are proprietary).
• Control usage (through rate limits, billing, and policy enforcement).
• Handle infrastructure (scaling, updates, optimizations, GPU costs).
Most commercial LLM APIs today use a RESTful HTTP interface with JSON request and response formats. They typically provide official integration libraries for Python — the most widely used language in the AI ecosystem — and often for JavaScript/TypeScript. Direct Java SDKs are less common, but Java applications can interact with these APIs via standard REST calls.
Python in 2025: The Unexpected Powerhouse of Modern AI
I can already hear the chuckles behind me:
“Python? Modern? Is this a joke?”
Yes, I know — Python first appeared when floppy disks were still a thing, and the hottest phone on the market had a monochrome screen and Snake installed. But hold that thought. The Python you’re imagining — slow scripts, indentation wars, and “just for sysadmins” — is not the Python that runs today’s AI stacks. In 2025, Python is the lingua franca of machine learning, deep learning, and LLM orchestration. Give me a minute, and I’m pretty sure you’ll start seeing it less like an antique typewriter and more like the Swiss Army knife of modern AI development.
As we can see, integrating an LLM into your own application is relatively straightforward. However, for our AI Lab, there’s an important consideration:
We want to experiment with different providers — such as OpenAI, Anthropic, or Mistral — and also integrate our own data into the process.
This means we need to carefully evaluate the available frameworks, and in 2025 it’s clear that Python remains the primary language for this kind of work.
Most of today’s AI frameworks and tooling are indeed Python-first (or even Python-only), for a few reasons:
✔️ Ecosystem maturity — PyTorch, TensorFlow, Hugging Face Transformers, LangChain, OpenAI SDK, etc., all have their most complete and up-to-date implementations in Python.
✔️ Developer adoption — AI research has been dominated by Python since the deep learning boom of ~2010’s, so cutting-edge models are released in Python first.
✔️ Rapid prototyping — Python’s syntax + huge library ecosystem make it ideal for experimenting and iterating fast.
✔️ Community & tooling — tutorials, notebooks, pre-trained models, and example repos are overwhelmingly in Python.
2020s — Python dominates AI research and prototyping, but its ecosystem now also focuses on production readiness: FastAPI, Pydantic, async support, and robust serving stacks. This is exactly what we will dig in this article.
Why it matters for AI today: Python is where cutting-edge AI happens first — but in production, the code lives inside a bigger, reliable backbone: API frameworks, process managers, monitoring, and scaling tools.
So, do you still think Python is just a relic of the past — or the most proven and indispensable tool in the AI engineer’s arsenal for 2025?
Choosing the Right Integration Framework
At this early stage, it’s clear that we will work in Python. While we could interact directly with each provider’s exposed API, doing so for multiple vendors quickly becomes cumbersome — especially when switching between different authentication methods, request formats, or model capabilities. This is where an integration framework becomes essential: it abstracts away the provider-specific details and lets us focus on building features, not plumbing code.
Among the many options available are LangChain, AutoGen, CrewAI, Flowise, and AgentOps. Each has its strengths, but LangChain stands out for its depth, flexibility, and agentic capabilities. In practical terms, this means it doesn’t just act as a thin wrapper around APIs; it enables AI-driven workflows where the model can decide dynamically which tools to call, what data to retrieve, and how to combine intermediate results.
For AI Lab, this is critical. We want to experiment with multiple providers — such as OpenAI, Anthropic, or Mistral — integrate our own proprietary datasets, and design prototypes that can evolve into production-grade applications. LangChain’s rich ecosystem of connectors, memory modules, and chain types makes it ideal for building complex solutions like:
• Retrieval-Augmented Generation (RAG) pipelines.
• Multi-step reasoning agents combining different APIs.
• Contextual chatbots with access to internal business data.
• Hybrid workflows that mix LLM calls, image generation, and database lookups.
Beyond features, LangChain benefits from an active open-source community, extensive documentation, and regular updates that keep pace with the rapidly evolving AI landscape. This combination of flexibility, ecosystem maturity, and agentic workflow support makes it a strategic choice for AI Lab’s integration layer.
You can find more about LangChain here: https://python.langchain.com/docs/introduction
Let’s Build Our First Mini Test
Before diving deeper into LangChain’s capabilities, it’s worth stepping back to implement a simple test. The goal of this initial experiment is straightforward:
Expose a minimal Python-based microservice that provides a REST API to interact with OpenAI.
This will give us a working foundation to validate our setup, confirm that the API integration functions as expected, and create a baseline we can extend with more complex LangChain workflows later.
Configuring Docker to Add Our Service
With this setup, Traefik routes requests to https://ai-api.yourdomain.com directly to our Docker service, which listens internally on port 8188. The application’s code resides locally in /opt/ai_stack.
The next step is to define the service image using a Dockerfile — essentially a recipe that packages your code, dependencies, and runtime environment into an immutable image. A container is simply a live, running instance of that image.
This Dockerfile starts from a minimal Linux base image with Python preinstalled. It then installs the required OS packages and Python dependencies using "pip install -r requirements.txt". Each instruction in the file creates a new image layer, which is why it’s common practice to keep requirements.txt outside the Dockerfile — allowing dependency updates without modifying the Dockerfile itself.
The application code from /opt/ai_stack is copied into the container’s /app directory. The file also specifies the port the application will listen on (matching the value configured in Traefik). Finally, the application is launched using Gunicorn, a production-grade Python HTTP server.
What is Gunicorn?
Gunicorn (short for Green Unicorn) is a production-grade Python HTTP server for running WSGI or ASGI applications.
It’s not your application — it’s the component that runs your application and ensures it can handle real-world web traffic reliably, much like Nginx or Apache do for traditional web apps (although Gunicorn actually executes the Python code, whereas Nginx/Apache usually forward requests to it).
What it does?
• Starts multiple worker processes so your app can handle many requests in parallel.
• Listens for HTTP requests and forwards them to your Python application (FastAPI, Flask, Django, etc.).
• Manages workers — restarts them if they crash, handles graceful shutdowns, and enforces timeouts.
• Performs far better under load than running your app directly with python api.py.
WSGI vs ASGI Gunicorn itself is framework-agnostic — it can:
• Serve WSGI apps (Web Server Gateway Interface) like Flask or Django (sync model).
• Serve ASGI apps (Asynchronous Server Gateway Interface) like FastAPI or Starlette (async model) if you choose an ASGI-compatible worker class.
Different worker classes are optimized for different concurrency models.
Our AI Lab use case:
Recommended by LinkedIn
For AI Lab, ASGI is essential because it supports async/await, enabling efficient handling of many concurrent requests — perfect for APIs that call LLMs, databases, or other services without blocking.
That’s why we run Uvicorn inside Gunicorn workers to handle async code — ideal for FastAPI in production.
Configuration in our container:
CMD ["gunicorn", "-k", "uvicorn.workers.UvicornWorker", "api:app", "--workers", "2", "--bind", "0.0.0.0:8188", "--log-level", "warning"]
This means:
Run 2 worker processes, each running Uvicorn, serving the FastAPI app defined in api.py.
Bind to port 8188 so Traefik can route HTTPS traffic to it.
Use log level = warning to reduce log noise.
At this stage:
• Gunicorn → Manages processes, load balances between workers.
• Uvicorn → Runs the async FastAPI app inside each worker.
What is FastAPI?
FastAPI is a modern, high-performance web framework for building APIs in Python. It’s designed to be fast, developer-friendly, and type-safe, making it especially popular for creating REST and GraphQL services, AI backends, and microservices.
Let’s Get Started: Building Our FastAPI Application
api.py
Implementation of the test route: test_llm.py
The first file is the application bootstrap, responsible for creating the FastAPI service and wiring routes together. test_llm_router is a group of related endpoints, imported from another file to keep the code modular and organized.
The second file contains the implementation of a POST endpoint that sends a question to an OpenAI language model and returns the answer.
✔️ It’s a POST route at /ai/v1/test_llm/ that returns a StringResponse.
✔️ The LangChain ChatOpenAI object is initialized with:
• Model: the OpenAI model to use.
• Temperature: controls creativity (0 makes it deterministic).
• Api_key: your OpenAI API key (this should not be hardcoded in production).
✔️ The .invoke() method sends the question to the LLM and returns an AI message object.
✔️ We extract the answer from ai_msg.content and return it in the expected format.
✔️ If anything goes wrong, we raise an HTTPException with status 500.
Let’s Test Our New API
Suppose we simply want to ask: “What is the capital of Switzerland?”
curl
-H "Content-Type: application/json"
-d '{"question": "What is the capital of Switzerland?"}'
Response:
Success! The API is functioning properly and returning the expected result.
OpenAI
A quick note about the OpenAI API key: Once you’ve created your OpenAI account (separate from a ChatGPT account), you’ll also need to choose the models you want to interact with.
You can choose the model that best fits your needs.
For simple chat interactions, GPT-3.5-turbo is usually sufficient and is much cheaper than GPT-5. OpenAI also provides guidance to help you select the right model.
Pricing examples:
GPT-5: $1.250 USD per 1M input tokens, $0.125 USD per 1M cached input tokens, and $10.000 USD per 1M output tokens.
GPT-3.5-turbo: $0.500 USD per 1M input tokens and $1.500 USD per 1M output tokens.
From Prototype to Production:
Next Steps for Our LangChain–OpenAI REST API
At this stage, we have the early foundation of an API capable of interacting with OpenAI via LangChain and exposing its capabilities through a REST interface. While far from production-ready, it already serves as a functional entry point for validating specific hypotheses in your project.
Because the endpoint follows the REST model, it can be easily integrated into existing applications, regardless of the programming language. For instance, you could invoke it from within a Spring Boot application and continue working in Java as usual. This flexibility allows teams to split responsibilities without disrupting day-to-day operations.
Before moving toward production, it’s important to step back and address several key areas for this Python component:
❌ Secure the OpenAI API key.
❌ Implement a consistent, centralized logging mechanism.
❌ Collect and centralize logs from Uvicorn, Gunicorn, and FastAPI.
❌ Apply a rate limiter at the endpoint level.
❌ Apply a rate limiter at the Traefik level.
❌ Configure CORS appropriately.
❌ Protect the information sent to the LLM to prevent leaks or data breaches.
We will explore some of these aspects in the next article of this AI Lab series. Later, we’ll extend the system with additional modules to support more advanced workflows, such as Retrieval-Augmented Generation (RAG), which will involve adding a Qdrant database.
Try the code and share your results — what’s the first AI endpoint you’ll deploy?
Read Part 3: Protecting Your Keys with Vault
The prototype is ready — now it’s time to get serious about security. We’ll show you how to store your OpenAI keys (and other secrets) in HashiCorp Vault and access them safely inside your AI stack.
As a Fractional CTO, I help teams design efficient, scalable systems — without over-engineering.
📩 Let’s talk If you want to rethink your architecture without overengineering it, my DMs are open.
The AI Lab Series teaser: https://www.garudax.id/posts/alexandrechatton_most-ai-talks-sound-smart-but-produce-activity-7360614211015577600-Bx8H
What stands out here is the balance between big-picture clarity (mapping the AI landscape beyond just LLMs) and practical implementation (FastAPI + LangChain inside Docker). The reminder that Python remains the powerhouse of AI in 2025 is spot on — not because it’s trendy, but because the ecosystem, tooling, and community make it the most pragmatic choice for turning ideas into real, deployable services. Looking forward to the next chapter on securing keys with Vault: This is a critical step too often postponed until problems arise!
The third article of the AI Lab series is ready. Inside, we cover: • How to secure your OpenAI API key with HashiCorp Vault • Why environment variables aren’t enough • Secrets lifecycle, AppRole + token auth, and runtime-only injection with a sidecar agent • Full FastAPI integration with hvac 🧩 Key takeaway: You can survive a bug, but you won’t survive a leaked API key — secure it the right way from day one. 👇 Read the full article: https://www.garudax.id/pulse/protecting-your-keys-vault-because-key-breach-costs-more-chatton-oofyf