Simplifying Local Deployment of Large Language Models using Ollama
Over the past two years, generative AI has driven transformative changes across industries, reshaping how businesses innovate and operate. As advancements in this field continue to accelerate, investment in generative AI technologies is expected to grow significantly in the coming years. However, the software development lifecycle (SDLC) for AI model testing often relies heavily on external services, introducing challenges related to cost, dependency, and privacy.
In this article, I will explore how Ollama's capability for local deployment can streamline the development process, reduce external dependencies, and enhance efficiency in AI model testing and application development.
Olama is an open-source tool designed to simplify the deployment and management of large language models (LLMs) on local hardware. It addresses key challenges developers and researchers face when using LLMs, such as cost, complexity, and privacy while enhancing flexibility and performance.
Addressing Challenges in LLM Deployment
Why Ollama
Use Cases
Download and Setup Ollama:
System Requirements :
OS: MAC/Linux/Windows.
Storage: Minimum 10GB of Free space
Processor: Modern CPU.
To download and set up Ollama, (Minimum visit the Ollama website(https://ollama.com/) and locate the download section for the latest version, such as Llama 3.2. Select the appropriate option based on your operating system (macOS, Windows, or Linux). For macOS and Windows, download and unzip the file, then install it as a standalone application. On Linux, you may need to run a terminal command to complete the installation. Once installed, launch the application and proceed with the setup wizard, which includes installing the command-line interface. Using the CLI, you can run a model, such as "Llama 3.2," by executing simple commands like ollama run llama-3.2. The setup provides a shell for interacting with the model, allowing users to ask questions, generate text, and explore features like help commands and session management. The process emphasizes simplicity and the ability to manage multiple locally installed models while maintaining privacy and control.
Model Management
1. List Installed Models ollama list Displays all installed models with their size and modification dates. The default storage location for these models is within the user's home directory, specifically under ~/. ollama/models.
2. Remove a Model ollama rm <model-name> Deletes a specific model.
3. Pull a New Model ollama pull <model-name> Downloads a model from Ollama's library.
4. Show Command Help ollama help Lists all available commands and their descriptions.
5. Run a Model ollama run <model-name> Starts the specified model for interaction.
Function Calling with Large Language Models
Large language models (LLMs) are powerful but limited by their training data, which can lead to issues like hallucination (fabricated or inaccurate outputs). To mitigate these limitations and enhance their capabilities, tool or function calling can be used. This involves integrating external tools or functions with the LLM, allowing it to:
Function Calling Example
To demonstrate the use of Ollama with RAG architecture, I have written a small program that enhances the LLM output with some of the API calls. The program reads the list of grocery items available in a file and categorizes them using LLM. Based on this category it calls different APIs for additional data. Here is the flow. The code is available in my GitHub (https://github.com/jaysara/ollama-function-calling)
This approach demonstrates the utility of tool integration to expand LLM applications and improve their accuracy and functionality.
Function calling with Ollama integrates tools or functions into large language model workflows, enabling advanced operations such as categorizing items, fetching data, and processing results. Here's how it works:
Workflow Breakdown
1. Setting Up Tools
# Define the functions (tools) for the model
tools = [
{
"type": "function",
"function": {
"name": "fetch_price_and_nutrition",
"description": "Fetch price and nutrition data for a grocery item",
"parameters": {
"type": "object",
"properties": {
"item": {
"type": "string",
"description": "The name of the grocery item",
},
},
"required": ["item"],
},
},
},
{
"type": "function",
"function": {
"name": "fetch_recipe",
"description": "Fetch a recipe based on a category",
"parameters": {
"type": "object",
"properties": {
"category": {
"type": "string",
"description": "The category of food (e.g., Produce, Dairy)",
},
},
"required": ["category"],
},
},
},
]
2. Categorizing Items
· Prepare the Prompt: A detailed prompt specifies the task (e.g., categorizing grocery items) and the desired format (e.g., JSON with categories as keys and items as values).
categorize_prompt = f"""
You are an assistant that categorizes grocery items.
**Instructions:**
- Return the result **only** as a valid JSON object.
- Do **not** include any explanations, greetings, or additional text.
- Use double quotes (`"`) for all strings.
- Ensure the JSON is properly formatted.
- The JSON should have categories as keys and lists of items as values.
**Example Format:**
{{
"Produce": ["Apples", "Bananas"],
"Dairy": ["Milk", "Cheese"]
}}
**Grocery Items:**
{', '.join(grocery_items)}
"""
· API Call 1:
· Parse and Process Output: Extract and validate the JSON response containing categorized items.
3. Fetching Additional Data (e.g., Price & Nutrition Info)
· Prepare Fetch Prompt: A new prompt instructs the model to use the tools to fetch specific information for each grocery item.
# Construct a message to instruct the model to fetch data for each item
# We'll ask the model to decide which items to fetch data for by using function calling
fetch_prompt = """
For each item in the grocery list, use the 'fetch_price_and_nutrition' function to get its price and nutrition data.
"""
· API Call 2:
4. Processing Function Calls
Detect Tool Calls:
·Invoke the Tools:
# Process function calls made by the model
if response["message"].get("tool_calls"):
print("Function calls made by the model:")
available_functions = {
"fetch_price_and_nutrition": fetch_price_and_nutrition,
}
# Store the details for later use
item_details = []
for tool_call in response["message"]["tool_calls"]:
function_name = tool_call["function"]["name"]
arguments = tool_call["function"]["arguments"]
function_to_call = available_functions.get(function_name)
if function_to_call:
result = await function_to_call(**arguments)
# Add function response to the conversation
messages.append(
{
"role": "tool",
"content": json.dumps(result),
}
)
item_details.append(result)
print(item_details)
Conclusion
Olama democratizes access to advanced AI technologies by making LLMs more accessible, affordable, and secure. Its focus on local execution, unified management, and user-friendly design empowers developers, researchers, and organizations to harness the power of LLMs without traditional limitations. This open-source tool is a game-changer for anyone looking to leverage AI while maintaining control over their data and costs.
In this article, we explored the potential of Ollama models to revolutionize local large language model (LLM) applications. By leveraging Ollama's CLI, REST API, and Python SDK, you can build and customize robust applications without dependency on external services—ensuring zero cost and complete privacy. This allows you to create full-fledged LLM applications locally with customizable workflows, empowering you to adapt models to unique requirements.
Armed with these skills, you can now innovate further. Extend the project explored here, design new applications, and continue learning to maximize the potential of Ollama models. For additional resources, visit Ollama's GitHub repository. Keep building and let your creativity define the future of local AI applications!
Reference:
· Github example of function calling : https://github.com/jaysara/ollama-function-calling/tree/main
Great read Jay Saraiya - appreciate you sharing. How have you found performance in comparison to some of the other models out there? Do you have a preference on which models work better with specific use cases?
Given Ollama's emphasis on local execution, how does its performance in a RAG application compare to cloud-based solutions like LangChain when considering latency and resource utilization for complex query pipelines involving transformer models? Does Ollama leverage techniques like quantization or pruning to optimize model size and inference speed for edge deployments within the RAG context?