Simplifying Local Deployment of Large Language Models using Ollama
Image Generated by Microsoft Copilot

Simplifying Local Deployment of Large Language Models using Ollama

Over the past two years, generative AI has driven transformative changes across industries, reshaping how businesses innovate and operate. As advancements in this field continue to accelerate, investment in generative AI technologies is expected to grow significantly in the coming years. However, the software development lifecycle (SDLC) for AI model testing often relies heavily on external services, introducing challenges related to cost, dependency, and privacy.

In this article, I will explore how Ollama's capability for local deployment can streamline the development process, reduce external dependencies, and enhance efficiency in AI model testing and application development.

Olama is an open-source tool designed to simplify the deployment and management of large language models (LLMs) on local hardware. It addresses key challenges developers and researchers face when using LLMs, such as cost, complexity, and privacy while enhancing flexibility and performance.

Addressing Challenges in LLM Deployment

  • Cost Efficiency: By avoiding cloud API expenses, Olama allows significant cost savings, particularly for long-term or large-scale deployments., by avoiding cloud API expenses
  • Enhanced Privacy: Local execution ensures sensitive data remains on-premises, a critical requirement for industries like healthcare and finance.
  • Simplified Setup: Olama automates the technical intricacies of LLM installation and execution, reducing the barriers for non-experts.
  • Latency Reduction: Local processing minimizes delays, making Olama ideal for real-time applications.
  • Customization: Running models locally enables fine-tuning and adaptations to meet unique requirements without third-party constraints.

Why Ollama

  1. Local Execution: Olama enables users to run LLMs on their hardware, eliminating reliance on cloud-based services. This local-first approach reduces costs, ensures data privacy(as data is stored locally), and eliminates latency associated with network-based solutions.
  2. Model Management: Olama is a centralized platform for managing multiple LLMs, allowing seamless switching between models. This functionality supports diverse applications, from development and testing to production environments.
  3. Unified Interface: Interacting with various LLMs is streamlined through a consistent command-line interface (CLI), abstracting technical complexities and making the process more accessible.
  4. Extensibility and Customization: Users can integrate custom models and extensions, tailoring Olama to specific use cases. Additionally, the platform supports hardware acceleration, including GPU optimization, to maximize performance.

Use Cases

  1. Development and Testing: Olama allows developers to experiment with multiple LLMs, streamlining the evaluation process for integrating AI into applications.
  2. Education and Research: Its open-source nature and ease of use make Olama an excellent tool for learning and experimentation without the overhead of cloud service subscriptions.
  3. Secure Applications: Industries requiring stringent data protection benefit from Olama’s privacy-first approach, ensuring sensitive data never leaves local systems.

Download and Setup Ollama:

System Requirements :

OS: MAC/Linux/Windows.

Storage: Minimum 10GB of Free space

Processor: Modern CPU.

To download and set up Ollama, (Minimum visit the Ollama website(https://ollama.com/) and locate the download section for the latest version, such as Llama 3.2. Select the appropriate option based on your operating system (macOS, Windows, or Linux). For macOS and Windows, download and unzip the file, then install it as a standalone application. On Linux, you may need to run a terminal command to complete the installation. Once installed, launch the application and proceed with the setup wizard, which includes installing the command-line interface. Using the CLI, you can run a model, such as "Llama 3.2," by executing simple commands like ollama run llama-3.2. The setup provides a shell for interacting with the model, allowing users to ask questions, generate text, and explore features like help commands and session management. The process emphasizes simplicity and the ability to manage multiple locally installed models while maintaining privacy and control.

Model Management

1.     List Installed Models ollama list Displays all installed models with their size and modification dates. The default storage location for these models is within the user's home directory, specifically under ~/. ollama/models.

2.     Remove a Model ollama rm <model-name> Deletes a specific model.

3.     Pull a New Model ollama pull <model-name> Downloads a model from Ollama's library.

4.     Show Command Help ollama help Lists all available commands and their descriptions.

5.     Run a Model ollama run <model-name> Starts the specified model for interaction.

Function Calling with Large Language Models

Large language models (LLMs) are powerful but limited by their training data, which can lead to issues like hallucination (fabricated or inaccurate outputs). To mitigate these limitations and enhance their capabilities, tool or function calling can be used. This involves integrating external tools or functions with the LLM, allowing it to:

  1. Access additional data (e.g., prices, nutrition facts, or recipes).
  2. Perform tasks beyond their inherent capabilities.
  3. Enrich its outputs and reduce hallucination by leveraging external systems or custom functions.

Function Calling Example

To demonstrate the use of Ollama with RAG architecture, I have written a small program that enhances the LLM output with some of the API calls. The program reads the list of grocery items available in a file and categorizes them using LLM. Based on this category it calls different APIs for additional data. Here is the flow. The code is available in my GitHub (https://github.com/jaysara/ollama-function-calling)

  1. Input: Load a grocery list into the LLM. (https://github.com/jaysara/ollama-function-calling/blob/main/data/grocery_list.txt)
  2. Categorization: The model categorizes items on the list.
  3. Function Calls: Fetch price and nutrition data for each item. Retrieve a recipe based on a selected category from the list.
  4. Output: Save or display the enriched information (e.g., categorized items, prices, and a recipe).

This approach demonstrates the utility of tool integration to expand LLM applications and improve their accuracy and functionality.

 

Article content

Function calling with Ollama integrates tools or functions into large language model workflows, enabling advanced operations such as categorizing items, fetching data, and processing results. Here's how it works:


Workflow Breakdown

1. Setting Up Tools

  • Define a dictionary of tools containing functions the model can call.
  • This dictionary serves as the "toolbox" for the large language model.

# Define the functions (tools) for the model
    tools = [
        {
            "type": "function",
            "function": {
                "name": "fetch_price_and_nutrition",
                "description": "Fetch price and nutrition data for a grocery item",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "item": {
                            "type": "string",
                            "description": "The name of the grocery item",
                        },
                    },
                    "required": ["item"],
                },
            },
        },
        {
            "type": "function",
            "function": {
                "name": "fetch_recipe",
                "description": "Fetch a recipe based on a category",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "category": {
                            "type": "string",
                            "description": "The category of food (e.g., Produce, Dairy)",
                        },
                    },
                    "required": ["category"],
                },
            },
        },
    ]
         

2. Categorizing Items

·       Prepare the Prompt: A detailed prompt specifies the task (e.g., categorizing grocery items) and the desired format (e.g., JSON with categories as keys and items as values).

    categorize_prompt = f"""
You are an assistant that categorizes grocery items.

**Instructions:**

- Return the result **only** as a valid JSON object.
- Do **not** include any explanations, greetings, or additional text.
- Use double quotes (`"`) for all strings.
- Ensure the JSON is properly formatted.
- The JSON should have categories as keys and lists of items as values.

**Example Format:**

{{
  "Produce": ["Apples", "Bananas"],
  "Dairy": ["Milk", "Cheese"]
}}

**Grocery Items:**

{', '.join(grocery_items)}
"""         

·       API Call 1:

  • Send the prompt and tools to the model via an API call.
  • Append the model's response (categorized items) to a messages list for maintaining conversation history.

·       Parse and Process Output: Extract and validate the JSON response containing categorized items.


3. Fetching Additional Data (e.g., Price & Nutrition Info)

·       Prepare Fetch Prompt: A new prompt instructs the model to use the tools to fetch specific information for each grocery item.

# Construct a message to instruct the model to fetch data for each item
    # We'll ask the model to decide which items to fetch data for by using function calling
    fetch_prompt = """
    For each item in the grocery list, use the 'fetch_price_and_nutrition' function to get its price and nutrition data.
    """           

·       API Call 2:

  • Send the updated messages and tool list to the model.
  • The model decides which tool to call based on the context of the prompt and tasks.


4. Processing Function Calls

Detect Tool Calls:

  • Check the model's response for "tool calls" that specify which tool to invoke and what arguments to pass.

·Invoke the Tools:

  • Use the tool's dictionary to call the appropriate function with the required arguments.
  • Append the results (e.g., price and nutrition details) back into the messages list and a details list.

# Process function calls made by the model
    if response["message"].get("tool_calls"):
        print("Function calls made by the model:")
        available_functions = {
            "fetch_price_and_nutrition": fetch_price_and_nutrition,
        }
        # Store the details for later use
        item_details = []
        for tool_call in response["message"]["tool_calls"]:
            function_name = tool_call["function"]["name"]
            arguments = tool_call["function"]["arguments"]
            function_to_call = available_functions.get(function_name)
            if function_to_call:
                result = await function_to_call(**arguments)
                # Add function response to the conversation
                messages.append(
                    {
                        "role": "tool",
                        "content": json.dumps(result),
                    }
                )
                item_details.append(result)

                print(item_details)        

 

Conclusion

Olama democratizes access to advanced AI technologies by making LLMs more accessible, affordable, and secure. Its focus on local execution, unified management, and user-friendly design empowers developers, researchers, and organizations to harness the power of LLMs without traditional limitations. This open-source tool is a game-changer for anyone looking to leverage AI while maintaining control over their data and costs.

In this article, we explored the potential of Ollama models to revolutionize local large language model (LLM) applications. By leveraging Ollama's CLI, REST API, and Python SDK, you can build and customize robust applications without dependency on external services—ensuring zero cost and complete privacy. This allows you to create full-fledged LLM applications locally with customizable workflows, empowering you to adapt models to unique requirements.

Armed with these skills, you can now innovate further. Extend the project explored here, design new applications, and continue learning to maximize the potential of Ollama models. For additional resources, visit Ollama's GitHub repository. Keep building and let your creativity define the future of local AI applications!

Reference:

·       Github example of function calling : https://github.com/jaysara/ollama-function-calling/tree/main

·        Ollama's GitHub repository.

Great read Jay Saraiya - appreciate you sharing. How have you found performance in comparison to some of the other models out there? Do you have a preference on which models work better with specific use cases?

Like
Reply

Given Ollama's emphasis on local execution, how does its performance in a RAG application compare to cloud-based solutions like LangChain when considering latency and resource utilization for complex query pipelines involving transformer models? Does Ollama leverage techniques like quantization or pruning to optimize model size and inference speed for edge deployments within the RAG context?

Like
Reply

To view or add a comment, sign in

More articles by Jay Saraiya

  • OpenAI AgentKit: A Practical Guide to Building an Enterprise AI Platform

    The last year has been all about agents — but until recently, building them felt like stitching together prompts…

    4 Comments
  • 🚀 Boosting Java Performance with Virtual Threads (Java 21+)

    Ever noticed your Java service slowing down under load — even though CPU and memory look fine? 🤔 Chances are, you’ve…

  • Turbocharging Spark ETL Pipelines for Massive, Wide Datasets

    The Challenge of High-Dimensional Data Processing Apache Spark is a powerhouse for big data processing, but when…

    1 Comment
  • LangChain and LLM

    One of the biggest technical innovations that 2023 can claim credit for is the introduction of Large Language Models…

    3 Comments
  • Metaverse: Why should we pay attention

    Why Rebrand When Facebook changed its name to Meta, many people including the media ridiculed this move and looked at…

    10 Comments
  • Serverless functions on Multicloud

    Multicloud approach is becoming a popular architecture pattern on cloud platforms for many organizations. Many think…

    6 Comments
  • Micro Frontends for UI

    What and Why of Micro-frontend The Microservice architecture is not a new pattern anymore for the backends. The modern…

    7 Comments
  • Value of Adopting Reactive Programming

    The years old problem that software engineering always thrive to solve is how do I handle more users in my system? As I…

  • Schema in Kafka

    Kafka is becoming a de facto messaging framework in microservices based systems. The important aspects in any messaging…

    1 Comment
  • Data Streaming Pipeline Architecture

    What is Data Stream: If you had to write a software to identify all the 'red' cars parked in a garage, it is a…

    3 Comments

Explore content categories