Control 50+ local and proprietary LLMs with one API and web interface (open source)

Control 50+ local and proprietary LLMs with one API and web interface (open source)

TLDR: Data protection, vendor lock-in and sometimes costs are current problems when selecting LLM providers such as OpenAI or Anthropic. The solution is a middleman between the various inference servers and the end application: TaskingAI. All LLM models from any provider, whether paid and proprietary or free and open-source, are thus visible in a structured manner in one interface and can be operated with a single API.

Anyone planning their own AI infrastructure in the company wants to be able to offer as many models as possible. Using only the Azure OpenAI API is compliant with data protection regulations, but using OpenAI models at a price of around USD 15 per 1 million output tokens does not seem to make much sense from a scaling perspective. Proprietary models are also less suitable for fine-tuning due to API limitations compared to open source models in combination with a no-code fine-tuning GUI such as H2O.

Replace OpenAI with TaskingAI

Most AI apps currently use the OpenAI APIs because there were no other alternatives. With TaskingAI, true independence from just one paid LLM provider is now possible.

Article content

TaskingAI is published 100% open source on Github and consists of a backend, a frontend and a Python package. The frontend allows you to add models from any paid LLM provider or your own inference server, such as an Ollama or LM Studio server running on your own GPU hardware. There are a total of 26 provider options to choose from. The 26 providers are all inference servers, which in turn can address different LLM models themselves. There are therefore no limits to the number of different LLM models.

Article content
Choose from OpenAI, Azure OpenAI, Claude, Ollama, HuggingFace and many more
Article content

TaskingAI is OpenAI compatible

The AI applications already run with the OpenAI API and therefore TaskingAI offers a drop-in replacement by converting all LLM responses from the TaskingAI server into an OpenAI compatible object format. To make the API compatible with OpenAI, only /v1 needs to be added to the request.

Example for TaskingAI Format:

https://taskingai.server.com/assistants/{assistant_id}/chats/{chat_id}/generate

Example for OpenAI Format:

https://taskingai.server.com/v1/assistants/{assistant_id}/chats/{chat_id}/generate

TaskingAI on Premise

TaskingAI can be set up very easily with Docker itself. A few environment variables need to be defined in an .env file and all Docker containers, with server and frontend, can be started with Docker Compose.

Article content

I recommend running the TaskingAI server and the frontend on the same GPU server as the local inference servers (e.g. Ollama) so that the latency between Ollama and TaskingAI is minimised.

Important: In this case, Ollama would run as a system process in the background and TaskingAI in Docker containers. Therefore, http://host.docker.internal:11434/ must be specified as the endpoint for the Ollama server in TaskingAI, otherwise an error message will be displayed.

Since TaskingAI itself only forwards the request to the correct inference provider, it would also be possible to use the TaskingAI containers on a server without a GPU. Ollama, on the other hand, should definitely have access to a GPU, because otherwise the response times of a model such as Mistral or Llama 3.1 are too slow and therefore unusable.

TaskingAI as Python Package

To integrate TaskingAI into your own application and exchange it for OpenAI, you would first generate an API key yourself via the web interface, which your own Python app will use to authenticate itself to the TaskingAI server:

Article content

This will look something like this: tkRw2hihzZLHcSugqjbKTUW6LefGKaOy

The Tasking AI Python package can then be installed and the Python app registers with the Tasking AI server using the self-generated API key. This allows the models to be integrated into your own app and, above all, combined with other tools such as Crew.AI for pipelines or PhiData for assistants:

Article content

What is Retrieval Augmented Generation (RAG)?

A method to search for relevant information from large collections of text and thus obtain context-dependent answers from the LLM. A vector database plays a central role in Retrieval Augmented Generation as it is used to efficiently search large amounts of text data and find relevant information. Here, texts are converted into vectors that capture the semantic meanings of the texts. When a query is made, the system also converts it into a vector and searches the vector database for the most similar vectors. The most similar vectors are determined by a cosine distance or the Euclidean distance between two vectors. These vectors correspond to the texts in the database and have the greatest semantic similarity to the text from the user request. The LLM receives the original query together with the retrieved relevant texts as additional context. Based on this extended context, the LLM generates a response that addresses the original query while integrating the retrieved relevant information. This results in a more precise and comprehensive response that takes into account both the query and the additional context.

RAG in Tasking AI through assistants

Article content

To create an LLM assistant with a ‘memory’, an embedding model must be selected. This embedding model can also be Mistral, Llama 3.1 or GPT-4o, for example. An embedding model is used to convert text data into numerical vectors. These vectors represent the semantic meaning of the text in a multidimensional space. An embedding model therefore ensures that all texts are consistently projected into the same vector space and also converts the text of the user request into a vector so that the request can be compared with the stored vectors in the database. This is crucial for an effective similarity search.

Embedding Modell

Once an embedding model has been selected, a collection must be created in Tasking AI. A collection is responsible for the structured storage of data within the RAG process. However, it is not a vector database in the conventional sense. A collection is a comprehensive organisational unit for various data sources, whereas a vector database is specifically designed for managing and searching vectors. A collection in TaskingAI can contain metadata, raw texts and their vectorisations.

Memory

In TaskingAI, "memory" refers to an assistant's ability to store and use the context of conversations to provide coherent and relevant responses. There are three types of memory:

1. zero memory: each exchange is handled independently, without context storage. Suitable for isolated queries.

2. naive memory: Saves all messages of a session in order to retain the complete context. Ideal for short, precise interactions.

3. message window memory: Focusses on recent, relevant parts of the conversation. Suitable for longer dialogues with a detailed history.

Tools

Tools are a way of integrating third-party APIs into the assistant. For example, the assistant could receive a text description from the user and use the Stability AI API or DALL-E to generate an image and display the image to the user. A real-time web search would also be possible by creating a ‘programmatic search engine’ at Google - 100 queries per day free of charge - and searching the Internet for suitable sources for each user query, which an open source model such as Llama or Mistral does not natively support.

Article content

I hope I was able to give you a good first impression of how Tasking AI can significantly simplify your own local LLM infrastructure. If you now link Tasking AI with customised billing software such as Lago, you could not only supply LLM models to your own apps, but also offer local LLMs from Germany - with extra data protection 😉 - as a service with usage-based billing per token.

To view or add a comment, sign in

More articles by Patrick Lenert

  • Vector Databases: Lance vs Chroma

    Why vectors? Imagine you’re trying to find a specific book in a massive library that has no catalog or organization…

    1 Comment
  • Save Replicate Images in GCloud

    TLDR: A python FastAPI for replicate webhooks and uploading the images to GCloud buckets or sending it as emails for…

Others also viewed

Explore content categories