Building generative AI apps locally

Table of Contents

As application developers, we often want to run everything locally. On our development machines. Including databases, and more recently, AI models like embedders and LLMs. We know we can run a local database like PostgreSQL or MariaDB easily even on a tiny Raspberry Pi Zero. However, an AI model consumes substantially more resources. Yet, we want to run it on our machine, even if it’s painfully slow. It’s just so satisfying.

Also, when we run something locally, we get first-hand experience on using the technology and reveal some of its obscure bits. Then we get that “aha!” moment when we finally discover it’s not as obscure as it seemed when we merely consumed it from a cloud service. There are also the obvious benefits of cost savings, security, and data privacy when we run services locally.

In this article we’ll explore a basic setup to locally run AI embedding models to store the generated vectors in MariaDB. We’ll also cover some basic terminology for those who want to catch up with generative AI application development.

What is an embedder?
#

An embedding model (or embedder) is a type of AI model. It converts text, images, or sound data into numbers. More precisely, it converts data into vectors. A vector is a sequence of numbers. For example:

[0.134, -0.527, 0.762, -0.243, ... more numbers ... , 0.418]

A vector captures the semantic meaning of certain piece of data. In this context, we call it vector embedding or simply embedding. It’s always associated to data. Maybe the original data used to compute the vector. Or maybe an ID that points to related data in a database. When we calculate an embedding (using an embedder), we can store it in a database to keep a numerical representation of data. This is useful later for personalized recommendations, AI-enhanced search, and in general RAG applications (Retrieval-Augmented Generation, more on this later).

MariaDB includes the VECTOR datatype to store and manipulate vector embeddings. To illustrate all this, let’s say we have a products table like the following:

CREATE TABLE products(
	id INT PRIMARY KEY AUTO_INCREMENT,
	name VARCHAR(200) UNIQUE NOT NULL,
	description VARCHAR(500)
);

Let’s also say we want to suggest related products on each product’s page. First, we need to store the vectors for each product. We can use same or a different table. Creating a new table for this purpose is a good practice, for example:

CREATE TABLE product_embeddings(
	product_id INT,
	embedding_llama VECTOR(2048) NOT NULL,
	FOREIGN KEY (product_id) REFERENCES products(id)
);

With this, a product can have an associated embedding (embedding_llama). We use the name of the embedder in the table or column name, in this article, we’ll use Llama. This is also a good practice since embeddings produced by different embedders are not compatible and we cannot compare them—something we want to do later. 2048 is the number of dimensions or numbers that the embedding has. We have to specify the exact number of dimensions that our embedder generates.

Generating embeddings
#

Before we set up our local services, let’s see how we generate and use these vector embeddings. Some pseudo-code should clear things up:

product_details = "This is data about some product. Maybe the name + description as a string."
embedding_to_store = call_embedder(data)
store_vector_in_database(embedding_to_store)

We’ll do this for each product. Later, we can suggest similar products with something like the following:

data = "Product name and description for which we want to find similar products"
embedding_to_compare = call_embedder(data)
products = get_closest_vectors_from_database(embedding_to_compare)
show_similar_products(products)

Here, getClosestVectorsFromDatabase(embedding_to_compare) would run a SQL query like the following:

SELECT p.id , p.name
FROM products p
ORDER BY VEC_DISTANCE_COSINE(p.embedding_llama, ?)
LIMIT 5

Typically, our persistence framework replaces ? with an actual value. For example we can pass the value stored in the embedding_to_compare variable to our persistence framework. We can pass the value directly in binary format (for example byte[] in Java) or in plain text (in JSON array format) in which case we should use MariaDB’s VEC_FromText function in the previous SELECT.

VEC_DISTANCE_COSINE is the distance function. MariaDB has several vector functions. We are using the cosine version which is good for similarity search in text. Notice that we are ordering the results by the output this function, meaning that we’ll get a result set where the first rows have vectors that are closer to the one we just calculated (embedding_to_compare). And we are only interested in the top 5. Since “closest” means “similar”, we are getting the 5 vectors that are closer to the target and hence the 5 most similar products.

Keep in mind that we might have millions or even billions of products, in which case we’ll need a vector index. With an index, the search is approximate. There are techniques like hybrid search to include exact-match results, but that’s homework for another day.

Show me the setup!
#

Now that we’ve covered the very basics of embeddings and vector search, let’s set up our local environment to try this out. Here’s a Docker Compose file that starts both a database and an embedder so you can try this on your computer without having to generate API keys and setting system environment variables:

services:
  mariadb:
    image: mariadb:11.7-rc
    container_name: mariadb
    environment:
      MARIADB_ROOT_PASSWORD: password
      MARIADB_DATABASE: demo
    ports:
      - "3306:3306"

  local-ai:
      image: localai/localai:master-ffmpeg-core
      container_name: local-ai
      command: bert-embeddings
      ports:
        - "8080:8080"
      volumes:
        - ./models:/build/models:cached

MariaDB will be listening on port 3306 and the embedder on port 8080. We are using something called LocalAI. There are alternatives like Ollama or vLLM, just to mention two. LocalAI is interesting because it’s a drop-in replacement for the REST API offered by OpenAI. So, when we move to production we only need to change the endpoint (URL) and set the required credentials (API key). We could also deploy a LocalAI-based setup in the cloud or anywhere with more compute power.

If we want to add an LLM (or rather SLM if you are running it on a development machine), we can add additional models from the hundreds available on the LocalAI Gallery. For instance, we can add Microsoft’s Phi-4 model in our Docker Compose file as follows:

...
      command: bert-embeddings phi-4
...

LocalAI will automatically download and run the specified AI models.

Invoking the embedder
#

After this, we can use an AI framework, like LangChain4j (Java), LangChain.js (JavaScript), LangChain (Python), and llama.cpp (C++). Alternatively, we can invoke the embeddings and chat endpoints directly as well. For example, if we want to generate an embedding for the string "women's lightweight running shoes", we can use this curl command:

curl http://127.0.0.1:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bert-embeddings",
    "input": "women's lightweight running shoes"
  }'

After this, we’ll get a response similar to this:

{
  "object": "embedding",
  "model": "bert-embeddings",
  "data": [
    {
      "embedding": [
        0.1234, -0.5678, 0.9101, -0.1121, 0.3141, -0.5161, 0.7182, -0.9202,
        0.2233, -0.4252, 0.6273, -0.8293, 1.0313, -1.2333, 1.4353, -1.6373,
        0.8393, -0.0412, 0.2432, -0.4452, 0.6472, -0.8492, 1.0512, -1.2532,
        0.9552, -0.1572, 0.3592, -0.5612, 0.7632, -0.9652, 1.1672, -1.3692
		... more numbers ...
      ],
      "index": 0
    }
  ],
  "usage": {
    "prompt_tokens": 7,
    "total_tokens": 7
  }
}

We can extract the part we are interested in using a JSON Path expression like $.data[0].embedding. That’s the value we want to store in the database or send to the SELECT query to perform similarity search. AI frameworks do this and more for you. Use them!

Learn more
#

The tools and techniques covered here are useful for developing Retrieval-Augmented Generation (RAG) applications. But that’s also homework for another day. Meanwhile, here’s a video that demonstrates this:

Enjoyed this post? I can help your team implement similar solutions—contact me to learn more.

Supercharge your app: MariaDB in-memory tables as a cache

1 August, 2024·1308 words·7 mins

Databases

Explore how to use MariaDB in-memory tables to boost your app’s performance and efficiency.

Packages for store routines in MariaDB 11.4

28 June, 2024·898 words·5 mins

Databases

Explore the new feature of packages for stored routines in MariaDB 11.4 and how it enhances database development.

Better CRUDs with REPLACE INTO in MariaDB and MySQL

17 May, 2024·622 words·3 mins

Databases