Embeddings and search (RAG)

An embedding represents a piece of text as a vector — a list of numbers. Texts that mean similar things get similar vectors, even when they share no words. That’s the basis of semantic search: you match on meaning, not on keywords. And it’s the backbone of RAG (retrieval-augmented generation), where a model answers using your own documents. Embeddings on RuAPI run through the same OpenAI-compatible endpoint as chat: https://www.ruapi.ai/v1, the same sk-... key. The only thing to change in your code is base_url.

No key yet? Start with the Quickstart. From here on we assume you already have an sk-... key.

Your first vector

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="sk-YOUR_KEY",
    base_url="https://www.ruapi.ai/v1",
)

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input="Moscow is the capital of Russia",
)

vector = resp.data[0].embedding
print(len(vector))   # vector dimension
print(vector[:5])    # first five numbers

resp.data is a list: one input text yields one item with an .embedding field. text-embedding-3-small returns a 1536-number vector. Need higher quality? Use text-embedding-3-large (3072 numbers per vector). It’s more accurate on hard text but costs more and runs slower. For most tasks -small is plenty.

Several texts at once

Pass a list to input and you get one vector per item in a single request — faster and cheaper than sending texts one by one.

resp = client.embeddings.create(
    model="text-embedding-3-small",
    input=[
        "How do I top up my balance?",
        "The minimum top-up is 10 USDT.",
        "The cat is sleeping on the windowsill.",
    ],
)

vectors = [item.embedding for item in resp.data]

The order of the results matches the order of the inputs.

End-to-end RAG in five steps

The idea is simple: ahead of time, split your documents into chunks and compute a vector for each. When a question comes in, find the closest chunks and hand them to the chat model as context. 1. Split text into chunks. Chunks that are too long blur the meaning; too short and they lose context. Paragraphs or 200-500-word windows usually work well.

docs = [
    "RuAPI accepts payment in USDT. Minimum is 10 USDT.",
    "Embeddings and chat use one key and one base_url.",
    "The base URL for OpenAI-compatible clients is https://www.ruapi.ai/v1",
]

2. Compute vectors and store them. For a small set, a plain in-memory list is fine. For production, use a vector store — FAISS, pgvector, or Chroma; they search across millions of vectors quickly.

import numpy as np

emb = client.embeddings.create(model="text-embedding-3-small", input=docs)
index = np.array([item.embedding for item in emb.data])

3. Embed the query — with the same model you used for the documents (this matters, see below).

question = "What's the minimum I need to top up?"
q = np.array(
    client.embeddings.create(
        model="text-embedding-3-small", input=question
    ).data[0].embedding
)

4. Find the closest chunks by cosine similarity. Cosine measures the angle between vectors: 1 means nearly identical, 0 means unrelated.

def cosine(a, b):
    return a @ b / (np.linalg.norm(a) * np.linalg.norm(b))

scores = [cosine(q, v) for v in index]
top = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:2]
context = "\n".join(docs[i] for i in top)

5. Feed the retrieved chunks to a chat model. Put the best chunks into the system prompt as context, and the model answers from your data instead of guessing.

answer = client.chat.completions.create(
    model="claude-opus-4-8",
    messages=[
        {"role": "system", "content": f"Answer only from this context:\n{context}"},
        {"role": "user", "content": question},
    ],
)

print(answer.choices[0].message.content)

You build the index from step 2 once and reuse it — there’s no need to re-embed everything on every query.

Gotchas

Same model for the index and the query

Vectors from text-embedding-3-small and text-embedding-3-large aren’t comparable — different dimensions, different meaning space. Embed your query with whatever model built the index. Switch models and you have to rebuild the whole index.

Batch via a list in input

Sending texts one at a time is slow and burns more requests. Pass a list to input (usually up to a few hundred strings at once) — results come back in the same order.

-small vs -large: dimensions and cost

text-embedding-3-small is 1536 numbers per vector — cheaper and faster, good for most tasks. text-embedding-3-large is 3072 numbers — more accurate on complex, long text, but costs more and takes twice the storage. Start with -small and move to -large only if search quality falls short.

What’s next

Quickstart — sign up, get a key, make your first request
GPT models — choosing the chat model that answers from your retrieved context
API reference — endpoints, base URLs and request format
LangChain — ready-made chains and RAG on top of RuAPI without hand-rolling code

​Your first vector

​Several texts at once

​End-to-end RAG in five steps

​Gotchas

​What’s next

Your first vector

Several texts at once

End-to-end RAG in five steps

Gotchas

What’s next