Stop Sending All Your Tools to the LLM#

If you're building an AI agent with tools, you're probably making a mistake that's costing you money and hurting response quality. I made it too — until I understood embeddings.

The Problem: Giving the LLM Everything#

Imagine you're building an agent with 50 tools — email, calendar, database, file system, Slack, GitHub, and more. The naive approach is to send all 50 tools with every request:

# The naive approach — don't do this
response = llm.call(
    user_input="send an email to John",
    tools=all_50_tools  # sending everything!
)

This seems fine at first. But it creates three real problems.

Problem 1 — Cost#

Every tool has a description, parameters, and examples. Sending 50 tools can add 10,000+ tokens to every single request. At scale, this gets expensive fast.

Problem 2 — Context Pollution#

LLMs have a limited context window. When you fill half of it with irrelevant tools, there is less room for the actual conversation, history, and reasoning. The model gets distracted.

Problem 3 — Response Quality Degrades#

Research and practice shows that LLMs make worse decisions when given too many irrelevant choices. It's the same as asking a human to pick from 50 options versus 5 — focus matters.

The Smart Solution: Filter Before You Send#

What if, before calling the LLM, you automatically figured out which 3–5 tools are actually relevant to the user's request — and only sent those?

# The smart approach
relevant_tools = filter_relevant_tools(user_input)  # only 3-5 tools

response = llm.call(
    user_input="send an email to John",
    tools=relevant_tools  # lean and focused
)

The LLM now gets exactly what it needs. Nothing more.

The question is: how do you figure out which tools are relevant?

The Wrong Answer: Keyword Matching#

Your first instinct might be grep-style keyword matching:

def is_relevant(user_input, tool):
    return tool.name in user_input  # exact word match

This breaks immediately:

user says: "shoot a message to John"
keyword match for "email" → no match ✗

Users don't speak in tool names. They say "shoot a message", "ping John", "drop a note" — all meaning the same thing. Keyword matching is brittle and frustrating.

You need something that understands meaning, not just words.

The Right Answer: Embeddings#

An embedding model converts text into a list of numbers — called a vector — where similar meanings produce similar numbers.

"shoot a message to John"  →  [0.2, 0.8, 0.1, 0.9, ...]
"send email to recipient"  →  [0.21, 0.79, 0.11, 0.88, ...]  ← very similar!
"query the database"       →  [0.9, 0.1, 0.8, 0.2, ...]      ← very different

This works because the embedding model was pre-trained on billions of sentences and learned that "shoot a message", "send email", and "ping someone" all appear in the same contexts — so they get similar vectors.

No LLM call. No API. Just a small 80 MB model running locally on your machine.

How to Set It Up#

Install the library:

pip install sentence-transformers scikit-learn

Load the model — it downloads once and caches locally:

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')  # 80MB, runs locally

The Scoring Step#

Once you have vectors, you measure how "close" they are using cosine similarity. The score ranges from 0 (completely unrelated) to 1 (identical meaning).

def get_relevance_score(user_input: str, tool_description: str) -> float:
    user_vector = model.encode(user_input)
    tool_vector = model.encode(tool_description)

    score = cosine_similarity([user_vector], [tool_vector])[0][0]
    return score

Let's see it in action:

get_relevance_score("shoot a message to John", "sends email to recipients")
# → 0.72  ✓ relevant

get_relevance_score("shoot a message to John", "queries the database")
# → 0.08  ✗ not relevant

get_relevance_score("shoot a message to John", "posts a Slack message")
# → 0.61  ✓ relevant

The model understands meaning. It correctly identified that email and Slack are both relevant to "shoot a message" — without any keyword in common.

Putting It All Together#

The complete filter function:

def filter_relevant_tools(user_input: str, all_tools: list, top_k: int = 5) -> list:
    scored_tools = []

    for tool in all_tools:
        score = get_relevance_score(user_input, tool.description)
        scored_tools.append((score, tool))

    scored_tools.sort(key=lambda x: x[0], reverse=True)
    return [tool for score, tool in scored_tools[:top_k]]

And your agent now looks like this:

def run_agent(user_input: str):
    # Step 1 — filter tools locally, no LLM call
    relevant_tools = filter_relevant_tools(user_input, all_tools, top_k=5)

    # Step 2 — send only relevant tools to LLM
    response = llm.call(
        user_input=user_input,
        tools=relevant_tools
    )

    return response

The Full Flow#

User: "shoot a message to John"
            ↓
    Embedding model (local, fast)
    converts user input → vector
            ↓
    Compare against all tool vectors
    score each tool 0.0 → 1.0
            ↓
    Pick top 5 by score
    [email_tool, slack_tool, ...]
            ↓
    Send only these 5 to LLM
            ↓
    LLM picks the right tool, focused

No wasted tokens. No distracted LLM. No keyword brittleness.

Results You Can Expect#

	Before	After
Tokens per request	~12,000	~2,000
Tool selection accuracy	Degrades with more tools	Stays high
Latency	Higher	Lower
Cost	High	~5x cheaper

Key Takeaways#

Sending all tools to the LLM hurts quality and costs money
Keyword matching is too brittle — users don't speak in tool names
Embeddings understand meaning, not just words
The embedding model runs locally — no API call, milliseconds of latency
Score your tools, pick the top 5, then call the LLM with a focused context

The embedding model does the heavy lifting of understanding language. Your LLM gets to focus on what it does best — reasoning and decision making.