Stop Sending All Your Tools to the LLM#
If you're building an AI agent with tools, you're probably making a mistake that's costing you money and hurting response quality. I made it too — until I understood embeddings.
The Problem: Giving the LLM Everything#
Imagine you're building an agent with 50 tools — email, calendar, database, file system, Slack, GitHub, and more. The naive approach is to send all 50 tools with every request:
# The naive approach — don't do this
response = llm.call(
user_input="send an email to John",
tools=all_50_tools # sending everything!
)
This seems fine at first. But it creates three real problems.
Problem 1 — Cost#
Every tool has a description, parameters, and examples. Sending 50 tools can add 10,000+ tokens to every single request. At scale, this gets expensive fast.
Problem 2 — Context Pollution#
LLMs have a limited context window. When you fill half of it with irrelevant tools, there is less room for the actual conversation, history, and reasoning. The model gets distracted.
Problem 3 — Response Quality Degrades#
Research and practice shows that LLMs make worse decisions when given too many irrelevant choices. It's the same as asking a human to pick from 50 options versus 5 — focus matters.
The Smart Solution: Filter Before You Send#
What if, before calling the LLM, you automatically figured out which 3–5 tools are actually relevant to the user's request — and only sent those?
# The smart approach
relevant_tools = filter_relevant_tools(user_input) # only 3-5 tools
response = llm.call(
user_input="send an email to John",
tools=relevant_tools # lean and focused
)
The LLM now gets exactly what it needs. Nothing more.
The question is: how do you figure out which tools are relevant?
The Wrong Answer: Keyword Matching#
Your first instinct might be grep-style keyword matching:
def is_relevant(user_input, tool):
return tool.name in user_input # exact word match
This breaks immediately:
user says: "shoot a message to John"
keyword match for "email" → no match ✗
Users don't speak in tool names. They say "shoot a message", "ping John", "drop a note" — all meaning the same thing. Keyword matching is brittle and frustrating.
You need something that understands meaning, not just words.
The Right Answer: Embeddings#
An embedding model converts text into a list of numbers — called a vector — where similar meanings produce similar numbers.
"shoot a message to John" → [0.2, 0.8, 0.1, 0.9, ...]
"send email to recipient" → [0.21, 0.79, 0.11, 0.88, ...] ← very similar!
"query the database" → [0.9, 0.1, 0.8, 0.2, ...] ← very different
This works because the embedding model was pre-trained on billions of sentences and learned that "shoot a message", "send email", and "ping someone" all appear in the same contexts — so they get similar vectors.
No LLM call. No API. Just a small 80 MB model running locally on your machine.
How to Set It Up#
Install the library:
pip install sentence-transformers scikit-learn
Load the model — it downloads once and caches locally:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
model = SentenceTransformer('all-MiniLM-L6-v2') # 80MB, runs locally
The Scoring Step#
Once you have vectors, you measure how "close" they are using cosine similarity. The score ranges from 0 (completely unrelated) to 1 (identical meaning).
def get_relevance_score(user_input: str, tool_description: str) -> float:
user_vector = model.encode(user_input)
tool_vector = model.encode(tool_description)
score = cosine_similarity([user_vector], [tool_vector])[0][0]
return score
Let's see it in action:
get_relevance_score("shoot a message to John", "sends email to recipients")
# → 0.72 ✓ relevant
get_relevance_score("shoot a message to John", "queries the database")
# → 0.08 ✗ not relevant
get_relevance_score("shoot a message to John", "posts a Slack message")
# → 0.61 ✓ relevant
The model understands meaning. It correctly identified that email and Slack are both relevant to "shoot a message" — without any keyword in common.
Putting It All Together#
The complete filter function:
def filter_relevant_tools(user_input: str, all_tools: list, top_k: int = 5) -> list:
scored_tools = []
for tool in all_tools:
score = get_relevance_score(user_input, tool.description)
scored_tools.append((score, tool))
scored_tools.sort(key=lambda x: x[0], reverse=True)
return [tool for score, tool in scored_tools[:top_k]]
And your agent now looks like this:
def run_agent(user_input: str):
# Step 1 — filter tools locally, no LLM call
relevant_tools = filter_relevant_tools(user_input, all_tools, top_k=5)
# Step 2 — send only relevant tools to LLM
response = llm.call(
user_input=user_input,
tools=relevant_tools
)
return response
The Full Flow#
User: "shoot a message to John"
↓
Embedding model (local, fast)
converts user input → vector
↓
Compare against all tool vectors
score each tool 0.0 → 1.0
↓
Pick top 5 by score
[email_tool, slack_tool, ...]
↓
Send only these 5 to LLM
↓
LLM picks the right tool, focused
No wasted tokens. No distracted LLM. No keyword brittleness.
Results You Can Expect#
| Before | After | |
|---|---|---|
| Tokens per request | ~12,000 | ~2,000 |
| Tool selection accuracy | Degrades with more tools | Stays high |
| Latency | Higher | Lower |
| Cost | High | ~5x cheaper |
Key Takeaways#
- Sending all tools to the LLM hurts quality and costs money
- Keyword matching is too brittle — users don't speak in tool names
- Embeddings understand meaning, not just words
- The embedding model runs locally — no API call, milliseconds of latency
- Score your tools, pick the top 5, then call the LLM with a focused context
The embedding model does the heavy lifting of understanding language. Your LLM gets to focus on what it does best — reasoning and decision making.