Spaces:
Running
title: README
emoji: π
colorFrom: blue
colorTo: green
sdk: static
pinned: false
TraceVerse Community
The fastest way to know what your AI agent is actually doing β and prove it on a public leaderboard.
You wrote an agent. It works. Sometimes. It calls an LLM, it calls a tool, sometimes it loops, occasionally it spends βΉ400 on a single user query and you have no idea why.
This org exists to fix that. Open source, framework-agnostic, built so you can go from git clone to a traced agent with a leaderboard rank in under five minutes.
π Discord Β· GitHub Β· genai-otel-instrument Β· SmolTrace
Get a traced agent in 30 seconds
# pip install genai-otel-instrument
from genai_otel_instrument import instrument
instrument(
service_name="my-first-agent",
otlp_endpoint="http://localhost:4318", # or point at the public TraceMind Space
redact_pii=True, # PII off your traces by default
)
# That's it. Run your agent. Every LLM call, tool call, token, rupee, and
# millisecond of latency is now visible.
No SDK lock-in. No daemons. No "you must use our framework." Works with LangGraph, CrewAI, OpenAI Agents SDK, AutoGen, smolagents, vanilla openai β anything that hits an LLM API.
What we ship
Libraries
| Project | What you get |
|---|---|
genai-otel-instrument |
One-line OpenTelemetry instrumentation for any GenAI agent. Captures LLM calls, tool calls, cost, tokens, latency. Auto-redacts PII by default. |
SmolTrace |
Public benchmark + leaderboard for agent evals. Submit an agent, get a rank, compare on cost, latency, and quality. |
TraceMind |
Hosted trace viewer. Point your OTLP endpoint at it, see what your agent did, where it broke, what it cost. No signup. |
TraceMind-mcp-server |
An MCP server so your agent can query its own historical traces. Meta-observability for self-improving agents. |
Live MCP servers (3 servers Β· 18 tools Β· synthetic data Β· no API key)
| Surface | Space | Tools |
|---|---|---|
| Food delivery | food-delivery-mcp |
7 |
| Grocery / Instamart | instamart-mcp |
6 |
| Dineout / Reservations | dineout-mcp |
5 |
Eval datasets (SmolTrace-format)
| Dataset | Tasks |
|---|---|
food-delivery-evals |
111 |
instamart-evals |
100 |
dineout-evals |
100 |
Cross-domain SmolTrace datasets
For evaluation across other domains, see the TraceMind-AI Collection β 41 SmolTrace-format datasets covering:
- General domains (12) β travel, ecommerce, healthcare, finance, legal, education, real-estate, social-media, recruitment, smart-home, customer-support, food-delivery
- Ops & infrastructure (15) β aiops, apm, devops, secops, mlops, llmops, cloud-cost, kubernetes, database-ops, incident-management, IaC, SRE, observability-platform, CI/CD, log-management
- Industry-specific (14) β drone, farming, manufacturing, hospitality, logistics, automotive, cybersecurity, telecom, insurance, events, marine, aviation, gaming, plus the three TraceVerse Community datasets above
Same SmolTrace schema, same prompt-template structure as ours. Use them directly β no need to mirror.
Reference agents + docs
- GitHub:
food-delivery-agentsβ the binding repo. Reference agents wired withgenai-otel-instrument, architecture docs, observability primer, leaderboard CI.
What you'll get from this stack
- See it. Every LLM call, tool call, token spent, millisecond burned β visualized as a trace tree.
- Score it. Run your agent against shared task datasets. Get a number on a public leaderboard. Watch it move.
- Compare it. Two model versions, two prompts, two frameworks β same dataset, side-by-side cost, latency, and quality.
- Trust it. PII redaction is on by default. Self-host the viewer if you don't want anyone seeing your traces.
Who this is for
- Buildathon participants β go from zero to traced agent with a leaderboard rank in under five minutes. Any framework, any model.
- Indie builders β see what your agent actually does, not what you think it does. Stop debugging via
print(). - Teams shipping LLM apps β replace ad-hoc notebook evals with reproducible numbers you can show a stakeholder.
- Researchers β every dataset and benchmark here is open. Fork it, extend it, contribute back.
What we believe
- Observability is a precondition for serious agent work. You cannot improve what you cannot see.
- Evaluation should be reproducible and public. Benchmarks that live in private notebooks help no one.
- Cost and latency are first-class signals. Quality without cost discipline is a research demo, not a product.
- The toolkit must work the same on localhost as in production. No magic that only kicks in on day 30.
Community
- π¬ Discord β chat with the community, ask questions, share traces, suggest tasks for the eval suites.
- π GitHub β open issues, PRs welcome, no CLA. Discussions enabled on every repo.
- π€ HF Discussions β every Space and Dataset has a Discussions tab. Use it for surface-specific questions (e.g. "found a bug in
apply_promo" β discuss on the Space's tab).
Roadmap
- β
Live now β
genai-otel-instrument,SmolTrace, publicTraceMind,TraceMind-mcp-server, 3 live MCP servers (food / grocery / dineout, 18 tools), 3 own eval suites (311 tasks total), 18 mirrored eval datasets,food-delivery-agentsbinding repo. - π Next β framework-specific reference agents (LangGraph + smolagents + CrewAI), automated PR-driven leaderboard, more domain MCP servers.
- After β community-curated tasks across more domains, cost-optimization recipes,
agents.mdstandardization across all our Spaces.
Production-grade companion
Need this stack on-premises with autonomous root-cause analysis, compliance audit trails, multi-year retention, and air-gapped deployment? TraceVerse Enterprise is the bigger sibling built for regulated environments β same telemetry contract, hardened for the bank floor.
Get involved
- Try it β start with
genai-otel-instrumenton the agent you have right now. - Contribute β every repo above accepts PRs. Issues open. No CLA.
- Share datasets β got a domain-specific task set? PR it into SmolTrace or open a discussion.
- Join the conversation β Discord, GitHub Discussions, or HF Discussions on any repo. We answer.