AI & ML interests

TraceVerse Community is an open evaluation and observability ecosystem for real-world AI applications. Built on top of the Hugging Face Hub, it hosts datasets, traces, and benchmarking pipelines that help developers measure cost, latency, and quality across models using real production-like workflows. We focus on turning observability data into reusable evaluation assetsβ€”enabling reproducible benchmarking, continuous optimization, and better model selection for any AI use case. The goal is simple: make evaluation a first-class, community-driven layer in the AI stack.

Recent Activity

Organization Card

TraceVerse Community

The fastest way to know what your AI agent is actually doing β€” and prove it on a public leaderboard.

You wrote an agent. It works. Sometimes. It calls an LLM, it calls a tool, sometimes it loops, occasionally it spends β‚Ή400 on a single user query and you have no idea why.

This org exists to fix that. Open source, framework-agnostic, built so you can go from git clone to a traced agent with a leaderboard rank in under five minutes.

πŸ”— Discord Β· GitHub Β· genai-otel-instrument Β· SmolTrace


Get a traced agent in 30 seconds

# pip install genai-otel-instrument
from genai_otel_instrument import instrument

instrument(
    service_name="my-first-agent",
    otlp_endpoint="http://localhost:4318",   # or point at the public TraceMind Space
    redact_pii=True,                         # PII off your traces by default
)

# That's it. Run your agent. Every LLM call, tool call, token, rupee, and
# millisecond of latency is now visible.

No SDK lock-in. No daemons. No "you must use our framework." Works with LangGraph, CrewAI, OpenAI Agents SDK, AutoGen, smolagents, vanilla openai β€” anything that hits an LLM API.


What we ship

Libraries

Project What you get
genai-otel-instrument One-line OpenTelemetry instrumentation for any GenAI agent. Captures LLM calls, tool calls, cost, tokens, latency. Auto-redacts PII by default.
SmolTrace Public benchmark + leaderboard for agent evals. Submit an agent, get a rank, compare on cost, latency, and quality.
TraceMind Hosted trace viewer. Point your OTLP endpoint at it, see what your agent did, where it broke, what it cost. No signup.
TraceMind-mcp-server An MCP server so your agent can query its own historical traces. Meta-observability for self-improving agents.

Live MCP servers (3 servers Β· 18 tools Β· synthetic data Β· no API key)

Surface Space Tools
Food delivery food-delivery-mcp 7
Grocery / Instamart instamart-mcp 6
Dineout / Reservations dineout-mcp 5

Eval datasets (SmolTrace-format)

Cross-domain SmolTrace datasets

For evaluation across other domains, see the TraceMind-AI Collection β€” 41 SmolTrace-format datasets covering:

  • General domains (12) β€” travel, ecommerce, healthcare, finance, legal, education, real-estate, social-media, recruitment, smart-home, customer-support, food-delivery
  • Ops & infrastructure (15) β€” aiops, apm, devops, secops, mlops, llmops, cloud-cost, kubernetes, database-ops, incident-management, IaC, SRE, observability-platform, CI/CD, log-management
  • Industry-specific (14) β€” drone, farming, manufacturing, hospitality, logistics, automotive, cybersecurity, telecom, insurance, events, marine, aviation, gaming, plus the three TraceVerse Community datasets above

Same SmolTrace schema, same prompt-template structure as ours. Use them directly β€” no need to mirror.

Reference agents + docs

  • GitHub: food-delivery-agents β€” the binding repo. Reference agents wired with genai-otel-instrument, architecture docs, observability primer, leaderboard CI.

What you'll get from this stack

  • See it. Every LLM call, tool call, token spent, millisecond burned β€” visualized as a trace tree.
  • Score it. Run your agent against shared task datasets. Get a number on a public leaderboard. Watch it move.
  • Compare it. Two model versions, two prompts, two frameworks β€” same dataset, side-by-side cost, latency, and quality.
  • Trust it. PII redaction is on by default. Self-host the viewer if you don't want anyone seeing your traces.

Who this is for

  • Buildathon participants β€” go from zero to traced agent with a leaderboard rank in under five minutes. Any framework, any model.
  • Indie builders β€” see what your agent actually does, not what you think it does. Stop debugging via print().
  • Teams shipping LLM apps β€” replace ad-hoc notebook evals with reproducible numbers you can show a stakeholder.
  • Researchers β€” every dataset and benchmark here is open. Fork it, extend it, contribute back.

What we believe

  1. Observability is a precondition for serious agent work. You cannot improve what you cannot see.
  2. Evaluation should be reproducible and public. Benchmarks that live in private notebooks help no one.
  3. Cost and latency are first-class signals. Quality without cost discipline is a research demo, not a product.
  4. The toolkit must work the same on localhost as in production. No magic that only kicks in on day 30.

Community

  • πŸ’¬ Discord β€” chat with the community, ask questions, share traces, suggest tasks for the eval suites.
  • πŸ™ GitHub β€” open issues, PRs welcome, no CLA. Discussions enabled on every repo.
  • πŸ€— HF Discussions β€” every Space and Dataset has a Discussions tab. Use it for surface-specific questions (e.g. "found a bug in apply_promo" β†’ discuss on the Space's tab).

Roadmap

  • βœ… Live now β€” genai-otel-instrument, SmolTrace, public TraceMind, TraceMind-mcp-server, 3 live MCP servers (food / grocery / dineout, 18 tools), 3 own eval suites (311 tasks total), 18 mirrored eval datasets, food-delivery-agents binding repo.
  • πŸ”œ Next β€” framework-specific reference agents (LangGraph + smolagents + CrewAI), automated PR-driven leaderboard, more domain MCP servers.
  • After β€” community-curated tasks across more domains, cost-optimization recipes, agents.md standardization across all our Spaces.

Production-grade companion

Need this stack on-premises with autonomous root-cause analysis, compliance audit trails, multi-year retention, and air-gapped deployment? TraceVerse Enterprise is the bigger sibling built for regulated environments β€” same telemetry contract, hardened for the bank floor.


Get involved

  • Try it β€” start with genai-otel-instrument on the agent you have right now.
  • Contribute β€” every repo above accepts PRs. Issues open. No CLA.
  • Share datasets β€” got a domain-specific task set? PR it into SmolTrace or open a discussion.
  • Join the conversation β€” Discord, GitHub Discussions, or HF Discussions on any repo. We answer.

models 0

None public yet