|
|
--- |
|
|
license: mit |
|
|
sdk: docker |
|
|
emoji: 📚 |
|
|
short_description: Collaborative Testing for LLM & Agentic Applications |
|
|
--- |
|
|
# Rhesis: Collaborative Testing for LLM & Agentic Applications |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/user-attachments/assets/ff43ca6a-ffde-4aff-9ff9-eec3897d0d02" alt="Rhesis AI Logo" height="80"> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://github.com/rhesis-ai/rhesis/blob/main/LICENSE"> |
|
|
<img src="https://img.shields.io/badge/license-MIT%20%2B%20Enterprise-blue" alt="License"> |
|
|
</a> |
|
|
<a href="https://pypi.org/project/rhesis-sdk/"> |
|
|
<img src="https://img.shields.io/pypi/v/rhesis-sdk" alt="PyPI Version"> |
|
|
</a> |
|
|
<a href="https://pypi.org/project/rhesis-sdk/"> |
|
|
<img src="https://img.shields.io/pypi/pyversions/rhesis-sdk" alt="Python Versions"> |
|
|
</a> |
|
|
<a href="https://codecov.io/gh/rhesis-ai/rhesis"> |
|
|
<img src="https://codecov.io/gh/rhesis-ai/rhesis/graph/badge.svg?token=1XQV983JEJ" alt="codecov"> |
|
|
</a> |
|
|
<a href="https://discord.rhesis.ai"> |
|
|
<img src="https://img.shields.io/discord/1340989671601209408?color=7289da&label=Discord&logo=discord&logoColor=white" alt="Discord"> |
|
|
</a> |
|
|
<a href="https://www.linkedin.com/company/rhesis-ai"> |
|
|
<img src="https://img.shields.io/badge/LinkedIn-Rhesis_AI-blue?logo=linkedin" alt="LinkedIn"> |
|
|
</a> |
|
|
<a href="https://huggingface.co/rhesis"> |
|
|
<img src="https://img.shields.io/badge/🤗-Rhesis-yellow" alt="Hugging Face"> |
|
|
</a> |
|
|
<a href="https://docs.rhesis.ai"> |
|
|
<img src="https://img.shields.io/badge/docs-rhesis.ai-blue" alt="Documentation"> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://rhesis.ai"><strong>Website</strong></a> · |
|
|
<a href="https://docs.rhesis.ai"><strong>Docs</strong></a> · |
|
|
<a href="https://discord.rhesis.ai"><strong>Discord</strong></a> · |
|
|
<a href="https://github.com/rhesis-ai/rhesis/blob/main/CHANGELOG.md"><strong>Changelog</strong></a> |
|
|
</p> |
|
|
|
|
|
<h3 align="center">More than just evals.<br><strong>Collaborative agent testing for teams.</strong></h3> |
|
|
|
|
|
<p align="center"> |
|
|
Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together. |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://rhesis.ai/?video=open" target="_blank"> |
|
|
<img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Short_Demo.png" |
|
|
loading="lazy" |
|
|
width="1080" |
|
|
alt="Rhesis Platform Overview - Click to watch demo"> |
|
|
</a> |
|
|
</p> |
|
|
|
|
|
--- |
|
|
|
|
|
## Core features |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Features.png" |
|
|
loading="lazy" |
|
|
width="1080" |
|
|
alt="Rhesis Core Features"> |
|
|
</p> |
|
|
|
|
|
### Test generation |
|
|
|
|
|
**AI-Powered Synthesis** - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts. |
|
|
|
|
|
**Knowledge-Aware** - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation. |
|
|
|
|
|
### Single-turn & conversation simulation |
|
|
|
|
|
**Single-turn** for Q&A validation. **Conversation simulation** for dialogue flows. |
|
|
|
|
|
**Penelope Agent** simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions. |
|
|
|
|
|
### Adversarial testing (red-teaming) |
|
|
|
|
|
**Polyphemus Agent** proactively finds vulnerabilities: |
|
|
|
|
|
- Jailbreak attempts and prompt injection |
|
|
- PII leakage and data extraction |
|
|
- Harmful content generation |
|
|
- Role violation and instruction bypassing |
|
|
|
|
|
**Garak Integration** - Built-in support for [garak](https://github.com/leondz/garak), the LLM vulnerability scanner, for comprehensive security testing. |
|
|
|
|
|
### 60+ pre-built metrics |
|
|
|
|
|
| Framework | Example Metrics | |
|
|
|-----------|-----------------| |
|
|
| **RAGAS** | Context relevance, faithfulness, answer accuracy | |
|
|
| **DeepEval** | Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention | |
|
|
| **Garak** | Jailbreak detection, prompt injection, XSS, malware generation, data leakage | |
|
|
| **Custom** | NumericJudge, CategoricalJudge for domain-specific evaluation | |
|
|
|
|
|
All metrics include LLM-as-Judge reasoning explanations. |
|
|
|
|
|
### Traces & observability |
|
|
|
|
|
Monitor your LLM applications with OpenTelemetry-based tracing: |
|
|
|
|
|
```python |
|
|
from rhesis.sdk.decorators import observe |
|
|
|
|
|
@observe.llm(model="gpt-4") |
|
|
def generate_response(prompt: str) -> str: |
|
|
# Your LLM call here |
|
|
return response |
|
|
``` |
|
|
|
|
|
Track LLM calls, latency, token usage, and link traces to test results for debugging. |
|
|
|
|
|
### Bring your own model |
|
|
|
|
|
Use any LLM provider for test generation and evaluation: |
|
|
|
|
|
**Cloud:** OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI |
|
|
|
|
|
**Local/Self-hosted:** Ollama, vLLM, LiteLLM |
|
|
|
|
|
See [Model Configuration Docs](https://docs.rhesis.ai/sdk/models) for setup instructions. |
|
|
|
|
|
--- |
|
|
|
|
|
## Curated Test Sets on Hugging Face |
|
|
|
|
|
We publish curated test datasets on [Hugging Face](https://huggingface.co/rhesis) to help teams assess their LLM applications. These test sets cover diverse evaluation scenarios across conversational AI, agentic systems, RAG applications, and more—helping you validate robustness, reliability, safety, and compliance. |
|
|
|
|
|
### What's available |
|
|
|
|
|
Test sets designed for: |
|
|
- **Conversational AI** - Multi-turn dialogue, context retention, role adherence |
|
|
- **Agentic Systems** - Tool selection, goal achievement, multi-agent coordination |
|
|
- **RAG Systems** - Context relevance, faithfulness, hallucination detection |
|
|
- **Adversarial Testing** - Jailbreak resistance, prompt injection, PII leakage |
|
|
- **Domain-Specific Applications** - Finance, healthcare, customer support, sales, and more |
|
|
|
|
|
### Using our test sets |
|
|
|
|
|
**Option 1: Rhesis Platform** |
|
|
1. Download a test set from [Hugging Face](https://huggingface.co/rhesis) |
|
|
2. In the Rhesis platform, navigate to **Test Sets** → **Import from file** |
|
|
3. Upload the downloaded CSV file |
|
|
|
|
|
**Option 2: Python SDK** |
|
|
|
|
|
```python |
|
|
from rhesis.sdk import TestSet |
|
|
|
|
|
# Load tests from a CSV file downloaded from Hugging Face |
|
|
test_set = TestSet.from_csv( |
|
|
"tests.csv", |
|
|
name="Imported Tests", |
|
|
description="Tests imported from Hugging Face" |
|
|
) |
|
|
print(f"Loaded {len(test_set.tests)} tests") |
|
|
``` |
|
|
|
|
|
> **Disclaimer:** Some test cases may contain sensitive or challenging content included for thorough realistic assessment. Review test cases carefully and exercise discretion when utilizing them. |
|
|
|
|
|
--- |
|
|
|
|
|
## Why Rhesis? |
|
|
|
|
|
**Platform for teams. SDK for developers.** |
|
|
|
|
|
Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows. |
|
|
|
|
|
### The testing lifecycle |
|
|
|
|
|
Six integrated phases from project setup to team collaboration: |
|
|
|
|
|
| Phase | What You Do | |
|
|
|--------------------------------|-------------| |
|
|
| **[1. Projects](https://docs.rhesis.ai/platform/projects)** | Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors | |
|
|
| **[2. Requirements](https://docs.rhesis.ai/platform/behaviors)** | Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams | |
|
|
| **[3. Metrics](https://docs.rhesis.ai/platform/metrics)** | Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met | |
|
|
| **[4. Tests](https://docs.rhesis.ai/platform/tests)** | Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage | |
|
|
| **[5. Execution](https://docs.rhesis.ai/platform/test-execution)** | Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution | |
|
|
| **[6. Collaboration](https://docs.rhesis.ai/platform/test-runs)** | Review results with your team through comments, tasks, workflows, and side-by-side comparisons | |
|
|
|
|
|
### Rhesis vs... |
|
|
|
|
|
| Instead of... | Rhesis gives you... | |
|
|
|---------------|---------------------| |
|
|
| **Manual testing** | AI-generated test cases based on your context, hundreds in minutes | |
|
|
| **Traditional test frameworks** | Non-deterministic output handling built-in | |
|
|
| **LLM observability tools** | Pre-production validation, not post-production monitoring | |
|
|
| **Red-teaming services** | Continuous, self-service adversarial testing, not one-time audits | |
|
|
|
|
|
--- |
|
|
|
|
|
## What you can test |
|
|
|
|
|
| Use Case | What Rhesis Tests | |
|
|
|----------|-------------------| |
|
|
| **Conversational AI** | Conversation simulation, role adherence, knowledge retention | |
|
|
| **RAG Systems** | Context relevance, faithfulness, hallucination detection | |
|
|
| **NL-to-SQL / NL-to-Code** | Query accuracy, syntax validation, edge case handling | |
|
|
| **Agentic Systems** | Tool selection, goal achievement, multi-agent coordination | |
|
|
|
|
|
--- |
|
|
|
|
|
## SDK: Code-first testing |
|
|
|
|
|
Test your Python functions directly with the `@endpoint` decorator: |
|
|
|
|
|
```python |
|
|
from rhesis.sdk.decorators import endpoint |
|
|
|
|
|
@endpoint(name="my-chatbot") |
|
|
def chat(message: str) -> str: |
|
|
# Your LLM logic here |
|
|
return response |
|
|
``` |
|
|
|
|
|
**Features:** Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production). |
|
|
|
|
|
**Generate tests programmatically:** |
|
|
|
|
|
```python |
|
|
from rhesis.sdk.synthesizers import PromptSynthesizer |
|
|
|
|
|
synthesizer = PromptSynthesizer( |
|
|
prompt="Generate tests for a medical chatbot that must never provide diagnosis" |
|
|
) |
|
|
test_set = synthesizer.generate(num_tests=10) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Deployment options |
|
|
|
|
|
| Option | Best For | Setup Time | |
|
|
|--------|----------|------------| |
|
|
| **[Rhesis Cloud](https://app.rhesis.ai)** | Teams wanting managed deployment | Instant | |
|
|
| **Docker** | Local development and testing | 5 minutes | |
|
|
| **Kubernetes** | Production self-hosting | [See docs](https://docs.rhesis.ai/getting-started/self-hosting) | |
|
|
|
|
|
### Quick Start |
|
|
|
|
|
**Option 1: Cloud (fastest)** - [app.rhesis.ai](https://app.rhesis.ai) - Managed service, just connect your app |
|
|
|
|
|
**Option 2: Self-host with Docker** |
|
|
```bash |
|
|
git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start |
|
|
``` |
|
|
|
|
|
**Access:** Frontend at `localhost:3000`, API at `localhost:8080/docs` |
|
|
|
|
|
**Commands:** `./rh logs` · `./rh stop` · `./rh restart` · `./rh delete` |
|
|
|
|
|
> **Note:** This setup enables auto-login for local testing. For production, see [Self-hosting Documentation](https://docs.rhesis.ai/getting-started/self-hosting). |
|
|
|
|
|
**Option 3: Python SDK** |
|
|
```bash |
|
|
pip install rhesis-sdk |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Integrations |
|
|
|
|
|
Connect Rhesis to your LLM stack: |
|
|
|
|
|
| Integration | Languages | Description | |
|
|
|-------------|-----------|-------------| |
|
|
| **Rhesis SDK** | Python, JS/TS | Native SDK with decorators for endpoints and observability. Full control over test execution and tracing. | |
|
|
| **OpenAI** | Python | Drop-in replacement for OpenAI SDK. Automatic instrumentation with zero code changes. | |
|
|
| **Anthropic** | Python | Native support for Claude models with automatic tracing. | |
|
|
| **LangChain** | Python | Add Rhesis callback handler to your LangChain app for automatic tracing and test execution. | |
|
|
| **LangGraph** | Python | Built-in integration for LangGraph agent workflows with full observability. | |
|
|
| **AutoGen** | Python | Automatic instrumentation for Microsoft AutoGen multi-agent conversations. | |
|
|
| **LiteLLM** | Python | Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate). | |
|
|
| **Google Gemini** | Python | Native integration for Google's Gemini models. | |
|
|
| **Ollama** | Python | Local LLM deployment with Ollama integration. | |
|
|
| **OpenRouter** | Python | Access to multiple LLM providers through OpenRouter. | |
|
|
| **Vertex AI** | Python | Google Cloud Vertex AI model support. | |
|
|
| **HuggingFace** | Python | Direct integration with HuggingFace models. | |
|
|
| **REST API** | Any | Direct API access for custom integrations. [OpenAPI spec available](https://api.rhesis.ai/docs). | |
|
|
|
|
|
See [Integration Docs](https://docs.rhesis.ai/development) for setup instructions. |
|
|
|
|
|
--- |
|
|
|
|
|
## Open source |
|
|
|
|
|
[MIT licensed](LICENSE). No plans to relicense core features. Enterprise version will live in `ee/` folders and remain separate. |
|
|
|
|
|
We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome. |
|
|
|
|
|
--- |
|
|
|
|
|
## Contributing |
|
|
|
|
|
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines. |
|
|
|
|
|
**Ways to contribute:** Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions |
|
|
|
|
|
--- |
|
|
|
|
|
## Support |
|
|
|
|
|
- **[Documentation](https://docs.rhesis.ai)** - Guides and API reference |
|
|
- **[Discord](https://discord.rhesis.ai)** - Community support |
|
|
- **[GitHub Issues](https://github.com/rhesis-ai/rhesis/issues)** - Bug reports and feature requests |
|
|
|
|
|
--- |
|
|
|
|
|
## Security & privacy |
|
|
|
|
|
We take data security seriously. See our [Privacy Policy](https://rhesis.ai/privacy-policy) for details. |
|
|
|
|
|
**Telemetry:** Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties. |
|
|
|
|
|
- **Self-hosted:** Opt out by setting `OTEL_RHESIS_TELEMETRY_ENABLED=false` |
|
|
- **Cloud:** Telemetry enabled as part of Terms & Conditions |
|
|
|
|
|
--- |
|
|
|
|
|
<p align="center"> |
|
|
<strong>Made with <img src="https://github.com/user-attachments/assets/598c2d81-572c-46bd-b718-dee32cdc749c" height="16" alt="Rhesis logo"> in Potsdam, Germany 🇩🇪</strong> |
|
|
</p> |
|
|
|
|
|
<p align="center"> |
|
|
<a href="https://rhesis.ai">Learn more at rhesis.ai</a> |
|
|
</p> |