File size: 13,609 Bytes
6b27037 4371926 daa3803 eda05be 4371926 eda05be daa3803 4371926 daa3803 eda05be 4371926 eda05be 4371926 eda05be 4371926 eda05be 4371926 eda05be 4371926 eda05be 4371926 6b27037 4371926 6b27037 4371926 eda05be 4371926 5bc5c33 4371926 620cfb0 4371926 a4a5a47 4371926 a4a5a47 4371926 a4a5a47 4371926 a4a5a47 4371926 a4a5a47 4371926 a4a5a47 4371926 a4a5a47 4371926 a4a5a47 4371926 a4a5a47 4371926 ba18231 4371926 ba18231 4371926 ba18231 4371926 f4a5ba1 4371926 6b27037 4371926 6b27037 4371926 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 |
---
license: mit
sdk: docker
emoji: 📚
short_description: Collaborative Testing for LLM & Agentic Applications
---
# Rhesis: Collaborative Testing for LLM & Agentic Applications
<p align="center">
<img src="https://github.com/user-attachments/assets/ff43ca6a-ffde-4aff-9ff9-eec3897d0d02" alt="Rhesis AI Logo" height="80">
</p>
<p align="center">
<a href="https://github.com/rhesis-ai/rhesis/blob/main/LICENSE">
<img src="https://img.shields.io/badge/license-MIT%20%2B%20Enterprise-blue" alt="License">
</a>
<a href="https://pypi.org/project/rhesis-sdk/">
<img src="https://img.shields.io/pypi/v/rhesis-sdk" alt="PyPI Version">
</a>
<a href="https://pypi.org/project/rhesis-sdk/">
<img src="https://img.shields.io/pypi/pyversions/rhesis-sdk" alt="Python Versions">
</a>
<a href="https://codecov.io/gh/rhesis-ai/rhesis">
<img src="https://codecov.io/gh/rhesis-ai/rhesis/graph/badge.svg?token=1XQV983JEJ" alt="codecov">
</a>
<a href="https://discord.rhesis.ai">
<img src="https://img.shields.io/discord/1340989671601209408?color=7289da&label=Discord&logo=discord&logoColor=white" alt="Discord">
</a>
<a href="https://www.linkedin.com/company/rhesis-ai">
<img src="https://img.shields.io/badge/LinkedIn-Rhesis_AI-blue?logo=linkedin" alt="LinkedIn">
</a>
<a href="https://huggingface.co/rhesis">
<img src="https://img.shields.io/badge/🤗-Rhesis-yellow" alt="Hugging Face">
</a>
<a href="https://docs.rhesis.ai">
<img src="https://img.shields.io/badge/docs-rhesis.ai-blue" alt="Documentation">
</a>
</p>
<p align="center">
<a href="https://rhesis.ai"><strong>Website</strong></a> ·
<a href="https://docs.rhesis.ai"><strong>Docs</strong></a> ·
<a href="https://discord.rhesis.ai"><strong>Discord</strong></a> ·
<a href="https://github.com/rhesis-ai/rhesis/blob/main/CHANGELOG.md"><strong>Changelog</strong></a>
</p>
<h3 align="center">More than just evals.<br><strong>Collaborative agent testing for teams.</strong></h3>
<p align="center">
Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.
</p>
<p align="center">
<a href="https://rhesis.ai/?video=open" target="_blank">
<img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Short_Demo.png"
loading="lazy"
width="1080"
alt="Rhesis Platform Overview - Click to watch demo">
</a>
</p>
---
## Core features
<p align="center">
<img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Features.png"
loading="lazy"
width="1080"
alt="Rhesis Core Features">
</p>
### Test generation
**AI-Powered Synthesis** - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.
**Knowledge-Aware** - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.
### Single-turn & conversation simulation
**Single-turn** for Q&A validation. **Conversation simulation** for dialogue flows.
**Penelope Agent** simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.
### Adversarial testing (red-teaming)
**Polyphemus Agent** proactively finds vulnerabilities:
- Jailbreak attempts and prompt injection
- PII leakage and data extraction
- Harmful content generation
- Role violation and instruction bypassing
**Garak Integration** - Built-in support for [garak](https://github.com/leondz/garak), the LLM vulnerability scanner, for comprehensive security testing.
### 60+ pre-built metrics
| Framework | Example Metrics |
|-----------|-----------------|
| **RAGAS** | Context relevance, faithfulness, answer accuracy |
| **DeepEval** | Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention |
| **Garak** | Jailbreak detection, prompt injection, XSS, malware generation, data leakage |
| **Custom** | NumericJudge, CategoricalJudge for domain-specific evaluation |
All metrics include LLM-as-Judge reasoning explanations.
### Traces & observability
Monitor your LLM applications with OpenTelemetry-based tracing:
```python
from rhesis.sdk.decorators import observe
@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
# Your LLM call here
return response
```
Track LLM calls, latency, token usage, and link traces to test results for debugging.
### Bring your own model
Use any LLM provider for test generation and evaluation:
**Cloud:** OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI
**Local/Self-hosted:** Ollama, vLLM, LiteLLM
See [Model Configuration Docs](https://docs.rhesis.ai/sdk/models) for setup instructions.
---
## Curated Test Sets on Hugging Face
We publish curated test datasets on [Hugging Face](https://huggingface.co/rhesis) to help teams assess their LLM applications. These test sets cover diverse evaluation scenarios across conversational AI, agentic systems, RAG applications, and more—helping you validate robustness, reliability, safety, and compliance.
### What's available
Test sets designed for:
- **Conversational AI** - Multi-turn dialogue, context retention, role adherence
- **Agentic Systems** - Tool selection, goal achievement, multi-agent coordination
- **RAG Systems** - Context relevance, faithfulness, hallucination detection
- **Adversarial Testing** - Jailbreak resistance, prompt injection, PII leakage
- **Domain-Specific Applications** - Finance, healthcare, customer support, sales, and more
### Using our test sets
**Option 1: Rhesis Platform**
1. Download a test set from [Hugging Face](https://huggingface.co/rhesis)
2. In the Rhesis platform, navigate to **Test Sets** → **Import from file**
3. Upload the downloaded CSV file
**Option 2: Python SDK**
```python
from rhesis.sdk import TestSet
# Load tests from a CSV file downloaded from Hugging Face
test_set = TestSet.from_csv(
"tests.csv",
name="Imported Tests",
description="Tests imported from Hugging Face"
)
print(f"Loaded {len(test_set.tests)} tests")
```
> **Disclaimer:** Some test cases may contain sensitive or challenging content included for thorough realistic assessment. Review test cases carefully and exercise discretion when utilizing them.
---
## Why Rhesis?
**Platform for teams. SDK for developers.**
Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.
### The testing lifecycle
Six integrated phases from project setup to team collaboration:
| Phase | What You Do |
|--------------------------------|-------------|
| **[1. Projects](https://docs.rhesis.ai/platform/projects)** | Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors |
| **[2. Requirements](https://docs.rhesis.ai/platform/behaviors)** | Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams |
| **[3. Metrics](https://docs.rhesis.ai/platform/metrics)** | Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met |
| **[4. Tests](https://docs.rhesis.ai/platform/tests)** | Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage |
| **[5. Execution](https://docs.rhesis.ai/platform/test-execution)** | Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution |
| **[6. Collaboration](https://docs.rhesis.ai/platform/test-runs)** | Review results with your team through comments, tasks, workflows, and side-by-side comparisons |
### Rhesis vs...
| Instead of... | Rhesis gives you... |
|---------------|---------------------|
| **Manual testing** | AI-generated test cases based on your context, hundreds in minutes |
| **Traditional test frameworks** | Non-deterministic output handling built-in |
| **LLM observability tools** | Pre-production validation, not post-production monitoring |
| **Red-teaming services** | Continuous, self-service adversarial testing, not one-time audits |
---
## What you can test
| Use Case | What Rhesis Tests |
|----------|-------------------|
| **Conversational AI** | Conversation simulation, role adherence, knowledge retention |
| **RAG Systems** | Context relevance, faithfulness, hallucination detection |
| **NL-to-SQL / NL-to-Code** | Query accuracy, syntax validation, edge case handling |
| **Agentic Systems** | Tool selection, goal achievement, multi-agent coordination |
---
## SDK: Code-first testing
Test your Python functions directly with the `@endpoint` decorator:
```python
from rhesis.sdk.decorators import endpoint
@endpoint(name="my-chatbot")
def chat(message: str) -> str:
# Your LLM logic here
return response
```
**Features:** Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).
**Generate tests programmatically:**
```python
from rhesis.sdk.synthesizers import PromptSynthesizer
synthesizer = PromptSynthesizer(
prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)
```
---
## Deployment options
| Option | Best For | Setup Time |
|--------|----------|------------|
| **[Rhesis Cloud](https://app.rhesis.ai)** | Teams wanting managed deployment | Instant |
| **Docker** | Local development and testing | 5 minutes |
| **Kubernetes** | Production self-hosting | [See docs](https://docs.rhesis.ai/getting-started/self-hosting) |
### Quick Start
**Option 1: Cloud (fastest)** - [app.rhesis.ai](https://app.rhesis.ai) - Managed service, just connect your app
**Option 2: Self-host with Docker**
```bash
git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start
```
**Access:** Frontend at `localhost:3000`, API at `localhost:8080/docs`
**Commands:** `./rh logs` · `./rh stop` · `./rh restart` · `./rh delete`
> **Note:** This setup enables auto-login for local testing. For production, see [Self-hosting Documentation](https://docs.rhesis.ai/getting-started/self-hosting).
**Option 3: Python SDK**
```bash
pip install rhesis-sdk
```
---
## Integrations
Connect Rhesis to your LLM stack:
| Integration | Languages | Description |
|-------------|-----------|-------------|
| **Rhesis SDK** | Python, JS/TS | Native SDK with decorators for endpoints and observability. Full control over test execution and tracing. |
| **OpenAI** | Python | Drop-in replacement for OpenAI SDK. Automatic instrumentation with zero code changes. |
| **Anthropic** | Python | Native support for Claude models with automatic tracing. |
| **LangChain** | Python | Add Rhesis callback handler to your LangChain app for automatic tracing and test execution. |
| **LangGraph** | Python | Built-in integration for LangGraph agent workflows with full observability. |
| **AutoGen** | Python | Automatic instrumentation for Microsoft AutoGen multi-agent conversations. |
| **LiteLLM** | Python | Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate). |
| **Google Gemini** | Python | Native integration for Google's Gemini models. |
| **Ollama** | Python | Local LLM deployment with Ollama integration. |
| **OpenRouter** | Python | Access to multiple LLM providers through OpenRouter. |
| **Vertex AI** | Python | Google Cloud Vertex AI model support. |
| **HuggingFace** | Python | Direct integration with HuggingFace models. |
| **REST API** | Any | Direct API access for custom integrations. [OpenAPI spec available](https://api.rhesis.ai/docs). |
See [Integration Docs](https://docs.rhesis.ai/development) for setup instructions.
---
## Open source
[MIT licensed](LICENSE). No plans to relicense core features. Enterprise version will live in `ee/` folders and remain separate.
We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.
---
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
**Ways to contribute:** Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions
---
## Support
- **[Documentation](https://docs.rhesis.ai)** - Guides and API reference
- **[Discord](https://discord.rhesis.ai)** - Community support
- **[GitHub Issues](https://github.com/rhesis-ai/rhesis/issues)** - Bug reports and feature requests
---
## Security & privacy
We take data security seriously. See our [Privacy Policy](https://rhesis.ai/privacy-policy) for details.
**Telemetry:** Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.
- **Self-hosted:** Opt out by setting `OTEL_RHESIS_TELEMETRY_ENABLED=false`
- **Cloud:** Telemetry enabled as part of Terms & Conditions
---
<p align="center">
<strong>Made with <img src="https://github.com/user-attachments/assets/598c2d81-572c-46bd-b718-dee32cdc749c" height="16" alt="Rhesis logo"> in Potsdam, Germany 🇩🇪</strong>
</p>
<p align="center">
<a href="https://rhesis.ai">Learn more at rhesis.ai</a>
</p> |