File size: 13,609 Bytes
6b27037
 
 
 
 
 
4371926
daa3803
eda05be
4371926
eda05be
 
 
daa3803
4371926
daa3803
eda05be
4371926
eda05be
 
4371926
 
 
 
eda05be
 
4371926
eda05be
 
4371926
 
 
 
eda05be
 
4371926
eda05be
 
 
4371926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b27037
4371926
 
 
 
 
 
 
 
 
 
 
6b27037
4371926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eda05be
4371926
5bc5c33
4371926
620cfb0
4371926
a4a5a47
4371926
 
 
 
 
 
a4a5a47
4371926
a4a5a47
4371926
 
 
 
a4a5a47
4371926
a4a5a47
4371926
 
a4a5a47
4371926
 
 
 
 
 
 
 
a4a5a47
4371926
a4a5a47
4371926
a4a5a47
4371926
ba18231
4371926
ba18231
4371926
ba18231
4371926
f4a5ba1
4371926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b27037
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4371926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b27037
4371926
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
---
license: mit
sdk: docker
emoji: 📚
short_description: Collaborative Testing for LLM & Agentic Applications
---
# Rhesis: Collaborative Testing for LLM & Agentic Applications

<p align="center">
  <img src="https://github.com/user-attachments/assets/ff43ca6a-ffde-4aff-9ff9-eec3897d0d02" alt="Rhesis AI Logo" height="80">
</p>

<p align="center">
  <a href="https://github.com/rhesis-ai/rhesis/blob/main/LICENSE">
    <img src="https://img.shields.io/badge/license-MIT%20%2B%20Enterprise-blue" alt="License">
  </a>
  <a href="https://pypi.org/project/rhesis-sdk/">
    <img src="https://img.shields.io/pypi/v/rhesis-sdk" alt="PyPI Version">
  </a>
  <a href="https://pypi.org/project/rhesis-sdk/">
    <img src="https://img.shields.io/pypi/pyversions/rhesis-sdk" alt="Python Versions">
  </a>
  <a href="https://codecov.io/gh/rhesis-ai/rhesis">
    <img src="https://codecov.io/gh/rhesis-ai/rhesis/graph/badge.svg?token=1XQV983JEJ" alt="codecov">
  </a>
  <a href="https://discord.rhesis.ai">
    <img src="https://img.shields.io/discord/1340989671601209408?color=7289da&label=Discord&logo=discord&logoColor=white" alt="Discord">
  </a>
  <a href="https://www.linkedin.com/company/rhesis-ai">
    <img src="https://img.shields.io/badge/LinkedIn-Rhesis_AI-blue?logo=linkedin" alt="LinkedIn">
  </a>
  <a href="https://huggingface.co/rhesis">
    <img src="https://img.shields.io/badge/🤗-Rhesis-yellow" alt="Hugging Face">
  </a>
  <a href="https://docs.rhesis.ai">
    <img src="https://img.shields.io/badge/docs-rhesis.ai-blue" alt="Documentation">
  </a>
</p>

<p align="center">
  <a href="https://rhesis.ai"><strong>Website</strong></a> ·
  <a href="https://docs.rhesis.ai"><strong>Docs</strong></a> ·
  <a href="https://discord.rhesis.ai"><strong>Discord</strong></a> ·
  <a href="https://github.com/rhesis-ai/rhesis/blob/main/CHANGELOG.md"><strong>Changelog</strong></a>
</p>

<h3 align="center">More than just evals.<br><strong>Collaborative agent testing for teams.</strong></h3>

<p align="center">
Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.
</p>

<p align="center">
  <a href="https://rhesis.ai/?video=open" target="_blank">
    <img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Short_Demo.png"
         loading="lazy"
         width="1080"
         alt="Rhesis Platform Overview - Click to watch demo">
  </a>
</p>

---

## Core features

<p align="center">
  <img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Features.png"
       loading="lazy"
       width="1080"
       alt="Rhesis Core Features">
</p>

### Test generation

**AI-Powered Synthesis** - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

**Knowledge-Aware** - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

### Single-turn & conversation simulation

**Single-turn** for Q&A validation. **Conversation simulation** for dialogue flows.

**Penelope Agent** simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

### Adversarial testing (red-teaming)

**Polyphemus Agent** proactively finds vulnerabilities:

- Jailbreak attempts and prompt injection
- PII leakage and data extraction
- Harmful content generation
- Role violation and instruction bypassing

**Garak Integration** - Built-in support for [garak](https://github.com/leondz/garak), the LLM vulnerability scanner, for comprehensive security testing.

### 60+ pre-built metrics

| Framework | Example Metrics |
|-----------|-----------------|
| **RAGAS** | Context relevance, faithfulness, answer accuracy |
| **DeepEval** | Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention |
| **Garak** | Jailbreak detection, prompt injection, XSS, malware generation, data leakage |
| **Custom** | NumericJudge, CategoricalJudge for domain-specific evaluation |

All metrics include LLM-as-Judge reasoning explanations.

### Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

```python
from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response
```

Track LLM calls, latency, token usage, and link traces to test results for debugging.

### Bring your own model

Use any LLM provider for test generation and evaluation:

**Cloud:** OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

**Local/Self-hosted:** Ollama, vLLM, LiteLLM

See [Model Configuration Docs](https://docs.rhesis.ai/sdk/models) for setup instructions.

---

## Curated Test Sets on Hugging Face

We publish curated test datasets on [Hugging Face](https://huggingface.co/rhesis) to help teams assess their LLM applications. These test sets cover diverse evaluation scenarios across conversational AI, agentic systems, RAG applications, and more—helping you validate robustness, reliability, safety, and compliance.

### What's available

Test sets designed for:
- **Conversational AI** - Multi-turn dialogue, context retention, role adherence
- **Agentic Systems** - Tool selection, goal achievement, multi-agent coordination
- **RAG Systems** - Context relevance, faithfulness, hallucination detection
- **Adversarial Testing** - Jailbreak resistance, prompt injection, PII leakage
- **Domain-Specific Applications** - Finance, healthcare, customer support, sales, and more

### Using our test sets

**Option 1: Rhesis Platform**
1. Download a test set from [Hugging Face](https://huggingface.co/rhesis)
2. In the Rhesis platform, navigate to **Test Sets****Import from file**
3. Upload the downloaded CSV file

**Option 2: Python SDK**

```python
from rhesis.sdk import TestSet

# Load tests from a CSV file downloaded from Hugging Face
test_set = TestSet.from_csv(
    "tests.csv",
    name="Imported Tests",
    description="Tests imported from Hugging Face"
)
print(f"Loaded {len(test_set.tests)} tests")
```

> **Disclaimer:** Some test cases may contain sensitive or challenging content included for thorough realistic assessment. Review test cases carefully and exercise discretion when utilizing them.

---

## Why Rhesis?

**Platform for teams. SDK for developers.**

Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

### The testing lifecycle

Six integrated phases from project setup to team collaboration:

| Phase | What You Do |
|--------------------------------|-------------|
| **[1. Projects](https://docs.rhesis.ai/platform/projects)** | Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors |
| **[2. Requirements](https://docs.rhesis.ai/platform/behaviors)** | Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams |
| **[3. Metrics](https://docs.rhesis.ai/platform/metrics)** | Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met |
| **[4. Tests](https://docs.rhesis.ai/platform/tests)** | Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage |
| **[5. Execution](https://docs.rhesis.ai/platform/test-execution)** | Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution |
| **[6. Collaboration](https://docs.rhesis.ai/platform/test-runs)** | Review results with your team through comments, tasks, workflows, and side-by-side comparisons |

### Rhesis vs...

| Instead of... | Rhesis gives you... |
|---------------|---------------------|
| **Manual testing** | AI-generated test cases based on your context, hundreds in minutes |
| **Traditional test frameworks** | Non-deterministic output handling built-in |
| **LLM observability tools** | Pre-production validation, not post-production monitoring |
| **Red-teaming services** | Continuous, self-service adversarial testing, not one-time audits |

---

## What you can test

| Use Case | What Rhesis Tests |
|----------|-------------------|
| **Conversational AI** | Conversation simulation, role adherence, knowledge retention |
| **RAG Systems** | Context relevance, faithfulness, hallucination detection |
| **NL-to-SQL / NL-to-Code** | Query accuracy, syntax validation, edge case handling |
| **Agentic Systems** | Tool selection, goal achievement, multi-agent coordination |

---

## SDK: Code-first testing

Test your Python functions directly with the `@endpoint` decorator:

```python
from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response
```

**Features:** Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).

**Generate tests programmatically:**

```python
from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)
```

---

## Deployment options

| Option | Best For | Setup Time |
|--------|----------|------------|
| **[Rhesis Cloud](https://app.rhesis.ai)** | Teams wanting managed deployment | Instant |
| **Docker** | Local development and testing | 5 minutes |
| **Kubernetes** | Production self-hosting | [See docs](https://docs.rhesis.ai/getting-started/self-hosting) |

### Quick Start

**Option 1: Cloud (fastest)** - [app.rhesis.ai](https://app.rhesis.ai) - Managed service, just connect your app

**Option 2: Self-host with Docker**
```bash
git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start
```

**Access:** Frontend at `localhost:3000`, API at `localhost:8080/docs`

**Commands:** `./rh logs` · `./rh stop` · `./rh restart` · `./rh delete`

> **Note:** This setup enables auto-login for local testing. For production, see [Self-hosting Documentation](https://docs.rhesis.ai/getting-started/self-hosting).

**Option 3: Python SDK**
```bash
pip install rhesis-sdk
```

---

## Integrations

Connect Rhesis to your LLM stack:

| Integration | Languages | Description |
|-------------|-----------|-------------|
| **Rhesis SDK** | Python, JS/TS | Native SDK with decorators for endpoints and observability. Full control over test execution and tracing. |
| **OpenAI** | Python | Drop-in replacement for OpenAI SDK. Automatic instrumentation with zero code changes. |
| **Anthropic** | Python | Native support for Claude models with automatic tracing. |
| **LangChain** | Python | Add Rhesis callback handler to your LangChain app for automatic tracing and test execution. |
| **LangGraph** | Python | Built-in integration for LangGraph agent workflows with full observability. |
| **AutoGen** | Python | Automatic instrumentation for Microsoft AutoGen multi-agent conversations. |
| **LiteLLM** | Python | Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate). |
| **Google Gemini** | Python | Native integration for Google's Gemini models. |
| **Ollama** | Python | Local LLM deployment with Ollama integration. |
| **OpenRouter** | Python | Access to multiple LLM providers through OpenRouter. |
| **Vertex AI** | Python | Google Cloud Vertex AI model support. |
| **HuggingFace** | Python | Direct integration with HuggingFace models. |
| **REST API** | Any | Direct API access for custom integrations. [OpenAPI spec available](https://api.rhesis.ai/docs). |

See [Integration Docs](https://docs.rhesis.ai/development) for setup instructions.

---

## Open source

[MIT licensed](LICENSE). No plans to relicense core features. Enterprise version will live in `ee/` folders and remain separate.

We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

**Ways to contribute:** Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions

---

## Support

- **[Documentation](https://docs.rhesis.ai)** - Guides and API reference
- **[Discord](https://discord.rhesis.ai)** - Community support
- **[GitHub Issues](https://github.com/rhesis-ai/rhesis/issues)** - Bug reports and feature requests

---

## Security & privacy

We take data security seriously. See our [Privacy Policy](https://rhesis.ai/privacy-policy) for details.

**Telemetry:** Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.

- **Self-hosted:** Opt out by setting `OTEL_RHESIS_TELEMETRY_ENABLED=false`
- **Cloud:** Telemetry enabled as part of Terms & Conditions

---

<p align="center">
  <strong>Made with <img src="https://github.com/user-attachments/assets/598c2d81-572c-46bd-b718-dee32cdc749c" height="16" alt="Rhesis logo"> in Potsdam, Germany 🇩🇪</strong>
</p>

<p align="center">
  <a href="https://rhesis.ai">Learn more at rhesis.ai</a>
</p>