Spaces:

rhesis
/

README

No application file

App Files Files Community

README / README.md

Nicolai-Rhesis-AI

Update README.md

6b27037 verified 1 day ago

preview code

raw

history blame contribute delete

13.6 kB

	---
	license: mit
	sdk: docker
	emoji: 📚
	short_description: Collaborative Testing for LLM & Agentic Applications
	---
	# Rhesis: Collaborative Testing for LLM & Agentic Applications

	<p align="center">
	<img src="https://github.com/user-attachments/assets/ff43ca6a-ffde-4aff-9ff9-eec3897d0d02" alt="Rhesis AI Logo" height="80">
	</p>

	<p align="center">
	<a href="https://github.com/rhesis-ai/rhesis/blob/main/LICENSE">
	<img src="https://img.shields.io/badge/license-MIT%20%2B%20Enterprise-blue" alt="License">
	</a>
	<a href="https://pypi.org/project/rhesis-sdk/">
	<img src="https://img.shields.io/pypi/v/rhesis-sdk" alt="PyPI Version">
	</a>
	<a href="https://pypi.org/project/rhesis-sdk/">
	<img src="https://img.shields.io/pypi/pyversions/rhesis-sdk" alt="Python Versions">
	</a>
	<a href="https://codecov.io/gh/rhesis-ai/rhesis">
	<img src="https://codecov.io/gh/rhesis-ai/rhesis/graph/badge.svg?token=1XQV983JEJ" alt="codecov">
	</a>
	<a href="https://discord.rhesis.ai">
	<img src="https://img.shields.io/discord/1340989671601209408?color=7289da&label=Discord&logo=discord&logoColor=white" alt="Discord">
	</a>
	<a href="https://www.linkedin.com/company/rhesis-ai">
	<img src="https://img.shields.io/badge/LinkedIn-Rhesis_AI-blue?logo=linkedin" alt="LinkedIn">
	</a>
	<a href="https://huggingface.co/rhesis">
	<img src="https://img.shields.io/badge/🤗-Rhesis-yellow" alt="Hugging Face">
	</a>
	<a href="https://docs.rhesis.ai">
	<img src="https://img.shields.io/badge/docs-rhesis.ai-blue" alt="Documentation">
	</a>
	</p>

	<p align="center">
	<a href="https://rhesis.ai"><strong>Website</strong></a> ·
	<a href="https://docs.rhesis.ai"><strong>Docs</strong></a> ·
	<a href="https://discord.rhesis.ai"><strong>Discord</strong></a> ·
	<a href="https://github.com/rhesis-ai/rhesis/blob/main/CHANGELOG.md"><strong>Changelog</strong></a>
	</p>

	<h3 align="center">More than just evals.<br><strong>Collaborative agent testing for teams.</strong></h3>

	<p align="center">
	Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.
	</p>

	<p align="center">
	<a href="https://rhesis.ai/?video=open" target="_blank">
	<img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Short_Demo.png"
	loading="lazy"
	width="1080"
	alt="Rhesis Platform Overview - Click to watch demo">
	</a>
	</p>

	---

	## Core features

	<p align="center">
	<img src="https://raw.githubusercontent.com/rhesis-ai/rhesis/main/.github/images/GH_Features.png"
	loading="lazy"
	width="1080"
	alt="Rhesis Core Features">
	</p>

	### Test generation

	AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

	Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

	### Single-turn & conversation simulation

	Single-turn for Q&A validation. Conversation simulation for dialogue flows.

	Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

	### Adversarial testing (red-teaming)

	Polyphemus Agent proactively finds vulnerabilities:

	- Jailbreak attempts and prompt injection
	- PII leakage and data extraction
	- Harmful content generation
	- Role violation and instruction bypassing

	Garak Integration - Built-in support for [garak](https://github.com/leondz/garak), the LLM vulnerability scanner, for comprehensive security testing.

	### 60+ pre-built metrics

	\| Framework \| Example Metrics \|
	\|-----------\|-----------------\|
	\| RAGAS \| Context relevance, faithfulness, answer accuracy \|
	\| DeepEval \| Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention \|
	\| Garak \| Jailbreak detection, prompt injection, XSS, malware generation, data leakage \|
	\| Custom \| NumericJudge, CategoricalJudge for domain-specific evaluation \|

	All metrics include LLM-as-Judge reasoning explanations.

	### Traces & observability

	Monitor your LLM applications with OpenTelemetry-based tracing:

	```python
	from rhesis.sdk.decorators import observe

	@observe.llm(model="gpt-4")
	def generate_response(prompt: str) -> str:
	# Your LLM call here
	return response
	```

	Track LLM calls, latency, token usage, and link traces to test results for debugging.

	### Bring your own model

	Use any LLM provider for test generation and evaluation:

	Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

	Local/Self-hosted: Ollama, vLLM, LiteLLM

	See [Model Configuration Docs](https://docs.rhesis.ai/sdk/models) for setup instructions.

	---

	## Curated Test Sets on Hugging Face

	We publish curated test datasets on [Hugging Face](https://huggingface.co/rhesis) to help teams assess their LLM applications. These test sets cover diverse evaluation scenarios across conversational AI, agentic systems, RAG applications, and more—helping you validate robustness, reliability, safety, and compliance.

	### What's available

	Test sets designed for:
	- Conversational AI - Multi-turn dialogue, context retention, role adherence
	- Agentic Systems - Tool selection, goal achievement, multi-agent coordination
	- RAG Systems - Context relevance, faithfulness, hallucination detection
	- Adversarial Testing - Jailbreak resistance, prompt injection, PII leakage
	- Domain-Specific Applications - Finance, healthcare, customer support, sales, and more

	### Using our test sets

	Option 1: Rhesis Platform
	1. Download a test set from [Hugging Face](https://huggingface.co/rhesis)
	2. In the Rhesis platform, navigate to Test Sets → Import from file
	3. Upload the downloaded CSV file

	Option 2: Python SDK

	```python
	from rhesis.sdk import TestSet

	# Load tests from a CSV file downloaded from Hugging Face
	test_set = TestSet.from_csv(
	"tests.csv",
	name="Imported Tests",
	description="Tests imported from Hugging Face"
	)
	print(f"Loaded {len(test_set.tests)} tests")
	```

	> Disclaimer: Some test cases may contain sensitive or challenging content included for thorough realistic assessment. Review test cases carefully and exercise discretion when utilizing them.

	---

	## Why Rhesis?

	Platform for teams. SDK for developers.

	Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

	### The testing lifecycle

	Six integrated phases from project setup to team collaboration:

	\| Phase \| What You Do \|
	\|--------------------------------\|-------------\|
	\| [1. Projects](https://docs.rhesis.ai/platform/projects) \| Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors \|
	\| [2. Requirements](https://docs.rhesis.ai/platform/behaviors) \| Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams \|
	\| [3. Metrics](https://docs.rhesis.ai/platform/metrics) \| Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met \|
	\| [4. Tests](https://docs.rhesis.ai/platform/tests) \| Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage \|
	\| [5. Execution](https://docs.rhesis.ai/platform/test-execution) \| Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution \|
	\| [6. Collaboration](https://docs.rhesis.ai/platform/test-runs) \| Review results with your team through comments, tasks, workflows, and side-by-side comparisons \|

	### Rhesis vs...

	\| Instead of... \| Rhesis gives you... \|
	\|---------------\|---------------------\|
	\| Manual testing \| AI-generated test cases based on your context, hundreds in minutes \|
	\| Traditional test frameworks \| Non-deterministic output handling built-in \|
	\| LLM observability tools \| Pre-production validation, not post-production monitoring \|
	\| Red-teaming services \| Continuous, self-service adversarial testing, not one-time audits \|

	---

	## What you can test

	\| Use Case \| What Rhesis Tests \|
	\|----------\|-------------------\|
	\| Conversational AI \| Conversation simulation, role adherence, knowledge retention \|
	\| RAG Systems \| Context relevance, faithfulness, hallucination detection \|
	\| NL-to-SQL / NL-to-Code \| Query accuracy, syntax validation, edge case handling \|
	\| Agentic Systems \| Tool selection, goal achievement, multi-agent coordination \|

	---

	## SDK: Code-first testing

	Test your Python functions directly with the `@endpoint` decorator:

	```python
	from rhesis.sdk.decorators import endpoint

	@endpoint(name="my-chatbot")
	def chat(message: str) -> str:
	# Your LLM logic here
	return response
	```

	Features: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).

	Generate tests programmatically:

	```python
	from rhesis.sdk.synthesizers import PromptSynthesizer

	synthesizer = PromptSynthesizer(
	prompt="Generate tests for a medical chatbot that must never provide diagnosis"
	)
	test_set = synthesizer.generate(num_tests=10)
	```

	---

	## Deployment options

	\| Option \| Best For \| Setup Time \|
	\|--------\|----------\|------------\|
	\| [Rhesis Cloud](https://app.rhesis.ai) \| Teams wanting managed deployment \| Instant \|
	\| Docker \| Local development and testing \| 5 minutes \|
	\| Kubernetes \| Production self-hosting \| [See docs](https://docs.rhesis.ai/getting-started/self-hosting) \|

	### Quick Start

	Option 1: Cloud (fastest) - [app.rhesis.ai](https://app.rhesis.ai) - Managed service, just connect your app

	Option 2: Self-host with Docker
	```bash
	git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start
	```

	Access: Frontend at `localhost:3000`, API at `localhost:8080/docs`

	Commands: `./rh logs` · `./rh stop` · `./rh restart` · `./rh delete`

	> Note: This setup enables auto-login for local testing. For production, see [Self-hosting Documentation](https://docs.rhesis.ai/getting-started/self-hosting).

	Option 3: Python SDK
	```bash
	pip install rhesis-sdk
	```

	---

	## Integrations

	Connect Rhesis to your LLM stack:

	\| Integration \| Languages \| Description \|
	\|-------------\|-----------\|-------------\|
	\| Rhesis SDK \| Python, JS/TS \| Native SDK with decorators for endpoints and observability. Full control over test execution and tracing. \|
	\| OpenAI \| Python \| Drop-in replacement for OpenAI SDK. Automatic instrumentation with zero code changes. \|
	\| Anthropic \| Python \| Native support for Claude models with automatic tracing. \|
	\| LangChain \| Python \| Add Rhesis callback handler to your LangChain app for automatic tracing and test execution. \|
	\| LangGraph \| Python \| Built-in integration for LangGraph agent workflows with full observability. \|
	\| AutoGen \| Python \| Automatic instrumentation for Microsoft AutoGen multi-agent conversations. \|
	\| LiteLLM \| Python \| Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate). \|
	\| Google Gemini \| Python \| Native integration for Google's Gemini models. \|
	\| Ollama \| Python \| Local LLM deployment with Ollama integration. \|
	\| OpenRouter \| Python \| Access to multiple LLM providers through OpenRouter. \|
	\| Vertex AI \| Python \| Google Cloud Vertex AI model support. \|
	\| HuggingFace \| Python \| Direct integration with HuggingFace models. \|
	\| REST API \| Any \| Direct API access for custom integrations. [OpenAPI spec available](https://api.rhesis.ai/docs). \|

	See [Integration Docs](https://docs.rhesis.ai/development) for setup instructions.

	---

	## Open source

	[MIT licensed](LICENSE). No plans to relicense core features. Enterprise version will live in `ee/` folders and remain separate.

	We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.

	---

	## Contributing

	See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

	Ways to contribute: Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions

	---

	## Support

	- [Documentation](https://docs.rhesis.ai) - Guides and API reference
	- [Discord](https://discord.rhesis.ai) - Community support
	- [GitHub Issues](https://github.com/rhesis-ai/rhesis/issues) - Bug reports and feature requests

	---

	## Security & privacy

	We take data security seriously. See our [Privacy Policy](https://rhesis.ai/privacy-policy) for details.

	Telemetry: Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.

	- Self-hosted: Opt out by setting `OTEL_RHESIS_TELEMETRY_ENABLED=false`
	- Cloud: Telemetry enabled as part of Terms & Conditions

	---

	<p align="center">
	<strong>Made with <img src="https://github.com/user-attachments/assets/598c2d81-572c-46bd-b718-dee32cdc749c" height="16" alt="Rhesis logo"> in Potsdam, Germany 🇩🇪</strong>
	</p>

	<p align="center">
	<a href="https://rhesis.ai">Learn more at rhesis.ai</a>
	</p>