Upload README.md
Browse files
README.md
CHANGED
|
@@ -23,7 +23,7 @@ tags:
|
|
| 23 |
|
| 24 |
> **The production-grade OpenEnv RL environment for training and evaluating LLMs on hallucination avoidance.**
|
| 25 |
|
| 26 |
-
**Server Version:** v4.2.
|
| 27 |
|
| 28 |
[](https://github.com/meta-pytorch/OpenEnv)
|
| 29 |
[](#-quick-start)
|
|
@@ -319,11 +319,14 @@ All benchmarks: **3 episodes × 5 steps, seed=42**, against deployed HF Space.
|
|
| 319 |
|---|-------|----------|---------|--------|--------|--------|------|
|
| 320 |
| 1 | Nemotron-3-Super 120B | OpenRouter | **0.553** | 0.599 | 0.535 | 0.524 | 10m 57s |
|
| 321 |
| 2 | Llama 3.3 70B | Groq | **0.514** | 0.542 | 0.449 | 0.552 | 1m 12s |
|
| 322 |
-
| 3 |
|
| 323 |
-
| 4 |
|
| 324 |
-
| 5 |
|
| 325 |
-
| 6 |
|
| 326 |
-
| 7 |
|
|
|
|
|
|
|
|
|
|
| 327 |
|
| 328 |
### Heuristic Baseline (no LLM required)
|
| 329 |
|
|
@@ -459,6 +462,16 @@ ruff check . --ignore E501,F401,F403
|
|
| 459 |
|
| 460 |
## Changelog
|
| 461 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 462 |
### v4.2.0 (2026-04)
|
| 463 |
|
| 464 |
- **Fixed** BERTScore crash on HF Spaces — switched from `microsoft/deberta-v3-base` to `roberta-base` (fast tokenizer incompatibility with transformers>=4.57)
|
|
|
|
| 23 |
|
| 24 |
> **The production-grade OpenEnv RL environment for training and evaluating LLMs on hallucination avoidance.**
|
| 25 |
|
| 26 |
+
**Server Version:** v4.2.1
|
| 27 |
|
| 28 |
[](https://github.com/meta-pytorch/OpenEnv)
|
| 29 |
[](#-quick-start)
|
|
|
|
| 319 |
|---|-------|----------|---------|--------|--------|--------|------|
|
| 320 |
| 1 | Nemotron-3-Super 120B | OpenRouter | **0.553** | 0.599 | 0.535 | 0.524 | 10m 57s |
|
| 321 |
| 2 | Llama 3.3 70B | Groq | **0.514** | 0.542 | 0.449 | 0.552 | 1m 12s |
|
| 322 |
+
| 3 | Llama 4 Scout 17B | Groq | **0.508** | 0.558 | 0.453 | 0.513 | 1m 14s |
|
| 323 |
+
| 4 | Qwen3 32B | Groq | **0.513** | 0.564 | 0.453 | 0.522 | 4m 41s |
|
| 324 |
+
| 5 | GPT-OSS 20B | Groq | **0.498** | 0.552 | 0.406 | 0.537 | 3m 53s |
|
| 325 |
+
| 6 | Kimi K2 Instruct | Groq | **0.486** | 0.494 | 0.447 | 0.516 | 1m 14s |
|
| 326 |
+
| 7 | Qwen2.5 72B Instruct | HF Router | **0.480** | 0.594 | 0.431 | 0.417 | 3m 05s |
|
| 327 |
+
| 8 | Gemma 4 31B Cloud | Ollama | **0.421** | 0.498 | 0.286 | 0.480 | 3m 18s |
|
| 328 |
+
| 9 | GLM-4.5 Air | OpenRouter | **0.350** | 0.436 | 0.311 | 0.303 | 14m 01s |
|
| 329 |
+
| 10 | Heuristic (no LLM) | — | **0.131** | 0.162 | 0.144 | 0.087 | 30s |
|
| 330 |
|
| 331 |
### Heuristic Baseline (no LLM required)
|
| 332 |
|
|
|
|
| 462 |
|
| 463 |
## Changelog
|
| 464 |
|
| 465 |
+
### v4.2.1 (2026-04)
|
| 466 |
+
|
| 467 |
+
- **Fixed** Source grounding key phrase matching — trailing periods in normalized context words (e.g., `"alaska."`) prevented matching against quote words (e.g., `"alaska"`), causing false 0.0 grounding scores for valid partial quotes. Context word set now strips periods.
|
| 468 |
+
- **Fixed** AlignScore computation — `compute_alignscore()` called `nli.predict()` without `apply_softmax=True`, using raw logits instead of probabilities in the `entailment - contradiction + 0.5` formula. Short answers like single-word responses now get meaningful alignment scores instead of erratic 0.0/1.0 values.
|
| 469 |
+
- **Fixed** Semantic consistency penalizing short answers — NLI models give low entailment from paragraph→single-word, so 50/50 weighting between context-entailment and truth-entailment unfairly penalized correct short answers. Added adaptive weighting: 80/20 (truth/context) for ≤5 words, 60/40 for 6-15 words, 50/50 for longer.
|
| 470 |
+
- **Fixed** `best_match_score` not updated in key phrase matching path of `check_quote_in_context_advanced()` — now correctly set to `0.5 + 0.3 * ratio`.
|
| 471 |
+
- **Fixed** `inference.py` JSON parsing — models wrapping responses in markdown code fences (` ```json...``` `) now correctly extracted. Rewrote agent fallback flow to try JSON format first, then no-format, with proper extraction at each stage.
|
| 472 |
+
- **Fixed** `inference.py` connection reliability — added retry with exponential backoff for `ChunkedEncodingError`, `ConnectionError`, and `ReadTimeout` when communicating with HF Spaces.
|
| 473 |
+
- **Added** Kimi K2 Instruct and Llama 4 Scout 17B benchmark results to leaderboard.
|
| 474 |
+
|
| 475 |
### v4.2.0 (2026-04)
|
| 476 |
|
| 477 |
- **Fixed** BERTScore crash on HF Spaces — switched from `microsoft/deberta-v3-base` to `roberta-base` (fast tokenizer incompatibility with transformers>=4.57)
|