Instructions to use 0xBatuhan4/ceng493-rag-bundle with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use 0xBatuhan4/ceng493-rag-bundle with sentence-transformers:
from sentence_transformers import CrossEncoder model = CrossEncoder("0xBatuhan4/ceng493-rag-bundle") query = "Which planet is known as the Red Planet?" passages = [ "Venus is often called Earth's twin because of its similar size and proximity.", "Mars, known for its reddish appearance, is often referred to as the Red Planet.", "Jupiter, the largest planet in our solar system, has a prominent red spot.", "Saturn, famous for its rings, is sometimes mistaken for the Red Planet." ] scores = model.predict([(query, passage) for passage in passages]) print(scores) - Notebooks
- Google Colab
- Kaggle
CENG 493 Turkish Legal RAG — leakage-free clean retrain bundle
A 3-component bundle for Turkish legal Multiple-Choice QA over the mevzuat (legislation) corpus. Built as the Spring 2026 term project for METU NCC CENG 493 (Information Retrieval).
Reach the system entrypoint: clone
Batuhan4/ceng493-submissionon GitHub, install requirements, runscripts/download_models.pyto pull this bundle, thenscripts/run_eval.py --corpus <DIR> --benchmark <JSONL>.
What's in this repo
| Subfolder | Component | Base model | Size | Purpose |
|---|---|---|---|---|
llm-vanilla-clean/merged/ |
Causal LM | google/gemma-4-E2B-it (2.3 B) |
9.6 GB | Answer generator (LoRA r=16 α=32 merged bf16) |
embed-tr-legal-v2/ |
Dense embedder | intfloat/multilingual-e5-base (278 M) |
1.1 GB | Question/passage similarity for FAISS retrieval |
reranker-tr-legal-v2/ |
Cross-encoder | BAAI/bge-reranker-v2-m3 (568 M) |
2.2 GB | Top-20 → top-5 reranker before LLM |
Total bundle: ~13 GB on disk.
Headline performance
78.5 % MCQ accuracy on the team-curated 200-Q Turkish legal benchmark (ensemble Top-3 majority vote over Multi-Query+RRF + HyDE + PageIndex). Best single variant: Multi-Query+RRF at 77.5 %.
The bundle is leakage-free — Phase 13 audit (scripts/audit_synthetic_clean.py
in the submission repo) verified that no training data overlaps with any
benchmark question. The 7-check audit covers: chunk-id match, (law, madde)
pair match, Jaccard 0.4, exact substring, TF-IDF cosine, first-4-tokens, and
direct query-string match.
Training data
| Component | Source | Size | Notes |
|---|---|---|---|
| LLM SFT mix | Renicames/turkish-law-chatbot (Apache 2.0) + 768 corpus-only synthetic MCQs |
15,622 rows | 165 (law, madde) pairs from the benchmark were blocked from synthetic generation |
| Embed pairs | 768 (query, positive, 5 hard-negatives) triples mined from muhammetakkurt/mevzuat-gov-dataset |
4,608 examples | BM25-mined negatives, benchmark-blocked |
| Reranker pairs | Same source as embed pairs, formatted for pair-wise loss | — | Identical filtering |
The 768 corpus-only synthetic MCQs were generated by subagents reading randomly sampled mevzuat articles and producing 5-option MCQs grounded strictly in the article text. Each pair was audited against the benchmark before being added to the training set.
Training recipe
All three components were trained on Modal H100 80GB. Same constraints across the bundle:
- No QLoRA, no 4-bit/8-bit weights, no Unsloth — bf16 throughout (project constraint C12).
- AdamW (not 8-bit) — full optimizer state.
- LoRA r=16, α=32, dropout=0.05 for the LLM (target=
all-linear). - Sentence-transformers
MultipleNegativesRankingLossfor embed. - CrossEncoder pair-wise loss for reranker.
- 2 epochs, batch 16 (LLM) / 32 (embed/reranker), bf16, cosine LR.
Total training cost: ~$16 across 3 components on Modal H100 80GB EU.
Evaluation methodology
200-Q benchmark, 8 categories (Anayasa, Borçlar, Ceza, İcra-İflas, İdare, İş-Usul, Medeni, Ticaret), 5-option MCQ. Metrics:
- Accuracy — letter match (primary)
- Recall@5 — does the top-5 retrieved chunk pool contain the gold passage?
- MRR — reciprocal rank of the gold chunk
- Citation F1 — extracted
[KOD Madde N]citations vs gold - ROUGE-L — text overlap of retrieved vs gold passage text
- Latency — per-question wall-time
See the submission repo's report/REPORT.md §6 for the full per-metric
ablation table (Phase 3 baseline → Phase 7 best-single → Phase 8 ensemble).
Known limitations
- Turkish-only. Tokenizer + corpus are Turkish; English / other languages will degrade.
- Domain-specific. Trained for Turkish legal MCQ; general-domain Q&A will lose the FT benefits.
- Closed-set MCQ format. The LLM was trained to emit
Cevap: <X>where X ∈ A-E. Open-ended Q&A works but the formatting bias may show. - Per-category variance. v2 ensemble matches/beats v1 on 5 of 8 categories but drops on Borçlar / Medeni / Ticaret (~−3 to −20 pp) where v1's now-removed leaked SFT had over-fit. v3 expanded mix (1690 rows stricter audit) is on disk but the retrain on this expanded mix is blocked by Modal credit exhaustion — flagged as future work.
- License inheritance. The LLM inherits
google/gemma-4-E2B-it's license; see Gemma Terms for use restrictions (no harmful content, no military, no bio/chem weapons, etc.). The embed and reranker are MIT (from upstream).
Citation
@misc{bayazit2026ceng493,
title = {Turkish Legal RAG on a Single Intel iGPU (Leakage-Free Clean Retrain)},
author = {Bayazıt, Batuhan and Bayburtlu, Barış Cem and Aydoğmuş, Burak},
year = {2026},
note = {CENG 493 Term Project, METU NCC, Spring 2026},
url = {https://github.com/Batuhan4/ceng493-submission}
}
License
- LLM (
llm-vanilla-clean/) — Gemma Terms of Use. Derivative ofgoogle/gemma-4-E2B-it. - Embed (
embed-tr-legal-v2/) — MIT. Derivative ofintfloat/multilingual-e5-base(MIT). - Reranker (
reranker-tr-legal-v2/) — MIT. Derivative ofBAAI/bge-reranker-v2-m3(MIT).
Each subfolder ships its own LICENSE file from the upstream base.
Model tree for 0xBatuhan4/ceng493-rag-bundle
Base model
BAAI/bge-reranker-v2-m3