CENG 493 Turkish Legal RAG — leakage-free clean retrain bundle

A 3-component bundle for Turkish legal Multiple-Choice QA over the mevzuat (legislation) corpus. Built as the Spring 2026 term project for METU NCC CENG 493 (Information Retrieval).

Reach the system entrypoint: clone Batuhan4/ceng493-submission on GitHub, install requirements, run scripts/download_models.py to pull this bundle, then scripts/run_eval.py --corpus <DIR> --benchmark <JSONL>.

What's in this repo

Subfolder	Component	Base model	Size	Purpose
`llm-vanilla-clean/merged/`	Causal LM	`google/gemma-4-E2B-it` (2.3 B)	9.6 GB	Answer generator (LoRA r=16 α=32 merged bf16)
`embed-tr-legal-v2/`	Dense embedder	`intfloat/multilingual-e5-base` (278 M)	1.1 GB	Question/passage similarity for FAISS retrieval
`reranker-tr-legal-v2/`	Cross-encoder	`BAAI/bge-reranker-v2-m3` (568 M)	2.2 GB	Top-20 → top-5 reranker before LLM

Total bundle: ~13 GB on disk.

Headline performance

78.5 % MCQ accuracy on the team-curated 200-Q Turkish legal benchmark (ensemble Top-3 majority vote over Multi-Query+RRF + HyDE + PageIndex). Best single variant: Multi-Query+RRF at 77.5 %.

The bundle is leakage-free — Phase 13 audit (scripts/audit_synthetic_clean.py in the submission repo) verified that no training data overlaps with any benchmark question. The 7-check audit covers: chunk-id match, (law, madde) pair match, Jaccard 0.4, exact substring, TF-IDF cosine, first-4-tokens, and direct query-string match.

Training data

Component	Source	Size	Notes
LLM SFT mix	`Renicames/turkish-law-chatbot` (Apache 2.0) + 768 corpus-only synthetic MCQs	15,622 rows	165 (law, madde) pairs from the benchmark were blocked from synthetic generation
Embed pairs	768 (query, positive, 5 hard-negatives) triples mined from `muhammetakkurt/mevzuat-gov-dataset`	4,608 examples	BM25-mined negatives, benchmark-blocked
Reranker pairs	Same source as embed pairs, formatted for pair-wise loss	—	Identical filtering

The 768 corpus-only synthetic MCQs were generated by subagents reading randomly sampled mevzuat articles and producing 5-option MCQs grounded strictly in the article text. Each pair was audited against the benchmark before being added to the training set.

Training recipe

All three components were trained on Modal H100 80GB. Same constraints across the bundle:

No QLoRA, no 4-bit/8-bit weights, no Unsloth — bf16 throughout (project constraint C12).
AdamW (not 8-bit) — full optimizer state.
LoRA r=16, α=32, dropout=0.05 for the LLM (target=all-linear).
Sentence-transformers MultipleNegativesRankingLoss for embed.
CrossEncoder pair-wise loss for reranker.
2 epochs, batch 16 (LLM) / 32 (embed/reranker), bf16, cosine LR.

Total training cost: ~$16 across 3 components on Modal H100 80GB EU.

Evaluation methodology

200-Q benchmark, 8 categories (Anayasa, Borçlar, Ceza, İcra-İflas, İdare, İş-Usul, Medeni, Ticaret), 5-option MCQ. Metrics:

Accuracy — letter match (primary)
Recall@5 — does the top-5 retrieved chunk pool contain the gold passage?
MRR — reciprocal rank of the gold chunk
Citation F1 — extracted [KOD Madde N] citations vs gold
ROUGE-L — text overlap of retrieved vs gold passage text
Latency — per-question wall-time

See the submission repo's report/REPORT.md §6 for the full per-metric ablation table (Phase 3 baseline → Phase 7 best-single → Phase 8 ensemble).

Known limitations

Turkish-only. Tokenizer + corpus are Turkish; English / other languages will degrade.
Domain-specific. Trained for Turkish legal MCQ; general-domain Q&A will lose the FT benefits.
Closed-set MCQ format. The LLM was trained to emit Cevap: <X> where X ∈ A-E. Open-ended Q&A works but the formatting bias may show.
Per-category variance. v2 ensemble matches/beats v1 on 5 of 8 categories but drops on Borçlar / Medeni / Ticaret (~−3 to −20 pp) where v1's now-removed leaked SFT had over-fit. v3 expanded mix (1690 rows stricter audit) is on disk but the retrain on this expanded mix is blocked by Modal credit exhaustion — flagged as future work.
License inheritance. The LLM inherits google/gemma-4-E2B-it's license; see Gemma Terms for use restrictions (no harmful content, no military, no bio/chem weapons, etc.). The embed and reranker are MIT (from upstream).

Citation

@misc{bayazit2026ceng493,
  title  = {Turkish Legal RAG on a Single Intel iGPU (Leakage-Free Clean Retrain)},
  author = {Bayazıt, Batuhan and Bayburtlu, Barış Cem and Aydoğmuş, Burak},
  year   = {2026},
  note   = {CENG 493 Term Project, METU NCC, Spring 2026},
  url    = {https://github.com/Batuhan4/ceng493-submission}
}

License

LLM (llm-vanilla-clean/) — Gemma Terms of Use. Derivative of google/gemma-4-E2B-it.
Embed (embed-tr-legal-v2/) — MIT. Derivative of intfloat/multilingual-e5-base (MIT).
Reranker (reranker-tr-legal-v2/) — MIT. Derivative of BAAI/bge-reranker-v2-m3 (MIT).

Each subfolder ships its own LICENSE file from the upstream base.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xBatuhan4/ceng493-rag-bundle

Base model

BAAI/bge-reranker-v2-m3

Adapter

(5)

this model