You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CENG 493 Turkish Legal RAG — leakage-free clean retrain bundle

A 3-component bundle for Turkish legal Multiple-Choice QA over the mevzuat (legislation) corpus. Built as the Spring 2026 term project for METU NCC CENG 493 (Information Retrieval).

Reach the system entrypoint: clone Batuhan4/ceng493-submission on GitHub, install requirements, run scripts/download_models.py to pull this bundle, then scripts/run_eval.py --corpus <DIR> --benchmark <JSONL>.

What's in this repo

Subfolder Component Base model Size Purpose
llm-vanilla-clean/merged/ Causal LM google/gemma-4-E2B-it (2.3 B) 9.6 GB Answer generator (LoRA r=16 α=32 merged bf16)
embed-tr-legal-v2/ Dense embedder intfloat/multilingual-e5-base (278 M) 1.1 GB Question/passage similarity for FAISS retrieval
reranker-tr-legal-v2/ Cross-encoder BAAI/bge-reranker-v2-m3 (568 M) 2.2 GB Top-20 → top-5 reranker before LLM

Total bundle: ~13 GB on disk.

Headline performance

78.5 % MCQ accuracy on the team-curated 200-Q Turkish legal benchmark (ensemble Top-3 majority vote over Multi-Query+RRF + HyDE + PageIndex). Best single variant: Multi-Query+RRF at 77.5 %.

The bundle is leakage-free — Phase 13 audit (scripts/audit_synthetic_clean.py in the submission repo) verified that no training data overlaps with any benchmark question. The 7-check audit covers: chunk-id match, (law, madde) pair match, Jaccard 0.4, exact substring, TF-IDF cosine, first-4-tokens, and direct query-string match.

Training data

Component Source Size Notes
LLM SFT mix Renicames/turkish-law-chatbot (Apache 2.0) + 768 corpus-only synthetic MCQs 15,622 rows 165 (law, madde) pairs from the benchmark were blocked from synthetic generation
Embed pairs 768 (query, positive, 5 hard-negatives) triples mined from muhammetakkurt/mevzuat-gov-dataset 4,608 examples BM25-mined negatives, benchmark-blocked
Reranker pairs Same source as embed pairs, formatted for pair-wise loss Identical filtering

The 768 corpus-only synthetic MCQs were generated by subagents reading randomly sampled mevzuat articles and producing 5-option MCQs grounded strictly in the article text. Each pair was audited against the benchmark before being added to the training set.

Training recipe

All three components were trained on Modal H100 80GB. Same constraints across the bundle:

  • No QLoRA, no 4-bit/8-bit weights, no Unsloth — bf16 throughout (project constraint C12).
  • AdamW (not 8-bit) — full optimizer state.
  • LoRA r=16, α=32, dropout=0.05 for the LLM (target=all-linear).
  • Sentence-transformers MultipleNegativesRankingLoss for embed.
  • CrossEncoder pair-wise loss for reranker.
  • 2 epochs, batch 16 (LLM) / 32 (embed/reranker), bf16, cosine LR.

Total training cost: ~$16 across 3 components on Modal H100 80GB EU.

Evaluation methodology

200-Q benchmark, 8 categories (Anayasa, Borçlar, Ceza, İcra-İflas, İdare, İş-Usul, Medeni, Ticaret), 5-option MCQ. Metrics:

  • Accuracy — letter match (primary)
  • Recall@5 — does the top-5 retrieved chunk pool contain the gold passage?
  • MRR — reciprocal rank of the gold chunk
  • Citation F1 — extracted [KOD Madde N] citations vs gold
  • ROUGE-L — text overlap of retrieved vs gold passage text
  • Latency — per-question wall-time

See the submission repo's report/REPORT.md §6 for the full per-metric ablation table (Phase 3 baseline → Phase 7 best-single → Phase 8 ensemble).

Known limitations

  1. Turkish-only. Tokenizer + corpus are Turkish; English / other languages will degrade.
  2. Domain-specific. Trained for Turkish legal MCQ; general-domain Q&A will lose the FT benefits.
  3. Closed-set MCQ format. The LLM was trained to emit Cevap: <X> where X ∈ A-E. Open-ended Q&A works but the formatting bias may show.
  4. Per-category variance. v2 ensemble matches/beats v1 on 5 of 8 categories but drops on Borçlar / Medeni / Ticaret (~−3 to −20 pp) where v1's now-removed leaked SFT had over-fit. v3 expanded mix (1690 rows stricter audit) is on disk but the retrain on this expanded mix is blocked by Modal credit exhaustion — flagged as future work.
  5. License inheritance. The LLM inherits google/gemma-4-E2B-it's license; see Gemma Terms for use restrictions (no harmful content, no military, no bio/chem weapons, etc.). The embed and reranker are MIT (from upstream).

Citation

@misc{bayazit2026ceng493,
  title  = {Turkish Legal RAG on a Single Intel iGPU (Leakage-Free Clean Retrain)},
  author = {Bayazıt, Batuhan and Bayburtlu, Barış Cem and Aydoğmuş, Burak},
  year   = {2026},
  note   = {CENG 493 Term Project, METU NCC, Spring 2026},
  url    = {https://github.com/Batuhan4/ceng493-submission}
}

License

  • LLM (llm-vanilla-clean/) — Gemma Terms of Use. Derivative of google/gemma-4-E2B-it.
  • Embed (embed-tr-legal-v2/) — MIT. Derivative of intfloat/multilingual-e5-base (MIT).
  • Reranker (reranker-tr-legal-v2/) — MIT. Derivative of BAAI/bge-reranker-v2-m3 (MIT).

Each subfolder ships its own LICENSE file from the upstream base.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 0xBatuhan4/ceng493-rag-bundle

Adapter
(2)
this model