uslap's picture
Update README.md
063cf1b verified

USLaP Mistral v22

Universal Scientific Laws and Principles (USLaP) — Fine-tuned Mistral-7B for scientific terminology validation against Qur'anic Arabic roots.

Purpose

Detects and rejects contaminated scientific terminology (Persian, Greek, Latin) and provides Qur'anic alternatives.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2", 
    device_map="auto", torch_dtype="auto"
)
model = PeftModel.from_pretrained(base, "uslap/uslap-mistral-v22")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")

prompt = "[INST] What is the Arabic term for geometry? [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=300)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Expected Output

❌ REJECTED: "geometry" (Greek)
❌ REJECTED: "هَنْدَسَة" (handasa) — PERSIAN CONTAMINATION

✅ USE INSTEAD: عِلْم التَّقْدِير ('Ilm al-Taqdīr)
Root: ق د ر (q-d-r) — Qur'anic: 54:49

Training

  • Base: Mistral-7B-Instruct-v0.2
  • Method: LoRA (r=32)
  • Dataset: 2,680 validated entries
  • Final Loss: 0.069-0.116

Framework

  • TRL: 0.27.2
  • Transformers: 5.0.0
  • PEFT: LoRA adapter

License

Apache 2.0