--- language: - en - hi tags: - Multiturn - QnA - Summarization pipeline_tag: text-generation library_name: transformers license: apache-2.0 ---
BharatGen

Model License
# LegalParam **BharatGen** introduces **LegalParam**, a domain-specialized large language model fine-tuned from **Param-1-2.9B-Instruct** on an exhaustive India-centric legal dataset. Trained across a comprehensive taxonomy of acts, laws, policies, and regulations, LegalParam is built to deliver accurate, context-aware answers to legal queries while also supporting tasks such as summarizing lengthy legal documents and simplifying complex policy texts. Whether itโ€™s aiding practitioners with quick references, assisting researchers in exploring legal frameworks, or helping citizens better understand their rights and obligations, LegalParam brings clarity and accessibility to the vast and intricate landscape of Indian law. --- ## โš–๏ธ Motivation Law in India is vast, complex, and ever-evolving, yet most language models lack the depth and domain specialization needed to navigate acts, policies, and regulations in an India-centric context. Citizens, researchers, and practitioners often struggle with scattered information and dense legal language that is hard to interpret. LegalParam bridges this gap by combining **Param-1**โ€™s strong instruction-following capabilities with a meticulously curated, **exhaustive dataset of Indian laws**, policies, and regulations, making legal knowledge more **accessible**, **contextual**, and **actionable**. --- ## ๐Ÿ— Model Architecture LegalParam inherits the architecture of **Param-1-2.9B-Instruct**: * **Hidden size**: 2048 * **Intermediate size**: 7168 * **Attention heads**: 16 * **Hidden layers**: 32 * **Key-value heads**: 8 * **Max position embeddings**: 2048 * **Activation**: SiLU * **Positional Embeddings**: Rotary (RoPE, theta=10000) * **Attention Mechanism**: Grouped-query attention * **Precision**: bf16-mixed * **Base model**: [Param-1-2.9B-Instruct](https://huggingface.co/bharatgenai/Param-1-2.9B-Instruct) --- ## ๐Ÿ“š Data Preparation LegalParamโ€™s training corpus was designed to ensure comprehensive coverage of Indian legal knowledge and high-quality instruction tuning along with bilingual (Hindi + English) accessibility. **Steps involved:** 1. **Source Gathering** * Open-source datasets on Indian laws, acts, and policies were collected and curated. Historical data (for acts and laws) for more grounded training 2. **Question Generation** * Q&A pairs were generated from acts and legal texts to create grounded supervision signals 3. **Domain Taxonomy & Personas** * An exhaustive taxonomy of the Indian legal framework was built. * Personas such as citizen, lawyer, policymaker, and researcher were defined to guide synthetic data generation. 4. **Dataset Construction** * ~2M Q&A pairs curated from open sources. * Additional synthetic data grounded in taxonomy and personas expanded the dataset. * In total, 5M Q&A pairs were used for fine-tuning. --- ## ๐Ÿ‹๏ธ Training Setup * **Base model**: Param-1-2.9B-Instruct * **Training framework**: Hugging Face + `torchrun` multi-node setup * **Prompt template**: Custom-designed for legal inference * **Scheduler**: Linear with warmup * **Epochs**: 3 * **Total training samples**: 12M * **Test samples**: 1.2M * **Base learning rate**: 5e-6 * **Minimum learning rate**: 0 * **Additional tokens**: ``, ``, ``, ``, ``, `` * **Vocab size**: 256k + 6 * **Global batch size**: 1024 * **Micro batch size**: 4 * **Gradient accumulation steps**: 32 --- ## ๐Ÿš€ Inference Example ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "bharatgenai/LegalParam" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=False) model = AutoModelForCausalLM.from_pretrained( model_name, trust_remote_code=True, torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.bfloat32, device_map="auto" ) # Example legal query user_input = "What steps should a farmer take to legally transfer agricultural land ownership?" # 3 types of prompt # 1. Generic QA # 2. Context based QA (context as part of prompt) # 3. Multi-turn conversation # Based on your requirements use the type of prompt (refere the above examples) prompt = f"\n{user_input}\n" # prompt = f"\n{user_or_rag_context}\n\n" # prompt = f"\n{user_input1}\n\n{user_input2}\n {user_input3} ..." inputs = tokenizer(prompt, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=300, do_sample=True, top_k=50, top_p=0.95, temperature=0.6, eos_token_id=tokenizer.eos_token_id, use_cache=False ) print(tokenizer.decode(output[0], skip_special_tokens=True)) ``` ๐Ÿ“Š Evaluation * Legal Skills & Communication Q&A * Corporate, Commercial & Criminal Law * Legal Theory & Jurisprudence * Public International Law * Humanitarian & Refugee Rights * Bilingual (English/Hindi) capability ### **[BhashaBench-Legal (BBL)](https://huggingface.co/datasets/bharatgenai/BhashaBench-Legal)** | Model | BBL | BBL-English | BBL-Hindi | |--------------------------------------|-------|-------------|-----------| | **gemma-2-2b-it** | 33.22 | 34.49 | 30.25 | | **granite-3.1-2b-instruct** | 34.91 | 38.18 | 27.30 | | **Llama-3.2-1B-Instruct** | 28.47 | 29.08 | 27.04 | | **Llama-3.2-3B-Instruct** | 36.86 | 39.74 | 30.13 | | **Nemotron-4-Mini-Hindi-4B-Instruct**| 36.12 | 36.99 | 34.11 | | **Qwen2.5-3B-Instruct** | 37.39 | 40.62 | 29.89 | | **Legal Param** | 35.17 | 36.15 | 32.89 | --- ### **Subject Domain Performance** | Domain | gemma-2-2b-it | granite-3.1-2b-instruct | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Nemotron-4-Mini-Hindi-4B-Instruct | Qwen2.5-3B-Instruct | Legal Param | |--------------------------------------|---------------|--------------------------|-----------------------|-----------------------|------------------------------------|---------------------|-------------| | **Civil Litigation & Procedure** | 32.33 | 33.69 | 28.18 | 34.97 | 32.77 | 35.31 | 35.07 | | **Constitutional & Administrative Law** | 33.75 | 36.35 | 28.15 | 40.62 | 40.37 | 37.93 | 38.43 | | **Consumer & Competition Law** | 37.33 | 34.67 | 22.67 | 34.67 | 38.67 | 46.67 | 32.00 | | **Corporate & Commercial Law** | 31.04 | 34.74 | 28.63 | 34.67 | 33.59 | 37.70 | 33.96 | | **Criminal Law & Justice** | 32.47 | 33.33 | 26.98 | 33.66 | 34.71 | 34.45 | 33.95 | | **Employment & Labour Law** | 37.14 | 36.00 | 25.71 | 29.14 | 41.71 | 37.14 | 32.00 | | **Environmental & Energy Law** | 32.33 | 32.79 | 24.42 | 37.91 | 36.28 | 38.84 | 32.56 | | **Family & Personal Law** | 31.18 | 30.68 | 28.86 | 31.69 | 32.69 | 32.80 | 32.09 | | **General Academic Subjects** | 38.84 | 39.81 | 32.52 | 43.91 | 43.05 | 45.44 | 38.21 | | **Healthcare & Medical Law** | 40.00 | 68.00 | 20.00 | 40.00 | 64.00 | 40.00 | 48.00 | | **Human Rights & Social Justice** | 15.79 | 42.11 | 42.11 | 26.32 | 10.53 | 31.58 | 10.53 | | **Intellectual Property Law** | 48.35 | 47.25 | 31.87 | 45.05 | 42.86 | 54.95 | 39.56 | | **Interdisciplinary Studies** | 37.19 | 40.77 | 28.10 | 41.32 | 42.98 | 44.08 | 31.96 | | **International & Comparative Law** | 37.32 | 39.92 | 30.35 | 45.22 | 44.18 | 43.76 | 37.84 | | **Legal Skills & Communication** | 27.94 | 29.53 | 27.33 | 30.15 | 29.90 | 31.74 | 27.82 | | **Legal Theory & Jurisprudence** | 35.33 | 35.61 | 28.36 | 39.69 | 38.78 | 40.04 | 36.31 | | **Media & Entertainment Law** | 44.44 | 44.44 | 35.19 | 51.85 | 35.19 | 33.33 | 35.19 | | **Real Estate & Property Law** | 28.30 | 32.75 | 25.91 | 31.96 | 32.11 | 33.55 | 28.62 | | **Tax & Revenue Law** | 32.03 | 31.60 | 31.60 | 38.10 | 38.10 | 37.66 | 39.83 | | **Technology & Cyber Law** | 44.72 | 39.84 | 41.46 | 49.59 | 49.59 | 59.35 | 43.90 | --- ### **Question Level Difficulty** | Difficulty | gemma-2-2b-it | granite-3.1-2b-instruct | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Nemotron-4-Mini-Hindi-4B-Instruct | Qwen2.5-3B-Instruct | Legal Param | |------------|---------------|--------------------------|-----------------------|-----------------------|------------------------------------|---------------------|-------------| | **Easy** | 35.66 | 37.00 | 29.88 | 40.19 | 38.65 | 39.81 | 37.96 | | **Hard** | 27.51 | 29.23 | 27.70 | 31.81 | 32.57 | 33.05 | 30.18 | | **Medium** | 30.25 | 32.45 | 26.46 | 32.49 | 32.77 | 34.30 | 31.61 | ---