Update README.md

dd71053 verified about 1 month ago

13.8 kB

	---
	license: mit
	language:
	- en
	metrics:
	- accuracy
	- recall
	base_model:
	- BAAI/bge-large-en-v1.5


	pipeline_tag: sentence-similarity
	library_name: sentence-transformers

	tags:
	- legal
	- law
	- WA
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- dense
	- loss:MultipleNegativesRankingLoss

	model-index:
	- name: Washington-state-law-embedding-model-Large
	results:
	- task:
	type: information-retrieval
	name: Information Retrieval
	dataset:
	name: RCW Validation
	type: rcw-validation
	metrics:
	- type: cosine_accuracy@10
	value: 0.8344200750839755
	name: Cosine Accuracy@10
	- type: cosine_accuracy@1
	value: 0.08774945662912467
	name: Cosine Accuracy@1
	- type: cosine_accuracy@3
	value: 0.2561944279786603
	name: Cosine Accuracy@3
	- type: cosine_accuracy@5
	value: 0.42533096226042283
	name: Cosine Accuracy@5
	- type: cosine_precision@1
	value: 0.08774945662912467
	name: Cosine Precision@1
	- type: cosine_precision@3
	value: 0.08539814265955344
	name: Cosine Precision@3
	- type: cosine_precision@5
	value: 0.08506619245208456
	name: Cosine Precision@5
	- type: cosine_precision@10
	value: 0.08344200750839757
	name: Cosine Precision@10
	- type: cosine_recall@1
	value: 0.08774945662912467
	name: Cosine Recall@1
	- type: cosine_recall@3
	value: 0.2561944279786603
	name: Cosine Recall@3
	- type: cosine_recall@5
	value: 0.42533096226042283
	name: Cosine Recall@5
	- type: cosine_recall@10
	value: 0.8344200750839755
	name: Cosine Recall@10
	- type: cosine_ndcg@10
	value: 0.3829692177232852
	name: Cosine Ndcg@10
	- type: cosine_mrr@10
	value: 0.24923231025931583
	name: Cosine Mrr@10
	- type: cosine_map@100
	value: 0.25674619603156057
	name: Cosine Map@100
	datasets:
	- CSI-lab/RCW_2025_Positive_Query_Pairs
	---

	# Washington-state-law-embedding-model-Large

	Washington-state-law-embedding-model-Large is a highly specialized, parameter-rich embedding model fine-tuned specifically for Legal Information Retrieval (IR) within the State of Washington.

	Generic embedding models often perform suboptimally on legal texts due to the semantic gap between natural language questions (e.g., "What dollar amount makes a theft a first degree felony?") and formal statutory legalese. This model bridges that gap, allowing plain-English queries, legal scenarios, and document drafts to be accurately mapped to their corresponding Washington State statutes (Revised Code of Washington - RCW).

	## Available Models

	\| Model \| Language \| Description \| Query Prefix \|
	\|:------\|:---------\|:------------\|:-------------\|
	\| [CSI-lab/Washington-state-law-embedding-model-Large](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Large) \| English \| Fine-tuned `large` model (1024d) for WA State RCWs. Best performance. \| `Represent this sentence for searching relevant passages: ` \|
	\| [CSI-lab/Washington-state-law-embedding-model-Base](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Base) \| English \| Fine-tuned `base` model (768d) for WA State RCWs. Faster inference. \| `Represent this sentence for searching relevant passages: ` \|

	## Model Overview
	* Base Model: `BAAI/bge-large-en-v1.5`
	* Task: Semantic Search / Information Retrieval / Legal Preemption Analysis
	* Language: English (Legal Domain)
	* Max Sequence Length: 512 tokens
	* Output Dimensionality: 1024 dimensions
	* Similarity Function: Cosine Similarity

	## Key Features
	- Fine-tuned for Washington State legal domain (RCW)
	- Optimized for semantic search and retrieval tasks
	- Supports natural language legal queries
	- Designed for RAG-based legal assistants
	- Superior retrieval capacity leveraging the 1024d `large` architecture

	## Intended Use Cases
	This model is optimized to act as the retriever component in legal Retrieval-Augmented Generation (RAG) pipelines. Primary use cases include:
	1. Statutory Cross-Referencing: Mapping natural language legal questions to specific RCWs.
	2. Preemption Checking: Automatically retrieving state laws that may preempt or conflict with proposed municipal ordinances.
	3. Legal Research Automation: Clustering and searching local agency drafts against established state frameworks.
	4. AI Legal Assistants: Powering chatbots and research tools that require accurate retrieval of Washington State laws before generating an answer.
	5. Automated Compliance: Scanning contracts or external drafts against established state legislative frameworks.

	## Technical Details & Training Methodology

	### The Semantic Gap
	A standard dense retriever often fails on legal tasks because it relies on vocabulary overlap rather than conceptual legal mapping. To address this, `Washington-state-law-embedding-model-Large` was fine-tuned using a synthetic, high-variance dataset.

	### Training Data
	The model was fine-tuned on synthetic legal query–passage pairs generated from Washington State RCW statutes.

	The dataset includes:
	- Size: 455,424 training samples
	- Natural language paraphrases of legal questions
	- Hypothetical legal scenarios
	- Statute-grounded positive document matches

	The dataset spans 500+ legal categories derived from RCW structure.

	### Hyperparameters & Architecture
	* Loss Function: Multiple Negatives Ranking (MNR) Loss
	* Batch Size: 32
	* Epochs: 4
	* fp16: True
	* batch_sampler: no_duplicates
	* multi_dataset_batch_sampler: round_robin
	* Learning Rate Decay: Linear
	* Infrastructure: High-Performance Computing (HPC) Cluster

	#### All Hyperparameters
	<details><summary>Click to expand</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: steps
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 32
	- `per_device_eval_batch_size`: 32
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 5e-05
	- `weight_decay`: 0.0
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1
	- `num_train_epochs`: 4
	- `max_steps`: -1
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: {}
	- `warmup_ratio`: 0.0
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: None
	- `jit_mode_eval`: False
	- `use_ipex`: False
	- `bf16`: False
	- `fp16`: True
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: None
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: False
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `parallelism_config`: None
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_torch_fused
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: True
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `resume_from_checkpoint`: None
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `hub_revision`: None
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: False
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `liger_kernel_config`: None
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: False
	- `prompts`: None
	- `batch_sampler`: no_duplicates
	- `multi_dataset_batch_sampler`: round_robin
	- `router_mapping`: {}
	- `learning_rate_mapping`: {}

	</details>

	## Evaluation Metrics

	The model was evaluated on a rigorously held-out validation set of synthetic municipal drafts mapped 1-to-1 against Washington State RCWs. The table below compares the peak validation performance (achieved at Epoch 3.02) against the baseline, untrained `bge-large` model.

	\| Metric \| Base Model (Untrained Large) \| Fine-Tuned (Peak @ 3.02) \| Absolute Improvement \|
	\|:-------\|:-----------------------------\|:-------------------------\|:---------------------\|
	\| Recall@10 \| 0.5684 \| 0.8354 \| + 26.7% \|
	\| Recall@5 \| 0.2842 \| 0.4255 \| + 14.13% \|
	\| NDCG@10 \| 0.2509 \| 0.3828 \| + 12.38% \|
	\| MRR@10 \| 0.1569 \| 0.2487 \| + 9.18% \|

	Interpretation: Because the BAAI large architecture is already highly proficient, the baseline was extremely strong out-of-the-box. Fine-tuning pushed the model to extract the absolute mathematical ceiling from this legal dataset, successfully returning the exact governing state law within the top 10 results 83.5% of the time.

	## Limitations

	- This model does not provide legal advice.
	- Performance is limited to Washington State law (RCW) and may not generalize to other jurisdictions.
	- Outputs depend on the quality of the underlying document corpus.
	- Should be used as a retrieval tool, not a final decision-making system.

	## Usage Examples

	### Semantic Search with `sentence-transformers`
	<div style="padding:10px; border-left:4px solid #ff4d4f; background-color:#fff1f0;">

	Warning: Because this model is built on the BGE architecture, you must append the specific instruction prefix
	`"Represent this sentence for searching relevant passages:"`
	to your search queries to achieve optimal performance.

	Do not add this prefix to the database documents.

	</div>

	```python
	import torch
	from sentence_transformers import SentenceTransformer, util

	# 1. Load the fine-tuned model
	model = SentenceTransformer('CSI-lab/Washington-state-law-embedding-model-Large')

	# 2. Define the laws (Your Vector Database)
	laws = [
	"RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.",
	"RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...",
	"RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..."
	]

	# 3. Define the user's search query
	user_query = "What dollar amount makes a theft a first degree felony?"

	# 4. CRITICAL: Add the required BGE prefix to the query ONLY
	query_prefix = "Represent this sentence for searching relevant passages: "
	formatted_query = query_prefix + user_query

	# 5. Encode the documents and the query
	law_embeddings = model.encode(laws, convert_to_tensor=True)
	query_embedding = model.encode(formatted_query, convert_to_tensor=True)

	# 6. Calculate Cosine Similarity
	cosine_scores = util.cos_sim(query_embedding, law_embeddings)

	# 7. Print the top result
	best_idx = cosine_scores.argmax().item()
	print(f"Top Match: {laws[best_idx]}")
	print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")
	```

	# Model Citation
	```
	@misc{washington_state_law_embedding_Large_2026,
	title={Washington-state-law-embedding-model-Large: Fine-Tuned Dense Retrieval for Washington State Law},
	author={Tomar, Shlok},
	year={2026},
	publisher={Hugging Face}
	howpublished={\url{https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Large}},
	note={Hugging Face Model Repository}
	}
	```

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### MultipleNegativesRankingLoss
	```bibtex
	@misc{henderson2017efficient,
	title={Efficient Natural Language Response Suggestion for Smart Reply},
	author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
	year={2017},
	eprint={1705.00652},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```