Sentence Similarity
sentence-transformers
Safetensors
English
bert
legal
law
WA
feature-extraction
dense
loss:MultipleNegativesRankingLoss
Eval Results (legacy)
text-embeddings-inference
Instructions to use CSI-lab/Washington-state-law-embedding-model-Large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use CSI-lab/Washington-state-law-embedding-model-Large with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("CSI-lab/Washington-state-law-embedding-model-Large") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: | |
| - en | |
| metrics: | |
| - accuracy | |
| - recall | |
| base_model: | |
| - BAAI/bge-large-en-v1.5 | |
| pipeline_tag: sentence-similarity | |
| library_name: sentence-transformers | |
| tags: | |
| - legal | |
| - law | |
| - WA | |
| - sentence-transformers | |
| - feature-extraction | |
| - sentence-similarity | |
| - dense | |
| - loss:MultipleNegativesRankingLoss | |
| model-index: | |
| - name: Washington-state-law-embedding-model-Large | |
| results: | |
| - task: | |
| type: information-retrieval | |
| name: Information Retrieval | |
| dataset: | |
| name: RCW Validation | |
| type: rcw-validation | |
| metrics: | |
| - type: cosine_accuracy@10 | |
| value: 0.8344200750839755 | |
| name: Cosine Accuracy@10 | |
| - type: cosine_accuracy@1 | |
| value: 0.08774945662912467 | |
| name: Cosine Accuracy@1 | |
| - type: cosine_accuracy@3 | |
| value: 0.2561944279786603 | |
| name: Cosine Accuracy@3 | |
| - type: cosine_accuracy@5 | |
| value: 0.42533096226042283 | |
| name: Cosine Accuracy@5 | |
| - type: cosine_precision@1 | |
| value: 0.08774945662912467 | |
| name: Cosine Precision@1 | |
| - type: cosine_precision@3 | |
| value: 0.08539814265955344 | |
| name: Cosine Precision@3 | |
| - type: cosine_precision@5 | |
| value: 0.08506619245208456 | |
| name: Cosine Precision@5 | |
| - type: cosine_precision@10 | |
| value: 0.08344200750839757 | |
| name: Cosine Precision@10 | |
| - type: cosine_recall@1 | |
| value: 0.08774945662912467 | |
| name: Cosine Recall@1 | |
| - type: cosine_recall@3 | |
| value: 0.2561944279786603 | |
| name: Cosine Recall@3 | |
| - type: cosine_recall@5 | |
| value: 0.42533096226042283 | |
| name: Cosine Recall@5 | |
| - type: cosine_recall@10 | |
| value: 0.8344200750839755 | |
| name: Cosine Recall@10 | |
| - type: cosine_ndcg@10 | |
| value: 0.3829692177232852 | |
| name: Cosine Ndcg@10 | |
| - type: cosine_mrr@10 | |
| value: 0.24923231025931583 | |
| name: Cosine Mrr@10 | |
| - type: cosine_map@100 | |
| value: 0.25674619603156057 | |
| name: Cosine Map@100 | |
| datasets: | |
| - CSI-lab/RCW_2025_Positive_Query_Pairs | |
| # Washington-state-law-embedding-model-Large | |
| **Washington-state-law-embedding-model-Large** is a highly specialized, parameter-rich embedding model fine-tuned specifically for Legal Information Retrieval (IR) within the State of Washington. | |
| Generic embedding models often perform suboptimally on legal texts due to the semantic gap between natural language questions (e.g., "What dollar amount makes a theft a first degree felony?") and formal statutory legalese. This model bridges that gap, allowing plain-English queries, legal scenarios, and document drafts to be accurately mapped to their corresponding Washington State statutes (Revised Code of Washington - RCW). | |
| ## Available Models | |
| | Model | Language | Description | Query Prefix | | |
| |:------|:---------|:------------|:-------------| | |
| | [CSI-lab/Washington-state-law-embedding-model-Large](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Large) | English | Fine-tuned `large` model (1024d) for WA State RCWs. Best performance. | `Represent this sentence for searching relevant passages: ` | | |
| | [CSI-lab/Washington-state-law-embedding-model-Base](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Base) | English | Fine-tuned `base` model (768d) for WA State RCWs. Faster inference. | `Represent this sentence for searching relevant passages: ` | | |
| ## Model Overview | |
| * **Base Model:** `BAAI/bge-large-en-v1.5` | |
| * **Task:** Semantic Search / Information Retrieval / Legal Preemption Analysis | |
| * **Language:** English (Legal Domain) | |
| * **Max Sequence Length:** 512 tokens | |
| * **Output Dimensionality:** 1024 dimensions | |
| * **Similarity Function:** Cosine Similarity | |
| ## Key Features | |
| - Fine-tuned for Washington State legal domain (RCW) | |
| - Optimized for semantic search and retrieval tasks | |
| - Supports natural language legal queries | |
| - Designed for RAG-based legal assistants | |
| - Superior retrieval capacity leveraging the 1024d `large` architecture | |
| ## Intended Use Cases | |
| This model is optimized to act as the retriever component in legal Retrieval-Augmented Generation (RAG) pipelines. Primary use cases include: | |
| 1. **Statutory Cross-Referencing:** Mapping natural language legal questions to specific RCWs. | |
| 2. **Preemption Checking:** Automatically retrieving state laws that may preempt or conflict with proposed municipal ordinances. | |
| 3. **Legal Research Automation:** Clustering and searching local agency drafts against established state frameworks. | |
| 4. **AI Legal Assistants:** Powering chatbots and research tools that require accurate retrieval of Washington State laws before generating an answer. | |
| 5. **Automated Compliance:** Scanning contracts or external drafts against established state legislative frameworks. | |
| ## Technical Details & Training Methodology | |
| ### The Semantic Gap | |
| A standard dense retriever often fails on legal tasks because it relies on vocabulary overlap rather than conceptual legal mapping. To address this, `Washington-state-law-embedding-model-Large` was fine-tuned using a synthetic, high-variance dataset. | |
| ### Training Data | |
| The model was fine-tuned on synthetic legal query–passage pairs generated from Washington State RCW statutes. | |
| The dataset includes: | |
| - Size: 455,424 training samples | |
| - Natural language paraphrases of legal questions | |
| - Hypothetical legal scenarios | |
| - Statute-grounded positive document matches | |
| The dataset spans 500+ legal categories derived from RCW structure. | |
| ### Hyperparameters & Architecture | |
| * **Loss Function:** Multiple Negatives Ranking (MNR) Loss | |
| * **Batch Size:** 32 | |
| * **Epochs:** 4 | |
| * **fp16:** True | |
| * **batch_sampler:** no_duplicates | |
| * **multi_dataset_batch_sampler:** round_robin | |
| * **Learning Rate Decay:** Linear | |
| * **Infrastructure:** High-Performance Computing (HPC) Cluster | |
| #### All Hyperparameters | |
| <details><summary>Click to expand</summary> | |
| - `overwrite_output_dir`: False | |
| - `do_predict`: False | |
| - `eval_strategy`: steps | |
| - `prediction_loss_only`: True | |
| - `per_device_train_batch_size`: 32 | |
| - `per_device_eval_batch_size`: 32 | |
| - `per_gpu_train_batch_size`: None | |
| - `per_gpu_eval_batch_size`: None | |
| - `gradient_accumulation_steps`: 1 | |
| - `eval_accumulation_steps`: None | |
| - `torch_empty_cache_steps`: None | |
| - `learning_rate`: 5e-05 | |
| - `weight_decay`: 0.0 | |
| - `adam_beta1`: 0.9 | |
| - `adam_beta2`: 0.999 | |
| - `adam_epsilon`: 1e-08 | |
| - `max_grad_norm`: 1 | |
| - `num_train_epochs`: 4 | |
| - `max_steps`: -1 | |
| - `lr_scheduler_type`: linear | |
| - `lr_scheduler_kwargs`: {} | |
| - `warmup_ratio`: 0.0 | |
| - `warmup_steps`: 0 | |
| - `log_level`: passive | |
| - `log_level_replica`: warning | |
| - `log_on_each_node`: True | |
| - `logging_nan_inf_filter`: True | |
| - `save_safetensors`: True | |
| - `save_on_each_node`: False | |
| - `save_only_model`: False | |
| - `restore_callback_states_from_checkpoint`: False | |
| - `no_cuda`: False | |
| - `use_cpu`: False | |
| - `use_mps_device`: False | |
| - `seed`: 42 | |
| - `data_seed`: None | |
| - `jit_mode_eval`: False | |
| - `use_ipex`: False | |
| - `bf16`: False | |
| - `fp16`: True | |
| - `fp16_opt_level`: O1 | |
| - `half_precision_backend`: auto | |
| - `bf16_full_eval`: False | |
| - `fp16_full_eval`: False | |
| - `tf32`: None | |
| - `local_rank`: 0 | |
| - `ddp_backend`: None | |
| - `tpu_num_cores`: None | |
| - `tpu_metrics_debug`: False | |
| - `debug`: [] | |
| - `dataloader_drop_last`: False | |
| - `dataloader_num_workers`: 0 | |
| - `dataloader_prefetch_factor`: None | |
| - `past_index`: -1 | |
| - `disable_tqdm`: False | |
| - `remove_unused_columns`: True | |
| - `label_names`: None | |
| - `load_best_model_at_end`: False | |
| - `ignore_data_skip`: False | |
| - `fsdp`: [] | |
| - `fsdp_min_num_params`: 0 | |
| - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} | |
| - `fsdp_transformer_layer_cls_to_wrap`: None | |
| - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} | |
| - `parallelism_config`: None | |
| - `deepspeed`: None | |
| - `label_smoothing_factor`: 0.0 | |
| - `optim`: adamw_torch_fused | |
| - `optim_args`: None | |
| - `adafactor`: False | |
| - `group_by_length`: False | |
| - `length_column_name`: length | |
| - `ddp_find_unused_parameters`: None | |
| - `ddp_bucket_cap_mb`: None | |
| - `ddp_broadcast_buffers`: False | |
| - `dataloader_pin_memory`: True | |
| - `dataloader_persistent_workers`: False | |
| - `skip_memory_metrics`: True | |
| - `use_legacy_prediction_loop`: False | |
| - `push_to_hub`: False | |
| - `resume_from_checkpoint`: None | |
| - `hub_model_id`: None | |
| - `hub_strategy`: every_save | |
| - `hub_private_repo`: None | |
| - `hub_always_push`: False | |
| - `hub_revision`: None | |
| - `gradient_checkpointing`: False | |
| - `gradient_checkpointing_kwargs`: None | |
| - `include_inputs_for_metrics`: False | |
| - `include_for_metrics`: [] | |
| - `eval_do_concat_batches`: True | |
| - `fp16_backend`: auto | |
| - `push_to_hub_model_id`: None | |
| - `push_to_hub_organization`: None | |
| - `mp_parameters`: | |
| - `auto_find_batch_size`: False | |
| - `full_determinism`: False | |
| - `torchdynamo`: None | |
| - `ray_scope`: last | |
| - `ddp_timeout`: 1800 | |
| - `torch_compile`: False | |
| - `torch_compile_backend`: None | |
| - `torch_compile_mode`: None | |
| - `include_tokens_per_second`: False | |
| - `include_num_input_tokens_seen`: False | |
| - `neftune_noise_alpha`: None | |
| - `optim_target_modules`: None | |
| - `batch_eval_metrics`: False | |
| - `eval_on_start`: False | |
| - `use_liger_kernel`: False | |
| - `liger_kernel_config`: None | |
| - `eval_use_gather_object`: False | |
| - `average_tokens_across_devices`: False | |
| - `prompts`: None | |
| - `batch_sampler`: no_duplicates | |
| - `multi_dataset_batch_sampler`: round_robin | |
| - `router_mapping`: {} | |
| - `learning_rate_mapping`: {} | |
| </details> | |
| ## Evaluation Metrics | |
| The model was evaluated on a rigorously held-out validation set of synthetic municipal drafts mapped 1-to-1 against Washington State RCWs. The table below compares the peak validation performance (achieved at Epoch 3.02) against the baseline, untrained `bge-large` model. | |
| | Metric | Base Model (Untrained Large) | Fine-Tuned (Peak @ 3.02) | Absolute Improvement | | |
| |:-------|:-----------------------------|:-------------------------|:---------------------| | |
| | **Recall@10** | 0.5684 | **0.8354** | + 26.7% | | |
| | **Recall@5** | 0.2842 | **0.4255** | + 14.13% | | |
| | **NDCG@10** | 0.2509 | **0.3828** | + 12.38% | | |
| | **MRR@10** | 0.1569 | **0.2487** | + 9.18% | | |
| *Interpretation: Because the BAAI large architecture is already highly proficient, the baseline was extremely strong out-of-the-box. Fine-tuning pushed the model to extract the absolute mathematical ceiling from this legal dataset, successfully returning the exact governing state law within the top 10 results 83.5% of the time.* | |
| ## Limitations | |
| - This model does not provide legal advice. | |
| - Performance is limited to Washington State law (RCW) and may not generalize to other jurisdictions. | |
| - Outputs depend on the quality of the underlying document corpus. | |
| - Should be used as a retrieval tool, not a final decision-making system. | |
| ## Usage Examples | |
| ### Semantic Search with `sentence-transformers` | |
| <div style="padding:10px; border-left:4px solid #ff4d4f; background-color:#fff1f0;"> | |
| **Warning:** Because this model is built on the BGE architecture, you **must** append the specific instruction prefix | |
| `"Represent this sentence for searching relevant passages:"` | |
| to your search queries to achieve optimal performance. | |
| **Do not** add this prefix to the database documents. | |
| </div> | |
| ```python | |
| import torch | |
| from sentence_transformers import SentenceTransformer, util | |
| # 1. Load the fine-tuned model | |
| model = SentenceTransformer('CSI-lab/Washington-state-law-embedding-model-Large') | |
| # 2. Define the laws (Your Vector Database) | |
| laws = [ | |
| "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.", | |
| "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...", | |
| "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..." | |
| ] | |
| # 3. Define the user's search query | |
| user_query = "What dollar amount makes a theft a first degree felony?" | |
| # 4. CRITICAL: Add the required BGE prefix to the query ONLY | |
| query_prefix = "Represent this sentence for searching relevant passages: " | |
| formatted_query = query_prefix + user_query | |
| # 5. Encode the documents and the query | |
| law_embeddings = model.encode(laws, convert_to_tensor=True) | |
| query_embedding = model.encode(formatted_query, convert_to_tensor=True) | |
| # 6. Calculate Cosine Similarity | |
| cosine_scores = util.cos_sim(query_embedding, law_embeddings) | |
| # 7. Print the top result | |
| best_idx = cosine_scores.argmax().item() | |
| print(f"Top Match: {laws[best_idx]}") | |
| print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}") | |
| ``` | |
| # Model Citation | |
| ``` | |
| @misc{washington_state_law_embedding_Large_2026, | |
| title={Washington-state-law-embedding-model-Large: Fine-Tuned Dense Retrieval for Washington State Law}, | |
| author={Tomar, Shlok}, | |
| year={2026}, | |
| publisher={Hugging Face} | |
| howpublished={\url{https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Large}}, | |
| note={Hugging Face Model Repository} | |
| } | |
| ``` | |
| ### BibTeX | |
| #### Sentence Transformers | |
| ```bibtex | |
| @inproceedings{reimers-2019-sentence-bert, | |
| title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", | |
| author = "Reimers, Nils and Gurevych, Iryna", | |
| booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", | |
| month = "11", | |
| year = "2019", | |
| publisher = "Association for Computational Linguistics", | |
| url = "https://arxiv.org/abs/1908.10084", | |
| } | |
| ``` | |
| #### MultipleNegativesRankingLoss | |
| ```bibtex | |
| @misc{henderson2017efficient, | |
| title={Efficient Natural Language Response Suggestion for Smart Reply}, | |
| author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil}, | |
| year={2017}, | |
| eprint={1705.00652}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.CL} | |
| } | |
| ``` |