---
license: mit
language:
- en
metrics:
- accuracy
- recall
base_model:
- BAAI/bge-large-en-v1.5
pipeline_tag: sentence-similarity
library_name: sentence-transformers
tags:
- legal
- law
- WA
- sentence-transformers
- feature-extraction
- sentence-similarity
- dense
- loss:MultipleNegativesRankingLoss
model-index:
- name: Washington-state-law-embedding-model-Large
results:
- task:
type: information-retrieval
name: Information Retrieval
dataset:
name: RCW Validation
type: rcw-validation
metrics:
- type: cosine_accuracy@10
value: 0.8344200750839755
name: Cosine Accuracy@10
- type: cosine_accuracy@1
value: 0.08774945662912467
name: Cosine Accuracy@1
- type: cosine_accuracy@3
value: 0.2561944279786603
name: Cosine Accuracy@3
- type: cosine_accuracy@5
value: 0.42533096226042283
name: Cosine Accuracy@5
- type: cosine_precision@1
value: 0.08774945662912467
name: Cosine Precision@1
- type: cosine_precision@3
value: 0.08539814265955344
name: Cosine Precision@3
- type: cosine_precision@5
value: 0.08506619245208456
name: Cosine Precision@5
- type: cosine_precision@10
value: 0.08344200750839757
name: Cosine Precision@10
- type: cosine_recall@1
value: 0.08774945662912467
name: Cosine Recall@1
- type: cosine_recall@3
value: 0.2561944279786603
name: Cosine Recall@3
- type: cosine_recall@5
value: 0.42533096226042283
name: Cosine Recall@5
- type: cosine_recall@10
value: 0.8344200750839755
name: Cosine Recall@10
- type: cosine_ndcg@10
value: 0.3829692177232852
name: Cosine Ndcg@10
- type: cosine_mrr@10
value: 0.24923231025931583
name: Cosine Mrr@10
- type: cosine_map@100
value: 0.25674619603156057
name: Cosine Map@100
datasets:
- CSI-lab/RCW_2025_Positive_Query_Pairs
---
# Washington-state-law-embedding-model-Large
**Washington-state-law-embedding-model-Large** is a highly specialized, parameter-rich embedding model fine-tuned specifically for Legal Information Retrieval (IR) within the State of Washington.
Generic embedding models often perform suboptimally on legal texts due to the semantic gap between natural language questions (e.g., "What dollar amount makes a theft a first degree felony?") and formal statutory legalese. This model bridges that gap, allowing plain-English queries, legal scenarios, and document drafts to be accurately mapped to their corresponding Washington State statutes (Revised Code of Washington - RCW).
## Available Models
| Model | Language | Description | Query Prefix |
|:------|:---------|:------------|:-------------|
| [CSI-lab/Washington-state-law-embedding-model-Large](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Large) | English | Fine-tuned `large` model (1024d) for WA State RCWs. Best performance. | `Represent this sentence for searching relevant passages: ` |
| [CSI-lab/Washington-state-law-embedding-model-Base](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Base) | English | Fine-tuned `base` model (768d) for WA State RCWs. Faster inference. | `Represent this sentence for searching relevant passages: ` |
## Model Overview
* **Base Model:** `BAAI/bge-large-en-v1.5`
* **Task:** Semantic Search / Information Retrieval / Legal Preemption Analysis
* **Language:** English (Legal Domain)
* **Max Sequence Length:** 512 tokens
* **Output Dimensionality:** 1024 dimensions
* **Similarity Function:** Cosine Similarity
## Key Features
- Fine-tuned for Washington State legal domain (RCW)
- Optimized for semantic search and retrieval tasks
- Supports natural language legal queries
- Designed for RAG-based legal assistants
- Superior retrieval capacity leveraging the 1024d `large` architecture
## Intended Use Cases
This model is optimized to act as the retriever component in legal Retrieval-Augmented Generation (RAG) pipelines. Primary use cases include:
1. **Statutory Cross-Referencing:** Mapping natural language legal questions to specific RCWs.
2. **Preemption Checking:** Automatically retrieving state laws that may preempt or conflict with proposed municipal ordinances.
3. **Legal Research Automation:** Clustering and searching local agency drafts against established state frameworks.
4. **AI Legal Assistants:** Powering chatbots and research tools that require accurate retrieval of Washington State laws before generating an answer.
5. **Automated Compliance:** Scanning contracts or external drafts against established state legislative frameworks.
## Technical Details & Training Methodology
### The Semantic Gap
A standard dense retriever often fails on legal tasks because it relies on vocabulary overlap rather than conceptual legal mapping. To address this, `Washington-state-law-embedding-model-Large` was fine-tuned using a synthetic, high-variance dataset.
### Training Data
The model was fine-tuned on synthetic legal query–passage pairs generated from Washington State RCW statutes.
The dataset includes:
- Size: 455,424 training samples
- Natural language paraphrases of legal questions
- Hypothetical legal scenarios
- Statute-grounded positive document matches
The dataset spans 500+ legal categories derived from RCW structure.
### Hyperparameters & Architecture
* **Loss Function:** Multiple Negatives Ranking (MNR) Loss
* **Batch Size:** 32
* **Epochs:** 4
* **fp16:** True
* **batch_sampler:** no_duplicates
* **multi_dataset_batch_sampler:** round_robin
* **Learning Rate Decay:** Linear
* **Infrastructure:** High-Performance Computing (HPC) Cluster
#### All Hyperparameters
Click to expand
- `overwrite_output_dir`: False
- `do_predict`: False
- `eval_strategy`: steps
- `prediction_loss_only`: True
- `per_device_train_batch_size`: 32
- `per_device_eval_batch_size`: 32
- `per_gpu_train_batch_size`: None
- `per_gpu_eval_batch_size`: None
- `gradient_accumulation_steps`: 1
- `eval_accumulation_steps`: None
- `torch_empty_cache_steps`: None
- `learning_rate`: 5e-05
- `weight_decay`: 0.0
- `adam_beta1`: 0.9
- `adam_beta2`: 0.999
- `adam_epsilon`: 1e-08
- `max_grad_norm`: 1
- `num_train_epochs`: 4
- `max_steps`: -1
- `lr_scheduler_type`: linear
- `lr_scheduler_kwargs`: {}
- `warmup_ratio`: 0.0
- `warmup_steps`: 0
- `log_level`: passive
- `log_level_replica`: warning
- `log_on_each_node`: True
- `logging_nan_inf_filter`: True
- `save_safetensors`: True
- `save_on_each_node`: False
- `save_only_model`: False
- `restore_callback_states_from_checkpoint`: False
- `no_cuda`: False
- `use_cpu`: False
- `use_mps_device`: False
- `seed`: 42
- `data_seed`: None
- `jit_mode_eval`: False
- `use_ipex`: False
- `bf16`: False
- `fp16`: True
- `fp16_opt_level`: O1
- `half_precision_backend`: auto
- `bf16_full_eval`: False
- `fp16_full_eval`: False
- `tf32`: None
- `local_rank`: 0
- `ddp_backend`: None
- `tpu_num_cores`: None
- `tpu_metrics_debug`: False
- `debug`: []
- `dataloader_drop_last`: False
- `dataloader_num_workers`: 0
- `dataloader_prefetch_factor`: None
- `past_index`: -1
- `disable_tqdm`: False
- `remove_unused_columns`: True
- `label_names`: None
- `load_best_model_at_end`: False
- `ignore_data_skip`: False
- `fsdp`: []
- `fsdp_min_num_params`: 0
- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- `fsdp_transformer_layer_cls_to_wrap`: None
- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
- `parallelism_config`: None
- `deepspeed`: None
- `label_smoothing_factor`: 0.0
- `optim`: adamw_torch_fused
- `optim_args`: None
- `adafactor`: False
- `group_by_length`: False
- `length_column_name`: length
- `ddp_find_unused_parameters`: None
- `ddp_bucket_cap_mb`: None
- `ddp_broadcast_buffers`: False
- `dataloader_pin_memory`: True
- `dataloader_persistent_workers`: False
- `skip_memory_metrics`: True
- `use_legacy_prediction_loop`: False
- `push_to_hub`: False
- `resume_from_checkpoint`: None
- `hub_model_id`: None
- `hub_strategy`: every_save
- `hub_private_repo`: None
- `hub_always_push`: False
- `hub_revision`: None
- `gradient_checkpointing`: False
- `gradient_checkpointing_kwargs`: None
- `include_inputs_for_metrics`: False
- `include_for_metrics`: []
- `eval_do_concat_batches`: True
- `fp16_backend`: auto
- `push_to_hub_model_id`: None
- `push_to_hub_organization`: None
- `mp_parameters`:
- `auto_find_batch_size`: False
- `full_determinism`: False
- `torchdynamo`: None
- `ray_scope`: last
- `ddp_timeout`: 1800
- `torch_compile`: False
- `torch_compile_backend`: None
- `torch_compile_mode`: None
- `include_tokens_per_second`: False
- `include_num_input_tokens_seen`: False
- `neftune_noise_alpha`: None
- `optim_target_modules`: None
- `batch_eval_metrics`: False
- `eval_on_start`: False
- `use_liger_kernel`: False
- `liger_kernel_config`: None
- `eval_use_gather_object`: False
- `average_tokens_across_devices`: False
- `prompts`: None
- `batch_sampler`: no_duplicates
- `multi_dataset_batch_sampler`: round_robin
- `router_mapping`: {}
- `learning_rate_mapping`: {}