Instructions to use CSI-lab/Washington-state-law-embedding-model-Base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CSI-lab/Washington-state-law-embedding-model-Base with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("CSI-lab/Washington-state-law-embedding-model-Base")

sentences = [
    "Represent this sentence for searching relevant passages: RCW 36.75.190",
    "RCW 36.75.190 - Engineer's report—Hearing—Order.\nUpon report by the examining engineer for the erection and construction upon any county road, or for acquisition by purchase, gift or condemnation of any bridge, trestle, or any other structure crossing any stream, body of water, gulch, navigable water, swamp or other topographical formation, which constitutes a boundary, publication shall be made and joint hearing had upon such report in the same manner and upon the same procedure as in the case of resolution or petition for the laying out and establishing of county roads. If upon the hearing the governing authorities jointly order the erection and construction or acquisition of such bridge, trestle, or other structure, they may jointly acquire land necessary therefor by purchase, gift, or condemnation in the manner as provided for acquiring land for county roads, and shall advertise calls for bids, require contractor's deposit and bond, award contracts, and supervise construction as by law provided and in the same manner as required in the case of the construction of county roads. Any such bridges, trestles or other structures may be operated free, or may be operated as toll bridges, trestles, or other structures under the provisions of the laws of this state relating thereto.\n[ 1963 c 4 s 36.75.190 . Prior: 1937 c 187 s 29 ; RRS s 6450-29.]",
    "RCW 28B.30.285 - State treasurer receiving agent of certain federal aid—Trust funds not subject to appropriation.\nAll federal grants received by the state treasurer pursuant to RCW 28B.30.270 shall be deemed trust funds under the control of the state treasurer and not subject to appropriation by the legislature.\n[ 1969 ex.s. c 223 s 28B.30.285 . Prior: 1955 c 66 s 4 . Formerly RCW 28.80.224 .]",
    "RCW 48.09.160 - Directors—Disqualification.\nNo individual shall be a director of a domestic mutual insurer by reason of his or her holding public office. Adjudication as a bankrupt or taking the benefit of any insolvency law or making a general assignment for the benefit of creditors disqualifies an individual from being or acting as a director.\n[ 2009 c 549 s 7037 ; 1947 c 79 s .09.16; Rem. Supp. 1947 s 45.09.16.]"
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

Notebooks
Google Colab
Kaggle

Washington-state-law-embedding-model-Base

Washington-state-law-embedding-model-Base is a highly specialized embedding model fine-tuned specifically for Legal Information Retrieval (IR) within the State of Washington.

Generic embedding models often perform suboptimally on legal texts due to the semantic gap between natural language questions (e.g., "What dollar amount makes a theft a first degree felony?") and formal statutory legalese. This model bridges that gap, allowing plain-English queries, legal scenarios, and document drafts to be accurately mapped to their corresponding Washington State statutes (Revised Code of Washington - RCW).

Available Models

Model	Language	Description	Query Prefix
CSI-lab/Washington-state-law-embedding-model-Large	English	Fine-tuned `large` model (1024d) for WA State RCWs. Best performance.	`Represent this sentence for searching relevant passages:`
CSI-lab/Washington-state-law-embedding-model-Base	English	Fine-tuned `base` model (768d) for WA State RCWs. Faster inference.	`Represent this sentence for searching relevant passages:`

Model Overview

Base Model: BAAI/bge-base-en-v1.5
Task: Semantic Search / Information Retrieval / Legal Preemption Analysis
Language: English (Legal Domain)
Max Sequence Length: 512 tokens
Output Dimensionality: 768 dimensions
Similarity Function: Cosine Similarity

Key Features

Fine-tuned for Washington State legal domain (RCW)
Optimized for semantic search and retrieval tasks
Supports natural language legal queries
Designed for RAG-based legal assistants
Improved retrieval accuracy over base BGE embeddings

Intended Use Cases

This model is optimized to act as the retriever component in legal Retrieval-Augmented Generation (RAG) pipelines. Primary use cases include:

Statutory Cross-Referencing: Mapping natural language legal questions to specific RCWs.
Preemption Checking: Automatically retrieving state laws that may preempt or conflict with proposed municipal ordinances.
Legal Research Automation: Clustering and searching local agency drafts against established state frameworks.
AI Legal Assistants: Powering chatbots and research tools that require accurate retrieval of Washington State laws before generating an answer.
Automated Compliance: Scanning contracts or external drafts against established state legislative frameworks.

Technical Details & Training Methodology

The Semantic Gap

A standard dense retriever often fails on legal tasks because it relies on vocabulary overlap rather than conceptual legal mapping. To address this, Washington-state-law-embedding-model was fine-tuned using a synthetic, high-variance dataset.

Training Data

The model was fine-tuned on synthetic legal query–passage pairs generated from Washington State RCW statutes.

The dataset includes:

Size: 455,424 training samples
Natural language paraphrases of legal questions
Hypothetical legal scenarios
Statute-grounded positive document matches

The dataset spans 500+ legal categories derived from RCW structure.

Hyperparameters & Architecture

Loss Function: Multiple Negatives Ranking (MNR) Loss
Batch Size: 256
Epochs: 4
fp16: True
batch_sampler: no_duplicates
multi_dataset_batch_sampler: round_robin
Learning Rate Decay: Linear
Infrastructure: High-Performance Computing (HPC) Cluster

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 256
per_device_eval_batch_size: 256
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 5e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1
num_train_epochs: 4
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.0
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 42
data_seed: None
jit_mode_eval: False
use_ipex: False
bf16: False
fp16: True
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: False
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: False
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: round_robin
router_mapping: {}
learning_rate_mapping: {}

Evaluation Metrics

The model was evaluated on a rigorously held-out validation set of synthetic municipal drafts mapped 1-to-1 against Washington State RCWs. The fine-tuning process yielded a +31.27% absolute improvement in Recall@10 over the base model.

Metric	Base Model (Untrained)	Fine-Tuned (Epoch 4)	Absolute Improvement
Recall@10	0.5314	0.8441	+ 31.27%
Recall@5	0.2636	0.4318	+ 16.82%
NDCG@10	0.2341	0.3876	+ 15.35%
MRR@10	0.1462	0.2524	+ 10.62%

Interpretation: When a user asks this model a legal question in plain English, there is an 84.4% probability that the exact governing state law will be returned in the top 10 search results.

Limitations

This model does not provide legal advice.
Performance is limited to Washington State law (RCW) and may not generalize to other jurisdictions.
Outputs depend on the quality of the underlying document corpus.
Should be used as a retrieval tool, not a final decision-making system.

Usage Examples

Semantic Search with `sentence-transformers`

Warning: Because this model is built on the BGE architecture, you must append the specific instruction prefix
"Represent this sentence for searching relevant passages:"
to your search queries to achieve optimal performance.

Do not add this prefix to the database documents.

import torch
from sentence_transformers import SentenceTransformer, util

# 1. Load the fine-tuned model
model = SentenceTransformer('CSI-lab/Washington-state-law-embedding-model-Base')

# 2. Define the laws (Your Vector Database)
laws = [
    "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.",
    "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...",
    "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..."
]

# 3. Define the user's search query
user_query = "What dollar amount makes a theft a first degree felony?"

# 4. CRITICAL: Add the required BGE prefix to the query ONLY
query_prefix = "Represent this sentence for searching relevant passages: "
formatted_query = query_prefix + user_query

# 5. Encode the documents and the query
law_embeddings = model.encode(laws, convert_to_tensor=True)
query_embedding = model.encode(formatted_query, convert_to_tensor=True)

# 6. Calculate Cosine Similarity
cosine_scores = util.cos_sim(query_embedding, law_embeddings)

# 7. Print the top result
best_idx = cosine_scores.argmax().item()
print(f"Top Match: {laws[best_idx]}")
print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")

Model Citation

@misc{washington_state_law_embedding_base_2026,
  title={Washington-state-law-embedding-model-Base: Fine-Tuned Dense Retrieval for Washington State Law},
  author={Tomar, Shlok},
  year={2026},
  publisher={Hugging Face}
  howpublished={\url{https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Base}},
  note={Hugging Face Model Repository}
}

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 51

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for CSI-lab/Washington-state-law-embedding-model-Base

Base model

BAAI/bge-base-en-v1.5

Finetuned

(477)

this model

Dataset used to train CSI-lab/Washington-state-law-embedding-model-Base

Papers for CSI-lab/Washington-state-law-embedding-model-Base

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 15

Efficient Natural Language Response Suggestion for Smart Reply

Paper • 1705.00652 • Published May 1, 2017

Evaluation results

Accuracy@10 on RCW Validation
self-reported

0.844
Precision@10 on RCW Validation
self-reported

0.084
Recall@10 on RCW Validation
self-reported

0.844
Accuracy@1 on RCW Validation
self-reported

0.089
Accuracy@3 on RCW Validation
self-reported

0.260
Accuracy@5 on RCW Validation
self-reported

0.432
NDCG@10 on RCW Validation
self-reported

0.388
MRR@10 on RCW Validation
self-reported

0.252