CSI-lab commited on
Commit
dd71053
·
verified ·
1 Parent(s): 91c2f63

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +374 -3
README.md CHANGED
@@ -1,3 +1,374 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ - recall
8
+ base_model:
9
+ - BAAI/bge-large-en-v1.5
10
+
11
+
12
+ pipeline_tag: sentence-similarity
13
+ library_name: sentence-transformers
14
+
15
+ tags:
16
+ - legal
17
+ - law
18
+ - WA
19
+ - sentence-transformers
20
+ - feature-extraction
21
+ - sentence-similarity
22
+ - dense
23
+ - loss:MultipleNegativesRankingLoss
24
+
25
+ model-index:
26
+ - name: Washington-state-law-embedding-model-Large
27
+ results:
28
+ - task:
29
+ type: information-retrieval
30
+ name: Information Retrieval
31
+ dataset:
32
+ name: RCW Validation
33
+ type: rcw-validation
34
+ metrics:
35
+ - type: cosine_accuracy@10
36
+ value: 0.8344200750839755
37
+ name: Cosine Accuracy@10
38
+ - type: cosine_accuracy@1
39
+ value: 0.08774945662912467
40
+ name: Cosine Accuracy@1
41
+ - type: cosine_accuracy@3
42
+ value: 0.2561944279786603
43
+ name: Cosine Accuracy@3
44
+ - type: cosine_accuracy@5
45
+ value: 0.42533096226042283
46
+ name: Cosine Accuracy@5
47
+ - type: cosine_precision@1
48
+ value: 0.08774945662912467
49
+ name: Cosine Precision@1
50
+ - type: cosine_precision@3
51
+ value: 0.08539814265955344
52
+ name: Cosine Precision@3
53
+ - type: cosine_precision@5
54
+ value: 0.08506619245208456
55
+ name: Cosine Precision@5
56
+ - type: cosine_precision@10
57
+ value: 0.08344200750839757
58
+ name: Cosine Precision@10
59
+ - type: cosine_recall@1
60
+ value: 0.08774945662912467
61
+ name: Cosine Recall@1
62
+ - type: cosine_recall@3
63
+ value: 0.2561944279786603
64
+ name: Cosine Recall@3
65
+ - type: cosine_recall@5
66
+ value: 0.42533096226042283
67
+ name: Cosine Recall@5
68
+ - type: cosine_recall@10
69
+ value: 0.8344200750839755
70
+ name: Cosine Recall@10
71
+ - type: cosine_ndcg@10
72
+ value: 0.3829692177232852
73
+ name: Cosine Ndcg@10
74
+ - type: cosine_mrr@10
75
+ value: 0.24923231025931583
76
+ name: Cosine Mrr@10
77
+ - type: cosine_map@100
78
+ value: 0.25674619603156057
79
+ name: Cosine Map@100
80
+ datasets:
81
+ - CSI-lab/RCW_2025_Positive_Query_Pairs
82
+ ---
83
+
84
+ # Washington-state-law-embedding-model-Large
85
+
86
+ **Washington-state-law-embedding-model-Large** is a highly specialized, parameter-rich embedding model fine-tuned specifically for Legal Information Retrieval (IR) within the State of Washington.
87
+
88
+ Generic embedding models often perform suboptimally on legal texts due to the semantic gap between natural language questions (e.g., "What dollar amount makes a theft a first degree felony?") and formal statutory legalese. This model bridges that gap, allowing plain-English queries, legal scenarios, and document drafts to be accurately mapped to their corresponding Washington State statutes (Revised Code of Washington - RCW).
89
+
90
+ ## Available Models
91
+
92
+ | Model | Language | Description | Query Prefix |
93
+ |:------|:---------|:------------|:-------------|
94
+ | [CSI-lab/Washington-state-law-embedding-model-Large](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Large) | English | Fine-tuned `large` model (1024d) for WA State RCWs. Best performance. | `Represent this sentence for searching relevant passages: ` |
95
+ | [CSI-lab/Washington-state-law-embedding-model-Base](https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Base) | English | Fine-tuned `base` model (768d) for WA State RCWs. Faster inference. | `Represent this sentence for searching relevant passages: ` |
96
+
97
+ ## Model Overview
98
+ * **Base Model:** `BAAI/bge-large-en-v1.5`
99
+ * **Task:** Semantic Search / Information Retrieval / Legal Preemption Analysis
100
+ * **Language:** English (Legal Domain)
101
+ * **Max Sequence Length:** 512 tokens
102
+ * **Output Dimensionality:** 1024 dimensions
103
+ * **Similarity Function:** Cosine Similarity
104
+
105
+ ## Key Features
106
+ - Fine-tuned for Washington State legal domain (RCW)
107
+ - Optimized for semantic search and retrieval tasks
108
+ - Supports natural language legal queries
109
+ - Designed for RAG-based legal assistants
110
+ - Superior retrieval capacity leveraging the 1024d `large` architecture
111
+
112
+ ## Intended Use Cases
113
+ This model is optimized to act as the retriever component in legal Retrieval-Augmented Generation (RAG) pipelines. Primary use cases include:
114
+ 1. **Statutory Cross-Referencing:** Mapping natural language legal questions to specific RCWs.
115
+ 2. **Preemption Checking:** Automatically retrieving state laws that may preempt or conflict with proposed municipal ordinances.
116
+ 3. **Legal Research Automation:** Clustering and searching local agency drafts against established state frameworks.
117
+ 4. **AI Legal Assistants:** Powering chatbots and research tools that require accurate retrieval of Washington State laws before generating an answer.
118
+ 5. **Automated Compliance:** Scanning contracts or external drafts against established state legislative frameworks.
119
+
120
+ ## Technical Details & Training Methodology
121
+
122
+ ### The Semantic Gap
123
+ A standard dense retriever often fails on legal tasks because it relies on vocabulary overlap rather than conceptual legal mapping. To address this, `Washington-state-law-embedding-model-Large` was fine-tuned using a synthetic, high-variance dataset.
124
+
125
+ ### Training Data
126
+ The model was fine-tuned on synthetic legal query–passage pairs generated from Washington State RCW statutes.
127
+
128
+ The dataset includes:
129
+ - Size: 455,424 training samples
130
+ - Natural language paraphrases of legal questions
131
+ - Hypothetical legal scenarios
132
+ - Statute-grounded positive document matches
133
+
134
+ The dataset spans 500+ legal categories derived from RCW structure.
135
+
136
+ ### Hyperparameters & Architecture
137
+ * **Loss Function:** Multiple Negatives Ranking (MNR) Loss
138
+ * **Batch Size:** 32
139
+ * **Epochs:** 4
140
+ * **fp16:** True
141
+ * **batch_sampler:** no_duplicates
142
+ * **multi_dataset_batch_sampler:** round_robin
143
+ * **Learning Rate Decay:** Linear
144
+ * **Infrastructure:** High-Performance Computing (HPC) Cluster
145
+
146
+ #### All Hyperparameters
147
+ <details><summary>Click to expand</summary>
148
+
149
+ - `overwrite_output_dir`: False
150
+ - `do_predict`: False
151
+ - `eval_strategy`: steps
152
+ - `prediction_loss_only`: True
153
+ - `per_device_train_batch_size`: 32
154
+ - `per_device_eval_batch_size`: 32
155
+ - `per_gpu_train_batch_size`: None
156
+ - `per_gpu_eval_batch_size`: None
157
+ - `gradient_accumulation_steps`: 1
158
+ - `eval_accumulation_steps`: None
159
+ - `torch_empty_cache_steps`: None
160
+ - `learning_rate`: 5e-05
161
+ - `weight_decay`: 0.0
162
+ - `adam_beta1`: 0.9
163
+ - `adam_beta2`: 0.999
164
+ - `adam_epsilon`: 1e-08
165
+ - `max_grad_norm`: 1
166
+ - `num_train_epochs`: 4
167
+ - `max_steps`: -1
168
+ - `lr_scheduler_type`: linear
169
+ - `lr_scheduler_kwargs`: {}
170
+ - `warmup_ratio`: 0.0
171
+ - `warmup_steps`: 0
172
+ - `log_level`: passive
173
+ - `log_level_replica`: warning
174
+ - `log_on_each_node`: True
175
+ - `logging_nan_inf_filter`: True
176
+ - `save_safetensors`: True
177
+ - `save_on_each_node`: False
178
+ - `save_only_model`: False
179
+ - `restore_callback_states_from_checkpoint`: False
180
+ - `no_cuda`: False
181
+ - `use_cpu`: False
182
+ - `use_mps_device`: False
183
+ - `seed`: 42
184
+ - `data_seed`: None
185
+ - `jit_mode_eval`: False
186
+ - `use_ipex`: False
187
+ - `bf16`: False
188
+ - `fp16`: True
189
+ - `fp16_opt_level`: O1
190
+ - `half_precision_backend`: auto
191
+ - `bf16_full_eval`: False
192
+ - `fp16_full_eval`: False
193
+ - `tf32`: None
194
+ - `local_rank`: 0
195
+ - `ddp_backend`: None
196
+ - `tpu_num_cores`: None
197
+ - `tpu_metrics_debug`: False
198
+ - `debug`: []
199
+ - `dataloader_drop_last`: False
200
+ - `dataloader_num_workers`: 0
201
+ - `dataloader_prefetch_factor`: None
202
+ - `past_index`: -1
203
+ - `disable_tqdm`: False
204
+ - `remove_unused_columns`: True
205
+ - `label_names`: None
206
+ - `load_best_model_at_end`: False
207
+ - `ignore_data_skip`: False
208
+ - `fsdp`: []
209
+ - `fsdp_min_num_params`: 0
210
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
211
+ - `fsdp_transformer_layer_cls_to_wrap`: None
212
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
213
+ - `parallelism_config`: None
214
+ - `deepspeed`: None
215
+ - `label_smoothing_factor`: 0.0
216
+ - `optim`: adamw_torch_fused
217
+ - `optim_args`: None
218
+ - `adafactor`: False
219
+ - `group_by_length`: False
220
+ - `length_column_name`: length
221
+ - `ddp_find_unused_parameters`: None
222
+ - `ddp_bucket_cap_mb`: None
223
+ - `ddp_broadcast_buffers`: False
224
+ - `dataloader_pin_memory`: True
225
+ - `dataloader_persistent_workers`: False
226
+ - `skip_memory_metrics`: True
227
+ - `use_legacy_prediction_loop`: False
228
+ - `push_to_hub`: False
229
+ - `resume_from_checkpoint`: None
230
+ - `hub_model_id`: None
231
+ - `hub_strategy`: every_save
232
+ - `hub_private_repo`: None
233
+ - `hub_always_push`: False
234
+ - `hub_revision`: None
235
+ - `gradient_checkpointing`: False
236
+ - `gradient_checkpointing_kwargs`: None
237
+ - `include_inputs_for_metrics`: False
238
+ - `include_for_metrics`: []
239
+ - `eval_do_concat_batches`: True
240
+ - `fp16_backend`: auto
241
+ - `push_to_hub_model_id`: None
242
+ - `push_to_hub_organization`: None
243
+ - `mp_parameters`:
244
+ - `auto_find_batch_size`: False
245
+ - `full_determinism`: False
246
+ - `torchdynamo`: None
247
+ - `ray_scope`: last
248
+ - `ddp_timeout`: 1800
249
+ - `torch_compile`: False
250
+ - `torch_compile_backend`: None
251
+ - `torch_compile_mode`: None
252
+ - `include_tokens_per_second`: False
253
+ - `include_num_input_tokens_seen`: False
254
+ - `neftune_noise_alpha`: None
255
+ - `optim_target_modules`: None
256
+ - `batch_eval_metrics`: False
257
+ - `eval_on_start`: False
258
+ - `use_liger_kernel`: False
259
+ - `liger_kernel_config`: None
260
+ - `eval_use_gather_object`: False
261
+ - `average_tokens_across_devices`: False
262
+ - `prompts`: None
263
+ - `batch_sampler`: no_duplicates
264
+ - `multi_dataset_batch_sampler`: round_robin
265
+ - `router_mapping`: {}
266
+ - `learning_rate_mapping`: {}
267
+
268
+ </details>
269
+
270
+ ## Evaluation Metrics
271
+
272
+ The model was evaluated on a rigorously held-out validation set of synthetic municipal drafts mapped 1-to-1 against Washington State RCWs. The table below compares the peak validation performance (achieved at Epoch 3.02) against the baseline, untrained `bge-large` model.
273
+
274
+ | Metric | Base Model (Untrained Large) | Fine-Tuned (Peak @ 3.02) | Absolute Improvement |
275
+ |:-------|:-----------------------------|:-------------------------|:---------------------|
276
+ | **Recall@10** | 0.5684 | **0.8354** | + 26.7% |
277
+ | **Recall@5** | 0.2842 | **0.4255** | + 14.13% |
278
+ | **NDCG@10** | 0.2509 | **0.3828** | + 12.38% |
279
+ | **MRR@10** | 0.1569 | **0.2487** | + 9.18% |
280
+
281
+ *Interpretation: Because the BAAI large architecture is already highly proficient, the baseline was extremely strong out-of-the-box. Fine-tuning pushed the model to extract the absolute mathematical ceiling from this legal dataset, successfully returning the exact governing state law within the top 10 results 83.5% of the time.*
282
+
283
+ ## Limitations
284
+
285
+ - This model does not provide legal advice.
286
+ - Performance is limited to Washington State law (RCW) and may not generalize to other jurisdictions.
287
+ - Outputs depend on the quality of the underlying document corpus.
288
+ - Should be used as a retrieval tool, not a final decision-making system.
289
+
290
+ ## Usage Examples
291
+
292
+ ### Semantic Search with `sentence-transformers`
293
+ <div style="padding:10px; border-left:4px solid #ff4d4f; background-color:#fff1f0;">
294
+
295
+ **Warning:** Because this model is built on the BGE architecture, you **must** append the specific instruction prefix
296
+ `"Represent this sentence for searching relevant passages:"`
297
+ to your search queries to achieve optimal performance.
298
+
299
+ **Do not** add this prefix to the database documents.
300
+
301
+ </div>
302
+
303
+ ```python
304
+ import torch
305
+ from sentence_transformers import SentenceTransformer, util
306
+
307
+ # 1. Load the fine-tuned model
308
+ model = SentenceTransformer('CSI-lab/Washington-state-law-embedding-model-Large')
309
+
310
+ # 2. Define the laws (Your Vector Database)
311
+ laws = [
312
+ "RCW 9A.56.030: Theft in the first degree. A person is guilty of theft in the first degree if he or she commits theft of property or services which exceed(s) five thousand dollars in value.",
313
+ "RCW 46.61.502: Driving under the influence. A person is guilty of driving while under the influence of intoxicating liquor...",
314
+ "RCW 9A.36.011: Assault in the first degree. A person is guilty of assault in the first degree if he or she..."
315
+ ]
316
+
317
+ # 3. Define the user's search query
318
+ user_query = "What dollar amount makes a theft a first degree felony?"
319
+
320
+ # 4. CRITICAL: Add the required BGE prefix to the query ONLY
321
+ query_prefix = "Represent this sentence for searching relevant passages: "
322
+ formatted_query = query_prefix + user_query
323
+
324
+ # 5. Encode the documents and the query
325
+ law_embeddings = model.encode(laws, convert_to_tensor=True)
326
+ query_embedding = model.encode(formatted_query, convert_to_tensor=True)
327
+
328
+ # 6. Calculate Cosine Similarity
329
+ cosine_scores = util.cos_sim(query_embedding, law_embeddings)
330
+
331
+ # 7. Print the top result
332
+ best_idx = cosine_scores.argmax().item()
333
+ print(f"Top Match: {laws[best_idx]}")
334
+ print(f"Similarity Score: {cosine_scores[0][best_idx]:.4f}")
335
+ ```
336
+
337
+ # Model Citation
338
+ ```
339
+ @misc{washington_state_law_embedding_Large_2026,
340
+ title={Washington-state-law-embedding-model-Large: Fine-Tuned Dense Retrieval for Washington State Law},
341
+ author={Tomar, Shlok},
342
+ year={2026},
343
+ publisher={Hugging Face}
344
+ howpublished={\url{https://huggingface.co/CSI-lab/Washington-state-law-embedding-model-Large}},
345
+ note={Hugging Face Model Repository}
346
+ }
347
+ ```
348
+
349
+ ### BibTeX
350
+
351
+ #### Sentence Transformers
352
+ ```bibtex
353
+ @inproceedings{reimers-2019-sentence-bert,
354
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
355
+ author = "Reimers, Nils and Gurevych, Iryna",
356
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
357
+ month = "11",
358
+ year = "2019",
359
+ publisher = "Association for Computational Linguistics",
360
+ url = "https://arxiv.org/abs/1908.10084",
361
+ }
362
+ ```
363
+
364
+ #### MultipleNegativesRankingLoss
365
+ ```bibtex
366
+ @misc{henderson2017efficient,
367
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
368
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
369
+ year={2017},
370
+ eprint={1705.00652},
371
+ archivePrefix={arXiv},
372
+ primaryClass={cs.CL}
373
+ }
374
+ ```