yjoonjang commited on
Commit
bd9e41e
·
verified ·
1 Parent(s): 55907a6

Add new SparseEncoder model

Browse files
1_SpladePooling/config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "pooling_strategy": "max",
3
+ "activation_function": "relu",
4
+ "word_embedding_dimension": 50000
5
+ }
README.md ADDED
@@ -0,0 +1,630 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sparse-encoder
5
+ - sparse
6
+ - splade
7
+ - generated_from_trainer
8
+ - loss:SpladeLoss
9
+ - loss:SparseMultipleNegativesRankingLoss
10
+ - loss:FlopsLoss
11
+ base_model: skt/A.X-Encoder-base
12
+ pipeline_tag: feature-extraction
13
+ library_name: sentence-transformers
14
+ license: apache-2.0
15
+ language:
16
+ - ko
17
+ ---
18
+
19
+ # splade-ko-v1.0
20
+
21
+ **splade-ko-v1.0** is a Korean-specific SPLADE Sparse Encoder model finetuned from [skt/A.X-Encoder-base](https://huggingface.co/skt/A.X-Encoder-base) using the [sentence-transformers](https://www.SBERT.net) library. It maps sentences & paragraphs to a 50000-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
22
+ ## Model Details
23
+
24
+ ### Model Description
25
+ - **Model Type:** SPLADE Sparse Encoder
26
+ - **Base model:** [skt/A.X-Encoder-base](https://huggingface.co/skt/A.X-Encoder-base) <!-- at revision b5c71f3601aedf38372fe21383ac7d04991af187 -->
27
+ - **Maximum Sequence Length:** 8192 tokens
28
+ - **Output Dimensionality:** 50000 dimensions
29
+ - **Similarity Function:** Dot Product
30
+
31
+ ### Model Sources
32
+
33
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
34
+ - **Documentation:** [Sparse Encoder Documentation](https://www.sbert.net/docs/sparse_encoder/usage/usage.html)
35
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
36
+ - **Hugging Face:** [Sparse Encoders on Hugging Face](https://huggingface.co/models?library=sentence-transformers&other=sparse-encoder)
37
+
38
+ ### Full Model Architecture
39
+
40
+ ```
41
+ SparseEncoder(
42
+ (0): MLMTransformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'ModernBertForMaskedLM'})
43
+ (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 50000})
44
+ )
45
+ ```
46
+
47
+ ## Usage
48
+
49
+ ### Direct Usage (Sentence Transformers)
50
+
51
+ First install the Sentence Transformers library:
52
+
53
+ ```bash
54
+ pip install -U sentence-transformers
55
+ ```
56
+
57
+ Then you can load this model and run inference.
58
+ ```python
59
+ from sentence_transformers import SparseEncoder
60
+
61
+ # Download from the 🤗 Hub
62
+ model = SparseEncoder("yjoonjang/splade-ko-v1.0")
63
+ # Run inference
64
+ sentences = [
65
+ '양이온 최적화 방법은 산소공공을 감소시키기 때문에 전자 농도가 증가하는 문제점을 갖고있을까?',
66
+ '산화물 TFT 소자 신뢰성 열화기구\n그러나 이와 같은 양이온 최적화 방법은 산소공공을 감소시키기 때문에 전자농도 역시 감소하게 되어 전계 이동도가 감소하는 문제점을 않고 있다.\n이는 산화물 반도체의 전도기구가 Percolation Conduction에 따르기 때문이다. ',
67
+ '세포대사 기능 분석을 위한 광학센서 기반 용존산소와 pH 측정 시스템의 제작 및 특성 분석\n수소이온 농도가 증가하는 경우인 \\( \\mathrm{pH} \\) 가 낮아지면 다수의 수소이온들과 충돌한 방출 광이 에너지를 잃고 짧은 검출시간을 갖는다. \n반대로 \\( \\mathrm{pH} \\)가 높아질수록 형광물질로부터 방출된 광의 수명이 길어져 긴 검출시간을 가진다.',
68
+ ]
69
+ embeddings = model.encode(sentences)
70
+ print(embeddings.shape)
71
+ # [3, 50000]
72
+
73
+ # Get the similarity scores for the embeddings
74
+ similarities = model.similarity(embeddings, embeddings)
75
+ print(similarities)
76
+ # tensor([[ 51.0219, 59.3815, 26.5736],
77
+ # [ 59.3815, 240.4989, 44.8991],
78
+ # [ 26.5736, 44.8991, 241.5082]], device='cuda:0')
79
+ ```
80
+
81
+ <!--
82
+ ### Direct Usage (Transformers)
83
+
84
+ <details><summary>Click to see the direct usage in Transformers</summary>
85
+
86
+ </details>
87
+ -->
88
+
89
+ <!--
90
+ ### Downstream Usage (Sentence Transformers)
91
+
92
+ You can finetune this model on your own dataset.
93
+
94
+ <details><summary>Click to expand</summary>
95
+
96
+ </details>
97
+ -->
98
+
99
+ <!--
100
+ ### Out-of-Scope Use
101
+
102
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
103
+ -->
104
+
105
+ ## Evaluation
106
+
107
+ ## MTEB-ko-retrieval Leaderboard
108
+ I evaluated all the Korean Retrieval Benchmarks on [MTEB](https://github.com/embeddings-benchmark/mteb)
109
+ ### Korean Retrieval Benchmark
110
+ | Dataset | Description | Average Length (characters) |
111
+ |-----------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-----------------------------|
112
+ | [Ko-StrategyQA](https://huggingface.co/datasets/taeminlee/Ko-StrategyQA) | Korean ODQA multi-hop retrieval dataset (translated from StrategyQA) | 305.15 |
113
+ | [AutoRAGRetrieval](https://huggingface.co/datasets/yjoonjang/markers_bm) | Korean document retrieval dataset constructed by parsing PDFs across 5 domains: finance, public sector, healthcare, legal, and commerce | 823.60 |
114
+ | [MIRACLRetrieval](https://huggingface.co/datasets/miracl/miracl) | Wikipedia-based Korean document retrieval dataset | 166.63 |
115
+ | [PublicHealthQA](https://huggingface.co/datasets/xhluca/publichealth-qa) | Korean document retrieval dataset for medical and public health domains | 339.00 |
116
+ | [BelebeleRetrieval](https://huggingface.co/datasets/facebook/belebele) | FLORES-200-based Korean document retrieval dataset | 243.11 |
117
+ | [MrTidyRetrieval](https://huggingface.co/datasets/mteb/mrtidy) | Wikipedia-based Korean document retrieval dataset | 166.90 |
118
+ | [MultiLongDocRetrieval](https://huggingface.co/datasets/Shitao/MLDR) | Korean long document retrieval dataset across various domains | 13,813.44 |
119
+ <!-- - [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa): 다양한 도메인의 한국어 문서 검색 데이터셋 -->
120
+
121
+ <details>
122
+ <summary>Reasons for excluding XPQARetrieval</summary>
123
+
124
+ - In our evaluation, we excluded the [XPQARetrieval](https://huggingface.co/datasets/jinaai/xpqa) dataset. XPQA is a dataset designed to evaluate Cross-Lingual QA capabilities, and we determined it to be inappropriate for evaluating retrieval tasks that require finding supporting documents based on queries.
125
+ - Examples from the XPQARetrieval dataset are as follows:
126
+ ```json
127
+ {
128
+ "query": "Is it unopened?",
129
+ "document": "No. It is a renewed product."
130
+ },
131
+ {
132
+ "query": "Is it compatible with iPad Air 3?",
133
+ "document": "Yes, it is possible."
134
+ }
135
+ ```
136
+ - Details for excluding this dataset is shown in the [Github Issue](https://github.com/embeddings-benchmark/mteb/discussions/3099)
137
+
138
+ </details>
139
+
140
+ ### Evaluation Metrics
141
+ - Recall@10
142
+ - NDCG@10
143
+ - MRR@10
144
+ - AVG_Query_Active_Dims
145
+ - AVG_Corpus_Active_Dims
146
+
147
+ ### Evaluation Code
148
+ Our evaluation uses the [SparseInformationRetrievalEvaluator](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/sparse_encoder/evaluation/SparseInformationRetrievalEvaluator.py#L23-L308) from the [sentence-transformers](https://www.SBERT.net) library.
149
+ <details><summary>Code</summary>
150
+
151
+ ```python
152
+ from sentence_transformers import SparseEncoder
153
+ from datasets import load_dataset
154
+ from sentence_transformers.sparse_encoder.evaluation import SparseInformationRetrievalEvaluator
155
+ import os
156
+ import pandas as pd
157
+ from tqdm import tqdm
158
+ import json
159
+ from multiprocessing import Process, current_process
160
+ import torch
161
+ from setproctitle import setproctitle
162
+ import traceback
163
+
164
+ # GPU별로 평가할 데이터셋 매핑
165
+ DATASET_GPU_MAPPING = {
166
+ 0: [
167
+ "yjoonjang/markers_bm",
168
+ "taeminlee/Ko-StrategyQA",
169
+ "facebook/belebele",
170
+ "xhluca/publichealth-qa",
171
+ "Shitao/MLDR"
172
+ ],
173
+ 1: [
174
+ "miracl/mmteb-miracl",
175
+ ],
176
+ 2: [
177
+ "mteb/mrtidy",
178
+ ]
179
+ }
180
+
181
+ model_name = "yjoonjang/splade-ko-v1.0"
182
+
183
+ def evaluate_dataset(model_name, gpu_id, eval_datasets):
184
+ """단일 GPU에서 할당된 데이터셋들을 평가하는 함수"""
185
+ import torch
186
+ try:
187
+ os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu_id)
188
+ device = torch.device(f"cuda:0")
189
+ # device = torch.device(f"cuda:{str(gpu_id)}")
190
+ torch.cuda.set_device(device)
191
+
192
+ setproctitle(f"yjoonjang splade-eval-gpu{gpu_id}")
193
+ print(f"Running datasets: {eval_datasets} on GPU {gpu_id} in process {current_process().name}")
194
+
195
+ # 모델 로드
196
+ model = SparseEncoder(model_name, trust_remote_code=True, device=device)
197
+
198
+ for eval_dataset in eval_datasets:
199
+ short_dataset_name = eval_dataset.split("/")[-1]
200
+ output_dir = f"./results/{model_name}"
201
+ os.makedirs(output_dir, exist_ok=True)
202
+
203
+ prediction_filepath = f"{output_dir}/{short_dataset_name}.json"
204
+ if os.path.exists(prediction_filepath):
205
+ print(f"Skipping evaluation for {eval_dataset} as output already exists at {prediction_filepath}")
206
+ continue
207
+
208
+ corpus = {}
209
+ queries = {}
210
+ relevant_docs = {}
211
+ split = "dev"
212
+ if eval_dataset == "yjoonjang/markers_bm" or eval_dataset == "yjoonjang/squad_kor_v1":
213
+ split = "test"
214
+
215
+ if eval_dataset in ["yjoonjang/markers_bm", "taeminlee/Ko-StrategyQA"]:
216
+ dev_corpus = load_dataset(eval_dataset, "corpus", split="corpus")
217
+ dev_queries = load_dataset(eval_dataset, "queries", split="queries")
218
+ relevant_docs_data = load_dataset(eval_dataset, "default", split=split)
219
+
220
+ queries = dict(zip(dev_queries["_id"], dev_queries["text"]))
221
+ corpus = dict(zip(dev_corpus["_id"], dev_corpus["text"]))
222
+ for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]):
223
+ qid_str = str(qid)
224
+ corpus_ids_str = str(corpus_ids)
225
+ if qid_str not in relevant_docs:
226
+ relevant_docs[qid_str] = set()
227
+ relevant_docs[qid_str].add(corpus_ids_str)
228
+
229
+ elif eval_dataset == "facebook/belebele":
230
+ split = "test"
231
+ ds = load_dataset(eval_dataset, "kor_Hang", split=split)
232
+
233
+ corpus_df = pd.DataFrame(ds)
234
+ corpus_df = corpus_df.drop_duplicates(subset=["link"])
235
+ corpus_df["cid"] = [f"C{i}" for i in range(len(corpus_df))]
236
+ corpus = dict(zip(corpus_df["cid"], corpus_df["flores_passage"]))
237
+
238
+ link_to_cid = dict(zip(corpus_df["link"], corpus_df["cid"]))
239
+
240
+ queries_df = pd.DataFrame(ds)
241
+ queries_df = queries_df.drop_duplicates(subset=["question"])
242
+ queries_df["qid"] = [f"Q{i}" for i in range(len(queries_df))]
243
+ queries = dict(zip(queries_df["qid"], queries_df["question"]))
244
+
245
+ question_to_qid = dict(zip(queries_df["question"], queries_df["qid"]))
246
+
247
+ for row in tqdm(ds, desc="Processing belebele"):
248
+ qid = question_to_qid[row["question"]]
249
+ cid = link_to_cid[row["link"]]
250
+ if qid not in relevant_docs:
251
+ relevant_docs[qid] = set()
252
+ relevant_docs[qid].add(cid)
253
+
254
+ elif eval_dataset == "jinaai/xpqa":
255
+ split = "test"
256
+ ds = load_dataset(eval_dataset, "ko", split=split, trust_remote_code=True)
257
+
258
+ corpus_df = pd.DataFrame(ds)
259
+ corpus_df = corpus_df.drop_duplicates(subset=["answer"])
260
+ corpus_df["cid"] = [f"C{i}" for i in range(len(corpus_df))]
261
+ corpus = dict(zip(corpus_df["cid"], corpus_df["answer"]))
262
+ answer_to_cid = dict(zip(corpus_df["answer"], corpus_df["cid"]))
263
+
264
+ queries_df = pd.DataFrame(ds)
265
+ queries_df = queries_df.drop_duplicates(subset=["question"])
266
+ queries_df["qid"] = [f"Q{i}" for i in range(len(queries_df))]
267
+ queries = dict(zip(queries_df["qid"], queries_df["question"]))
268
+ question_to_qid = dict(zip(queries_df["question"], queries_df["qid"]))
269
+
270
+ for row in tqdm(ds, desc="Processing xpqa"):
271
+ qid = question_to_qid[row["question"]]
272
+ cid = answer_to_cid[row["answer"]]
273
+ if qid not in relevant_docs:
274
+ relevant_docs[qid] = set()
275
+ relevant_docs[qid].add(cid)
276
+
277
+ elif eval_dataset == "miracl/mmteb-miracl":
278
+ split = "dev"
279
+ corpus_ds = load_dataset(eval_dataset, "corpus-ko", split="corpus")
280
+ queries_ds = load_dataset(eval_dataset, "queries-ko", split="queries")
281
+ qrels_ds = load_dataset(eval_dataset, "ko", split=split)
282
+
283
+ corpus = {row['docid']: row['text'] for row in corpus_ds}
284
+ queries = {row['query_id']: row['query'] for row in queries_ds}
285
+
286
+ for row in qrels_ds:
287
+ qid = row["query_id"]
288
+ cid = row["docid"]
289
+ if qid not in relevant_docs:
290
+ relevant_docs[qid] = set()
291
+ relevant_docs[qid].add(cid)
292
+
293
+ elif eval_dataset == "mteb/mrtidy":
294
+ split = "test"
295
+ corpus_ds = load_dataset(eval_dataset, "korean-corpus", split="train", trust_remote_code=True)
296
+ queries_ds = load_dataset(eval_dataset, "korean-queries", split=split, trust_remote_code=True)
297
+ qrels_ds = load_dataset(eval_dataset, "korean-qrels", split=split, trust_remote_code=True)
298
+
299
+ corpus = {row['_id']: row['text'] for row in corpus_ds}
300
+ queries = {row['_id']: row['text'] for row in queries_ds}
301
+
302
+ for row in qrels_ds:
303
+ qid = str(row["query-id"])
304
+ cid = str(row["corpus-id"])
305
+ if qid not in relevant_docs:
306
+ relevant_docs[qid] = set()
307
+ relevant_docs[qid].add(cid)
308
+
309
+ elif eval_dataset == "Shitao/MLDR":
310
+ split = "dev"
311
+ corpus_ds = load_dataset(eval_dataset, "corpus-ko", split="corpus")
312
+ lang_data = load_dataset(eval_dataset, "ko", split=split)
313
+
314
+ corpus = {row['docid']: row['text'] for row in corpus_ds}
315
+ queries = {row['query_id']: row['query'] for row in lang_data}
316
+
317
+ for row in lang_data:
318
+ qid = row["query_id"]
319
+ cid = row["positive_passages"][0]["docid"]
320
+ if qid not in relevant_docs:
321
+ relevant_docs[qid] = set()
322
+ relevant_docs[qid].add(cid)
323
+
324
+ elif eval_dataset == "xhluca/publichealth-qa":
325
+ split = "test"
326
+ ds = load_dataset(eval_dataset, "korean", split=split)
327
+
328
+ ds = ds.filter(lambda x: x["question"] is not None and x["answer"] is not None)
329
+
330
+ corpus_df = pd.DataFrame(list(ds))
331
+ corpus_df = corpus_df.drop_duplicates(subset=["answer"])
332
+ corpus_df["cid"] = [f"D{i}" for i in range(len(corpus_df))]
333
+ corpus = dict(zip(corpus_df["cid"], corpus_df["answer"]))
334
+ answer_to_cid = dict(zip(corpus_df["answer"], corpus_df["cid"]))
335
+
336
+ queries_df = pd.DataFrame(list(ds))
337
+ queries_df = queries_df.drop_duplicates(subset=["question"])
338
+ queries_df["qid"] = [f"Q{i}" for i in range(len(queries_df))]
339
+ queries = dict(zip(queries_df["qid"], queries_df["question"]))
340
+ question_to_qid = dict(zip(queries_df["question"], queries_df["qid"]))
341
+
342
+ for row in tqdm(ds, desc="Processing publichealth-qa"):
343
+ qid = question_to_qid[row["question"]]
344
+ cid = answer_to_cid[row["answer"]]
345
+ if qid not in relevant_docs:
346
+ relevant_docs[qid] = set()
347
+ relevant_docs[qid].add(cid)
348
+
349
+ else:
350
+ continue
351
+
352
+ evaluator = SparseInformationRetrievalEvaluator(
353
+ queries=queries,
354
+ corpus=corpus,
355
+ relevant_docs=relevant_docs,
356
+ write_csv=False,
357
+ name=f"{eval_dataset}",
358
+ show_progress_bar=True,
359
+ batch_size=32,
360
+ write_predictions=False
361
+ )
362
+ short_dataset_name = eval_dataset.split("/")[-1]
363
+ output_filepath = f"./results/{model_name}"
364
+ metrics = evaluator(model)
365
+ print(f"GPU {gpu_id} - {eval_dataset} metrics: {metrics}")
366
+ with open(f"{output_filepath}/{short_dataset_name}.json", "w", encoding="utf-8") as f:
367
+ json.dump(metrics, f, ensure_ascii=False, indent=2)
368
+
369
+ except Exception as ex:
370
+ print(f"Error on GPU {gpu_id}: {ex}")
371
+ traceback.print_exc()
372
+
373
+ if __name__ == "__main__":
374
+ torch.multiprocessing.set_start_method('spawn')
375
+
376
+ print(f"Starting evaluation for model: {model_name}")
377
+ processes = []
378
+
379
+ for gpu_id, datasets in DATASET_GPU_MAPPING.items():
380
+ p = Process(target=evaluate_dataset, args=(model_name, gpu_id, datasets))
381
+ p.start()
382
+ processes.append(p)
383
+
384
+ for p in processes:
385
+ p.join()
386
+
387
+ print(f"Completed evaluation for model: {model_name}")
388
+ ```
389
+ </details>
390
+
391
+ ### Evaluation Results
392
+
393
+ | Model | Parameters | Recall@10 | NDCG@10 | MRR@10 | AVG_Query_Active_Dims | AVG_Corpus_Active_Dims |
394
+ |-------|------------|-----------|---------|--------|----------------------|------------------------|
395
+ | **yjoonjang/splade-ko-v1.0** | **0.1B** | **0.7626** | **0.7037** | **0.7379** | **110.7664** | **778.6494** |
396
+ | [telepix/PIXIE-Splade-Preview](https://huggingface.co/telepix/PIXIE-Splade-Preview) | 0.1B | 0.7382 | 0.6869 | 0.7204 | 108.3300 | 718.5110 |
397
+ | [opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1](https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-multilingual-v1) | 0.1B | 0.5900 | 0.5137 | 0.5455 | 27.8722 | 177.5564 |
398
+
399
+
400
+
401
+
402
+ <!--
403
+ ## Bias, Risks and Limitations
404
+
405
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
406
+ -->
407
+
408
+ <!--
409
+ ### Recommendations
410
+
411
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
412
+ -->
413
+
414
+ ## Training Details
415
+
416
+ ### Training Hyperparameters
417
+ #### Non-Default Hyperparameters
418
+
419
+ - `eval_strategy`: steps
420
+ - `per_device_train_batch_size`: 4
421
+ - `per_device_eval_batch_size`: 2
422
+ - `learning_rate`: 2e-05
423
+ - `num_train_epochs`: 2
424
+ - `warmup_ratio`: 0.1
425
+ - `bf16`: True
426
+ - `negs_per_query`: 6 (from our dataset)
427
+ - `gather_device`: True (Makes samples available to be shared across devices)
428
+
429
+ #### All Hyperparameters
430
+ <details><summary>Click to expand</summary>
431
+
432
+
433
+ - `overwrite_output_dir`: False
434
+ - `do_predict`: False
435
+ - `eval_strategy`: steps
436
+ - `prediction_loss_only`: True
437
+ - `per_device_train_batch_size`: 4
438
+ - `per_device_eval_batch_size`: 2
439
+ - `per_gpu_train_batch_size`: None
440
+ - `per_gpu_eval_batch_size`: None
441
+ - `gradient_accumulation_steps`: 1
442
+ - `eval_accumulation_steps`: None
443
+ - `torch_empty_cache_steps`: None
444
+ - `learning_rate`: 2e-05
445
+ - `weight_decay`: 0.0
446
+ - `adam_beta1`: 0.9
447
+ - `adam_beta2`: 0.999
448
+ - `adam_epsilon`: 1e-08
449
+ - `max_grad_norm`: 1.0
450
+ - `num_train_epochs`: 2
451
+ - `max_steps`: -1
452
+ - `lr_scheduler_type`: linear
453
+ - `lr_scheduler_kwargs`: {}
454
+ - `warmup_ratio`: 0.1
455
+ - `warmup_steps`: 0
456
+ - `log_level`: passive
457
+ - `log_level_replica`: warning
458
+ - `log_on_each_node`: True
459
+ - `logging_nan_inf_filter`: True
460
+ - `save_safetensors`: True
461
+ - `save_on_each_node`: False
462
+ - `save_only_model`: False
463
+ - `restore_callback_states_from_checkpoint`: False
464
+ - `no_cuda`: False
465
+ - `use_cpu`: False
466
+ - `use_mps_device`: False
467
+ - `seed`: 42
468
+ - `data_seed`: None
469
+ - `jit_mode_eval`: False
470
+ - `use_ipex`: False
471
+ - `bf16`: True
472
+ - `fp16`: False
473
+ - `fp16_opt_level`: O1
474
+ - `half_precision_backend`: auto
475
+ - `bf16_full_eval`: False
476
+ - `fp16_full_eval`: False
477
+ - `tf32`: None
478
+ - `local_rank`: 7
479
+ - `ddp_backend`: None
480
+ - `tpu_num_cores`: None
481
+ - `tpu_metrics_debug`: False
482
+ - `debug`: []
483
+ - `dataloader_drop_last`: True
484
+ - `dataloader_num_workers`: 0
485
+ - `dataloader_prefetch_factor`: None
486
+ - `past_index`: -1
487
+ - `disable_tqdm`: False
488
+ - `remove_unused_columns`: True
489
+ - `label_names`: None
490
+ - `load_best_model_at_end`: False
491
+ - `ignore_data_skip`: False
492
+ - `fsdp`: []
493
+ - `fsdp_min_num_params`: 0
494
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
495
+ - `fsdp_transformer_layer_cls_to_wrap`: None
496
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
497
+ - `parallelism_config`: None
498
+ - `deepspeed`: None
499
+ - `label_smoothing_factor`: 0.0
500
+ - `optim`: adamw_torch_fused
501
+ - `optim_args`: None
502
+ - `adafactor`: False
503
+ - `group_by_length`: False
504
+ - `length_column_name`: length
505
+ - `ddp_find_unused_parameters`: None
506
+ - `ddp_bucket_cap_mb`: None
507
+ - `ddp_broadcast_buffers`: False
508
+ - `dataloader_pin_memory`: True
509
+ - `dataloader_persistent_workers`: False
510
+ - `skip_memory_metrics`: True
511
+ - `use_legacy_prediction_loop`: False
512
+ - `push_to_hub`: False
513
+ - `resume_from_checkpoint`: None
514
+ - `hub_model_id`: None
515
+ - `hub_strategy`: every_save
516
+ - `hub_private_repo`: None
517
+ - `hub_always_push`: False
518
+ - `hub_revision`: None
519
+ - `gradient_checkpointing`: False
520
+ - `gradient_checkpointing_kwargs`: None
521
+ - `include_inputs_for_metrics`: False
522
+ - `include_for_metrics`: []
523
+ - `eval_do_concat_batches`: True
524
+ - `fp16_backend`: auto
525
+ - `push_to_hub_model_id`: None
526
+ - `push_to_hub_organization`: None
527
+ - `mp_parameters`:
528
+ - `auto_find_batch_size`: False
529
+ - `full_determinism`: False
530
+ - `torchdynamo`: None
531
+ - `ray_scope`: last
532
+ - `ddp_timeout`: 1800
533
+ - `torch_compile`: False
534
+ - `torch_compile_backend`: None
535
+ - `torch_compile_mode`: None
536
+ - `include_tokens_per_second`: False
537
+ - `include_num_input_tokens_seen`: False
538
+ - `neftune_noise_alpha`: None
539
+ - `optim_target_modules`: None
540
+ - `batch_eval_metrics`: False
541
+ - `eval_on_start`: False
542
+ - `use_liger_kernel`: False
543
+ - `liger_kernel_config`: None
544
+ - `eval_use_gather_object`: False
545
+ - `average_tokens_across_devices`: True
546
+ - `prompts`: None
547
+ - `batch_sampler`: batch_sampler
548
+ - `multi_dataset_batch_sampler`: proportional
549
+ - `router_mapping`: {}
550
+ - `learning_rate_mapping`: {}
551
+ </details>
552
+
553
+ ### Framework Versions
554
+ - Python: 3.10.18
555
+ - Sentence Transformers: 5.1.1
556
+ - Transformers: 4.56.2
557
+ - PyTorch: 2.8.0+cu128
558
+ - Accelerate: 1.10.1
559
+ - Datasets: 4.1.1
560
+ - Tokenizers: 0.22.1
561
+
562
+ ## Citation
563
+
564
+ ### BibTeX
565
+
566
+ #### Sentence Transformers
567
+ ```bibtex
568
+ @inproceedings{reimers-2019-sentence-bert,
569
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
570
+ author = "Reimers, Nils and Gurevych, Iryna",
571
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
572
+ month = "11",
573
+ year = "2019",
574
+ publisher = "Association for Computational Linguistics",
575
+ url = "https://arxiv.org/abs/1908.10084",
576
+ }
577
+ ```
578
+
579
+ #### SpladeLoss
580
+ ```bibtex
581
+ @misc{formal2022distillationhardnegativesampling,
582
+ title={From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective},
583
+ author={Thibault Formal and Carlos Lassance and Benjamin Piwowarski and Stéphane Clinchant},
584
+ year={2022},
585
+ eprint={2205.04733},
586
+ archivePrefix={arXiv},
587
+ primaryClass={cs.IR},
588
+ url={https://arxiv.org/abs/2205.04733},
589
+ }
590
+ ```
591
+
592
+ #### SparseMultipleNegativesRankingLoss
593
+ ```bibtex
594
+ @misc{henderson2017efficient,
595
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
596
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
597
+ year={2017},
598
+ eprint={1705.00652},
599
+ archivePrefix={arXiv},
600
+ primaryClass={cs.CL}
601
+ }
602
+ ```
603
+
604
+ #### FlopsLoss
605
+ ```bibtex
606
+ @article{paria2020minimizing,
607
+ title={Minimizing flops to learn efficient sparse representations},
608
+ author={Paria, Biswajit and Yeh, Chih-Kuan and Yen, Ian EH and Xu, Ning and Ravikumar, Pradeep and P{'o}czos, Barnab{'a}s},
609
+ journal={arXiv preprint arXiv:2004.05665},
610
+ year={2020}
611
+ }
612
+ ```
613
+
614
+ <!--
615
+ ## Glossary
616
+
617
+ *Clearly define terms in order to be accessible across audiences.*
618
+ -->
619
+
620
+ <!--
621
+ ## Model Card Authors
622
+
623
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
624
+ -->
625
+
626
+ <!--
627
+ ## Model Card Contact
628
+
629
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
630
+ -->
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<pad>": 49999
3
+ }
config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "ModernBertForMaskedLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "classifier_activation": "gelu",
9
+ "classifier_bias": false,
10
+ "classifier_dropout": 0.0,
11
+ "classifier_pooling": "mean",
12
+ "cls_token_id": 0,
13
+ "decoder_bias": true,
14
+ "deterministic_flash_attn": false,
15
+ "dtype": "float32",
16
+ "embedding_dropout": 0.0,
17
+ "eos_token_id": 1,
18
+ "global_attn_every_n_layers": 3,
19
+ "global_rope_theta": 160000,
20
+ "gradient_checkpointing": false,
21
+ "hidden_activation": "gelu",
22
+ "hidden_size": 768,
23
+ "initializer_cutoff_factor": 2.0,
24
+ "initializer_range": 0.02,
25
+ "intermediate_size": 1152,
26
+ "layer_norm_eps": 1e-05,
27
+ "local_attention": 128,
28
+ "local_rope_theta": 10000.0,
29
+ "max_position_embeddings": 16384,
30
+ "mlp_bias": false,
31
+ "mlp_dropout": 0.0,
32
+ "model_type": "modernbert",
33
+ "norm_bias": false,
34
+ "norm_eps": 1e-05,
35
+ "num_attention_heads": 12,
36
+ "num_hidden_layers": 22,
37
+ "pad_token_id": 49999,
38
+ "position_embedding_type": "absolute",
39
+ "repad_logits_with_grad": false,
40
+ "sep_token_id": 1,
41
+ "sparse_pred_ignore_index": -100,
42
+ "sparse_prediction": false,
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.52.3",
45
+ "vocab_size": 50000
46
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SparseEncoder",
3
+ "__version__": {
4
+ "sentence_transformers": "5.1.1",
5
+ "transformers": "4.52.3",
6
+ "pytorch": "2.8.0+cu128"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "dot"
14
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:141f71d91cfc85255c6777e38312c15f080513216c937f02fec2cebdfb3ad40c
3
+ size 597503064
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.sparse_encoder.models.MLMTransformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_SpladePooling",
12
+ "type": "sentence_transformers.sparse_encoder.models.SpladePooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<cls>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "<\\s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "<sep>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<\\s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "<unk>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<sep>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ },
43
+ "5": {
44
+ "content": "<cls>",
45
+ "lstrip": false,
46
+ "normalized": false,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": true
50
+ },
51
+ "6": {
52
+ "content": "<unused0>",
53
+ "lstrip": false,
54
+ "normalized": false,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": true
58
+ },
59
+ "7": {
60
+ "content": "<unused1>",
61
+ "lstrip": false,
62
+ "normalized": false,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": true
66
+ },
67
+ "8": {
68
+ "content": "<unused2>",
69
+ "lstrip": false,
70
+ "normalized": false,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": true
74
+ },
75
+ "9": {
76
+ "content": "<unused3>",
77
+ "lstrip": false,
78
+ "normalized": false,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": true
82
+ },
83
+ "10": {
84
+ "content": "<unused4>",
85
+ "lstrip": false,
86
+ "normalized": false,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": true
90
+ },
91
+ "11": {
92
+ "content": "<unused5>",
93
+ "lstrip": false,
94
+ "normalized": false,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": true
98
+ },
99
+ "12": {
100
+ "content": "<unused6>",
101
+ "lstrip": false,
102
+ "normalized": false,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": true
106
+ },
107
+ "13": {
108
+ "content": "<unused7>",
109
+ "lstrip": false,
110
+ "normalized": false,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": true
114
+ },
115
+ "14": {
116
+ "content": "<unused8>",
117
+ "lstrip": false,
118
+ "normalized": false,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": true
122
+ },
123
+ "15": {
124
+ "content": "<unused9>",
125
+ "lstrip": false,
126
+ "normalized": false,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": true
130
+ },
131
+ "16": {
132
+ "content": "<unused10>",
133
+ "lstrip": false,
134
+ "normalized": false,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": true
138
+ },
139
+ "17": {
140
+ "content": "<unused11>",
141
+ "lstrip": false,
142
+ "normalized": false,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": true
146
+ },
147
+ "18": {
148
+ "content": "<unused12>",
149
+ "lstrip": false,
150
+ "normalized": false,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": true
154
+ },
155
+ "19": {
156
+ "content": "<unused13>",
157
+ "lstrip": false,
158
+ "normalized": false,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": true
162
+ },
163
+ "20": {
164
+ "content": "<unused14>",
165
+ "lstrip": false,
166
+ "normalized": false,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": true
170
+ },
171
+ "21": {
172
+ "content": "<unused15>",
173
+ "lstrip": false,
174
+ "normalized": false,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": true
178
+ },
179
+ "22": {
180
+ "content": "<unused16>",
181
+ "lstrip": false,
182
+ "normalized": false,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": true
186
+ },
187
+ "23": {
188
+ "content": "<unused17>",
189
+ "lstrip": false,
190
+ "normalized": false,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": true
194
+ },
195
+ "24": {
196
+ "content": "<unused18>",
197
+ "lstrip": false,
198
+ "normalized": false,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": true
202
+ },
203
+ "25": {
204
+ "content": "<unused19>",
205
+ "lstrip": false,
206
+ "normalized": false,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": true
210
+ },
211
+ "26": {
212
+ "content": "<unused20>",
213
+ "lstrip": false,
214
+ "normalized": false,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": true
218
+ },
219
+ "27": {
220
+ "content": "<unused21>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "28": {
228
+ "content": "<unused22>",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "29": {
236
+ "content": "<unused23>",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "30": {
244
+ "content": "<unused24>",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "31": {
252
+ "content": "<unused25>",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "32": {
260
+ "content": "<unused26>",
261
+ "lstrip": false,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "33": {
268
+ "content": "<unused27>",
269
+ "lstrip": false,
270
+ "normalized": false,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": true
274
+ },
275
+ "34": {
276
+ "content": "<unused28>",
277
+ "lstrip": false,
278
+ "normalized": false,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": true
282
+ },
283
+ "35": {
284
+ "content": "<unused29>",
285
+ "lstrip": false,
286
+ "normalized": false,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": true
290
+ },
291
+ "36": {
292
+ "content": "<unused30>",
293
+ "lstrip": false,
294
+ "normalized": false,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": true
298
+ },
299
+ "49999": {
300
+ "content": "<pad>",
301
+ "lstrip": false,
302
+ "normalized": false,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": true
306
+ }
307
+ },
308
+ "bos_token": "<s>",
309
+ "clean_up_tokenization_spaces": true,
310
+ "cls_token": "<cls>",
311
+ "do_lower_case": false,
312
+ "eos_token": "<\\s>",
313
+ "extra_special_tokens": {},
314
+ "mask_token": "<mask>",
315
+ "max_length": 2048,
316
+ "model_max_length": 8192,
317
+ "pad_to_multiple_of": null,
318
+ "pad_token": "<pad>",
319
+ "pad_token_type_id": 0,
320
+ "padding_side": "right",
321
+ "sep_token": "<sep>",
322
+ "stride": 0,
323
+ "strip_accents": null,
324
+ "tokenize_chinese_chars": true,
325
+ "tokenizer_class": "BertTokenizer",
326
+ "truncation_side": "right",
327
+ "truncation_strategy": "longest_first",
328
+ "unk_token": "<unk>"
329
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff