Upload folder using huggingface_hub

Browse files

Files changed (10) hide show

.gitattributes +1 -0
README.md +126 -0
config.json +177 -0
model.safetensors +3 -0
modeling_open_provence_standalone.py +0 -0
sentencepiece.bpe.model +3 -0
special_tokens_map.json +51 -0
tokenizer.json +3 -0
tokenizer_config.json +56 -0
training_args.json +206 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,126 @@

+# Semantic Highlight Bilingual Model (Preview)
+## What is Semantic Highlight?
+Traditional search highlighting works by matching keywords. When you search for "iPhone performance" on an e-commerce site, only the words "iPhone" and "performance" get highlighted in the results. But what if the product description says "Powered by A15 Bionic chip, scores over 1 million in benchmarks, smooth performance with no lag"? This clearly answers the performance question, yet nothing gets highlighted because it doesn't contain the exact word "performance".
+**Semantic Highlight** solves this problem by understanding meaning, not just matching words. It highlights text segments that are semantically relevant to your query, even if they don't contain the exact keywords. This is crucial in RAG (Retrieval-Augmented Generation) scenarios where users need to quickly identify relevant information in long retrieved documents.
+### Why a Lightweight Model?
+Highlighting happens on every search query - it needs to be fast and cost-effective. Large language models would be too slow and expensive for this real-time task. This model is designed to be:
+- **Small**: ~560MB, deployable on standard servers
+- **Fast**: Millisecond-level inference
+- **Accurate**: Trained on context-relevance datasets
+## Model Details
+- **Base Model**: BAAI/bge-reranker-v2-m3
+- **Languages**: Chinese and English
+- **Task**: Context relevance prediction for semantic highlighting
+- **Status**: ⚠️ **Preview Version** - This is an experimental release
+## Quick Start
+### Installation
+```bash
+pip install transformers torch
+```
+### Usage
+#### English Example
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained(
+    "Zilliz/semantic-highlight-bilingual-pre",
+    trust_remote_code=True
+)
+question = "How to improve Python code performance?"
+context = """
+Python optimization techniques include using numpy for vectorized operations,
+avoiding object creation in loops, and utilizing built-in functions.
+List comprehensions are faster than traditional loops.
+Profiling tools like cProfile help identify bottlenecks.
+"""
+result = model.process(
+    question=question,
+    context=context,
+    threshold=0.5,
+    language="en",
+)
+print("Relevant sentences:")
+print(result["pruned_context"])
+```
+#### Chinese Example
+```python
+from transformers import AutoModel
+model = AutoModel.from_pretrained(
+    "Zilliz/semantic-highlight-bilingual-pre",
+    trust_remote_code=True
+)
+question = "北京有哪些著名景点？"
+context = """
+故宫是明清两代的皇家宫殿，占地面积约72万平方米。
+长城是中国古代的军事防御工程，东起山海关，西至嘉峪关。
+颐和园是清朝时期的皇家园林，以昆明湖和万寿山为主体。
+天安门广场是世界上最大的城市广场之一。
+"""
+result = model.process(
+    question=question,
+    context=context,
+    threshold=0.5,
+    language="zh",
+)
+print("相关句子:")
+print(result["pruned_context"])
+```
+## Parameters
+- `question`: Query text
+- `context`: Document text to highlight
+- `threshold`: Relevance threshold (0-1), default 0.5. Lower values include more sentences.
+- `language`: Language code ("en", "zh", or "auto")
+- `return_sentence_metrics`: Return per-sentence relevance scores
+## Output
+- `pruned_context`: Highlighted text (relevant sentences only)
+- `compression_rate`: Percentage of text removed
+- `sentence_probabilities`: Relevance score for each sentence (if `return_sentence_metrics=True`)
+## Notes
+⚠️ **This is a preview version.** The model is still under development and improvements are ongoing.
+## License
+Same as base model: MIT License
+## Citation
+If you use this model, please cite the base model:
+```bibtex
+@misc{bge-reranker-v2-m3,
+  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
+  author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},
+  year={2024},
+  eprint={2402.03216},
+  archivePrefix={arXiv},
+  primaryClass={cs.CL}
+}
+```

config.json ADDED Viewed

	@@ -0,0 +1,177 @@

+{
+  "architectures": [
+    "OpenProvenceForSequenceClassification"
+  ],
+  "auto_map": {
+    "AutoConfig": "modeling_open_provence_standalone.OpenProvenceConfig",
+    "AutoModel": "modeling_open_provence_standalone.OpenProvenceForSequenceClassification",
+    "AutoModelForSequenceClassification": "modeling_open_provence_standalone.OpenProvenceForSequenceClassification",
+    "AutoModelForTokenClassification": "modeling_open_provence_standalone.OpenProvenceForTokenClassification"
+  },
+  "base_model_config": {
+    "_name_or_path": "BAAI/bge-reranker-v2-m3",
+    "add_cross_attention": false,
+    "architectures": [
+      "XLMRobertaForSequenceClassification"
+    ],
+    "attention_probs_dropout_prob": 0.1,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": 0,
+    "chunk_size_feed_forward": 0,
+    "classifier_dropout": null,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dtype": "float32",
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": 2,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.1,
+    "hidden_size": 1024,
+    "id2label": {
+      "0": "LABEL_0"
+    },
+    "initializer_range": 0.02,
+    "intermediate_size": 4096,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0
+    },
+    "layer_norm_eps": 1e-05,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 8194,
+    "min_length": 0,
+    "model_type": "xlm-roberta",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 16,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 24,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_past": true,
+    "output_scores": false,
+    "pad_token_id": 1,
+    "position_embedding_type": "absolute",
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torchscript": false,
+    "transformers_version": "4.57.1",
+    "type_vocab_size": 1,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_cache": true,
+    "vocab_size": 250002
+  },
+  "base_model_name_or_path": "BAAI/bge-reranker-v2-m3",
+  "default_threadshold": null,
+  "default_threshold": null,
+  "encoder_architecture": "xlm-roberta",
+  "hidden_size": 1024,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "max_length": 512,
+  "mode": "reranking_pruning",
+  "model_type": "open_provence",
+  "num_pruning_labels": 2,
+  "pruning_config": {
+    "_name_or_path": "",
+    "add_cross_attention": false,
+    "architectures": null,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "classifier_dropout": 0.1,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "dtype": null,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_size": 1024,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "min_length": 0,
+    "model_type": "open_provence_head",
+    "no_repeat_ngram_size": 0,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": null,
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sentence_pooling": "mean",
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torchscript": false,
+    "transformers_version": "4.57.1",
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_weighted_pooling": false
+  },
+  "tokenizer_name_or_path": null,
+  "transformers_version": "4.57.1",
+  "vocab_size": 250002
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:17222e5588c9913e14292bb6df4a35e863ccdff1346ca94599cf26edafe17a54
+size 2271085700

modeling_open_provence_standalone.py ADDED Viewed

The diff for this file is too large to render. See raw diff

sentencepiece.bpe.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
+size 5069051

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8bf8afbfd11306bd872018c53bfdf2e160a56f8edbcf49933324404791c148d3
+size 17082900

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "250001": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 8192,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "XLMRobertaTokenizer",
+  "unk_token": "<unk>"
+}

training_args.json ADDED Viewed

	@@ -0,0 +1,206 @@

+{
+  "model_args": {
+    "model_name_or_path": "BAAI/bge-reranker-v2-m3",
+    "num_labels": null,
+    "classifier_dropout": 0.1,
+    "max_length": 512,
+    "config_name": null,
+    "tokenizer_name": null,
+    "cache_dir": null
+  },
+  "data_args": {
+    "dataset_name": "hotchpotch/wip-msmarco-context-relevance",
+    "subset": "msmarco-ja-minimal",
+    "teacher_column": null,
+    "datasets": [
+      {
+        "dataset_name": "hotchpotch/msmarco-context-relevance",
+        "subset": "freq2",
+        "teacher_column": "teacher_scores.gte-reranker-modernbert-base"
+      },
+      {
+        "dataset_name": "hotchpotch/natural-questions-context-relevance",
+        "subset": "nodup_freq2",
+        "teacher_column": "teacher_scores.gte-reranker-modernbert-base",
+        "items": 6
+      },
+      {
+        "dataset_name": "hotchpotch/gooaq-context-relevance-130k",
+        "subset": "default",
+        "teacher_column": "teacher_scores.gte-reranker-modernbert-base",
+        "items": 6
+      },
+      {
+        "dataset_name": "zc277584121/dureader-context-relevance-with-think",
+        "subset": "default"
+      },
+      {
+        "dataset_name": "zc277584121/chinese_wiki_0_300k-context-relevance-with-think",
+        "subset": "default"
+      }
+    ],
+    "items": null,
+    "max_train_samples": null,
+    "max_eval_samples": null,
+    "validation_split": null,
+    "validation_split_samples": null,
+    "validation_split_name": "validation",
+    "preprocessing_num_workers": null,
+    "filter_zero_relevance_max_items": null,
+    "filter_zero_relevance_max_items_reverse": false,
+    "filter_keep_first_item": false,
+    "upsample_factor": null
+  },
+  "training_args": {
+    "output_dir": "./output/bilingual-chinese-english-m3-v10_20251213_152447",
+    "overwrite_output_dir": true,
+    "do_train": true,
+    "do_eval": true,
+    "do_predict": false,
+    "eval_strategy": "steps",
+    "prediction_loss_only": false,
+    "per_device_train_batch_size": 2,
+    "per_device_eval_batch_size": 8,
+    "per_gpu_train_batch_size": null,
+    "per_gpu_eval_batch_size": null,
+    "gradient_accumulation_steps": 32,
+    "eval_accumulation_steps": null,
+    "eval_delay": 0,
+    "torch_empty_cache_steps": null,
+    "learning_rate": 5e-05,
+    "weight_decay": 0.01,
+    "adam_beta1": 0.9,
+    "adam_beta2": 0.999,
+    "adam_epsilon": 1e-08,
+    "max_grad_norm": 1.0,
+    "num_train_epochs": 3,
+    "max_steps": -1,
+    "lr_scheduler_type": "cosine",
+    "lr_scheduler_kwargs": {},
+    "warmup_ratio": 0.1,
+    "warmup_steps": 0,
+    "log_level": "passive",
+    "log_level_replica": "warning",
+    "log_on_each_node": true,
+    "logging_dir": "trainer_output/runs/Dec13_15-24-45_nvidiadgx",
+    "logging_strategy": "steps",
+    "logging_first_step": false,
+    "logging_steps": 363,
+    "logging_nan_inf_filter": true,
+    "save_strategy": "steps",
+    "save_steps": 500,
+    "save_total_limit": 3,
+    "save_safetensors": true,
+    "save_on_each_node": false,
+    "save_only_model": false,
+    "restore_callback_states_from_checkpoint": false,
+    "no_cuda": false,
+    "use_cpu": false,
+    "use_mps_device": false,
+    "seed": 42,
+    "data_seed": null,
+    "jit_mode_eval": false,
+    "bf16": true,
+    "fp16": false,
+    "fp16_opt_level": "O1",
+    "half_precision_backend": "auto",
+    "bf16_full_eval": false,
+    "fp16_full_eval": false,
+    "tf32": null,
+    "local_rank": 7,
+    "ddp_backend": null,
+    "tpu_num_cores": null,
+    "tpu_metrics_debug": false,
+    "debug": [],
+    "dataloader_drop_last": false,
+    "eval_steps": 1815,
+    "dataloader_num_workers": 4,
+    "dataloader_prefetch_factor": null,
+    "past_index": -1,
+    "run_name": "bilingual-chinese-english-m3-v10-20251213_152447",
+    "disable_tqdm": false,
+    "remove_unused_columns": false,
+    "label_names": null,
+    "load_best_model_at_end": true,
+    "metric_for_best_model": "eval_loss",
+    "greater_is_better": false,
+    "ignore_data_skip": false,
+    "fsdp": [],
+    "fsdp_min_num_params": 0,
+    "fsdp_config": {
+      "min_num_params": 0,
+      "xla": false,
+      "xla_fsdp_v2": false,
+      "xla_fsdp_grad_ckpt": false
+    },
+    "fsdp_transformer_layer_cls_to_wrap": null,
+    "accelerator_config": "AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None, use_configured_state=False)",
+    "parallelism_config": null,
+    "deepspeed": null,
+    "label_smoothing_factor": 0.0,
+    "optim": "adafactor",
+    "optim_args": null,
+    "adafactor": false,
+    "group_by_length": false,
+    "length_column_name": "length",
+    "report_to": [
+      "wandb"
+    ],
+    "project": "huggingface",
+    "trackio_space_id": "trackio",
+    "ddp_find_unused_parameters": null,
+    "ddp_bucket_cap_mb": null,
+    "ddp_broadcast_buffers": null,
+    "dataloader_pin_memory": true,
+    "dataloader_persistent_workers": false,
+    "skip_memory_metrics": true,
+    "use_legacy_prediction_loop": false,
+    "push_to_hub": false,
+    "resume_from_checkpoint": null,
+    "hub_model_id": null,
+    "hub_strategy": "every_save",
+    "hub_token": null,
+    "hub_private_repo": null,
+    "hub_always_push": false,
+    "hub_revision": null,
+    "gradient_checkpointing": false,
+    "gradient_checkpointing_kwargs": null,
+    "include_inputs_for_metrics": false,
+    "include_for_metrics": [],
+    "eval_do_concat_batches": true,
+    "fp16_backend": "auto",
+    "push_to_hub_model_id": null,
+    "push_to_hub_organization": null,
+    "push_to_hub_token": null,
+    "mp_parameters": "",
+    "auto_find_batch_size": false,
+    "full_determinism": false,
+    "torchdynamo": null,
+    "ray_scope": "last",
+    "ddp_timeout": 1800,
+    "torch_compile": false,
+    "torch_compile_backend": null,
+    "torch_compile_mode": null,
+    "include_tokens_per_second": false,
+    "include_num_input_tokens_seen": "no",
+    "neftune_noise_alpha": null,
+    "optim_target_modules": null,
+    "batch_eval_metrics": false,
+    "eval_on_start": false,
+    "use_liger_kernel": false,
+    "liger_kernel_config": null,
+    "eval_use_gather_object": false,
+    "average_tokens_across_devices": true,
+    "ranking_weight": 0.0,
+    "pruning_weight": 1.0,
+    "use_teacher_scores": true,
+    "sentence_level_pruning": true,
+    "eval_datasets": {
+      "config": "configs/eval_datasets/bilingual_nano.yaml",
+      "threshold": 0.1,
+      "batch_size": 32
+    },
+    "distributed_state": "Distributed environment: DistributedType.MULTI_GPU  Backend: nccl\nNum processes: 8\nProcess index: 7\nLocal process index: 7\nDevice: cuda:7\n",
+    "deepspeed_plugin": null
+  }
+}