Initial upload of FreeChunk model with custom code

Browse files

Files changed (11) hide show

README.md +94 -0
aggregator.py +205 -0
config.json +32 -0
configuration_freechunker.py +157 -0
encoder.py +257 -0
final_loss_curve.png +0 -0
model.safetensors +3 -0
modeling_freechunker.py +768 -0
sentenizer.py +276 -0
training_losses.json +0 -0
utils.py +235 -0

README.md ADDED Viewed

	@@ -0,0 +1,94 @@

+# FreeChunker-Jina
+FreeChunker is a training-free embedding optimization method that dynamically chunks text to improve retrieval performance. This repository contains the **FreeChunker** model initialized with **jinaai/jina-embeddings-v2-small-en** embeddings.
+## Features
+- **Dynamic Chunking**: Automatically groups sentences into semantically coherent chunks.
+- **Optimized for RAG**: Improves retrieval augmented generation by providing better context segments.
+- **Backbone**: Built on top of `jinaai/jina-embeddings-v2-small-en` sentence embeddings.
+## Requirements
+```bash
+pip install torch transformers sentence-transformers numpy
+```
+## Usage
+You can use the provided `UnifiedEncoder` class (in `encoder.py`) to easily use the model for encoding and retrieval.
+### Using UnifiedEncoder
+```python
+from encoder import UnifiedEncoder
+# Initialize the encoder
+# local_model_path="." assumes you are in the directory containing model.safetensors
+encoder = UnifiedEncoder(model_name="jina", local_model_path=".")
+# Input text
+text = """
+Your long text goes here. FreeChunker will split this text into sentences,
+generate embeddings using Jina, and then group them into semantic chunks.
+It handles long documents effectively.
+"""
+# Build vector store (chunks and encodes the text)
+encoder.build_vector_store(text)
+# Query
+query = "How does FreeChunker work?"
+results = encoder.query(query, top_k=3, aggregation_mode='post')
+print("Results:", results)
+```
+### Manual Pipeline
+If you prefer to use the components separately:
+1.  **Split and Encode**: Use `Sentenceizer` (wrapping `jinaai/jina-embeddings-v2-small-en`) to get sentence embeddings.
+2.  **FreeChunker**: Pass embeddings to `FreeChunkerModel`.
+3.  **Process**: Use the output `shift_matrix` to group sentences.
+```python
+from sentenizer import Sentenceizer
+from modeling_freechunker import FreeChunkerModel
+import torch
+# 1. Setup Sentenceizer with Backbone
+sentenceizer = Sentenceizer(model_name="jinaai/jina-embeddings-v2-small-en")
+# 2. Load FreeChunker Model
+model = FreeChunkerModel.from_pretrained(".", trust_remote_code=True)
+model.eval()
+# 3. Process Text
+text = "Your text..."
+sentences, embeddings = sentenceizer.split_and_encode(text)
+# 4. Forward pass through FreeChunker
+inputs_embeds = torch.tensor(embeddings).unsqueeze(0) # Batch size 1
+with torch.no_grad():
+    outputs = model(inputs_embeds=inputs_embeds)
+# outputs['embedding'] contains refined embeddings
+# outputs['shift_matrix'] contains chunking information
+```
+## Files
+- `model.safetensors`: The FreeChunker model weights.
+- `encoder.py`: High-level interface (`UnifiedEncoder`) for end-to-end usage.
+- `sentenizer.py`: Helper for text splitting and backbone embedding.
+- `aggregator.py`: Helper for aggregating retrieved results.
+- `configuration_freechunker.py` & `modeling_freechunker.py`: Model definition.
+## Citation
+If you use this model in your research, please cite:
+```
+Zhang W, Jiang Y H, Wu Y. FreeChunker: A Cross-Granularity Chunking Framework[J]. arXiv preprint arXiv:2510.20356, 2025.
+```

aggregator.py ADDED Viewed

	@@ -0,0 +1,205 @@

+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+"""
+Text Aggregator - Precise text segment aggregation based on sentence position markers
+Main functions:
+1. Detect overlaps between text segments based on 【Begin-x】【End-y】 markers
+2. Automatically merge and reconstruct based on original order when overlapping
+3. Retain the highest scoring segments
+"""
+import re
+from typing import List, Tuple
+class TextAggregator:
+    """
+    Text aggregator for merging retrieved text segments
+    Implements splitting, deduplication, sorting, and reconstruction of text segments based on 【Begin-x】【End-x】 markers
+    """
+    def __init__(self):
+        """
+        Initialize text aggregator
+        """
+        pass
+    def _extract_segments_from_text(self, text: str) -> List[Tuple[int, str]]:
+        """
+        Extract all 【Begin-x】...【End-x】 segments from text
+        Args:
+            text: Text containing position markers
+        Returns:
+            List[Tuple[int, str]]: List of (begin_index, segment_text)
+        """
+        segments = []
+        # Match 【Begin-x】...【End-x】 pattern
+        pattern = r'【Begin-(\d+)】(.*?)【End-\1】'
+        matches = re.findall(pattern, text, re.DOTALL)
+        for match in matches:
+            begin_idx = int(match[0])
+            segment_content = match[1]
+            full_segment = f"【Begin-{begin_idx}】{segment_content}【End-{begin_idx}】"
+            segments.append((begin_idx, full_segment))
+        return segments
+    def _remove_boundary_markers(self, text: str) -> str:
+        """
+        Remove all boundary markers from text, keeping only content
+        Args:
+            text: Text containing boundary markers
+        Returns:
+            str: Text with boundary markers removed
+        """
+        # Remove 【Begin-x】 and 【End-x】 markers
+        clean_text = re.sub(r'【Begin-\d+】|【End-\d+】', '', text)
+        return clean_text.strip()
+    def aggregate_segments(self, segments: List[str]) -> str:
+        """
+        Aggregate text segments: split, deduplicate, sort, reconstruct
+        Args:
+            segments: List of text segments
+        Returns:
+            str: Aggregated text string
+        """
+        if not segments:
+            return ""
+        # Step 1: Extract segments from all input texts
+        all_segments = {}  # {begin_index: segment_text}
+        for text in segments:
+            extracted = self._extract_segments_from_text(text)
+            for begin_idx, segment in extracted:
+                # Deduplication: Keep only one segment for the same begin_index
+                if begin_idx not in all_segments:
+                    all_segments[begin_idx] = segment
+        # Step 2: Sort by begin_index
+        sorted_segments = sorted(all_segments.items())
+        # Step 3: Reconstruct text
+        if not sorted_segments:
+            return []
+        # Build continuous text
+        result_text = ""
+        prev_end = -1
+        for begin_idx, segment in sorted_segments:
+            # If not continuous, add ellipsis
+            if prev_end != -1 and begin_idx != prev_end + 1:
+                result_text += "..."
+            # Add content of current segment (remove boundary markers)
+            content = self._remove_boundary_markers(segment)
+            result_text += content
+            prev_end = begin_idx
+        return result_text
+    def aggregate_segments_complete(self, segments: List[str]) -> str:
+        """
+        Completely aggregate all text segments
+        Args:
+            segments: List of text segments
+        Returns:
+            str: Aggregated text string
+        """
+        return self.aggregate_segments(segments)
+def demo():
+    """Demo function - Show text splitting, deduplication, sorting, and reconstruction based on position markers"""
+    print("=== Text Aggregator Demo (Completely Rewritten Version) ===\n")
+    # Create aggregator
+    aggregator = TextAggregator()
+    # Test data - Format according to user example
+    test_segments = [
+        "【Begin-1】sdfsdf【End-1】【Begin-2】sdfsdf【End-2】",
+        "【Begin-2】sdfsdf【End-2】【Begin-3】sdfsdf【End-3】",
+        "【Begin-5】sdfsdf【End-5】【Begin-6】sdfsdf【End-6】"
+    ]
+    print("Original input segments:")
+    for i, text in enumerate(test_segments, 1):
+        print(f"{i}. {text}")
+    print("\n=== Step 1: Extract segments from each text ===")
+    all_extracted = {}
+    for i, text in enumerate(test_segments, 1):
+        extracted = aggregator._extract_segments_from_text(text)
+        print(f"Segments extracted from text {i}: {extracted}")
+        for begin_idx, segment in extracted:
+            if begin_idx not in all_extracted:
+                all_extracted[begin_idx] = segment
+                print(f"  Add segment: Begin-{begin_idx}")
+            else:
+                print(f"  Skip duplicate segment: Begin-{begin_idx}")
+    print(f"\nAll segments after deduplication: {list(all_extracted.keys())}")
+    print("\n=== Step 2: Sort by Begin marker ===")
+    sorted_segments = sorted(all_extracted.items())
+    print("Sorted segments:")
+    for begin_idx, segment in sorted_segments:
+        print(f"  Begin-{begin_idx}: {segment}")
+    print("\n=== Step 3: Reconstruct text (remove boundary markers, add ellipsis) ===")
+    result = aggregator.aggregate_segments(test_segments)
+    print(f"Final result: {result}")
+    print("\n=== Full Test Cases ===")
+    # More complex test cases
+    complex_segments = [
+        "【Begin-1】First sentence【End-1】【Begin-2】Second sentence【End-2】【Begin-3】Third sentence【End-3】",
+        "【Begin-2】Second sentence【End-2】【Begin-3】Third sentence【End-3】【Begin-4】Fourth sentence【End-4】",
+        "【Begin-6】Sixth sentence【End-6】【Begin-7】Seventh sentence【End-7】",
+        "【Begin-4】Fourth sentence【End-4】【Begin-5】Fifth sentence【End-5】"
+    ]
+    print("\nComplex test input:")
+    for i, text in enumerate(complex_segments, 1):
+        print(f"{i}. {text}")
+    complex_result = aggregator.aggregate_segments(complex_segments)
+    print(f"\nComplex test result: {complex_result}")
+    print("\n=== Boundary Case Tests ===")
+    # Test empty input
+    empty_result = aggregator.aggregate_segments([])
+    print(f"Empty input result: {empty_result}")
+    # Test single segment
+    single_result = aggregator.aggregate_segments(["【Begin-1】Single segment【End-1】"])
+    print(f"Single segment result: {single_result}")
+    # Test text without markers (should return empty)
+    no_marker_result = aggregator.aggregate_segments(["Normal text without markers"])
+    print(f"Text without markers result: {no_marker_result}")
+    print("\n=== Demo Completed ===")
+if __name__ == "__main__":
+    demo()

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "architectures": [
+    "FreeChunkerModel"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 4096,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 8194,
+  "model_type": "xlm-roberta",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 24,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.56.1",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 2,
+  "max_power": 4,
+  "auto_map": {
+    "AutoConfig": "configuration_freechunker.FreeChunkerConfig",
+    "AutoModel": "modeling_freechunker.FreeChunkerModel"
+  }
+}

configuration_freechunker.py ADDED Viewed

	@@ -0,0 +1,157 @@

+# coding=utf-8
+# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""FreeChunker configuration: Modified from XLM-RoBERTa configuration"""
+from collections import OrderedDict
+from typing import Mapping
+from transformers.configuration_utils import PretrainedConfig
+from transformers.onnx import OnnxConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+class FreeChunkerConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`FreeChunkerModel`] or a [`TFFreeChunkerModel`]. It
+    is used to instantiate a XLM-RoBERTa model according to the specified arguments, defining the model architecture.
+    Instantiating a configuration with the defaults will yield a similar configuration to that of the FreeChunker
+    [FacebookAI/xlm-roberta-base](https://huggingface.co/FacebookAI/xlm-roberta-base) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30522):
+            Vocabulary size of the XLM-RoBERTa model. Defines the number of different tokens that can be represented by
+            the `inputs_ids` passed when calling [`FreeChunekrModel`] or [`TFFreeChunekrModel`].
+        hidden_size (`int`, *optional*, defaults to 768):
+            Dimensionality of the encoder layers and the pooler layer.
+        num_hidden_layers (`int`, *optional*, defaults to 12):
+            Number of hidden layers in the Transformer encoder.
+        num_attention_heads (`int`, *optional*, defaults to 12):
+            Number of attention heads for each attention layer in the Transformer encoder.
+        intermediate_size (`int`, *optional*, defaults to 3072):
+            Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
+        hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
+            The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
+            `"relu"`, `"silu"` and `"gelu_new"` are supported.
+        hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
+        attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
+            The dropout ratio for the attention probabilities.
+        max_position_embeddings (`int`, *optional*, defaults to 512):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        type_vocab_size (`int`, *optional*, defaults to 2):
+            The vocabulary size of the `token_type_ids` passed when calling [`FreeChunekrModel`] or
+            [`TFFreeChunekrModel`].
+        initializer_range (`float`, *optional*, defaults to 0.02):
+            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
+        layer_norm_eps (`float`, *optional*, defaults to 1e-12):
+            The epsilon used by the layer normalization layers.
+        position_embedding_type (`str`, *optional*, defaults to `"absolute"`):
+            Type of position embedding. Choose one of `"absolute"`, `"relative_key"`, `"relative_key_query"`. For
+            positional embeddings use `"absolute"`. For more information on `"relative_key"`, please refer to
+            [Self-Attention with Relative Position Representations (Shaw et al.)](https://arxiv.org/abs/1803.02155).
+            For more information on `"relative_key_query"`, please refer to *Method 4* in [Improve Transformer Models
+            with Better Relative Position Embeddings (Huang et al.)](https://arxiv.org/abs/2009.13658).
+        is_decoder (`bool`, *optional*, defaults to `False`):
+            Whether the model is used as a decoder or not. If `False`, the model is used as an encoder.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether or not the model should return the last key/values attentions (not used by all models). Only
+            relevant if `config.is_decoder=True`.
+        classifier_dropout (`float`, *optional*):
+            The dropout ratio for the classification head.
+    Examples:
+    ```python
+    >>> from transformers import FreeChunekrConfig, FreeChunekrModel
+    >>> # Initializing a XLM-RoBERTa FacebookAI/xlm-roberta-base style configuration
+    >>> configuration = FreeChunekrConfig()
+    >>> # Initializing a model (with random weights) from the FacebookAI/xlm-roberta-base style configuration
+    >>> model = FreeChunekrModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "xlm-roberta"
+    def __init__(
+        self,
+        vocab_size=30522,
+        hidden_size=768,
+        num_hidden_layers=12,
+        num_attention_heads=12,
+        intermediate_size=3072,
+        hidden_act="gelu",
+        hidden_dropout_prob=0.1,
+        attention_probs_dropout_prob=0.1,
+        max_position_embeddings=512,
+        type_vocab_size=2,
+        initializer_range=0.02,
+        layer_norm_eps=1e-12,
+        pad_token_id=1,
+        bos_token_id=0,
+        eos_token_id=2,
+        position_embedding_type="absolute",
+        use_cache=True,
+        classifier_dropout=None,
+        **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.use_cache = use_cache
+        self.classifier_dropout = classifier_dropout
+# Copied from transformers.models.roberta.configuration_roberta.RobertaOnnxConfig with Roberta->FreeChunekr
+class FreeChunekrOnnxConfig(OnnxConfig):
+    @property
+    def inputs(self) -> Mapping[str, Mapping[int, str]]:
+        if self.task == "multiple-choice":
+            dynamic_axis = {0: "batch", 1: "choice", 2: "sequence"}
+        else:
+            dynamic_axis = {0: "batch", 1: "sequence"}
+        return OrderedDict(
+            [
+                ("input_ids", dynamic_axis),
+                ("attention_mask", dynamic_axis),
+            ]
+        )
+__all__ = ["FreeChunkerConfig", "FreeChunkerOnnxConfig"]

encoder.py ADDED Viewed

	@@ -0,0 +1,257 @@

+#!/usr/bin/env python3
+"""
+UnifiedEncoder - Unified text encoder
+Integrates sentence splitting and multiple encoding models into a unified interface
+"""
+import torch
+import numpy as np
+import pickle
+import os
+from typing import List, Tuple, Union
+from .sentenizer import Sentenceizer
+from .modeling_freechunker import FreeChunkerModel
+from .aggregator import TextAggregator
+class UnifiedEncoder:
+    """
+    Unified text encoder, supporting text sentence splitting and encoding for multiple models
+    """
+    def __init__(self, model_name: str, local_model_path: str = None):
+        """
+        Initialize unified text encoder
+        Args:
+            model_name (str): Model name (e.g. 'bge-m3', 'jina', 'nomic')
+            local_model_path (str, optional): Local model path for loading FreeChunker weights.
+                                            If None, tries to load from current directory or Hugging Face.
+        """
+        self.model_name = model_name
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'mps')
+        # Initialize text aggregator
+        self.aggregator = TextAggregator()
+        print(f"Initializing unified text encoder, model: {model_name}")
+        print(f"Using local model path: {local_model_path}")
+        print(f"Using device: {self.device}")
+        # If local_model_path is not provided, assume current directory or let from_pretrained handle it
+        if local_model_path is None:
+            local_model_path = "."
+        try:
+            self.model = FreeChunkerModel.from_pretrained(local_model_path)
+        except Exception as e:
+            print(f"Failed to load model from {local_model_path}: {e}")
+            print("Trying to load as a fresh model or from HF hub if applicable...")
+            # Fallback or re-raise
+            raise e
+        self.model.to(self.device)
+        self.model.eval()
+        # Select model and preprocessor based on model name
+        # Predefined model mapping: name -> (local_path, HF_model_ID)
+        # Note: Local paths are environment specific, so we primarily rely on HF IDs or passed arguments
+        model_configs = {
+            'bge-m3': ('/share/home/ecnuzwx/UnifiedRAG/cache/models--BAAI--bge-m3', 'BAAI/bge-m3'),
+            'nomic-embed-text-v1.5': ('/share/home/ecnuzwx/UnifiedRAG/cache/models--nomic-ai--nomic-embed-text-v1.5', 'nomic-ai/nomic-embed-text-v1.5'),
+            'jina': ('/share/home/ecnuzwx/UnifiedRAG/cache/models--jinaai--jina-embeddings-v2-small-en', 'jinaai/jina-embeddings-v2-small-en')
+        }
+        if model_name in model_configs:
+            local_path, hf_id = model_configs[model_name]
+            # Prioritize local path if it exists, otherwise use HF ID
+            if os.path.exists(local_path):
+                target_model = local_path
+            else:
+                target_model = hf_id
+            self.sentenceizer = Sentenceizer(model_name=target_model)
+        else:
+            # Try using model_name directly as path or ID
+            print(f"Unknown predefined model name: {model_name}, trying to load directly...")
+            self.sentenceizer = Sentenceizer(model_name=model_name)
+        print("Unified text encoder initialized!")
+    def encode(self, text: str, show_progress: bool = True) -> Tuple[List[str], np.ndarray, List[List[str]]]:
+        """
+        Split text and encode, return results grouped by shift_matrix
+        Args:
+            text (str): Input text
+            show_progress (bool): Whether to show progress
+        Returns:
+            Tuple[List[str], np.ndarray, List[List[str]]]: (Original sentence list, encoded vector array, grouped sentence list by shift_matrix)
+        """
+        with torch.no_grad():
+            sentences, input_embeddings = self.sentenceizer.split_and_encode(text, show_progress=show_progress)
+            if len(sentences) == 0:
+                return sentences, np.array([]), []
+            if isinstance(input_embeddings, np.ndarray):
+                input_embeddings = torch.from_numpy(input_embeddings)
+            input_embeddings = input_embeddings.to(self.device)
+            inputs_embeds = input_embeddings.unsqueeze(0)
+            outputs = self.model(inputs_embeds=inputs_embeds)
+            final_embeddings = outputs['embedding']
+            shift_matrix = outputs['shift_matrix']
+            # Group sentences using shift_matrix
+            sentences = [f"【Begin-{num}】" + sentence + f"【End-{num}】" for num, sentence in enumerate(sentences)]
+            grouped_sentences = self._group_sentences_by_shift_matrix(sentences, shift_matrix)
+            result_embeddings = final_embeddings.cpu().numpy()
+            return sentences, result_embeddings, grouped_sentences
+    def _group_sentences_by_shift_matrix(self, sentences: List[str], shift_matrix: torch.Tensor) -> List[List[str]]:
+        """
+        Group sentences according to shift_matrix (Optimized version)
+        Args:
+            sentences (List[str]): Original sentence list
+            shift_matrix (torch.Tensor): Mask matrix with shape [num_chunks, seq_len]
+        Returns:
+            List[List[str]]: List of sentences grouped by shift_matrix
+        """
+        grouped_sentences = []
+        num_chunks, seq_len = shift_matrix.shape
+        for chunk_idx in range(num_chunks):
+            chunk_mask = shift_matrix[chunk_idx]  # [seq_len]
+            # Use vectorized operation to get all indices that are 1
+            valid_indices = (chunk_mask == 1).nonzero(as_tuple=True)[0].cpu().numpy()
+            # Select only indices within the sentence list range
+            valid_indices = valid_indices[valid_indices < len(sentences)]
+            if len(valid_indices) > 0:
+                # Get sentences directly by index
+                chunk_sentences = [sentences[idx] for idx in valid_indices]
+                grouped_sentences.append(chunk_sentences)
+        return grouped_sentences
+    def build_vector_store(self, text: str, show_progress: bool = True):
+        """
+        Build vector store based on long text
+        Args:
+            text (str): Long text
+            show_progress (bool): Whether to show progress
+        """
+        sentences, embeddings, grouped_sentences = self.encode(text, show_progress)
+        # grouped_texts = [" ".join(group) if isinstance(group, list) else str(group) for group in grouped_sentences]
+        grouped_texts = sentences + [" ".join(group) if isinstance(group, list) else str(group) for group in grouped_sentences]
+        self.vector_store = {
+            'sentences': sentences,  # Keep original sentences for debugging
+            'embeddings': embeddings,  # embeddings correspond to grouped_sentences
+            'grouped_sentences': grouped_sentences,  # Original grouping structure
+            'grouped_texts': grouped_texts  # Text for retrieval
+        }
+        if show_progress:
+            print(f"Vector store built: {len(sentences)} original sentences, {len(grouped_sentences)} groups, {len(embeddings)} embedding vectors")
+            print(f"Vector store verification: embeddings.shape={embeddings.shape}, grouped_texts count={len(grouped_texts)}\n")
+    def query(self, query: str, top_k: int = 5, aggregation_mode: str = 'post', tokenizer=None) -> Union[List[Tuple[str, float]], str]:
+        """
+        Query vector store
+        Args:
+            query (str): Query text
+            top_k (int): Return top k most similar results
+            aggregation_mode (str): Aggregation mode
+                - 'none': No aggregation, return top_k results directly [(text, score), ...]
+                - 'post': Post-aggregation mode, return aggregated text string
+        Returns:
+            Union[List[Tuple[str, float]], str]:
+                - If aggregation_mode='none', return [(sentence, similarity_score), ...]
+                - If aggregation_mode='post', return aggregated string
+        """
+        if not hasattr(self, 'vector_store'):
+            raise ValueError("Vector store not built, please call build_vector_store method first")
+        # Encode query text
+        query_embeddings = self.sentenceizer.encode([query])
+        query_embedding = query_embeddings[0]
+        # Calculate cosine similarity
+        similarities = np.dot(self.vector_store['embeddings'], query_embedding)
+        # Sort (descending)
+        sorted_indices = np.argsort(similarities)[::-1]
+        if aggregation_mode == 'none':
+            return self._get_direct_results(sorted_indices, similarities, top_k)
+        elif aggregation_mode == 'post':
+            return self._post_aggregation(sorted_indices, similarities, top_k, tokenizer=tokenizer)
+        else:
+            print(f"Warning: Unknown aggregation_mode '{aggregation_mode}', falling back to 'none'")
+            return self._get_direct_results(sorted_indices, similarities, top_k)
+    def _get_direct_results(self, sorted_indices: np.ndarray, similarities: np.ndarray, top_k: int) -> List[Tuple[str, float]]:
+        available_count = len(self.vector_store['grouped_texts'])
+        actual_top_k = min(top_k, available_count)
+        top_indices = sorted_indices[:actual_top_k]
+        results = []
+        for idx in top_indices:
+            if idx < len(self.vector_store['grouped_texts']):
+                grouped_text = self.vector_store['grouped_texts'][idx]
+                score = similarities[idx]
+                results.append((grouped_text, float(score)))
+        return results
+    def _post_aggregation(self, sorted_indices: np.ndarray, similarities: np.ndarray, top_k: int, tokenizer=None) -> List[Tuple[str, float]]:
+        # Get top_k results first
+        direct_results = self._get_direct_results(sorted_indices, similarities, top_k)
+        # Extract text parts for aggregation
+        texts = [text for text, score in direct_results]
+        aggregated_texts = self.aggregator.aggregate_segments(texts)
+        return aggregated_texts
+    def load_vector_store(self, file_path: str):
+        """
+        Load vector store from file
+        Args:
+            file_path (str): Vector store file path
+        """
+        if not os.path.exists(file_path):
+            raise FileNotFoundError(f"Vector store file not found: {file_path}")
+        with open(file_path, 'rb') as f:
+            self.vector_store = pickle.load(f)
+        print(f"Vector store loaded from {file_path}")
+        print(f"Vector store info: {len(self.vector_store['grouped_texts'])} groups, embedding dimension: {self.vector_store['embeddings'].shape}")
+    def has_vector_store(self) -> bool:
+        """
+        Check if vector store is built or loaded
+        Returns:
+            bool: Whether a vector store is available
+        """
+        return hasattr(self, 'vector_store') and self.vector_store is not None

final_loss_curve.png ADDED Viewed

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aca47fe33b4f8d4b507ac46c60817fc9287a1b81d63c0ad06559196d64c9a30d
+size 1247063776

modeling_freechunker.py ADDED Viewed

	@@ -0,0 +1,768 @@

+# coding=utf-8
+# Copyright 2019 Facebook AI Research and the HuggingFace Inc. team.
+# Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""FreeChunker model: Modified from PyTorch XLM-RoBERTa model."""
+from .utils import generate_shifted_matrix
+import math
+from typing import Optional, Tuple, Union
+import torch
+import torch.utils.checkpoint
+from packaging import version
+from torch import nn
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import (
+    BaseModelOutputWithPoolingAndCrossAttentions
+)
+from transformers.modeling_utils import PreTrainedModel
+from transformers.pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer
+from transformers.utils import (
+    add_code_sample_docstrings,
+    add_start_docstrings,
+    add_start_docstrings_to_model_forward,
+    get_torch_version,
+    logging
+)
+from .configuration_freechunker import FreeChunkerConfig
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "FacebookAI/xlm-roberta-base"
+_CONFIG_FOR_DOC = "FreeChunkerConfig"
+# Copied from transformers.models.roberta.modeling_roberta.RobertaEmbeddings with Roberta->FreeChunker
+class FreeChunkerEmbeddings(nn.Module):
+    """
+    Same as BertEmbeddings with a tiny tweak for positional embeddings indexing.
+    """
+    # Copied from transformers.models.bert.modeling_bert.BertEmbeddings.__init__
+    def __init__(self, config):
+        super().__init__()
+        self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
+        self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
+        self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
+        # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
+        # any TensorFlow checkpoint file
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
+        self.register_buffer(
+            "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
+        )
+        self.register_buffer(
+            "token_type_ids", torch.zeros(self.position_ids.size(), dtype=torch.long), persistent=False
+        )
+        # End copy
+        self.padding_idx = config.pad_token_id
+        self.position_embeddings = nn.Embedding(
+            config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
+        )
+    def forward(
+        self, input_ids=None, token_type_ids=None, position_ids=None, inputs_embeds=None
+    ):
+        if position_ids is None:
+            if input_ids is not None:
+                # Create the position ids from the input token ids. Any padded tokens remain padded.
+                position_ids = create_position_ids_from_input_ids(input_ids, self.padding_idx)
+            else:
+                position_ids = self.create_position_ids_from_inputs_embeds(inputs_embeds)
+        if input_ids is not None:
+            input_shape = input_ids.size()
+        else:
+            input_shape = inputs_embeds.size()[:-1]
+        seq_length = input_shape[1]
+        if position_ids is None:
+            position_ids = torch.arange(seq_length, dtype=torch.long, device=self.position_ids.device)
+            position_ids = position_ids.unsqueeze(0).expand(input_shape)
+        if token_type_ids is None:
+            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)
+        if inputs_embeds is None:
+            inputs_embeds = self.word_embeddings(input_ids)
+        token_type_embeddings = self.token_type_embeddings(token_type_ids)
+        embeddings = inputs_embeds + token_type_embeddings
+        if self.position_embedding_type == "absolute":
+            position_embeddings = self.position_embeddings(position_ids)
+            embeddings += position_embeddings
+        embeddings = self.LayerNorm(embeddings)
+        embeddings = self.dropout(embeddings)
+        return embeddings
+    def create_position_ids_from_inputs_embeds(self, inputs_embeds):
+        """
+        We are provided embeddings directly. We cannot infer which are padded so just generate sequential position ids.
+        Args:
+            inputs_embeds: torch.Tensor
+        Returns: torch.Tensor
+        """
+        input_shape = inputs_embeds.size()[:-1]
+        sequence_length = input_shape[1]
+        position_ids = torch.arange(
+            self.padding_idx + 1, sequence_length + self.padding_idx + 1, dtype=torch.long, device=inputs_embeds.device
+        )
+        return position_ids.unsqueeze(0).expand(input_shape)
+# Copied from transformers.models.roberta.modeling_roberta.RobertaSelfAttention with Roberta->FreeChunker
+class FreeChunkerSelfAttention(nn.Module):
+    def __init__(self, config, position_embedding_type=None):
+        super().__init__()
+        if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
+            raise ValueError(
+                f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
+                f"heads ({config.num_attention_heads})"
+            )
+        self.num_attention_heads = config.num_attention_heads
+        self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
+        self.all_head_size = self.num_attention_heads * self.attention_head_size
+        self.query = nn.Linear(config.hidden_size, self.all_head_size)
+        self.key = nn.Linear(config.hidden_size, self.all_head_size)
+        self.value = nn.Linear(config.hidden_size, self.all_head_size)
+        self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
+        self.position_embedding_type = position_embedding_type or getattr(
+            config, "position_embedding_type", "absolute"
+        )
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            self.max_position_embeddings = config.max_position_embeddings
+            self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
+        self.is_decoder = config.is_decoder
+    def transpose_for_scores(self, x: torch.Tensor) -> torch.Tensor:
+        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
+        x = x.view(new_x_shape)
+        return x.permute(0, 2, 1, 3)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        hidden_states2: torch.Tensor,  # Second input stream, required parameter
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        # Query comes from hidden_states
+        mixed_query_layer = self.query(hidden_states)
+        query_layer = self.transpose_for_scores(mixed_query_layer)
+        # Key and Value come from hidden_states2
+        key_layer = self.transpose_for_scores(self.key(hidden_states2))
+        value_layer = self.transpose_for_scores(self.value(hidden_states2))
+        # Calculate attention scores
+        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
+        # Modified positional encoding handling
+        if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
+            query_length, key_length = query_layer.shape[2], key_layer.shape[2]
+            # hidden_states positions are all the first position (0, 0, 0, ...)
+            position_ids_l = torch.zeros(query_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
+            # hidden_states2 uses normal incremental position sequence (0, 1, 2, 3, ...)
+            position_ids_r = torch.arange(key_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
+            distance = position_ids_l - position_ids_r
+            positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
+            positional_embedding = positional_embedding.to(dtype=query_layer.dtype)  # fp16 compatibility
+            if self.position_embedding_type == "relative_key":
+                relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores
+            elif self.position_embedding_type == "relative_key_query":
+                relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
+                relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
+                attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
+        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
+        if attention_mask is not None:
+            attention_scores = attention_scores + attention_mask
+        # Normalize to probabilities
+        attention_probs = nn.functional.softmax(attention_scores, dim=-1)
+        attention_probs = self.dropout(attention_probs)
+        # Apply head mask
+        if head_mask is not None:
+            attention_probs = attention_probs * head_mask
+        # Calculate context
+        context_layer = torch.matmul(attention_probs, value_layer)
+        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
+        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
+        context_layer = context_layer.view(new_context_layer_shape)
+        outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
+        return outputs
+# Copied from transformers.models.roberta.modeling_roberta.RobertaSdpaSelfAttention with Roberta->FreeChunker
+class FreeChunkerSdpaSelfAttention(FreeChunkerSelfAttention):
+    def __init__(self, config, position_embedding_type=None):
+        super().__init__(config, position_embedding_type=position_embedding_type)
+        self.dropout_prob = config.attention_probs_dropout_prob
+        self.require_contiguous_qkv = version.parse(get_torch_version()) < version.parse("2.2.0")
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        hidden_states2: torch.Tensor,  # Second input stream, required parameter
+        attention_mask: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        # If relative positional encoding, output attentions, or head mask are present, fallback to parent implementation
+        if (self.position_embedding_type != "absolute" or
+            output_attentions or
+            head_mask is not None):
+            return super().forward(
+                hidden_states,
+                hidden_states2,
+                attention_mask,
+                head_mask,
+                output_attentions,
+            )
+        # Use optimized implementation of SDPA
+        bsz, tgt_len, _ = hidden_states.size()
+        query_layer = self.transpose_for_scores(self.query(hidden_states))
+        key_layer = self.transpose_for_scores(self.key(hidden_states2))
+        value_layer = self.transpose_for_scores(self.value(hidden_states2))
+        # SDPA with memory-efficient backend is broken in torch==2.1.2 when using non-contiguous inputs and a custom
+        # attn_mask, so we need to call `.contiguous()` here. This was fixed in torch==2.2.0.
+        # Reference: https://github.com/pytorch/pytorch/issues/112577
+        if self.require_contiguous_qkv and query_layer.device.type == "cuda" and attention_mask is not None:
+            query_layer = query_layer.contiguous()
+            key_layer = key_layer.contiguous()
+            value_layer = value_layer.contiguous()
+        attn_output = torch.nn.functional.scaled_dot_product_attention(
+            query_layer,
+            key_layer,
+            value_layer,
+            attn_mask=attention_mask,
+            dropout_p=self.dropout_prob if self.training else 0.0,
+            is_causal=False,  # For customized tasks, causal mask is not used
+        )
+        attn_output = attn_output.transpose(1, 2)
+        attn_output = attn_output.reshape(bsz, tgt_len, self.all_head_size)
+        outputs = (attn_output,)
+        return outputs
+# Copied from transformers.models.roberta.modeling_roberta.RobertaSelfOutput with Roberta->FreeChunker
+class FreeChunkerSelfOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+XLM_ROBERTA_SELF_ATTENTION_CLASSES = {
+    "eager": FreeChunkerSelfAttention,
+    "sdpa": FreeChunkerSdpaSelfAttention,
+}
+# Copied from transformers.models.roberta.modeling_roberta.RobertaAttention with Roberta->FreeChunker
+class FreeChunkerAttention(nn.Module):
+    def __init__(self, config, position_embedding_type=None):
+        super().__init__()
+        self.self = XLM_ROBERTA_SELF_ATTENTION_CLASSES[config._attn_implementation](
+            config, position_embedding_type=position_embedding_type
+        )
+        self.output = FreeChunkerSelfOutput(config)
+        self.pruned_heads = set()
+    def prune_heads(self, heads):
+        if len(heads) == 0:
+            return
+        heads, index = find_pruneable_heads_and_indices(
+            heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads
+        )
+        # Prune linear layers
+        self.self.query = prune_linear_layer(self.self.query, index)
+        self.self.key = prune_linear_layer(self.self.key, index)
+        self.self.value = prune_linear_layer(self.self.value, index)
+        self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
+        # Update hyper params and store pruned heads
+        self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
+        self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
+        self.pruned_heads = self.pruned_heads.union(heads)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        hidden_states2: torch.Tensor,  # Second input stream, required parameter
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        self_outputs = self.self(
+            hidden_states,
+            hidden_states2,  # Pass second input stream
+            attention_mask,
+            head_mask,
+            output_attentions,
+        )
+        attention_output = self.output(self_outputs[0], hidden_states)
+        outputs = (attention_output,) + self_outputs[1:]  # add attentions if we output them
+        return outputs
+# Copied from transformers.models.roberta.modeling_roberta.RobertaIntermediate with Roberta->FreeChunker
+class FreeChunkerIntermediate(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
+        if isinstance(config.hidden_act, str):
+            self.intermediate_act_fn = ACT2FN[config.hidden_act]
+        else:
+            self.intermediate_act_fn = config.hidden_act
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.intermediate_act_fn(hidden_states)
+        return hidden_states
+# Copied from transformers.models.roberta.modeling_roberta.RobertaOutput with Roberta->FreeChunker
+class FreeChunkerOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+# Copied from transformers.models.roberta.modeling_roberta.RobertaLayer with Roberta->FreeChunker
+class FreeChunkerLayer(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.chunk_size_feed_forward = config.chunk_size_feed_forward
+        self.seq_len_dim = 1
+        self.attention = FreeChunkerAttention(config)
+        self.is_decoder = config.is_decoder
+        self.add_cross_attention = config.add_cross_attention
+        if self.add_cross_attention:
+            if not self.is_decoder:
+                raise ValueError(f"{self} should be used as a decoder model if cross attention is added")
+            self.crossattention = FreeChunkerAttention(config, position_embedding_type="absolute")
+        self.intermediate = FreeChunkerIntermediate(config)
+        self.output = FreeChunkerOutput(config)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        hidden_states2: torch.Tensor,  # Second input stream, required parameter
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+        output_attentions: Optional[bool] = False,
+    ) -> Tuple[torch.Tensor]:
+        attention_outputs = self.attention(
+            hidden_states,
+            hidden_states2,  # Pass second input stream
+            attention_mask,
+            head_mask,
+            output_attentions,
+        )
+        attention_output = attention_outputs[0]
+        outputs = attention_outputs[1:]  # add self attentions if we output attention weights
+        layer_output = self.feed_forward_chunk(attention_output)
+        outputs = (layer_output,) + outputs
+        return outputs
+    def feed_forward_chunk(self, attention_output):
+        intermediate_output = self.intermediate(attention_output)
+        layer_output = self.output(intermediate_output, attention_output)
+        return layer_output
+# Copied from transformers.models.roberta.modeling_roberta.RobertaEncoder with Roberta->FreeChunker
+class FreeChunkerEncoder(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.layer = nn.ModuleList([FreeChunkerLayer(config) for _ in range(config.num_hidden_layers)])
+        self.gradient_checkpointing = False
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        hidden_states2: torch.Tensor,  # Second input stream, required parameter
+        attention_mask: Optional[torch.FloatTensor] = None,
+        head_mask: Optional[torch.FloatTensor] = None,
+    ) -> torch.Tensor:
+        for i, layer_module in enumerate(self.layer):
+            layer_head_mask = head_mask[i] if head_mask is not None else None
+            if self.gradient_checkpointing and self.training:
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs)
+                    return custom_forward
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(layer_module),
+                    hidden_states,
+                    hidden_states2,  # Pass second input stream
+                    attention_mask,
+                    layer_head_mask,
+                )
+            else:
+                layer_outputs = layer_module(
+                    hidden_states,
+                    hidden_states2,  # Pass second input stream
+                    attention_mask,
+                    layer_head_mask,
+                )
+            hidden_states = layer_outputs[0]
+        return hidden_states
+# Copied from transformers.models.roberta.modeling_roberta.RobertaPooler with Roberta->FreeChunker
+class FreeChunkerPooler(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
+        self.activation = nn.Tanh()
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        # We "pool" the model by simply taking the hidden state corresponding
+        # to the first token.
+        first_token_tensor = hidden_states[:, 0]
+        pooled_output = self.dense(first_token_tensor)
+        pooled_output = self.activation(pooled_output)
+        return pooled_output
+# Copied from transformers.models.roberta.modeling_roberta.RobertaPreTrainedModel with Roberta->FreeChunker
+class FreeChunkerPreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = FreeChunkerConfig
+    base_model_prefix = "roberta"
+    supports_gradient_checkpointing = True
+    _no_split_modules = ["FreeChunkerEmbeddings", "FreeChunkerSelfAttention", "FreeChunkerSdpaSelfAttention"]
+    _supports_sdpa = True
+    # Copied from transformers.models.bert.modeling_bert.BertPreTrainedModel._init_weights
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            # Slightly different from the TF version which uses truncated_normal for initialization
+            # cf https://github.com/pytorch/pytorch/pull/5617
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        elif isinstance(module, nn.Embedding):
+            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
+            if module.padding_idx is not None:
+                module.weight.data[module.padding_idx].zero_()
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+XLM_ROBERTA_START_DOCSTRING = r"""
+    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
+    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
+    etc.)
+    This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
+    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
+    and behavior.
+    Parameters:
+        config ([`FreeChunkerConfig`]): Model configuration class with all the parameters of the
+            model. Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+XLM_ROBERTA_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.LongTensor` of shape `({0})`):
+            Indices of input sequence tokens in the vocabulary.
+            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        attention_mask (`torch.FloatTensor` of shape `({0})`, *optional*):
+            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
+            - 1 for tokens that are **not masked**,
+            - 0 for tokens that are **masked**.
+            [What are attention masks?](../glossary#attention-mask)
+        token_type_ids (`torch.LongTensor` of shape `({0})`, *optional*):
+            Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0,
+            1]`:
+            - 0 corresponds to a *sentence A* token,
+            - 1 corresponds to a *sentence B* token.
+            [What are token type IDs?](../glossary#token-type-ids)
+        position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
+            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
+            config.max_position_embeddings - 1]`.
+            [What are position IDs?](../glossary#position-ids)
+        head_mask (`torch.FloatTensor` of shape `(num_heads,)` or `(num_layers, num_heads)`, *optional*):
+            Mask to nullify selected heads of the self-attention modules. Mask values selected in `[0, 1]`:
+            - 1 indicates the head is **not masked**,
+            - 0 indicates the head is **masked**.
+        inputs_embeds (`torch.FloatTensor` of shape `({0}, hidden_size)`, *optional*):
+            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
+            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
+            model's internal embedding lookup matrix.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
+            tensors for more detail.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
+            more detail.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+@add_start_docstrings(
+    "The bare XLM-RoBERTa Model transformer outputting raw hidden-states without any specific head on top.",
+    XLM_ROBERTA_START_DOCSTRING,
+)
+# Copied from transformers.models.roberta.modeling_roberta.RobertaModel with Roberta->FreeChunker, ROBERTA->XLM_ROBERTA
+class FreeChunkerModel(FreeChunkerPreTrainedModel):
+    """
+    The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
+    cross-attention is added between the self-attention layers, following the architecture described in [Attention is
+    all you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
+    Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
+    To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration set
+    to `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and
+    `add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass.
+    """
+    _no_split_modules = ["FreeChunkerEmbeddings", "FreeChunkerLayer"]
+    def __init__(self, config, add_pooling_layer=True):
+        super().__init__(config)
+        self.config = config
+        self.config.vocab_size = 2
+        self.embeddings = FreeChunkerEmbeddings(self.config)
+        self.encoder = FreeChunkerEncoder(config)
+        self.pooler = FreeChunkerPooler(config) if add_pooling_layer else None
+        self.attn_implementation = config._attn_implementation
+        self.position_embedding_type = config.position_embedding_type
+        # Initialize weights and apply final processing
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.embeddings.word_embeddings
+    def set_input_embeddings(self, value):
+        self.embeddings.word_embeddings = value
+    def _prune_heads(self, heads_to_prune):
+        """
+        Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
+        class PreTrainedModel
+        """
+        for layer, heads in heads_to_prune.items():
+            self.encoder.layer[layer].attention.prune_heads(heads)
+    @add_start_docstrings_to_model_forward(XLM_ROBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPoolingAndCrossAttentions,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        inputs_embeds=None,
+        labels=None,
+        loss_weights: bool = False,
+        input_ids: Optional[torch.Tensor] = None,
+        head_mask: Optional[torch.Tensor] = None,
+        encoder_hidden_states: Optional[torch.Tensor] = None,
+        encoder_attention_mask: Optional[torch.Tensor] = None
+    ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
+        # Get input device
+        input_device = inputs_embeds.device
+        # Dimension adaptation: if input dimension is less than 1024, pad with 0
+        original_hidden_size = inputs_embeds.shape[-1]
+        target_hidden_size = self.config.hidden_size  # 1024
+        if original_hidden_size < target_hidden_size:
+            # Calculate number of dimensions to pad
+            padding_size = target_hidden_size - original_hidden_size
+            # Pad with 0 on the last dimension
+            padding = torch.zeros(inputs_embeds.shape[:-1] + (padding_size,),
+                                device=input_device, dtype=inputs_embeds.dtype)
+            inputs_embeds = torch.cat([inputs_embeds, padding], dim=-1)
+        # Adjust max_power based on sequence length
+        sequence_length = inputs_embeds.shape[1]
+        shifted_matrix = generate_shifted_matrix(sequence_length, device=input_device)
+        # Generate attention mask
+        encoder_attention_mask = shifted_matrix.transpose(1, 2)
+        encoder_attention_mask = torch.where(encoder_attention_mask == 1.0, 0.0, float('-inf'))[:, None, :, :]
+        # Fixed input IDs and position IDs
+        input_ids = torch.tensor([[0] * shifted_matrix.shape[2]], device=input_device)
+        position_ids = torch.tensor([[0] * shifted_matrix.shape[2]], device=input_device)
+        # Embedding layer processing
+        embedding_output = self.embeddings(
+            input_ids=input_ids,
+            position_ids=position_ids,
+            token_type_ids=None,
+        )
+        # Set second input stream
+        encoder_hidden_states = inputs_embeds
+        head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
+        # Encoder processing
+        sequence_output = self.encoder(
+            embedding_output,
+            hidden_states2=encoder_hidden_states,  # Second input stream
+            attention_mask=encoder_attention_mask,  # Use generated mask
+            head_mask=head_mask,
+        )
+        if original_hidden_size < target_hidden_size:
+            sequence_output = sequence_output[..., :original_hidden_size]
+            # Also truncate inputs_embeds back to original size to match dimensions of sequence_output
+            inputs_embeds = inputs_embeds[..., :original_hidden_size]
+        shift_matrix = shifted_matrix.transpose(1, 2).squeeze(0)
+        # Loss calculation
+        loss = None
+        if labels is not None:
+            emb = sequence_output.view(-1, sequence_output.shape[-1])
+            lab = labels.view(-1, labels.shape[-1])
+            target = torch.ones(emb.size(0), device=emb.device)
+            # If weights are provided, use weighted cosine loss
+            if loss_weights:
+                # Validate weight dimensions
+                loss_weights = shift_matrix.sum(dim=1).to(emb.device)
+                # Calculate unweighted cosine loss
+                cos_loss_fn = torch.nn.CosineEmbeddingLoss(reduction='none')
+                individual_losses = cos_loss_fn(emb, lab, target)
+                # Apply weights and calculate weighted average
+                weighted_losses = individual_losses * loss_weights
+                loss = weighted_losses.sum() / loss_weights.sum()
+            else:
+                # Use standard cosine loss
+                cos_loss = torch.nn.CosineEmbeddingLoss()
+                loss = cos_loss(emb, lab, target)
+        embedding = torch.cat([inputs_embeds, sequence_output], dim=1)
+        embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)
+        # embedding = torch.nn.functional.normalize(sequence_output, p=2, dim=-1)
+        return {
+            "loss": loss,
+            "embedding": embedding.squeeze(0),
+            "shift_matrix": shift_matrix
+        }
+# Copied from transformers.models.roberta.modeling_roberta.create_position_ids_from_input_ids
+def create_position_ids_from_input_ids(input_ids, padding_idx):
+    """
+    Replace non-padding symbols with their position numbers. Position numbers begin at padding_idx+1. Padding symbols
+    are ignored. This is modified from fairseq's `utils.make_positions`.
+    Args:
+        x: torch.Tensor x:
+    Returns: torch.Tensor
+    """
+    # The series of casts and type-conversions here are carefully balanced to both work with ONNX export and XLA.
+    mask = input_ids.ne(padding_idx).int()
+    incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask)) * mask
+    return incremental_indices.long() + padding_idx
+__all__ = [
+    "FreeChunkerModel",
+    "FreeChunkerPreTrainedModel",
+]

sentenizer.py ADDED Viewed

	@@ -0,0 +1,276 @@

+#!/usr/bin/env python3
+"""
+Sentenceizer - Universal sentence splitter + vector encoder
+Length-constrained sentence splitting tool that protects special formats but not quotes/brackets
+"""
+import numpy as np
+from typing import List, Tuple, Union, Optional
+from sentence_transformers import SentenceTransformer
+from transformers import AutoTokenizer
+# --- Integrated TraditionalChunking ---
+def setup_tokenizer(model_name="xlm-roberta-base"):
+    """Setup tokenizer"""
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+    except Exception as e:
+        print(f"Warning: Could not load tokenizer for {model_name}: {e}. Falling back to bert-base-uncased")
+        tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+    return tokenizer
+def fixed_size_chunking(text: str, tokenizer=None, chunk_size: int = 256, overlap: int = 0) -> List[str]:
+    """
+    Fixed-size chunking based on token count (Strict truncation)
+    Args:
+        text: Text to chunk
+        tokenizer: Tokenizer
+        chunk_size: Token count per chunk
+        overlap: Overlapping token count
+    """
+    if tokenizer is None:
+        tokenizer = setup_tokenizer()
+    # Encode the entire text, do not add special tokens to keep it clean
+    tokens = tokenizer.encode(text, add_special_tokens=False)
+    total_tokens = len(tokens)
+    chunks = []
+    # Calculate step size
+    step = chunk_size - overlap
+    if step <= 0:
+        step = 1  # Prevent infinite loop, theoretically overlap should be smaller than chunk_size
+    for i in range(0, total_tokens, step):
+        # Truncate tokens for current chunk
+        chunk_tokens = tokens[i : i + chunk_size]
+        # Decode back to text
+        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
+        if chunk_text.strip():
+            chunks.append(chunk_text.strip())
+    return chunks
+def traditional_chunking(text, tokenizer=None, chunk_size=256, overlap=0):
+    """
+    Fixed-size chunking based on tokens
+    Args:
+        text: Text to chunk
+        tokenizer: Tokenizer
+        chunk_size: Token count per chunk
+        overlap: Overlapping token count
+    """
+    return fixed_size_chunking(text, tokenizer, chunk_size, overlap)
+class TraditionalChunking:
+    def __init__(self, model_name_or_path=None, tokenizer=None, chunk_size=256, overlap=0):
+        if tokenizer is not None:
+            self.tokenizer = tokenizer
+        elif model_name_or_path is not None:
+            self.tokenizer = setup_tokenizer(model_name_or_path)
+        else:
+            self.tokenizer = setup_tokenizer()
+        self.chunk_size = chunk_size
+        self.overlap = overlap
+    def chunk(self, text):
+        return traditional_chunking(text, self.tokenizer, self.chunk_size, self.overlap)
+# --- End TraditionalChunking ---
+class Sentenceizer:
+    """
+    Universal sentence splitter and encoder with length constraints, protecting special formats
+    """
+    def __init__(self, model_name: Optional[str] = None):
+        """
+        Initialize Sentenceizer
+        Args:
+            model_name (str, optional): SentenceTransformer model name
+                                      If None, no encoding model is loaded
+        """
+        # Initialize chunker with model_name if available, otherwise default
+        self.chunker = TraditionalChunking(model_name_or_path=model_name if model_name else "xlm-roberta-base", chunk_size=256, overlap=0)
+        self.model = None
+        self.model_name = model_name
+        if model_name:
+            print(f"Loading sentence transformer model: {model_name}")
+            self.model = SentenceTransformer(model_name, trust_remote_code=True)
+            self.model.eval()
+            print(f"Model loaded successfully. Embedding dimension: {self.model.get_sentence_embedding_dimension()}")
+    def split(self, text: str) -> List[str]:
+        """
+        Split text into sentence list using NLTK sent_tokenize, then merge short sentences
+        Args:
+            text (str): Input text
+        Returns:
+            List[str]: List of sentences
+        """
+        if not text.strip():
+            return []
+        return self.chunker.chunk(text)
+    def split_with_positions(self, text: str) -> List[Tuple[str, int, int]]:
+        """
+        Split text and return sentences with their positions in the original text
+        Args:
+            text (str): Input text
+        Returns:
+            List[Tuple[str, int, int]]: List of (sentence, start_position, end_position)
+        """
+        sentences = self.split(text)
+        sentences_with_pos = []
+        start_pos = 0
+        for sentence in sentences:
+            # Find sentence position in original text
+            pos = text.find(sentence, start_pos)
+            if pos != -1:
+                sentences_with_pos.append((sentence, pos, pos + len(sentence)))
+                start_pos = pos + len(sentence)
+            else:
+                # If not found (possibly due to merging or splitting), use estimated position
+                sentences_with_pos.append((sentence, start_pos, start_pos + len(sentence)))
+                start_pos += len(sentence)
+        return sentences_with_pos
+    def encode(self, text: Union[str, List[str]], show_progress: bool = False) -> np.ndarray:
+        """
+        Encode text
+        Args:
+            text (Union[str, List[str]]): Input text, can be a single string or list of strings
+                                         If it's a string, sentence splitting will be performed first
+            show_progress (bool): Whether to show progress bar
+        Returns:
+            np.ndarray: Encoded vector array with shape (n_sentences, embedding_dim)
+        Raises:
+            ValueError: If no model is loaded
+        """
+        if self.model is None:
+            raise ValueError("No model loaded. Please initialize with a model_name.")
+        # If input is string, perform sentence splitting first
+        if isinstance(text, str):
+            sentences = self.split(text)
+        else:
+            sentences = text
+        if not sentences:
+            return np.array([])
+        # Use sentence transformer for encoding, limit max batch size to 64
+        embeddings = self.model.encode(
+            sentences,
+            show_progress_bar=show_progress,
+            convert_to_numpy=True,
+            batch_size=4
+        )
+        return embeddings
+    def split_and_encode(self, text: str, show_progress: bool = True) -> Tuple[List[str], np.ndarray]:
+        """
+        Split text and encode
+        Args:
+            text (str): Input text
+            show_progress (bool): Whether to show progress bar
+        Returns:
+            Tuple[List[str], np.ndarray]: (sentence list, encoded vector array)
+        """
+        sentences = self.split(text)
+        embeddings = self.encode(sentences, show_progress=show_progress)
+        return sentences, embeddings
+    @property
+    def embedding_dimension(self) -> int:
+        """Get embedding dimension"""
+        if self.model is None:
+            raise ValueError("No model loaded.")
+        return self.model.get_sentence_embedding_dimension()
+def test_sentenceizer():
+    """Test universal sentence splitting functionality and protection mechanisms"""
+    print("=== Testing Universal Sentence Splitting and Protection Mechanisms ===")
+    # Use reasonable length constraints for testing
+    sentenceizer = Sentenceizer()
+    test_cases = [
+        # Basic sentence splitting test
+        "This is the first sentence. This is the second sentence! This is the third sentence?",
+        # Quote sentence splitting test (should be able to split)
+        'He said "Hello there. How are you? I hope you are well." Then he left.',
+        # Abbreviation protection test (should not split at abbreviations)
+        "Dr. Smith is here. Mr. Jones left at 3 p.m. today. The U.S. economy is growing.",
+        # Number protection test (should not split within numbers)
+        "The temperature is 36.5 degrees. The price is $19.99. Version 2.1.3 was released.",
+        # Ellipsis protection test (should not split at ellipsis)
+        "This is incomplete... But this continues the thought. Another sentence follows.",
+        # URL protection test (should not split within URLs)
+        "Visit https://www.example.com for more info. The website www.test.org has details.",
+        # Email protection test (should not split within emails)
+        "Contact me at john.doe@example.com for questions. Send reports to admin@company.org please.",
+        # Date and time protection test
+        "The meeting is on 12/25/2023. We start at 3:30 p.m. today. See you then.",
+        # Non-English text test
+        "这是第一个句子。这是第二个句子！这是第三个句子？",
+        # Mixed text test
+        "This is English. 这是中文。Mix of both languages!",
+        # Complex mixed test
+        "访问 https://www.baidu.com 获取信息。联系邮箱是 test@163.com。价格为 ￥99.99 元。",
+        # Long sentence test (should be split)
+        "This is a very long sentence that should be split into multiple parts because it exceeds the maximum length limit that we have set for individual sentences in our system, and we need to handle this properly.",
+        # Sentences starting with numbers
+        "Today is sunny. 123 people attended the meeting. Everyone was happy.",
+        # Sentences starting with special characters
+        "First sentence here. \"Quoted sentence comes next.\" Final sentence ends it.",
+    ]
+    for i, text in enumerate(test_cases, 1):
+        print(f"\n--- Test Case {i} ---")
+        print(f"Original: {text}")
+        sentences = sentenceizer.split(text)
+        print(f"Split Result ({len(sentences)} sentences):")
+        for j, sentence in enumerate(sentences, 1):
+            print(f"  {j}. ({len(sentence)} chars) {sentence}")
+if __name__ == "__main__":
+    test_sentenceizer()

training_losses.json ADDED Viewed

The diff for this file is too large to render. See raw diff

utils.py ADDED Viewed

	@@ -0,0 +1,235 @@

+#!/usr/bin/env python3
+"""
+Utility Functions
+"""
+import torch
+import numpy as np
+import torch
+def generate_shifted_matrix(n, device=None):
+    matrix_columns = []
+    granularities = [2, 4]
+    for granularity in granularities:
+        if granularity > n:
+            continue
+        # Calculate step size for this granularity
+        step_size = max(1, granularity // 2)
+        max_start = n - granularity
+        for start in range(0, max_start + 1, step_size):
+            column = torch.zeros(n, dtype=torch.int, device=device)
+            column[start:start + granularity] = 1
+            matrix_columns.append(column)
+        # If the last position is not covered, add a mask at the end
+        if max_start >= 0 and (max_start % step_size) != 0:
+            column = torch.zeros(n, dtype=torch.int, device=device)
+            column[-granularity:] = 1
+            matrix_columns.append(column)
+    if not matrix_columns:
+        column = torch.ones(n, dtype=torch.int, device=device)
+        matrix_columns.append(column)
+    result = torch.stack(matrix_columns, dim=1).unsqueeze(0).expand(1, -1, -1)
+    return result
+def create_attention_mask(shift_matrix: torch.Tensor) -> torch.Tensor:
+    """
+    Create attention mask from shift matrix
+    Args:
+        shift_matrix (torch.Tensor): shift matrix, shape [num_chunks, seq_len]
+    Returns:
+        torch.Tensor: attention mask, shape [1, num_chunks, seq_len, seq_len]
+    """
+    # Transpose and create attention mask
+    attention_mask = shift_matrix.transpose(0, 1)  # [seq_len, num_chunks]
+    attention_mask = torch.where(attention_mask == 1.0, 0.0, float('-inf'))
+    # Add dimensions to match expected shape of attention
+    attention_mask = attention_mask.unsqueeze(0).unsqueeze(0)  # [1, 1, seq_len, num_chunks]
+    return attention_mask
+def normalize_embeddings(embeddings: torch.Tensor, eps: float = 1e-8) -> torch.Tensor:
+    """
+    L2 normalize embeddings
+    Args:
+        embeddings (torch.Tensor): Embeddings
+        eps (float): Small value to prevent division by zero
+    Returns:
+        torch.Tensor: Normalized embeddings
+    """
+    norm = torch.norm(embeddings, dim=-1, keepdim=True)
+    return embeddings / (norm + eps)
+def cosine_similarity(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
+    """
+    Calculate cosine similarity
+    Args:
+        a (torch.Tensor): Vector A
+        b (torch.Tensor): Vector B
+    Returns:
+        torch.Tensor: Cosine similarity
+    """
+    a_norm = normalize_embeddings(a)
+    b_norm = normalize_embeddings(b)
+    return torch.sum(a_norm * b_norm, dim=-1)
+def batch_cosine_similarity(embeddings1: torch.Tensor, embeddings2: torch.Tensor) -> torch.Tensor:
+    """
+    Calculate batch cosine similarity
+    Args:
+        embeddings1 (torch.Tensor): Embeddings group 1, shape [N, dim]
+        embeddings2 (torch.Tensor): Embeddings group 2, shape [M, dim]
+    Returns:
+        torch.Tensor: Similarity matrix, shape [N, M]
+    """
+    embeddings1_norm = normalize_embeddings(embeddings1)
+    embeddings2_norm = normalize_embeddings(embeddings2)
+    return torch.matmul(embeddings1_norm, embeddings2_norm.transpose(0, 1))
+def split_embeddings_by_shift_matrix(embeddings: torch.Tensor, shift_matrix: torch.Tensor) -> list:
+    """
+    Split embeddings based on shift matrix
+    Args:
+        embeddings (torch.Tensor): Embeddings, shape [seq_len, hidden_dim]
+        shift_matrix (torch.Tensor): shift matrix, shape [num_chunks, seq_len]
+    Returns:
+        list: List of split embeddings
+    """
+    split_embeddings = []
+    num_chunks, seq_len = shift_matrix.shape
+    for chunk_idx in range(num_chunks):
+        mask = shift_matrix[chunk_idx]  # [seq_len]
+        indices = torch.nonzero(mask, as_tuple=True)[0]  # Get indices of non-zero positions
+        if len(indices) > 0:
+            chunk_embeddings = embeddings[indices]  # [chunk_size, hidden_dim]
+            split_embeddings.append(chunk_embeddings)
+    return split_embeddings
+def pool_embeddings(embeddings: torch.Tensor, method: str = 'mean') -> torch.Tensor:
+    """
+    Pool embeddings
+    Args:
+        embeddings (torch.Tensor): Embeddings, shape [seq_len, hidden_dim]
+        method (str): Pooling method, optional 'mean', 'max', 'first', 'last'
+    Returns:
+        torch.Tensor: Pooled vector, shape [hidden_dim]
+    """
+    if method == 'mean':
+        return torch.mean(embeddings, dim=0)
+    elif method == 'max':
+        return torch.max(embeddings, dim=0)[0]
+    elif method == 'first':
+        return embeddings[0]
+    elif method == 'last':
+        return embeddings[-1]
+    else:
+        raise ValueError(f"Unknown pooling method: {method}")
+def aggregate_chunk_embeddings(split_embeddings: list, method: str = 'mean') -> torch.Tensor:
+    """
+    Aggregate chunk embeddings
+    Args:
+        split_embeddings (list): List of split embeddings
+        method (str): Aggregation method
+    Returns:
+        torch.Tensor: Aggregated embeddings, shape [num_chunks, hidden_dim]
+    """
+    if not split_embeddings:
+        return torch.tensor([])
+    aggregated = []
+    for chunk_embeddings in split_embeddings:
+        pooled = pool_embeddings(chunk_embeddings, method)
+        aggregated.append(pooled)
+    return torch.stack(aggregated)
+def safe_tensor_to_numpy(tensor: torch.Tensor) -> np.ndarray:
+    """
+    Safely convert tensor to numpy array
+    Args:
+        tensor (torch.Tensor): Input tensor
+    Returns:
+        np.ndarray: Numpy array
+    """
+    if tensor.requires_grad:
+        tensor = tensor.detach()
+    if tensor.is_cuda:
+        tensor = tensor.cpu()
+    return tensor.numpy()
+def ensure_tensor_on_device(tensor: torch.Tensor, device: torch.device) -> torch.Tensor:
+    """
+    Ensure tensor is on specified device
+    Args:
+        tensor (torch.Tensor): Input tensor
+        device (torch.device): Target device
+    Returns:
+        torch.Tensor: Tensor on target device
+    """
+    if tensor.device != device:
+        tensor = tensor.to(device)
+    return tensor
+def get_available_device() -> torch.device:
+    """
+    Get available device
+    Returns:
+        torch.device: Available device
+    """
+    if torch.cuda.is_available():
+        return torch.device('cuda')
+    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
+        return torch.device('mps')
+    else:
+        return torch.device('cpu')
+def print_tensor_info(tensor: torch.Tensor, name: str = "tensor"):
+    """
+    Print tensor info
+    Args:
+        tensor (torch.Tensor): Input tensor
+        name (str): Tensor name
+    """
+    print(f"{name}:")
+    print(f"  Shape: {tensor.shape}")
+    print(f"  Data Type: {tensor.dtype}")
+    print(f"  Device: {tensor.device}")
+    print(f"  Requires Grad: {tensor.requires_grad}")
+    if tensor.numel() > 0:
+        print(f"  Value Range: [{tensor.min().item():.6f}, {tensor.max().item():.6f}]")
+        print(f"  Mean: {tensor.mean().item():.6f}")
+        print(f"  Std Dev: {tensor.std().item():.6f}")