giuseppe-tanzi commited on
Commit
b01dfee
·
verified ·
1 Parent(s): c80cdc8

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. README.md +207 -0
  2. config.json +11 -0
  3. data_collator.py +88 -0
  4. example_usage.py +48 -0
  5. pytorch_model.bin +3 -0
  6. requirements.txt +8 -0
README.md ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ tags:
5
+ - audio
6
+ - text
7
+ - multimodal
8
+ - seamless
9
+ - subtitle-editing-time-prediction
10
+ - translation-aware
11
+ license: apache-2.0
12
+ library_name: transformers
13
+ base_model: facebook/hf-seamless-m4t-medium
14
+ ---
15
+
16
+ # videoloc/seamless-translation
17
+
18
+ ## Model Description
19
+
20
+ This is a **SeamlessTranslation** model that processes audio and text inputs with translation awareness to predict **Time To Edit (TTE)** for subtitle segments. Given an audio segment and its corresponding subtitle text, the model predicts how much time (in seconds) would be required to edit/refine that subtitle segment, while taking into account whether the subtitle is a translation or original content.
21
+
22
+ The model extends the basic SeamlessM4T architecture with a translation feature that helps distinguish between original and translated subtitle content, improving TTE prediction accuracy in multilingual scenarios.
23
+
24
+ ### Key Features
25
+
26
+ - **Translation-Aware Processing**: Distinguishes between original and translated content
27
+ - **Multimodal Processing**: Simultaneously processes audio (16kHz) and text inputs
28
+ - **Frozen Encoders**: Uses pre-trained SeamlessM4T encoders (frozen for stability)
29
+ - **Enhanced Architecture**: Adds translation embedding to basic model
30
+ - **TTE Prediction**: Predicts editing time required for subtitle segments
31
+ - **Direct Output**: Raw time values in seconds for immediate use
32
+
33
+ ## Model Architecture
34
+
35
+ The model extends the basic SeamlessM4T architecture with translation awareness:
36
+
37
+ 1. **Audio Processing**:
38
+ - SeamlessM4T speech encoder (frozen) processes raw audio input
39
+ - Audio projection layer maps speech encoder output to 1024 dimensions
40
+ - Mean pooling over sequence length to get fixed-size audio embedding
41
+
42
+ 2. **Text Processing**:
43
+ - SeamlessM4T text encoder (frozen) processes tokenized text input
44
+ - Text projection layer maps text encoder output to 1024 dimensions
45
+ - Mean pooling over sequence length to get fixed-size text embedding
46
+
47
+ 3. **Translation Feature Processing**:
48
+ - Binary translation flag (0/1) indicating original vs translated content
49
+ - Translation projection layer maps binary input to 64 dimensions
50
+ - Learned embedding helps model distinguish translation effects
51
+
52
+ 4. **Feature Fusion**:
53
+ - Audio, text, and translation embeddings are concatenated (2112 total dimensions)
54
+ - Simple concatenation without complex cross-modal interactions
55
+
56
+ 5. **Regression Head**:
57
+ - Multi-layer perceptron: 2112 → 1024 → 512 → 256 → 1
58
+ - ReLU activations and dropout for regularization
59
+ - Single output for TTE prediction (regression, in seconds)
60
+
61
+ ## Quick Start
62
+
63
+ ### Installation
64
+ ```bash
65
+ pip install transformers torch torchaudio huggingface_hub
66
+ ```
67
+
68
+ ### Basic Usage
69
+ ```python
70
+ from transformers import AutoModel, AutoConfig
71
+ from huggingface_hub import hf_hub_download
72
+ import torch
73
+ import numpy as np
74
+ import importlib.util
75
+
76
+ # Load model
77
+ model = AutoModel.from_pretrained("videoloc/seamless-translation")
78
+ config = AutoConfig.from_pretrained("videoloc/seamless-translation")
79
+
80
+ # Load the data collator (included in this repo)
81
+ collator_file = hf_hub_download(repo_id="videoloc/seamless-translation", filename="data_collator.py")
82
+ spec = importlib.util.spec_from_file_location("data_collator", collator_file)
83
+ collator_module = importlib.util.module_from_spec(spec)
84
+ spec.loader.exec_module(collator_module)
85
+
86
+ # Initialize data collator
87
+ data_collator = collator_module.DataCollatorSimpleSeamless(
88
+ processor="facebook/hf-seamless-m4t-medium",
89
+ max_audio_length_sec=8.0,
90
+ max_text_length=256
91
+ )
92
+
93
+ # Prepare your data with translation information
94
+ your_data = [
95
+ {
96
+ 'raw_audio': np.random.randn(16000 * 5), # 5 seconds at 16kHz
97
+ 'raw_text': "Your subtitle text here",
98
+ 'is_translation': 1, # 1 for translated content, 0 for original
99
+ }
100
+ ]
101
+
102
+ # Process and run inference
103
+ batch = data_collator(your_data)
104
+ model.eval()
105
+ with torch.no_grad():
106
+ outputs = model(**batch)
107
+ tte_prediction = outputs.logits.item()
108
+
109
+ print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
110
+ ```
111
+
112
+ ## Model Details
113
+
114
+ - **Base Model**: SeamlessM4T (facebook/hf-seamless-m4t-medium)
115
+ - **Audio Encoder**: Frozen SeamlessM4T speech encoder
116
+ - **Text Encoder**: Frozen SeamlessM4T text encoder
117
+ - **Hidden Size**: 1024
118
+ - **Translation Embedding**: 64 dimensions
119
+ - **Audio Input**: 16kHz, max 8.0 seconds
120
+ - **Text Input**: Max 256 tokens
121
+ - **Translation Input**: Binary flag (0/1)
122
+ - **Output**: Single regression value (TTE in seconds)
123
+ - **Task**: Subtitle editing time prediction
124
+
125
+ ## Data Format
126
+
127
+ Your input data should be a list of dictionaries with:
128
+ - `raw_audio`: NumPy array of audio samples (16kHz sampling rate)
129
+ - `raw_text`: String of subtitle text
130
+ - `is_translation`: Binary flag (1 for translated, 0 for original content)
131
+ - `labels`: Target TTE values in seconds (optional, for training)
132
+
133
+ Example:
134
+ ```python
135
+ data = [
136
+ {
137
+ 'raw_audio': audio_samples, # shape: (num_samples,) at 16kHz
138
+ 'raw_text': "Subtitle text content",
139
+ 'is_translation': 1, # 1 = translated, 0 = original
140
+ 'labels': 2.5 # optional TTE target value in seconds
141
+ }
142
+ ]
143
+ ```
144
+
145
+ ## Performance Metrics
146
+
147
+ - **Best Eval RMSE**: 33.34
148
+
149
+ ## Training Details
150
+
151
+ - **Base Model**: facebook/hf-seamless-m4t-medium
152
+ - **Model Type**: seamless_with_translation
153
+ - **Epochs**: 10
154
+ - **Batch Size (Train)**: 32
155
+ - **Batch Size (Eval)**: 64
156
+ - **Learning Rate**: 1.2e-4
157
+ - **LR Scheduler**: cosine_with_restarts
158
+ - **Warmup Ratio**: 0.05
159
+ - **Weight Decay**: 0.001
160
+ - **Optimizer**: AdamW (torch)
161
+ - **Max Grad Norm**: 1.0
162
+ - **FP16**: True
163
+ - **Early Stopping Patience**: 5
164
+ - **Audio Max Length**: 8.0 seconds
165
+ - **Text Max Length**: 256 tokens
166
+ - **Sample Rate**: 16kHz
167
+ - **Translation Feature**: Binary flag (0/1)
168
+ - **Normalization**: None (raw values)
169
+ - **Dataset Split**: 80/20 train/test
170
+ - **Random Seed**: 42
171
+ - **Metric**: RMSE (lower is better)
172
+
173
+ ## Training Configuration
174
+
175
+ The model was trained with the following specifications:
176
+
177
+ - **Dataset**: Multimodal audio-subtitle pairs with translation annotations
178
+ - **Train/Test Split**: 80/20 with random seed 42
179
+ - **Audio Processing**: 16kHz sampling, max 8.0 seconds, no offset
180
+ - **Text Processing**: Max 256 tokens
181
+ - **Translation Feature**: Binary flag indicating original vs translated content
182
+ - **Normalization**: None (raw TTE values in seconds)
183
+ - **Caching**: Audio segments cached and compressed for efficiency
184
+
185
+ ## Usage Notes
186
+
187
+ - This is the **translation-aware** variant - includes translation features
188
+ - For basic model without translation features, see `seamless-basic`
189
+ - For language pair embeddings, see `seamless-langpairs`
190
+ - Model expects 16kHz audio input (automatically resampled by data collator)
191
+ - Translation flag significantly impacts predictions
192
+ - No feature normalization applied - outputs raw TTE predictions in seconds
193
+ - Optimized for subtitle editing time estimation tasks
194
+
195
+ ## Limitations
196
+
197
+ - Maximum audio length: 8.0 seconds
198
+ - Maximum text length: 256 tokens
199
+ - Requires translation annotation in training data
200
+ - Designed for TTE prediction, not general audio-text matching
201
+ - Performance may vary on out-of-domain content
202
+ - Requires specific data preprocessing (use included data collator)
203
+
204
+ ## Related Models
205
+
206
+ - **seamless-basic**: Basic audio+text model without translation features
207
+ - **seamless-langpairs**: Includes language pair embeddings for fine-grained multilingual control
config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "HFSeamlessTranslation"
4
+ ],
5
+ "dropout_prob": 0.1,
6
+ "hidden_size": 1024,
7
+ "model_type": "seamless_translation",
8
+ "seamless_model_name": "facebook/hf-seamless-m4t-medium",
9
+ "torch_dtype": "float32",
10
+ "transformers_version": "4.50.2"
11
+ }
data_collator.py ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import numpy as np
3
+ from transformers import AutoProcessor
4
+ from typing import Dict, List, Union
5
+ import logging
6
+
7
+ logger = logging.getLogger(__name__)
8
+
9
+ class DataCollatorSimpleSeamless:
10
+ def __init__(
11
+ self,
12
+ processor: str,
13
+ sample_rate: int = 16000,
14
+ max_audio_length_sec: float = 8.0,
15
+ max_text_length: int = 256,
16
+ normalization_type: str = "none"
17
+ ):
18
+ """Initialize the data collator.
19
+
20
+ Args:
21
+ processor: The processor to use.
22
+ sample_rate: Audio sample rate.
23
+ max_audio_length_sec: Maximum audio length in seconds.
24
+ max_text_length: Maximum text length.
25
+ normalization_type: Type of normalization to apply to labels. Options: "log1p", "none"
26
+ """
27
+ logger.info(f"Loading processor: {processor}")
28
+ self.processor = AutoProcessor.from_pretrained(processor)
29
+
30
+ self.sample_rate = sample_rate
31
+ self.max_audio_sample_length = int(max_audio_length_sec * sample_rate)
32
+ self.max_text_length = max_text_length
33
+ self.normalization_type = normalization_type
34
+
35
+ def __call__(self, batch: List[Dict[str, Union[np.ndarray, str, float]]]) -> Dict[str, torch.Tensor]:
36
+ """Process a batch of raw features into model inputs."""
37
+ # Extract raw data
38
+ raw_audios = [item['raw_audio'] for item in batch]
39
+ raw_texts = [item['raw_text'] for item in batch]
40
+
41
+ raw_audios = [torch.tensor(audio) for audio in raw_audios]
42
+
43
+ audio_inputs = self.processor(
44
+ audios=raw_audios,
45
+ sampling_rate=self.sample_rate,
46
+ return_tensors="pt",
47
+ padding="longest",
48
+ truncation=True,
49
+ max_length=self.max_audio_sample_length,
50
+ )
51
+
52
+ text_inputs = self.processor(
53
+ text=raw_texts,
54
+ return_tensors="pt",
55
+ padding="longest",
56
+ truncation=True,
57
+ max_length=self.max_text_length,
58
+ )
59
+
60
+ # Extract translation features
61
+ is_translation = torch.tensor([item.get('is_translation', 0) for item in batch], dtype=torch.float32)
62
+
63
+ # Extract language pair features
64
+ language_pair_id = torch.tensor([item.get('language_pair_id', 0) for item in batch], dtype=torch.long)
65
+
66
+ if 'labels' in batch[0]:
67
+ labels = [item['labels'] for item in batch]
68
+ labels = torch.tensor(labels, dtype=torch.float32)
69
+
70
+ # Apply normalization based on type
71
+ if self.normalization_type == "log1p":
72
+ labels = torch.log1p(labels)
73
+ elif self.normalization_type == "none":
74
+ pass
75
+ else:
76
+ raise ValueError(f"Unknown normalization type: {self.normalization_type}")
77
+ else:
78
+ labels = None
79
+
80
+ return {
81
+ 'input_features': audio_inputs['input_features'],
82
+ 'audio_attention_mask': audio_inputs.get('attention_mask', None) if audio_inputs.get('attention_mask') is not None else None,
83
+ 'input_ids': text_inputs['input_ids'],
84
+ 'text_attention_mask': text_inputs['attention_mask'],
85
+ 'is_translation': is_translation,
86
+ 'language_pair_id': language_pair_id,
87
+ **({'labels': labels} if labels is not None else {})
88
+ }
example_usage.py ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ # Example usage for videoloc/seamless-translation
3
+
4
+ from transformers import AutoModel, AutoConfig
5
+ from huggingface_hub import hf_hub_download
6
+ import torch
7
+ import numpy as np
8
+ import importlib.util
9
+
10
+ def load_model_and_collator():
11
+ model = AutoModel.from_pretrained("videoloc/seamless-translation")
12
+ config = AutoConfig.from_pretrained("videoloc/seamless-translation")
13
+
14
+ # Load data collator
15
+ collator_file = hf_hub_download(repo_id="videoloc/seamless-translation", filename="data_collator.py")
16
+ spec = importlib.util.spec_from_file_location("data_collator", collator_file)
17
+ collator_module = importlib.util.module_from_spec(spec)
18
+ spec.loader.exec_module(collator_module)
19
+
20
+ data_collator = collator_module.DataCollatorSimpleSeamless(
21
+ processor="facebook/hf-seamless-m4t-medium",
22
+ max_audio_length_sec=8.0,
23
+ max_text_length=256
24
+ )
25
+
26
+ return model, data_collator
27
+
28
+ def example_inference():
29
+ model, collator = load_model_and_collator()
30
+
31
+ # Example data with translation awareness
32
+ data = [{
33
+ 'raw_audio': np.random.randn(16000 * 3), # 3 seconds at 16kHz
34
+ 'raw_text': "Example subtitle text for temporal alignment",
35
+ 'is_translation': 1, # 1 for translated content, 0 for original
36
+ }]
37
+
38
+ batch = collator(data)
39
+ model.eval()
40
+ with torch.no_grad():
41
+ outputs = model(**batch)
42
+ tte_prediction = outputs.logits.item()
43
+
44
+ print(f"Predicted Time To Edit (TTE): {tte_prediction:.2f} seconds")
45
+ return tte_prediction
46
+
47
+ if __name__ == "__main__":
48
+ example_inference()
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3189ed6564e0632097e7fcc69a2a44c144b2fff0dc640fb1267829aff401583a
3
+ size 4858203091
requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.50.2
2
+ torch>=2.6.0
3
+ torchaudio>=2.6.0
4
+ huggingface_hub>=0.33.0
5
+ numpy>=2.2.3
6
+ sentencepiece>=0.2.0
7
+ accelerate>=1.5.2
8
+ soundfile>=0.13.1