AlexHung29629 commited on
Commit
8a2d277
·
verified ·
1 Parent(s): 294cee0

Upload processor

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
chat_template.jinja ADDED
@@ -0,0 +1 @@
 
 
1
+ {{ bos_token }}{% if messages[0]['role'] == 'system' %}{% set first_user_prefix = messages[0]['content'][0]['text'] + '\n\n' %}{% set loop_messages = messages[1:] %}{% else %}{% set first_user_prefix = '' %}{% set loop_messages = messages %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% set role = 'model' if message['role'] == 'assistant' else message['role'] %}{{ '<start_of_turn>' + role + '\n' + (first_user_prefix if loop.first else '') }}{% if role == 'model' and message.get('metadata') %}{% if message['metadata']['type'] == 'think' %}<think>{% if message['metadata'].get('range') %}<range>{{ message['metadata']['range'] }}</range>{% endif %}{% if message['metadata'].get('content') %}{{ message['metadata']['content'] | trim }}{% endif %}</think>{% elif message['metadata']['type'] == 'direct' %}<direct>{% endif %}{% if message['metadata'].get('function') %}<function>{{ message['metadata']['function'] | join(',') }}</function>{% endif %}{% endif %}{% if message['content'] is string %}{{ message['content'] | trim }}{% elif message['content'] is iterable %}{% for item in message['content'] %}{{ '<start_of_image>' if item['type']=='image' else '<start_of_audio>' if item['type']=='audio' else item['text'] | trim if item['type']=='text' else '' }}{% endfor %}{% else %}{{ raise_exception('Invalid content type') }}{% endif %}{{ '<end_of_turn>\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<start_of_turn>model\n' }}{% endif %}
preprocessor_config.json ADDED
The diff for this file is too large to render. See raw diff
 
processing_gemma3_omni.py ADDED
@@ -0,0 +1,635 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import List, Optional, Union, Dict, Any, Tuple # Added Tuple
3
+
4
+ import numpy as np
5
+ import scipy.signal
6
+ import torch
7
+ from torch.nn.utils.rnn import pad_sequence
8
+ from transformers.audio_utils import AudioInput # type: ignore
9
+ from transformers.feature_extraction_sequence_utils import SequenceFeatureExtractor
10
+ from transformers.feature_extraction_utils import BatchFeature
11
+ from transformers.image_utils import make_nested_list_of_images # If image processing is used
12
+ from transformers.processing_utils import ProcessorMixin, ProcessingKwargs, ImagesKwargs
13
+ from transformers.utils import TensorType, to_py_obj, logging
14
+
15
+ # Constants
16
+ DEFAULT_SAMPLING_RATE = 16000
17
+ DEFAULT_N_FFT = 512
18
+ DEFAULT_WIN_LENGTH = 400
19
+ DEFAULT_HOP_LENGTH = 160
20
+ DEFAULT_N_MELS = 80
21
+ DEFAULT_COMPRESSION_RATE = 4
22
+ DEFAULT_QFORMER_RATE = 4 # Used for default in __init__ (as audio_downsample_rate)
23
+ DEFAULT_FEAT_STRIDE = 4 # Used for default in __init__
24
+ IMAGE_TOKEN_PATTERN = r"<\|image_\d+\|>"
25
+ AUDIO_TOKEN_PATTERN = r"<\|audio_\d+\|>"
26
+ DEFAULT_MAX_LENGTH = 16384
27
+
28
+ logger = logging.get_logger(__name__)
29
+
30
+
31
+ def speechlib_mel(sample_rate, n_fft, n_mels, fmin=None, fmax=None):
32
+ """Create a Mel filter-bank the same as SpeechLib FbankFC.
33
+ Args:
34
+ sample_rate (int): Sample rate in Hz. number > 0 [scalar]
35
+ n_fft (int): FFT size. int > 0 [scalar]
36
+ n_mel (int): Mel filter size. int > 0 [scalar]
37
+ fmin (float): lowest frequency (in Hz). If None use 0.0.
38
+ float >= 0 [scalar]
39
+ fmax: highest frequency (in Hz). If None use sample_rate / 2.
40
+ float >= 0 [scalar]
41
+ Returns
42
+ out (numpy.ndarray): Mel transform matrix
43
+ [shape=(n_mels, 1 + n_fft/2)]
44
+ """
45
+
46
+ bank_width = int(n_fft // 2 + 1)
47
+ if fmax is None:
48
+ fmax = sample_rate / 2
49
+ if fmin is None:
50
+ fmin = 0
51
+ assert fmin >= 0, "fmin cannot be negtive"
52
+ assert fmin < fmax <= sample_rate / 2, "fmax must be between (fmin, samplerate / 2]"
53
+
54
+ def mel(f):
55
+ return 1127.0 * np.log(1.0 + f / 700.0)
56
+
57
+ def bin2mel(fft_bin):
58
+ return 1127.0 * np.log(1.0 + fft_bin * sample_rate / (n_fft * 700.0))
59
+
60
+ def f2bin(f):
61
+ return int((f * n_fft / sample_rate) + 0.5)
62
+
63
+ # Spec 1: FFT bin range [f2bin(fmin) + 1, f2bin(fmax) - 1]
64
+ klo = f2bin(fmin) + 1
65
+ khi = f2bin(fmax)
66
+
67
+ khi = max(khi, klo)
68
+
69
+ # Spec 2: SpeechLib uses trianges in Mel space
70
+ mlo = mel(fmin)
71
+ mhi = mel(fmax)
72
+ m_centers = np.linspace(mlo, mhi, n_mels + 2)
73
+ ms = (mhi - mlo) / (n_mels + 1)
74
+
75
+ matrix = np.zeros((n_mels, bank_width), dtype=np.float32)
76
+ for m in range(0, n_mels):
77
+ left = m_centers[m]
78
+ center = m_centers[m + 1]
79
+ right = m_centers[m + 2]
80
+ for fft_bin in range(klo, khi):
81
+ mbin = bin2mel(fft_bin)
82
+ if left < mbin < right:
83
+ matrix[m, fft_bin] = 1.0 - abs(center - mbin) / ms
84
+
85
+ return matrix
86
+
87
+
88
+ # --- Start of Refactored Audio Feature Extractor (to match Phi4M - Snippet A) ---
89
+ class Gemma3AudioFeatureExtractor(SequenceFeatureExtractor): # MODIFIED CLASS NAME AND __INIT__
90
+ model_input_names = ["input_audio_embeds", "audio_embed_sizes", "audio_attention_mask"]
91
+
92
+ def __init__(self,
93
+ audio_compression_rate: int = DEFAULT_COMPRESSION_RATE, # ADDED DEFAULT
94
+ audio_downsample_rate: int = DEFAULT_QFORMER_RATE, # ADDED DEFAULT (maps to qformer_rate)
95
+ audio_feat_stride: int = DEFAULT_FEAT_STRIDE, # ADDED DEFAULT
96
+ feature_size: int = DEFAULT_N_MELS, # Added default based on constants
97
+ sampling_rate: int = DEFAULT_SAMPLING_RATE, # Added default based on constants
98
+ padding_value: float = 0.0, # Added default
99
+ eightk_method: str = "fillzero", # Added default for this custom param
100
+ **kwargs):
101
+
102
+ # If feature_size, sampling_rate, padding_value are in kwargs, they will override defaults.
103
+ # The super().__init__ expects feature_size, sampling_rate, padding_value.
104
+ # We ensure they are passed, either from defaults or kwargs.
105
+ _feature_size = kwargs.pop("feature_size", feature_size)
106
+ _sampling_rate = kwargs.pop("sampling_rate", sampling_rate)
107
+ _padding_value = kwargs.pop("padding_value", padding_value)
108
+
109
+ super().__init__(feature_size=_feature_size, sampling_rate=_sampling_rate, padding_value=_padding_value,
110
+ **kwargs)
111
+
112
+ self.compression_rate = audio_compression_rate
113
+ self.qformer_compression_rate = audio_downsample_rate
114
+ self.feat_stride = audio_feat_stride
115
+
116
+ self._eightk_method = eightk_method # Use the argument, which has a default
117
+
118
+ # Ensure _sampling_rate is used for mel filterbank if it was overridden by kwargs for superclass
119
+ # However, Phi4M logic hardcodes 16000Hz for its mel parameters.
120
+ # self.sampling_rate from super() will be the target sampling rate.
121
+ if self.sampling_rate != 16000:
122
+ logger.warning(
123
+ f"The feature extractor's target sampling rate is {self.sampling_rate}, "
124
+ "but Phi4M-consistent Mel parameters are based on 16000 Hz. "
125
+ "This might lead to inconsistencies if the input audio is not resampled to 16000 Hz by this extractor."
126
+ )
127
+
128
+ self._mel = speechlib_mel(16000, 512, 80, fmin=None, fmax=7690).T
129
+ self._hamming400 = np.hamming(400)
130
+ self._hamming200 = np.hamming(200)
131
+
132
+ def __call__(
133
+ self,
134
+ audios: List[Union[AudioInput, Tuple[np.ndarray, int]]],
135
+ return_tensors: Optional[Union[str, TensorType]] = None,
136
+ # sampling_rate: Optional[int] = None, # This was in original B, but Phi4M gets sr from AudioInput
137
+ ):
138
+ returned_input_audio_embeds = []
139
+ returned_audio_embed_sizes = []
140
+ audio_frames_list = []
141
+
142
+ for audio_input_item in audios:
143
+ if not isinstance(audio_input_item, tuple) or len(audio_input_item) != 2:
144
+ raise ValueError(
145
+ "Each item in 'audios' must be a tuple (waveform: np.ndarray, sample_rate: int)."
146
+ )
147
+ audio_data, sample_rate = audio_input_item # sample_rate is from the input audio
148
+
149
+ if isinstance(audio_data, list):
150
+ audio_data = np.array(audio_data, dtype=np.float32)
151
+ if not isinstance(audio_data, np.ndarray):
152
+ raise TypeError(f"Waveform data must be a numpy array, got {type(audio_data)}")
153
+
154
+ # _extract_features will handle resampling to self.sampling_rate (16000 Hz)
155
+ audio_embeds_np = self._extract_features(audio_data, sample_rate)
156
+
157
+ num_mel_frames = audio_embeds_np.shape[0]
158
+ current_audio_frames = num_mel_frames * self.feat_stride
159
+
160
+ audio_embed_size = self._compute_audio_embed_size(current_audio_frames)
161
+
162
+ returned_input_audio_embeds.append(torch.from_numpy(audio_embeds_np))
163
+ returned_audio_embed_sizes.append(torch.tensor(audio_embed_size).long())
164
+ audio_frames_list.append(current_audio_frames)
165
+
166
+ padded_input_audio_embeds = pad_sequence(
167
+ returned_input_audio_embeds, batch_first=True, padding_value=self.padding_value
168
+ )
169
+ stacked_audio_embed_sizes = torch.stack(returned_audio_embed_sizes, dim=0)
170
+
171
+ tensor_audio_frames_list = torch.tensor(audio_frames_list, dtype=torch.long)
172
+
173
+ max_audio_frames = 0
174
+ if len(audios) > 0 and tensor_audio_frames_list.numel() > 0:
175
+ max_audio_frames = tensor_audio_frames_list.max().item()
176
+
177
+ returned_audio_attention_mask = None
178
+ if max_audio_frames > 0:
179
+ if len(audios) > 1:
180
+ returned_audio_attention_mask = torch.arange(0, max_audio_frames,
181
+ device=tensor_audio_frames_list.device).unsqueeze(
182
+ 0) < tensor_audio_frames_list.unsqueeze(1)
183
+ elif len(audios) == 1:
184
+ returned_audio_attention_mask = torch.ones(1, max_audio_frames, dtype=torch.bool,
185
+ device=tensor_audio_frames_list.device)
186
+
187
+ data = {
188
+ "input_audio_embeds": padded_input_audio_embeds,
189
+ "audio_embed_sizes": stacked_audio_embed_sizes,
190
+ }
191
+ if returned_audio_attention_mask is not None:
192
+ data["audio_attention_mask"] = returned_audio_attention_mask
193
+
194
+ return BatchFeature(data=data, tensor_type=return_tensors)
195
+
196
+ def _extract_spectrogram(self, wav: np.ndarray, fs: int) -> np.ndarray:
197
+ # This method expects fs to be the original sampling rate of wav.
198
+ # It will resample to self.sampling_rate (16000Hz) or 8000Hz as needed.
199
+ if wav.ndim > 1:
200
+ wav = np.squeeze(wav)
201
+ if len(wav.shape) == 2:
202
+ wav = wav.mean(axis=1).astype(np.float32)
203
+
204
+ wav = wav.astype(np.float32)
205
+
206
+ current_fs = fs
207
+ if current_fs > self.sampling_rate: # self.sampling_rate is 16000
208
+ wav = scipy.signal.resample_poly(wav, self.sampling_rate, current_fs)
209
+ current_fs = self.sampling_rate
210
+ elif 8000 < current_fs < self.sampling_rate:
211
+ wav = scipy.signal.resample_poly(wav, 8000, current_fs)
212
+ current_fs = 8000
213
+ elif current_fs < 8000 and current_fs > 0:
214
+ logger.warning(f"Sample rate {current_fs} is less than 8000Hz. Resampling to 8000Hz.")
215
+ wav = scipy.signal.resample_poly(wav, 8000, current_fs)
216
+ current_fs = 8000
217
+ elif current_fs <= 0:
218
+ raise RuntimeError(f"Unsupported sample rate {current_fs}")
219
+
220
+ # After this block, current_fs is either 16000Hz or 8000Hz, or an error was raised.
221
+ # Or it's the original fs if it was already 16000 or 8000.
222
+
223
+ if current_fs == 8000:
224
+ if self._eightk_method == "resample":
225
+ wav = scipy.signal.resample_poly(wav, self.sampling_rate, 8000)
226
+ current_fs = self.sampling_rate
227
+ elif current_fs != self.sampling_rate:
228
+ # This case should ideally not be hit if logic above is correct and self.sampling_rate is 16000
229
+ raise RuntimeError(
230
+ f"Audio sample rate {current_fs} not supported. Expected {self.sampling_rate} or 8000 for 8k methods.")
231
+
232
+ preemphasis_coeff = 0.97
233
+
234
+ # current_fs is now the rate for STFT parameters (either 16000 or 8000 if fillzero)
235
+ if current_fs == 8000: # This implies _eightk_method == "fillzero"
236
+ n_fft, win_length, hop_length, fft_window = 256, 200, 80, self._hamming200
237
+ elif current_fs == 16000: # This is the standard path
238
+ n_fft, win_length, hop_length, fft_window = 512, 400, 160, self._hamming400
239
+ else:
240
+ raise RuntimeError(f"Inconsistent fs {current_fs} for parameter selection. Should be 16000 or 8000.")
241
+
242
+ if len(wav) < win_length:
243
+ wav = np.pad(wav, (0, win_length - len(wav)), 'constant', constant_values=(0.0,))
244
+
245
+ num_frames = (wav.shape[0] - win_length) // hop_length + 1
246
+ if num_frames <= 0:
247
+ # For n_fft=512 (16k), output bins = 257. For n_fft=256 (8k), output bins = 129
248
+ # If fillzero for 8k, it will be padded to 257 later.
249
+ # So, the number of freq bins depends on n_fft here.
250
+ return np.zeros((0, n_fft // 2 + 1), dtype=np.float32)
251
+
252
+ y_frames = np.array(
253
+ [wav[i * hop_length: i * hop_length + win_length] for i in range(num_frames)],
254
+ dtype=np.float32,
255
+ )
256
+
257
+ _y_frames_rolled = np.roll(y_frames, 1, axis=1)
258
+ _y_frames_rolled[:, 0] = _y_frames_rolled[:, 1]
259
+ y_frames_preemphasized = (y_frames - preemphasis_coeff * _y_frames_rolled) * 32768.0
260
+
261
+ S = np.fft.rfft(fft_window * y_frames_preemphasized, n=n_fft, axis=1).astype(np.complex64)
262
+
263
+ if current_fs == 8000 and self._eightk_method == "fillzero":
264
+ # S.shape[1] is 129 for n_fft=256. Target is 257 for n_fft=512 equivalence.
265
+ target_bins = (512 // 2) + 1
266
+ S_core = S[:, :-1] # Drop 8kHz Nyquist bin (1 bin)
267
+ # Pad to target_bins. Number of columns to add: target_bins - S_core.shape[1]
268
+ padarray = np.zeros((S_core.shape[0], target_bins - S_core.shape[1]), dtype=S.dtype)
269
+ S = np.concatenate((S_core, padarray), axis=1)
270
+
271
+ spec = np.abs(S).astype(np.float32)
272
+ return spec
273
+
274
+ def _extract_features(self, wav: np.ndarray, fs: int) -> np.ndarray:
275
+ spec = self._extract_spectrogram(wav, fs)
276
+ if spec.shape[0] == 0:
277
+ # self.feature_size is n_mels (e.g. 80)
278
+ return np.zeros((0, self.feature_size), dtype=np.float32)
279
+
280
+ spec_power = spec ** 2
281
+ fbank_power = np.clip(spec_power.dot(self._mel), 1.0, None)
282
+ log_fbank = np.log(fbank_power).astype(np.float32)
283
+ return log_fbank
284
+
285
+ def _compute_audio_embed_size(self, audio_frames: int) -> int:
286
+ integer = audio_frames // self.compression_rate
287
+ remainder = audio_frames % self.compression_rate
288
+ result = integer if remainder == 0 else integer + 1
289
+
290
+ integer = result // self.qformer_compression_rate
291
+ remainder = result % self.qformer_compression_rate
292
+ result = integer if remainder == 0 else integer + 1
293
+ return result
294
+
295
+
296
+ class Gemma3ImagesKwargs(ImagesKwargs):
297
+ do_pan_and_scan: Optional[bool]
298
+ pan_and_scan_min_crop_size: Optional[int]
299
+ pan_and_scan_max_num_crops: Optional[int]
300
+ pan_and_scan_min_ratio_to_activate: Optional[float]
301
+ do_convert_rgb: Optional[bool]
302
+
303
+
304
+ class Gemma3ProcessorKwargs(ProcessingKwargs, total=False):
305
+ images_kwargs: Optional[Dict[str, Any]] = None
306
+ audio_kwargs: Optional[Dict[str, Any]] = None
307
+ text_kwargs: Optional[Dict[str, Any]] = None
308
+ _defaults = {
309
+ "text_kwargs": {"padding": False, "truncation": False, "max_length": DEFAULT_MAX_LENGTH},
310
+ "images_kwargs": {},
311
+ "audio_kwargs": {}
312
+ }
313
+
314
+
315
+ class Gemma3OmniProcessor(ProcessorMixin):
316
+ attributes = ["image_processor", "audio_processor", "tokenizer"]
317
+ valid_kwargs = ["chat_template", "image_seq_length"]
318
+
319
+ image_processor_class = "AutoImageProcessor"
320
+ audio_processor_class = "AutoFeatureExtractor"
321
+ tokenizer_class = "AutoTokenizer"
322
+
323
+ def __init__(
324
+ self,
325
+ image_processor=None,
326
+ audio_processor=None, # User can pass an instance of RefactoredGemma3... here
327
+ tokenizer=None,
328
+ chat_template=None,
329
+ image_seq_length: int = 256,
330
+ **kwargs
331
+ ):
332
+ super().__init__(
333
+ image_processor=image_processor,
334
+ audio_processor=audio_processor,
335
+ tokenizer=tokenizer,
336
+ chat_template=chat_template,
337
+ **kwargs
338
+ )
339
+
340
+ self.image_seq_length = image_seq_length
341
+ if self.tokenizer is not None:
342
+ self.image_token_id = getattr(self.tokenizer, "image_token_id",
343
+ self.tokenizer.unk_token_id if hasattr(self.tokenizer,
344
+ "unk_token_id") else None)
345
+ self.boi_token = getattr(self.tokenizer, "boi_token", "<image>")
346
+ self.image_token = getattr(self.tokenizer, "image_token", "<image>")
347
+ self.eoi_token = getattr(self.tokenizer, "eoi_token", "")
348
+
349
+ self.audio_token_str_from_user_code = "<audio_soft_token>" # Example
350
+ # Ensure this token is actually in the tokenizer vocab as a special token
351
+ self.audio_token_id = self.tokenizer.convert_tokens_to_ids(self.audio_token_str_from_user_code)
352
+ if hasattr(self.tokenizer, "unk_token_id") and self.audio_token_id == self.tokenizer.unk_token_id:
353
+ logger.warning(
354
+ f"The audio token string '{self.audio_token_str_from_user_code}' maps to the UNK token. "
355
+ "Please ensure it is added to the tokenizer's vocabulary as a special token."
356
+ )
357
+ self.full_image_sequence = f"\n\n{self.boi_token}{''.join([self.image_token] * image_seq_length)}{self.eoi_token}\n\n"
358
+ else:
359
+ logger.error(
360
+ "Gemma3OmniProcessor initialized, but self.tokenizer is None. Token-dependent attributes will use placeholders or defaults.")
361
+ self.image_token_id = None
362
+ self.boi_token = "<image>"
363
+ self.image_token = "<image>"
364
+ self.eoi_token = ""
365
+ self.audio_token_str_from_user_code = "<audio_soft_token>"
366
+ self.audio_token_id = -1 # Placeholder if tokenizer is missing
367
+ self.full_image_sequence = ""
368
+
369
+ # These attributes are specific to Gemma3OmniProcessor for its internal _compute_audio_embed_size
370
+ self.prompt_audio_compression_rate = kwargs.pop("prompt_audio_compression_rate", DEFAULT_COMPRESSION_RATE)
371
+ self.prompt_audio_qformer_rate = kwargs.pop("prompt_audio_qformer_rate", DEFAULT_QFORMER_RATE)
372
+ # self.prompt_audio_feat_stride = kwargs.pop("prompt_audio_feat_stride", DEFAULT_FEAT_STRIDE) # Not used by its _compute_audio_embed_size
373
+
374
+ self.audio_placeholder_token = kwargs.pop("audio_placeholder_token", "<|audio_placeholder|>")
375
+
376
+ def _merge_kwargs(self, KwargsClassWithDefaults, tokenizer_init_kwargs, **kwargs_from_call):
377
+ final_kwargs = {}
378
+ _defaults = getattr(KwargsClassWithDefaults, "_defaults", {})
379
+ if not isinstance(_defaults, dict): _defaults = {}
380
+
381
+ for modality_key, default_modality_kwargs in _defaults.items():
382
+ final_kwargs[modality_key] = default_modality_kwargs.copy()
383
+
384
+ for modality_key_in_call, modality_kwargs_in_call in kwargs_from_call.items():
385
+ if modality_key_in_call in final_kwargs:
386
+ if isinstance(modality_kwargs_in_call, dict):
387
+ final_kwargs[modality_key_in_call].update(modality_kwargs_in_call)
388
+ elif isinstance(modality_kwargs_in_call, dict): # New modality not in defaults
389
+ final_kwargs[modality_key_in_call] = modality_kwargs_in_call.copy()
390
+
391
+ if self.tokenizer: # Ensure tokenizer exists before accessing its attributes
392
+ for modality_key in final_kwargs:
393
+ modality_dict = final_kwargs[modality_key]
394
+ if isinstance(modality_dict, dict): # Check if it's a dictionary
395
+ for key_in_mod_dict in list(modality_dict.keys()): # Iterate over keys
396
+ if key_in_mod_dict in tokenizer_init_kwargs:
397
+ value = (
398
+ getattr(self.tokenizer, key_in_mod_dict)
399
+ if hasattr(self.tokenizer, key_in_mod_dict)
400
+ else tokenizer_init_kwargs[key_in_mod_dict]
401
+ )
402
+ modality_dict[key_in_mod_dict] = value
403
+
404
+ if "text_kwargs" not in final_kwargs: final_kwargs["text_kwargs"] = {} # Ensure text_kwargs exists
405
+ final_kwargs["text_kwargs"]["truncation"] = final_kwargs["text_kwargs"].get("truncation", False)
406
+ final_kwargs["text_kwargs"]["max_length"] = final_kwargs["text_kwargs"].get("max_length", DEFAULT_MAX_LENGTH)
407
+
408
+ return final_kwargs
409
+
410
+ def _compute_audio_embed_size(self, audio_mel_frames: int) -> int:
411
+ integer = audio_mel_frames // self.prompt_audio_compression_rate
412
+ remainder = audio_mel_frames % self.prompt_audio_compression_rate
413
+ result = integer if remainder == 0 else integer + 1
414
+
415
+ # Second compression
416
+ integer = result // self.prompt_audio_qformer_rate
417
+ remainder = result % self.prompt_audio_qformer_rate
418
+ result = integer if remainder == 0 else integer + 1
419
+ return result
420
+
421
+ def __call__(
422
+ self,
423
+ text: Union[str, List[str]] = None,
424
+ images: Optional[Any] = None,
425
+ audios: Optional[Union[AudioInput, List[AudioInput]]] = None,
426
+ sampling_rate: Optional[int] = None, # sampling_rate for raw audio arrays
427
+ return_tensors: Optional[Union[str, TensorType]] = None,
428
+ **kwargs: Any
429
+ ) -> BatchFeature:
430
+ if text is None and images is None and audios is None:
431
+ raise ValueError("Provide at least one of `text`, `images`, or `audios`.")
432
+
433
+ final_rt = return_tensors # Store original return_tensors
434
+ # Properly merge kwargs for text, images, audio
435
+ merged_call_kwargs = self._merge_kwargs(
436
+ Gemma3ProcessorKwargs, # The class defining _defaults
437
+ self.tokenizer.init_kwargs if hasattr(self.tokenizer, 'init_kwargs') else {}, # Tokenizer defaults
438
+ **kwargs # User-provided kwargs from the call
439
+ )
440
+
441
+ # Determine final return_tensors, prioritizing call > text_kwargs > default
442
+ if final_rt is None: # If not specified in call
443
+ final_rt = merged_call_kwargs.get("text_kwargs", {}).pop("return_tensors", TensorType.PYTORCH)
444
+ else: # If specified in call, remove from text_kwargs to avoid conflict
445
+ merged_call_kwargs.get("text_kwargs", {}).pop("return_tensors", None)
446
+
447
+ if text is None: # If no text, create empty strings based on other inputs
448
+ num_samples = 0
449
+ if images is not None:
450
+ _images_list = images if isinstance(images, list) and (
451
+ not images or not isinstance(images[0], (int, float))) else [images]
452
+ num_samples = len(_images_list)
453
+ elif audios is not None:
454
+ _audios_list = audios if isinstance(audios, list) and not (
455
+ isinstance(audios[0], tuple) and isinstance(audios[0][0], (int, float))) else [
456
+ audios] # check if audios is list of items or list of (wave,sr)
457
+ num_samples = len(_audios_list)
458
+ text = [""] * num_samples if num_samples > 0 else [""] # Default to one empty string if no inputs
459
+
460
+ if isinstance(text, str): text = [text] # Ensure text is a list
461
+ if not (isinstance(text, list) and all(isinstance(t, str) for t in text)):
462
+ raise ValueError("Input `text` must be a string or a list of strings.")
463
+
464
+ image_features_dict = {}
465
+ if images is not None:
466
+ if self.image_processor is None: raise ValueError("Images provided but self.image_processor is None.")
467
+ # Ensure images are correctly batched
468
+ batched_images = make_nested_list_of_images(images) # handles various image input types
469
+
470
+ _img_kwargs = merged_call_kwargs.get("images_kwargs", {})
471
+ _img_proc_output = self.image_processor(batched_images, return_tensors=None,
472
+ **_img_kwargs) # Pass None to handle tensors later
473
+ image_features_dict = _img_proc_output.data if isinstance(_img_proc_output,
474
+ BatchFeature) else _img_proc_output
475
+
476
+ if len(text) == 1 and text[0] == "" and len(
477
+ batched_images) > 0: # If text is default empty and images exist
478
+ text = [" ".join([self.boi_token] * len(img_batch)) for img_batch in batched_images]
479
+ elif len(batched_images) != len(text): # If text was provided, ensure consistency
480
+ raise ValueError(
481
+ f"Inconsistent batch: {len(batched_images)} image groups, {len(text)} texts. Ensure one text prompt per image group."
482
+ )
483
+
484
+ num_crops_popped = image_features_dict.pop("num_crops", None)
485
+ if num_crops_popped is not None:
486
+ num_crops_all = to_py_obj(num_crops_popped)
487
+ temp_text_img, current_crop_idx_offset = [], 0
488
+ for batch_idx, (prompt, current_imgs_in_batch) in enumerate(zip(text, batched_images)):
489
+ crops_for_this_batch_sample = [] # Number of *additional* crops for each original image
490
+ if num_crops_all: # If num_crops_all is not None or empty
491
+ for _ in current_imgs_in_batch: # For each original image in the current batch sample
492
+ if current_crop_idx_offset < len(num_crops_all):
493
+ # num_crops_all contains total items (original + crops) for each image
494
+ # We need number of *additional* crops. Assuming num_crops_all[i] >= 1
495
+ crops_for_this_batch_sample.append(max(0, num_crops_all[current_crop_idx_offset] - 1))
496
+ current_crop_idx_offset += 1
497
+ else:
498
+ crops_for_this_batch_sample.append(0) # Should not happen if num_crops_all is correct
499
+
500
+ image_placeholders_in_prompt = [m.start() for m in re.finditer(re.escape(self.boi_token), prompt)]
501
+ processed_prompt = prompt
502
+
503
+ # Iterate backwards to preserve indices for replacement
504
+ iter_count = min(len(crops_for_this_batch_sample), len(image_placeholders_in_prompt))
505
+ for i_placeholder_idx in range(iter_count - 1, -1, -1):
506
+ num_additional_crops_for_this_image = crops_for_this_batch_sample[i_placeholder_idx]
507
+ original_token_idx_in_prompt = image_placeholders_in_prompt[i_placeholder_idx]
508
+
509
+ if num_additional_crops_for_this_image > 0:
510
+ # Create replacement text: original image placeholder + placeholders for additional crops
511
+ replacement_text = self.boi_token + "".join(
512
+ [self.boi_token] * num_additional_crops_for_this_image)
513
+ # Replace the single original boi_token with the new sequence
514
+ processed_prompt = (
515
+ processed_prompt[:original_token_idx_in_prompt] +
516
+ replacement_text +
517
+ processed_prompt[original_token_idx_in_prompt + len(self.boi_token):]
518
+ )
519
+ temp_text_img.append(processed_prompt)
520
+ text = temp_text_img
521
+ # Replace all BOI tokens with the full image sequence (BOI + IMAGE*N + EOI)
522
+ # This step assumes that if additional crops were handled, self.boi_token still marks each image.
523
+ text = [p.replace(self.boi_token, self.full_image_sequence) for p in text]
524
+
525
+ audio_features_dict = {}
526
+ if audios is not None:
527
+ if self.audio_processor is None: raise ValueError("Audios provided but self.audio_processor is None.")
528
+
529
+ audio_call_kwargs = merged_call_kwargs.get("audio_kwargs", {})
530
+ # Pass sampling_rate from __call__ to audio_processor if provided (for raw arrays)
531
+ if sampling_rate is not None: audio_call_kwargs["sampling_rate"] = sampling_rate
532
+
533
+ # The audio_processor (e.g., RefactoredGemma3...) will return its model_input_names
534
+ # e.g., {"input_audio_embeds", "audio_embed_sizes", "audio_attention_mask"}
535
+ _audio_proc_output = self.audio_processor(audios=audios, return_tensors=None, **audio_call_kwargs)
536
+ audio_features_dict = _audio_proc_output.data
537
+
538
+ new_text_with_audio = []
539
+
540
+ # Determine the number of actual audio items processed by the audio_processor
541
+ # This should match len(text) if batching is consistent.
542
+ # The 'audio_attention_mask' or 'input_audio_embeds' can indicate this.
543
+ num_audio_samples_processed = audio_features_dict[self.audio_processor.model_input_names[0]].shape[0]
544
+
545
+ if num_audio_samples_processed != len(text):
546
+ raise ValueError(
547
+ f"Inconsistent batch for audio/text: {num_audio_samples_processed} audio samples processed, {len(text)} text prompts."
548
+ )
549
+ frames_for_embed_size_calc = to_py_obj(audio_features_dict[self.audio_processor.model_input_names[2]].sum(
550
+ axis=-1)) # sum of audio_attention_mask
551
+
552
+ for i, prompt in enumerate(text):
553
+ # num_soft_tokens should be the final number of audio tokens to insert in the text.
554
+ # This is calculated by the Gemma3OmniProcessor's own method.
555
+ num_soft_tokens = self._compute_audio_embed_size(frames_for_embed_size_calc[i])
556
+
557
+ audio_token_sequence_str = self.audio_token_str_from_user_code * num_soft_tokens
558
+
559
+ if self.audio_placeholder_token in prompt:
560
+ prompt = prompt.replace(self.audio_placeholder_token, audio_token_sequence_str,
561
+ 1) # Replace only first
562
+ else:
563
+ prompt += audio_token_sequence_str # Append if no placeholder
564
+ new_text_with_audio.append(prompt)
565
+ text = new_text_with_audio
566
+
567
+ text_tokenizer_kwargs = merged_call_kwargs.get("text_kwargs", {})
568
+ text_features_dict = self.tokenizer(text=text, return_tensors=None,
569
+ **text_tokenizer_kwargs) # Pass None for tensors
570
+
571
+ # Create token_type_ids
572
+ input_ids_list_of_lists = text_features_dict["input_ids"]
573
+ # Ensure it's a list of lists
574
+ if not isinstance(input_ids_list_of_lists, list) or not (
575
+ input_ids_list_of_lists and isinstance(input_ids_list_of_lists[0], list)):
576
+ if isinstance(input_ids_list_of_lists, (torch.Tensor, np.ndarray)):
577
+ input_ids_list_of_lists = to_py_obj(input_ids_list_of_lists) # to nested python lists
578
+ elif isinstance(input_ids_list_of_lists, list) and (
579
+ not input_ids_list_of_lists or isinstance(input_ids_list_of_lists[0], int)):
580
+ input_ids_list_of_lists = [input_ids_list_of_lists] # wrap single list
581
+
582
+ token_type_ids_list = []
583
+ for ids_sample in input_ids_list_of_lists:
584
+ types = [0] * len(ids_sample) # 0 for text
585
+ for j, token_id_val in enumerate(ids_sample):
586
+ if self.image_token_id is not None and token_id_val == self.image_token_id:
587
+ types[j] = 1 # 1 for image
588
+ elif self.audio_token_id != -1 and token_id_val == self.audio_token_id: # Check if audio_token_id is valid
589
+ types[j] = 2 # 2 for audio
590
+ token_type_ids_list.append(types)
591
+ text_features_dict["token_type_ids"] = token_type_ids_list
592
+
593
+ final_batch_data = {**text_features_dict}
594
+ if image_features_dict: final_batch_data.update(image_features_dict)
595
+ if audio_features_dict: final_batch_data.update(audio_features_dict)
596
+
597
+ # Convert all data to tensors if final_rt is specified
598
+ return BatchFeature(data=final_batch_data, tensor_type=final_rt)
599
+
600
+ def batch_decode(self, *args, **kwargs):
601
+ return self.tokenizer.batch_decode(*args, **kwargs)
602
+
603
+ def decode(self, *args, **kwargs):
604
+ return self.tokenizer.decode(*args, **kwargs)
605
+
606
+ @property
607
+ def model_input_names(self) -> List[str]:
608
+ input_names = set()
609
+ if hasattr(self, 'tokenizer') and self.tokenizer is not None:
610
+ # Make sure model_input_names is a list/set before +
611
+ tokenizer_inputs = self.tokenizer.model_input_names
612
+ if isinstance(tokenizer_inputs, (list, set)):
613
+ input_names.update(tokenizer_inputs)
614
+ else: # Fallback if it's a single string
615
+ input_names.add(str(tokenizer_inputs))
616
+ input_names.add("token_type_ids")
617
+
618
+ if hasattr(self, 'image_processor') and self.image_processor is not None:
619
+ # Similar check for image_processor
620
+ image_inputs = self.image_processor.model_input_names
621
+ if isinstance(image_inputs, (list, set)):
622
+ input_names.update(image_inputs)
623
+ else:
624
+ input_names.add(str(image_inputs))
625
+
626
+ if hasattr(self, 'audio_processor') and self.audio_processor is not None:
627
+ # Use model_input_names from the instantiated audio_processor
628
+ # This will correctly reflect the names from RefactoredGemma3... if it's used.
629
+ audio_inputs = self.audio_processor.model_input_names
630
+ if isinstance(audio_inputs, (list, set)):
631
+ input_names.update(audio_inputs)
632
+ else:
633
+ input_names.add(str(audio_inputs))
634
+
635
+ return list(input_names)
processor_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "auto_map": {
3
+ "AutoProcessor": "processing_gemma3_omni.Gemma3OmniProcessor"
4
+ },
5
+ "image_seq_length": 256,
6
+ "processor_class": "Gemma3OmniProcessor"
7
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "audio_token": "<audio_soft_token>",
3
+ "boa_token": "<start_of_audio>",
4
+ "boi_token": "<start_of_image>",
5
+ "bos_token": {
6
+ "content": "<bos>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "eoa_token": "<end_of_audio>",
13
+ "eoi_token": "<end_of_image>",
14
+ "eos_token": {
15
+ "content": "<eos>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "image_token": "<image_soft_token>",
22
+ "pad_token": {
23
+ "content": "<pad>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false
28
+ },
29
+ "unk_token": {
30
+ "content": "<unk>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false
35
+ }
36
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e4a92ec8bee95d6b8f5a141bae86b6d612ac509b62cedbb9538dc6d19870fc04
3
+ size 33384534
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff