NovaIALATAM commited on
Commit
225c729
·
verified ·
1 Parent(s): f5f33fd

Upload folder using huggingface_hub

Browse files
.gitattributes ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ adapter_model.safetensors filter=lfs diff=lfs merge=lfs -text
2
+ optimizer.pt filter=lfs diff=lfs merge=lfs -text
3
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -8,151 +8,202 @@ tags:
8
  - transformers
9
  - trl
10
  - unsloth
11
- license: cc-by-4.0
12
- datasets:
13
- - NovaIALATAM/CuPer_Text
14
- - NovaIALATAM/CuPer_Images
15
- language:
16
- - es
17
- pipeline_tag: image-text-to-text
18
  ---
19
 
20
- ![AMARU](./amaru-vl.jpg)
21
-
22
- # BIENVENIDOS A AMARU
23
-
24
- ## Detalles del Modelo
25
- - **Nombre del Modelo:** Amaru-VL-3B
26
- - **Versión del Modelo:** 1.0
27
- - **Desarrollador/Organización:** NovaIA, laboratorio de IA de Grupo Neura
28
- - **Fecha de Lanzamiento:** Agosto 2025
29
- - **Licencia:** cc-by-4.0
30
- - **Idioma(s):** Español (principalmente, con capacidades multilingües heredadas del modelo base)
31
- - **Repositorio:** [NovaIALATAM/Amaru-VL-3B](https://huggingface.co/NovaIALATAM/Amaru-VL-3B)
32
- - **Contacto:** Para más información, contactar a NovaIA a través de [su sitio web](https://www.gruponeura.ia).
33
-
34
- ## Descripción del Modelo
35
- Amaru-VL-3B es un modelo de lenguaje multimodal (VLM) fine-tuneado mediante LoRA a partir del modelo base Qwen2.5-VL-3B. Desarrollado por NovaIA, el laboratorio de inteligencia artificial de Grupo Neura, este modelo está diseñado para ser un experto en las culturas precolombinas del Perú, abordando las limitaciones de los grandes modelos de lenguaje que a menudo cometen errores en tareas relacionadas con el patrimonio cultural latinoamericano. Amaru representa el primer acercamiento hacia un LLM latinoamericano especializado, manteniendo las capacidades generales de generación de texto del modelo base para tareas de propósito general, mientras integra conocimiento profundo sobre civilizaciones ancestrales peruanas como Chavín, Moche, Nazca, Chimú, Lambayeque y otras.
36
-
37
- El modelo soporta entradas multimodales (texto e imágenes) y es útil para tareas como análisis de artefactos arqueológicos, respuestas a preguntas históricas, generación de descripciones culturales y más. Fue entrenado con un total de aproximadamente 5 millones de tokens, combinando datasets abiertos y privados para asegurar precisión histórica y cultural.
38
-
39
- **Arquitectura:** Basado en Qwen2.5-VL-3B, un modelo de visión-lenguaje con aproximadamente 3 mil millones de parámetros, optimizado para eficiencia con soporte para cuantización (e.g., 4-bit) y long context.
40
-
41
- **Objetivos Principales:**
42
- - Proporcionar conocimiento preciso sobre culturas precolombinas peruanas.
43
- - Servir como base para aplicaciones educativas, de investigación y de preservación cultural en América Latina.
44
- - Mantener versatilidad para generación de texto general.
45
-
46
- ## Usos
47
- ### Usos Directos
48
- - **Análisis Multimodal:** Identificación y descripción de imágenes relacionadas con artefactos, arquitectura o iconografía precolombina (e.g., máscaras Lambayeque, cerámica Moche).
49
- - **Preguntas y Respuestas:** Consultas sobre historia, rituales, economía y arte de civilizaciones peruanas.
50
- - **Generación de Texto:** Creación de contenido educativo o narrativo sobre patrimonio cultural, así como tareas generales de lenguaje.
51
-
52
- ### Ejemplo de Carga y Uso
53
- El modelo se carga fácilmente con Unsloth para inferencia eficiente. Aquí un ejemplo:
54
-
55
- ```python
56
- from unsloth import FastVisionModel, FastTextModel
57
- import torch
58
- from transformers import TextStreamer
59
- from PIL import Image
60
- import requests
61
- from io import BytesIO
62
-
63
- model_id = "NovaIALATAM/Amaru-VL-3B"
64
-
65
- # Cargar modelo y tokenizer con Unsloth
66
- base_model, tokenizer = FastVisionModel.from_pretrained(
67
- model_name=model_id,
68
- max_seq_length=8192,
69
- dtype=None,
70
- load_in_4bit=False,
71
- use_gradient_checkpointing="unsloth",
72
- )
73
-
74
- FastVisionModel.for_inference(base_model)
75
-
76
- # Imagen de ejemplo
77
- path_image = "https://cloudfront-us-east-1.images.arcpublishing.com/infobae/WARBZQ3A65BQRLN3D63I3JJBIY.jpg"
78
- response = requests.get(path_image, timeout=10)
79
- response.raise_for_status()
80
- image = Image.open(BytesIO(response.content))
81
-
82
- messages = [
83
- {"role": "system", "content": [
84
- {"type": "text", "text": "Amaru, eres un experto conocedor de diferentes culturas precolombinas del Perú."}]},
85
- {"role": "user", "content": [
86
- {"type": "image", "image": image},
87
- {"type": "text", "text": "¿Qué es esto?"}
88
- ]}
89
- ]
90
-
91
- inputs = tokenizer(
92
- images=image,
93
- text=tokenizer.apply_chat_template(messages, add_generation_prompt=True),
94
- add_special_tokens=False,
95
- return_tensors="pt",
96
- ).to("cuda")
97
-
98
- text_streamer = TextStreamer(tokenizer, skip_prompt=True)
99
- _ = base_model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048,
100
- use_cache=True, temperature=0.05, min_p=0.1)
101
- ```
102
-
103
- ### Usos en Aplicaciones Downstream
104
- - Integración en chatbots educativos para museos o plataformas de aprendizaje.
105
- - Fine-tuning adicional para tareas específicas en patrimonio cultural o turismo.
106
- - Uso en herramientas de IA para investigación arqueológica.
107
-
108
- ### Usos Fuera de Alcance
109
- - No se recomienda para aplicaciones médicas, legales o de alta precisión sin validación adicional.
110
- - Evitar usos que promuevan desinformación histórica o cultural.
111
- - No está optimizado para idiomas no latinos o contextos no relacionados con el conocimiento base.
112
-
113
- ## Sesgos, Riesgos y Limitaciones
114
- - **Sesgos:** El modelo hereda sesgos potenciales del modelo base Qwen2.5-VL-3B y de los datasets de entrenamiento. Dado el enfoque en culturas peruanas, podría subrepresentar otras culturas latinoamericanas o globales. Los datasets abiertos (CuPer_Text y CuPer_Images) fueron curados para precisión, pero los privados podrían introducir sesgos no documentados.
115
- - **Riesgos:** Posibles alucinaciones en respuestas fuera del dominio precolombino, como en modelos de IA generales. En temas sensibles como rituales indígenas, podría generar contenido inexacto si no se guía adecuadamente.
116
- - **Limitaciones:**
117
- - Rendimiento inferior en idiomas no españoles o en contextos modernos no relacionados.
118
- - Requiere hardware GPU para inferencia eficiente (soporte para cuantización mitiga esto).
119
- - No evaluado exhaustivamente; se recomienda validación humana para usos críticos.
120
- - **Recomendaciones:**
121
- - Usar prompts específicos (e.g., "Actúa como un experto en culturas precolombinas del Perú") para maximizar precisión.
122
- - Usar una temperatura baja para evitar alucinaciones: temperature = 0.05
123
-
124
- ## Detalles del Entrenamiento
125
- ### Datos de Entrenamiento
126
- El modelo fue fine-tuneado con un total de ~5 millones de tokens, incluyendo:
127
- - **CuPer_Text** (abierto, cc-by-nc-4.0): 4,144 muestras (3,268 train, 876 test), ~650,000 tokens. Formato: Preguntas y respuestas en texto sobre culturas precolombinas peruanas (e.g., historia, arquitectura, rituales). Creado por NovaIA para asegurar precisión cultural.
128
- - **CuPer_Images** (abierto, cc-by-nc-4.0): 4,247 muestras (3,396 train, 851 test), ~2.2 millones de tokens. Formato: Preguntas, imágenes y respuestas en texto, enfocadas en aspectos visuales como artefactos, estructuras y iconografía (e.g., acueductos Nazca, máscaras Lambayeque).
129
- - **Datasets Privados:** Datos adicionales curados por NovaIA, contribuyendo al total de 5 millones de tokens, con énfasis en conocimiento histórico validado.
130
-
131
- Todos los datos están en español y centrados en culturas como Lambayeque, Moche, Nazca, Chavín y Chimú.
132
-
133
- ### Procedimiento de Entrenamiento
134
- - **Método:** Fine-tuning LoRA (Low-Rank Adaptation) sobre Qwen2.5-VL-3B para eficiencia.
135
- - **Hiperparámetros:** No especificados públicamente; optimizados para contexto largo (hasta 8192 tokens) y multimodalidad.
136
- - **Hardware:** Entrenado en infraestructura de NovaIA, compatible con Unsloth para aceleración.
137
- - **Duración:** No detallada, pero enfocado en precisión cultural sobre volumen.
138
-
139
- ## Impacto Ambiental
140
- - **Estimación de Carbono:** El fine-tuning con LoRA es eficiente, estimando un impacto bajo (~equivalente a unas horas de GPU A100). No se calculó formalmente.
141
- - **Eficiencia:** Soporte para cuantización 4-bit reduce el consumo en inferencia.
142
-
143
- ## Citación
144
- Si utilizas este modelo en tu investigación o aplicación, por favor cita como:
145
- ```
146
- @misc{amaru_vl_3b_2025,
147
- title = {Amaru-VL-3B: Modelo Multimodal Especializado en Culturas Precolombinas del Perú},
148
- author = {NovaIA - Grupo Neura},
149
- year = {2025},
150
- publisher = {Hugging Face},
151
- url = {https://huggingface.co/NovaIALATAM/Amaru-VL-3B}
152
- }
153
- ```
154
-
155
- ## Más Información
156
- Para más detalles sobre los datasets:
157
- - [CuPer_Text](https://huggingface.co/datasets/NovaIALATAM/CuPer_Text)
158
- - [CuPer_Images](https://huggingface.co/datasets/NovaIALATAM/CuPer_Images)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - transformers
9
  - trl
10
  - unsloth
 
 
 
 
 
 
 
11
  ---
12
 
13
+ # Model Card for Model ID
14
+
15
+ <!-- Provide a quick summary of what the model is/does. -->
16
+
17
+
18
+
19
+ ## Model Details
20
+
21
+ ### Model Description
22
+
23
+ <!-- Provide a longer summary of what this model is. -->
24
+
25
+
26
+
27
+ - **Developed by:** [More Information Needed]
28
+ - **Funded by [optional]:** [More Information Needed]
29
+ - **Shared by [optional]:** [More Information Needed]
30
+ - **Model type:** [More Information Needed]
31
+ - **Language(s) (NLP):** [More Information Needed]
32
+ - **License:** [More Information Needed]
33
+ - **Finetuned from model [optional]:** [More Information Needed]
34
+
35
+ ### Model Sources [optional]
36
+
37
+ <!-- Provide the basic links for the model. -->
38
+
39
+ - **Repository:** [More Information Needed]
40
+ - **Paper [optional]:** [More Information Needed]
41
+ - **Demo [optional]:** [More Information Needed]
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ [More Information Needed]
52
+
53
+ ### Downstream Use [optional]
54
+
55
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
56
+
57
+ [More Information Needed]
58
+
59
+ ### Out-of-Scope Use
60
+
61
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
62
+
63
+ [More Information Needed]
64
+
65
+ ## Bias, Risks, and Limitations
66
+
67
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
+
69
+ [More Information Needed]
70
+
71
+ ### Recommendations
72
+
73
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
+
75
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
+
77
+ ## How to Get Started with the Model
78
+
79
+ Use the code below to get started with the model.
80
+
81
+ [More Information Needed]
82
+
83
+ ## Training Details
84
+
85
+ ### Training Data
86
+
87
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
+
89
+ [More Information Needed]
90
+
91
+ ### Training Procedure
92
+
93
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
+
95
+ #### Preprocessing [optional]
96
+
97
+ [More Information Needed]
98
+
99
+
100
+ #### Training Hyperparameters
101
+
102
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
+
104
+ #### Speeds, Sizes, Times [optional]
105
+
106
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
+
108
+ [More Information Needed]
109
+
110
+ ## Evaluation
111
+
112
+ <!-- This section describes the evaluation protocols and provides the results. -->
113
+
114
+ ### Testing Data, Factors & Metrics
115
+
116
+ #### Testing Data
117
+
118
+ <!-- This should link to a Dataset Card if possible. -->
119
+
120
+ [More Information Needed]
121
+
122
+ #### Factors
123
+
124
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
+
126
+ [More Information Needed]
127
+
128
+ #### Metrics
129
+
130
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
+
132
+ [More Information Needed]
133
+
134
+ ### Results
135
+
136
+ [More Information Needed]
137
+
138
+ #### Summary
139
+
140
+
141
+
142
+ ## Model Examination [optional]
143
+
144
+ <!-- Relevant interpretability work for the model goes here -->
145
+
146
+ [More Information Needed]
147
+
148
+ ## Environmental Impact
149
+
150
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
+
152
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
+
154
+ - **Hardware Type:** [More Information Needed]
155
+ - **Hours used:** [More Information Needed]
156
+ - **Cloud Provider:** [More Information Needed]
157
+ - **Compute Region:** [More Information Needed]
158
+ - **Carbon Emitted:** [More Information Needed]
159
+
160
+ ## Technical Specifications [optional]
161
+
162
+ ### Model Architecture and Objective
163
+
164
+ [More Information Needed]
165
+
166
+ ### Compute Infrastructure
167
+
168
+ [More Information Needed]
169
+
170
+ #### Hardware
171
+
172
+ [More Information Needed]
173
+
174
+ #### Software
175
+
176
+ [More Information Needed]
177
+
178
+ ## Citation [optional]
179
+
180
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
+
182
+ **BibTeX:**
183
+
184
+ [More Information Needed]
185
+
186
+ **APA:**
187
+
188
+ [More Information Needed]
189
+
190
+ ## Glossary [optional]
191
+
192
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
193
+
194
+ [More Information Needed]
195
+
196
+ ## More Information [optional]
197
+
198
+ [More Information Needed]
199
+
200
+ ## Model Card Authors [optional]
201
+
202
+ [More Information Needed]
203
+
204
+ ## Model Card Contact
205
+
206
+ [More Information Needed]
207
+ ### Framework versions
208
+
209
+ - PEFT 0.17.0
adapter_config.json ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alpha_pattern": {},
3
+ "auto_mapping": {
4
+ "base_model_class": "Qwen2_5_VLForConditionalGeneration",
5
+ "parent_library": "transformers.models.qwen2_5_vl.modeling_qwen2_5_vl"
6
+ },
7
+ "base_model_name_or_path": "unsloth/Qwen2.5-VL-3B-Instruct",
8
+ "bias": "none",
9
+ "corda_config": null,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 4,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.0,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "qalora_group_size": 16,
27
+ "r": 2,
28
+ "rank_pattern": {},
29
+ "revision": null,
30
+ "target_modules": [
31
+ "k_proj",
32
+ "up_proj",
33
+ "gate_proj",
34
+ "v_proj",
35
+ "o_proj",
36
+ "down_proj",
37
+ "q_proj"
38
+ ],
39
+ "target_parameters": null,
40
+ "task_type": null,
41
+ "trainable_token_indices": null,
42
+ "use_dora": false,
43
+ "use_qalora": false,
44
+ "use_rslora": false
45
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:385ceb02b804f3c2831c7943305ed5f27c2134e543b1464903990707eddd0371
3
+ size 18676192
added_tokens.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</tool_call>": 151658,
3
+ "<tool_call>": 151657,
4
+ "<|box_end|>": 151649,
5
+ "<|box_start|>": 151648,
6
+ "<|endoftext|>": 151643,
7
+ "<|file_sep|>": 151664,
8
+ "<|fim_middle|>": 151660,
9
+ "<|fim_pad|>": 151662,
10
+ "<|fim_prefix|>": 151659,
11
+ "<|fim_suffix|>": 151661,
12
+ "<|im_end|>": 151645,
13
+ "<|im_start|>": 151644,
14
+ "<|image_pad|>": 151655,
15
+ "<|object_ref_end|>": 151647,
16
+ "<|object_ref_start|>": 151646,
17
+ "<|quad_end|>": 151651,
18
+ "<|quad_start|>": 151650,
19
+ "<|repo_name|>": 151663,
20
+ "<|video_pad|>": 151656,
21
+ "<|vision_end|>": 151653,
22
+ "<|vision_pad|>": 151654,
23
+ "<|vision_start|>": 151652
24
+ }
chat_template.jinja ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
2
+ You are a helpful assistant.<|im_end|>
3
+ {% endif %}<|im_start|>{{ message['role'] }}
4
+ {% if message['content'] is string %}{{ message['content'] }}<|im_end|>
5
+ {% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
6
+ {% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
7
+ {% endif %}
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
optimizer.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ac2cb766e5373a5c6c5bbb4a7008f0942d3f0122b65cfb1c62a3ff451de28a7a
3
+ size 8363492
preprocessor_config.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "disable_grouping": null,
7
+ "do_center_crop": null,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_rescale": true,
11
+ "do_resize": true,
12
+ "image_mean": [
13
+ 0.48145466,
14
+ 0.4578275,
15
+ 0.40821073
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessorFast",
18
+ "image_std": [
19
+ 0.26862954,
20
+ 0.26130258,
21
+ 0.27577711
22
+ ],
23
+ "input_data_format": null,
24
+ "max_pixels": 12845056,
25
+ "merge_size": 2,
26
+ "min_pixels": 3136,
27
+ "patch_size": 14,
28
+ "processor_class": "Qwen2_5_VLProcessor",
29
+ "resample": 3,
30
+ "rescale_factor": 0.00392156862745098,
31
+ "return_tensors": null,
32
+ "size": {
33
+ "longest_edge": 12845056,
34
+ "shortest_edge": 3136
35
+ },
36
+ "temporal_patch_size": 2
37
+ }
rng_state.pth ADDED
Binary file (14.2 kB). View file
 
scheduler.pt ADDED
Binary file (1.06 kB). View file
 
special_tokens_map.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|im_start|>",
4
+ "<|im_end|>",
5
+ "<|object_ref_start|>",
6
+ "<|object_ref_end|>",
7
+ "<|box_start|>",
8
+ "<|box_end|>",
9
+ "<|quad_start|>",
10
+ "<|quad_end|>",
11
+ "<|vision_start|>",
12
+ "<|vision_end|>",
13
+ "<|vision_pad|>",
14
+ "<|image_pad|>",
15
+ "<|video_pad|>"
16
+ ],
17
+ "eos_token": {
18
+ "content": "<|im_end|>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ },
24
+ "pad_token": {
25
+ "content": "<|vision_pad|>",
26
+ "lstrip": false,
27
+ "normalized": false,
28
+ "rstrip": false,
29
+ "single_word": false
30
+ }
31
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1296e28e96c0a57e3bbee1bcdc014e39b2dde71f015c7e19e56cc4399de40c66
3
+ size 11422166
tokenizer_config.json ADDED
@@ -0,0 +1,209 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "151643": {
6
+ "content": "<|endoftext|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "151644": {
14
+ "content": "<|im_start|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "151645": {
22
+ "content": "<|im_end|>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "151646": {
30
+ "content": "<|object_ref_start|>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "151647": {
38
+ "content": "<|object_ref_end|>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "151648": {
46
+ "content": "<|box_start|>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "151649": {
54
+ "content": "<|box_end|>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "151650": {
62
+ "content": "<|quad_start|>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "151651": {
70
+ "content": "<|quad_end|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "151652": {
78
+ "content": "<|vision_start|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "151653": {
86
+ "content": "<|vision_end|>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "151654": {
94
+ "content": "<|vision_pad|>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "151655": {
102
+ "content": "<|image_pad|>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "151656": {
110
+ "content": "<|video_pad|>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "151657": {
118
+ "content": "<tool_call>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": false
124
+ },
125
+ "151658": {
126
+ "content": "</tool_call>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": false
132
+ },
133
+ "151659": {
134
+ "content": "<|fim_prefix|>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": false
140
+ },
141
+ "151660": {
142
+ "content": "<|fim_middle|>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": false
148
+ },
149
+ "151661": {
150
+ "content": "<|fim_suffix|>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": false
156
+ },
157
+ "151662": {
158
+ "content": "<|fim_pad|>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": false
164
+ },
165
+ "151663": {
166
+ "content": "<|repo_name|>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": false
172
+ },
173
+ "151664": {
174
+ "content": "<|file_sep|>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": false
180
+ }
181
+ },
182
+ "additional_special_tokens": [
183
+ "<|im_start|>",
184
+ "<|im_end|>",
185
+ "<|object_ref_start|>",
186
+ "<|object_ref_end|>",
187
+ "<|box_start|>",
188
+ "<|box_end|>",
189
+ "<|quad_start|>",
190
+ "<|quad_end|>",
191
+ "<|vision_start|>",
192
+ "<|vision_end|>",
193
+ "<|vision_pad|>",
194
+ "<|image_pad|>",
195
+ "<|video_pad|>"
196
+ ],
197
+ "bos_token": null,
198
+ "clean_up_tokenization_spaces": false,
199
+ "eos_token": "<|im_end|>",
200
+ "errors": "replace",
201
+ "extra_special_tokens": {},
202
+ "model_max_length": 128000,
203
+ "pad_token": "<|vision_pad|>",
204
+ "padding_side": "right",
205
+ "processor_class": "Qwen2_5_VLProcessor",
206
+ "split_special_tokens": false,
207
+ "tokenizer_class": "Qwen2Tokenizer",
208
+ "unk_token": null
209
+ }
trainer_state.json ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_global_step": 200,
3
+ "best_metric": 0.8998275399208069,
4
+ "best_model_checkpoint": "/content/drive/MyDrive/amaru-txt-epoch-6/checkpoint-200",
5
+ "epoch": 0.966183574879227,
6
+ "eval_steps": 10,
7
+ "global_step": 200,
8
+ "is_hyper_param_search": false,
9
+ "is_local_process_zero": true,
10
+ "is_world_process_zero": true,
11
+ "log_history": [
12
+ {
13
+ "epoch": 0.04830917874396135,
14
+ "grad_norm": 1.9938534498214722,
15
+ "learning_rate": 3.5389440000000007e-05,
16
+ "loss": 0.8315,
17
+ "step": 10
18
+ },
19
+ {
20
+ "epoch": 0.04830917874396135,
21
+ "eval_loss": 0.9170326590538025,
22
+ "eval_runtime": 89.5861,
23
+ "eval_samples_per_second": 9.778,
24
+ "eval_steps_per_second": 0.156,
25
+ "step": 10
26
+ },
27
+ {
28
+ "epoch": 0.0966183574879227,
29
+ "grad_norm": 2.8435614109039307,
30
+ "learning_rate": 7.471104000000001e-05,
31
+ "loss": 0.7156,
32
+ "step": 20
33
+ },
34
+ {
35
+ "epoch": 0.0966183574879227,
36
+ "eval_loss": 0.92298823595047,
37
+ "eval_runtime": 89.6583,
38
+ "eval_samples_per_second": 9.77,
39
+ "eval_steps_per_second": 0.156,
40
+ "step": 20
41
+ },
42
+ {
43
+ "epoch": 0.14492753623188406,
44
+ "grad_norm": 2.7325339317321777,
45
+ "learning_rate": 7.819458354458255e-05,
46
+ "loss": 0.8015,
47
+ "step": 30
48
+ },
49
+ {
50
+ "epoch": 0.14492753623188406,
51
+ "eval_loss": 0.9262471199035645,
52
+ "eval_runtime": 89.6974,
53
+ "eval_samples_per_second": 9.766,
54
+ "eval_steps_per_second": 0.156,
55
+ "step": 30
56
+ },
57
+ {
58
+ "epoch": 0.1932367149758454,
59
+ "grad_norm": 2.380279779434204,
60
+ "learning_rate": 7.665694808929027e-05,
61
+ "loss": 0.7263,
62
+ "step": 40
63
+ },
64
+ {
65
+ "epoch": 0.1932367149758454,
66
+ "eval_loss": 0.9168504476547241,
67
+ "eval_runtime": 89.6787,
68
+ "eval_samples_per_second": 9.768,
69
+ "eval_steps_per_second": 0.156,
70
+ "step": 40
71
+ },
72
+ {
73
+ "epoch": 0.24154589371980675,
74
+ "grad_norm": 2.8440909385681152,
75
+ "learning_rate": 7.406804077083218e-05,
76
+ "loss": 0.7098,
77
+ "step": 50
78
+ },
79
+ {
80
+ "epoch": 0.24154589371980675,
81
+ "eval_loss": 0.9175106883049011,
82
+ "eval_runtime": 89.6725,
83
+ "eval_samples_per_second": 9.769,
84
+ "eval_steps_per_second": 0.156,
85
+ "step": 50
86
+ },
87
+ {
88
+ "epoch": 0.2898550724637681,
89
+ "grad_norm": 2.432560920715332,
90
+ "learning_rate": 7.050075887179768e-05,
91
+ "loss": 0.7212,
92
+ "step": 60
93
+ },
94
+ {
95
+ "epoch": 0.2898550724637681,
96
+ "eval_loss": 0.9199467301368713,
97
+ "eval_runtime": 89.7174,
98
+ "eval_samples_per_second": 9.764,
99
+ "eval_steps_per_second": 0.156,
100
+ "step": 60
101
+ },
102
+ {
103
+ "epoch": 0.33816425120772947,
104
+ "grad_norm": 3.1470143795013428,
105
+ "learning_rate": 6.605554830418061e-05,
106
+ "loss": 0.7116,
107
+ "step": 70
108
+ },
109
+ {
110
+ "epoch": 0.33816425120772947,
111
+ "eval_loss": 0.9214985966682434,
112
+ "eval_runtime": 89.6568,
113
+ "eval_samples_per_second": 9.771,
114
+ "eval_steps_per_second": 0.156,
115
+ "step": 70
116
+ },
117
+ {
118
+ "epoch": 0.3864734299516908,
119
+ "grad_norm": 3.123927354812622,
120
+ "learning_rate": 6.085757529877134e-05,
121
+ "loss": 0.7162,
122
+ "step": 80
123
+ },
124
+ {
125
+ "epoch": 0.3864734299516908,
126
+ "eval_loss": 0.9215625524520874,
127
+ "eval_runtime": 89.6808,
128
+ "eval_samples_per_second": 9.768,
129
+ "eval_steps_per_second": 0.156,
130
+ "step": 80
131
+ },
132
+ {
133
+ "epoch": 0.43478260869565216,
134
+ "grad_norm": 2.565199851989746,
135
+ "learning_rate": 5.5053202030981025e-05,
136
+ "loss": 0.806,
137
+ "step": 90
138
+ },
139
+ {
140
+ "epoch": 0.43478260869565216,
141
+ "eval_loss": 0.9178029298782349,
142
+ "eval_runtime": 89.7041,
143
+ "eval_samples_per_second": 9.765,
144
+ "eval_steps_per_second": 0.156,
145
+ "step": 90
146
+ },
147
+ {
148
+ "epoch": 0.4830917874396135,
149
+ "grad_norm": 2.8664753437042236,
150
+ "learning_rate": 4.880586542083376e-05,
151
+ "loss": 0.7107,
152
+ "step": 100
153
+ },
154
+ {
155
+ "epoch": 0.4830917874396135,
156
+ "eval_loss": 0.9137737154960632,
157
+ "eval_runtime": 89.6956,
158
+ "eval_samples_per_second": 9.766,
159
+ "eval_steps_per_second": 0.156,
160
+ "step": 100
161
+ },
162
+ {
163
+ "epoch": 0.5314009661835749,
164
+ "grad_norm": 2.94624400138855,
165
+ "learning_rate": 4.229147515001422e-05,
166
+ "loss": 0.7115,
167
+ "step": 110
168
+ },
169
+ {
170
+ "epoch": 0.5314009661835749,
171
+ "eval_loss": 0.913902223110199,
172
+ "eval_runtime": 89.6611,
173
+ "eval_samples_per_second": 9.77,
174
+ "eval_steps_per_second": 0.156,
175
+ "step": 110
176
+ },
177
+ {
178
+ "epoch": 0.5797101449275363,
179
+ "grad_norm": 3.233154296875,
180
+ "learning_rate": 3.569346047652783e-05,
181
+ "loss": 0.7752,
182
+ "step": 120
183
+ },
184
+ {
185
+ "epoch": 0.5797101449275363,
186
+ "eval_loss": 0.9105567336082458,
187
+ "eval_runtime": 89.6726,
188
+ "eval_samples_per_second": 9.769,
189
+ "eval_steps_per_second": 0.156,
190
+ "step": 120
191
+ },
192
+ {
193
+ "epoch": 0.6280193236714976,
194
+ "grad_norm": 2.791991949081421,
195
+ "learning_rate": 2.9197605316528352e-05,
196
+ "loss": 0.8082,
197
+ "step": 130
198
+ },
199
+ {
200
+ "epoch": 0.6280193236714976,
201
+ "eval_loss": 0.9074307084083557,
202
+ "eval_runtime": 89.6733,
203
+ "eval_samples_per_second": 9.769,
204
+ "eval_steps_per_second": 0.156,
205
+ "step": 130
206
+ },
207
+ {
208
+ "epoch": 0.6763285024154589,
209
+ "grad_norm": 2.9703595638275146,
210
+ "learning_rate": 2.2986817024745032e-05,
211
+ "loss": 0.8185,
212
+ "step": 140
213
+ },
214
+ {
215
+ "epoch": 0.6763285024154589,
216
+ "eval_loss": 0.9059156775474548,
217
+ "eval_runtime": 89.6839,
218
+ "eval_samples_per_second": 9.768,
219
+ "eval_steps_per_second": 0.156,
220
+ "step": 140
221
+ },
222
+ {
223
+ "epoch": 0.7246376811594203,
224
+ "grad_norm": 2.797487735748291,
225
+ "learning_rate": 1.7235976171826803e-05,
226
+ "loss": 0.8013,
227
+ "step": 150
228
+ },
229
+ {
230
+ "epoch": 0.7246376811594203,
231
+ "eval_loss": 0.9037850499153137,
232
+ "eval_runtime": 89.6764,
233
+ "eval_samples_per_second": 9.768,
234
+ "eval_steps_per_second": 0.156,
235
+ "step": 150
236
+ },
237
+ {
238
+ "epoch": 0.7729468599033816,
239
+ "grad_norm": 2.6389381885528564,
240
+ "learning_rate": 1.210701233624601e-05,
241
+ "loss": 0.8531,
242
+ "step": 160
243
+ },
244
+ {
245
+ "epoch": 0.7729468599033816,
246
+ "eval_loss": 0.9019114971160889,
247
+ "eval_runtime": 89.6735,
248
+ "eval_samples_per_second": 9.769,
249
+ "eval_steps_per_second": 0.156,
250
+ "step": 160
251
+ },
252
+ {
253
+ "epoch": 0.821256038647343,
254
+ "grad_norm": 2.7340869903564453,
255
+ "learning_rate": 7.744344564388566e-06,
256
+ "loss": 0.8288,
257
+ "step": 170
258
+ },
259
+ {
260
+ "epoch": 0.821256038647343,
261
+ "eval_loss": 0.9006927609443665,
262
+ "eval_runtime": 89.6331,
263
+ "eval_samples_per_second": 9.773,
264
+ "eval_steps_per_second": 0.156,
265
+ "step": 170
266
+ },
267
+ {
268
+ "epoch": 0.8695652173913043,
269
+ "grad_norm": 2.7714426517486572,
270
+ "learning_rate": 4.270814884295176e-06,
271
+ "loss": 0.8686,
272
+ "step": 180
273
+ },
274
+ {
275
+ "epoch": 0.8695652173913043,
276
+ "eval_loss": 0.8999435901641846,
277
+ "eval_runtime": 89.7018,
278
+ "eval_samples_per_second": 9.766,
279
+ "eval_steps_per_second": 0.156,
280
+ "step": 180
281
+ },
282
+ {
283
+ "epoch": 0.9178743961352657,
284
+ "grad_norm": 3.1497857570648193,
285
+ "learning_rate": 1.7842293753365276e-06,
286
+ "loss": 0.8859,
287
+ "step": 190
288
+ },
289
+ {
290
+ "epoch": 0.9178743961352657,
291
+ "eval_loss": 0.9000065326690674,
292
+ "eval_runtime": 89.6862,
293
+ "eval_samples_per_second": 9.767,
294
+ "eval_steps_per_second": 0.156,
295
+ "step": 190
296
+ },
297
+ {
298
+ "epoch": 0.966183574879227,
299
+ "grad_norm": 2.9853219985961914,
300
+ "learning_rate": 3.546041888197535e-07,
301
+ "loss": 0.8611,
302
+ "step": 200
303
+ },
304
+ {
305
+ "epoch": 0.966183574879227,
306
+ "eval_loss": 0.8998275399208069,
307
+ "eval_runtime": 89.6735,
308
+ "eval_samples_per_second": 9.769,
309
+ "eval_steps_per_second": 0.156,
310
+ "step": 200
311
+ }
312
+ ],
313
+ "logging_steps": 10,
314
+ "max_steps": 207,
315
+ "num_input_tokens_seen": 0,
316
+ "num_train_epochs": 1,
317
+ "save_steps": 20,
318
+ "stateful_callbacks": {
319
+ "TrainerControl": {
320
+ "args": {
321
+ "should_epoch_stop": false,
322
+ "should_evaluate": false,
323
+ "should_log": false,
324
+ "should_save": true,
325
+ "should_training_stop": false
326
+ },
327
+ "attributes": {}
328
+ }
329
+ },
330
+ "total_flos": 3.99869509435392e+16,
331
+ "train_batch_size": 16,
332
+ "trial_name": null,
333
+ "trial_params": null
334
+ }
training_args.bin ADDED
Binary file (5.82 kB). View file
 
video_preprocessor_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": null,
3
+ "data_format": "channels_first",
4
+ "default_to_square": true,
5
+ "device": null,
6
+ "do_center_crop": null,
7
+ "do_convert_rgb": true,
8
+ "do_normalize": true,
9
+ "do_pad": null,
10
+ "do_rescale": true,
11
+ "do_resize": true,
12
+ "do_sample_frames": false,
13
+ "fps": null,
14
+ "image_mean": [
15
+ 0.48145466,
16
+ 0.4578275,
17
+ 0.40821073
18
+ ],
19
+ "image_std": [
20
+ 0.26862954,
21
+ 0.26130258,
22
+ 0.27577711
23
+ ],
24
+ "input_data_format": null,
25
+ "max_frames": 768,
26
+ "max_pixels": 12845056,
27
+ "merge_size": 2,
28
+ "min_frames": 4,
29
+ "min_pixels": 3136,
30
+ "num_frames": null,
31
+ "patch_size": 14,
32
+ "processor_class": "Qwen2_5_VLProcessor",
33
+ "resample": 3,
34
+ "rescale_factor": 0.00392156862745098,
35
+ "size": {
36
+ "longest_edge": 12845056,
37
+ "shortest_edge": 3136
38
+ },
39
+ "size_divisor": null,
40
+ "temporal_patch_size": 2,
41
+ "video_metadata": null,
42
+ "video_processor_type": "Qwen2VLVideoProcessor"
43
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff