Hanzo Dev commited on
Commit
7522691
Β·
1 Parent(s): 8e8625a

Add Zen VL training space with ADP+xLAM datasets

Browse files
Files changed (3) hide show
  1. README.md +100 -5
  2. app.py +398 -0
  3. requirements.txt +9 -0
README.md CHANGED
@@ -1,12 +1,107 @@
1
  ---
2
- title: Zen Vl Training
3
- emoji: 😻
4
  colorFrom: blue
5
- colorTo: blue
6
  sdk: gradio
7
- sdk_version: 5.49.1
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Zen VL Training
3
+ emoji: 🧘
4
  colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
+ hardware: a10g-large
12
  ---
13
 
14
+ # 🧘 Zen VL Training Space
15
+
16
+ Train zen-vl vision-language models with combined ADP+xLAM datasets on HuggingFace Pro GPUs.
17
+
18
+ ## Features
19
+
20
+ - **Multi-Size Support**: Train 4B, 8B, or 30B parameter models
21
+ - **GPU Options**: A10G (24GB), A100-Large (40GB), A100 (80GB)
22
+ - **Combined Datasets**: Agent Data Protocol (ADP) + xLAM Function Calling
23
+ - **Auto-Upload**: Trained models automatically uploaded to HuggingFace Hub
24
+ - **Real-time Monitoring**: Live training logs and progress tracking
25
+
26
+ ## Datasets
27
+
28
+ ### Agent Data Protocol (ADP)
29
+ - **Source**: [neulab/agent-data-collection](https://huggingface.co/datasets/neulab/agent-data-collection)
30
+ - **Size**: ~220k agent trajectories (8.4GB)
31
+ - **Citation**: arXiv:2510.24702
32
+
33
+ ### xLAM Function Calling 60k
34
+ - **Source**: [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k)
35
+ - **Size**: 60k function calling examples (101MB)
36
+ - **Citation**: Salesforce Research
37
+
38
+ ## Training Configuration
39
+
40
+ ### 4B Model (A10G - 24GB)
41
+ - Batch size: 1
42
+ - Gradient accumulation: 8
43
+ - Max samples: 30,000
44
+ - Estimated time: 6-8 hours
45
+
46
+ ### 8B Model (A100-Large - 40GB)
47
+ - Batch size: 2
48
+ - Gradient accumulation: 8
49
+ - Max samples: 50,000
50
+ - Estimated time: 10-12 hours
51
+
52
+ ### 30B Model (A100 - 80GB)
53
+ - Batch size: 4
54
+ - Gradient accumulation: 8
55
+ - Max samples: 100,000
56
+ - Estimated time: 20-24 hours
57
+
58
+ ## Usage
59
+
60
+ 1. Select model size (4b, 8b, or 30b)
61
+ 2. Choose GPU type (a10g, a100-large, or a100)
62
+ 3. Click "Start Training"
63
+ 4. Monitor progress in real-time
64
+ 5. Trained model automatically uploads to `zenlm/zen-vl-{size}-agent`
65
+
66
+ ## Requirements
67
+
68
+ - HuggingFace Pro account (for GPU access)
69
+ - HF_TOKEN environment variable set
70
+ - Write access to zenlm organization
71
+
72
+ ## Output Models
73
+
74
+ Trained models will be uploaded to:
75
+ - `zenlm/zen-vl-4b-agent`
76
+ - `zenlm/zen-vl-8b-agent`
77
+ - `zenlm/zen-vl-30b-agent`
78
+
79
+ ## Technical Details
80
+
81
+ **Base Architecture**: Qwen3-VL
82
+ **Training Method**: Supervised Fine-Tuning (SFT)
83
+ **Data Mixture**: 80% ADP, 20% xLAM
84
+ **Precision**: bfloat16
85
+ **Framework**: Transformers + Accelerate
86
+
87
+ ## License
88
+
89
+ Apache 2.0 - See [LICENSE](https://github.com/zenlm/zen-vl/blob/main/LICENSE)
90
+
91
+ ## Citation
92
+
93
+ ```bibtex
94
+ @software{zen-vl-2025,
95
+ title={Zen VL: Vision-Language Models with Function Calling},
96
+ author={Zen AI Team},
97
+ year={2025},
98
+ url={https://github.com/zenlm/zen-vl}
99
+ }
100
+ ```
101
+
102
+ ## Links
103
+
104
+ - **Website**: https://zenlm.org
105
+ - **GitHub**: https://github.com/zenlm/zen-vl
106
+ - **Models**: https://huggingface.co/zenlm
107
+ - **Paper**: Coming soon
app.py ADDED
@@ -0,0 +1,398 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Zen VL Training Space - HuggingFace Pro GPU Training
3
+ Trains zen-vl-4b with combined ADP+xLAM datasets
4
+ """
5
+
6
+ import os
7
+ import sys
8
+ import time
9
+ import json
10
+ import random
11
+ import logging
12
+ from pathlib import Path
13
+ from typing import List, Dict, Any
14
+
15
+ import torch
16
+ from transformers import (
17
+ Qwen3VLForConditionalGeneration,
18
+ Qwen3VLProcessor,
19
+ TrainingArguments,
20
+ Trainer,
21
+ )
22
+ from datasets import load_dataset, Dataset
23
+ import gradio as gr
24
+
25
+ # Setup logging
26
+ logging.basicConfig(
27
+ level=logging.INFO,
28
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
29
+ )
30
+ logger = logging.getLogger(__name__)
31
+
32
+ # Global training state
33
+ training_state = {
34
+ "status": "idle",
35
+ "progress": 0,
36
+ "current_step": 0,
37
+ "total_steps": 0,
38
+ "loss": 0.0,
39
+ "logs": []
40
+ }
41
+
42
+ def log_message(message: str):
43
+ """Add message to training logs"""
44
+ timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
45
+ log_entry = f"[{timestamp}] {message}"
46
+ training_state["logs"].append(log_entry)
47
+ logger.info(message)
48
+ return log_entry
49
+
50
+ class ZenVLTrainer:
51
+ def __init__(self, model_size="4b", gpu_type="a10g"):
52
+ self.model_size = model_size
53
+ self.gpu_type = gpu_type
54
+ self.model_name = f"zenlm/zen-vl-{model_size}-instruct"
55
+ self.output_name = f"zenlm/zen-vl-{model_size}-agent"
56
+
57
+ # GPU-specific configs
58
+ self.configs = {
59
+ "a10g": {
60
+ "batch_size": 1,
61
+ "gradient_accumulation": 8,
62
+ "max_samples": 30000,
63
+ "learning_rate": 2e-5,
64
+ },
65
+ "a100-large": {
66
+ "batch_size": 2,
67
+ "gradient_accumulation": 8,
68
+ "max_samples": 50000,
69
+ "learning_rate": 2e-5,
70
+ },
71
+ "a100": {
72
+ "batch_size": 4,
73
+ "gradient_accumulation": 8,
74
+ "max_samples": 100000,
75
+ "learning_rate": 2e-5,
76
+ }
77
+ }
78
+
79
+ self.config = self.configs.get(gpu_type, self.configs["a10g"])
80
+ log_message(f"Initialized Zen VL Trainer for {model_size} on {gpu_type}")
81
+ log_message(f"Config: {self.config}")
82
+
83
+ def load_adp_data(self, max_samples: int = None) -> List[Dict[str, Any]]:
84
+ """Load Agent Data Protocol dataset"""
85
+ log_message("Loading ADP dataset...")
86
+
87
+ data_dir = Path("data/adp")
88
+ all_data = []
89
+
90
+ if data_dir.exists():
91
+ # Load from local cache
92
+ for json_file in data_dir.glob("*.jsonl"):
93
+ log_message(f"Loading {json_file.name}...")
94
+ with open(json_file, 'r') as f:
95
+ for line in f:
96
+ if line.strip():
97
+ all_data.append(json.loads(line))
98
+ if max_samples and len(all_data) >= max_samples:
99
+ break
100
+ else:
101
+ # Download from HuggingFace
102
+ log_message("Downloading ADP dataset from HuggingFace...")
103
+ configs = [
104
+ 'agenttuning_os', 'agenttuning_kg', 'agenttuning_db',
105
+ 'synatra', 'code_feedback', 'go-browse-wa'
106
+ ]
107
+
108
+ for config in configs:
109
+ try:
110
+ dataset = load_dataset(
111
+ "neulab/agent-data-collection",
112
+ config,
113
+ split="train",
114
+ streaming=True
115
+ )
116
+
117
+ for i, example in enumerate(dataset):
118
+ all_data.append(example)
119
+ if max_samples and len(all_data) >= max_samples:
120
+ break
121
+
122
+ log_message(f"Loaded {len(all_data)} samples from {config}")
123
+
124
+ if max_samples and len(all_data) >= max_samples:
125
+ break
126
+
127
+ except Exception as e:
128
+ log_message(f"Warning: Could not load {config}: {e}")
129
+ continue
130
+
131
+ log_message(f"Loaded {len(all_data)} ADP samples")
132
+ return all_data
133
+
134
+ def load_xlam_data(self, max_samples: int = None) -> List[Dict[str, Any]]:
135
+ """Load xLAM function calling dataset"""
136
+ log_message("Loading xLAM dataset...")
137
+
138
+ data_dir = Path("data/xlam")
139
+ all_data = []
140
+
141
+ if data_dir.exists():
142
+ # Load from local cache
143
+ json_file = data_dir / "xlam_converted.jsonl"
144
+ if json_file.exists():
145
+ with open(json_file, 'r') as f:
146
+ for line in f:
147
+ if line.strip():
148
+ all_data.append(json.loads(line))
149
+ if max_samples and len(all_data) >= max_samples:
150
+ break
151
+ else:
152
+ # Download from HuggingFace
153
+ log_message("Downloading xLAM dataset from HuggingFace...")
154
+ try:
155
+ dataset = load_dataset(
156
+ "Salesforce/xlam-function-calling-60k",
157
+ split="train"
158
+ )
159
+
160
+ for i, example in enumerate(dataset):
161
+ all_data.append(example)
162
+ if max_samples and len(all_data) >= max_samples:
163
+ break
164
+
165
+ log_message(f"Loaded {len(all_data)} xLAM samples")
166
+
167
+ except Exception as e:
168
+ log_message(f"Error loading xLAM: {e}")
169
+
170
+ return all_data
171
+
172
+ def create_balanced_mixture(
173
+ self,
174
+ adp_data: List[Dict],
175
+ xlam_data: List[Dict],
176
+ adp_weight: float = 0.80,
177
+ xlam_weight: float = 0.20
178
+ ) -> List[Dict]:
179
+ """Create balanced mixture of ADP and xLAM data"""
180
+ log_message(f"Creating balanced mixture: {adp_weight:.0%} ADP, {xlam_weight:.0%} xLAM")
181
+
182
+ total_size = min(len(adp_data), int(len(xlam_data) / xlam_weight))
183
+ adp_target = int(total_size * adp_weight)
184
+ xlam_target = int(total_size * xlam_weight)
185
+
186
+ adp_sample = random.sample(adp_data, min(adp_target, len(adp_data)))
187
+ xlam_sample = random.sample(xlam_data, min(xlam_target, len(xlam_data)))
188
+
189
+ combined = adp_sample + xlam_sample
190
+ random.shuffle(combined)
191
+
192
+ log_message(f"Created mixture: {len(adp_sample)} ADP + {len(xlam_sample)} xLAM = {len(combined)} total")
193
+ return combined
194
+
195
+ def train(self):
196
+ """Main training function"""
197
+ try:
198
+ training_state["status"] = "preparing"
199
+ log_message("=" * 80)
200
+ log_message("Starting Zen VL Training on HuggingFace Space")
201
+ log_message("=" * 80)
202
+
203
+ # Load model and processor
204
+ training_state["status"] = "loading_model"
205
+ log_message(f"Loading model: {self.model_name}")
206
+
207
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
208
+ self.model_name,
209
+ torch_dtype=torch.bfloat16,
210
+ device_map="auto"
211
+ )
212
+
213
+ processor = Qwen3VLProcessor.from_pretrained(self.model_name)
214
+ log_message("Model and processor loaded successfully")
215
+
216
+ # Load datasets
217
+ training_state["status"] = "loading_data"
218
+ max_samples = self.config["max_samples"]
219
+
220
+ adp_data = self.load_adp_data(max_samples=int(max_samples * 0.8))
221
+ xlam_data = self.load_xlam_data(max_samples=int(max_samples * 0.2))
222
+
223
+ # Create mixture
224
+ combined_data = self.create_balanced_mixture(adp_data, xlam_data)
225
+
226
+ # Convert to HuggingFace Dataset
227
+ dataset = Dataset.from_list(combined_data)
228
+ log_message(f"Created dataset with {len(dataset)} examples")
229
+
230
+ # Training arguments
231
+ training_state["status"] = "configuring"
232
+ output_dir = f"./output/{self.model_size}"
233
+
234
+ training_args = TrainingArguments(
235
+ output_dir=output_dir,
236
+ num_train_epochs=3,
237
+ per_device_train_batch_size=self.config["batch_size"],
238
+ gradient_accumulation_steps=self.config["gradient_accumulation"],
239
+ learning_rate=self.config["learning_rate"],
240
+ warmup_steps=500,
241
+ logging_steps=10,
242
+ save_steps=500,
243
+ save_total_limit=3,
244
+ fp16=False,
245
+ bf16=True,
246
+ push_to_hub=True,
247
+ hub_model_id=self.output_name,
248
+ hub_strategy="every_save",
249
+ report_to="tensorboard",
250
+ )
251
+
252
+ log_message("Training configuration:")
253
+ log_message(f" Epochs: {training_args.num_train_epochs}")
254
+ log_message(f" Batch size: {training_args.per_device_train_batch_size}")
255
+ log_message(f" Gradient accumulation: {training_args.gradient_accumulation_steps}")
256
+ log_message(f" Learning rate: {training_args.learning_rate}")
257
+ log_message(f" Total samples: {len(dataset)}")
258
+
259
+ # Calculate total steps
260
+ total_steps = (
261
+ len(dataset)
262
+ // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps)
263
+ * training_args.num_train_epochs
264
+ )
265
+ training_state["total_steps"] = total_steps
266
+ log_message(f" Total training steps: {total_steps}")
267
+
268
+ # Initialize trainer
269
+ training_state["status"] = "training"
270
+ log_message("Initializing trainer...")
271
+
272
+ trainer = Trainer(
273
+ model=model,
274
+ args=training_args,
275
+ train_dataset=dataset,
276
+ )
277
+
278
+ # Start training
279
+ log_message("=" * 80)
280
+ log_message("TRAINING STARTED")
281
+ log_message("=" * 80)
282
+
283
+ result = trainer.train()
284
+
285
+ # Training completed
286
+ training_state["status"] = "uploading"
287
+ log_message("=" * 80)
288
+ log_message("TRAINING COMPLETED")
289
+ log_message("=" * 80)
290
+ log_message(f"Final loss: {result.training_loss:.4f}")
291
+
292
+ # Push to hub
293
+ log_message(f"Uploading model to {self.output_name}...")
294
+ trainer.push_to_hub()
295
+
296
+ training_state["status"] = "completed"
297
+ training_state["progress"] = 100
298
+ log_message("=" * 80)
299
+ log_message("SUCCESS! Model uploaded to HuggingFace")
300
+ log_message("=" * 80)
301
+
302
+ return "Training completed successfully!"
303
+
304
+ except Exception as e:
305
+ training_state["status"] = "error"
306
+ error_msg = f"Training failed: {str(e)}"
307
+ log_message(error_msg)
308
+ return error_msg
309
+
310
+ def get_training_status():
311
+ """Get current training status for Gradio UI"""
312
+ status = training_state["status"]
313
+ progress = training_state["progress"]
314
+ current_step = training_state["current_step"]
315
+ total_steps = training_state["total_steps"]
316
+ loss = training_state["loss"]
317
+
318
+ status_text = {
319
+ "idle": "⏸️ Ready to start training",
320
+ "preparing": "πŸ”§ Preparing training environment...",
321
+ "loading_model": "πŸ“¦ Loading model and processor...",
322
+ "loading_data": "πŸ“š Loading training datasets...",
323
+ "configuring": "βš™οΈ Configuring training parameters...",
324
+ "training": f"πŸš€ Training in progress: {current_step}/{total_steps} steps",
325
+ "uploading": "☁️ Uploading model to HuggingFace...",
326
+ "completed": "βœ… Training completed successfully!",
327
+ "error": "❌ Training failed"
328
+ }
329
+
330
+ return status_text.get(status, status), progress, "\n".join(training_state["logs"][-50:])
331
+
332
+ def start_training(model_size, gpu_type):
333
+ """Start training job"""
334
+ log_message(f"Starting training for {model_size} on {gpu_type}")
335
+ trainer = ZenVLTrainer(model_size=model_size, gpu_type=gpu_type)
336
+ result = trainer.train()
337
+ return result
338
+
339
+ # Gradio Interface
340
+ with gr.Blocks(title="Zen VL Training") as demo:
341
+ gr.Markdown("""
342
+ # 🧘 Zen VL Training Space
343
+
344
+ Train zen-vl models with combined ADP+xLAM datasets on HuggingFace Pro GPUs.
345
+
346
+ **Datasets:**
347
+ - Agent Data Protocol (ADP): ~220k agent trajectories
348
+ - xLAM Function Calling: 60k function calling examples
349
+
350
+ **Training Time Estimates:**
351
+ - 4B model on A10G: ~6-8 hours
352
+ - 8B model on A100: ~10-12 hours
353
+ - 30B model on A100-80GB: ~20-24 hours
354
+ """)
355
+
356
+ with gr.Row():
357
+ model_size = gr.Dropdown(
358
+ choices=["4b", "8b", "30b"],
359
+ value="4b",
360
+ label="Model Size"
361
+ )
362
+ gpu_type = gr.Dropdown(
363
+ choices=["a10g", "a100-large", "a100"],
364
+ value="a10g",
365
+ label="GPU Type"
366
+ )
367
+
368
+ start_btn = gr.Button("πŸš€ Start Training", variant="primary")
369
+
370
+ status_text = gr.Textbox(label="Status", value="Ready to start training")
371
+ progress_bar = gr.Slider(minimum=0, maximum=100, value=0, label="Progress")
372
+ logs_text = gr.Textbox(label="Training Logs", lines=20, max_lines=50)
373
+
374
+ # Auto-refresh status every 10 seconds
375
+ demo.load(
376
+ get_training_status,
377
+ None,
378
+ [status_text, progress_bar, logs_text],
379
+ every=10
380
+ )
381
+
382
+ start_btn.click(
383
+ start_training,
384
+ inputs=[model_size, gpu_type],
385
+ outputs=[status_text]
386
+ )
387
+
388
+ if __name__ == "__main__":
389
+ # Check if running in HF Space
390
+ if os.environ.get("SPACE_ID"):
391
+ log_message(f"Running in HuggingFace Space: {os.environ['SPACE_ID']}")
392
+
393
+ # Launch Gradio interface
394
+ demo.launch(
395
+ server_name="0.0.0.0",
396
+ server_port=7860,
397
+ share=False
398
+ )
requirements.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ transformers>=4.57.1
2
+ torch>=2.0.0
3
+ datasets>=2.14.0
4
+ accelerate>=0.27.0
5
+ pillow>=10.0.0
6
+ gradio>=4.0.0
7
+ huggingface-hub>=0.20.0
8
+ tensorboard>=2.15.0
9
+ pydantic>=2.0.0