Config arguments Explanation

Model configuration arguments

Name	Type	Description	Default Value
model.config_path	str	Path to the model huggingface configuration, like `config.json`	model.model_path
model.model_path	str	Path to the model parameter file. If empty, random initialization will be performed	None
model.tokenizer_path	str	Path to the tokenizer	model.model_path
model.encoders	dict	Configuration file for multi-modal encoders	{}
model.decoders	dict	Configuration file for multi-modal decoders	{}
model.input_encoder	str: {"encoder", "decoder"}	Use the encoder of the encoder or decoder to encode the input image	encoder
model.output_encoder	str: {"encoder", "decoder"}	Use the encoder of the encoder or decoder to encode the output image	decoder
model.encode_target	bool	Used to encode the training data for the diffusion model	False

Data configuration arguments

Name	Type	Description	Default Value
data.train_path	str	Path of training dataset	Required
data.train_size	int	Total number of tokens in the training set	10,000,000
data.data_type	str: {"plaintext", "conversation"}	Dataset type.	conversation
data.dataloader_type	str: {"native"}	Use the pytorch dataloader or	native
data.datasets_type	str: {"mapping", "iterable"}	Dataset type. `IterativeDataset` or `MappingDataset`, or your custom datsets	mapping
data.text_keys	str: {"content_split", "messages"}	The key corresponding to the text samples in the data dictionary. Generally, it is "content_split" for pretraining and "messages" for SFT.	content_split
data.image_keys	str	The key corresponding to the image samples in the data dictionary. Generally, it is "images".	images
data.chat_template	str	Name of the chat template.	default
data.max_seq_len	int	Maximum training length.	2048
data.num_workers	int	Number of multi-process loaders for the dataloader.	4
data.drop_last	bool	Whether to discard the remaining data at the end.	True
data.pin_memory	bool	Whether to pin the data in the CPU memory.	True
data.prefetch_factor	int	Number of samples preprocessed by the dataloader.	2

Training configuration arguments

Name	Type	Description	Default Value
train.output_dir	str	Path to save the model.	Required
train.lr	float	Maximum learning rate.	5e - 5
train.lr_min	float	Minimum learning rate.	1e - 7
train.weight_decay	float	Weight decay coefficient.	0
train.optimizer	str: {"adamw", "anyprecision_adamw"}	Name of the optimizer.	adamw
train.max_grad_norm	float	Gradient clipping norm.	1.0
train.micro_batch_size	int	Number of samples processed simultaneously on each GPU.	1
train.global_batch_size	int	Global batch size, which must be a multiple of the number of GPUs.	train.micro_batch_size * n_gpus
train.num_train_epochs	int	Number of training epochs.	1
train.rmpad	bool	Whether to use rmpad training based on cu_seqlens.	False
train.rmpad_with_pos_ids	bool	Whether to use rmpad training based on position_ids.	False
train.dyn_bsz_margin	int	Number of pad tokens in the dynamic batch.	0
train.dyn_bsz_runtime	str: {"main", "worker"}	Running process of the dynamic batch.	worker
train.bsz_warmup_ratio	float	Proportion of batch size warmup in the total number of steps.	0
train.lr_warmup_ratio	float	Proportion of learning rate warmup in the total number of steps.	0
train.lr_decay_style	str: {"constant", "linear", "cosine"}	Name of the learning rate scheduler.	cosine
train.lr_decay_ratio	float	Proportion of learning rate decay in the total number of steps	1.0
train.use_doptim	bool	Whether to use the distributed optimizer during Vescale training(no use for torch fsdp)	False
train.enable_mixed_precision	bool	Whether to enable mixed precision training (higher memory usage but more stable)	True
train.enable_gradient_checkpointing	bool	Whether to enable gradient checkpointing to reduce memory usage.	True
train.enable_reentrant	bool	Whether to enable reentrant in gradient checkpointing.	True
train.enable_full_shard	bool	Whether to use full sharding FSDP (equivalent to ZeRO3).	True
train.enable_fsdp_offload	bool	Whether to enable FSDP CPU offloading (only supported for FSDP1).	False
train.enable_activation_offload	bool	Whether to enable activation value CPU offloading.	False
train.activation_gpu_limit	float	Size of the activation values retained on the GPU (in GB).	0.0
train.enable_manual_eager	bool	Whether to use manual eager during Vescale training.	False
train.init_device: meta	str	"cpu", "cuda", "meta", init device for model initialization. use "meta" or cpu for large model(>30B)	cuda
train.enable_full_determinism	bool	Whether to enable deterministic mode (for bitwise alignment).	False
train.empty_cache_steps	int	Number of steps between two cache clearings. -1 means not enabled.	500
train.data_parallel_mode	str: {"ddp", "fsdp1", "fsdp2"}	Data parallel algorithm.	ddp
train.tensor_parallel_size	int	Tensor parallel size (currently only supported for vescale training).	1
train.pipeline_parallel_size	int	Pipeline parallel size (currently not supported).	1
train.ulysses_parallel_size	int	Ulysses sequence parallel size (currently only supported for P6dense and Qwen2VL).	1
train.context_parallel_size	int	Ring sequence parallel size (currently not supported)	1
train.expert_parallel_size	int	Expert parallel size (currently only supported DeepseekMOE)	1
train.load_checkpoint_path	str	Path to the omnistore checkpoint for resuming training.	None
train.save_steps	int	Number of steps between two checkpoint saves. 0 means invalid.	0
train.save_epochs	int	Number of epochs between two checkpoint saves. 0 means invalid.	1
train.save_hf_weights	bool	Whether to save the model weights in the huggingface format. It is recommended to set it to False for models > 30B to prevent NCCL timeout. You can convert it after training.	True
train.seed	int	Random seed.	42
train.use_wandb	bool	Whether to enable byted wandb experiment logging.	True
train.wandb_project	str	Name of the wandb experiment project.	LingBotVLA
train.wandb_name	str	Name of the wandb experiment.	None
train.enable_profiling	bool	Whether to use torch profiling.	False
train.profile_start_step	int	Starting step of profiling.	1
train.profile_end_step	int	Ending step of profiling.	2
train.profile_trace_dir	str	Path to save the profiling results.	./trace
train.profile_record_shapes	bool	Whether to record the shapes of the input tensors.	True
train.profile_profile_memory	bool	Whether to record the memory usage.	True
train.profile_with_stack	bool	Whether to record the stack information.	True
train.max_steps	int	Number of steps per training epoch (only used for debugging).	None

Inference configuration arguments

Name	Type	Description	Default Value
infer.model_path	str	Path to the model parameter file.	Required
infer.tokenizer_path	str	Path to the tokenizer.	model.model_path
infer.seed	int	Random seed.	42
infer.do_sample	bool	Whether to enable sampling.	True
infer.temperature	float	Sampling temperature.	1.0
infer.top_p	float	Sampling Top P value.	1.0
infer.max_tokens	int	Maximum number of tokens generated each time.	1024