Update README.md

5d07e2d verified over 1 year ago

4.72 kB

	---
	library_name: transformers
	license: apache-2.0
	datasets:
	- RekaAI/VibeEval
	base_model:
	- meta-llama/Llama-3.2-11B-Vision-Instruct
	pipeline_tag: image-text-to-text
	---

	# Model Card for hiiamsid/llama-3.2-vision-11B-ROCO

	This is the finetuned version of meta-llama/Llama-3.2-11B-Vision-Instruct trained on MedIR/roco dataset using FSDP on 2 A100s.



	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->


	- Developed by: hiiamsid
	- Model type: multimodal (Image/Text to Text)
	- Language(s) (NLP): multilingual
	- License: Apache License 2.0
	- Finetuned from model [optional]: meta-llama/Llama-3.2-11B-Vision-Instruct


	## How to Get Started with the Model
	```

	import requests
	from PIL import Image
	import torch
	from transformers import MllamaForConditionalGeneration, AutoProcessor

	base_model = "hiiamsid/llama-3.2-vision-11B-ROCO"

	processor = AutoProcessor.from_pretrained(base_model)

	model = MllamaForConditionalGeneration.from_pretrained(
	base_model,
	low_cpu_mem_usage=True,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	url = "https://lh7-rt.googleusercontent.com/docsz/AD_4nXcz-J3iR2bEGcCSLzay07Rqfj5tTakp2EMTTN0x6nKYGLS5yWl0unoSpj2S0-mrWpDtMqjl1fAgH6pVkKJekQEY_kwzL6QNOdf143Yt66znQ0EpfLvx6CLFOqw41oeOYmhPZ6Qrlb5AjEr4AenIOgBMTWTD?key=vhLUYntaS9QOx531XpJH3g"
	image = Image.open(requests.get(url, stream=True).raw)

	messages = [
	{"role": "user", "content": [
	{"type": "image"},
	{"type": "text", "text": "Describe the tutorial feature image."}
	]}
	]
	input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
	inputs = processor(image, input_text, return_tensors="pt").to(model.device)

	output = model.generate(**inputs, max_new_tokens=120)
	print(processor.decode(output[0]))
	```

	## Training Details

	### Training Data
	MedIR/roco: https://huggingface.co/datasets/MedIR/roco (only 1000 samples where used for training)

	### Training Procedure

	-Trained using FSDP activating wraping policy, MixedPrecision Policy (on bfloat16), activationcheckpointing etc and saved using Type FULL_STATE_DICT

	#### Training Hyperparameters

	```
	@dataclass
	class train_config:
	model_name: str="meta-llama/Llama-3.2-11B-Vision-Instruct"
	batch_size_training: int=8
	batching_strategy: str="padding" #alternative is packing but vision model doesn't work with packing.
	context_length: int =4096
	gradient_accumulation_steps: int=1
	num_epochs: int=3
	lr: float=1e-5
	weight_decay: float=0.0
	gamma: float= 0.85 # multiplicatively decay the learning rate by gamma after each epoch
	seed: int=42
	use_fp16: bool=False
	mixed_precision: bool=True
	val_batch_size:int = 1
	use_peft: bool = False
	output_dir: str = "workspace/models"
	enable_fsdp: bool = True
	dist_checkpoint_root_folder: str="workspace/FSDP/model" # will be used if using FSDP
	dist_checkpoint_folder: str="fine-tuned" # will be used if using FSDP
	save_optimizer: bool=False # will be used if using FSDP

	@dataclass
	class fsdp_config:
	mixed_precision: bool = True
	use_fp16: bool=False
	sharding_strategy: ShardingStrategy = ShardingStrategy.FULL_SHARD # HYBRID_SHARD "Full Shard within a node DDP cross Nodes", SHARD_GRAD_OP "Shard only Gradients and Optimizer States", NO_SHARD "Similar to DDP".
	hsdp : bool =False # Require HYBRID_SHARD to be set. This flag can extend the HYBRID_SHARD by allowing sharding a model on customized number of GPUs (Sharding_group) and Replicas over Sharding_group.
	sharding_group_size: int=0 # requires hsdp to be set. This specifies the sharding group size, number of GPUs that you model can fit into to form a replica of a model.
	replica_group_size: int=0 #requires hsdp to be set. This specifies the replica group size, which is world_size/sharding_group_size.
	checkpoint_type: StateDictType = StateDictType.FULL_STATE_DICT # alternatively FULL_STATE_DICT can be used. SHARDED_STATE_DICT saves one file with sharded weights per rank while FULL_STATE_DICT will collect all weights on rank 0 and save them in a single file.
	fsdp_activation_checkpointing: bool=True
	fsdp_cpu_offload: bool=False
	pure_bf16: bool = True
	optimizer: str= "AdamW"
	```

	### Model Architecture and Objective
	This was just trained to see how much improvement can be seen when finetuned llama 3.2 vision.

	### Compute Infrastructure
	Trained on 2 A100 (80GB) from runpods.

	## Citation

	<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
	https://github.com/meta-llama/llama-recipes
	[More Information Needed]