Update README.md

Browse files

Files changed (1) hide show

README.md +168 -3

README.md CHANGED Viewed

@@ -1,3 +1,168 @@
----
-license: llama2
----

+---
+language: ["en"]
+license: llama2
+tags:
+  - image-text-to-text
+  - visual-question-answering
+  - vision-language
+  - llava
+  - multimodal
+  - causal-lm
+  - continual-pretraining
+  - lora
+  - axolotl
+  - deepspeed
+  - transformers
+  - eu-hpc
+datasets:
+  - mm_captions_chat
+  - text_cpt_corpus
+metrics: ["loss"]
+library_name: transformers
+framework: pytorch
+base_model: llava-hf/llava-1.5-7b-hf
+model_name: llava-7b-cpt
+pipeline_tag: image-text-to-text
+task_categories: ["image-text-to-text","visual-question-answering"]
+model_type: llava
+inference:
+  parameters:
+    max_new_tokens: 128
+    temperature: 0.2
+    top_p: 0.9
+trained_on: ["Leonardo EuroHPC"]
+description: "Two-stage continual pretraining (CPT) of LLaVA 1.5 7B: first on **text-only** data, then on **image–text** chat-style captions. LoRA adapters merged into base."
+---
+# LLaVA 7B — Multimodal Continual Pretraining (CPT) with LoRA Adapters
+**Model type:** Vision-Language Causal Model
+**Base model:** [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf)
+**License:** Llama 2 Community License (inherits from base)
+**Framework:** Axolotl + DeepSpeed ZeRO-1
+---
+## Overview
+`llava-7b-cpt` is a **continual-pretrained** multimodal version of **LLaVA 1.5 7B**, extending its visual and textual reasoning capabilities through domain-specific continual pretraining (CPT).
+The process follows a **two-stage adaptation flow**:
+1. **Textual CPT (Stage 1):**
+   - Base: `llava-hf/llava-1.5-7b-hf`
+   - Objective: text-only continual pretraining on scientific, governmental, news, and encyclopedic corpora.
+2. **Multimodal CPT (Stage 2, this release):**
+   - Base: the Stage 1 text-CPT model
+   - Objective: multimodal (image–text) continual pretraining using image-caption dialogue data.
+This pipeline enhances LLaVA’s factual grounding and image-conditioned understanding of technical and energy-domain visual content.
+Training was performed on the **Leonardo EuroHPC** supercomputer using **Axolotl 0.6** with **DeepSpeed ZeRO-1** and **bfloat16** precision.
+---
+## Training Setup
+| Component | Specification |
+|:-----------|:--------------|
+| **Objective** | Multimodal continual pretraining (image–text dialogue) |
+| **Adapter type** | LoRA |
+| **Precision** | bfloat16 |
+| **Hardware** | 8 nodes × 2 × NVIDIA A100 64 GB GPUs |
+| **Framework** | Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1) |
+| **Runtime** | ≈ 24 hours |
+| **Checkpoints** | Saved every epoch |
+| **Vision tower** | Frozen |
+| **Text backbone** | LoRA-updated only |
+| **Loss watchdog** | Disabled for multimodal phase |
+---
+## Dataset
+The multimodal CPT stage was trained on **image–caption chat-style pairs**, using an Axolotl-compatible JSONL format (`mm_captions_chat.jsonl`) of LLaVA-style message lists.
+| File | Description |
+|:------|:-------------|
+| **mm_captions_chat.jsonl** | Image–text dialogues for visual captioning and VQA adaptation |
+| **images/** | Folder of referenced image files used by the dataset entries |
+Each entry contains alternating `user` (image + text prompt) and `assistant` (caption/answer) messages in a chat structure compatible with the `llava` chat template.
+---
+## Hyperparameters
+| Parameter | Value |
+|:-----------|:------|
+| Sequence length | 2048 |
+| Micro batch size | 1 |
+| Gradient accumulation | 4 |
+| Epochs | 1 |
+| Max steps | 6000 |
+| Learning rate | 0.00015 |
+| LR scheduler | cosine |
+| Optimizer | AdamW (8-bit) |
+| Warmup ratio | 0.1 |
+| Weight decay | 0.0 |
+| LoRA rank (r) | 16 |
+| LoRA alpha | 32 |
+| LoRA dropout | 0.05 |
+| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
+| Gradient checkpointing | ✅ |
+| Flash attention | ❌ (disabled for stability) |
+| Image size | 512 |
+| Resize algorithm | bilinear |
+---
+## Model Flow
+Base: llava-hf/llava-1.5-7b-hf
+Stage 1 — Textual Continual Pretraining (CPT) → llava-7b-text-cpt
+Stage 2 — Multimodal Continual Pretraining (CPT) → ubitech-edg/llava-7b-cpt
+---
+## Tokenizer & Processor
+| Component | Value |
+|:-----------|:------|
+| **Tokenizer type** | `AutoTokenizer` |
+| **Processor type** | `AutoProcessor` |
+| **Special tokens** | `<pad>` = ID 32001 |
+| **Chat template** | `llava` |
+---
+## Usage
+To load and run `llava-7b-cpt` locally for image–text generation:
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor
+from PIL import Image
+import torch
+model_id = "ubitech-edg/llava-7b-cpt"
+processor = AutoProcessor.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
+image = Image.open("example.jpg").convert("RGB")
+prompt = "USER: <image>\nDescribe this image in two sentences.\nASSISTANT:"
+inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
+with torch.inference_mode():
+    output = model.generate(**inputs, max_new_tokens=128)
+print(processor.decode(output[0], skip_special_tokens=True))
+```