Atah Alam commited on Dec 17, 2025

Commit

7f7a72e

0 Parent(s):

Manthan-T1 clean code-only

Files changed (29) hide show

.gitattributes +1 -0
.gitignore +24 -0
.hfignore +25 -0
MODEL_CARD.md +68 -0
README.md +29 -0
docs/KAGGLE_TRAINING.md +114 -0
hf_export_stub/added_tokens.json +5 -0
hf_export_stub/chat_template.jinja +26 -0
hf_export_stub/config.json +16 -0
hf_export_stub/special_tokens_map.json +7 -0
hf_export_stub/tokenizer_config.json +6 -0
manthan_t1/__init__.py +2 -0
manthan_t1/configuration_manthan.py +114 -0
manthan_t1/hf_integration_smoke.py +60 -0
manthan_t1/hf_smoke.py +43 -0
manthan_t1/modeling_manthan.py +654 -0
manthan_t1/smoke_test.py +63 -0
manthan_t1/text_generate_smoke.py +30 -0
manthan_t1/tokenizer_smoke.py +28 -0
manthan_t1/tokenizer_utils.py +52 -0
pyproject.toml +29 -0
requirements.txt +0 -0
scripts/export_hf.py +208 -0
scripts/infer_hf.py +30 -0
scripts/infer_qwen3_siglip2.py +50 -0
scripts/kaggle_train_all.sh +136 -0
scripts/smoke_export_load.py +66 -0
scripts/train_unsloth_kaggle.py +454 -0
tests/test_smoke.py +10 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Keep this repo code-only. Don"t use Git LFS here.

.gitignore ADDED Viewed

	@@ -0,0 +1,24 @@

+.venv/
+venv/
+__pycache__/
+*.pyc
+.DS_Store
+# python tooling
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+# local caches
+.cache/
+hf/
+# training artifacts
+wandb/
+runs/
+checkpoints/
+outputs/
+tmp/
+# local exports (keep them out of git; publish separately if needed)
+hf_export_ready/

.hfignore ADDED Viewed

	@@ -0,0 +1,25 @@

+.venv/
+venv/
+__pycache__/
+*.pyc
+.DS_Store
+# local/hf caches
+.cache/
+hf/
+# training artifacts
+wandb/
+runs/
+checkpoints/
+outputs/
+tmp/
+/tmp/
+# large local exports (only push intentionally)
+hf_export_ready/
+.venv/
+__pycache__/
+*.pyc
+.DS_Store

MODEL_CARD.md ADDED Viewed

	@@ -0,0 +1,68 @@

+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+tags:
+- pytorch
+- safetensors
+- vision-language
+- visual-question-answering
+pipeline_tag: visual-question-answering
+base_model:
+- Qwen/Qwen3-0.6B
+- google/siglip-so400m-patch14-384
+model-index:
+- name: Manthan-T1
+  results:
+  - task:
+      type: visual-question-answering
+      name: VQAv2
+    dataset:
+      name: VQAv2
+      type: vqav2
+    metrics:
+    - name: Overall Accuracy
+      type: accuracy
+      value: 0.0
+    - name: Yes/No Accuracy
+      type: accuracy
+      value: 0.0
+    - name: Number Accuracy
+      type: accuracy
+      value: 0.0
+    - name: Other Accuracy
+      type: accuracy
+      value: 0.0
+    source:
+      name: Pending
+      url: https://visualqa.org/download.html
+---
+# Manthan-T1
+A custom **Transformers** architecture for a compact vision-language model.
+## Status
+This repo currently contains:
+- `ManthanConfig` (`manthan_t1/configuration_manthan.py`)
+- `ManthanForCausalLM` (`manthan_t1/modeling_manthan.py`)
+  - vision encoder (minimal ViT-like)
+  - projector to text hidden size
+  - decoder LM (placeholder GPT-2 by default for smoke tests)
+Planned next steps:
+- Swap the text backbone to **Qwen3-0.6B** via `text_config` + weight loading
+- Swap the vision tower to **SigLIP2-so400m (patch14-384)** and align image token handling
+- Add proper processor + chat template to enforce **reply in user’s input language** (Tamil/Hindi/etc.)
+## Loading
+This is intended to be loaded with:
+- `AutoModelForCausalLM.from_pretrained(..., trust_remote_code=True)`
+See `scripts/infer_hf.py`.

README.md ADDED Viewed

	@@ -0,0 +1,29 @@

+# Manthan-T1
+A from-scratch scaffold for a custom **Transformers** vision-language architecture named **Manthan-T1**.
+## What you get (today)
+- A clean project layout under `manthan_t1/`
+- A full HF custom architecture:
+	- `ManthanConfig` in `manthan_t1/configuration_manthan.py`
+	- `ManthanForCausalLM` in `manthan_t1/modeling_manthan.py`
+- A no-download HF forward smoke test:
+	- `python -m manthan_t1.hf_smoke`
+- An MLX smoke test (kept for Apple Silicon readiness):
+	- `python -m manthan_t1.smoke_test`
+## What we’ll add next
+- Qwen3-0.6B backbone wiring + weight loading (keeping the model type = `manthan_t1`)
+- SigLIP2 vision tower wiring + projector alignment
+- LoRA fine-tuning recipes for M4 16GB (MLX +/or PyTorch)
+- Multilingual “reply in user language” policy (Indian languages)
+## Quick smoke test
+After installing dependencies, you should be able to run:
+```bash
+python -m manthan_t1.smoke_test
+python -m manthan_t1.hf_smoke
+```
+This does **not** download any external models yet.

docs/KAGGLE_TRAINING.md ADDED Viewed

	@@ -0,0 +1,114 @@

+# Kaggle training (2×T4) – Manthan‑T1
+This repo includes a Kaggle-oriented training entrypoint:
+- `scripts/train_unsloth_kaggle.py`
+It uses the same LLaVA-style dataset format as TinyLLaVA/MicroLLaVA:
+- dataset sample keys: `image`, `conversations`, `id`
+- `conversations`: `[{'from':'human','value':'...<image>...'}, {'from':'gpt','value':'...'}]`
+- uses `IMAGE_TOKEN_INDEX = -200`
+- uses `IGNORE_INDEX = -100` for masked labels
+## What this script trains
+Default (recommended for 2×T4):
+- vision tower: **frozen**
+- multimodal projector: **trainable** (always)
+- LLM: **LoRA adapters** (optional, enable `--use_lora`)
+This matches the standard LLaVA/TinyLLaVA recipe: align projector first, then instruction tune.
+## Kaggle setup checklist
+1) Enable GPU (2×T4) in Kaggle.
+2) Ensure `pip` deps exist in the notebook:
+```bash
+pip install -U transformers accelerate datasets peft
+# Optional (recommended): Unsloth if available in your notebook image
+pip install -U unsloth
+```
+3) Clone your HF repo or `git clone` the repo.
+4) Set HF cache to persist in `/kaggle/working` so it survives “Save Version”:
+```bash
+export HF_HOME=/kaggle/working/hf
+export TRANSFORMERS_CACHE=/kaggle/working/hf/transformers
+export HF_DATASETS_CACHE=/kaggle/working/hf/datasets
+```
+## Stage 1 (projector alignment)
+Use a smaller pretrain set first:
+- `liuhaotian/LLaVA-CC3M-Pretrain-595K`
+Example run:
+```bash
+python scripts/train_unsloth_kaggle.py \
+  --stage stage1 \
+  --manthan_model <YOUR_HF_REPO_OR_LOCAL_PATH> \
+  --text_model Qwen/Qwen3-0.6B-Base \
+  --dataset liuhaotian/LLaVA-CC3M-Pretrain-595K \
+  --output_dir /kaggle/working/manthan_stage1 \
+  --use_lora \
+  --max_length 2048 \
+  --image_size 384 \
+  --batch_size 1 \
+  --grad_accum 32 \
+  --lr 1e-4 \
+  --epochs 1 \
+  --limit 20000
+```
+Notes:
+- Increase `--limit` as you gain confidence.
+- If you run out of VRAM, reduce `--max_length` or increase `--grad_accum`.
+## Stage 2 (instruction tuning)
+Dataset:
+- `liuhaotian/LLaVA-Instruct-150K`
+```bash
+python scripts/train_unsloth_kaggle.py \
+  --stage stage2 \
+  --manthan_model <YOUR_HF_REPO_OR_LOCAL_PATH> \
+  --text_model Qwen/Qwen3-0.6B-Base \
+  --dataset liuhaotian/LLaVA-Instruct-150K \
+  --output_dir /kaggle/working/manthan_stage2 \
+  --use_lora \
+  --max_length 2048 \
+  --image_size 384 \
+  --batch_size 1 \
+  --grad_accum 32 \
+  --lr 1e-4 \
+  --epochs 1 \
+  --limit 150000
+```
+## Outputs
+The script saves into `--output_dir`:
+- `projector.pt` (multimodal projector weights)
+- `save_pretrained()` output for the model (includes remote-code config; adapters if supported)
+In practice, you’ll likely upload these artifacts back to HF.
+## Dry run (local)
+To validate the training loop without datasets:
+```bash
+python scripts/train_unsloth_kaggle.py \
+  --stage stage1 \
+  --manthan_model hf_export_ready \
+  --text_model gpt2 \
+  --dataset dummy \
+  --output_dir ./tmp_out \
+  --dry_run
+```
+(For real Kaggle training, don’t use stub weights.)

hf_export_stub/added_tokens.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+  "<image>": 0,
+  "<im_start>": 0,
+  "<im_end>": 0
+}

hf_export_stub/chat_template.jinja ADDED Viewed

	@@ -0,0 +1,26 @@

+{% set system = system_message | default('You are Manthan-T1, a helpful multimodal assistant.') %}
+{% set user_lang_rule = 'Reply in the same language as the user. If the user writes in Tamil, reply in Tamil; if Hindi then Hindi; if English then English.' %}
+{% if messages[0]['role'] != 'system' %}
+<|system|>
+{{ system }}\n{{ user_lang_rule }}
+<|end|>
+{% endif %}
+{% for m in messages %}
+{% if m['role'] == 'system' %}
+<|system|>
+{{ m['content'] }}\n{{ user_lang_rule }}
+<|end|>
+{% elif m['role'] == 'user' %}
+<|user|>
+{{ m['content'] }}
+<|end|>
+{% elif m['role'] == 'assistant' %}
+<|assistant|>
+{{ m['content'] }}
+<|end|>
+{% endif %}
+{% endfor %}
+<|assistant|>

hf_export_stub/config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+  "model_type": "manthan_t1",
+  "architectures": ["ManthanForCausalLM"],
+  "auto_map": {
+  "AutoConfig": "configuration_manthan.ManthanConfig",
+  "AutoModelForCausalLM": "modeling_manthan.ManthanForCausalLM"
+  },
+  "text_model_id": "Qwen/Qwen3-0.6B",
+  "vision_model_id": "google/siglip-so400m-patch14-384",
+  "vision_image_size": 384,
+  "vision_patch_size": 14,
+  "vision_feature_select": "patch",
+  "num_image_tokens": 256,
+  "image_token_id": 0,
+  "torch_dtype": "float16"
+}

hf_export_stub/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "additional_special_tokens": [
+    "<image>",
+    "<im_start>",
+    "<im_end>"
+  ]
+}

hf_export_stub/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "model_max_length": 4096,
+  "padding_side": "right",
+  "truncation_side": "right",
+  "use_fast": false
+}

manthan_t1/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ __all__ = ["__version__"]
2	+ __version__ = "0.0.1"

manthan_t1/configuration_manthan.py ADDED Viewed

	@@ -0,0 +1,114 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Optional
+from transformers import PretrainedConfig
+class ManthanConfig(PretrainedConfig):
+    """Configuration for Manthan-T1.
+    Matches key MicroLLaVA/TinyLLaVA conventions:
+    - `image_token_index` is a negative placeholder id (default -200)
+    - keep `image_token_id` as an alias (defaults to image_token_index)
+    `text_config` is kept as a dict for JSON serialization.
+    """
+    model_type = "manthan_t1"
+    def __init__(
+        self,
+        text_config: Optional[dict] = None,
+        text_model_id: Optional[str] = None,
+        vision_model_id: Optional[str] = None,
+        vision_hidden_size: int = 1024,
+        vision_num_hidden_layers: int = 24,
+        vision_num_attention_heads: int = 16,
+        vision_image_size: int = 384,
+        vision_patch_size: int = 14,
+        projector_hidden_size: Optional[int] = None,
+        image_token_index: int = -200,
+        image_token_id: Optional[int] = None,
+        num_image_tokens: int = 256,
+        vision_feature_select: str = "patch",
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.text_config_dict = text_config or {}
+        # Optional resolved config for HF generation helpers
+        self.text_config_obj: Optional[PretrainedConfig] = None
+        if self.text_config_dict.get("model_type"):
+            from transformers import AutoConfig
+            try:
+                self.text_config_obj = AutoConfig.for_model(**self.text_config_dict)
+            except Exception:
+                self.text_config_obj = None
+        self.text_model_id = text_model_id
+        self.vision_model_id = vision_model_id
+        self.vision_hidden_size = int(vision_hidden_size)
+        self.vision_num_hidden_layers = int(vision_num_hidden_layers)
+        self.vision_num_attention_heads = int(vision_num_attention_heads)
+        self.vision_image_size = int(vision_image_size)
+        self.vision_patch_size = int(vision_patch_size)
+        self.projector_hidden_size = projector_hidden_size
+        self.image_token_index = int(image_token_index)
+        self.image_token_id = int(image_token_id) if image_token_id is not None else int(image_token_index)
+        self.num_image_tokens = int(num_image_tokens)
+        self.vision_feature_select = vision_feature_select
+        # -------- Generation-related compatibility --------
+        # Transformers' generation utilities (DynamicCache, etc.) expect certain
+        # attributes on the *decoder/text* config. Since ManthanConfig is a
+        # wrapper that may not always carry a resolved `text_config_obj`, we set
+        # conservative defaults here to keep `model.generate()` functional in
+        # stub/export scenarios.
+        self.num_hidden_layers = int(
+            getattr(self.text_config_obj, "num_hidden_layers", kwargs.get("num_hidden_layers", 1))
+            if self.text_config_obj is not None
+            else kwargs.get("num_hidden_layers", 1)
+        )
+        self.num_attention_heads = int(
+            getattr(self.text_config_obj, "num_attention_heads", kwargs.get("num_attention_heads", 1))
+            if self.text_config_obj is not None
+            else kwargs.get("num_attention_heads", 1)
+        )
+        self.hidden_size = int(
+            getattr(self.text_config_obj, "hidden_size", kwargs.get("hidden_size", 256))
+            if self.text_config_obj is not None
+            else kwargs.get("hidden_size", 256)
+        )
+        self.max_position_embeddings = int(
+            getattr(self.text_config_obj, "max_position_embeddings", kwargs.get("max_position_embeddings", 2048))
+            if self.text_config_obj is not None
+            else kwargs.get("max_position_embeddings", 2048)
+        )
+        self.vocab_size = int(
+            getattr(self.text_config_obj, "vocab_size", kwargs.get("vocab_size", 32000))
+            if self.text_config_obj is not None
+            else kwargs.get("vocab_size", 32000)
+        )
+    def get_text_config(self, decoder: bool = False):
+        # Transformers' GenerationConfig helpers call get_text_config() during
+        # PreTrainedModel initialization. For stub/export-time configs we may not
+        # have a resolved text backbone yet; in that case, fall back to self.
+        if self.text_config_obj is None:
+            return self
+        return self.text_config_obj
+@dataclass
+class ManthanBatch:
+    input_ids: "torch.LongTensor"
+    attention_mask: Optional["torch.LongTensor"]
+    pixel_values: Optional["torch.FloatTensor"]
+    labels: Optional["torch.LongTensor"]

manthan_t1/hf_integration_smoke.py ADDED Viewed

	@@ -0,0 +1,60 @@

+"""Optional integration smoke test.
+This WILL download models (big) if run.
+It checks that:
+- Qwen/Qwen3-0.6B loads as the text backbone
+- google/siglip-so400m-patch14-384 loads as the vision backbone
+- a single forward pass works with <image> token injection
+Run (optional):
+  python -m manthan_t1.hf_integration_smoke
+"""
+from __future__ import annotations
+import torch
+from transformers import AutoTokenizer
+from manthan_t1.configuration_manthan import ManthanConfig
+from manthan_t1.modeling_manthan import ManthanForCausalLM
+def main() -> None:
+    text_id = "Qwen/Qwen3-0.6B"
+    vision_id = "google/siglip-so400m-patch14-384"
+    cfg = ManthanConfig(
+        text_model_id=text_id,
+        vision_model_id=vision_id,
+        # SigLIP so400m patch14 384
+        vision_image_size=384,
+        vision_patch_size=14,
+        # 384/14 is non-integer; many siglip variants still use patch14, but token count comes from model.
+        # We'll keep num_image_tokens as 256 to match common LLaVA-style settings.
+        num_image_tokens=256,
+        image_token_id=151665,
+        vision_feature_select="patch",
+    )
+    model = ManthanForCausalLM(cfg)
+    model.eval()
+    tok = AutoTokenizer.from_pretrained(text_id, use_fast=False, trust_remote_code=True)
+    # Create a prompt with enough <image> placeholders
+    image_tok = tok.decode([cfg.image_token_id])
+    prompt = (image_tok + " ") * cfg.num_image_tokens + "\nDescribe the image."
+    inputs = tok(prompt, return_tensors="pt")
+    # Dummy image tensor with expected size; processor correctness is handled in `chat()`.
+    pixel_values = torch.randn(1, 3, cfg.vision_image_size, cfg.vision_image_size)
+    with torch.no_grad():
+        out = model(input_ids=inputs["input_ids"], attention_mask=inputs.get("attention_mask"), pixel_values=pixel_values)
+    print("OK integration forward", tuple(out.logits.shape))
+if __name__ == "__main__":
+    main()

manthan_t1/hf_smoke.py ADDED Viewed

	@@ -0,0 +1,43 @@

+"""HF Transformers smoke test for Manthan-T1.
+This must run without downloading any external checkpoints.
+"""
+from __future__ import annotations
+import torch
+from manthan_t1.configuration_manthan import ManthanConfig
+from manthan_t1.modeling_manthan import ManthanForCausalLM
+def main() -> None:
+    cfg = ManthanConfig(
+        text_config={"model_type": "gpt2"},
+        vision_image_size=224,
+        vision_patch_size=16,
+        vision_hidden_size=128,
+    # 224/16 = 14 patches per side => 196 patch tokens (CLS is dropped by default)
+    num_image_tokens=196,
+        image_token_id=42,
+    )
+    model = ManthanForCausalLM(cfg)
+    model.eval()
+    # Fake text with num_image_tokens image tokens
+    B, T = 2, 32
+    input_ids = torch.randint(0, 100, (B, T))
+    # Ensure sequence is long enough to host the image tokens
+    if T < cfg.num_image_tokens:
+        input_ids = torch.randint(0, 100, (B, cfg.num_image_tokens + 8))
+        T = input_ids.shape[1]
+    input_ids[:, : cfg.num_image_tokens] = cfg.image_token_id
+    pixel_values = torch.randn(B, 3, cfg.vision_image_size, cfg.vision_image_size)
+    out = model(input_ids=input_ids, pixel_values=pixel_values)
+    assert out.logits.shape[:2] == (B, T)
+    print("OK manthan hf forward", tuple(out.logits.shape))
+if __name__ == "__main__":
+    main()

manthan_t1/modeling_manthan.py ADDED Viewed

	@@ -0,0 +1,654 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from transformers import (
+    AutoConfig,
+    AutoModel,
+    AutoModelForCausalLM,
+    GenerationMixin,
+    PreTrainedModel,
+)
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from .configuration_manthan import ManthanConfig
+IGNORE_INDEX = -100
+class ManthanVisionEncoder(nn.Module):
+    """Minimal ViT-like vision encoder.
+    This is intentionally simple so the architecture is fully defined in this repo.
+    You can later swap it with SigLIP2 weights by mapping parameters.
+    """
+    def __init__(self, image_size: int, patch_size: int, hidden_size: int):
+        super().__init__()
+        self.image_size = image_size
+        self.patch_size = patch_size
+        self.hidden_size = hidden_size
+        self.proj = nn.Conv2d(3, hidden_size, kernel_size=patch_size, stride=patch_size)
+        num_patches = (image_size // patch_size) * (image_size // patch_size)
+        self.cls_token = nn.Parameter(torch.zeros(1, 1, hidden_size))
+        self.pos_embed = nn.Parameter(torch.zeros(1, 1 + num_patches, hidden_size))
+        encoder_layer = nn.TransformerEncoderLayer(
+            d_model=hidden_size,
+            nhead=max(1, hidden_size // 64),
+            dim_feedforward=hidden_size * 4,
+            batch_first=True,
+            activation="gelu",
+        )
+        self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=2)
+        self.ln = nn.LayerNorm(hidden_size)
+        nn.init.normal_(self.pos_embed, std=0.02)
+        nn.init.normal_(self.cls_token, std=0.02)
+    def forward(self, pixel_values: torch.FloatTensor) -> torch.FloatTensor:
+        # pixel_values: (B, 3, H, W)
+        x = self.proj(pixel_values)  # (B, C, H', W')
+        x = x.flatten(2).transpose(1, 2)  # (B, N, C)
+        cls = self.cls_token.expand(x.size(0), -1, -1)
+        x = torch.cat([cls, x], dim=1)
+        x = x + self.pos_embed[:, : x.size(1), :]
+        x = self.encoder(x)
+        return self.ln(x)
+class ManthanProjector(nn.Module):
+    def __init__(self, vision_hidden: int, text_hidden: int, mid: Optional[int] = None):
+        super().__init__()
+        mid = mid or max(text_hidden, vision_hidden)
+        self.net = nn.Sequential(
+            nn.Linear(vision_hidden, mid),
+            nn.GELU(),
+            nn.Linear(mid, text_hidden),
+        )
+    def forward(self, x: torch.FloatTensor) -> torch.FloatTensor:
+        return self.net(x)
+class ManthanForCausalLM(PreTrainedModel, GenerationMixin):
+    config_class = ManthanConfig
+    base_model_prefix = "manthan"
+    def __init__(self, config: ManthanConfig):
+        super().__init__(config)
+        # Text backbone
+        # Priority:
+        # 1) `text_model_id` (e.g., Qwen/Qwen3-0.6B)
+        # 2) `text_config` dict (model_type + overrides)
+        # 3) fallback tiny GPT-2 for smoke tests
+        if config.text_model_id:
+            self.language_model = AutoModelForCausalLM.from_pretrained(
+                config.text_model_id,
+                trust_remote_code=True,
+            )
+        else:
+            text_cfg = getattr(config, "text_config_obj", None)
+            if text_cfg is None:
+                text_cfg = (
+                    AutoConfig.for_model(**config.text_config_dict)
+                    if getattr(config, "text_config_dict", {}).get("model_type")
+                    else None
+                )
+            if text_cfg is None:
+                from transformers import GPT2Config
+                text_cfg = GPT2Config(
+                    n_embd=256,
+                    n_layer=4,
+                    n_head=4,
+                    vocab_size=32000,
+                )
+            self.language_model = AutoModelForCausalLM.from_config(text_cfg)
+        text_hidden = self.language_model.config.hidden_size
+        # Vision backbone
+        self.vision_model = None
+        if config.vision_model_id:
+            self.vision_model = AutoModel.from_pretrained(config.vision_model_id, trust_remote_code=True)
+            vision_hidden = getattr(getattr(self.vision_model, "config", None), "hidden_size", None)
+            if vision_hidden is not None:
+                config.vision_hidden_size = int(vision_hidden)
+        # Fallback toy tower remains available (used when vision_model_id is not set)
+        self.vision_tower = ManthanVisionEncoder(
+            image_size=config.vision_image_size,
+            patch_size=config.vision_patch_size,
+            hidden_size=config.vision_hidden_size,
+        )
+        self.projector = ManthanProjector(
+            vision_hidden=config.vision_hidden_size,
+            text_hidden=text_hidden,
+            mid=config.projector_hidden_size,
+        )
+        # Use TinyLLaVA-style negative placeholder for <image>
+        self.image_token_id = int(getattr(config, "image_token_id", -200))
+        self.num_image_tokens = int(getattr(config, "num_image_tokens", 256))
+        # Generation helpers
+        self._gen_pixel_values: Optional[torch.FloatTensor] = None
+        self.post_init()
+    @staticmethod
+    def format_chat_prompt(prompt: str, has_image: bool = False) -> str:
+        """Format a single user prompt similar to TinyLLaVA's Qwen3 template.
+        We intentionally keep it simple and *string based* so it does not depend
+        on the tokenizer's chat_template.
+        """
+        system = (
+            "A chat between a curious user and an artificial intelligence assistant. "
+            "The assistant gives helpful, detailed, and polite answers to the user's questions. "
+        )
+        if has_image:
+            # Ensure the user doesn't redundantly include <image> in their prompt.
+            clean = prompt.replace("<image>", "").strip()
+            formatted = f"<image>\n{clean}"
+        else:
+            formatted = prompt.strip()
+        # Critical: no trailing space after ASSISTANT:
+        return system + f"USER: {formatted} ASSISTANT:"
+    def _inject_vision_embeds(
+        self,
+        input_ids: torch.LongTensor,
+        pixel_values: torch.FloatTensor,
+        attention_mask: Optional[torch.LongTensor] = None,
+    ) -> Tuple[torch.FloatTensor, Optional[torch.LongTensor]]:
+        """Create input_embeds where <image> tokens are replaced by projected vision tokens.
+        Contract:
+        - input_ids contains exactly `num_image_tokens` occurrences of `image_token_id`.
+        - We will replace them in sequence order with vision tokens.
+        """
+        # Vision features
+        if self.vision_model is not None:
+            vout = self.vision_model(pixel_values=pixel_values)
+            # Try common fields: last_hidden_state
+            vision = getattr(vout, "last_hidden_state", None)
+            if vision is None:
+                raise ValueError("Vision model output does not contain last_hidden_state")
+        else:
+            vision = self.vision_tower(pixel_values)  # (B, 1+N, vision_hidden)
+        # Token selection
+        if self.config.vision_feature_select == "patch":
+            # Drop CLS and take first num_image_tokens patches
+            vision = vision[:, 1 : 1 + self.num_image_tokens, :]
+        elif self.config.vision_feature_select == "cls_patch":
+            vision = vision[:, : self.num_image_tokens, :]
+        else:
+            raise ValueError(f"Unknown vision_feature_select={self.config.vision_feature_select}")
+        # Be strict: ensure we have exactly num_image_tokens.
+        if vision.shape[1] != self.num_image_tokens:
+            # Common case for the toy vision tower: includes CLS + N patches.
+            if vision.shape[1] > self.num_image_tokens:
+                vision = vision[:, : self.num_image_tokens, :]
+            else:
+                raise ValueError(
+                    f"vision tokens ({vision.shape[1]}) < num_image_tokens ({self.num_image_tokens}); increase image size/patching or reduce num_image_tokens"
+                )
+        vision = self.projector(vision)  # (B, num_img, text_hidden)
+        # Text embeds
+        lm = self.language_model
+        if hasattr(lm, "get_input_embeddings"):
+            tok_emb = lm.get_input_embeddings()
+        else:
+            tok_emb = lm.base_model.get_input_embeddings()
+        inputs_embeds = tok_emb(input_ids)
+        # Replace image token positions
+        mask = input_ids.eq(self.image_token_id)  # (B, T)
+        if mask.sum(dim=1).min().item() != self.num_image_tokens:
+            raise ValueError(
+                f"Expected exactly {self.num_image_tokens} <image> tokens per sample, got min={mask.sum(dim=1).min().item()}"
+            )
+        for b in range(input_ids.size(0)):
+            idx = torch.nonzero(mask[b], as_tuple=False).squeeze(-1)
+            inputs_embeds[b, idx, :] = vision[b, : idx.numel(), :]
+        return inputs_embeds, attention_mask
+    @staticmethod
+    def tokenizer_image_token(
+        prompt: str,
+        tokenizer,
+        image_token_index: int = -200,
+        return_tensors: Optional[str] = None,
+    ):
+        """MicroLLaVA/TinyLLaVA-style tokenization inserting a negative image placeholder id.
+        This avoids requiring `<image>` to be a real token in the tokenizer vocab.
+        """
+        def _insert_separator(X, sep):
+            return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
+        prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("<image>")]
+        input_ids: List[int] = []
+        offset = 0
+        if (
+            len(prompt_chunks) > 0
+            and len(prompt_chunks[0]) > 0
+            and prompt_chunks[0][0] == tokenizer.bos_token_id
+        ):
+            offset = 1
+            input_ids.append(prompt_chunks[0][0])
+        for x in _insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
+            input_ids.extend(x[offset:])
+        if return_tensors is not None:
+            if return_tensors == "pt":
+                return torch.tensor(input_ids, dtype=torch.long)
+            raise ValueError(f"Unsupported tensor type: {return_tensors}")
+        return input_ids
+    def _encode_images(self, pixel_values: torch.FloatTensor) -> torch.FloatTensor:
+        """Return projected vision features (B, N, text_hidden)."""
+        if self.vision_model is not None:
+            vout = self.vision_model(pixel_values=pixel_values, output_hidden_states=True)
+            # Prefer siglip/clip style selection when hidden_states exist.
+            if hasattr(vout, "hidden_states") and vout.hidden_states is not None:
+                layer = getattr(self.config, "vision_feature_layer", -2)
+                vision = vout.hidden_states[layer]
+            else:
+                vision = getattr(vout, "last_hidden_state", None)
+            if vision is None:
+                raise ValueError("Vision model output has no usable hidden states")
+        else:
+            vision = self.vision_tower(pixel_values)
+        # Match TinyLLaVA selection strategy
+        strat = getattr(self.config, "vision_feature_select", "patch")
+        if strat == "patch":
+            vision = vision[:, 1:]
+        elif strat == "cls_patch":
+            vision = vision
+        else:
+            raise ValueError(f"Unknown vision_feature_select={strat}")
+        # Optionally truncate to configured max image tokens
+        if vision.shape[1] > self.num_image_tokens:
+            vision = vision[:, : self.num_image_tokens]
+        return self.projector(vision)
+    def prepare_inputs_labels_for_multimodal(
+        self,
+        input_ids: torch.LongTensor,
+        attention_mask: Optional[torch.Tensor],
+        past_key_values: Optional[Any],
+        labels: Optional[torch.LongTensor],
+        pixel_values: Optional[torch.FloatTensor],
+    ) -> Tuple[
+        Optional[torch.LongTensor],
+        Optional[torch.LongTensor],
+        Optional[torch.Tensor],
+        Optional[Any],
+        Optional[torch.FloatTensor],
+        Optional[torch.LongTensor],
+    ]:
+        """MicroLLaVA-style splice: build inputs_embeds by inserting vision features at IMAGE_TOKEN_INDEX.
+        Returns:
+            (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels)
+            where input_ids is None when inputs_embeds is provided.
+        """
+        if pixel_values is None or input_ids.shape[1] == 1 or self.vision_tower is None:
+            return input_ids, None, attention_mask, past_key_values, None, labels
+        image_features = self._encode_images(pixel_values)  # (B, N, hidden)
+        orig_labels = labels
+        orig_attention_mask = attention_mask
+        if attention_mask is None:
+            attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
+        else:
+            attention_mask = attention_mask.bool()
+        if labels is None:
+            labels = torch.full_like(input_ids, IGNORE_INDEX)
+        # Remove padding
+        input_ids_list = [cur_ids[cur_mask] for cur_ids, cur_mask in zip(input_ids, attention_mask)]
+        labels_list = [cur_lbl[cur_mask] for cur_lbl, cur_mask in zip(labels, attention_mask)]
+        tok_emb = self.language_model.get_input_embeddings()
+        vocab_size = int(getattr(tok_emb, "num_embeddings", 0) or 0)
+        new_input_embeds: List[torch.Tensor] = []
+        new_labels: List[torch.Tensor] = []
+        cur_image_idx = 0
+        for batch_idx, cur_ids in enumerate(input_ids_list):
+            num_images = int((cur_ids == self.image_token_id).sum().item())
+            # No image tokens: plain text path.
+            if num_images == 0:
+                if vocab_size > 0:
+                    cur_ids = cur_ids.clamp(min=0, max=vocab_size - 1)
+                new_input_embeds.append(tok_emb(cur_ids))
+                new_labels.append(labels_list[batch_idx])
+                continue
+            # Split around image placeholder positions
+            image_token_indices = [-1] + torch.where(cur_ids == self.image_token_id)[0].tolist() + [cur_ids.shape[0]]
+            cur_labels = labels_list[batch_idx]
+            seg_ids: List[torch.Tensor] = []
+            seg_lbls: List[torch.Tensor] = []
+            for i in range(len(image_token_indices) - 1):
+                s = image_token_indices[i] + 1
+                e = image_token_indices[i + 1]
+                seg_ids.append(cur_ids[s:e])
+                seg_lbls.append(cur_labels[s:e])
+            split_sizes = [x.shape[0] for x in seg_ids]
+            total = int(sum(split_sizes))
+            if total > 0:
+                flat_ids = torch.cat(seg_ids, dim=0)
+                # Never feed negative ids into embeddings.
+                flat_ids = flat_ids[flat_ids >= 0]
+                if vocab_size > 0 and flat_ids.numel() > 0:
+                    flat_ids = flat_ids.clamp(min=0, max=vocab_size - 1)
+                flat_emb = tok_emb(flat_ids)
+                emb_chunks = list(torch.split(flat_emb, split_sizes, dim=0))
+            else:
+                emb_chunks = [tok_emb(cur_ids[:0]) for _ in split_sizes]
+            cur_new_embeds: List[torch.Tensor] = []
+            cur_new_labels: List[torch.Tensor] = []
+            for i in range(num_images + 1):
+                cur_new_embeds.append(emb_chunks[i])
+                cur_new_labels.append(seg_lbls[i])
+                if i < num_images:
+                    cur_img_feat = image_features[cur_image_idx]
+                    cur_image_idx += 1
+                    cur_new_embeds.append(cur_img_feat)
+                    cur_new_labels.append(
+                        torch.full(
+                            (cur_img_feat.shape[0],),
+                            IGNORE_INDEX,
+                            device=cur_labels.device,
+                            dtype=cur_labels.dtype,
+                        )
+                    )
+            new_input_embeds.append(torch.cat([x.to(self.device) for x in cur_new_embeds], dim=0))
+            new_labels.append(torch.cat(cur_new_labels, dim=0))
+        # Truncate if needed
+        max_len_cfg = getattr(self.config, "tokenizer_model_max_length", None)
+        if max_len_cfg is not None:
+            new_input_embeds = [x[:max_len_cfg] for x in new_input_embeds]
+            new_labels = [x[:max_len_cfg] for x in new_labels]
+        max_len = max(x.shape[0] for x in new_input_embeds)
+        batch_size = len(new_input_embeds)
+        padded_embeds: List[torch.Tensor] = []
+        padded_labels = torch.full(
+            (batch_size, max_len),
+            IGNORE_INDEX,
+            dtype=new_labels[0].dtype,
+            device=new_labels[0].device,
+        )
+        padded_mask = torch.zeros((batch_size, max_len), dtype=torch.long, device=padded_labels.device)
+        for i, (emb, lbl) in enumerate(zip(new_input_embeds, new_labels)):
+            cur_len = emb.shape[0]
+            pad = torch.zeros((max_len - cur_len, emb.shape[1]), dtype=emb.dtype, device=emb.device)
+            padded_embeds.append(torch.cat([emb, pad], dim=0))
+            padded_labels[i, :cur_len] = lbl
+            padded_mask[i, :cur_len] = 1
+        inputs_embeds = torch.stack(padded_embeds, dim=0)
+        out_labels = None if orig_labels is None else padded_labels
+        out_mask = None if orig_attention_mask is None else padded_mask
+        return None, None, out_mask, past_key_values, inputs_embeds, out_labels
+    def forward(
+        self,
+        input_ids: Optional[torch.LongTensor] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        labels: Optional[torch.LongTensor] = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        if input_ids is None:
+            raise ValueError("input_ids is required")
+        if pixel_values is None and self._gen_pixel_values is not None:
+            pixel_values = self._gen_pixel_values
+        past_key_values = kwargs.get("past_key_values", None)
+        (
+            input_ids,
+            position_ids,
+            attention_mask,
+            past_key_values,
+            inputs_embeds,
+            labels,
+        ) = self.prepare_inputs_labels_for_multimodal(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            past_key_values=past_key_values,
+            labels=labels,
+            pixel_values=pixel_values,
+        )
+        return self.language_model(
+            input_ids=input_ids,
+            inputs_embeds=inputs_embeds,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            labels=labels,
+            **kwargs,
+        )
+    @torch.no_grad()
+    def generate(
+        self,
+        input_ids: Optional[torch.Tensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        **kwargs,
+    ):
+        if input_ids is None:
+            raise ValueError("input_ids is required")
+        if pixel_values is not None:
+            (
+                _,
+                position_ids,
+                attention_mask,
+                _,
+                inputs_embeds,
+                _,
+            ) = self.prepare_inputs_labels_for_multimodal(
+                input_ids=input_ids,
+                attention_mask=attention_mask,
+                past_key_values=None,
+                labels=None,
+                pixel_values=pixel_values,
+            )
+            return self.language_model.generate(
+                inputs_embeds=inputs_embeds,
+                attention_mask=attention_mask,
+                position_ids=position_ids,
+                **kwargs,
+            )
+        return self.language_model.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            **kwargs,
+        )
+    def prepare_inputs_for_generation(
+        self,
+        input_ids: torch.LongTensor,
+        past_key_values: Optional[Tuple[Tuple[torch.Tensor, ...], ...]] = None,
+        attention_mask: Optional[torch.LongTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        **kwargs,
+    ) -> Dict[str, Any]:
+        """HF generation hook.
+        Contract:
+        - On the first step, we keep full `input_ids` and provide `pixel_values`.
+        - For subsequent steps (past_key_values != None), HF passes only the last token.
+          We keep cached `pixel_values` and forward normally.
+        """
+        if pixel_values is not None:
+            self._gen_pixel_values = pixel_values
+        # If we have past, only feed last token (standard causal LM behavior)
+        if past_key_values is not None:
+            input_ids = input_ids[:, -1:]
+        model_inputs: Dict[str, Any] = {
+            "input_ids": input_ids,
+            "attention_mask": attention_mask,
+            "past_key_values": past_key_values,
+            "use_cache": kwargs.get("use_cache", True),
+        }
+        if self._gen_pixel_values is not None:
+            model_inputs["pixel_values"] = self._gen_pixel_values
+        return model_inputs
+    def _reorder_cache(self, past_key_values, beam_idx):
+        # Delegate to the underlying LM implementation
+        if hasattr(self.language_model, "_reorder_cache"):
+            return self.language_model._reorder_cache(past_key_values, beam_idx)
+        return past_key_values
+    @torch.no_grad()
+    def chat(
+        self,
+        prompt: str,
+        tokenizer,
+        image: Optional[Union[str, "PIL.Image.Image"]] = None,
+        max_new_tokens: int = 128,
+        **gen_kwargs,
+    ) -> str:
+        """Simple chat helper (mirrors the style in your MicroLLaVA README).
+        This is intentionally minimal. A proper processor will be added later.
+        """
+        # Lazy import to keep base install small
+        pixel_values = None
+        if image is not None:
+            from PIL import Image
+            import requests
+            from io import BytesIO
+            if isinstance(image, str) and image.startswith("http"):
+                r = requests.get(image, timeout=30)
+                r.raise_for_status()
+                image = Image.open(BytesIO(r.content)).convert("RGB")
+            elif isinstance(image, str):
+                image = Image.open(image).convert("RGB")
+            # Prefer model-specific preprocessing when we have a real vision backbone.
+            if self.vision_model is not None and self.config.vision_model_id:
+                from transformers import AutoProcessor
+                proc = AutoProcessor.from_pretrained(self.config.vision_model_id, trust_remote_code=True)
+                pv = proc(images=image, return_tensors="pt")
+                pixel_values = pv.get("pixel_values", None)
+                if pixel_values is None:
+                    raise ValueError("AutoProcessor did not return pixel_values")
+            else:
+                import torchvision.transforms as T
+                tfm = T.Compose(
+                    [
+                        T.Resize((self.config.vision_image_size, self.config.vision_image_size)),
+                        T.ToTensor(),
+                    ]
+                )
+                pixel_values = tfm(image).unsqueeze(0)
+        # Insert language mirroring instruction (simple + robust)
+        # We keep it short to not fight Qwen's reasoning.
+        user_prompt = (
+            "Reply in the same language as the user's prompt (e.g., Tamil in Tamil, Hindi in Hindi, English in English). "
+            "Be helpful and concise.\n\n" + prompt
+        )
+        formatted = self.format_chat_prompt(user_prompt, has_image=pixel_values is not None)
+        # MicroLLaVA-style: tokenize by inserting image placeholder ids directly.
+        if pixel_values is not None:
+            input_ids = self.tokenizer_image_token(
+                formatted,
+                tokenizer,
+                image_token_index=self.image_token_id,
+                return_tensors="pt",
+            ).unsqueeze(0)
+            attention_mask = torch.ones_like(input_ids, dtype=torch.long)
+        else:
+            ids = tokenizer(formatted, return_tensors="pt")
+            input_ids = ids["input_ids"]
+            attention_mask = ids.get("attention_mask")
+        # Keep everything on the model device
+        input_ids = input_ids.to(self.device)
+        if attention_mask is not None:
+            attention_mask = attention_mask.to(self.device)
+        if pixel_values is not None:
+            pixel_values = pixel_values.to(self.device)
+        # Use this wrapper's generate support so pixel_values can be carried through.
+        out = self.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            pixel_values=pixel_values,
+            max_new_tokens=max_new_tokens,
+            **gen_kwargs,
+        )
+        return tokenizer.decode(out[0], skip_special_tokens=True)
+# Registration so AutoModel/AutoConfig can find it when you package/export.
+AutoConfig.register(ManthanConfig.model_type, ManthanConfig)
+AutoModelForCausalLM.register(ManthanConfig, ManthanForCausalLM)

manthan_t1/smoke_test.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Minimal smoke test.
+Goal: ensure the repo is runnable on macOS + MLX before we wire real Qwen/SigLIP weights.
+"""
+from __future__ import annotations
+import time
+import mlx.core as mx
+import mlx.nn as nn
+import mlx.optimizers as optim
+class TinyToyModel(nn.Module):
+    def __init__(self, vocab_size: int = 256, d_model: int = 128):
+        super().__init__()
+        self.emb = nn.Embedding(vocab_size, d_model)
+        self.l1 = nn.Linear(d_model, d_model)
+        self.l2 = nn.Linear(d_model, vocab_size)
+    def __call__(self, token_ids: mx.array) -> mx.array:
+        # token_ids: (B, T)
+        x = self.emb(token_ids)
+        x = nn.relu(self.l1(x))
+        return self.l2(x)
+def main() -> None:
+    mx.random.seed(0)
+    model = TinyToyModel()
+    # fake batch
+    B, T = 4, 32
+    token_ids = mx.random.randint(0, 256, shape=(B, T))
+    targets = mx.random.randint(0, 256, shape=(B, T))
+    def loss_fn(m: TinyToyModel, x: mx.array, y: mx.array) -> mx.array:
+        logits = m(x)
+        logits2 = logits.reshape((-1, logits.shape[-1]))
+        y2 = y.reshape((-1,))
+        # `cross_entropy` may return per-example/per-token; reduce to scalar.
+        return mx.mean(nn.losses.cross_entropy(logits2, y2))
+    opt = optim.Adam(learning_rate=1e-3)
+    start = time.time()
+    def f(m: TinyToyModel) -> mx.array:
+        return loss_fn(m, token_ids, targets)
+    for step in range(5):
+        loss, grads = mx.value_and_grad(f)(model)
+        opt.update(model, grads)
+        mx.eval(loss)
+        print(f"step={step} loss={float(loss):.4f}")
+    mx.eval(model.parameters())
+    print(f"OK (elapsed {time.time() - start:.2f}s)")
+if __name__ == "__main__":
+    main()

manthan_t1/text_generate_smoke.py ADDED Viewed

	@@ -0,0 +1,30 @@

+"""No-download generate smoke test.
+Ensures that `ManthanForCausalLM.generate()` works (text-only path), which is
+required before enabling image+generate.
+"""
+from __future__ import annotations
+import torch
+from transformers import AutoTokenizer
+from manthan_t1.configuration_manthan import ManthanConfig
+from manthan_t1.modeling_manthan import ManthanForCausalLM
+def main() -> None:
+    cfg = ManthanConfig(text_config={"model_type": "gpt2"})
+    model = ManthanForCausalLM(cfg)
+    model.eval()
+    tok = AutoTokenizer.from_pretrained("gpt2")
+    prompt = "Hello, my name is"
+    inputs = tok(prompt, return_tensors="pt")
+    out = model.generate(**inputs, max_new_tokens=10)
+    print(tok.decode(out[0], skip_special_tokens=True))
+if __name__ == "__main__":
+    main()

manthan_t1/tokenizer_smoke.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""Tokenizer + template smoke test.
+- Ensures `<image>` can be added to a tokenizer
+- Ensures we can build a prompt containing `<image>` token placeholders
+Downloads a small tokenizer (gpt2) only.
+"""
+from __future__ import annotations
+from transformers import AutoTokenizer
+from manthan_t1.tokenizer_utils import ensure_vision_special_tokens
+def main() -> None:
+    tok = AutoTokenizer.from_pretrained("gpt2")
+    res = ensure_vision_special_tokens(tok, add_im_start_end=True)
+    prompt = ("<image> " * 4) + "\nExplain what you see."
+    ids = tok(prompt).input_ids
+    print("image_token_id", res.image_token_id)
+    print("count", sum(1 for i in ids if i == res.image_token_id))
+if __name__ == "__main__":
+    main()

manthan_t1/tokenizer_utils.py ADDED Viewed

	@@ -0,0 +1,52 @@

+from __future__ import annotations
+from dataclasses import dataclass
+from typing import Dict, Optional, Tuple
+DEFAULT_SPECIAL_TOKENS: Dict[str, str] = {
+    "image_token": "<image>",
+    "im_start": "<im_start>",
+    "im_end": "<im_end>",
+}
+@dataclass
+class TokenSetupResult:
+    image_token_id: int
+    added: Dict[str, int]
+def ensure_vision_special_tokens(tokenizer, add_im_start_end: bool = False) -> TokenSetupResult:
+    """Ensure the tokenizer has `<image>` (and optionally `<im_start>/<im_end>`).
+    Returns the chosen `image_token_id` and any newly-added token ids.
+    Works with both fast and slow tokenizers.
+    """
+    specials = {
+        "additional_special_tokens": [DEFAULT_SPECIAL_TOKENS["image_token"]],
+    }
+    if add_im_start_end:
+        specials["additional_special_tokens"].extend(
+            [DEFAULT_SPECIAL_TOKENS["im_start"], DEFAULT_SPECIAL_TOKENS["im_end"]]
+        )
+    # Add only missing tokens
+    existing = set(getattr(tokenizer, "additional_special_tokens", []) or [])
+    to_add = [t for t in specials["additional_special_tokens"] if t not in existing]
+    added_map: Dict[str, int] = {}
+    if to_add:
+        tokenizer.add_special_tokens({"additional_special_tokens": to_add})
+        # after add, resolve ids
+        for t in to_add:
+            added_map[t] = tokenizer.convert_tokens_to_ids(t)
+    image_token = DEFAULT_SPECIAL_TOKENS["image_token"]
+    image_token_id = tokenizer.convert_tokens_to_ids(image_token)
+    if image_token_id is None or image_token_id < 0:
+        raise ValueError("Failed to register <image> token")
+    return TokenSetupResult(image_token_id=image_token_id, added=added_map)

pyproject.toml ADDED Viewed

	@@ -0,0 +1,29 @@

+[project]
+name = "manthan-t1"
+version = "0.0.1"
+description = "Manthan-T1: MLX-first vision-language model (Qwen3 + vision encoder) fine-tuning scaffold"
+readme = "README.md"
+requires-python = ">=3.10"
+license = {text = "Apache-2.0"}
+authors = [{name = "Manthan"}]
+dependencies = [
+  "mlx>=0.17.0",
+  "numpy>=1.26",
+  "pillow>=10.0",
+  "tqdm>=4.66",
+  "pyyaml>=6.0",
+  "transformers>=4.55.0",
+  "torch>=2.2.0",
+  "torchvision>=0.17.0",
+  "requests>=2.31",
+]
+[project.optional-dependencies]
+dev = [
+  "pytest>=8.0",
+]
+[tool.pytest.ini_options]
+addopts = "-q"
+testpaths = ["tests"]

requirements.txt ADDED Viewed

File without changes

scripts/export_hf.py ADDED Viewed

	@@ -0,0 +1,208 @@

+#!/usr/bin/env python3
+"""Export a Manthan-T1 folder that can be uploaded to Hugging Face.
+What this does:
+- Copies `hf_export_stub/*` into an output directory
+- Builds a tokenizer from `tokenizer_name_or_path` (defaults to Qwen3)
+- Ensures `<image>` is a real special token in the tokenizer
+- Writes `tokenizer_config.json`, `special_tokens_map.json`, `added_tokens.json`, and `chat_template.jinja`
+- Updates `config.json` with a correct `image_token_id` (kept equal to -200 placeholder)
+Note:
+- This does NOT include model weights. It's intended for placeholder-weight repo layout
+  (like your MicroLLaVA example). For training, you'll later save actual weights.
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import shutil
+import sys
+from pathlib import Path
+from transformers import AutoTokenizer
+# Allow running this script without installing the package.
+REPO_ROOT = Path(__file__).resolve().parents[1]
+if str(REPO_ROOT) not in sys.path:
+    sys.path.insert(0, str(REPO_ROOT))
+def _copytree(src: Path, dst: Path) -> None:
+    dst.mkdir(parents=True, exist_ok=True)
+    for item in src.iterdir():
+        s = item
+        d = dst / item.name
+        if item.is_dir():
+            shutil.copytree(s, d, dirs_exist_ok=True)
+        else:
+            shutil.copy2(s, d)
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--out", required=True, help="Output folder")
+    ap.add_argument(
+        "--stub",
+        default=str(Path(__file__).resolve().parents[1] / "hf_export_stub"),
+        help="Path to hf_export_stub folder",
+    )
+    ap.add_argument(
+        "--tokenizer",
+        default=None,
+        help="Tokenizer name/path. Defaults to config.json tokenizer_name_or_path.",
+    )
+    ap.add_argument(
+        "--tokenizer_local_dir",
+        default=None,
+        help="Local tokenizer directory to copy (e.g. MicroLlava-* folder). If set, no network fetch is performed.",
+    )
+    ap.add_argument(
+        "--write_stub_weights",
+        action="store_true",
+        help="Write randomly-initialized weights (model.safetensors) into the export dir so from_pretrained() succeeds.",
+    )
+    args = ap.parse_args()
+    out_dir = Path(args.out).expanduser().resolve()
+    stub_dir = Path(args.stub).expanduser().resolve()
+    if not stub_dir.exists():
+        raise SystemExit(f"Stub dir not found: {stub_dir}")
+    out_dir.mkdir(parents=True, exist_ok=True)
+    _copytree(stub_dir, out_dir)
+    # Ensure we don't keep stale remote-code python files from a previous export.
+    for stale in ["configuration_manthan.py", "modeling_manthan.py", "__init__.py"]:
+        p = out_dir / stale
+        if p.exists():
+            p.unlink()
+    # Copy remote-code python files to export root (HF dynamic module loader expects them)
+    repo_root = Path(__file__).resolve().parents[1]
+    pkg_dir = repo_root / "manthan_t1"
+    for fname in ["configuration_manthan.py", "modeling_manthan.py", "__init__.py"]:
+        src = pkg_dir / fname
+        if not src.exists():
+            raise SystemExit(f"Missing required source file for export: {src}")
+        shutil.copy2(src, out_dir / fname)
+    cfg_path = out_dir / "config.json"
+    if not cfg_path.exists():
+        raise SystemExit(f"config.json not found in: {out_dir}")
+    cfg = json.loads(cfg_path.read_text(encoding="utf-8"))
+    tokenizer_name = (
+        args.tokenizer
+        or cfg.get("tokenizer_name_or_path")
+        or cfg.get("llm_model_name_or_path")
+        or cfg.get("text_model_id")
+        or cfg.get("vision_model_id")
+    )
+    if not tokenizer_name:
+        raise SystemExit("Could not infer tokenizer_name_or_path")
+    # Prefer an on-disk tokenizer (e.g. the attached MicroLLaVA folder) to avoid any
+    # network dependency during export.
+    repo_root = Path(__file__).resolve().parents[1]
+    local_tokenizer_candidates = [
+        repo_root / "MicroLlava-Qwen3-0.6B-base-siglip2-so400m",
+    ]
+    for cand in local_tokenizer_candidates:
+        if cand.exists() and (cand / "tokenizer_config.json").exists():
+            tokenizer_name = str(cand)
+            break
+    tok = AutoTokenizer.from_pretrained(
+        tokenizer_name,
+        trust_remote_code=True,
+        use_fast=bool(cfg.get("tokenizer_use_fast", False)),
+        local_files_only=True,
+    )
+    # Ensure special tokens exist
+    added = tok.add_special_tokens({"additional_special_tokens": ["<image>"]})
+    # Some tokenizers need a pad token for batching.
+    if tok.pad_token_id is None and cfg.get("pad_token"):
+        tok.add_special_tokens({"pad_token": cfg["pad_token"]})
+    # Save tokenizer files into export dir
+    tok.save_pretrained(out_dir)
+    # Copy chat template if present in stub
+    tmpl_src = out_dir / "chat_template.jinja"
+    if tmpl_src.exists():
+        # Ensure tokenizer_config.json references it (HF uses string field)
+        tok_cfg_path = out_dir / "tokenizer_config.json"
+        if tok_cfg_path.exists():
+            tok_cfg = json.loads(tok_cfg_path.read_text(encoding="utf-8"))
+        else:
+            tok_cfg = {}
+        tok_cfg["chat_template"] = tmpl_src.read_text(encoding="utf-8")
+        tok_cfg_path.write_text(json.dumps(tok_cfg, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
+    # Align config fields with MicroLLaVA convention
+    cfg.setdefault("image_token_index", -200)
+    cfg["image_token_index"] = -200
+    cfg["image_token_id"] = -200
+    # For user convenience record actual tokenizer vocab id of '<image>'
+    img_vocab_id = tok.convert_tokens_to_ids("<image>")
+    cfg["tokenizer_image_token_id"] = int(img_vocab_id) if img_vocab_id is not None else None
+    cfg["tokenizer_added_tokens"] = int(added)
+    cfg_path.write_text(json.dumps(cfg, indent=2, ensure_ascii=False) + "\n", encoding="utf-8")
+    # Minimal README hint
+    readme = out_dir / "README_EXPORT.md"
+    readme.write_text(
+        "Manthan-T1 export folder (stub).\n\n"
+        "- `config.json` uses `image_token_index=-200` placeholder like TinyLLaVA.\n"
+        "- Tokenizer contains a real `<image>` special token.\n"
+        "- This folder does not include model weights; training should save weights here later.\n",
+        encoding="utf-8",
+    )
+    print(f"Exported to: {out_dir}")
+    if args.write_stub_weights:
+        # Import only when requested to avoid heavier imports for plain export.
+        from manthan_t1.configuration_manthan import ManthanConfig
+        from manthan_t1.modeling_manthan import ManthanForCausalLM
+        # Tiny randomly-initialized model that is loadable.
+        # This does not download any base weights.
+        stub_cfg = ManthanConfig(
+            text_model_id=None,
+            vision_model_id=None,
+            image_token_index=-200,
+            num_image_tokens=32,
+        )
+        model = ManthanForCausalLM(stub_cfg)
+        model.save_pretrained(out_dir, safe_serialization=True)
+        # Ensure auto_map is present so AutoConfig/AutoModel can resolve our
+        # custom classes via trust_remote_code.
+        saved_cfg = json.loads((out_dir / "config.json").read_text(encoding="utf-8"))
+        saved_cfg["auto_map"] = cfg.get(
+            "auto_map",
+            {
+                "AutoConfig": "configuration_manthan.ManthanConfig",
+                "AutoModelForCausalLM": "modeling_manthan.ManthanForCausalLM",
+            },
+        )
+        (out_dir / "config.json").write_text(
+            json.dumps(saved_cfg, indent=2, ensure_ascii=False) + "\n",
+            encoding="utf-8",
+        )
+        print("Wrote stub weights: model.safetensors")
+if __name__ == "__main__":
+    main()

scripts/infer_hf.py ADDED Viewed

	@@ -0,0 +1,30 @@

+from __future__ import annotations
+import argparse
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--model", type=str, default="./")
+    ap.add_argument("--prompt", type=str, required=True)
+    ap.add_argument("--image", type=str, default=None, help="URL or local path")
+    args = ap.parse_args()
+    model = AutoModelForCausalLM.from_pretrained(args.model, trust_remote_code=True)
+    tok = AutoTokenizer.from_pretrained(args.model, use_fast=False)
+    if hasattr(model, "chat"):
+        text = model.chat(prompt=args.prompt, image=args.image, tokenizer=tok)
+        print(text)
+        return
+    inputs = tok(args.prompt, return_tensors="pt")
+    out = model.generate(**inputs, max_new_tokens=128)
+    print(tok.decode(out[0], skip_special_tokens=True))
+if __name__ == "__main__":
+    main()

scripts/infer_qwen3_siglip2.py ADDED Viewed

	@@ -0,0 +1,50 @@

+from __future__ import annotations
+import argparse
+import torch
+from transformers import AutoTokenizer
+from manthan_t1.configuration_manthan import ManthanConfig
+from manthan_t1.modeling_manthan import ManthanForCausalLM
+def main() -> None:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--text-model", type=str, default="Qwen/Qwen3-0.6B")
+    ap.add_argument(
+        "--vision-model",
+        type=str,
+    default="google/siglip-so400m-patch14-384",
+    )
+    ap.add_argument("--prompt", type=str, required=True)
+    ap.add_argument("--image", type=str, default=None, help="URL or local path")
+    ap.add_argument("--image-token-id", type=int, default=151665)
+    ap.add_argument("--num-image-tokens", type=int, default=256)
+    args = ap.parse_args()
+    cfg = ManthanConfig(
+        text_model_id=args.text_model,
+        vision_model_id=args.vision_model,
+        vision_image_size=384,
+        vision_patch_size=14,
+        image_token_id=args.image_token_id,
+        num_image_tokens=args.num_image_tokens,
+        vision_feature_select="patch",
+    )
+    model = ManthanForCausalLM(cfg)
+    tok = AutoTokenizer.from_pretrained(args.text_model, use_fast=False, trust_remote_code=True)
+    if args.image:
+        out = model.chat(prompt=args.prompt, image=args.image, tokenizer=tok)
+    else:
+        inputs = tok(args.prompt, return_tensors="pt")
+        out_ids = model.language_model.generate(**inputs, max_new_tokens=128)
+        out = tok.decode(out_ids[0], skip_special_tokens=True)
+    print(out)
+if __name__ == "__main__":
+    main()

scripts/kaggle_train_all.sh ADDED Viewed

	@@ -0,0 +1,136 @@

+#!/usr/bin/env bash
+set -euo pipefail
+# One-shot Kaggle runner: Stage 1 (pretrain/alignment) -> Stage 2 (instruct finetune)
+# Designed for Kaggle 2xT4, but also works on other CUDA machines.
+############################################
+# User config (edit these if you want)
+############################################
+: "${MANTHAN_MODEL:=zyxcisss/Manthan-T1}"               # HF repo or local path containing Manthan remote-code
+: "${TEXT_MODEL:=Qwen/Qwen3-0.6B-Base}"                # base LLM checkpoint
+: "${STAGE1_DS:=liuhaotian/LLaVA-CC3M-Pretrain-595K}"  # pretrain/alignment
+: "${STAGE2_DS:=liuhaotian/LLaVA-Instruct-150K}"       # instruction finetune
+: "${OUT_BASE:=/kaggle/working/manthan_runs}"          # all outputs saved here
+: "${STAGE1_OUT:=${OUT_BASE}/stage1}"                 # stage1 output dir
+: "${STAGE2_OUT:=${OUT_BASE}/stage2}"                 # stage2 output dir
+# Training knobs (safe defaults for 2xT4)
+: "${MAX_LENGTH:=2048}"
+: "${IMAGE_SIZE:=384}"
+: "${BATCH_SIZE:=1}"
+: "${GRAD_ACCUM:=32}"
+: "${LR:=1e-4}"
+: "${EPOCHS_STAGE1:=1}"
+: "${EPOCHS_STAGE2:=1}"
+# Optional dataset limits (set empty for full)
+: "${LIMIT_STAGE1:=20000}"
+: "${LIMIT_STAGE2:=150000}"
+# If you want to disable LoRA for projector-only training, set USE_LORA=0
+: "${USE_LORA:=1}"
+# If you want this script to upload artifacts via huggingface-cli, set UPLOAD=1
+: "${UPLOAD:=0}"
+############################################
+# Environment setup
+############################################
+if command -v nvidia-smi >/dev/null 2>&1; then
+  echo "GPU found:"; nvidia-smi || true
+else
+  echo "WARNING: nvidia-smi not found. This script expects a CUDA runtime (Kaggle)."
+fi
+# Persist caches on Kaggle
+export HF_HOME="${HF_HOME:-/kaggle/working/hf}"
+export TRANSFORMERS_CACHE="${TRANSFORMERS_CACHE:-/kaggle/working/hf/transformers}"
+export HF_DATASETS_CACHE="${HF_DATASETS_CACHE:-/kaggle/working/hf/datasets}"
+mkdir -p "${HF_HOME}" "${TRANSFORMERS_CACHE}" "${HF_DATASETS_CACHE}" "${OUT_BASE}"
+############################################
+# Dependencies
+############################################
+python - <<'PY'
+import sys
+print("python:", sys.version)
+PY
+# Keep installs minimal and reproducible enough for Kaggle.
+python -m pip install -U pip
+python -m pip install -U "transformers>=4.45" accelerate datasets peft
+# Unsloth is optional; script falls back to PEFT if it isn't installed.
+python -m pip install -U unsloth || true
+############################################
+# Helper to add optional args
+############################################
+maybe_limit_args() {
+  local limit_val="$1"
+  if [[ -n "${limit_val}" ]]; then
+    echo "--limit" "${limit_val}"
+  fi
+}
+maybe_lora_args() {
+  if [[ "${USE_LORA}" == "1" ]]; then
+    echo "--use_lora"
+  else
+    echo ""
+  fi
+}
+############################################
+# Stage 1
+############################################
+echo "==== Stage 1: projector alignment/pretrain ===="
+python scripts/train_unsloth_kaggle.py \
+  --stage stage1 \
+  --manthan_model "${MANTHAN_MODEL}" \
+  --text_model "${TEXT_MODEL}" \
+  --dataset "${STAGE1_DS}" \
+  --output_dir "${STAGE1_OUT}" \
+  $(maybe_lora_args) \
+  --max_length "${MAX_LENGTH}" \
+  --image_size "${IMAGE_SIZE}" \
+  --batch_size "${BATCH_SIZE}" \
+  --grad_accum "${GRAD_ACCUM}" \
+  --lr "${LR}" \
+  --epochs "${EPOCHS_STAGE1}" \
+  $(maybe_limit_args "${LIMIT_STAGE1}")
+############################################
+# Stage 2
+############################################
+echo "==== Stage 2: instruction finetune ===="
+python scripts/train_unsloth_kaggle.py \
+  --stage stage2 \
+  --manthan_model "${MANTHAN_MODEL}" \
+  --text_model "${TEXT_MODEL}" \
+  --dataset "${STAGE2_DS}" \
+  --output_dir "${STAGE2_OUT}" \
+  $(maybe_lora_args) \
+  --max_length "${MAX_LENGTH}" \
+  --image_size "${IMAGE_SIZE}" \
+  --batch_size "${BATCH_SIZE}" \
+  --grad_accum "${GRAD_ACCUM}" \
+  --lr "${LR}" \
+  --epochs "${EPOCHS_STAGE2}" \
+  $(maybe_limit_args "${LIMIT_STAGE2}")
+echo "==== Done ===="
+echo "Stage1 outputs: ${STAGE1_OUT}"
+echo "Stage2 outputs: ${STAGE2_OUT}"
+############################################
+# Optional upload (manual control)
+############################################
+if [[ "${UPLOAD}" == "1" ]]; then
+  echo "UPLOAD=1: attempting to upload artifacts (requires HF auth)."
+  python -m pip install -U huggingface_hub
+  echo "You can now upload ${OUT_BASE} with your preferred workflow."
+fi

scripts/smoke_export_load.py ADDED Viewed

	@@ -0,0 +1,66 @@

+#!/usr/bin/env python3
+"""Smoke test: export a HF folder, then load it with trust_remote_code.
+This is meant to catch:
+- remote-code syntax/indentation errors
+- missing auto_map
+- missing model weights (optional stub)
+- basic forward/generate wiring regressions
+It intentionally uses the small stub-weights mode so it does not download big models.
+"""
+from __future__ import annotations
+import subprocess
+import sys
+from pathlib import Path
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+def main() -> int:
+    repo_root = Path(__file__).resolve().parents[1]
+    out_dir = repo_root / "hf_export_ready"
+    if out_dir.exists():
+        # keep it simple
+        subprocess.run(["rm", "-rf", str(out_dir)], check=True)
+    subprocess.run(
+        [
+            sys.executable,
+            str(repo_root / "scripts" / "export_hf.py"),
+            "--out",
+            str(out_dir),
+            "--write_stub_weights",
+        ],
+        check=True,
+    )
+    print("Loading tokenizer...")
+    tok = AutoTokenizer.from_pretrained(out_dir, trust_remote_code=True)
+    print("Loading model...")
+    model = AutoModelForCausalLM.from_pretrained(out_dir, trust_remote_code=True)
+    model.eval()
+    # Basic forward pass (text-only)
+    ids = tok("Hello", return_tensors="pt").input_ids
+    with torch.inference_mode():
+        out = model(input_ids=ids)
+    assert out.logits.shape[:2] == ids.shape
+    # Tiny generate smoke
+    with torch.inference_mode():
+        gen = model.generate(ids, max_new_tokens=4, use_cache=False)
+    assert gen.shape[0] == 1
+    print("SMOKE OK")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

scripts/train_unsloth_kaggle.py ADDED Viewed

	@@ -0,0 +1,454 @@

+"""Kaggle/Unsloth training entrypoint for Manthan-T1 (TinyLLaVA-style).
+This script is intended to be copied into a Kaggle notebook and run on 2×T4.
+It supports two stages:
+- stage1: projector alignment pretraining (e.g., LLaVA-CC3M-Pretrain-595K)
+- stage2: instruction tuning (e.g., LLaVA-Instruct-150K)
+Notes:
+- We follow MicroLLaVA/TinyLLaVA convention: IMAGE_TOKEN_INDEX = -200 is inserted
+  into input_ids for <image> placeholders.
+- Labels are IGNORE_INDEX for everything except assistant tokens.
+- This script trains:
+  - the multimodal projector (always)
+  - LoRA adapters on the text model (optional, recommended)
+  - vision tower is frozen by default
+You still need a *real* base model + vision tower weights. Stub exports will run
+but won't learn useful vision-language alignment.
+"""
+from __future__ import annotations
+import argparse
+import os
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple
+import torch
+from torch import nn
+from torch.utils.data import Dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer, get_cosine_schedule_with_warmup
+try:
+    # Fallback for non-Unsloth environments
+    from peft import LoraConfig, get_peft_model
+except Exception:  # pragma: no cover
+    LoraConfig = None
+    get_peft_model = None
+try:
+    # Kaggle + Unsloth
+    from unsloth import FastLanguageModel
+except Exception:  # pragma: no cover
+    FastLanguageModel = None
+try:
+    from datasets import load_dataset
+except Exception as e:  # pragma: no cover
+    raise RuntimeError(
+        "Missing dependency `datasets`. Install with `pip install datasets` (Kaggle: add to notebook)."
+    ) from e
+IMAGE_TOKEN_INDEX = -200
+IGNORE_INDEX = -100
+def tokenizer_image_token(prompt: str, tokenizer, image_token_index: int = IMAGE_TOKEN_INDEX) -> List[int]:
+    """MicroLLaVA/TinyLLaVA tokenizer: split on '<image>' and insert a negative id."""
+    def _insert_separator(X, sep):
+        return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
+    prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("<image>")]
+    input_ids: List[int] = []
+    offset = 0
+    if (
+        len(prompt_chunks) > 0
+        and len(prompt_chunks[0]) > 0
+        and tokenizer.bos_token_id is not None
+        and prompt_chunks[0][0] == tokenizer.bos_token_id
+    ):
+        offset = 1
+        input_ids.append(prompt_chunks[0][0])
+    for x in _insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
+        input_ids.extend(x[offset:])
+    return input_ids
+def build_prompt_from_conversations(conversations: List[Dict[str, str]]) -> Tuple[str, str]:
+    """Return (full_prompt, assistant_answer_text).
+    LLaVA datasets are 2-turn: human then gpt.
+    We map to the string template used in `ManthanForCausalLM.format_chat_prompt`.
+    """
+    # Expect 2 turns
+    human = conversations[0]["value"]
+    assistant = conversations[1]["value"]
+    system = (
+        "A chat between a curious user and an artificial intelligence assistant. "
+        "The assistant gives helpful, detailed, and polite answers to the user's questions. "
+    )
+    # IMPORTANT: no trailing space after ASSISTANT:
+    full = system + f"USER: {human.strip()} ASSISTANT:" + assistant
+    return full, assistant
+@dataclass
+class TrainExample:
+    input_ids: torch.LongTensor
+    labels: torch.LongTensor
+    image_path: str
+class LlavaLikeDataset(Dataset):
+    def __init__(
+        self,
+        ds_name: str,
+        split: str,
+        tokenizer,
+        max_length: int,
+        limit: Optional[int] = None,
+    ) -> None:
+        self.tokenizer = tokenizer
+        self.max_length = max_length
+        # Streaming keeps Kaggle disk usage low.
+        self.ds = load_dataset(ds_name, split=split, streaming=True)
+        self.limit = limit
+        # Materialize a small index for non-streaming dataloader behavior.
+        self._cache: List[Dict[str, Any]] = []
+        for i, ex in enumerate(self.ds):
+            self._cache.append(ex)
+            if limit is not None and i + 1 >= limit:
+                break
+    def __len__(self) -> int:
+        return len(self._cache)
+    def __getitem__(self, idx: int) -> TrainExample:
+        ex = self._cache[idx]
+        image_path = ex["image"]
+        conversations = ex["conversations"]
+        full_prompt, _assistant = build_prompt_from_conversations(conversations)
+        ids = tokenizer_image_token(full_prompt, self.tokenizer, IMAGE_TOKEN_INDEX)
+        # Truncate
+        ids = ids[: self.max_length]
+        # Labels: only learn on assistant answer tokens.
+        # Simple heuristic: find the last occurrence of " ASSISTANT:" marker.
+        marker = " ASSISTANT:"
+        marker_ids = self.tokenizer(marker).input_ids
+        # Find marker in tokenized ids (best-effort).
+        start = 0
+        for j in range(0, len(ids) - len(marker_ids) + 1):
+            if ids[j : j + len(marker_ids)] == marker_ids:
+                start = j + len(marker_ids)
+        labels = [IGNORE_INDEX] * len(ids)
+        for j in range(start, len(ids)):
+            if ids[j] == IMAGE_TOKEN_INDEX:
+                labels[j] = IGNORE_INDEX
+            else:
+                labels[j] = ids[j]
+        return TrainExample(
+            input_ids=torch.tensor(ids, dtype=torch.long),
+            labels=torch.tensor(labels, dtype=torch.long),
+            image_path=image_path,
+        )
+def load_image_tensor(image_path: str, image_size: int) -> torch.FloatTensor:
+    """Load image from local path in dataset.
+    In Kaggle, LLaVA datasets provide image paths relative to the dataset repo.
+    Hugging Face datasets streaming yields paths that resolve via HF cache.
+    """
+    from PIL import Image
+    import torchvision.transforms as T
+    img = Image.open(image_path).convert("RGB")
+    tfm = T.Compose([T.Resize((image_size, image_size)), T.ToTensor()])
+    return tfm(img)
+def collate_fn(batch: List[TrainExample], image_size: int) -> Dict[str, torch.Tensor]:
+    # Pad to max length
+    max_len = max(x.input_ids.numel() for x in batch)
+    input_ids = torch.full((len(batch), max_len), 0, dtype=torch.long)
+    labels = torch.full((len(batch), max_len), IGNORE_INDEX, dtype=torch.long)
+    attention_mask = torch.zeros((len(batch), max_len), dtype=torch.long)
+    for i, ex in enumerate(batch):
+        L = ex.input_ids.numel()
+        input_ids[i, :L] = ex.input_ids
+        labels[i, :L] = ex.labels
+        attention_mask[i, :L] = 1
+    # Images
+    pixel_values = torch.stack([load_image_tensor(ex.image_path, image_size) for ex in batch], dim=0)
+    return {
+        "input_ids": input_ids,
+        "labels": labels,
+        "attention_mask": attention_mask,
+        "pixel_values": pixel_values,
+    }
+def set_requires_grad(module: nn.Module, requires_grad: bool) -> None:
+    for p in module.parameters():
+        p.requires_grad = requires_grad
+def save_projector(model, output_dir: str) -> None:
+    os.makedirs(output_dir, exist_ok=True)
+    if not hasattr(model, "projector"):
+        return
+    torch.save(model.projector.state_dict(), os.path.join(output_dir, "projector.pt"))
+def maybe_add_lora_to_model(model, args) -> None:
+    """Attach LoRA adapters (Unsloth preferred; PEFT fallback)."""
+    if not args.use_lora:
+        return
+    # If the model already has adapters (e.g., loaded via Unsloth), skip.
+    if hasattr(model, "peft_config"):
+        return
+    if get_peft_model is None or LoraConfig is None:
+        raise RuntimeError("PEFT not installed, and Unsloth not available. Install `peft` or enable Unsloth.")
+    target_modules = [
+        # Qwen-like
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj",
+        "gate_proj",
+        "up_proj",
+        "down_proj",
+        # GPT-like fallback
+        "c_attn",
+        "c_proj",
+    ]
+    cfg = LoraConfig(
+        r=args.lora_r,
+        lora_alpha=args.lora_alpha,
+        lora_dropout=args.lora_dropout,
+        bias="none",
+        task_type="CAUSAL_LM",
+        target_modules=target_modules,
+    )
+    # Wrap the language model inside Manthan
+    model.language_model = get_peft_model(model.language_model, cfg)
+def main() -> int:
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--stage", choices=["stage1", "stage2"], required=True)
+    ap.add_argument("--text_model", type=str, default="Qwen/Qwen3-0.6B-Base")
+    ap.add_argument("--vision_model", type=str, default="google/siglip-so400m-patch14-384")
+    ap.add_argument("--dataset", type=str, required=True)
+    ap.add_argument("--output_dir", type=str, default="./outputs")
+    ap.add_argument("--max_length", type=int, default=2048)
+    ap.add_argument("--image_size", type=int, default=384)
+    ap.add_argument("--limit", type=int, default=2048, help="For debugging: number of samples to materialize")
+    # Training
+    ap.add_argument("--epochs", type=int, default=1)
+    ap.add_argument("--batch_size", type=int, default=1)
+    ap.add_argument("--grad_accum", type=int, default=16)
+    ap.add_argument("--lr", type=float, default=1e-4)
+    ap.add_argument("--warmup_ratio", type=float, default=0.03)
+    ap.add_argument("--use_lora", action="store_true")
+    ap.add_argument("--lora_r", type=int, default=16)
+    ap.add_argument("--lora_alpha", type=int, default=32)
+    ap.add_argument("--lora_dropout", type=float, default=0.05)
+    ap.add_argument(
+        "--manthan_model",
+        type=str,
+        required=True,
+        help="HF repo id or local path that contains Manthan remote-code (the thing you push to HF).",
+    )
+    ap.add_argument("--save_every", type=int, default=500)
+    ap.add_argument("--dry_run", action="store_true", help="Run a single synthetic step (no datasets).")
+    args = ap.parse_args()
+    os.makedirs(args.output_dir, exist_ok=True)
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    if device != "cuda":
+        print("WARNING: This script is designed for CUDA (Kaggle). Running on CPU will be extremely slow.")
+    # Tokenizer (use the LLM tokenizer)
+    tok = AutoTokenizer.from_pretrained(args.text_model, trust_remote_code=True, use_fast=False)
+    if tok.pad_token_id is None:
+        tok.pad_token = tok.eos_token
+    # Load Manthan remote-code model
+    # (This should contain config that points to your desired text_model_id & vision_model_id.)
+    model = AutoModelForCausalLM.from_pretrained(
+        args.manthan_model,
+        trust_remote_code=True,
+        torch_dtype=torch.float16 if device == "cuda" else None,
+    )
+    model.train()
+    model.to(device)
+    # Make sure we don't train the vision tower (T4-friendly)
+    if hasattr(model, "vision_model") and model.vision_model is not None:
+        set_requires_grad(model.vision_model, False)
+    if hasattr(model, "vision_tower") and model.vision_tower is not None:
+        set_requires_grad(model.vision_tower, False)
+    # Train projector always
+    if hasattr(model, "projector"):
+        set_requires_grad(model.projector, True)
+    # Add LoRA to the language model (recommended)
+    maybe_add_lora_to_model(model, args)
+    # Optimizer params = trainable only
+    trainable_params = [p for p in model.parameters() if p.requires_grad]
+    if len(trainable_params) == 0:
+        raise RuntimeError("No trainable parameters. Did you freeze everything?")
+    optim = torch.optim.AdamW(trainable_params, lr=args.lr, betas=(0.9, 0.95), weight_decay=0.01)
+    # Data
+    if args.dry_run:
+        # Minimal synthetic batch (no images on disk). This just validates loss pathway.
+        B, T = 1, min(64, args.max_length)
+        # IMPORTANT: some tokenizers report an imprecise `vocab_size`; `len(tok)` is the safe upper bound.
+        tok_vocab = int(len(tok))
+        input_ids = torch.randint(low=0, high=max(tok_vocab - 1, 1), size=(B, T), dtype=torch.long)
+        labels = input_ids.clone()
+        attn = torch.ones_like(input_ids)
+        pixel_values = torch.randn(B, 3, args.image_size, args.image_size)
+        # Insert one image placeholder
+        input_ids[0, 5] = IMAGE_TOKEN_INDEX
+        labels[0, :10] = IGNORE_INDEX
+        # If tokenizer vocab > model vocab (common in dry_run), clamp to avoid CE index errors.
+        lm_vocab = None
+        try:
+            if hasattr(model, "language_model") and hasattr(model.language_model, "config"):
+                lm_vocab = int(getattr(model.language_model.config, "vocab_size", 0) or 0)
+        except Exception:
+            lm_vocab = None
+        if lm_vocab and lm_vocab > 0:
+            safe_ids = input_ids.clone()
+            mask = safe_ids >= 0
+            safe_ids[mask] = safe_ids[mask].clamp(min=0, max=lm_vocab - 1)
+            input_ids = safe_ids
+            safe_labels = labels.clone()
+            mask = safe_labels >= 0
+            safe_labels[mask] = safe_labels[mask].clamp(min=0, max=lm_vocab - 1)
+            labels = safe_labels
+        batch = {
+            "input_ids": input_ids.to(device),
+            "labels": labels.to(device),
+            "attention_mask": attn.to(device),
+            "pixel_values": pixel_values.to(device),
+        }
+        out = model(**batch)
+        print("dry_run loss:", float(out.loss))
+        out.loss.backward()
+        optim.step()
+        optim.zero_grad(set_to_none=True)
+        save_projector(model, args.output_dir)
+        if hasattr(model, "language_model") and hasattr(model.language_model, "save_pretrained"):
+            # Save adapters if present
+            try:
+                model.language_model.save_pretrained(args.output_dir)
+            except Exception:
+                pass
+        return 0
+    ds = LlavaLikeDataset(args.dataset, split="train", tokenizer=tok, max_length=args.max_length, limit=args.limit)
+    from torch.utils.data import DataLoader
+    dl = DataLoader(
+        ds,
+        batch_size=args.batch_size,
+        shuffle=True,
+        num_workers=2,
+        collate_fn=lambda b: collate_fn(b, args.image_size),
+    )
+    total_steps = (len(dl) * args.epochs) // max(1, args.grad_accum)
+    warmup_steps = max(1, int(total_steps * args.warmup_ratio))
+    sched = get_cosine_schedule_with_warmup(optim, warmup_steps, total_steps)
+    step = 0
+    optim.zero_grad(set_to_none=True)
+    for epoch in range(args.epochs):
+        for micro_idx, batch in enumerate(dl):
+            batch = {k: v.to(device) for k, v in batch.items()}
+            # Mixed precision on Kaggle
+            with torch.autocast(device_type="cuda", dtype=torch.float16, enabled=(device == "cuda")):
+                out = model(**batch)
+                loss = out.loss / max(1, args.grad_accum)
+            loss.backward()
+            if (micro_idx + 1) % args.grad_accum == 0:
+                torch.nn.utils.clip_grad_norm_(trainable_params, 1.0)
+                optim.step()
+                sched.step()
+                optim.zero_grad(set_to_none=True)
+                step += 1
+                if step % 10 == 0:
+                    print(f"epoch={epoch} step={step}/{total_steps} loss={float(out.loss):.4f}")
+                if step % args.save_every == 0:
+                    save_projector(model, args.output_dir)
+                    # Save adapters if any
+                    try:
+                        model.save_pretrained(args.output_dir)
+                    except Exception:
+                        pass
+            if step >= total_steps:
+                break
+    save_projector(model, args.output_dir)
+    try:
+        model.save_pretrained(args.output_dir)
+    except Exception:
+        pass
+    print("DONE")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

tests/test_smoke.py ADDED Viewed

	@@ -0,0 +1,10 @@

+from manthan_t1.smoke_test import TinyToyModel
+def test_tiny_model_forward():
+    import mlx.core as mx
+    m = TinyToyModel(vocab_size=64, d_model=32)
+    x = mx.random.randint(0, 64, shape=(2, 8))
+    y = m(x)
+    assert y.shape == (2, 8, 64)