Instructions to use AbstractPhil/Qwen3.5-0.8B-json-captioner with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="AbstractPhil/Qwen3.5-0.8B-json-captioner")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("AbstractPhil/Qwen3.5-0.8B-json-captioner")
model = AutoModelForMultimodalLM.from_pretrained("AbstractPhil/Qwen3.5-0.8B-json-captioner")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AbstractPhil/Qwen3.5-0.8B-json-captioner"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AbstractPhil/Qwen3.5-0.8B-json-captioner",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/AbstractPhil/Qwen3.5-0.8B-json-captioner

SGLang

How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AbstractPhil/Qwen3.5-0.8B-json-captioner" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AbstractPhil/Qwen3.5-0.8B-json-captioner",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AbstractPhil/Qwen3.5-0.8B-json-captioner" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AbstractPhil/Qwen3.5-0.8B-json-captioner",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use AbstractPhil/Qwen3.5-0.8B-json-captioner with Docker Model Runner:
```
docker model run hf.co/AbstractPhil/Qwen3.5-0.8B-json-captioner
```

Qwen3.5-0.8B-json-captioner

A merged, standalone image → structured-JSON captioner: Qwen/Qwen3.5-0.8B with the task_1 caption-structuring LoRA fused into the weights. It looks at an image (or an image-synthesis prompt) and emits a grounded, literal caption as JSON via an emit_caption_schema tool call. No PEFT/adapter loading required at inference — load it like any transformers model.

This requires the tool call schema to be correctly aligned, my apologies for the earlier explanation.

What it is

Base: Qwen/Qwen3.5-0.8B — qwen3_5 architecture, ~873M params, image-text-to-text, Apache-2.0.
Adapter: AbstractPhil/qwen3.5-0.8b-task_1-lora-v2, folded in with peft's merge_and_unload().
Result: a single checkpoint with the base architecture — AutoModelForImageTextToText + AutoProcessor, no peft.

The merge was faithfulness-checked (base+LoRA logits vs. merged, in-memory and reloaded-from-disk) before upload.

Intended use

Turn an image into a fixed-schema caption JSON for downstream training pipelines (it was built to fill the structured-caption field of an image-caption super-dataset). It is a narrow extraction model, not a general chat or VQA model.

Training

Two-stage curriculum (`qwen_lora_train_v2.py`)

The v2 adapter was trained via a two-stage curriculum, warm-started from the v1 LoRA (AbstractPhil/qwen3.5-0.8b-task_1-lora, which was trained on the Claude gold set alone).

Stage 1 — Bulk pretraining on ~50,000 grounded rows from AbstractPhil/cc-task1-json (Qwen-generated Conceptual Captions conversions, filtered to grounded==True). High volume, ~99%-clean but 0.8B-quality. 1 epoch.

Stage 2 — Refinement on ~20,505 Claude Sonnet 4.6 gold extractions from AbstractPhil/json-coco-format, config task_1. These are higher-fidelity, more robust tool-call examples produced by the ClaudeProvider (strict prompt mode, forced emit_caption_schema tool choice, filtered to grounding_rate==1.0). 2 epochs.

The hypothesis: v1 may have been quality-capped by the small 20K Claude set; bulk CC data broadens it, and the gold refinement stage re-anchors. Three checkpoints exist for comparison: v1 (Claude only) → v2-stage1 (+ 50K CC) → v2 (CC then Claude refine).

Data format

Source captions are MS-COCO (Karpathy split). The teacher is Claude Sonnet 4.6, run in strict mode with forced emit_caption_schema tool choice and filtered to grounding_rate==1.0 (every extracted entity must trace back to the input caption). Each example is in the Qwen3.5-native tool-call shape:

messages[0] — system prompt (caption-structuring assistant)
messages[1] — user turn: the raw caption text
messages[2] — assistant turn with tool_calls[0].function.arguments:

// Input:  "A long restaurant table with rattan rounded back chairs."
// Output:
{
  "subjects": [
    {"name": "restaurant table", "attributes": ["long"]},
    {"name": "chairs", "attributes": ["rattan", "rounded back"]}
  ],
  "actions": [],
  "setting": "indoor"
}

// Input:  "a long table with a plant on top of it surrounded with wooden chairs"
// Output:
{
  "subjects": [
    {"name": "table", "attributes": ["long"]},
    {"name": "plant", "attributes": []},
    {"name": "chairs", "attributes": ["wooden"]}
  ],
  "actions": ["plant on top of table", "table surrounded with wooden chairs"],
  "setting": "indoor"
}

Note: style and mood are omitted — they are const: null in the schema (strict mode forced them null in all training examples). The meta column records model, mode, schema_valid, validator_passed, and token/cost accounting per row.

At inference time, Qwen3.5 generates in its native text format (<tool_call><function=emit_caption_schema><parameter=subjects>…</parameter>…</tool_call>), which is parsed into the dict above by parse_tool_call.

Schema reference

subjects  [SubjectValue]   max 8 items
            ├─ name        str (1–64 chars, required)
            └─ attributes  [str] (max 8, optional)
actions   [str]            max 8 items — relational phrases, not single verbs
setting   enum             "indoor" | "outdoor" | "unknown" (default: "unknown")
style     null             const null (strict mode)
mood      null             const null (strict mode)

LoRA config (from `adapter_config.json`)

parameter	value
rank `r`	32
`lora_alpha`	64
`lora_dropout`	0.05
`target_modules`	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
`bias`	none
`task_type`	`CAUSAL_LM`
rsLoRA / DoRA	off

Training hyperparameters (from `qwen_lora_train_v2.py`)

parameter	value
trainer	`transformers.Trainer`
optimizer	AdamW (default)
LR (both stages)	`1e-4` (below v1's `2e-4` — continuing a trained adapter)
LR schedule	cosine with 3% warmup
batch size	16
gradient accumulation	1 (effective batch = 16)
precision	bf16
max sequence length	2048
label masking	`-100` over system+user prefix; loss on assistant tokens only
seed	42

The adapter modifies only the language-model projections; the base's vision encoder is untouched. That is why, although training was text-only, the merged model also does image → JSON at inference: feed an image and the vision-conditioned generation inherits the same tool-call structuring behavior.

Important: the task scaffold is not baked into the weights

The system prompt and the tools definition the LoRA was trained against live in the dataset AbstractPhil/json-coco-format (config task_1), not in the model. For the structured output this model is tuned for, apply that same system prompt + tools at inference (shown below). Without them the model still runs, but you lose the schema grounding.

Tool definition (click to expand)

[
  {
    "type": "function",
    "function": {
      "name": "emit_caption_schema",
      "description": "Emit the structured caption representation. The parameters follow the qwen-test-runner slot registry.",
      "parameters": {
        "$defs": {
          "SubjectValue": {
            "description": "A single entity in the caption.",
            "properties": {
              "name": { "maxLength": 64, "minLength": 1, "title": "Name", "type": "string" },
              "attributes": { "items": { "type": "string" }, "maxItems": 8, "title": "Attributes", "type": "array" }
            },
            "required": ["name"],
            "title": "SubjectValue",
            "type": "object"
          }
        },
        "properties": {
          "subjects": { "items": { "$ref": "#/$defs/SubjectValue" }, "maxItems": 8, "title": "Subjects", "type": "array" },
          "actions": { "items": { "type": "string" }, "maxItems": 8, "title": "Actions", "type": "array" },
          "setting": { "default": "unknown", "enum": ["indoor", "outdoor", "unknown"], "title": "Setting", "type": "string" },
          "style": { "anyOf": [{ "maxLength": 64, "type": "string" }, { "type": "null" }], "default": null, "title": "Style", "const": null },
          "mood": { "anyOf": [{ "maxLength": 64, "type": "string" }, { "type": "null" }], "default": null, "title": "Mood", "const": null }
        },
        "title": "Caption",
        "type": "object"
      }
    }
  }
]

Usage

import json, torch
from PIL import Image
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor, AutoModelForImageTextToText

REPO = "AbstractPhil/Qwen3.5-0.8B-json-captioner"

processor = AutoProcessor.from_pretrained(REPO)
model = AutoModelForImageTextToText.from_pretrained(
    REPO, dtype=torch.bfloat16, device_map="cuda").eval()
processor.tokenizer.padding_side = "left"
if processor.tokenizer.pad_token_id is None:
    processor.tokenizer.pad_token_id = processor.tokenizer.eos_token_id

# Task scaffold (system prompt + tools). Read the JSONL directly: the dataset card
# declares a 'Json' feature type that datasets>=4.0 rejects, so load_dataset() fails
# ("Feature type 'Json' not found") — hf_hub_download + json.loads(first line) is robust.
_p = hf_hub_download("AbstractPhil/json-coco-format", "data/task_1.jsonl", repo_type="dataset")
with open(_p, encoding="utf-8") as f:
    scaffold = json.loads(f.readline())
SYSTEM_PROMPT = scaffold["messages"][0]["content"]
TOOLS         = scaffold["tools"]

image = Image.open("example.jpg").convert("RGB")
messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text",  "text": "Extract the structured representation of what this image shows."},
    ]},
]
inputs = processor.apply_chat_template(
    messages, tools=TOOLS, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt", enable_thinking=False).to(model.device)

out = model.generate(
    **inputs, max_new_tokens=768, do_sample=False,
    pad_token_id=processor.tokenizer.pad_token_id,
    stop_strings=["</tool_call>"], tokenizer=processor.tokenizer)

text = processor.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(text)   # -> <tool_call><function=emit_caption_schema><parameter=...>...</tool_call>

The continuation is a Qwen tool call; parse the <function=...><parameter=...> block into a dict to get the caption JSON. Text-only input (an image-synthesis prompt instead of an image) works too — pass the prompt as the user text and drop the image content block.

Notes

Precision: bfloat16 is recommended (the merge was done in bf16).
Attention backend: sdpa is correct on Blackwell (sm_120) and Turing (sm_75), where flash-attn kernels don't run. On Ampere/Ada/Hopper (sm_80/86/89/90) you can pass attn_implementation="flash_attention_2" if flash-attn is installed, for a faster prefill.
Decoding: deterministic (do_sample=False) with stop_strings=["</tool_call>"] to halt once the tool call closes.

Provenance

Produced by merging the LoRA into the base via merge_and_unload(safe_merge=True), then save_pretrained (weights + config) and processor.save_pretrained (image processor + tokenizer + chat template). Qwen/Qwen3.5-0.8B is a standard transformers architecture, so the repo is self-contained — no custom remote code.

License

Model weights: Apache-2.0, inherited from Qwen/Qwen3.5-0.8B.
Training data: AbstractPhil/json-coco-format is CC-BY-4.0. Source captions are MS-COCO (Karpathy split).

Limitations

Small (0.8B): extraction quality is bounded by the task_1 LoRA's training; it is not a general-purpose captioner or chat model.
Image → JSON is a transfer capability. The adapter was trained on text caption → JSON, so image grounding rides on the base VLM's vision encoder plus the LoRA's structuring behavior — it was not directly trained on image inputs. Expect text → JSON to be its strongest mode.
The output schema is fixed by the emit_caption_schema tool — subjects (structured {name, attributes} objects), actions, setting (3-way enum), with style/mood always null. Anything outside that schema is out of scope.
Tuned toward grounded, literal extraction; it is not designed for creative or interpretive captions.

Downloads last month: 106

Safetensors

Model size

0.9B params

Tensor type

BF16

Model tree for AbstractPhil/Qwen3.5-0.8B-json-captioner

AbstractPhil/qwen3.5-0.8b-task_1-lora-v2

Qwen/Qwen3.5-0.8B

Merge model

this model

Datasets used to train AbstractPhil/Qwen3.5-0.8B-json-captioner

Collection including AbstractPhil/Qwen3.5-0.8B-json-captioner

Flagships

Collection

My flagship models that actually work or are the best I have capable from a category currently. • 13 items • Updated 4 days ago