Instructions to use prithivMLmods/Inkscope-Captions-2B-0526 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prithivMLmods/Inkscope-Captions-2B-0526 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="prithivMLmods/Inkscope-Captions-2B-0526")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModel

processor = AutoProcessor.from_pretrained("prithivMLmods/Inkscope-Captions-2B-0526")
model = AutoModel.from_pretrained("prithivMLmods/Inkscope-Captions-2B-0526", device_map="auto")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use prithivMLmods/Inkscope-Captions-2B-0526 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prithivMLmods/Inkscope-Captions-2B-0526"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Inkscope-Captions-2B-0526",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/prithivMLmods/Inkscope-Captions-2B-0526

SGLang

How to use prithivMLmods/Inkscope-Captions-2B-0526 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prithivMLmods/Inkscope-Captions-2B-0526" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Inkscope-Captions-2B-0526",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prithivMLmods/Inkscope-Captions-2B-0526" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prithivMLmods/Inkscope-Captions-2B-0526",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use prithivMLmods/Inkscope-Captions-2B-0526 with Docker Model Runner:
```
docker model run hf.co/prithivMLmods/Inkscope-Captions-2B-0526
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Inkscope-Captions-2B-0526

The Inkscope-Captions-2B-0526 model is a fine-tuned version of Qwen2-VL-2B-Instruct, optimized for image captioning, vision-language understanding, and English-language caption generation. This model was fine-tuned on the conceptual-captions-cc12m-llavanext dataset (first 30k entries) to generate detailed, high-quality captions for images, including complex or abstract scenes.

Colab Demo : https://huggingface.co/prithivMLmods/Inkscope-Captions-2B-0526/blob/main/Inkscope%20Captions%202B%200526%20Demo/Inkscope-Captions-2B-0526.ipynb

Video Understanding Demo : https://huggingface.co/prithivMLmods/Inkscope-Captions-2B-0526/blob/main/Inkscope-Captions-2B-0526-Video-Understanding/Inkscope-Captions-2B-0526-Video-Understanding.ipynb

Key Enhancements:

High-Quality Visual Captioning: Generates rich and descriptive captions from diverse visual inputs, including abstract, real-world, and complex images.
Fine-Tuned on CC12M Subset: Trained using the first 30k entries of the Conceptual Captions 12M (CC12M) dataset with the LLaVA-Next formatting, ensuring alignment with instruction-tuned captioning.
Multimodal Understanding: Supports detailed understanding of text+image combinations, ideal for caption generation, scene understanding, and instruction-based vision-language tasks.
Multilingual Recognition: While focused on English captioning, the model can recognize text in various languages present in the image.
Strong Foundation Model: Built on Qwen2-VL-2B-Instruct, offering powerful visual-linguistic reasoning, OCR capability, and flexible prompt handling.

How to Use

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the fine-tuned model
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Inkscope-Captions-2B-0526", torch_dtype="auto", device_map="auto"
)

# Load processor
processor = AutoProcessor.from_pretrained("prithivMLmods/Inkscope-Captions-2B-0526")

# Sample input message with an image
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Generate a detailed caption for this image."},
        ],
    }
]

# Preprocess input
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Buffering Output (Optional for streaming inference)

buffer = ""
for new_text in streamer:
    buffer += new_text
    buffer = buffer.replace("<|im_end|>", "")
    yield buffer

Demo Inference

Video Inference

Key Features

Caption Generation from Images:
- Transforms visual scenes into detailed, human-like descriptions.
Conceptual Reasoning:
- Captures abstract or high-level elements from images, including emotion, action, or scene context.
Multi-modal Prompting:
- Accepts both image and text input for instruction-tuned caption generation.
Flexible Output Format:
- Generates output in natural language, ideal for storytelling, accessibility tools, and educational applications.
Instruction-Tuned:
- Fine-tuned with LLaVA-Next style prompts, making it suitable for interactive use and vision-language agents.

Intended Use

Inkscope-Captions-2B-0526 is designed for the following applications:

Image Captioning for web-scale datasets, social media analysis, and generative applications.
Accessibility Tools: Helping visually impaired users understand image content through text.
Content Tagging and Metadata Generation for media, digital assets, and educational material.
AI Companions and Tutors that need to explain or describe visuals in a conversational setting.
Instruction-following Vision-Language Tasks, such as zero-shot VQA, scene description, and multimodal storytelling.