Spaces:

tencent
/

Penguin-VL

Running on Zero

App Files Files Community

lkeab commited on Mar 8

Commit

0cb9ad5

verified ·

1 Parent(s): ad0c1db

Sync Space app to tencent/Penguin-VL

Browse files

Files changed (21) hide show

.gitattributes +9 -0
README.md +297 -12
app.py +23 -0
assets/inputs/2b_table_result.png +3 -0
assets/inputs/chart_understanding.png +3 -0
assets/inputs/desert.jpg +3 -0
assets/inputs/horse_poet.png +3 -0
assets/inputs/leetcode.png +3 -0
assets/inputs/newspaper.png +3 -0
assets/inputs/polar_bear.mp4 +3 -0
assets/inputs/sora.png +3 -0
assets/inputs/video-example.mp4 +3 -0
inference/interface/__init__.py +1 -0
inference/interface/gradio_interface.py +235 -0
inference/launch_gradio_demo.py +62 -0
inference/server/__init__.py +2 -0
inference/server/direct_client.py +176 -0
inference/server/plain_server.py +279 -0
packages.txt +1 -0
pre-requirements.txt +5 -0
requirements.txt +37 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,12 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/inputs/2b_table_result.png filter=lfs diff=lfs merge=lfs -text
+assets/inputs/chart_understanding.png filter=lfs diff=lfs merge=lfs -text
+assets/inputs/desert.jpg filter=lfs diff=lfs merge=lfs -text
+assets/inputs/horse_poet.png filter=lfs diff=lfs merge=lfs -text
+assets/inputs/leetcode.png filter=lfs diff=lfs merge=lfs -text
+assets/inputs/newspaper.png filter=lfs diff=lfs merge=lfs -text
+assets/inputs/polar_bear.mp4 filter=lfs diff=lfs merge=lfs -text
+assets/inputs/sora.png filter=lfs diff=lfs merge=lfs -text
+assets/inputs/video-example.mp4 filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,15 +1,300 @@
 ---
-title: Penguin VL
-emoji: ⚡
-colorFrom: yellow
-colorTo: gray
-sdk: gradio
-sdk_version: 6.9.0
-python_version: '3.12'
-app_file: app.py
-pinned: false
-license: apache-2.0
-short_description: Penguin-VL
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+<p align="center">
+    <img src="assets/logo.png" width="150" style="margin-bottom: 0.2;"/>
+</p>
+<h3 align="center">Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders</h3>
+<h5 align="center">
+[![Hugging Face](https://img.shields.io/badge/🤗-2B_Model-F6C343.svg)](https://huggingface.co/tencent/Penguin-VL-2B)
+[![Hugging Face](https://img.shields.io/badge/🤗-8B_Model-F6C343.svg)](https://huggingface.co/tencent/Penguin-VL-8B)
+[![Hugging Face](https://img.shields.io/badge/🤗-Encoder-F6C343.svg)](https://huggingface.co/tencent/Penguin-Encoder) <br>
+[![Hugging Face](https://img.shields.io/badge/🤗-Demo-F6C343.svg)](https://huggingface.co/spaces/lkeab/Penguin-VL-8B)
+[![hf_paper](https://img.shields.io/badge/🤗-Paper%20In%20HF-8B5CF6.svg)](https://huggingface.co/papers/xxx.xxxx)
+[![arXiv](https://img.shields.io/badge/Arxiv-xxx.xxxx-B91C1C.svg?logo=arXiv)](https://arxiv.org/abs/xxx.xxxx)
+</h5>
+---
+## 📰 News
+* **[2025.03]** Release inference code, vLLM plugin, and Gradio demo for Penguin-VL.
+* **[2025.03]** Release [Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B), [Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B), and [Penguin Vision Encoder](https://huggingface.co/tencent/Penguin-Encoder) on Hugging Face.
+---
+## ✨ Overview
+**Penguin-VL** is a compact vision-language model family built to study how far multimodal efficiency can be pushed by redesigning the **vision encoder**, rather than only scaling data or model size.
+Most modern VLMs rely on vision encoders pretrained with large-scale **contrastive objectives** such as CLIP or SigLIP. Penguin-VL argues that this setup can be suboptimal for multimodal reasoning because contrastive learning favors coarse category-level invariances over the fine-grained signals needed for **OCR, document understanding, dense captioning, and complex reasoning**. Instead, Penguin-VL introduces **Penguin-Encoder**, a vision encoder **initialized from a text-only LLM**, so the visual backbone starts closer to the language model representation space and learns more data-efficiently.
+<p align="center">
+  <img src="assets/framework.png" alt="Penguin-VL framework overview" width="920"/>
+</p>
+<p align="center">
+  <em>Framework overview of Penguin-VL: an LLM-initialized vision encoder, mixed-supervision pretraining, and efficient video token compression.</em>
+</p>
+### Highlights
+- **LLM → Vision Encoder initialization (Penguin-Encoder)**
+  Initialize the vision encoder from a text-only LLM (e.g., Qwen3-0.6B), convert causal attention to **bidirectional attention**, and add **2D-RoPE** for variable-resolution vision tokens.
+- **Mixed-supervision encoder pretraining**
+  Warm up the LLM-initialized encoder with a reconstruction/distillation objective (amplitude / direction / relation losses) to inject visual knowledge stably, then switch to high-resolution alignment.
+- **Video efficiency via Temporal Redundancy-Aware (TRA) token compression**
+  Dynamically allocate token budgets across **key frames vs. intermediate frames** under a global token budget to scale to long videos more efficiently.
+- **Unified training recipe**
+  A low-to-high resolution curriculum + instruction tuning strategy that balances image and video capabilities at compact scale.
+---
+## 📈 Results
+Penguin-VL-2B delivers a strong accuracy-efficiency tradeoff across image and video benchmarks, with especially solid gains on OCR-heavy and reasoning-heavy tasks where fine-grained visual understanding matters most.
+<p align="center">
+  <img src="assets/2b_results.png" alt="Penguin-VL-2B benchmark results" width="980"/>
+</p>
+<p align="center">
+  <em>Benchmark snapshot for Penguin-VL-2B across image and video evaluation suites.</em>
+</p>
+The released checkpoints and encoder weights are listed below.
+---
+## 📦 Model Zoo
+| Model | Hugging Face |
+| :---- | :----------- |
+| **Penguin-VL-2B** | [tencent/Penguin-VL-2B](https://huggingface.co/tencent/Penguin-VL-2B) |
+| **Penguin-VL-8B** | [tencent/Penguin-VL-8B](https://huggingface.co/tencent/Penguin-VL-8B) |
+| **Penguin Vision Encoder** | [tencent/Penguin-Encoder](https://huggingface.co/tencent/Penguin-Encoder) |
+---
+## 🛠️ Environment Setup
+### Requirements
+- **Python** = 3.11.13 (recommended)
+- **PyTorch** ≥ 2.5 (CUDA 12.4 recommended)
+- **CUDA** ≥ 11.8
+### Installation
+```bash
+# Clone the repository
+git clone <repo_url>
+cd <repo_name>
+# Recommended: create and activate a clean conda environment
+conda create -n PenguinVL python=3.11.13 -y
+conda activate PenguinVL
+# INSTALL ffmpeg if you don't have it on your system
+conda install ffmpeg -y # optional
+# Install dependencies (inference + Gradio demo)
+pip install -r requirements.txt
+# Install Flash Attention (recommended for faster inference)
+pip install flash-attn==2.8.3 --no-build-isolation
+```
+### Version Notes
+| Use Case | Recommended |
+| :------- | :---------- |
+| **Transformers inference** | `transformers==4.51.3` |
+| **vLLM inference** | Install vLLM separately (see [§ vLLM Inference](#-vllm-inference)) |
+---
+## 🤖 Inference (Transformers)
+Use HuggingFace `AutoModelForCausalLM` + `AutoProcessor` for image, video, and text.
+```bash
+python inference/example_penguinvl.py
+```
+You can provide a customized `--model-path` argument to the script (default: `tencent/Penguin-VL-8B`). Supported formats:
+- **Video:** `type: "video"` with `video_path`, `fps`, `max_frames`
+- **Image:** `type: "image"` with `image_path`
+- **Mixed:** image + video + text in one conversation
+- **Text-only:** plain text dialogue
+---
+## 📓 Cookbook
+Checkout the inference notebook for a GitHub-friendly walkthrough of Penguin-VL across diverse tasks.
+Unlike a multi-notebook cookbook, Penguin-VL currently provides **one consolidated notebook** that covers multiple representative examples in a single place.
+| Notebook | Description |
+| :------- | :---------- |
+| [Inference Recipes](inference/notebooks/01_penguinvl_inference_recipes.public.ipynb) | Demonstrations of Penguin-VL for **visual code generation**, **OCR/document parsing**, **creative image understanding**, **table extraction**, **multi-round chart analysis**, **multi-round video understanding**, **mixed video+image prompting**, and a **text-only baseline**. |
+If you want to re-execute the notebook locally and regenerate the GitHub-previewable output:
+```bash
+export PENGUIN_VL_MODEL_PATH=tencent/Penguin-VL-8B
+jupyter nbconvert \
+  --to notebook \
+  --execute \
+  --output 01_penguinvl_inference_recipes.public.ipynb \
+  --ExecutePreprocessor.timeout=-1 \
+  --ExecutePreprocessor.kernel_name=penguinvl \
+  inference/notebooks/01_penguinvl_inference_recipes.source.ipynb
+```
+The clean source notebook lives at [inference/notebooks/01_penguinvl_inference_recipes.source.ipynb](inference/notebooks/01_penguinvl_inference_recipes.source.ipynb).
 ---
+## ⚡ vLLM Inference
+> Installing **vLLM 0.11.0** requires **PyTorch 2.8** and the corresponding compatible version of **Flash Attention**. This setup may different from the default Transformers inference environment (which recommends PyTorch ≥ 2.5). You may need to create a separate environment or upgrade dependencies accordingly to avoid version conflicts.
+### Environment
+- The vLLM plugin targets **vLLM 0.11.0** (`penguinvl/plugin/vllm/v0_11_0/`).
+- vLLM is not in `requirements.txt` by default; install it separately:
+```bash
+pip install vllm==0.11.0
+```
+**Troubleshooting:** If you see `cannot find -lcuda` during flashinfer build:
+```bash
+export LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LIBRARY_PATH
+# or /usr/local/cuda/lib64 depending on your CUDA install
+```
+### Start vLLM Server
+```bash
+# Single GPU
+python -m penguinvl.plugin.vllm serve tencent/Penguin-VL-8B
+# Multi-GPU (e.g. 8B on 2 GPUs)
+python -m penguinvl.plugin.vllm serve tencent/Penguin-VL-8B --port 8000 --tensor-parallel-size 2
+```
+Additional options: `--host`, `--max-model-len`, etc. (see vLLM 0.11 `serve` docs).
+### vLLM Demo Script
+Run text, image, video, and batch demos:
+```bash
+# All demos (single GPU)
+CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B
+# Text-only
+CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo text
+# Image (requires --image-path)
+CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo image --image-path assets/inputs/horse_poet.png
+# Video
+CUDA_VISIBLE_DEVICES=0 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --demo video --video-path assets/inputs/polar_bear.mp4
+# 8B with tensor parallelism (2 GPUs)
+CUDA_VISIBLE_DEVICES=0,1 python inference/test_vllm_infer.py --model-path tencent/Penguin-VL-8B --tensor-parallel-size 2
+```
+| Argument | Description |
+| :------- | :---------- |
+| `--model-path` | HuggingFace model name or local path |
+| `--demo` | `text` \| `image` \| `batch` \| `video` \| `all` |
+| `--tensor-parallel-size` | Number of GPUs for tensor parallelism |
+| `--max-new-tokens` | Max tokens to generate |
+| `--max-model-len` | Max context length |
+| `--gpu-memory-utilization` | GPU memory fraction (0–1) |
+---
+## 🤗 Gradio Demo (Local UI)
+Launch a local web UI with image/video upload and chat.
+### Quick Start
+```bash
+python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B
+```
+Then open **http://localhost:33666** (or your machine’s IP + port) in a browser.
+### Options
+| Option | Description | Default |
+| :----- | :----------- | :------ |
+| `--model-path` | Model path or HuggingFace ID | *required* |
+| `--server-port` | Backend inference server port | 16667 |
+| `--interface-port` | Gradio web UI port | 33666 |
+| `--nproc` | Number of backend worker processes | 1 |
+### Examples
+```bash
+# 2B model, default ports
+python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-2B
+# 8B model, custom UI port
+python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B --interface-port 8080
+# Multi-worker backend
+python inference/launch_gradio_demo.py --model-path tencent/Penguin-VL-8B --nproc 4
+```
+---
+## 📁 Project Structure
+```text
+.
+├── penguinvl/                    # Core model and processor code
+│   ├── plugin/vllm/              # vLLM plugin (v0_11_0)
+│   └── ...
+├── inference/
+│   ├── example_penguinvl.py      # Transformers inference example
+│   ├── test_vllm_infer.py        # vLLM inference demo
+│   ├── launch_gradio_demo.py     # Gradio local demo
+│   ├── notebooks/                # Executed and source Jupyter notebooks
+│   ├── server/                   # Backend for Gradio
+│   ├── interface/                # Gradio UI
+│   └── transformers_api/         # Transformers model/processor wrappers
+├── assets/
+│   ├── framework.png             # README framework figure
+│   ├── 2b_results.png            # README benchmark figure
+│   └── inputs/                   # Demo images and videos
+└── requirements.txt
+```
+---
+## 📄 License
+This project is released under the [Apache 2.0 License](LICENSE).
+## 📚 Citation
+If you use Penguin-VL in your research, please cite:
+```bibtex
+...
+```
 ---
+If you find this project useful, please consider giving it a ⭐ on GitHub. Issues and PRs are welcome.

app.py ADDED Viewed

	@@ -0,0 +1,23 @@

+import os
+from inference.interface import PenguinVLQwen3GradioInterface
+from inference.server import PenguinVLQwen3DirectClient
+from inference.server.direct_client import ensure_flash_attn_installed
+def main():
+    ensure_flash_attn_installed()
+    model_client = PenguinVLQwen3DirectClient(
+        model_path=os.getenv("MODEL_PATH", "tencent/Penguin-VL-8B"),
+    )
+    interface = PenguinVLQwen3GradioInterface(
+        model_client,
+        example_dir=os.getenv("EXAMPLE_DIR", "./assets/inputs"),
+        server_name=os.getenv("GRADIO_SERVER_NAME", "0.0.0.0"),
+        server_port=int(os.getenv("PORT", "7860")),
+    )
+    interface.launch()
+if __name__ == "__main__":
+    main()

assets/inputs/2b_table_result.png ADDED Viewed

Git LFS Details

SHA256: beae8b770010f24eb47ca0cd5f4e76ec0c939e3373140fc20affee5f96f0fdb7
Pointer size: 131 Bytes
Size of remote file: 223 kB

assets/inputs/chart_understanding.png ADDED Viewed

Git LFS Details

SHA256: ce6c8f15eefc924f2b6c284c06e50d61663fe6215888c57cfc90ac713eceb7c2
Pointer size: 131 Bytes
Size of remote file: 191 kB

assets/inputs/desert.jpg ADDED Viewed

Git LFS Details

SHA256: d1f8133a0910fe8ccb40a410bcdced33a667427bb6e1f6c6de6348043af515d4
Pointer size: 131 Bytes
Size of remote file: 881 kB

assets/inputs/horse_poet.png ADDED Viewed

Git LFS Details

SHA256: 96a033041b3f56b873af1606808c6e7bc2ae6863110db334805397658b1fe124
Pointer size: 132 Bytes
Size of remote file: 1.41 MB

assets/inputs/leetcode.png ADDED Viewed

Git LFS Details

SHA256: eb64e7664af9ab5a3e984ced8d8f129362366dbd1f740cb272a1210b883930d1
Pointer size: 131 Bytes
Size of remote file: 211 kB

assets/inputs/newspaper.png ADDED Viewed

Git LFS Details

SHA256: d905781b3f48e438e15bfabbe853301ad593e4cfb3a83dba9c0bdc188aec4e7e
Pointer size: 132 Bytes
Size of remote file: 1.75 MB

assets/inputs/polar_bear.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef863db1456a4d6c20e8a8a346925d7bee2c4f4e2e6e7749f80f6b4961a2062c
+size 2865660

assets/inputs/sora.png ADDED Viewed

Git LFS Details

SHA256: 6b69de5c87b429c7b1a87de6f9cb3f5ec6aec5f58ab6ab7c0f727a5d0ec259a5
Pointer size: 132 Bytes
Size of remote file: 1.05 MB

assets/inputs/video-example.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ae512cb2e0311fb0be4de6f9f6a646598e7ccb8da397ca5800ec2e9b8115bfd1
+size 11846338

inference/interface/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .gradio_interface import PenguinVLQwen3GradioInterface

inference/interface/gradio_interface.py ADDED Viewed

	@@ -0,0 +1,235 @@

+import os
+import os.path as osp
+import gradio as gr
+HEADER = """
+# Penguin-VL Gradio Interface
+Developed by [Penguin-VL](https://github.com/tencent-ailab/Penguin-VL) team at Tencent AI Lab.
+Note: speed on ZeroGPU does not reflect real model speed and may be influenced by the shared environment. For stable and fast Gradio Space deployment and running, please visit [the local UI instructions](https://github.com/tencent-ailab/Penguin-VL?tab=readme-ov-file#-gradio-demo-local-ui). For usage examples and expected results, please refer to [here](https://github.com/tencent-ailab/Penguin-VL/blob/master/inference/notebooks/01_penguinvl_inference_recipes.public.ipynb).
+Please login with your Hugging Face account first. We provide some example images and videos for easier trials.
+"""
+class PenguinVLQwen3GradioInterface(object):
+    def __init__(self, model_client, example_dir=None, default_system_prompt="You are a helpful assistant developed by Tencent AI Lab PenguinVL team.", **server_kwargs):
+        self.model_client = model_client
+        self.server_kwargs = server_kwargs
+        self.default_system_prompt = (default_system_prompt or "").strip()
+        self.image_formats = ("png", "jpg", "jpeg")
+        self.video_formats = ("mp4", "mov")
+        image_examples, video_examples = [], []
+        if example_dir is not None:
+            example_files = [
+                osp.join(example_dir, f) for f in os.listdir(example_dir)
+            ]
+            for example_file in example_files:
+                if example_file.endswith(self.image_formats):
+                    image_examples.append([example_file])
+                elif example_file.endswith(self.video_formats):
+                    video_examples.append([example_file])
+        with gr.Blocks() as self.interface:
+            gr.Markdown(HEADER)
+            with gr.Row():
+                chatbot_kwargs = {"elem_id": "chatbot", "height": 710}
+                try:
+                    chatbot = gr.Chatbot(type="messages", **chatbot_kwargs)
+                except TypeError:
+                    # Gradio 6 uses OpenAI-style messages by default and removed the `type` arg.
+                    chatbot = gr.Chatbot(**chatbot_kwargs)
+                with gr.Column():
+                    with gr.Tab(label="Input"):
+                        with gr.Row():
+                            input_video = gr.Video(sources=["upload"], label="Upload Video")
+                            input_image = gr.Image(sources=["upload"], type="filepath", label="Upload Image")
+                        if len(image_examples):
+                            gr.Examples(image_examples, inputs=[input_image], label="Example Images")
+                        if len(video_examples):
+                            gr.Examples(video_examples, inputs=[input_video], label="Example Videos")
+                        input_text = gr.Textbox(label="Input Text", placeholder="Type your message here and press enter to submit")
+                        submit_button = gr.Button("Generate")
+                    with gr.Tab(label="Configure"):
+                        with gr.Accordion("Prompt Config", open=True):
+                            system_prompt = gr.Textbox(
+                                value=self.default_system_prompt,
+                                label="System Prompt",
+                                lines=4,
+                                placeholder="Optional: system instruction prepended to each request",
+                            )
+                        with gr.Accordion("Generation Config", open=True):
+                            do_sample = gr.Checkbox(value=True, label="Do Sample")
+                            temperature = gr.Slider(minimum=0.0, maximum=1.0, value=0.1, label="Temperature")
+                            top_p = gr.Slider(minimum=0.0, maximum=1.0, value=0.9, label="Top P")
+                            max_new_tokens = gr.Slider(minimum=0, maximum=4096, value=1024, step=1, label="Max New Tokens")
+                        with gr.Accordion("Video Config", open=True):
+                            fps = gr.Slider(minimum=0.0, maximum=10.0, value=1, label="FPS")
+                            max_frames = gr.Slider(minimum=0, maximum=256, value=180, step=1, label="Max Frames")
+            input_video.change(self._on_video_upload, [chatbot, input_video], [chatbot, input_video])
+            input_image.change(self._on_image_upload, [chatbot, input_image], [chatbot, input_image])
+            input_text.submit(self._on_text_submit, [chatbot, input_text], [chatbot, input_text])
+            submit_button.click(
+                self._predict,
+                [
+                    chatbot, input_text, system_prompt, do_sample, temperature, top_p, max_new_tokens,
+                    fps, max_frames,
+                ],
+                [chatbot, input_text],
+            )
+    def _on_video_upload(self, messages, video):
+        messages = messages or []
+        if video is not None:
+            # messages.append({"role": "user", "content": gr.Video(video)})
+            messages.append({"role": "user", "content": {"path": video}})
+        return messages, None
+    def _on_image_upload(self, messages, image):
+        messages = messages or []
+        if image is not None:
+            # messages.append({"role": "user", "content": gr.Image(image)})
+            messages.append({"role": "user", "content": {"path": image}})
+        return messages, None
+    def _on_text_submit(self, messages, text):
+        messages = messages or []
+        messages.append({"role": "user", "content": text})
+        return messages, ""
+    def _extract_media_path(self, content):
+        if isinstance(content, dict):
+            if content.get("type") == "text" and isinstance(content.get("text"), str):
+                raise ValueError(f"Text content is not media: {content}")
+            media_path = content.get("path")
+            if media_path:
+                return media_path
+            for value in content.values():
+                try:
+                    return self._extract_media_path(value)
+                except ValueError:
+                    continue
+        if isinstance(content, (list, tuple)) and len(content) > 0:
+            for item in content:
+                try:
+                    return self._extract_media_path(item)
+                except ValueError:
+                    continue
+        raise ValueError(f"Unsupported media content: {content}")
+    def _extract_text_content(self, content):
+        if isinstance(content, str):
+            return content
+        if isinstance(content, dict):
+            if content.get("type") == "text" and isinstance(content.get("text"), str):
+                return content["text"]
+            text = content.get("text")
+            if isinstance(text, str):
+                return text
+        if isinstance(content, (list, tuple)) and len(content) > 0:
+            text_parts = []
+            for item in content:
+                try:
+                    text_parts.append(self._extract_text_content(item))
+                except ValueError:
+                    continue
+            if text_parts:
+                return "\n".join(part for part in text_parts if part)
+        raise ValueError(f"Unsupported text content: {content}")
+    def _normalize_user_content(self, content, fps, max_frames):
+        if isinstance(content, str):
+            return [{"type": "text", "text": content}]
+        if isinstance(content, (list, tuple)):
+            normalized_items = []
+            for item in content:
+                normalized_items.extend(self._normalize_user_content(item, fps, max_frames))
+            return normalized_items
+        if isinstance(content, dict):
+            try:
+                text = self._extract_text_content(content)
+            except ValueError:
+                text = None
+            else:
+                return [{"type": "text", "text": text}]
+            media_path = self._extract_media_path(content)
+            media_ext = osp.splitext(media_path)[1].lower().lstrip(".")
+            if media_ext in self.video_formats:
+                return [{"type": "video", "video": {"video_path": media_path, "fps": fps, "max_frames": max_frames}}]
+            if media_ext in self.image_formats:
+                return [{"type": "image", "image": {"image_path": media_path}}]
+            raise ValueError(f"Unsupported media type: {media_path}")
+        raise ValueError(f"Unsupported user content: {content}")
+    def _predict(self, messages, input_text, system_prompt, do_sample, temperature, top_p, max_new_tokens,
+                 fps, max_frames):
+        messages = list(messages or [])
+        input_text = input_text or ""
+        if input_text and len(input_text) > 0:
+            messages.append({"role": "user", "content": input_text})
+        new_messages = []
+        active_system_prompt = (system_prompt or self.default_system_prompt).strip()
+        if active_system_prompt:
+            new_messages.append({
+                "role": "system",
+                "content": [{"type": "text", "text": active_system_prompt}],
+            })
+        contents = []
+        for message in messages:
+            if message["role"] == "assistant":
+                if len(contents):
+                    new_messages.append({"role": "user", "content": contents})
+                    contents = []
+                new_messages.append(message)
+            elif message["role"] == "user":
+                contents.extend(self._normalize_user_content(message["content"], fps, max_frames))
+        if len(contents):
+            new_messages.append({"role": "user", "content": contents})
+        if len(new_messages) == 0 or new_messages[-1]["role"] != "user":
+            return messages
+        generation_config = {
+            "do_sample": do_sample,
+            "temperature": temperature,
+            "top_p": top_p,
+            "max_new_tokens": max_new_tokens
+        }
+        response = self.model_client.submit({"conversation": new_messages, "generation_config": generation_config})
+        if isinstance(response, str):
+            messages.append({"role": "assistant", "content": response})
+            yield messages, ""
+            return
+        messages.append({"role": "assistant", "content": ""})
+        for token in response:
+            messages[-1]['content'] += token
+            yield messages, ""
+    def launch(self):
+        self.interface.launch(**self.server_kwargs)

inference/launch_gradio_demo.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import sys
+sys.path.append('.')
+import argparse
+import os
+import subprocess
+from threading import Thread
+from inference.interface import PenguinVLQwen3GradioInterface
+from inference.server import PenguinVLQwen3PlainClient
+def launch_gradio_demo(model_path, server_port=16667, interface_port=33666, server_name="0.0.0.0", nproc=1, example_dir="./assets/inputs"):
+    server_thread = Thread(
+        target=lambda: subprocess.run(
+            [
+                sys.executable, "-m",
+                "inference.server.plain_server",
+                "--model-path", model_path,
+                "--nproc", str(nproc),
+                "--port", str(server_port),
+            ]
+        )
+    )
+    server_thread.daemon = True
+    server_thread.start()
+    if example_dir is not None and not os.path.isdir(example_dir):
+        example_dir = None
+    model_client = PenguinVLQwen3PlainClient(port=server_port)
+    interface = PenguinVLQwen3GradioInterface(
+        model_client,
+        example_dir=example_dir,
+        server_name=server_name,
+        server_port=interface_port,
+    )
+    interface.launch()
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", "--model_path", type=str, required=True)
+    parser.add_argument("--server-port", "--server_port", type=int, default=16667)
+    parser.add_argument("--interface-port", "--interface_port", type=int, default=33666)
+    parser.add_argument("--server-name", "--server_name", type=str, default="0.0.0.0")
+    parser.add_argument("--nproc", type=int, default=1)
+    parser.add_argument("--example-dir", "--example_dir", type=str, default="./assets/inputs")
+    args = parser.parse_args()
+    launch_gradio_demo(
+        model_path=args.model_path,
+        server_port=args.server_port,
+        interface_port=args.interface_port,
+        server_name=args.server_name,
+        nproc=args.nproc,
+        example_dir=args.example_dir,
+    )
+if __name__ == "__main__":
+    main()

inference/server/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from .plain_server import PenguinVLQwen3PlainClient, PenguinVLQwen3PlainServer
2	+ from .direct_client import PenguinVLQwen3DirectClient

inference/server/direct_client.py ADDED Viewed

	@@ -0,0 +1,176 @@

+import importlib
+import importlib.util
+import os
+import subprocess
+import sys
+from threading import Lock, Thread
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor, TextIteratorStreamer
+try:
+    import spaces
+except ImportError:
+    class _SpacesShim:
+        @staticmethod
+        def GPU(*args, **kwargs):
+            if args and callable(args[0]) and len(args) == 1 and not kwargs:
+                return args[0]
+            def decorator(fn):
+                return fn
+            return decorator
+    spaces = _SpacesShim()
+_MODEL = None
+_PROCESSOR = None
+_MODEL_PATH = None
+_MODEL_LOCK = Lock()
+_FLASH_ATTN_LOCK = Lock()
+_FLASH_ATTN_PACKAGE = "flash_attn"
+_FLASH_ATTN_REQUIREMENT = os.getenv("FLASH_ATTN_REQUIREMENT", "flash-attn==2.8.3")
+def _get_attn_implementation():
+    return os.getenv("ATTN_IMPLEMENTATION", "sdpa")
+def _get_model_revision():
+    return os.getenv("MODEL_REVISION")
+def ensure_flash_attn_installed():
+    if importlib.util.find_spec(_FLASH_ATTN_PACKAGE) is not None:
+        return
+    with _FLASH_ATTN_LOCK:
+        if importlib.util.find_spec(_FLASH_ATTN_PACKAGE) is not None:
+            return
+        install_cmd = [
+            sys.executable,
+            "-m",
+            "pip",
+            "install",
+            _FLASH_ATTN_REQUIREMENT,
+            "--no-build-isolation",
+        ]
+        print(f"Installing {_FLASH_ATTN_REQUIREMENT} with --no-build-isolation...")
+        subprocess.check_call(install_cmd, env=os.environ.copy())
+        importlib.invalidate_caches()
+        if importlib.util.find_spec(_FLASH_ATTN_PACKAGE) is None:
+            raise RuntimeError(f"Failed to import {_FLASH_ATTN_PACKAGE} after installation.")
+def _ensure_model_loaded(model_path):
+    global _MODEL, _PROCESSOR, _MODEL_PATH
+    if _MODEL is not None and _PROCESSOR is not None and _MODEL_PATH == model_path:
+        return _MODEL, _PROCESSOR
+    with _MODEL_LOCK:
+        if _MODEL is not None and _PROCESSOR is not None and _MODEL_PATH == model_path:
+            return _MODEL, _PROCESSOR
+        ensure_flash_attn_installed()
+        attn_implementation = _get_attn_implementation()
+        revision = _get_model_revision()
+        processor_kwargs = {
+            "trust_remote_code": True,
+        }
+        if revision:
+            processor_kwargs["revision"] = revision
+        model_kwargs = {
+            "trust_remote_code": True,
+            "device_map": {"": "cuda:0"},
+            "torch_dtype": torch.bfloat16,
+            "attn_implementation": attn_implementation,
+        }
+        if revision:
+            model_kwargs["revision"] = revision
+        _MODEL = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
+        _PROCESSOR = AutoProcessor.from_pretrained(model_path, **processor_kwargs)
+        _MODEL_PATH = model_path
+        return _MODEL, _PROCESSOR
+def _estimate_duration(payload):
+    generation_config = payload.get("generation_config", {})
+    max_new_tokens = int(generation_config.get("max_new_tokens", 512))
+    has_video = False
+    for message in payload.get("conversation", []):
+        for content in message.get("content", []):
+            if isinstance(content, dict) and content.get("type") == "video":
+                has_video = True
+                break
+        if has_video:
+            break
+    base_duration = 90 if has_video else 60
+    token_budget = max_new_tokens // 16
+    return min(180, max(base_duration, base_duration + token_budget))
+@spaces.GPU(duration=_estimate_duration)
+def _run_generation_stream(payload):
+    model_path = payload["model_path"]
+    model, processor = _ensure_model_loaded(model_path)
+    inputs = processor(
+        conversation=payload["conversation"],
+        add_system_prompt=True,
+        add_generation_prompt=True,
+        return_tensors="pt",
+    )
+    inputs = {k: v.to("cuda:0") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
+    if "pixel_values" in inputs:
+        inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
+    generation_kwargs = {
+        **inputs,
+        **payload.get("generation_config", {}),
+    }
+    streamer = TextIteratorStreamer(
+        processor.tokenizer,
+        skip_prompt=True,
+        skip_special_tokens=True,
+    )
+    generation_kwargs["streamer"] = streamer
+    generation_error = {}
+    def _generation_worker():
+        try:
+            with torch.inference_mode():
+                model.generate(**generation_kwargs)
+        except Exception as exc:
+            generation_error["exc"] = exc
+            streamer.on_finalized_text("", stream_end=True)
+    thread = Thread(target=_generation_worker, daemon=True)
+    thread.start()
+    for token in streamer:
+        yield token
+    if "exc" in generation_error:
+        raise generation_error["exc"]
+class PenguinVLQwen3DirectClient(object):
+    def __init__(self, model_path):
+        self.model_path = model_path
+    def submit(self, payload):
+        return _run_generation_stream({
+            "model_path": self.model_path,
+            "conversation": payload["conversation"],
+            "generation_config": payload.get("generation_config", {}),
+        })

inference/server/plain_server.py ADDED Viewed

	@@ -0,0 +1,279 @@

+import argparse
+import random
+import socket
+import time
+import traceback
+import json
+import logging
+import torch
+from transformers import AutoModelForCausalLM, AutoProcessor, TextIteratorStreamer
+from threading import Thread
+from multiprocessing import Process, Queue
+EOS_FLAG = "<EOS>"
+SEPARATOR = "<SEP>"
+def get_logger(name):
+    logger = logging.getLogger(name)
+    logger.setLevel(logging.INFO)
+    handler = logging.StreamHandler()
+    formatter = logging.Formatter("%(asctime)s - %(levelname)s - %(name)s - %(message)s")
+    handler.setFormatter(formatter)
+    logger.addHandler(handler)
+    return logger
+class Streamer(object):
+    def __init__(self, timeout=None):
+        self.timeout = timeout
+        self.queue = Queue(maxsize=1024)
+        self.stop_signal = EOS_FLAG
+    def put(self, value):
+        self.queue.put(value)
+    def __iter__(self):
+        return self
+    def __next__(self):
+        try:
+            value = self.queue.get(timeout=self.timeout)
+        except:
+            raise StopIteration()
+        if value == self.stop_signal:
+            raise StopIteration()
+        else:
+            return value
+class PenguinVLQwen3PlainClient(object):
+    def __init__(self, host="localhost", port=16666):
+        self.host = host
+        self.port = port
+        self.input_buffer = Queue(maxsize=1024)
+        self.streamers = dict()
+        self.logger = get_logger("penguinvl_qwen3.client")
+        client_thread = Thread(target=self._client_worker)
+        client_thread.deamon = True
+        client_thread.start()
+    def _receive_worker(self, server_socket):
+        try:
+            while True:
+                data = server_socket.recv(8192)
+                if not data:
+                    self.logger.info(f"Connection has been terminated.")
+                    for streamer in self.streamers.values():
+                        streamer.put(streamer.stop_signal)
+                    break
+                for sub_data in data.decode("utf-8").split(SEPARATOR):
+                    if len(sub_data) == 0:
+                        continue
+                    try:
+                        sub_data = json.loads(sub_data)
+                    except:
+                        self.logger.info(f"Failed to parse data: {sub_data}")
+                        continue
+                    self.logger.info(f"Received: {sub_data['data']}")
+                    self.streamers[sub_data["id"]].put(sub_data["data"])
+                    if sub_data["data"] == EOS_FLAG:
+                        self.streamers.pop(sub_data["id"])
+        except ConnectionResetError:
+            self.logger.info(f"Connection has been terminated.")
+    def _send_worker(self, server_socket):
+        while True:
+            request_id, conversation = self.input_buffer.get()
+            data = json.dumps({"id": request_id, "data": conversation}) + SEPARATOR
+            server_socket.sendall(data.encode("utf-8"))
+            self.logger.info(f"Sent: {data}")
+    def _client_worker(self):
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket:
+            while True:
+                try:
+                    server_socket.connect((self.host, self.port))
+                    break
+                except ConnectionRefusedError:
+                    self.logger.info("Waiting for the server to start...")
+                    time.sleep(1)
+                    continue
+            self.logger.info("Connected to server.")
+            receive_thread = Thread(target=self._receive_worker, args=(server_socket,))
+            receive_thread.daemon = True
+            receive_thread.start()
+            send_thread = Thread(target=self._send_worker, args=(server_socket,))
+            send_thread.daemon = True
+            send_thread.start()
+            receive_thread.join()
+    def submit(self, conversation):
+        request_id = random.randint(0, 4294967295)
+        streamer = Streamer()
+        self.streamers[request_id] = streamer
+        self.input_buffer.put((request_id, conversation))
+        return streamer
+class PenguinVLQwen3PlainServer(object):
+    def __init__(
+        self,
+        model_path,
+        torch_dtype=torch.bfloat16,
+        attn_implementation="flash_attention_2",
+        num_processes=1,
+        buffer_size=2,
+        host="localhost",
+        port=16666,
+    ):
+        self.model_path = model_path
+        self.torch_dtype = torch_dtype
+        self.attn_implementation = attn_implementation
+        self.num_processes = num_processes
+        self.buffer_size = buffer_size
+        self.host = host
+        self.port = port
+    def _model_worker(self, input_buffer, output_buffer, device_map, rank):
+        logger = get_logger(f"penguinvl_qwen3.server.worker_{rank}")
+        logger.info(f"Loading model from {self.model_path}...")
+        model = AutoModelForCausalLM.from_pretrained(
+            self.model_path,
+            trust_remote_code=True,
+            torch_dtype=self.torch_dtype,
+            attn_implementation=self.attn_implementation,
+            device_map=device_map,
+        )
+        processor = AutoProcessor.from_pretrained(self.model_path, trust_remote_code=True)
+        logger.info(f"Successfully loaded model.")
+        while True:
+            logger.info("Waiting for input...")
+            request_id, data = input_buffer.get()
+            try:
+                inputs = processor(
+                    conversation=data["conversation"],
+                    add_system_prompt=True,
+                    add_generation_prompt=True,
+                    return_tensors="pt"
+                )
+                inputs = {k: v.to(f"cuda:{rank}") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
+                if "pixel_values" in inputs:
+                    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
+                streamer = TextIteratorStreamer(processor.tokenizer, skip_prompt=True, skip_special_tokens=True)
+                generation_kwargs = {
+                    **inputs,
+                    **data["generation_config"],
+                    "streamer": streamer,
+                }
+                thread = Thread(target=model.generate, kwargs=generation_kwargs)
+                thread.deamon = True
+                thread.start()
+                for token in streamer:
+                    output_buffer.put((request_id, token))
+                output_buffer.put((request_id, EOS_FLAG))
+            except:
+                logger.error(f"An error occurred: {traceback.format_exc()}")
+                output_buffer.put((request_id, "Server error! Please check the server logs and retry."))
+                output_buffer.put((request_id, EOS_FLAG))
+    def _receive_worker(self, logger, input_buffer, client_socket, client_address):
+        try:
+            while True:
+                data = client_socket.recv(8192)
+                if not data:
+                    logger.info(f"Connection from {client_address} has been terminated.")
+                    break
+                for sub_data in data.decode("utf-8").split(SEPARATOR):
+                    if len(sub_data) == 0:
+                        continue
+                    try:
+                        sub_data = json.loads(sub_data)
+                    except:
+                        logger.info(f"Failed to parse data: {sub_data}")
+                        continue
+                    logger.info(f"Received from {client_address}: {sub_data}")
+                    input_buffer.put((sub_data["id"], sub_data["data"]))
+        except ConnectionResetError:
+            logger.info(f"Connection from {client_address} has been terminated.")
+    def _send_worker(self, logger, output_buffer, client_socket, client_address):
+        try:
+            while True:
+                request_id, token = output_buffer.get()
+                data = json.dumps({"id": request_id, "data": token}) + SEPARATOR
+                client_socket.sendall(data.encode("utf-8"))
+        except ConnectionResetError:
+            logger.info(f"Connection from {client_address} has been terminated.")
+    def launch(self):
+        logger = get_logger(f"penguinvl_qwen3.server.controller")
+        input_buffer = Queue(maxsize=self.num_processes * self.buffer_size)
+        output_buffer = Queue(maxsize=self.num_processes * 1024)
+        for i in range(self.num_processes):
+            device_map = {"": f"cuda:{i}"}
+            process = Process(target=self._model_worker, args=(input_buffer, output_buffer, device_map, i))
+            process.start()
+        with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as server_socket:
+            server_socket.bind((self.host, self.port))
+            server_socket.listen(1)
+            logger.info("Waiting for connection...")
+            while True:
+                client_socket, client_address = server_socket.accept()
+                logger.info(f"Connected to {client_address}.")
+                receive_thread = Thread(target=self._receive_worker, args=(logger, input_buffer, client_socket, client_address))
+                receive_thread.deamon = True
+                receive_thread.start()
+                send_thread = Thread(target=self._send_worker, args=(logger, output_buffer, client_socket, client_address))
+                send_thread.deamon = True
+                send_thread.start()
+if __name__ == "__main__":
+    torch.multiprocessing.set_start_method("spawn")
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model-path", "--model_path", type=str, required=True)
+    parser.add_argument("--nproc", type=int, default=8)
+    parser.add_argument("--port", type=int, default=16666)
+    args = parser.parse_args()
+    server = PenguinVLQwen3PlainServer(
+        model_path=args.model_path,
+        num_processes=args.nproc,
+        port=args.port,
+    )
+    server.launch()

packages.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ build-essential

pre-requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+# Build helpers recommended by the FlashAttention installation guide.
+packaging
+psutil
+ninja
+wheel

requirements.txt ADDED Viewed

	@@ -0,0 +1,37 @@

+--extra-index-url https://download.pytorch.org/whl/cu124
+# Base runtime for Transformers inference and the Gradio demo.
+# Training, notebook, and vLLM-specific extras were removed from this file.
+# The previous full list is preserved in requirements.original.txt.
+# Core model runtime
+torch==2.5.1
+torchvision==0.20.1
+transformers==4.51.3
+tokenizers==0.21.4
+accelerate==1.10.1
+huggingface_hub==0.34.4
+sentencepiece==0.1.99
+timm==1.0.3
+numpy==1.24.4
+Pillow
+einops==0.6.1
+einops-exts==0.0.4
+# Image and video processing
+decord==0.6.0
+imageio==2.34.0
+imageio-ffmpeg==0.4.9
+opencv-python-headless==4.6.0.66
+ffmpeg-python
+requests
+# UI
+gradio>=5.44.1,<7
+# FlashAttention is installed separately with:
+# pip install flash-attn==2.8.3 --no-build-isolation
+# This cannot be expressed in a standard Gradio Space requirements install step.
+# Optional extras
+# vllm==0.11.0