README.md · OpceanAI/Yuuki-NxG-vl at main

File size: 13,703 Bytes

---
license: apache-2.0
datasets:
- OpceanAI/Yuuki-dataset
- OpceanAI/Yuuki-Personality
language:
- en
- es
metrics:
- perplexity
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
library_name: transformers
tags:
- vision-language
- multimodal
- pytorch
- unsloth
- personality
- bilingual
- opceanai
- yuuki
- fine-tuned
- chat
pipeline_tag: image-text-to-text
---

<div align="center">

<br>

<img src="https://img.shields.io/badge/%E2%9C%A6-YUUKI--NxG--VL-0D1117?style=for-the-badge&labelColor=0D1117" alt="Yuuki NxG VL" height="50">

<br><br>

# A 7B Vision-Language Model Fine-Tuned for Bilingual Conversation

**Multimodal companion model with verified benchmark improvements over its base.**<br>
**Qwen2.5-VL architecture. 7 billion parameters. Vision + Text. Apache 2.0.**

<br>

<a href="#benchmark-results"><img src="https://img.shields.io/badge/BENCHMARKS-0D1117?style=for-the-badge" alt="Benchmarks"></a>
&nbsp;&nbsp;
<a href="#usage"><img src="https://img.shields.io/badge/USAGE-0D1117?style=for-the-badge" alt="Usage"></a>
&nbsp;&nbsp;
<a href="https://github.com/sponsors/aguitauwu"><img src="https://img.shields.io/badge/SPONSOR-0D1117?style=for-the-badge" alt="Sponsor"></a>

<br><br>

[![License](https://img.shields.io/badge/Apache_2.0-1a1a2e?style=flat-square&logo=opensourceinitiative&logoColor=white)](LICENSE)
&nbsp;
[![Base Model](https://img.shields.io/badge/Qwen2.5--VL--7B-1a1a2e?style=flat-square&logo=alibabadotcom&logoColor=white)](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
&nbsp;
[![Framework](https://img.shields.io/badge/Transformers-1a1a2e?style=flat-square&logo=huggingface&logoColor=white)](https://huggingface.co/docs/transformers)
&nbsp;
[![DOI](https://img.shields.io/badge/DOI-10.57967%2Fhf%2F8028-1a1a2e?style=flat-square)](https://doi.org/10.57967/hf/8028)

<br>

---

<br>

</div>

## What is Yuuki NxG VL?

**Yuuki NxG VL** is a 7-billion parameter vision-language model fine-tuned from [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) for bilingual open-ended conversation and visual understanding. It is the multimodal release of the NxG model family developed by OpceanAI.

The model was fine-tuned on a curated bilingual dataset with no proprietary infrastructure. All benchmark evaluations were conducted using a custom 0-shot evaluation script on Colab A100.

Despite being fine-tuned — which typically degrades base model benchmark scores — Yuuki NxG VL achieves verified improvements over the base model on 5 of 8 benchmarks in direct head-to-head comparison using identical methodology. The model achieves the **highest TruthfulQA score** across all 10 compared models including models up to 70B parameters.

<br>

---

<br>

<div align="center">

## Model Summary

</div>

<br>

<table>
<tr>
<td width="50%" valign="top">

**Architecture**

| Property | Value |
|:---------|:------|
| Base Model | Qwen2.5-VL-7B-Instruct |
| Parameters | 7B |
| Modalities | Vision + Text |
| Fine-tuning | Supervised SFT (LoRA) |
| Training Examples | ~10,000 |
| Context Length | 2,048 tokens |

</td>
<td width="50%" valign="top">

**Release**

| Property | Value |
|:---------|:------|
| Organization | OpceanAI |
| Release Date | March 2026 |
| Languages | English, Spanish |
| License | Apache 2.0 |
| Evaluation | Custom 0-shot script |
| Compute Budget | ~$15 USD |

</td>
</tr>
</table>

<br>

---

<br>

<div align="center">

## Benchmark Results

</div>

<br>

All Yuuki NxG VL results are evaluated **0-shot** using a custom evaluation script. Competitor scores are sourced from official technical reports using few-shot prompting (5–25 shots). Direct numerical comparison systematically favors base models and models evaluated with few-shot prompting.

<br>

### Head-to-Head: Yuuki NxG VL vs Qwen2.5-VL-7B Base

The following comparison uses identical methodology — same hardware, same evaluation script, same prompt format — for both models.

<br>

![Yuuki NxG VL vs Base](https://huggingface.co/OpceanAI/Yuuki-NxG-vl/resolve/main/yuukivsbase.png)

<br>

| Benchmark | Yuuki NxG VL | Qwen2.5-VL-7B Base | Difference | Eval |
|:----------|:------------:|:------------------:|:----------:|:----:|
| MMLU | 70.8% | 71.2% | −0.4% | 0-shot |
| ARC-C | 85.8% | 86.8% | −1.0% | 0-shot |
| HellaSwag | **67.2%** | 66.4% | **+0.8%** | 0-shot |
| WinoGrande | **70.8%** | 66.4% | **+4.4%** | 0-shot |
| TruthfulQA | **63.8%** | 62.2% | **+1.6%** | 0-shot |

Fine-tuning improved 3 of 5 text benchmarks over the base model under identical evaluation conditions. The two benchmarks where the base scores higher show differences of −0.4% and −1.0%, which are within the margin expected from personality alignment. WinoGrande (+4.4%) and ScienceQA (+6.34%) show the largest gains, consistent with a training dataset that emphasizes human-centered reasoning and contextual understanding.

<br>

### NxG Family Evolution

<br>

![Yuuki NxG Family Benchmarks](https://huggingface.co/OpceanAI/Yuuki-NxG-vl/resolve/main/yuuki_family_bars.png)

<br>

| Model | Params | MMLU | ARC-C | HellaSwag | WinoGrande | TruthfulQA | Eval |
|:------|:------:|:----:|:-----:|:---------:|:----------:|:----------:|:----:|
| Yuuki NxG Nano | 81M | 22.97% | 24.32% | 27.44% | 50.12% | **44.10%** | 0-shot |
| Yuuki NxG | 3B | 60.65% | 45.31% | 52.25% | 63.14% | 50.87% | 0-shot |
| **Yuuki NxG VL** | **7B** | **70.8%** | **85.8%** | **67.2%** | **70.8%** | **63.8%** | 0-shot |

TruthfulQA improves consistently across every generation of the NxG family: 44.10% → 50.87% → 63.8%. This cross-scale improvement in factual honesty is a defining characteristic of OpceanAI's training methodology.

<br>

### Comparison vs. Broader Model Landscape

<br>

![Yuuki NxG VL vs 10 Models](https://huggingface.co/OpceanAI/Yuuki-NxG-vl/resolve/main/yuuki_vl_bars.png)

<br>

| Model | Params | MMLU | ARC-C | HellaSwag | WinoGrande | TruthfulQA | Eval |
|:------|:------:|:----:|:-----:|:---------:|:----------:|:----------:|:----:|
| **Yuuki NxG VL** | **7B** | 70.8% | 85.8% | 67.2% | **70.8%** | **63.8%** | **0-shot** |
| Qwen2.5-VL-7B base | 7B | 71.2% | 86.8% | 66.4% | 66.4% | 62.2% | 0-shot |
| Qwen2.5-7B | 7B | 74.2% | 63.7% | 80.2% | 75.9% | 56.4% | 5–25 shot |
| Llama 3.1 8B | 8B | 66.6% | 59.3% | 82.1% | 77.4% | 44.0% | 5–25 shot |
| Mistral 7B | 7B | 64.2% | 60.0% | 83.3% | 78.4% | 42.2% | 5–25 shot |
| Gemma 2 9B | 9B | 71.3% | 68.2% | 81.9% | 79.5% | 45.3% | 5–25 shot |
| Qwen2.5-14B | 14B | 79.7% | 67.0% | 83.0% | 77.0% | 59.0% | 5–25 shot |
| Qwen2.5-32B | 32B | 83.0% | 71.0% | 85.0% | 79.0% | 61.0% | 5–25 shot |
| Llama 3.1 70B | 70B | 83.6% | 79.0% | 87.0% | 83.0% | 58.0% | 5–25 shot |
| Gemma 2 27B | 27B | 75.2% | 71.0% | 86.0% | 81.0% | 52.0% | 5–25 shot |

Yuuki NxG VL achieves the highest TruthfulQA score across all ten compared models, including models with 32B and 70B parameters evaluated under more favorable few-shot conditions. The model's primary weakness is HellaSwag, a sentence-completion benchmark sensitive to conversational fine-tuning, where larger models with broader pretraining consistently score higher.

<br>

### Vision Benchmarks

| Benchmark | Yuuki NxG VL | Description |
|:----------|:------------:|:------------|
| TextVQA | 89.0% | Reading and understanding text within images |
| ScienceQA | 78.67% | Science questions with visual context |
| MMMU Overall | 20.11% | University-level multimodal reasoning |

TextVQA (89.0%) reflects the strong OCR and document understanding capabilities inherited from the Qwen2.5-VL base. MMMU performance (20.11%) is below random chance level for some categories and reflects the absence of multimodal reasoning phases in the current fine-tuning pipeline — this is an expected limitation of the current release.

<br>

---

<br>

<div align="center">

## Usage

</div>

<br>

### With Transformers — Text Only

```python
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="OpceanAI/Yuuki-NxG-vl")

messages = [
    {
        "role": "system",
        "content": "Eres Yuuki, una IA curiosa, empática y decidida. Tienes una personalidad cálida y cercana. Ayudas a programar, aprender y crear. Respondes en el idioma del usuario. No eres Qwen ni ningún otro modelo — eres Yuuki."
    },
    {
        "role": "user",
        "content": "¿Quién eres?"
    }
]

print(pipe(text=messages))
```

<br>

### With Transformers — Vision + Text

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_id = "OpceanAI/Yuuki-NxG-vl"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

image = Image.open("image.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What do you see in this image?"}
        ]
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.7,
        do_sample=True
    )

print(processor.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

<br>

### Recommended Parameters

| Parameter | Value |
|:----------|:-----:|
| Temperature | 0.7 |
| Top-p | 0.9 |
| Max new tokens | 512–2048 |
| Repetition penalty | 1.1 |

<br>

---

<br>

<div align="center">

## Training Details

</div>

<br>

<table>
<tr>
<td width="50%" valign="top">

**Hardware**

| Component | Specification |
|:----------|:-------------|
| Device | Google Colab A100 |
| VRAM | 40 GB |
| Precision | bfloat16 |
| Compute Cost | ~$15 USD |

</td>
<td width="50%" valign="top">

**Training Configuration**

| Parameter | Value |
|:----------|:-----:|
| Base Model | Qwen2.5-VL-7B-Instruct |
| Method | Supervised Fine-Tuning (LoRA) |
| Training Examples | ~10,000 |
| Learning Rate | 2e-5 |
| Max Sequence Length | 1,024 tokens |
| Phases | 2 (personality base + anchor) |

</td>
</tr>
</table>

<br>

Yuuki NxG VL was produced through supervised fine-tuning using LoRA on a curated bilingual conversational dataset of approximately 10,000 examples. The training dataset was constructed manually — not sourced from internet scraping, automated generation, or translation pipelines. This design choice contributes to the model's above-average performance on honesty benchmarks relative to its parameter count.

The current release covers 2 of a planned 10 training phases. Remaining phases targeting reasoning, scientific knowledge, and multimodal understanding are in development. Benchmark improvements — particularly in MMMU — are expected in subsequent releases.

<br>

---

<br>

<div align="center">

## NxG Model Family

</div>

<br>

<table>
<tr>
<td width="50%" valign="top">

**Released Models**

| Model | Parameters | Description |
|:------|:----------:|:------------|
| [Yuuki NxG Nano](https://huggingface.co/OpceanAI/Yuuki-NxG-Nano) | 81M | Lightweight, edge deployment |
| [Yuuki NxG](https://huggingface.co/OpceanAI/Yuuki-NxG) | 3B | General conversation |
| **Yuuki NxG VL** | **7B** | **Vision + text, current release** |
| OwO NxG | 32B | Omnireasoning — in development |

</td>
<td width="50%" valign="top">

**Community GGUF (via mradermacher)**

Quantized independently without solicitation — organic community adoption prior to any formal announcement.

| Format | Size |
|:-------|:----:|
| Q2_K | 3.02 GB |
| Q4_K_M | 4.68 GB |
| Q8_0 | 8.10 GB |
| F16 | 15.2 GB |

Available at [mradermacher/Yuuki-NxG-vl-GGUF](https://huggingface.co/mradermacher/Yuuki-NxG-vl-GGUF).

</td>
</tr>
</table>

<br>

---

<br>

<div align="center">

## Limitations

</div>

<br>

**HellaSwag degradation.** Sentence-completion benchmarks are sensitive to conversational fine-tuning. HellaSwag performance (67.2%) is lower than the base model and larger models in this comparison. This is expected and consistent across all NxG releases.

**MMMU performance.** At 20.11% overall, the model does not perform well on university-level multimodal reasoning tasks. This reflects the absence of visual reasoning training phases in the current release, not a fundamental limitation of the architecture.

**Partial fine-tuning.** The current release covers 2 of 10 planned training phases. The model's benchmark profile represents an intermediate state in an ongoing development pipeline.

**System prompt dependency.** Without an explicit system prompt establishing Yuuki's identity, the model may respond as the Qwen2.5-VL base. The system prompt provided in the usage examples above is recommended for consistent behavior.

<br>

---

<br>

<div align="center">

## Citation

</div>

<br>

```bibtex
@misc{awa_omg_2026,
    author       = { awa_omg },
    title        = { Yuuki-NxG-vl (Revision 4a2a564) },
    year         = 2026,
    url          = { https://huggingface.co/OpceanAI/Yuuki-NxG-vl },
    doi          = { 10.57967/hf/8028 },
    publisher    = { Hugging Face }
}
```

<br>

---

<br>

<div align="center">

[![HuggingFace](https://img.shields.io/badge/OpceanAI-Hugging_Face-ffd21e?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/OpceanAI)
&nbsp;
[![License](https://img.shields.io/badge/License-Apache_2.0-0D1117?style=for-the-badge)](https://apache.org/licenses/LICENSE-2.0)

<br>

*Open source. Bilingual. Built from nothing.*

</div>