rp-yu
/

Dimple-7B

@@ -17,197 +17,132 @@ base_model:
 pipeline_tag: image-text-to-text
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 pipeline_tag: image-text-to-text
 ---
+# Dimple-7B 🧊
+**Dimple** is the first Discrete Diffusion Multimodal Large Language Model (DMLLM) that leverages a hybrid training paradigm combining autoregressive and diffusion-based instruction tuning. The model architecture is similar to Qwen and LLaVA, while introducing a novel **autoregressive-then-diffusion** training strategy:
+* **Stage 1**: Autoregressive fine-tuning for alignment and initial instruction tuning.
+* **Stage 2**: Diffusion-based fine-tuning for enhanced instruction-following capabilities.
+Trained on the same dataset as LLaVA-NEXT, **Dimple-7B surpasses LLaVA-NEXT-7B by 3.9%**, demonstrating that diffusion-based multimodal language models can match its autoregressive counterparts under similar training budget.
+---
+## 🔍 Highlights
+* **Hybrid Training**: Combines autoregressive and diffusion training.
+* **Diffusion Decoding**: Supports confident decoding, maskgit-style decoding, and entropy-based decoding.
+* **Controllable Generation**: Enables fine-grained control over format, structure, and length via structure priors.
+* **Autoregressive-like Prefilling**: Enhances inference speed using prefilling techniques.
+---
+## 📊 Evaluation Results
+| Benchmark             | Dimple-7B (ours) | LLaVA-1.5-7B | LLaVA-NEXT-7B | Eagle-7B | Eagle2-9B | Qwen-VL-7B | Qwen2.5-VL-7B |
+| --------------------- | ---------------- | ------------ | ------------- | -------- | --------- | ---------- | ------------- |
+| **Training Samples**  | 1.3M             | 1.2M         | 1.3M          | 2.4M     | 27.8M     | 1.5B       | -             |
+| **Training Tokens**   | 0.8B             | -            | -             | -        | -         | -          | 2.6T          |
+| **Base LLM**          | Dream (Qwen2.5)  | Vicuna       | Vicuna-1.5    | Vicuna   | Qwen2.5   | Qwen       | Qwen2.5       |
+| **GQA**               | 59.2             | 62.0         | 64.8          | 64.9 | -         | 59.3       | -             |
+| **MMBench (en test)** | 74.6         | 64.3         | 68.7          | 68.4     | -         | -          | 83.5      |
+| **MME (Perception)**  | 1514             | 1510         | 1519          | 1528 | -         | -          | -             |
+| **MME (Cognition)**   | 432          | -            | 332           | -        | -         | -          | -             |
+| **MME (Total)**       | 1946         | -            | 1851          | -        | -         | -          | 2347      |
+| **POPE**              | 86.2             | 85.8         | 86.7          | 88.8 | -         | -          | -             |
+| **MMMU (val)**        | 45.2         | -            | 35.8          | 36.3     | 56.1      | -          | 58.6      |
+| **SQA (img)**         | 77.1         | 66.8         | 72.8          | 70.0     | -         | -          | -             |
+| **AI2D**              | 74.4         | -            | 65.4          | -        | 83.9  | 62.3       | 83.9      |
+| **ChartQA**           | 63.4             | -            | 54.9          | 67.7 | 86.4  | 65.7       | 87.3      |
+| **TextVQA**           | 61.6             | -            | 64.8      | -        | 83.0  | -          | -             |
+| **OCRBench**          | 565          | -            | 490           | 529      | -         | -          | -             |
+| **MathVista (mini)**  | 42.3         | -            | 33.0          | -        | 63.8  | 37.0       | 68.2      |
+| **MMVet**             | 41.2             | 31.1         | 47.3      | -        | 62.2  | -          | 67.1      |
+---
+## 🛠️ Environment
+Make sure your environment includes the following versions:
+```bash
+transformers==4.46.2
+torch==2.5.1
+accelerate==1.6.0
+```
+---
+## 🚀 Inference Example
+```python
+import torch
+from transformers import AutoProcessor, AutoModel
+import json, requests
+from PIL import Image
+model_name = "rp-yu/Dimple-7B"
+processor = AutoProcessor.from_pretrained(
+    model_name,
+    trust_remote_code=True
+)
+model = AutoModel.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    trust_remote_code=True,
+)
+image_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
+messages = [
+    [{"role": "user", "content": [
+        {"type": "image", "image": image_url},
+        {"type": "text", "text": "Describe this image."}
+    ]}],
+]
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True, add_vision_id=False
+)
+images = [
+    Image.open(requests.get(image_url, stream=True).raw).convert("RGB").resize((336, 336), Image.LANCZOS)
+]
+inputs = processor(
+    text=text,
+    images=images,
+    videos=None,
+    padding="longest",
+    return_tensors="pt",
+)
+input_ids = inputs.pop("input_ids")
+output = model.diffusion_generate(
+    input_ids,
+    max_new_tokens=64,
+    output_history=True,
+    return_dict_in_generate=True,
+    steps=64,
+    temperature=0.2,
+    top_p=0.95,
+    alg="maskgit_plus",
+    use_cache=True,
+    alg_p_threshold=0.95,
+    use_original_confidence=True,
+    decoding_pipeline="dim",
+    **inputs
+)
+generations = [
+    processor.tokenizer.decode(g[len(p):].cpu().tolist())
+    for p, g in zip(input_ids, output.sequences)
+]
+for j in range(len(messages)):
+    print("output:", j, generations[j].split(processor.tokenizer.eos_token)[0])
+```
+---
+## 📚 Citation
+> Citation information will be provided soon.
+> Please stay tuned if you are interested in citing **Dimple** in your work.