Improve model card: Add metadata and sample usage

This PR enhances the model card by adding crucial metadata and a clear sample usage section:

- `pipeline_tag: robotics`: Categorizes the model for better discoverability under robotics/navigation tasks.
- `library_name: transformers`: Ensures compatibility with the Hugging Face Transformers library is indicated, enabling the automated "Use in Transformers" widget, based on `config.json` evidence.
- `license: cc-by-nc-sa-4.0`: Formally specifies the model's license, consistent with the existing badge and GitHub repository information.
- `tags`: Added `vision-language-model` and `navigation` for further discoverability.
- The official Hugging Face paper link is added under the title for immediate visibility.
- A Python code snippet for sample usage with the `transformers` library is included, adapted to demonstrate how to load and use the model for inference, based on the model's `transformers` compatibility and the typical usage pattern for such models.

These updates aim to provide users with more complete information and an easier getting started experience.

Files changed (1) hide show

README.md +68 -10

README.md CHANGED Viewed

@@ -1,5 +1,16 @@
 # InternVLA-N1 Model Series
 ![License](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)
 ![Transformers](https://img.shields.io/badge/%F0%9F%A4%97%20Transformers-9cf?style=flat)
 ![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white)
@@ -43,20 +54,68 @@ InternVLA-N1 is a state-of-the-art navigation foundation model built on a **mult
 ## Model Variants
 | Model Variant | Description | Key Characteristics |
-|--------------|-------------|----------------------|
-| [**InternVLA-N1 (S2)**](https://huggingface.co/InternRobotics/InternVLA-N1-System2) | Finetuned Qwen2.5-VL model for pixel-goal grounding | Strong System 2 module; compatible with decoupled System 1 controllers or joint optimization pipelines |
-| [**InternVLA-N1 (Dual System) _w/ NavDP\*_**](https://huggingface.co/InternRobotics/InternVLA-N1-w-NavDP) | Jointly tuned System 1 (NavDP*) and InternVLA-N1 (S2) | Optimized end-to-end performance; uses RGB-D observations |
-| [**InternVLA-N1 (Dual System) _DualVLN_**](https://huggingface.co/InternRobotics/InternVLA-N1-DualVLN) | Latest dual-system architecture | Optimized end-to-end performance and faster convergence; uses RGB observations |
 > The previously released version is now called [InternVLA-N1-wo-dagger](https://huggingface.co/InternRobotics/InternVLA-N1-wo-dagger). The lastest official release is recommended for best performance.
 ---
-## Usage
-For inference, evaluation, and the Gradio demo, please refer to the [InternNav repository](https://github.com/InternRobotics/InternNav).
 ---
@@ -85,5 +144,4 @@ If you find our work helpful, please consider starring this repository 🌟 and
       primaryClass={cs.RO},
       url={https://arxiv.org/abs/2512.08186},
 }

+---
+pipeline_tag: robotics
+library_name: transformers
+license: cc-by-nc-sa-4.0
+tags:
+  - vision-language-model
+  - navigation
+---
 # InternVLA-N1 Model Series
+This model was presented in the paper [Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation](https://huggingface.co/papers/2512.08186).
 ![License](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)
 ![Transformers](https://img.shields.io/badge/%F0%9F%A4%97%20Transformers-9cf?style=flat)
 ![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white)
 ## Model Variants
 | Model Variant | Description | Key Characteristics |
+|--------------|-------------|----------------------|\
+| [**InternVLA-N1 (S2)**](https://huggingface.co/InternRobotics/InternVLA-N1-System2) | Finetuned Qwen2.5-VL model for pixel-goal grounding | Strong System 2 module; compatible with decoupled System 1 controllers or joint optimization pipelines |\
+| [**InternVLA-N1 (Dual System) _w/ NavDP\*_**](https://huggingface.co/InternRobotics/InternVLA-N1-w-NavDP) | Jointly tuned System 1 (NavDP\*) and InternVLA-N1 (S2) | Optimized end-to-end performance; uses RGB-D observations |\
+| [**InternVLA-N1 (Dual System) _DualVLN_**](https://huggingface.co/InternRobotics/InternVLA-N1-DualVLN) | Latest dual-system architecture | Optimized end-to-end performance and faster convergence; uses RGB observations |\
 > The previously released version is now called [InternVLA-N1-wo-dagger](https://huggingface.co/InternRobotics/InternVLA-N1-wo-dagger). The lastest official release is recommended for best performance.
 ---
+## Sample Usage
+This model is compatible with the Hugging Face `transformers` library. The following code snippet demonstrates how to perform inference:
+```python
+import torch
+from PIL import Image
+from transformers import AutoProcessor, AutoModelForCausalLM
+import requests
+from io import BytesIO
+# Load model and processor
+hf_model_id = "InternRobotics/InternVLA-N1-DualVLN"
+model = AutoModelForCausalLM.from_pretrained(hf_model_id, torch_dtype=torch.float16, trust_remote_code=True, device_map="cuda")
+processor = AutoProcessor.from_pretrained(hf_model_id, trust_remote_code=True)
+# Load a dummy image
+# Replace with your actual image path or a URL to a relevant scene
+image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird_image.jpg"
+image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")
+# Define a question related to navigation or visual understanding
+question = "What is the most direct path to the kitchen from here? Describe the first few steps."
+messages = [
+    {"role": "user", "content": f"<|image_pad|>{question}"},
+]
+# Process inputs
+inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
+inputs = inputs.to(model.device)
+pixel_values = processor.preprocess(images=image, return_tensors="pt")["pixel_values"]
+pixel_values = pixel_values.to(model.device, dtype=torch.float16)
+# Generate response
+with torch.inference_mode():
+    outputs = model.generate(
+        **inputs,
+        pixel_values=pixel_values,
+        do_sample=True,
+        temperature=0.7,
+        max_new_tokens=1024,
+        eos_token_id=processor.tokenizer.eos_token_id,
+        repetition_penalty=1.05
+    )
+response = processor.decode(outputs[0], skip_special_tokens=True)
+print(f"User: {question}
+Assistant: {response}")
+```
+For more detailed usage (inference, evaluation, and Gradio demo), please refer to the [InternNav repository](https://github.com/InternRobotics/InternNav).
 ---
       primaryClass={cs.RO},
       url={https://arxiv.org/abs/2512.08186},
 }
+```