nielsr HF Staff commited on
Commit
166923c
·
verified ·
1 Parent(s): 6d48df2

Improve model card: Add metadata and sample usage

Browse files

This PR enhances the model card by adding crucial metadata and a clear sample usage section:

- `pipeline_tag: robotics`: Categorizes the model for better discoverability under robotics/navigation tasks.
- `library_name: transformers`: Ensures compatibility with the Hugging Face Transformers library is indicated, enabling the automated "Use in Transformers" widget, based on `config.json` evidence.
- `license: cc-by-nc-sa-4.0`: Formally specifies the model's license, consistent with the existing badge and GitHub repository information.
- `tags`: Added `vision-language-model` and `navigation` for further discoverability.
- The official Hugging Face paper link is added under the title for immediate visibility.
- A Python code snippet for sample usage with the `transformers` library is included, adapted to demonstrate how to load and use the model for inference, based on the model's `transformers` compatibility and the typical usage pattern for such models.

These updates aim to provide users with more complete information and an easier getting started experience.

Files changed (1) hide show
  1. README.md +68 -10
README.md CHANGED
@@ -1,5 +1,16 @@
 
 
 
 
 
 
 
 
 
1
  # InternVLA-N1 Model Series
2
 
 
 
3
  ![License](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)
4
  ![Transformers](https://img.shields.io/badge/%F0%9F%A4%97%20Transformers-9cf?style=flat)
5
  ![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white)
@@ -43,20 +54,68 @@ InternVLA-N1 is a state-of-the-art navigation foundation model built on a **mult
43
  ## Model Variants
44
 
45
  | Model Variant | Description | Key Characteristics |
46
- |--------------|-------------|----------------------|
47
- | [**InternVLA-N1 (S2)**](https://huggingface.co/InternRobotics/InternVLA-N1-System2) | Finetuned Qwen2.5-VL model for pixel-goal grounding | Strong System 2 module; compatible with decoupled System 1 controllers or joint optimization pipelines |
48
- | [**InternVLA-N1 (Dual System) _w/ NavDP\*_**](https://huggingface.co/InternRobotics/InternVLA-N1-w-NavDP) | Jointly tuned System 1 (NavDP*) and InternVLA-N1 (S2) | Optimized end-to-end performance; uses RGB-D observations |
49
- | [**InternVLA-N1 (Dual System) _DualVLN_**](https://huggingface.co/InternRobotics/InternVLA-N1-DualVLN) | Latest dual-system architecture | Optimized end-to-end performance and faster convergence; uses RGB observations |
50
-
51
-
52
 
53
 
54
  > The previously released version is now called [InternVLA-N1-wo-dagger](https://huggingface.co/InternRobotics/InternVLA-N1-wo-dagger). The lastest official release is recommended for best performance.
55
 
56
  ---
57
 
58
- ## Usage
59
- For inference, evaluation, and the Gradio demo, please refer to the [InternNav repository](https://github.com/InternRobotics/InternNav).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  ---
62
 
@@ -85,5 +144,4 @@ If you find our work helpful, please consider starring this repository 🌟 and
85
  primaryClass={cs.RO},
86
  url={https://arxiv.org/abs/2512.08186},
87
  }
88
-
89
-
 
1
+ ---
2
+ pipeline_tag: robotics
3
+ library_name: transformers
4
+ license: cc-by-nc-sa-4.0
5
+ tags:
6
+ - vision-language-model
7
+ - navigation
8
+ ---
9
+
10
  # InternVLA-N1 Model Series
11
 
12
+ This model was presented in the paper [Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation](https://huggingface.co/papers/2512.08186).
13
+
14
  ![License](https://img.shields.io/badge/License-CC_BY--NC--SA_4.0-lightgrey.svg)
15
  ![Transformers](https://img.shields.io/badge/%F0%9F%A4%97%20Transformers-9cf?style=flat)
16
  ![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white)
 
54
  ## Model Variants
55
 
56
  | Model Variant | Description | Key Characteristics |
57
+ |--------------|-------------|----------------------|\
58
+ | [**InternVLA-N1 (S2)**](https://huggingface.co/InternRobotics/InternVLA-N1-System2) | Finetuned Qwen2.5-VL model for pixel-goal grounding | Strong System 2 module; compatible with decoupled System 1 controllers or joint optimization pipelines |\
59
+ | [**InternVLA-N1 (Dual System) _w/ NavDP\*_**](https://huggingface.co/InternRobotics/InternVLA-N1-w-NavDP) | Jointly tuned System 1 (NavDP\*) and InternVLA-N1 (S2) | Optimized end-to-end performance; uses RGB-D observations |\
60
+ | [**InternVLA-N1 (Dual System) _DualVLN_**](https://huggingface.co/InternRobotics/InternVLA-N1-DualVLN) | Latest dual-system architecture | Optimized end-to-end performance and faster convergence; uses RGB observations |\
 
 
61
 
62
 
63
  > The previously released version is now called [InternVLA-N1-wo-dagger](https://huggingface.co/InternRobotics/InternVLA-N1-wo-dagger). The lastest official release is recommended for best performance.
64
 
65
  ---
66
 
67
+ ## Sample Usage
68
+
69
+ This model is compatible with the Hugging Face `transformers` library. The following code snippet demonstrates how to perform inference:
70
+
71
+ ```python
72
+ import torch
73
+ from PIL import Image
74
+ from transformers import AutoProcessor, AutoModelForCausalLM
75
+ import requests
76
+ from io import BytesIO
77
+
78
+ # Load model and processor
79
+ hf_model_id = "InternRobotics/InternVLA-N1-DualVLN"
80
+ model = AutoModelForCausalLM.from_pretrained(hf_model_id, torch_dtype=torch.float16, trust_remote_code=True, device_map="cuda")
81
+ processor = AutoProcessor.from_pretrained(hf_model_id, trust_remote_code=True)
82
+
83
+ # Load a dummy image
84
+ # Replace with your actual image path or a URL to a relevant scene
85
+ image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/bird_image.jpg"
86
+ image = Image.open(BytesIO(requests.get(image_url).content)).convert("RGB")
87
+
88
+ # Define a question related to navigation or visual understanding
89
+ question = "What is the most direct path to the kitchen from here? Describe the first few steps."
90
+
91
+ messages = [
92
+ {"role": "user", "content": f"<|image_pad|>{question}"},
93
+ ]
94
+
95
+ # Process inputs
96
+ inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
97
+ inputs = inputs.to(model.device)
98
+ pixel_values = processor.preprocess(images=image, return_tensors="pt")["pixel_values"]
99
+ pixel_values = pixel_values.to(model.device, dtype=torch.float16)
100
+
101
+ # Generate response
102
+ with torch.inference_mode():
103
+ outputs = model.generate(
104
+ **inputs,
105
+ pixel_values=pixel_values,
106
+ do_sample=True,
107
+ temperature=0.7,
108
+ max_new_tokens=1024,
109
+ eos_token_id=processor.tokenizer.eos_token_id,
110
+ repetition_penalty=1.05
111
+ )
112
+
113
+ response = processor.decode(outputs[0], skip_special_tokens=True)
114
+ print(f"User: {question}
115
+ Assistant: {response}")
116
+ ```
117
+
118
+ For more detailed usage (inference, evaluation, and Gradio demo), please refer to the [InternNav repository](https://github.com/InternRobotics/InternNav).
119
 
120
  ---
121
 
 
144
  primaryClass={cs.RO},
145
  url={https://arxiv.org/abs/2512.08186},
146
  }
147
+ ```