RangiLyu commited on
Commit
2e2fc6f
·
verified ·
1 Parent(s): 91ab419

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +438 -3
README.md CHANGED
@@ -1,3 +1,438 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ ---
5
+
6
+
7
+ ## Intern-S1
8
+
9
+
10
+ <div align="center">
11
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/642695e5274e7ad464c8a5ba/E43cgEXBRWjVJlU_-hdh6.png" />
12
+
13
+ <div>&nbsp;</div>
14
+
15
+ [💻Github Repo](https://github.com/InternLM/Intern-S1) • [🤗Model Collections](https://huggingface.co/collections/internlm/intern-s1-6882e325e8ac1c58ba108aa5) • [📜Technical Report (coming soon)]()
16
+
17
+ </div>
18
+
19
+ ## Introduction
20
+
21
+ We introduce **Intern-S1**, our **most advanced open-source multimodal reasoning model** to date. Intern-S1 combines **strong general-task capabilities with state-of-the-art performance on a wide range of scientific tasks**, rivaling leading closed-source commercial models.
22
+ Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on **5 trillion tokens** of multimodal data, including over **2.5 trillion scientific-domain tokens**. This enables the model to retain strong general capabilities while excelling in specialized scientific domains such as **interpreting chemical structures, understanding protein sequences, and planning compound synthesis routes**, making Intern-S1 to be a capable research assistant for real-world scientific applications.
23
+ Features
24
+
25
+ - Strong performance across language and vision reasoning benchmarks, especially scientific tasks.
26
+
27
+ - Continuously pretrained on a massive 5T token dataset, with over 50% specialized scientific data, embedding deep domain expertise.
28
+
29
+ - Dynamic tokenizer enables native understanding of molecular formulas, protein sequences, and seismic signals.
30
+
31
+ ## Performance
32
+
33
+ We evaluate the Intern-S1 on various benchmarks including general datasets and scientifc datasets. We report the performance comparsion with the recent VLMs and LLMs below.
34
+
35
+ <table>
36
+ <thead>
37
+ <tr>
38
+ <th rowspan="2">Benchmarks</th>
39
+ <th colspan="2">Intern-S1</th>
40
+ <th>InternVL3-78B</th>
41
+ <th>Qwen2.5-VL-72B</th>
42
+ <th>DS-R1-0528</th>
43
+ <th>Qwen3-235B-A2.2B</th>
44
+ <th>Kimi-K2-Instruct</th>
45
+ <th>Gemini-2.5 Pro</th>
46
+ <th>o3</th>
47
+ <th>Grok-4</th>
48
+ </tr>
49
+ </thead>
50
+ <tbody>
51
+ <tr><td>MMUL-Pro</td><td colspan="2">83.5 ✅</td><td>73.0</td><td>72.1</td><td>83.4</td><td>82.2</td><td>82.7</td><td>86.0</td><td>85.0</td><td>85.9</td></tr>
52
+ <tr><td>MMMU</td><td colspan="2">77.7 ✅</td><td>72.2</td><td>70.2</td><td>-</td><td>-</td><td>-</td><td>81.9</td><td>80.8</td><td>77.9</td></tr>
53
+ <tr><td>GPQA</td><td colspan="2">77.3</td><td>49.9</td><td>49.0</td><td>80.6</td><td>71.1</td><td>77.8</td><td>83.8</td><td>83.3</td><td>87.5</td></tr>
54
+ <tr><td>MMStar</td><td colspan="2">74.9 ✅</td><td>72.5</td><td>70.8</td><td>-</td><td>-</td><td>-</td><td>79.3</td><td>75.1</td><td>69.6</td></tr>
55
+ <tr><td>MathVista</td><td colspan="2">81.5 👑</td><td>79.0</td><td>74.8</td><td>-</td><td>-</td><td>-</td><td>80.3</td><td>77.5</td><td>72.5</td></tr>
56
+ <tr><td>AIME2025</td><td colspan="2">86.0</td><td>10.7</td><td>10.9</td><td>87.5</td><td>81.5</td><td>51.4</td><td>83.0</td><td>88.9</td><td>91.7</td></tr>
57
+ <tr><td>MathVision</td><td colspan="2">62.5 ✅</td><td>43.1</td><td>38.1</td><td>-</td><td>-</td><td>-</td><td>73.0</td><td>67.7</td><td>67.3</td></tr>
58
+ <tr><td>IFEval</td><td colspan="2">86.7</td><td>75.6</td><td>83.9</td><td>79.7</td><td>85.0</td><td>90.2</td><td>91.5</td><td>92.2</td><td>92.8</td></tr>
59
+ <tr><td>SFE</td><td colspan="2">44.3 👑</td><td>36.2</td><td>30.5</td><td>-</td><td>-</td><td>-</td><td>43.0</td><td>37.7</td><td>31.2</td></tr>
60
+ <tr><td>Physics</td><td colspan="2">44.0 ✅</td><td>23.1</td><td>15.7</td><td>-</td><td>-</td><td>-</td><td>40.0</td><td>47.9</td><td>42.8</td></tr>
61
+ <tr><td>SmolInstrcut</td><td colspan="2">51.0 👑</td><td>19.4</td><td>21.0</td><td>30.7</td><td>28.7</td><td>48.1</td><td>40.4</td><td>43.9</td><td>47.3</td></tr>
62
+ <tr><td>ChemBench</td><td colspan="2">83.4 👑</td><td>61.3</td><td>61.6</td><td>75.6</td><td>75.8</td><td>75.3</td><td>82.8</td><td>81.6</td><td>83.3</td></tr>
63
+ <tr><td>MatBench</td><td colspan="2">75.0 👑</td><td>49.3</td><td>51.5</td><td>57.7</td><td>52.1</td><td>61.7</td><td>61.7</td><td>61.6</td><td>67.9</td></tr>
64
+ <tr><td>MicroVQA</td><td colspan="2">63.9 👑</td><td>59.1</td><td>53.0</td><td>-</td><td>-</td><td>-</td><td>63.1</td><td>58.3</td><td>59.5</td></tr>
65
+ <tr><td>ProteinLMBench</td><td colspan="2">63.1</td><td>61.6</td><td>61.0</td><td>61.4</td><td>59.8</td><td>66.7</td><td>62.9</td><td>67.7</td><td>66.2</td></tr>
66
+ <tr><td>MSEarthMCQ</td><td colspan="2">65.7 👑</td><td>57.2</td><td>37.6</td><td>-</td><td>-</td><td>-</td><td>59.9</td><td>61.0</td><td>58.0</td></tr>
67
+ <tr><td>XLRS-Bench</td><td colspan="2">55.0 👑</td><td>49.3</td><td>50.9</td><td>-</td><td>-</td><td>-</td><td>45.2</td><td>43.6</td><td>45.4</td></tr>
68
+ </tbody>
69
+ </table>
70
+
71
+ > **Note**: ✅ means the best performance among open-sourced models, 👑 indicates the best performance among all models.
72
+
73
+ We use the [OpenCompass](https://github.com/open-compass/OpenCompass/) and [VLMEvalkit](https://github.com/open-compass/vlmevalkit) to evaluate all models.
74
+
75
+
76
+ ## Quick Start
77
+
78
+ ### Sampling Parameters
79
+
80
+ We recommend using the following hyperparameters to ensure better results
81
+
82
+ ```python
83
+ top_p = 1.0
84
+ top_k = 50
85
+ min_p = 0.0
86
+ temperature = 0.7
87
+ ```
88
+
89
+ ### Transformers
90
+
91
+ The following provides demo code illustrating how to generate based on text and multimodal inputs.
92
+
93
+ > **Please use transformers>=4.53.0 to ensure the model works normally.**
94
+
95
+ #### Text input
96
+
97
+ ```python
98
+ from transformers import AutoProcessor, AutoModelForCausalLM
99
+ import torch
100
+
101
+ model_name = "internlm/Intern-S1"
102
+ processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
103
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True)
104
+
105
+ messages = [
106
+ {
107
+ "role": "user",
108
+ "content": [
109
+ {"type": "text", "text": "tell me about an interesting physical phenomenon."},
110
+ ],
111
+ }
112
+ ]
113
+
114
+ inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
115
+
116
+ generate_ids = model.generate(**inputs, max_new_tokens=32768)
117
+ decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
118
+ print(decoded_output)
119
+ ```
120
+
121
+ #### Image input
122
+
123
+ ```python
124
+ from transformers import AutoProcessor, AutoModelForCausalLM
125
+ import torch
126
+
127
+ model_name = "internlm/Intern-S1"
128
+ processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
129
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True)
130
+
131
+ messages = [
132
+ {
133
+ "role": "user",
134
+ "content": [
135
+ {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
136
+ {"type": "text", "text": "Please describe the image explicitly."},
137
+ ],
138
+ }
139
+ ]
140
+
141
+ inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
142
+
143
+ generate_ids = model.generate(**inputs, max_new_tokens=32768)
144
+ decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
145
+ print(decoded_output)
146
+ ```
147
+
148
+ #### Video input
149
+
150
+ Please ensure that the decord video decoding library is installed via `pip install decord`.
151
+
152
+ ```python
153
+ from transformers import AutoProcessor, AutoModelForCausalLM
154
+ import torch
155
+
156
+ model_name = "internlm/Intern-S1"
157
+ processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
158
+ model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True)
159
+
160
+ messages = [
161
+ {
162
+ "role": "user",
163
+ "content": [
164
+ {
165
+ "type": "video",
166
+ "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
167
+ },
168
+ {"type": "text", "text": "What type of shot is the man performing?"},
169
+ ],
170
+ }
171
+ ]
172
+
173
+ inputs = processor.apply_chat_template(
174
+ messages,
175
+ return_tensors="pt",
176
+ add_generation_prompt=True,
177
+ video_load_backend="decord",
178
+ tokenize=True,
179
+ return_dict=True,
180
+ ).to(model.device, dtype=torch.float16)
181
+
182
+ generate_ids = model.generate(**inputs, max_new_tokens=32768)
183
+ decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
184
+ print(decoded_output)
185
+ ```
186
+
187
+ ### Serving
188
+
189
+ You can utilize one of the following LLM inference frameworks to create an OpenAI compatible server:
190
+
191
+ #### [lmdeploy(>=0.9.2)](https://github.com/InternLM/lmdeploy)
192
+
193
+ ```
194
+ lmdeploy serve api_server internlm/Intern-S1-FP8 --reasoning-parser intern-s1 --tool-call-parser intern-s1 --tp 4
195
+ ```
196
+
197
+ #### [vllm](https://github.com/vllm-project/vllm)
198
+
199
+ Coming soon.
200
+
201
+ #### [sglang](https://github.com/sgl-project/sglang)
202
+
203
+ Supporting Intern-S1 with SGLang is still in progress. Please refer to this [PR](https://github.com/sgl-project/sglang/pull/8350).
204
+
205
+ ```bash
206
+ CUDA_VISIBLE_DEVICES=0,1,2,3 \
207
+ python3 -m sglang.launch_server \
208
+ --model-path internlm/Intern-S1-FP8 \
209
+ --trust-remote-code \
210
+ --tp 4 \
211
+ --port 8001 \
212
+ --mem-fraction-static 0.85 \
213
+ --enable-multimodal \
214
+ --grammar-backend none
215
+ ```
216
+
217
+ ## Advanced Usage
218
+
219
+ ### Tool Calling
220
+
221
+ Many Large Language Models (LLMs) now feature **Tool Calling**, a powerful capability that allows them to extend their functionality by interacting with external tools and APIs. This enables models to perform tasks like fetching up-to-the-minute information, running code, or calling functions within other applications.
222
+
223
+ A key advantage for developers is that a growing number of open-source LLMs are designed to be compatible with the OpenAI API. This means you can leverage the same familiar syntax and structure from the OpenAI library to implement tool calling with these open-source models. As a result, the code demonstrated in this tutorial is versatile—it works not just with OpenAI models, but with any model that follows the same interface standard.
224
+
225
+ To illustrate how this works, let's dive into a practical code example that uses tool calling to get the latest weather forecast (based on lmdeploy api server).
226
+
227
+ ```python
228
+
229
+ from openai import OpenAI
230
+ import json
231
+
232
+
233
+ def get_current_temperature(location: str, unit: str = "celsius"):
234
+ """Get current temperature at a location.
235
+
236
+ Args:
237
+ location: The location to get the temperature for, in the format "City, State, Country".
238
+ unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
239
+
240
+ Returns:
241
+ the temperature, the location, and the unit in a dict
242
+ """
243
+ return {
244
+ "temperature": 26.1,
245
+ "location": location,
246
+ "unit": unit,
247
+ }
248
+
249
+
250
+ def get_temperature_date(location: str, date: str, unit: str = "celsius"):
251
+ """Get temperature at a location and date.
252
+
253
+ Args:
254
+ location: The location to get the temperature for, in the format "City, State, Country".
255
+ date: The date to get the temperature for, in the format "Year-Month-Day".
256
+ unit: The unit to return the temperature in. Defaults to "celsius". (choices: ["celsius", "fahrenheit"])
257
+
258
+ Returns:
259
+ the temperature, the location, the date and the unit in a dict
260
+ """
261
+ return {
262
+ "temperature": 25.9,
263
+ "location": location,
264
+ "date": date,
265
+ "unit": unit,
266
+ }
267
+
268
+ def get_function_by_name(name):
269
+ if name == "get_current_temperature":
270
+ return get_current_temperature
271
+ if name == "get_temperature_date":
272
+ return get_temperature_date
273
+
274
+ tools = [{
275
+ 'type': 'function',
276
+ 'function': {
277
+ 'name': 'get_current_temperature',
278
+ 'description': 'Get current temperature at a location.',
279
+ 'parameters': {
280
+ 'type': 'object',
281
+ 'properties': {
282
+ 'location': {
283
+ 'type': 'string',
284
+ 'description': 'The location to get the temperature for, in the format \'City, State, Country\'.'
285
+ },
286
+ 'unit': {
287
+ 'type': 'string',
288
+ 'enum': [
289
+ 'celsius',
290
+ 'fahrenheit'
291
+ ],
292
+ 'description': 'The unit to return the temperature in. Defaults to \'celsius\'.'
293
+ }
294
+ },
295
+ 'required': [
296
+ 'location'
297
+ ]
298
+ }
299
+ }
300
+ }, {
301
+ 'type': 'function',
302
+ 'function': {
303
+ 'name': 'get_temperature_date',
304
+ 'description': 'Get temperature at a location and date.',
305
+ 'parameters': {
306
+ 'type': 'object',
307
+ 'properties': {
308
+ 'location': {
309
+ 'type': 'string',
310
+ 'description': 'The location to get the temperature for, in the format \'City, State, Country\'.'
311
+ },
312
+ 'date': {
313
+ 'type': 'string',
314
+ 'description': 'The date to get the temperature for, in the format \'Year-Month-Day\'.'
315
+ },
316
+ 'unit': {
317
+ 'type': 'string',
318
+ 'enum': [
319
+ 'celsius',
320
+ 'fahrenheit'
321
+ ],
322
+ 'description': 'The unit to return the temperature in. Defaults to \'celsius\'.'
323
+ }
324
+ },
325
+ 'required': [
326
+ 'location',
327
+ 'date'
328
+ ]
329
+ }
330
+ }
331
+ }]
332
+
333
+
334
+
335
+ messages = [
336
+ {'role': 'user', 'content': 'Today is 2024-11-14, What\'s the temperature in San Francisco now? How about tomorrow?'}
337
+ ]
338
+
339
+ openai_api_key = "EMPTY"
340
+ openai_api_base = "http://0.0.0.0:23333/v1"
341
+ client = OpenAI(
342
+ api_key=openai_api_key,
343
+ base_url=openai_api_base,
344
+ )
345
+ model_name = client.models.list().data[0].id
346
+ response = client.chat.completions.create(
347
+ model=model_name,
348
+ messages=messages,
349
+ max_tokens=32768,
350
+ temperature=0.8,
351
+ top_p=0.8,
352
+ stream=False,
353
+ extra_body=dict(spaces_between_special_tokens=False, enable_thinking=False),
354
+ tools=tools)
355
+ print(response.choices[0].message)
356
+ messages.append(response.choices[0].message)
357
+
358
+ for tool_call in response.choices[0].message.tool_calls:
359
+ tool_call_args = json.loads(tool_call.function.arguments)
360
+ tool_call_result = get_function_by_name(tool_call.function.name)(**tool_call_args)
361
+ tool_call_result = json.dumps(tool_call_result, ensure_ascii=False)
362
+ messages.append({
363
+ 'role': 'tool',
364
+ 'name': tool_call.function.name,
365
+ 'content': tool_call_result,
366
+ 'tool_call_id': tool_call.id
367
+ })
368
+
369
+ response = client.chat.completions.create(
370
+ model=model_name,
371
+ messages=messages,
372
+ temperature=0.8,
373
+ top_p=0.8,
374
+ stream=False,
375
+ extra_body=dict(spaces_between_special_tokens=False, enable_thinking=False),
376
+ tools=tools)
377
+ print(response.choices[0].message.content)
378
+ ```
379
+
380
+ ### Switching Between Thinking and Non-Thinking Modes
381
+
382
+ Intern-S1 enables thinking mode by default, enhancing the model's reasoning capabilities to generate higher-quality responses. This feature can be disabled by setting `enable_thinking=False` in `tokenizer.apply_chat_template`
383
+
384
+ ```python
385
+ text = tokenizer.apply_chat_template(
386
+ messages,
387
+ tokenize=False,
388
+ add_generation_prompt=True,
389
+ enable_thinking=False # think mode indicator
390
+ )
391
+ ```
392
+
393
+ With LMDeploy serving Intern-S1 models, you can dynamically control the thinking mode by adjusting the `enable_thinking` parameter in your requests.
394
+
395
+ ```python
396
+ from openai import OpenAI
397
+ import json
398
+
399
+ messages = [
400
+ {
401
+ 'role': 'user',
402
+ 'content': 'who are you'
403
+ }, {
404
+ 'role': 'assistant',
405
+ 'content': 'I am an AI'
406
+ }, {
407
+ 'role': 'user',
408
+ 'content': 'AGI is?'
409
+ }]
410
+
411
+ openai_api_key = "EMPTY"
412
+ openai_api_base = "http://0.0.0.0:23333/v1"
413
+ client = OpenAI(
414
+ api_key=openai_api_key,
415
+ base_url=openai_api_base,
416
+ )
417
+ model_name = client.models.list().data[0].id
418
+
419
+ response = client.chat.completions.create(
420
+ model=model_name,
421
+ messages=messages,
422
+ temperature=0.7,
423
+ top_p=0.8,
424
+ max_tokens=2048,
425
+ extra_body={
426
+ "enable_thinking": False,
427
+ }
428
+ )
429
+ print(json.dumps(response.model_dump(), indent=2, ensure_ascii=False))
430
+ ```
431
+
432
+ For vllm and sglang users, configure this through,
433
+
434
+ ```python
435
+ extra_body={
436
+ "chat_template_kwargs": {"enable_thinking": False}
437
+ }
438
+ ```