Improve model card: Add paper info, project link, Transformers usage, and tags
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,17 +1,30 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
language:
|
| 4 |
- zh
|
| 5 |
- en
|
| 6 |
-
pipeline_tag: text-generation
|
| 7 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
<div align="center">
|
| 10 |
<img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
|
| 11 |
</div>
|
| 12 |
|
| 13 |
<p align="center">
|
| 14 |
-
<a href="https://github.com/
|
|
|
|
|
|
|
| 15 |
<a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
|
| 16 |
</p>
|
| 17 |
<p align="center">
|
|
@@ -44,6 +57,47 @@ BitCPM4 are ternary quantized models derived from the MiniCPM series models thro
|
|
| 44 |
- Achieving comparable performance to full-precision models of similar parameter models with a bit width of only 1.58 bits, demonstrating high parameter efficiency.
|
| 45 |
|
| 46 |
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
### Inference with [llama.cpp](https://github.com/ggml-org/llama.cpp)
|
| 48 |
|
| 49 |
```bash
|
|
@@ -64,12 +118,16 @@ BitCPM4's performance is comparable with other full-precision models in same mod
|
|
| 64 |
- This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
| 65 |
|
| 66 |
## Citation
|
| 67 |
-
- Please cite our [paper](https://
|
| 68 |
|
| 69 |
```bibtex
|
| 70 |
@article{minicpm4,
|
| 71 |
title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
|
| 72 |
author={MiniCPM Team},
|
| 73 |
-
year={2025}
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
}
|
| 75 |
-
```
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
language:
|
| 3 |
- zh
|
| 4 |
- en
|
|
|
|
| 5 |
library_name: transformers
|
| 6 |
+
license: apache-2.0
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
tags:
|
| 9 |
+
- llm
|
| 10 |
+
- code-generation
|
| 11 |
---
|
| 12 |
+
|
| 13 |
+
# MiniCPM4: Ultra-Efficient LLMs on End Devices
|
| 14 |
+
|
| 15 |
+
The model was presented in the paper [MiniCPM4: Ultra-Efficient LLMs on End Devices](https://huggingface.co/papers/2506.07900).
|
| 16 |
+
|
| 17 |
+
## Abstract
|
| 18 |
+
This paper introduces MiniCPM4, a highly efficient large language model (LLM) designed explicitly for end-side devices. We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems. Specifically, in terms of model architecture, we propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. Regarding training data, we propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens. Regarding training algorithms, we propose ModelTunnel v2 for efficient pre-training strategy search, and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient tenary LLM, BitCPM. Regarding inference systems, we propose this http URL that integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding. To meet diverse on-device requirements, MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively. Furthermore, we construct a hybrid reasoning model, MiniCPM4.1, which can be used in both deep reasoning mode and non-reasoning mode. Evaluation results demonstrate that MiniCPM4 and MiniCPM4.1 outperform similar-sized open-source models across benchmarks, with the 8B variants showing significant speed improvements on long sequence understanding and generation.
|
| 19 |
+
|
| 20 |
<div align="center">
|
| 21 |
<img src="https://github.com/OpenBMB/MiniCPM/blob/main/assets/minicpm_logo.png?raw=true" width="500em" ></img>
|
| 22 |
</div>
|
| 23 |
|
| 24 |
<p align="center">
|
| 25 |
+
<a href="https://github.com/openbmb/minicpm" target="_blank">GitHub Repo</a> |
|
| 26 |
+
<a href="https://huggingface.co/papers/2506.07900" target="_blank">Paper</a> |
|
| 27 |
+
<a href="https://huggingface.co/collections/openbmb/minicpm4-6841ab29d180257e940baa9b" target="_blank">Project Page</a> |
|
| 28 |
<a href="https://github.com/OpenBMB/MiniCPM/tree/main/report/MiniCPM_4_Technical_Report.pdf" target="_blank">Technical Report</a>
|
| 29 |
</p>
|
| 30 |
<p align="center">
|
|
|
|
| 57 |
- Achieving comparable performance to full-precision models of similar parameter models with a bit width of only 1.58 bits, demonstrating high parameter efficiency.
|
| 58 |
|
| 59 |
## Usage
|
| 60 |
+
### Hugging Face Transformers Inference
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 64 |
+
import torch
|
| 65 |
+
torch.manual_seed(0)
|
| 66 |
+
|
| 67 |
+
path = 'openbmb/MiniCPM4-8B'
|
| 68 |
+
device = "cuda"
|
| 69 |
+
tokenizer = AutoTokenizer.from_pretrained(path)
|
| 70 |
+
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)
|
| 71 |
+
|
| 72 |
+
# User can directly use the chat interface
|
| 73 |
+
# responds, history = model.chat(tokenizer, "Write an article about Artificial Intelligence.", temperature=0.7, top_p=0.7)
|
| 74 |
+
# print(responds)
|
| 75 |
+
|
| 76 |
+
# User can also use the generate interface
|
| 77 |
+
messages = [
|
| 78 |
+
{"role": "user", "content": "Write an article about Artificial Intelligence."},
|
| 79 |
+
]
|
| 80 |
+
prompt_text = tokenizer.apply_chat_template(
|
| 81 |
+
messages,
|
| 82 |
+
tokenize=False,
|
| 83 |
+
add_generation_prompt=True,
|
| 84 |
+
)
|
| 85 |
+
model_inputs = tokenizer([prompt_text], return_tensors="pt").to(device)
|
| 86 |
+
|
| 87 |
+
model_outputs = model.generate(
|
| 88 |
+
**model_inputs,
|
| 89 |
+
max_new_tokens=1024,
|
| 90 |
+
top_p=0.7,
|
| 91 |
+
temperature=0.7
|
| 92 |
+
)
|
| 93 |
+
output_token_ids = [
|
| 94 |
+
model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))
|
| 95 |
+
]
|
| 96 |
+
|
| 97 |
+
responses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]
|
| 98 |
+
print(responses)
|
| 99 |
+
```
|
| 100 |
+
|
| 101 |
### Inference with [llama.cpp](https://github.com/ggml-org/llama.cpp)
|
| 102 |
|
| 103 |
```bash
|
|
|
|
| 118 |
- This repository and MiniCPM models are released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
|
| 119 |
|
| 120 |
## Citation
|
| 121 |
+
- Please cite our [paper](https://huggingface.co/papers/2506.07900) if you find our work valuable.
|
| 122 |
|
| 123 |
```bibtex
|
| 124 |
@article{minicpm4,
|
| 125 |
title={{MiniCPM4}: Ultra-Efficient LLMs on End Devices},
|
| 126 |
author={MiniCPM Team},
|
| 127 |
+
year={2025},
|
| 128 |
+
journal={arXiv preprint arXiv:2506.07900},
|
| 129 |
+
archivePrefix={arXiv},
|
| 130 |
+
primaryClass={cs.CL},
|
| 131 |
+
url={https://arxiv.org/abs/2506.07900},
|
| 132 |
}
|
| 133 |
+
```
|