Text Generation
Transformers
Safetensors
English
cloverlm
causal-lm
quartet-ii
nvfp4
low-precision-training
pretrained
custom_code
Instructions to use daslab-testing/CloverLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use daslab-testing/CloverLM with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="daslab-testing/CloverLM", trust_remote_code=True)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("daslab-testing/CloverLM", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use daslab-testing/CloverLM with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "daslab-testing/CloverLM" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daslab-testing/CloverLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/daslab-testing/CloverLM
- SGLang
How to use daslab-testing/CloverLM with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "daslab-testing/CloverLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daslab-testing/CloverLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "daslab-testing/CloverLM" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "daslab-testing/CloverLM", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use daslab-testing/CloverLM with Docker Model Runner:
docker model run hf.co/daslab-testing/CloverLM
Update README.md
Browse files
README.md
CHANGED
|
@@ -122,7 +122,7 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 122 |
"daslab-testing/CloverLM",
|
| 123 |
trust_remote_code=True,
|
| 124 |
dtype="bfloat16",
|
| 125 |
-
quartet_2_impl="
|
| 126 |
).to("cuda") # for GPU usage or "cpu" for CPU usage
|
| 127 |
|
| 128 |
tokenizer = AutoTokenizer.from_pretrained(
|
|
@@ -134,6 +134,7 @@ input_ids = tokenizer("The capital of France is", return_tensors="pt").input_ids
|
|
| 134 |
output = model.generate(input_ids.to(model.device), max_new_tokens=32)
|
| 135 |
print(tokenizer.decode(output[0]))
|
| 136 |
```
|
|
|
|
| 137 |
|
| 138 |
### Running Evaluations
|
| 139 |
|
|
@@ -164,7 +165,7 @@ Attention backend options: `pytorch` (default), `flash2`, `flash3`, `flash4`.
|
|
| 164 |
- PyTorch 2.10+ with CUDA 13.0
|
| 165 |
- `transformers ≥ 5.3.0`
|
| 166 |
- `tokenmonster ≥ 1.1.12`
|
| 167 |
-
- [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II)
|
| 168 |
|
| 169 |
## Architecture Details
|
| 170 |
|
|
@@ -190,8 +191,8 @@ The model uses 264 weight tensors totaling ~4.14 B parameters.
|
|
| 190 |
@article{cloverlm2026,
|
| 191 |
title = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply
|
| 192 |
by Leveraging Native NVFP4},
|
| 193 |
-
author = {Erik Schultheis and
|
| 194 |
-
|
| 195 |
year = {2026},
|
| 196 |
}
|
| 197 |
-
```
|
|
|
|
| 122 |
"daslab-testing/CloverLM",
|
| 123 |
trust_remote_code=True,
|
| 124 |
dtype="bfloat16",
|
| 125 |
+
quartet_2_impl="pseudoquant", # on non-Blackwell GPUs or "quartet2" for native NVFP4 kernel
|
| 126 |
).to("cuda") # for GPU usage or "cpu" for CPU usage
|
| 127 |
|
| 128 |
tokenizer = AutoTokenizer.from_pretrained(
|
|
|
|
| 134 |
output = model.generate(input_ids.to(model.device), max_new_tokens=32)
|
| 135 |
print(tokenizer.decode(output[0]))
|
| 136 |
```
|
| 137 |
+
Note that `quartet_2_impl="quartet2"` only supports inputs with `(micro_batch_size * seq_length) % 128 == 0`.
|
| 138 |
|
| 139 |
### Running Evaluations
|
| 140 |
|
|
|
|
| 165 |
- PyTorch 2.10+ with CUDA 13.0
|
| 166 |
- `transformers ≥ 5.3.0`
|
| 167 |
- `tokenmonster ≥ 1.1.12`
|
| 168 |
+
- [Quartet II kernels](https://github.com/IST-DASLab/Quartet-II)
|
| 169 |
|
| 170 |
## Architecture Details
|
| 171 |
|
|
|
|
| 191 |
@article{cloverlm2026,
|
| 192 |
title = {Speedrunning GPT3: Pretraining an OPT-175B-Quality Model Cheaply
|
| 193 |
by Leveraging Native NVFP4},
|
| 194 |
+
author = {Erik Schultheis and Georgios Vlassis and Matin Ansaripour and
|
| 195 |
+
Andrei Panferov and Dan Alistarh},
|
| 196 |
year = {2026},
|
| 197 |
}
|
| 198 |
+
```
|