HINT-lab
/

PosS2-Llama3-8B-Instruct

PyTorch

llama

Model card Files Files and versions

xet

Community

Add improved model card with usage example

by nielsr HF Staff - opened Jun 6, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+49

-4

Files changed (1) hide show

README.md +49 -4

README.md CHANGED Viewed

@@ -1,6 +1,51 @@
-This is the PosS-2 model of the paper **PosS:Position Specialist Generates Better Draft for Speculative Decoding**
-If the code fails to auto-download the models, you may mannually download the following files.
-- `pytorch_model.bin`: Model weights
-- `config.json`: Model config

+---
+pipeline_tag: text-generation
+library_name: transformers
+license: apache-2.0
+---
+# PosS: Position Specialist Generates Better Draft for Speculative Decoding
+This repository contains the PosS-2 model described in the paper [POSS: Position Specialist Generates Better Draft for Speculative Decoding](https://huggingface.co/papers/2506.03566).
+**PosS** proposes several Position Specialists, which are responsible for drafting certain positions. They are trained to generate high-quality draft tokens with certain previous deviated features as inputs. During inference time, these Positions Specialists mitigate feature deviations and make accurate predictions even at large positions.
+<div align="center">
+<img src="assets/method-intro.png" width="60%">
+</div>
+**PosS** achieves higher **position-wise acceptance rate** *(acceptance rate at a position given its previous positions are accepted)* than previous methods:
+<div align="center">
+<img src="assets/pos-acc-rate.png" width="60%">
+</div>
+### PosS Weights
+We also provide our trained parameters in Hugging Face:
+| Base Model             | PosS-1 Weights                                         | PosS-2 Weights                                         | PosS-3 Weights                                         |
+| :---------------------- | :----------------------------------------------------- | :----------------------------------------------------- | :----------------------------------------------------- |
+| Llama3-8B-Instruct     | [HINT-lab/PosS1-Llama3-8B-Instruct](https://huggingface.co/HINT-lab/PosS1-Llama3-8B-Instruct) | [HINT-lab/PosS2-Llama3-8B-Instruct](https://huggingface.co/HINT-lab/PosS2-Llama3-8B-Instruct) | [HINT-lab/PosS3-Llama3-8B-Instruct](https://huggingface.co/HINT-lab/PosS3-Llama3-8B-Instruct) |
+| Llama2-13B-Chat        | [HINT-lab/PosS1-Llama2-13B-Chat](https://huggingface.co/HINT-lab/PosS1-Llama2-13B-Chat)    | [HINT-lab/PosS2-Llama2-13B-Chat](https://huggingface.co/HINT-lab/PosS2-Llama2-13B-Chat)    | [HINT-lab/PosS3-Llama2-13B-Chat](https://huggingface.co/HINT-lab/PosS3-Llama2-13B-Chat)    |
+### Simplified Inference Example
+This example uses the `transformers` library.  Make sure to install it first (`pip install transformers`).
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_id = "HINT-lab/PosS2-Llama3-8B-Instruct" # Or choose another PosS model
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+prompt = "The capital of France is"
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+outputs = model.generate(inputs["input_ids"], max_new_tokens=10) # Adjust max_new_tokens as needed
+generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
+print(generated_text)
+```
+Code: [https://github.com/shrango/PosS](https://github.com/shrango/PosS)