Improve model card: Add metadata, tags, and usage example

This PR significantly improves the model card by:
- Adding `pipeline_tag: text-generation` and `library_name: transformers` to the metadata, which enhances discoverability and indicates compatibility with the Hugging Face ecosystem.
- Including additional `tags` such as `speculative-decoding` and `inference-acceleration` for more granular filtering.
- Expanding the model description with an overview of `SpecDec++` derived from the paper's abstract, providing better context.
- Integrating a comprehensive and runnable Python code snippet directly from the paper's GitHub repository (`specdec_pp/sample.py`), guiding users on how to load and use the Acceptance Prediction Head for accelerated text generation.

The existing arXiv paper link and GitHub repository link are preserved. As no explicit project page URL was provided, it has not been included.

Files changed (1) hide show

README.md +74 -3

README.md CHANGED Viewed

@@ -1,13 +1,84 @@
 ---
 license: apache-2.0
 ---
-The Acceptance Prediction Head for Llama-2-chat 7B and 70B model pair trained with `weight_mismatch=6` and `resnet_num_layers=3`. It is recommended to be used with `stop_threshold=0.7`. See [arxiv: 2405.19715](https://arxiv.org/abs/2405.19715) for more details.
-Usage: [GitHub](https://github.com/Kaffaljidhmah2/SpecDec_pp)
-### Citation
 ```bibtex
 @article{huang2024specdec++,

 ---
 license: apache-2.0
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- speculative-decoding
+- inference-acceleration
 ---
+# SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
+Speculative decoding is a technique to significantly reduce the inference latency of large language models (LLMs) by utilizing a smaller and faster draft model. **SpecDec++** is an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. It formulates this choice as a Markov Decision Process, theoretically showing that the optimal policy involves stopping speculation when the probability of rejection exceeds a threshold.
+Motivated by this theory, SpecDec++ augments the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of candidate tokens. This adaptive method achieves substantial speedups: 2.04x on the Alpaca dataset (7.2% improvement over baseline speculative decoding), 2.26x on GSM8K (9.4% improvement), and 2.23x on HumanEval (11.1% improvement).
+This repository contains the **Acceptance Prediction Head for Llama-2-chat 7B and 70B model pair** trained with `weight_mismatch=6` and `resnet_num_layers=3`. It is recommended to be used with `stop_threshold=0.7`.
+**Paper**: [SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths](https://arxiv.org/abs/2405.19715)
+**Code**: [GitHub Repository](https://github.com/Kaffaljidhmah2/SpecDec_pp)
+## Usage
+To use this Acceptance Prediction Head for accelerated text generation with SpecDec++, you will need to integrate it with a base large language model (e.g., Llama-2-chat 7B) using the `EaModel` class provided in the original paper's repository.
+First, clone the `SpecDec_pp` repository and install its dependencies:
+```bash
+git clone https://github.com/Kaffaljidhmah2/SpecDec_pp.git
+cd SpecDec_pp
+pip install -r requirements.txt
+```
+Then, you can use the following Python snippet, adapted from `specdec_pp/sample.py`, to perform accelerated generation:
+```python
+import torch
+from transformers import AutoTokenizer
+# EaModel is a custom class from the SpecDec_pp repository.
+# Ensure the repository is cloned and its `specdec_pp` directory is accessible in your Python path.
+from eagle.model.ea_model import EaModel
+from fastchat.model import get_conversation_template
+# Define the paths for your base Large Language Model and this Acceptance Prediction Head
+# Replace with the actual model IDs or local paths
+base_model_path = "meta-llama/Llama-2-7b-chat-hf" # Example: The base LLM to accelerate
+ea_model_path = "hacky/acchead-llama2-chat-7bx70b" # This Acceptance Prediction Head checkpoint
+# Load the EaModel, which integrates the base LLM and the acceptance prediction head
+model = EaModel.from_pretrained(
+    base_model_path=base_model_path,
+    ea_model_path=ea_model_path,
+    torch_dtype=torch.float16, # Use appropriate precision (e.g., torch.float16 or torch.bfloat16)
+    low_cpu_mem_usage=True,
+    device_map="auto",
+    total_token=-1 # -1 enables adaptive candidate length as per SpecDec++
+)
+model.eval()
+# Prepare your prompt using the correct chat template for the base model (e.g., for Llama-2-chat)
+your_message = "What are the benefits of speculative decoding?"
+conv = get_conversation_template("llama-2") # Use "vicuna" or "llama3" as needed for your base model
+conv.append_message(conv.roles[0], your_message)
+conv.append_message(conv.roles[1], None) # The assistant's response will be appended here
+prompt = conv.get_prompt()
+# Tokenize input and move to the appropriate device (e.g., GPU)
+input_ids = model.tokenizer([prompt]).input_ids
+input_ids = torch.as_tensor(input_ids).cuda() # Requires CUDA-enabled GPU
+# Generate output using the `eagenerate` function for accelerated inference
+with torch.no_grad():
+    output_ids = model.eagenerate(input_ids, temperature=0.7, max_new_tokens=256)
+# Decode and print the generated text
+output = model.tokenizer.decode(output_ids[0])
+print(output)
+```
+## Citation
+If you find this useful in your research, please consider citing our paper.
 ```bibtex
 @article{huang2024specdec++,