Instructions to use thoughtworks/GLM-4.7-Flash-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/GLM-4.7-Flash-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/GLM-4.7-Flash-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/GLM-4.7-Flash-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/GLM-4.7-Flash-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use thoughtworks/GLM-4.7-Flash-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/GLM-4.7-Flash-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-Flash-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/GLM-4.7-Flash-Eagle3

SGLang

How to use thoughtworks/GLM-4.7-Flash-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/GLM-4.7-Flash-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-Flash-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/GLM-4.7-Flash-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/GLM-4.7-Flash-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/GLM-4.7-Flash-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/GLM-4.7-Flash-Eagle3
```

Question regarding Eagle3 training and MTP

by jhjhjh777 - opened Apr 6

Discussion

jhjhjh777

Apr 6

Hi there, thanks for sharing this model.

I was wondering how the Eagle3 draft model was trained for GLM-4.7-Flash. Did you use a specific framework or repository for the training process?

Also, since GLM already has native support for Multi-Token Prediction (MTP), I'm curious about your main motivation for choosing to train an EAGLE3 model instead.

If you are open to sharing the training scripts, I am planning to run a performance comparison between your EAGLE3 approach and the native MTP, and share the results back with the community. It would be a great learning resource for those of us looking into custom speculative decoding setups.

Best regards,

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment