Cialtion
/

SimpleTool

Text Generation

parallel-decoding

Model card Files Files and versions

SimpleTool / README.md

Cialtion's picture

readme

d7e2b17 verified 3 months ago

|

1.34 kB

SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

Hugging Face | ModelScope | GitHub

This repository contains the weights for RT-Qwen (RealtimeTool), a series of models optimized for low-latency, parallel LLM function calling.

📁 Model Directory Structure

The models are organized by scale, quantization format, and inference framework.

1. SFT & AWQ Models (vLLM / Transformers)

Directly use these folders for inference via vLLM or Transformers.

RT-Qwen2.5-0.5B / -0.5B-AWQ
RT-Qwen2.5-1.5B / -1.5B-AWQ
RT-Qwen2.5-3B / -3B-AWQ
RT-Qwen2.5-7B / -7B-AWQ
RT-Qwen2.5-14B / -14B-AWQ
RT-Qwen3-4B / -4B-AWQ
RT-Qwen3-30B / -30B-AWQ

2. GGUF Models (llama.cpp)

gguf_models/: Full-precision (F16) GGUF files for all versions.
gguf_quantized/: Quantized GGUF versions including Q4_K_M, Q5_K_M, and Q8_0.

📝 TODO

Release Arxiv Paper
Complete GitHub Documentation
Add Performance Benchmarks
Provide Citation Info

License: Apache-2.0
Status: Models Uploading / Placeholder README