SimpleTool / README.md
Cialtion's picture
readme
d7e2b17 verified
|
raw
history blame
1.34 kB

SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

Hugging Face | ModelScope | GitHub

This repository contains the weights for RT-Qwen (RealtimeTool), a series of models optimized for low-latency, parallel LLM function calling.

πŸ“ Model Directory Structure

The models are organized by scale, quantization format, and inference framework.

1. SFT & AWQ Models (vLLM / Transformers)

Directly use these folders for inference via vLLM or Transformers.

  • RT-Qwen2.5-0.5B / -0.5B-AWQ
  • RT-Qwen2.5-1.5B / -1.5B-AWQ
  • RT-Qwen2.5-3B / -3B-AWQ
  • RT-Qwen2.5-7B / -7B-AWQ
  • RT-Qwen2.5-14B / -14B-AWQ
  • RT-Qwen3-4B / -4B-AWQ
  • RT-Qwen3-30B / -30B-AWQ

2. GGUF Models (llama.cpp)

  • gguf_models/: Full-precision (F16) GGUF files for all versions.
  • gguf_quantized/: Quantized GGUF versions including Q4_K_M, Q5_K_M, and Q8_0.

πŸ“ TODO

  • Release Arxiv Paper
  • Complete GitHub Documentation
  • Add Performance Benchmarks
  • Provide Citation Info

License: Apache-2.0
Status: Models Uploading / Placeholder README