gannima's picture
Upload README.md with huggingface_hub
d616423 verified
metadata
base_model: zai-org/AutoGLM-Phone-9B-Multilingual
library_name: gguf
license: other
license_name: glm-4
tags:
  - gguf
  - llama.cpp
  - vision
  - multimodal
  - autoglm
  - phone-agent
  - android
  - gui-agent
pipeline_tag: text-generation

AutoGLM-Phone-9B-Multilingual (GGUF Quantizations)

This is a GGUF quantized version of zai-org/AutoGLM-Phone-9B-Multilingual, optimized for local inference with llama.cpp.

Includes vision encoder (mmproj) for multimodal capabilities and GUI agent tasks.

๐Ÿ“ฆ Model Files

File Quantization Size VRAM Description
AutoGLM-Phone-9B-Multilingual-q4_k_m.gguf Q4_K_M 5.7G ~10GB Performance balanced
AutoGLM-Phone-9B-Multilingual-q5_k_m.gguf Q5_K_M 6.6G ~11GB High quality
AutoGLM-Phone-9B-Multilingual-q6_k.gguf Q6_K 7.7G ~12GB Excellent quality
AutoGLM-Phone-9B-Multilingual-q8_0.gguf Q8_0 9.4G ~14GB Best quality
mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf F16 1.7G - Vision Encoder (required)

Total storage: ~31GB (all quantizations + vision encoder)

๐Ÿš€ Quick Start

1. Install llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
make llama-server

2. Download Model

huggingface-cli download gannima/AutoGLM-Phone-9B-Multilingual-GGUF \
    --local-dir ./AutoGLM-Phone-9B-Multilingual \
    --local-dir-use-symlinks False

3. Run Server

./llama-server \
    -m AutoGLM-Phone-9B-Multilingual/AutoGLM-Phone-9B-Multilingual-q8_0.gguf \
    --mmproj AutoGLM-Phone-9B-Multilingual/mmproj-AutoGLM-Phone-9B-Multilingual-F16.gguf \
    -c 32768 \
    -ngl 99 \
    --flash-attn on \
    --host 0.0.0.0 \
    --port 8080

4. Use with Open-AutoGLM

cd Open-AutoGLM
python main.py \
    --base-url http://localhost:8080 \
    --model "AutoGLM-Phone-9B-Multilingual" \
    --apikey dummy \
    "ๆ‰“ๅผ€่ฎพ็ฝฎๅบ”็”จ" \
    --max-steps 20

๐Ÿ’ป Hardware Requirements

Quick Reference (Tested on RTX 4090)

Quantization Model Size Vision Encoder Total Actual VRAM* Quality
Q4_K_M 5.7G 1.7G ~7.4G ~10GB Good
Q5_K_M 6.6G 1.7G ~8.3G ~11GB Very Good
Q6_K 7.7G 1.7G ~9.4G ~12GB Excellent
Q8_0 9.4G 1.7G ~11.1G ~14GB Best

*VRAM usage measured with --flash-attn on and all layers on GPU (-ngl 99)

System Requirements

  • OS: Linux (Ubuntu 22.04+ recommended), Windows 11 with WSL2
  • RAM: 32GB+ system memory recommended
  • Storage: SSD with sufficient space for model files
  • CUDA: 12.0+ for GPU acceleration
  • llama.cpp: Latest version with GLM4V support (PR #18042 merged)

Performance Notes

  • Flash Attention: Enabled by default for better performance
  • KV Cache: Quantized to Q8_0 to reduce memory usage
  • Batch Size: Optimized for RTX 4090 (adjust based on your GPU)
  • Context: Supports up to 32K tokens with M-RoPE
  • All layers on GPU: Set -ngl 99 to offload all transformer layers to GPU

๐ŸŽฏ Recommended Usage

For GUI Agent Tasks (Recommended)

Use Q5_K_M or Q6_K for the best balance between quality and performance:

  • Better reasoning accuracy
  • Faster inference than Q8_0
  • Lower VRAM usage

For Maximum Quality

Use Q8_0 when:

  • You want the highest possible accuracy
  • Running on RTX 4090 or better
  • Complex multi-step GUI automation tasks

For Consumer GPUs

Use Q4_K_M when:

  • Limited VRAM (12GB cards like RTX 4070)
  • Need faster inference
  • Running on gaming GPUs

๐Ÿ“„ License

This model is governed by the GLM-4 License. Please refer to the original model repository for details: https://huggingface.co/zai-org/AutoGLM-Phone-9B-Multilingual

๐Ÿ™ Acknowledgments


Conversion Date: 2025-12-29 llama.cpp Version: latest (with GLM4V support) Tested Hardware: RTX 4090 24GB