Upload 3 files

Browse files

Files changed (4) hide show

.gitattributes +2 -0
README.md +109 -0
WeDLM-8B-Instruct-Q4_K_M.gguf +3 -0
WeDLM-8B-Instruct-Q8_0.gguf +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+WeDLM-8B-Instruct-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
+WeDLM-8B-Instruct-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,112 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+language:
+  - en
+  - zh
+base_model: tencent/WeDLM-8B-Instruct
+tags:
+  - gguf
+  - llama-cpp
+  - wedlm
+  - tencent
+  - qwen3
+  - quantized
+library_name: gguf
+pipeline_tag: text-generation
 ---
+# WeDLM-8B-Instruct-GGUF
+**First GGUF quantization of Tencent WeDLM-8B-Instruct!**
+Quantized using llama.cpp b7688.
+Original model: [tencent/WeDLM-8B-Instruct](https://huggingface.co/tencent/WeDLM-8B-Instruct)
+## About
+WeDLM is an 8B parameter instruction-tuned model by Tencent, supporting English and Chinese. It features QK Norm architecture similar to Qwen3.
+This GGUF uses `qwen3` architecture identifier for maximum llama.cpp compatibility.
+## Available Files
+| Filename | Quant | Size | Description |
+|----------|-------|------|-------------|
+| WeDLM-8B-Instruct-Q4_K_M.gguf | Q4_K_M | 4.68 GB | Good quality, recommended for most use cases |
+| WeDLM-8B-Instruct-Q8_0.gguf | Q8_0 | 8.11 GB | High quality, best accuracy |
+## Performance Benchmarks
+### CPU (16 threads, Zen4)
+| Quant | Prompt Processing | Text Generation |
+|-------|-------------------|-----------------|
+| Q4_K_M | 88.65 t/s | 8.27 t/s |
+| Q8_0 | 50.80 t/s | 5.17 t/s |
+### GPU (RTX 4060 Laptop, 8GB VRAM)
+| Quant | Prompt Processing | Text Generation |
+|-------|-------------------|-----------------|
+| Q4_K_M | **1833.84 t/s** | **37.08 t/s** |
+*Q4_K_M recommended for RTX 4060 (fits in 8GB VRAM)*
+## Prompt Format (ChatML)
+```
+<|im_start|>system
+You are a helpful AI assistant.<|im_end|>
+<|im_start|>user
+Hello!<|im_end|>
+<|im_start|>assistant
+```
+## Usage
+### llama.cpp
+```bash
+./llama-cli -m WeDLM-8B-Instruct-Q4_K_M.gguf \
+  -p "<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n" \
+  -n 256 -ngl 99
+```
+### Ollama
+```bash
+# Create Modelfile
+cat > Modelfile << 'EOF'
+FROM ./WeDLM-8B-Instruct-Q4_K_M.gguf
+TEMPLATE "<|im_start|>user\n{{ .Prompt }}<|im_end|>\n<|im_start|>assistant\n"
+EOF
+ollama create wedlm -f Modelfile
+ollama run wedlm
+```
+## Hardware Requirements
+| Quant | Min VRAM | Recommended RAM |
+|-------|----------|-----------------|
+| Q4_K_M | 6 GB | 8 GB |
+| Q8_0 | 10 GB | 12 GB |
+## Model Architecture
+- Parameters: 8.19B
+- Layers: 36
+- Hidden Size: 4096
+- Attention Heads: 32 (8 KV heads, GQA)
+- Context Length: 16384
+- Features: QK Norm, SwiGLU, RoPE (theta=1M)
+## Acknowledgements
+- Original model: [Tencent WeDLM Team](https://huggingface.co/tencent)
+- Inference framework: [llama.cpp](https://github.com/ggml-org/llama.cpp)
+## Disclaimer
+This is an unofficial quantization. For official support, please refer to the original model repository.

WeDLM-8B-Instruct-Q4_K_M.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5c8b938dab334f03b9184e68f6466736153adb9d98b3f43119ee8c51852e1975
+size 5027782208

WeDLM-8B-Instruct-Q8_0.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d75abe66fd2c3f980d2090efc6bf74013eeca7b99322b00fbd3b8e65cdbef239
+size 8709516864