| | --- |
| | title: Nanbeige2.5 — Chat |
| | emoji: 🦙 |
| | colorFrom: indigo |
| | colorTo: purple |
| | sdk: gradio |
| | sdk_version: "3.50.0" |
| | python_version: "3.10" |
| | app_file: app.py |
| | pinned: false |
| | --- |
| | |
| | # Nanbeige2.5 — Gradio Chat Space |
| |
|
| | Lightweight Gradio chat UI for `PioTio/Nanbeige2.5` suitable for deployment on Hugging Face Spaces. |
| |
|
| | ## Features |
| | - Streaming and non-streaming generation ✅ |
| | - Tokenizer ↔ model sanity fixes (avoids SentencePiece `piece id` errors) ✅ |
| | - 4-bit BitsAndBytes load when GPU + `bitsandbytes` are available ✅ |
| | - Optional LoRA adapter application (requires `peft`) ✅ |
| | - Controls: temperature, top-p, top-k, max tokens, max-history ✅ |
| |
|
| | **Quick CPU tip:** This Space **may be CPU-only**. Full `PioTio/Nanbeige2.5` on CPU is extremely slow — use the **Load fast CPU demo (distilgpt2)** button for quick responses, enable GPU in Space settings for production use, or check **Force CPU generation** to run Nanbeige on CPU (very slow and not recommended). |
| |
|
| | ## Deployment (Hugging Face Spaces) |
| | 1. Create a new Space (Gradio runtime). |
| | 2. Upload these files (`app.py`, `requirements.txt`, `README.md`). |
| | 3. In Space settings choose **Hardware accelerator: GPU** (recommended). |
| |
|
| | After pushing these files the Space will build automatically — open the Space page and monitor logs for errors. If you prefer, you can create the `app.py` directly in the web UI instead of pushing from Git. |
| |
|
| | **Tip:** keep `bitsandbytes` in `requirements.txt` if you plan to enable 4-bit loading on GPU; remove or pin it if the build log shows dependency issues. |
| | - If you see `piece id is out of range` errors the app will attempt to auto-fix tokenizer/model alignment. |
| | - To apply a LoRA adapter after starting the app, paste the adapter HF repo in the LoRA field and click **Apply LoRA adapter**. |
| |
|
| | ## Recommended hardware |
| | - GPU (T4 / A10 / A100) for real-time streaming; CPU-only may be slow for inference. |
| |
|
| | ## Troubleshooting |
| | - If model load fails on Spaces, check the logs for memory OOM; switch to GPU or enable `bitsandbytes` 4-bit. |
| | - For adapter load failures, ensure adapter repo exists and `peft` is present in `requirements.txt`. |
| |
|
| | --- |
| |
|