T-lite-it-rk3588
This repository contains the T-lite-it-1.0 language model, converted into the RKLLM format for optimized inference on the Rockchip RK3588 Neural Processing Unit (NPU).
The original model was developed by t-tech and quantized to GGUF (Q8_0). This version has been further adapted using Rockchip's tools to run efficiently on edge devices, leveraging the NPU for high performance and low latency.
Model Details
This model is a Russian-language text generation model, suitable for building conversational AI assistants and other NLP applications directly on edge hardware.
- Base Model: T-lite-it-1.0-Q8_0-GGUF
- Format: RKLLM (
.rkllm) - Target Platform: Rockchip RK3588 (NPU)
- Language: Russian
- Quantization: INT8 (optimized for NPU). The conversion process was calibrated to preserve the high quality of the original Q8_0 version, with subjective evaluations confirming no noticeable loss in generation quality.
- Performance (Expected): Based on similar models, inference speed on the RK3588 NPU is expected to be in the range of 12-20 tokens per second, with a memory footprint of approximately 1.7-2.0 GB.
Getting Started with RKLLama
The recommended way to run this model is with RKLLama. RKLLama is a server and client specifically designed to run RKLLM models on Rockchip NPUs. It provides an API compatible with Ollama, making it easy to integrate with various front-ends like Open WebUI.
Follow these steps to get the model running on your RK3588 device.
Prerequisites
- A device with a Rockchip RK3588 (or RK3576) SoC (e.g., Orange Pi 5, NanoPC-T6).
- An operating system installed (Ubuntu 24.04 arm64 or Armbian are recommended).
- Python 3.9 - 3.12 and
pipinstalled. - An internet connection for the initial setup (to download the model and tokenizer).
Step 1: Install RKLLama
The easiest way is to install RKLLama directly from its GitHub repository using pip. It is recommended, but not required, to use a Python virtual environment.
- Clone the repository and install the package:
git clone https://github.com/notpunchnox/rkllama cd rkllama python -m pip install .
Step 2: Prepare the Model
You need to place the downloaded RKLLM model file into the correct directory structure that RKLLama expects.
Create the model directory: Navigate to the directory where you plan to run the server (e.g.,
~/rkllama) and create themodelsfolder, with a subfolder for this specific model.mkdir -p ~/rkllama/models/t-lite-itPlace the model file: Download the
.rkllmmodel file from this Hugging Face repository and move it into the folder you just created (~/rkllama/models/t-lite-it/). Let's assume the filename isT-lite-it-1.0-Q8_0.rkllm.Create a
Modelfile: Inside the same model folder (~/rkllama/models/t-lite-it/), create a text file namedModelfile(with no extension). This file tells RKLLama how to load and run your model.nano ~/rkllama/models/t-lite-it/ModelfileAdd the following content to the
Modelfile:FROM="T-lite-it-1.0-Q8_0.rkllm" HUGGINGFACE_PATH="Sinclear/T-lite-it-rk3588" SYSTEM="Ты — полезный, безопасный и этичный ассистент." TEMPERATURE=0.7FROM: Must match the exact filename of your.rkllmfile.HUGGINGFACE_PATH: This is crucial. RKLLama uses this link to automatically download the correct tokenizer and chat template from your Hugging Face repository. An internet connection is required for this step, but it only happens once.SYSTEMandTEMPERATUREare optional parameters for the system prompt and generation temperature. Adjust as needed.
Step 3: Run the RKLLama Server
Now you can start the server, which will load your model onto the NPU and provide an API endpoint.
Start the server:
cd ~/rkllama rkllama_server --models ./models- The
--modelsflag points to the directory containing your model folders. - To see more detailed logs (e.g., for debugging), add the
--debugflag:rkllama_server --debug --models ./models.
Once running, the server will be available at
http://0.0.0.0:8080by default. It will automatically detect your RK3588 platform and configure the NPU.- The
Step 4: Interact with the Model
With the server running, you can interact with your model using the RKLLama client or directly via its API.
Option A: Using the RKLLama Client
Open a new terminal window (keep the server running in the first one).
List available models:
rkllama_client listYou should see your model
t-lite-itin the list.Start a chat session:
rkllama_client run t-lite-itYou can now type your messages and receive responses generated by the model on the NPU. The client will display statistics like generation speed (tokens/second).
Option B: Using the API (Ollama Compatible)
RKLLama's API is compatible with Ollama, allowing you to send requests from any application or script.
Example using
curl:curl http://localhost:8080/api/chat \ -H "Content-Type: application/json" \ -d '{ "model": "t-lite-it", "messages": [{"role": "user", "content": "Расскажи вкратце о себе."}], "stream": false }'- Setting
"stream": falsewill return the entire response at once.
- Setting
Connecting to Open WebUI: In the settings of Open WebUI (or any other interface that supports an Ollama-compatible backend), set the Ollama API URL to
http://<IP-address-of-your-RK3588>:8080. You can then select thet-lite-itmodel from the interface.
Performance & Quality
This conversion was verified to maintain the high quality of the original model. The difference in responses compared to the base GGUF Q8_0 version is subjectively imperceptible. By running on the NPU, the model achieves interactive speeds, making it practical for real-time applications while keeping all data processing on the local device, ensuring privacy.
Acknowledgements
- Original model by t-tech.
- RKLLama server and tools by NotPunchnox.
- Conversion and testing by Sinclear.
- Downloads last month
- 8
Model tree for Sinclear/T-lite-it-rk3588
Base model
t-tech/T-lite-it-1.0