T-lite-it-rk3588

This repository contains the T-lite-it-1.0 language model, converted into the RKLLM format for optimized inference on the Rockchip RK3588 Neural Processing Unit (NPU).

The original model was developed by t-tech and quantized to GGUF (Q8_0). This version has been further adapted using Rockchip's tools to run efficiently on edge devices, leveraging the NPU for high performance and low latency.

Model Details

This model is a Russian-language text generation model, suitable for building conversational AI assistants and other NLP applications directly on edge hardware.

Base Model: T-lite-it-1.0-Q8_0-GGUF
Format: RKLLM (.rkllm)
Target Platform: Rockchip RK3588 (NPU)
Language: Russian
Quantization: INT8 (optimized for NPU). The conversion process was calibrated to preserve the high quality of the original Q8_0 version, with subjective evaluations confirming no noticeable loss in generation quality.
Performance (Expected): Based on similar models, inference speed on the RK3588 NPU is expected to be in the range of 12-20 tokens per second, with a memory footprint of approximately 1.7-2.0 GB.

Getting Started with RKLLama

The recommended way to run this model is with RKLLama. RKLLama is a server and client specifically designed to run RKLLM models on Rockchip NPUs. It provides an API compatible with Ollama, making it easy to integrate with various front-ends like Open WebUI.

Follow these steps to get the model running on your RK3588 device.

Prerequisites

A device with a Rockchip RK3588 (or RK3576) SoC (e.g., Orange Pi 5, NanoPC-T6).
An operating system installed (Ubuntu 24.04 arm64 or Armbian are recommended).
Python 3.9 - 3.12 and pip installed.
An internet connection for the initial setup (to download the model and tokenizer).

Step 1: Install RKLLama

The easiest way is to install RKLLama directly from its GitHub repository using pip. It is recommended, but not required, to use a Python virtual environment.

Clone the repository and install the package:

git clone https://github.com/notpunchnox/rkllama
cd rkllama
python -m pip install .

Step 2: Prepare the Model

You need to place the downloaded RKLLM model file into the correct directory structure that RKLLama expects.

Create the model directory: Navigate to the directory where you plan to run the server (e.g., ~/rkllama) and create the models folder, with a subfolder for this specific model.
```
mkdir -p ~/rkllama/models/t-lite-it
```
Place the model file: Download the .rkllm model file from this Hugging Face repository and move it into the folder you just created (~/rkllama/models/t-lite-it/). Let's assume the filename is T-lite-it-1.0-Q8_0.rkllm.
Create a Modelfile: Inside the same model folder (~/rkllama/models/t-lite-it/), create a text file named Modelfile (with no extension). This file tells RKLLama how to load and run your model.
```
nano ~/rkllama/models/t-lite-it/Modelfile
```
Add the following content to the Modelfile:
```
FROM="T-lite-it-1.0-Q8_0.rkllm"
HUGGINGFACE_PATH="Sinclear/T-lite-it-rk3588"
SYSTEM="Ты — полезный, безопасный и этичный ассистент."
TEMPERATURE=0.7
```
- FROM: Must match the exact filename of your .rkllm file.
- HUGGINGFACE_PATH: This is crucial. RKLLama uses this link to automatically download the correct tokenizer and chat template from your Hugging Face repository. An internet connection is required for this step, but it only happens once.
- SYSTEM and TEMPERATURE are optional parameters for the system prompt and generation temperature. Adjust as needed.

Step 3: Run the RKLLama Server

Now you can start the server, which will load your model onto the NPU and provide an API endpoint.

Start the server:
```
cd ~/rkllama
rkllama_server --models ./models
```
- The --models flag points to the directory containing your model folders.
- To see more detailed logs (e.g., for debugging), add the --debug flag: rkllama_server --debug --models ./models.
Once running, the server will be available at http://0.0.0.0:8080 by default. It will automatically detect your RK3588 platform and configure the NPU.

Step 4: Interact with the Model

With the server running, you can interact with your model using the RKLLama client or directly via its API.

Option A: Using the RKLLama Client

Open a new terminal window (keep the server running in the first one).

List available models:
```
rkllama_client list
```
You should see your model t-lite-it in the list.
Start a chat session:
```
rkllama_client run t-lite-it
```
You can now type your messages and receive responses generated by the model on the NPU. The client will display statistics like generation speed (tokens/second).

Option B: Using the API (Ollama Compatible)

RKLLama's API is compatible with Ollama, allowing you to send requests from any application or script.

Example using curl:

curl http://localhost:8080/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "t-lite-it",
    "messages": [{"role": "user", "content": "Расскажи вкратце о себе."}],
    "stream": false
  }'

Setting "stream": false will return the entire response at once.

Connecting to Open WebUI: In the settings of Open WebUI (or any other interface that supports an Ollama-compatible backend), set the Ollama API URL to http://<IP-address-of-your-RK3588>:8080. You can then select the t-lite-it model from the interface.

Performance & Quality

This conversion was verified to maintain the high quality of the original model. The difference in responses compared to the base GGUF Q8_0 version is subjectively imperceptible. By running on the NPU, the model achieves interactive speeds, making it practical for real-time applications while keeping all data processing on the local device, ensuring privacy.

Acknowledgements

Original model by t-tech.
RKLLama server and tools by NotPunchnox.
Conversion and testing by Sinclear.

Downloads last month: 8

Model tree for Sinclear/T-lite-it-rk3588

Base model

t-tech/T-lite-it-1.0

Finetuned

(18)

this model