--- library_name: transformers pipeline_tag: text-generation base_model: openai-community/gpt2 language: - en tags: - transformers - pytorch - gguf - gpt2 - gpt2-small - 117M - text-generation - conversational - grpo - vae - kv-cache - distillation - reinforcement-learning - openclaw - fallback-agent - soul-md - agent-framework - tool-use - task-automation - dpo - tool-masking - uncertainty-estimation - rag - semantic-cache - quantization - pruning - arxiv:2402.03300 license: apache-2.0 --- # đ§ microclaw-for-openclaw â Fallback Agent for OpenClaw (v2026.2.17) **Model ID:** `webxos/microclaw-for-openclaw-version-2026.2.17` **Tags:** `openclaw`, `fallback-agent`, `grpo`, `vae`, `kv-cache`, `dpo`, `tool-masking`, `uncertainty`, `rag`, `semantic-cache`, `soul.md`, `huggingface-space`, `gguf`, `llm-distillation` --- ## đ Overview **microclaw** (v2026.2.17) is a lightweight, distilled language model designed as a **fallback agent** for the [OpenClaw](https://openclaw.org) ecosystem. When the primary agent loses connectivity or requires offline operation, microclaw steps in to handle essential system tasks: file management, status checks, cron jobs, and simple Q&A. **WARNING: You will need to train your own GGUF model locally, the microcaw.gguf presented in this repo is a lightweight placeholder so users can scale and build their own local models with llama.cpp.** You will need to configure your own build locally from scratch with this model, it is still being developed and is under testing. This version is made to integrate directly with Openclaw.ai 18789 port and in this README.md we will present multiple ways and optional ways to configure this agent on your local Linux Debian based machines. This version introduces **advanced training and inference enhancements**: - **Toolâuse masking** and **schemaâfirst training** for reliable function calling. - **Direct Preference Optimization (DPO)** to align outputs with human preferences. - **Uncertainty estimation** with configurable thresholds for safe escalation. - **RetrievalâAugmented Generation (RAG)** with semantic chunking. - **Semantic KVâcache** for highâsimilarity query reuse. - **Quantization (down to 2âbit)** and **pruning** for extreme memory efficiency. The repository contains the full and partially trained model files, configuration (`soul.md`, `AGENTS.md`, `HEARTBEAT.md`, `SECURITY.md`), and export bundles ready for deployment to **Hugging Face Spaces** or local execution with OpenClaw. --- ## ⨠Key Features - **GRPO (Group Relative Policy Optimization)** â Trains the agent with groupâwise advantage estimation for stable policy updates. - **VAE Filter** â A Variational Autoencoder that filters lowâquality training samples, improving output coherence. - **ToolâUse Masking** â Masks nonâtool tokens during training to enforce strict schema adherence (JSON/YAML). - **DPO (Direct Preference Optimization)** â Fineâtunes on preference pairs to reduce hallucinations and improve helpfulness. - **Uncertainty Estimation** â Monitors tokenâlevel entropy and escalates to safe responses when confidence drops below a threshold. - **RAG (RetrievalâAugmented Generation)** â Retrieves relevant chunks from a local knowledge base (FAISS) to ground responses. - **Semantic Cache** â Reuses previous generations for semantically similar queries, reducing latency and cost. - **Quantization & Pruning** â Compress the model to 2â8 bits and prune unimportant weights; backend support for AutoGPTQ, llama.cpp (GGUF), and bitsandbytes. - **KVâCache** â Intelligent reuse of key/value states reduces inference latency by up to 78% (measured on local benchmarks). - **Soul.md Configuration** â Define personality, subâagent rules, proactive tasks, and prompt injection defenses in plain Markdown. - **Export Ready** â Oneâclick export to a **Hugging Face Space** (Dockerâbased) or a portable ZIP archive. - **Quantized (4âbit GGUF)** â Optimized for memoryâconstrained environments; runs smoothly on CPU. --- ### Part 1: Installation Included are multiple guides and ways you can implement Microclaw into your custom build, with steps to further train the GGUF file locally: **Read all steps carefully and find the right guide for your use case/setup, Not all options may work on your system. These guides are designed for specific use on Linux Debian systems.** # 1.1 Installation Guide + System Update & Basic Tools ```bash sudo apt update sudo apt upgrade -y sudo apt install -y curl wget git build-essential ``` # 1.2 Install Docker (for containerized execution) ```bash # Add Docker's official GPG key and repository curl -fsSL https://download.docker.com/linux/debian/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/debian bullseye stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null # Install Docker Engine sudo apt update sudo apt install -y docker-ce docker-ce-cli containerd.io # Add your user to the docker group (avoid sudo for every command) sudo usermod -aG docker $USER newgrp docker # activate group changes in current shell ``` # 1.3 Install Node.js (v22 or later) & TypeScript ```bash # Using NodeSource repository for a modern Node.js version curl -fsSL https://deb.nodesource.com/setup_22.x | sudo -E bash - sudo apt install -y nodejs # Install TypeScript globally sudo npm install -g typescript # Verify node --version # should be v22.x or higher tsc --version ``` # 1.4 Install SQLite (for memory & logs) ```bash sudo apt install -y sqlite3 libsqlite3-dev ``` # Part 2: Microclaw Fallback Agent The Microclaw agent is a Pythonâbased service (Flask + Transformers) that communicates with OpenClaw. You can install it using either a Python virtual environment (lightweight) or Conda (more reliable for PyTorch). Choose one method below. # 2.1 Clone the Microclaw Repository Create a parent directory for all agents: ```bash sudo mkdir -p /opt/openclaw-agents sudo chown -R $USER:$USER /opt/openclaw-agents cd /opt/openclaw-agents # Clone the Hugging Face repo (includes model files and soul configuration) git lfs install git clone https://huggingface.co/webxos/microclaw-for-openclaw-version-2026.2.17 microclaw-fallback cd microclaw-fallback ``` Note: The .gguf model files are several hundred MB. If the download is interrupted, git lfs can resume. After cloning, verify the file sizes: ```bash ls -lh *.gguf ``` They should be >100 MB, not 28 bytes. If they are still placeholders, run git lfs pull manually. # 2.2 Option A: Install with Python Virtual Environment (venv) ```bash # Create and activate a virtual environment python3 -m venv venv source venv/bin/activate # Upgrade pip and install dependencies pip install --upgrade pip pip install -r requirements.txt ``` If requirements.txt is missing, install core packages manually ```bash pip install flask transformers torch sentence-transformers faiss-cpu --extra-index-url https://download.pytorch.org/whl/cpu ``` # 2.3 Option B: Install with Conda (Recommended for unstable networks) ```bash # Download and install Miniconda (if not already present) wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda3 source ~/miniconda3/bin/activate # Create a dedicated environment with Python 3.11 conda create -y -n microclaw python=3.11 conda activate microclaw # Install CPUâonly PyTorch from conda-forge (smaller, more reliable) conda install -y pytorch torchvision torchaudio cpuonly -c pytorch # Install the rest via pip pip install flask transformers sentence-transformers faiss-cpu ``` # 2.4 Test the Agent Manually ```bash # Make sure you are in the agent directory with the environment activated python main.py ``` You should see output like * Running on http://127.0.0.1:18789. Press Ctrl+C to stop it. # âď¸ Part 3: Configure OpenClaw to Use the Microclaw Fallback OpenClaw reads its configuration from a TOML file (typically ~/.config/openclaw/config.toml or /etc/openclaw/config.toml). You need to point it to your local Microclaw instance. Find the port Microclaw listens on (default is 18789, defined in main.py): ```bash grep port main.py ``` Edit the OpenClaw configuration (create it if it doesn't exist): ```bash mkdir -p ~/.config/openclaw nano ~/.config/openclaw/config.toml Add or modify the [agent.fallback] section: toml [agent.fallback] path = "/opt/openclaw-agents/microclaw-fallback" port = 18789 enabled = true ``` If OpenClaw is already installed, restart it. (If you haven't installed OpenClaw yet, see Part 4 below.) # đł Part 4: Install & Run OpenClaw (the main framework) The OpenClaw core is a Node.js/TypeScript application. You can run it directly from source or use the provided Docker image. # 4.1 Run OpenClaw via Docker (easiest) ```bash Pull the official OpenClaw image (adjust tag as needed) docker pull openclaw/openclaw:latest Run the container, mounting the config and agents directories docker run -d \ --name openclaw \ -p 3000:3000 \ -v ~/.config/openclaw:/home/node/.config/openclaw \ -v /opt/openclaw-agents:/opt/openclaw-agents \ openclaw/openclaw:latest ``` # 4.2 Run OpenClaw from Source (for development) ```bash Clone the OpenClaw repository git clone https://github.com/openclaw/core.git openclaw-core cd openclaw-core Install dependencies yarn install Build TypeScript yarn build Start OpenClaw (it will read the config from ~/.config/openclaw/config.toml) yarn start ``` # đ§Ş Part 5: Verify the Integration Check that Microclaw is running (either manually or via systemd): ```bash curl http://localhost:18789/health ``` # đ Guide to Microclaw Auto-Start (systemd) To ensure the fallback agent starts on boot and restarts if it crashes, create a systemd service. Create the service file: ```bash sudo nano /etc/systemd/system/microclaw-fallback.service Paste (adjust User and paths to match your setup): ini [Unit] Description=Microclaw Fallback Agent for OpenClaw After=network.target [Service] Type=simple User=kali WorkingDirectory=/opt/openclaw-agents/microclaw-fallback Environment="PATH=/opt/openclaw-agents/microclaw-fallback/venv/bin" ExecStart=/opt/openclaw-agents/microclaw-fallback/venv/bin/python /opt/openclaw-agents/microclaw-fallback/main.py Restart=on-failure RestartSec=10 [Install] WantedBy=multi-user.target ``` Enable and start: ```bash sudo systemctl daemon-reload sudo systemctl enable microclaw-fallback.service sudo systemctl start microclaw-fallback.service ``` Check status: ```bash sudo systemctl status microclaw-fallback.service ``` ### ALTERNATIVE GUIDE - Installing via Llama.cpp instead: # đŚ Prerequisites: Essential System Tools You need a few standard command-line tools. Open a terminal and run: bash # Update your package list and install curl, wget, git, and build tools ```bash sudo apt update && sudo apt upgrade -y sudo apt install -y curl wget git build-essential ``` ## đĽ Step 1: Download the Model with Git LFS The model files are hosted in a Git repository and require Git Large File Storage (LFS) to download the actual GGUF files. # 1.1: Install Git LFS ```bash sudo apt install -y git-lfs git lfs install ``` # 1.2: Create a directory for your models and clone the repository ```bash mkdir -p ~/models cd ~/models git clone https://huggingface.co/webxos/microclaw-for-openclaw-version-2026.2.17 microclaw-fallback cd microclaw-fallback ``` # 1.3: Ensure the GGUF files are fully downloaded ```bash git lfs pull Verification: After cloning, check that the .gguf files are present and are a reasonable size (several hundred MB, not 28 bytes). Run: bash ls -lh *.gguf ``` If the files are small placeholders, run git lfs pull again. ## âď¸ Step 2: Set Up the llama.cpp Server Now, download, compile, and set up llama.cpp with its built-in server. bash # 2.1: Clone the llama.cpp repository ```bash cd ~/models git clone https://github.com/ggerganov/llama.cpp cd llama.cpp ``` # 2.2: Compile llama.cpp (this may take a few minutes) ```bash make -j4 ``` ## 3. (Optional but recommended) Install the Python dependencies for the server This step requires Python/pip, but it's a one-time, isolated setup. ```bash sudo apt install -y python3-pip python3-venv python3 -m venv venv source venv/bin/activate pip install -r requirements.txt ``` # đ Step 3.1: Run the Model Server Now, start the server, pointing it to the GGUF model file you downloaded. Make sure you are in the llama.cpp directory with the virtual env activated ```bash cd ~/models/llama.cpp source venv/bin/activate ``` Find the exact GGUF filename (replace with the actual filename you have) MODEL_FILE=~/models/microclaw-fallback/microclaw-for-openclaw-version-2026.2.17.Q4_K_M.gguf Run the server ``` ./server -m $MODEL_FILE \ --host 0.0.0.0 \ --port 8000 \ -c 2048 \ -ngl 0 # Use -ngl 33 if you have an NVIDIA GPU and compiled with CUDA support ``` Explanation of flags: -m $MODEL_FILE : Path to your GGUF model. --host 0.0.0.0 : Listen on all network interfaces (so OpenClaw can connect). --port 8000 : The port the server will use. -c 2048 : Context size (adjust based on model requirements). -ngl 0 : Number of layers to offload to GPU. Use -ngl 33 (or more) if you have an NVIDIA GPU and compiled with CUDA. Keep this terminal window open. The server is now running and ready to accept requests. # â Step 4: Test the Server Open a new terminal and test the API to ensure it's working correctly. ```bash curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "What is the capital of France?", "max_tokens": 50, "temperature": 0.7 }' ``` You should receive a JSON response containing the model's generated text. # đ Step 5: Configure OpenClaw to Use the Local Server Now, configure OpenClaw to use this local server as its fallback agent. Locate OpenClaw's configuration file. This is often ~/.config/openclaw/config.toml, /etc/openclaw/config.toml, or a .env file in the OpenClaw directory. Edit the configuration to define a custom provider that points to your local server. The exact variable names depend on your OpenClaw version, but it generally looks something like this: ```toml [agent.fallback] provider = "custom" # or "openai-compatible" base_url = "http://localhost:8000/v1" api_key = "not-needed" # llama.cpp server doesn't require a key model = "microclaw" # Optional: model name enabled = true ``` If OpenClaw uses environment variables (e.g., in a .env file), you might set: ```text OPENCLAW_FALLBACK_PROVIDER=custom OPENCLAW_CUSTOM_BASE_URL=http://localhost:8000/v1 OPENCLAW_CUSTOM_API_KEY=not-needed ``` Restart OpenClaw for the changes to take effect. # đ How to Run the Server as a Background Service: To have the server start automatically on boot and restart if it crashes, you can create a systemd service. Create the service file: ```bash sudo nano /etc/systemd/system/microclaw-llama.service Paste the following (adjust User, WorkingDirectory, and ExecStart paths as needed): ini [Unit] Description=llama.cpp server for Microclaw After=network.target [Service] Type=simple User=kali WorkingDirectory=/home/kali/models/llama.cpp ExecStart=/home/kali/models/llama.cpp/server -m /home/kali/models/microclaw-fallback/microclaw-for-openclaw-version-2026.2.17.Q4_K_M.gguf --host 0.0.0.0 --port 8000 -c 2048 -ngl 0 Restart=on-failure RestartSec=10 [Install] WantedBy=multi-user.target ``` Then enable and start the service: ```bash sudo systemctl daemon-reload sudo systemctl enable microclaw-llama.service sudo systemctl start microclaw-llama.service sudo systemctl status microclaw-llama.service # Check if it's running ``` ### ADVANCED GUIDE: TRAINING MICROCLAW.GGUF MODEL LOCALLY This guide adapts the full microclaw pipeline to run entirely on a lowâend machine like an 8GB RAM laptop or even a Raspberry Pi 5. We'll use a tiny base model (0.5Bâ1B parameters), parameterâefficient fineâtuning (LoRA) on CPU, and extreme quantization (2âbit) to produce a GGUF file that runs smoothly on consumer hardware. The final system provides: - A **local training script** that fits in 8GB RAM (CPU only). - A **FastAPI server** (`server.py`) serving a retro MSâDOSâstyle CLI dashboard on `localhost:8080`. - **Local API endpoints** for inference, file management, cron jobs, and RAG. - **SQLite** as a local database (conversation history, cache, RAG index). - Integration with **llama.cpp** for efficient GGUF inference. --- ## Prerequisites - **Hardware**: x86_64 or ARM64 (Raspberry Pi 5) with **at least 8GB RAM**. - **OS**: Debian 12 / Kali Linux / Raspberry Pi OS (64âbit). - **Storage**: 10GB free space. - **Software**: Python 3.10+, Git, CMake, build tools. --- ## Step 1: Environment Setup ```bash cd /home/kali/microclaw python3 -m venv venv source venv/bin/activate pip install --upgrade pip pip install -r requirements.txt ``` **`requirements.txt`** (CPUâoptimized, no CUDA dependencies): ``` torch==2.2.0 --index-url https://download.pytorch.org/whl/cpu transformers>=4.38.0 accelerate datasets trl>=0.8.0 peft bitsandbytes scipy sentencepiece protobuf fastapi uvicorn sqlite-utils pydantic pyyaml jinja2 aiofiles llama-cpp-python ``` --- ## Step 2: Build llama.cpp (for conversion & inference) llama.cpp provides the tools to convert Hugging Face models to GGUF and run them efficiently on CPU. ```bash cd /home/kali git clone https://github.com/ggerganov/llama.cpp cd llama.cpp mkdir build && cd build cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS # optional: enables BLAS for speed make -j$(nproc) ``` After compilation, the `convert-hf-to-gguf.py` script will be in `llama.cpp/` (not in build). We'll use it later. --- ## Step 3: Prepare the Dataset You need a small dataset (a few hundred to a few thousand examples) for fineâtuning and DPO. Place JSONL files in `data/raw/`. ### 3.1 Toolâuse data (schemaâfirst) Each line: ```json { "instruction": "List files in /home", "tools": ["ls"], "response": "ls /home" } ``` ### 3.2 Preference data (for DPO) Each line: ```json { "prompt": "What is the weather?", "chosen": "I cannot check live weather, but you can use the 'weather' tool.", "rejected": "I don't know." } ``` If you don't have preference data, you can skip DPO by setting `dpo: false` in config. ### 3.3 RAG documents (optional) Place plain text files (`.txt`) in `data/rag_docs/`. The training script will chunk them and store embeddings in SQLite. --- ## Step 4: Configuration (`config.yaml`) Edit this file to match your paths and training preferences. ```yaml # config.yaml model: base_model_name: "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # or "Qwen/Qwen2.5-0.5B" cache_dir: "models/base" training: output_dir: "models/lora" per_device_train_batch_size: 1 gradient_accumulation_steps: 4 learning_rate: 2e-4 num_train_epochs: 3 max_seq_length: 512 use_lora: true lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 dpo: true dpo_beta: 0.1 # CPU optimizations dataloader_num_workers: 0 save_steps: 100 logging_steps: 10 data: train_file: "data/raw/train.jsonl" eval_file: "data/raw/eval.jsonl" # optional preference_file: "data/raw/preferences.jsonl" # for DPO rag: enabled: true chunk_size: 500 chunk_overlap: 50 embedding_model: "all-MiniLM-L6-v2" # tiny, runs on CPU db_path: "db/microclaw.db" server: host: "0.0.0.0" port: 8080 model_path: "models/microclaw.gguf" context_size: 2048 max_tokens: 512 temperature: 0.7 ``` --- ## Step 5: Training Script (`train.py`) This script performs supervised fineâtuning (SFT) on instruction data, optionally followed by DPO, and finally merges the LoRA weights and saves the full model. It is heavily optimized for low RAM (CPU) usage. ### UltraâLightweight Local Training & Deployment Guide Optimized for CPUâonly systems (8GB RAM, no GPU) â Raspberry Pi ready This guide adapts the full microclaw pipeline to run entirely on a lowâend machine like an 8GB RAM laptop or even a Raspberry Pi 5. We'll use a tiny base model (0.5Bâ1B parameters), parameterâefficient fineâtuning (LoRA) on CPU, and extreme quantization (2âbit) to produce a GGUF file that runs smoothly on consumer hardware. The final system provides: - A **local training script** that fits in 8GB RAM (CPU only). - A **FastAPI server** (`server.py`) serving a retro MSâDOSâstyle CLI dashboard on `localhost:8080`. - **Local API endpoints** for inference, file management, cron jobs, and RAG. - **SQLite** as a local database (conversation history, cache, RAG index). - Integration with **llama.cpp** for efficient GGUF inference. --- # Folder Structure (to be created) ``` /home/kali/microclaw/ âââ server.py # FastAPI server (inference + static files + API) âââ train.py # CPUâoptimized fineâtuning + DPO script âââ requirements.txt âââ config.yaml âââ data/ â âââ raw/ # Place your JSONL datasets here â âââ rag_docs/ # Text files for RAG (optional) âââ models/ â âââ base/ # Will contain the downloaded base model â âââ lora/ # LoRA adapters after training â âââ microclaw.gguf # Final quantized model (after conversion) âââ static/ â âââ index.html # Main dashboard (CLI style) â âââ style.css â âââ script.js â âââ pages/ # Additional pages (file manager, cron, etc.) â âââ files.html â âââ cron.html â âââ rag.html âââ db/ â âââ microclaw.db # SQLite database (autoâcreated) âââ logs/ âââ training.log ``` --- ## Prerequisites - **Hardware**: x86_64 or ARM64 (Raspberry Pi 5) with **at least 8GB RAM**. - **OS**: Debian 12 / Kali Linux / Raspberry Pi OS (64âbit). - **Storage**: 10GB free space. - **Software**: Python 3.10+, Git, CMake, build tools. --- ## Step 1: Environment Setup ```bash cd /home/kali/microclaw python3 -m venv venv source venv/bin/activate pip install --upgrade pip pip install -r requirements.txt ``` **`requirements.txt`** (CPUâoptimized, no CUDA dependencies): ``` torch==2.2.0 --index-url https://download.pytorch.org/whl/cpu transformers>=4.38.0 accelerate datasets trl>=0.8.0 peft bitsandbytes scipy sentencepiece protobuf fastapi uvicorn sqlite-utils pydantic pyyaml jinja2 aiofiles llama-cpp-python ``` --- ## Step 2: Build llama.cpp (for conversion & inference) llama.cpp provides the tools to convert Hugging Face models to GGUF and run them efficiently on CPU. ```bash cd /home/kali git clone https://github.com/ggerganov/llama.cpp cd llama.cpp mkdir build && cd build cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS # optional: enables BLAS for speed make -j$(nproc) ``` After compilation, the `convert-hf-to-gguf.py` script will be in `llama.cpp/` (not in build). We'll use it later. --- ## Step 3: Prepare the Dataset You need a small dataset (a few hundred to a few thousand examples) for fineâtuning and DPO. Place JSONL files in `data/raw/`. ### 3.1 Toolâuse data (schemaâfirst) Each line: ```json { "instruction": "List files in /home", "tools": ["ls"], "response": "ls /home" } ``` ### 3.2 Preference data (for DPO) Each line: ```json] { "prompt": "What is the weather?", "chosen": "I cannot check live weather, but you can use the 'weather' tool.", "rejected": "I don't know." } ``` If you don't have preference data, you can skip DPO by setting `dpo: false` in config. ### 3.3 RAG documents (optional) Place plain text files (`.txt`) in `data/rag_docs/`. The training script will chunk them and store embeddings in SQLite. --- ## Step 4: Configuration (`config.yaml`) Edit this file to match your paths and training preferences. ```yaml # config.yaml model: base_model_name: "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # or "Qwen/Qwen2.5-0.5B" cache_dir: "models/base" training: output_dir: "models/lora" per_device_train_batch_size: 1 gradient_accumulation_steps: 4 learning_rate: 2e-4 num_train_epochs: 3 max_seq_length: 512 use_lora: true lora_r: 8 lora_alpha: 16 lora_dropout: 0.05 dpo: true dpo_beta: 0.1 # CPU optimizations dataloader_num_workers: 0 save_steps: 100 logging_steps: 10 data: train_file: "data/raw/train.jsonl" eval_file: "data/raw/eval.jsonl" # optional preference_file: "data/raw/preferences.jsonl" # for DPO rag: enabled: true chunk_size: 500 chunk_overlap: 50 embedding_model: "all-MiniLM-L6-v2" # tiny, runs on CPU db_path: "db/microclaw.db" server: host: "0.0.0.0" port: 8080 model_path: "models/microclaw.gguf" context_size: 2048 max_tokens: 512 temperature: 0.7 ``` --- ## Step 5: Training Script (`train.py`) This script performs supervised fineâtuning (SFT) on instruction data, optionally followed by DPO, and finally merges the LoRA weights and saves the full model. It is heavily optimized for low RAM (CPU) usage. ```python #!/usr/bin/env python3 # train.py â CPUâonly fineâtuning with LoRA + optional DPO import os import yaml import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig ) from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType from datasets import load_dataset from trl import DPOTrainer import logging # Load config with open("config.yaml") as f: config = yaml.safe_load(f) # Setup logging logging.basicConfig(level=logging.INFO, filename="logs/training.log", filemode="w") logger = logging.getLogger(__name__) def main(): # 1. Load tokenizer tokenizer = AutoTokenizer.from_pretrained(config["model"]["base_model_name"], cache_dir=config["model"]["cache_dir"]) tokenizer.pad_token = tokenizer.eos_token # 2. Load base model in 8-bit (CPU offload not supported for bitsandbytes on CPU; we use standard dtype) # For CPU, we load in float32 and rely on LoRA to reduce memory. model = AutoModelForCausalLM.from_pretrained( config["model"]["base_model_name"], cache_dir=config["model"]["cache_dir"], torch_dtype=torch.float32, # CPU uses float32 low_cpu_mem_usage=True ) # 3. Prepare LoRA if config["training"]["use_lora"]: lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=config["training"]["lora_r"], lora_alpha=config["training"]["lora_alpha"], lora_dropout=config["training"]["lora_dropout"], target_modules=["q_proj", "v_proj"] # adjust for your model ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # 4. Load dataset dataset = load_dataset("json", data_files=config["data"]["train_file"], split="train") if config["data"].get("eval_file"): eval_dataset = load_dataset("json", data_files=config["data"]["eval_file"], split="train") else: eval_dataset = None # Format prompt: "### Instruction:\n{instruction}\n\n### Response:\n{response}" def format_func(example): text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}" return {"text": text} dataset = dataset.map(format_func) if eval_dataset: eval_dataset = eval_dataset.map(format_func) # Tokenize def tokenize(element): return tokenizer(element["text"], truncation=True, max_length=config["training"]["max_seq_length"], padding=False) dataset = dataset.map(tokenize, remove_columns=dataset.column_names) if eval_dataset: eval_dataset = eval_dataset.map(tokenize, remove_columns=eval_dataset.column_names) # 5. Training arguments (CPUâfriendly) training_args = TrainingArguments( output_dir=config["training"]["output_dir"], per_device_train_batch_size=config["training"]["per_device_train_batch_size"], gradient_accumulation_steps=config["training"]["gradient_accumulation_steps"], learning_rate=config["training"]["learning_rate"], num_train_epochs=config["training"]["num_train_epochs"], logging_steps=config["training"]["logging_steps"], save_steps=config["training"]["save_steps"], evaluation_strategy="steps" if eval_dataset else "no", eval_steps=config["training"]["save_steps"], save_total_limit=2, load_best_model_at_end=True if eval_dataset else False, metric_for_best_model="eval_loss", greater_is_better=False, fp16=False, # CPU doesn't support fp16 bf16=False, dataloader_num_workers=0, # avoid multiprocessing issues optim="adamw_torch", torch_compile=False, # no speedup on CPU ) # 6. Trainer (SFT) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, ) logger.info("Starting SFT training...") trainer.train() trainer.save_model() # saves LoRA adapters # 7. Optional DPO training if config["training"]["dpo"] and config["data"].get("preference_file"): logger.info("Loading preference data for DPO...") pref_dataset = load_dataset("json", data_files=config["data"]["preference_file"], split="train") # For DPO we need base model without LoRA (or merged) # We'll reload base model and then apply LoRA weights # (Simplified: use the same model with LoRA attached; DPO trainer handles it) dpo_trainer = DPOTrainer( model=model, ref_model=None, # uses model as reference (or you can provide a frozen copy) args=training_args, # reuse same args (adjust for DPO) train_dataset=pref_dataset, tokenizer=tokenizer, beta=config["training"]["dpo_beta"], max_length=config["training"]["max_seq_length"], max_prompt_length=256, ) logger.info("Starting DPO training...") dpo_trainer.train() dpo_trainer.save_model(config["training"]["output_dir"] + "_dpo") # 8. Merge LoRA and save full model (for conversion) logger.info("Merging LoRA weights...") merged_model = model.merge_and_unload() merged_model.save_pretrained("models/merged") tokenizer.save_pretrained("models/merged") logger.info("Merged model saved to models/merged") if __name__ == "__main__": main() ``` **Run training**: ```bash python train.py ``` *Note: Training a 1B model on CPU with batch size 1 may take several hours to days depending on dataset size. Reduce epochs or dataset size for testing.* --- ## Step 6: Convert to GGUF After training, we have a merged Hugging Face model in `models/merged/`. Now use llama.cpp's conversion script. ```bash cd /home/kali/llama.cpp python convert-hf-to-gguf.py /home/kali/microclaw/models/merged \ --outfile /home/kali/microclaw/models/microclaw.gguf \ --outtype q2_k # 2âbit quantization (extremely small) ``` For Raspberry Pi, `q2_k` is ideal. You can also try `q3_k_s` if you have more RAM. --- ## Step 7: Build the FastAPI Server (`server.py`) This server serves: - Static files (the CLI dashboard) from the `static/` folder. - API endpoints for inference, file management, cron, and RAG. - SQLite database for conversation history and RAG cache. ```python #!/usr/bin/env python3 # server.py â FastAPI server with GGUF inference and static dashboard import os import yaml import sqlite3 import json from pathlib import Path from fastapi import FastAPI, Request, HTTPException from fastapi.responses import HTMLResponse, JSONResponse from fastapi.staticfiles import StaticFiles from pydantic import BaseModel from typing import Optional, List import uvicorn from llama_cpp import Llama # Load config with open("config.yaml") as f: config = yaml.safe_load(f) # Initialize SQLite DB_PATH = config["rag"]["db_path"] conn = sqlite3.connect(DB_PATH, check_same_thread=False) cursor = conn.cursor() cursor.execute(""" CREATE TABLE IF NOT EXISTS history ( id INTEGER PRIMARY KEY AUTOINCREMENT, prompt TEXT, response TEXT, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP ) """) cursor.execute(""" CREATE TABLE IF NOT EXISTS rag_cache ( id INTEGER PRIMARY KEY AUTOINCREMENT, query TEXT UNIQUE, chunks TEXT, embedding BLOB ) """) conn.commit() # Load GGUF model model_path = config["server"]["model_path"] llm = Llama( model_path=model_path, n_ctx=config["server"]["context_size"], n_threads=os.cpu_count(), n_gpu_layers=0, # CPU only verbose=False, ) app = FastAPI(title="microclaw Gateway") # Mount static files app.mount("/static", StaticFiles(directory="static"), name="static") # API Models class PromptRequest(BaseModel): prompt: str max_tokens: Optional[int] = 256 temperature: Optional[float] = 0.7 use_rag: Optional[bool] = False class ToolRequest(BaseModel): tool: str args: dict # Simple RAG (placeholder â you can enhance with embeddings) def retrieve_chunks(query: str) -> str: # For demo, just return static text; real implementation would use embeddings return "Relevant document chunk about file management." @app.get("/", response_class=HTMLResponse) async def root(): with open("static/index.html") as f: return f.read() @app.post("/api/chat") async def chat(req: PromptRequest): # Optionally enhance prompt with RAG if req.use_rag: context = retrieve_chunks(req.prompt) augmented_prompt = f"Context: {context}\n\nQuestion: {req.prompt}\nAnswer:" else: augmented_prompt = req.prompt # Call model output = llm( augmented_prompt, max_tokens=req.max_tokens, temperature=req.temperature, stop=["", "###"], echo=False ) response = output["choices"][0]["text"].strip() # Save to history cursor.execute("INSERT INTO history (prompt, response) VALUES (?, ?)", (req.prompt, response)) conn.commit() return {"response": response} @app.get("/api/history") async def get_history(limit: int = 50): cursor.execute("SELECT prompt, response, timestamp FROM history ORDER BY timestamp DESC LIMIT ?", (limit,)) rows = cursor.fetchall() return [{"prompt": r[0], "response": r[1], "timestamp": r[2]} for r in rows] @app.post("/api/tool") async def run_tool(req: ToolRequest): # Example: execute system commands (sandboxed) if req.tool == "ls": path = req.args.get("path", ".") try: files = os.listdir(path) return {"output": "\n".join(files)} except Exception as e: return {"error": str(e)} elif req.tool == "cron_list": # Parse crontab (requires user permissions) # For demo, return placeholder return {"output": "0 5 * * * /home/kali/backup.sh"} else: return {"error": "Unknown tool"} if __name__ == "__main__": uvicorn.run(app, host=config["server"]["host"], port=config["server"]["port"]) ``` --- ## Step 8: Run the Server ```bash cd /home/kali/microclaw source venv/bin/activate python server.py ``` Open your browser to `http://localhost:8080` and start interacting. --- # Step 9: TRAIN.PY: use this for the train.py file: ```python #!/usr/bin/env python3 # train.py â CPUâonly fineâtuning with LoRA + optional DPO import os import yaml import torch from transformers import ( AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer, BitsAndBytesConfig ) from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType from datasets import load_dataset from trl import DPOTrainer import logging # Load config with open("config.yaml") as f: config = yaml.safe_load(f) # Setup logging logging.basicConfig(level=logging.INFO, filename="logs/training.log", filemode="w") logger = logging.getLogger(__name__) def main(): # 1. Load tokenizer tokenizer = AutoTokenizer.from_pretrained(config["model"]["base_model_name"], cache_dir=config["model"]["cache_dir"]) tokenizer.pad_token = tokenizer.eos_token # 2. Load base model in 8-bit (CPU offload not supported for bitsandbytes on CPU; we use standard dtype) # For CPU, we load in float32 and rely on LoRA to reduce memory. model = AutoModelForCausalLM.from_pretrained( config["model"]["base_model_name"], cache_dir=config["model"]["cache_dir"], torch_dtype=torch.float32, # CPU uses float32 low_cpu_mem_usage=True ) # 3. Prepare LoRA if config["training"]["use_lora"]: lora_config = LoraConfig( task_type=TaskType.CAUSAL_LM, r=config["training"]["lora_r"], lora_alpha=config["training"]["lora_alpha"], lora_dropout=config["training"]["lora_dropout"], target_modules=["q_proj", "v_proj"] # adjust for your model ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # 4. Load dataset dataset = load_dataset("json", data_files=config["data"]["train_file"], split="train") if config["data"].get("eval_file"): eval_dataset = load_dataset("json", data_files=config["data"]["eval_file"], split="train") else: eval_dataset = None # Format prompt: "### Instruction:\n{instruction}\n\n### Response:\n{response}" def format_func(example): text = f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}{tokenizer.eos_token}" return {"text": text} dataset = dataset.map(format_func) if eval_dataset: eval_dataset = eval_dataset.map(format_func) # Tokenize def tokenize(element): return tokenizer(element["text"], truncation=True, max_length=config["training"]["max_seq_length"], padding=False) dataset = dataset.map(tokenize, remove_columns=dataset.column_names) if eval_dataset: eval_dataset = eval_dataset.map(tokenize, remove_columns=eval_dataset.column_names) # 5. Training arguments (CPUâfriendly) training_args = TrainingArguments( output_dir=config["training"]["output_dir"], per_device_train_batch_size=config["training"]["per_device_train_batch_size"], gradient_accumulation_steps=config["training"]["gradient_accumulation_steps"], learning_rate=config["training"]["learning_rate"], num_train_epochs=config["training"]["num_train_epochs"], logging_steps=config["training"]["logging_steps"], save_steps=config["training"]["save_steps"], evaluation_strategy="steps" if eval_dataset else "no", eval_steps=config["training"]["save_steps"], save_total_limit=2, load_best_model_at_end=True if eval_dataset else False, metric_for_best_model="eval_loss", greater_is_better=False, fp16=False, # CPU doesn't support fp16 bf16=False, dataloader_num_workers=0, # avoid multiprocessing issues optim="adamw_torch", torch_compile=False, # no speedup on CPU ) # 6. Trainer (SFT) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, eval_dataset=eval_dataset, tokenizer=tokenizer, ) logger.info("Starting SFT training...") trainer.train() trainer.save_model() # saves LoRA adapters # 7. Optional DPO training if config["training"]["dpo"] and config["data"].get("preference_file"): logger.info("Loading preference data for DPO...") pref_dataset = load_dataset("json", data_files=config["data"]["preference_file"], split="train") # For DPO we need base model without LoRA (or merged) # We'll reload base model and then apply LoRA weights # (Simplified: use the same model with LoRA attached; DPO trainer handles it) dpo_trainer = DPOTrainer( model=model, ref_model=None, # uses model as reference (or you can provide a frozen copy) args=training_args, # reuse same args (adjust for DPO) train_dataset=pref_dataset, tokenizer=tokenizer, beta=config["training"]["dpo_beta"], max_length=config["training"]["max_seq_length"], max_prompt_length=256, ) logger.info("Starting DPO training...") dpo_trainer.train() dpo_trainer.save_model(config["training"]["output_dir"] + "_dpo") # 8. Merge LoRA and save full model (for conversion) logger.info("Merging LoRA weights...") merged_model = model.merge_and_unload() merged_model.save_pretrained("models/merged") tokenizer.save_pretrained("models/merged") logger.info("Merged model saved to models/merged") if __name__ == "__main__": main() ``` **Run training**: ```bash python train.py ``` *Note: Training a 1B model on CPU with batch size 1 may take several hours to days depending on dataset size. Reduce epochs or dataset size for testing.* --- ## Step 10: Convert to GGUF After training, we have a merged Hugging Face model in `models/merged/`. Now use llama.cpp's conversion script. ```bash cd /home/kali/llama.cpp python convert-hf-to-gguf.py /home/kali/microclaw/models/merged \ --outfile /home/kali/microclaw/models/microclaw.gguf \ --outtype q2_k # 2âbit quantization (extremely small) ``` For Raspberry Pi, `q2_k` is ideal. You can also try `q3_k_s` if you have more RAM. --- ## Step 11: Build the FastAPI Server (`server.py`) This server serves: - Static files (the CLI dashboard) from the `static/` folder. - API endpoints for inference, file management, cron, and RAG. - SQLite database for conversation history and RAG cache. ```python #!/usr/bin/env python3 # server.py â FastAPI server with GGUF inference and static dashboard import os import yaml import sqlite3 import json from pathlib import Path from fastapi import FastAPI, Request, HTTPException from fastapi.responses import HTMLResponse, JSONResponse from fastapi.staticfiles import StaticFiles from pydantic import BaseModel from typing import Optional, List import uvicorn from llama_cpp import Llama # Load config with open("config.yaml") as f: config = yaml.safe_load(f) # Initialize SQLite DB_PATH = config["rag"]["db_path"] conn = sqlite3.connect(DB_PATH, check_same_thread=False) cursor = conn.cursor() cursor.execute(""" CREATE TABLE IF NOT EXISTS history ( id INTEGER PRIMARY KEY AUTOINCREMENT, prompt TEXT, response TEXT, timestamp DATETIME DEFAULT CURRENT_TIMESTAMP ) """) cursor.execute(""" CREATE TABLE IF NOT EXISTS rag_cache ( id INTEGER PRIMARY KEY AUTOINCREMENT, query TEXT UNIQUE, chunks TEXT, embedding BLOB ) """) conn.commit() # Load GGUF model model_path = config["server"]["model_path"] llm = Llama( model_path=model_path, n_ctx=config["server"]["context_size"], n_threads=os.cpu_count(), n_gpu_layers=0, # CPU only verbose=False, ) app = FastAPI(title="microclaw Gateway") # Mount static files app.mount("/static", StaticFiles(directory="static"), name="static") # API Models class PromptRequest(BaseModel): prompt: str max_tokens: Optional[int] = 256 temperature: Optional[float] = 0.7 use_rag: Optional[bool] = False class ToolRequest(BaseModel): tool: str args: dict # Simple RAG (placeholder â you can enhance with embeddings) def retrieve_chunks(query: str) -> str: # For demo, just return static text; real implementation would use embeddings return "Relevant document chunk about file management." @app.get("/", response_class=HTMLResponse) async def root(): with open("static/index.html") as f: return f.read() @app.post("/api/chat") async def chat(req: PromptRequest): # Optionally enhance prompt with RAG if req.use_rag: context = retrieve_chunks(req.prompt) augmented_prompt = f"Context: {context}\n\nQuestion: {req.prompt}\nAnswer:" else: augmented_prompt = req.prompt # Call model output = llm( augmented_prompt, max_tokens=req.max_tokens, temperature=req.temperature, stop=["", "###"], echo=False ) response = output["choices"][0]["text"].strip() # Save to history cursor.execute("INSERT INTO history (prompt, response) VALUES (?, ?)", (req.prompt, response)) conn.commit() return {"response": response} @app.get("/api/history") async def get_history(limit: int = 50): cursor.execute("SELECT prompt, response, timestamp FROM history ORDER BY timestamp DESC LIMIT ?", (limit,)) rows = cursor.fetchall() return [{"prompt": r[0], "response": r[1], "timestamp": r[2]} for r in rows] @app.post("/api/tool") async def run_tool(req: ToolRequest): # Example: execute system commands (sandboxed) if req.tool == "ls": path = req.args.get("path", ".") try: files = os.listdir(path) return {"output": "\n".join(files)} except Exception as e: return {"error": str(e)} elif req.tool == "cron_list": # Parse crontab (requires user permissions) # For demo, return placeholder return {"output": "0 5 * * * /home/kali/backup.sh"} else: return {"error": "Unknown tool"} if __name__ == "__main__": uvicorn.run(app, host=config["server"]["host"], port=config["server"]["port"]) ``` --- ## Step 12: Create the Retro CLI Dashboard # `static/index.html` ```html