--- license: mit language: - en tags: - instruction-quality - lint - code-review - agent-instructions - gguf - qwen3.5 base_model: Qwen/Qwen3.5-0.8B pipeline_tag: text-generation library_name: llama-cpp-python model-index: - name: writ-lint-0.8B results: [] --- # writ-lint-0.8B A fine-tuned [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) model that evaluates the quality of AI agent instructions and generates actionable improvement feedback. Part of the **Tier 2.5 hybrid architecture** in [enwrit](https://github.com/enwrit/writ) -- the communication layer for AI agents. ## How It Works This model is one half of a hybrid scoring system: 1. **LightGBM** (bundled in the `enwrit` CLI) predicts headline + 6 dimension scores (~1ms) 2. **writ-lint-0.8B** (this model) generates issues (ERROR/WARNING/INFO) and improvement suggestions, using the instruction text and LightGBM-predicted scores as context The model focuses entirely on generating actionable feedback, not scores. Scores from LightGBM are passed in the prompt so the model can target weak dimensions. ## Usage ### Via the enwrit CLI (recommended) ```bash pip install enwrit pip install llama-cpp-python # CPU inference, ~10s per instruction writ lint AGENTS.md --deep-local ``` The model is auto-downloaded to `~/.writ/models/` on first use. ### Direct inference with llama-cpp-python ```python from llama_cpp import Llama import json model = Llama( model_path="writ-lint-0.8B-Q4_K_M.gguf", n_ctx=4096, n_gpu_layers=-1, # GPU acceleration (0 for CPU-only) verbose=False, ) prompt = """<|im_start|>system You are an expert instruction quality evaluator. Given an instruction and its quality scores, generate specific issues and improvement suggestions.<|im_end|> <|im_start|>user ## Instruction to evaluate {instruction_text} ## Quality scores (predicted) Headline: 52/100 Clarity: 58 | Structure: 65 | Coverage: 42 | Brevity: 71 | Examples: 28 | Verification: 35 Analyze the instruction. Return JSON with "issues" (level + message) and "suggestions".<|im_end|> <|im_start|>assistant """ output = model.create_completion( prompt, max_tokens=1024, temperature=0.3, response_format={"type": "json_object"}, ) feedback = json.loads(output["choices"][0]["text"]) print(json.dumps(feedback, indent=2)) ``` ## Output Format ```json { "issues": [ {"level": "ERROR", "message": "Missing concrete code examples for error handling patterns."}, {"level": "WARNING", "message": "Verification steps are subjective rather than binary."}, {"level": "INFO", "message": "Consider adding a 'Rules' section for behavioral constraints."} ], "suggestions": [ "Add a 'Code Examples' section with 'Good vs Bad' patterns for the most critical rules.", "Replace subjective verification with specific CLI commands (e.g., `pytest`, `ruff check`).", "Include numeric thresholds for measurable constraints (e.g., 'max 100 lines per function')." ] } ``` ## Training Details | Parameter | Value | |---|---| | Base model | [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) | | Method | LoRA (r=32, alpha=64, dropout=0) via [Unsloth](https://github.com/unslothai/unsloth) | | Training data | 6,536 Tier 3 AI evaluations (Gemini-scored instructions) | | Issues in training data | 30,830 (avg 4.7 per instruction) | | Suggestions in training data | 19,602 (avg 3.0 per instruction) | | Non-coding examples | 145 seed instructions across 15 domains | | Epochs | 1 | | Batch size | 1 (gradient accumulation: 16, effective batch: 16) | | Max sequence length | 4096 tokens | | Learning rate | 2e-4 (cosine schedule, 10% warmup) | | Precision | bf16 | | Quantization | Q4_K_M (GGUF) | | Training hardware | NVIDIA RTX 5090 (32GB VRAM) | | Training time | ~5.5 hours | ## Evaluation Compared against retrieval-based approaches (v1/v2/v3) on a held-out validation set: | Approach | Relevance | All-Feedback Specificity | Issues/Instruction | Type | |---|---|---|---|---| | v1_shap_knn | 0.157 | 0.129 | N/A | retrieval | | v2_hybrid | 0.266 | 0.136 | N/A | retrieval | | v3_tfidf | 0.262 | 0.144 | N/A | retrieval | | **writ-lint-0.8B** | **0.236** | **0.364** | **4.7** | **generative** | Key strengths: - 100% JSON parse success (via constrained decoding) - Generates novel, context-specific feedback (not limited to seen examples) - Weak-dimension targeting: 0.47 (issues correlate with low-scoring dimensions) - Low domain mismatch: 0.014 (doesn't give coding feedback to non-coding instructions) ## Scoring Dimensions The 6 quality dimensions (scored by LightGBM, targeted by this model): | Dimension | What it measures | |---|---| | **Clarity** | Unambiguous language, precise terminology, defined jargon | | **Structure** | Logical sections, hierarchy, scannable formatting | | **Coverage** | Completeness of rules, edge cases, responsibilities | | **Brevity** | Concise without sacrificing meaning, no redundancy | | **Examples** | Code samples, input/output patterns, good vs bad | | **Verification** | Testable criteria, CLI commands, specific thresholds | ## Files - `writ-lint-0.8B-Q4_K_M.gguf` -- Quantized model for inference (504 MB) ## Links - [enwrit CLI](https://github.com/enwrit/writ) -- Open-source CLI tool - [enwrit.com](https://enwrit.com) -- Platform with Hub, AI scoring, and more - [PyPI](https://pypi.org/project/enwrit/) -- `pip install enwrit` ## License MIT