| --- |
| license: mit |
| language: |
| - en |
| - zh |
| base_model: |
| - Qwen/Qwen2.5-1.5B |
| tags: |
| - cellsentry |
| - excel |
| - spreadsheet |
| - formula-audit |
| - pii-detection |
| - data-extraction |
| - gguf |
| - mlx |
| - lora |
| - qwen2.5 |
| pipeline_tag: text-generation |
| --- |
| |
| # CellSentry Model β Multi-Task Spreadsheet AI |
|
|
| A fine-tuned 1.5B parameter model for spreadsheet intelligence tasks. Built on Qwen2.5-1.5B with LoRA, this model handles three distinct tasks through prompt routing: |
|
|
| - **Formula Audit** β Verify or dismiss rule engine findings in Excel formulas |
| - **PII Detection** β Identify sensitive data (SSN, phone, email, national IDs) in cell values |
| - **Data Extraction** β Extract structured fields (invoice number, date, vendor, totals) from spreadsheets |
|
|
| ## Model Details |
|
|
| | Property | Value | |
| |----------|-------| |
| | Base model | [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) | |
| | Fine-tuning | LoRA (rank 16, alpha 32) | |
| | Training | 4000 iterations, batch_size=2, lr=3e-5, AdamW | |
| | Quantization | 4-bit, group_size=32 (Q4_K_M for GGUF) | |
| | Context length | 1024 tokens | |
| | License | MIT | |
|
|
| ## Available Formats |
|
|
| | Format | File | Size | Platform | |
| |--------|------|------|----------| |
| | **GGUF** (Q4_K_M) | `cellsentry-1.5b-v3-q4km.gguf` | ~940 MB | Windows (llama.cpp) | |
| | **MLX** (4-bit g32) | `cellsentry-1.5b-v3-4bit-g32/` | ~920 MB | macOS (MLX) | |
|
|
| > Currently only the GGUF format is uploaded. MLX format coming soon. |
|
|
| ## Usage |
|
|
| This model is designed to be used with [CellSentry](https://github.com/almax000/cellsentry), an open-source desktop app for spreadsheet auditing. The app downloads the model automatically on first launch. |
|
|
| ### Manual Download |
|
|
| ```bash |
| # Install Hugging Face CLI |
| pip install huggingface-hub |
| |
| # Download GGUF model |
| huggingface-cli download almax000/cellsentry-model cellsentry-1.5b-v3-q4km.gguf --local-dir ./models |
| ``` |
|
|
| ### Prompt Format |
|
|
| The model uses Qwen2.5 chat template with task-specific system prompts: |
|
|
| **Formula Audit:** |
| ``` |
| <|im_start|>system |
| You are a spreadsheet formula auditor...<|im_end|> |
| <|im_start|>user |
| {rule engine finding + cell context}<|im_end|> |
| <|im_start|>assistant |
| ``` |
|
|
| **PII Detection:** |
| ``` |
| <|im_start|>system |
| You are a PII detection specialist...<|im_end|> |
| <|im_start|>user |
| {cell values to scan}<|im_end|> |
| <|im_start|>assistant |
| ``` |
|
|
| **Data Extraction:** |
| ``` |
| <|im_start|>system |
| You are a document data extractor...<|im_end|> |
| <|im_start|>user |
| {spreadsheet content + template}<|im_end|> |
| <|im_start|>assistant |
| ``` |
|
|
| ## Training |
|
|
| - **Method**: LoRA fine-tuning with multi-task data |
| - **Data**: Synthetic + real-world spreadsheet samples across all three tasks |
| - **Fusion**: LoRA weights fused into base model, then quantized (dequantize β fuse β re-quantize with group_size=32) |
| - **Key lesson**: group_size=64 loses fine-tuning quality; group_size=32 is the minimum viable floor for 1.5B models |
| |
| ## Limitations |
| |
| - Optimized for structured spreadsheet content, not general text |
| - 1024 token context β large spreadsheets need chunking |
| - PII patterns trained primarily on US and Chinese formats |
| - Extraction templates cover 5 document types (invoice, receipt, PO, expense, payroll) |
| |
| ## Related |
| |
| - [CellSentry App](https://github.com/almax000/cellsentry) β Desktop app that uses this model |
| - [CellSentry Website](https://cellsentry.pro) β Project homepage |
| |