cellsentry-model / README.md
almax000's picture
Upload README.md with huggingface_hub
13ff8bf verified
---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-1.5B
tags:
- cellsentry
- excel
- spreadsheet
- formula-audit
- pii-detection
- data-extraction
- gguf
- mlx
- lora
- qwen2.5
pipeline_tag: text-generation
---
# CellSentry Model β€” Multi-Task Spreadsheet AI
A fine-tuned 1.5B parameter model for spreadsheet intelligence tasks. Built on Qwen2.5-1.5B with LoRA, this model handles three distinct tasks through prompt routing:
- **Formula Audit** β€” Verify or dismiss rule engine findings in Excel formulas
- **PII Detection** β€” Identify sensitive data (SSN, phone, email, national IDs) in cell values
- **Data Extraction** β€” Extract structured fields (invoice number, date, vendor, totals) from spreadsheets
## Model Details
| Property | Value |
|----------|-------|
| Base model | [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) |
| Fine-tuning | LoRA (rank 16, alpha 32) |
| Training | 4000 iterations, batch_size=2, lr=3e-5, AdamW |
| Quantization | 4-bit, group_size=32 (Q4_K_M for GGUF) |
| Context length | 1024 tokens |
| License | MIT |
## Available Formats
| Format | File | Size | Platform |
|--------|------|------|----------|
| **GGUF** (Q4_K_M) | `cellsentry-1.5b-v3-q4km.gguf` | ~940 MB | Windows (llama.cpp) |
| **MLX** (4-bit g32) | `cellsentry-1.5b-v3-4bit-g32/` | ~920 MB | macOS (MLX) |
> Currently only the GGUF format is uploaded. MLX format coming soon.
## Usage
This model is designed to be used with [CellSentry](https://github.com/almax000/cellsentry), an open-source desktop app for spreadsheet auditing. The app downloads the model automatically on first launch.
### Manual Download
```bash
# Install Hugging Face CLI
pip install huggingface-hub
# Download GGUF model
huggingface-cli download almax000/cellsentry-model cellsentry-1.5b-v3-q4km.gguf --local-dir ./models
```
### Prompt Format
The model uses Qwen2.5 chat template with task-specific system prompts:
**Formula Audit:**
```
<|im_start|>system
You are a spreadsheet formula auditor...<|im_end|>
<|im_start|>user
{rule engine finding + cell context}<|im_end|>
<|im_start|>assistant
```
**PII Detection:**
```
<|im_start|>system
You are a PII detection specialist...<|im_end|>
<|im_start|>user
{cell values to scan}<|im_end|>
<|im_start|>assistant
```
**Data Extraction:**
```
<|im_start|>system
You are a document data extractor...<|im_end|>
<|im_start|>user
{spreadsheet content + template}<|im_end|>
<|im_start|>assistant
```
## Training
- **Method**: LoRA fine-tuning with multi-task data
- **Data**: Synthetic + real-world spreadsheet samples across all three tasks
- **Fusion**: LoRA weights fused into base model, then quantized (dequantize β†’ fuse β†’ re-quantize with group_size=32)
- **Key lesson**: group_size=64 loses fine-tuning quality; group_size=32 is the minimum viable floor for 1.5B models
## Limitations
- Optimized for structured spreadsheet content, not general text
- 1024 token context β€” large spreadsheets need chunking
- PII patterns trained primarily on US and Chinese formats
- Extraction templates cover 5 document types (invoice, receipt, PO, expense, payroll)
## Related
- [CellSentry App](https://github.com/almax000/cellsentry) β€” Desktop app that uses this model
- [CellSentry Website](https://cellsentry.pro) β€” Project homepage