README.md · almax000/cellsentry-model at main

File size: 3,312 Bytes

13ff8bf

---
license: mit
language:
- en
- zh
base_model:
- Qwen/Qwen2.5-1.5B
tags:
- cellsentry
- excel
- spreadsheet
- formula-audit
- pii-detection
- data-extraction
- gguf
- mlx
- lora
- qwen2.5
pipeline_tag: text-generation
---

# CellSentry Model — Multi-Task Spreadsheet AI

A fine-tuned 1.5B parameter model for spreadsheet intelligence tasks. Built on Qwen2.5-1.5B with LoRA, this model handles three distinct tasks through prompt routing:

- **Formula Audit** — Verify or dismiss rule engine findings in Excel formulas
- **PII Detection** — Identify sensitive data (SSN, phone, email, national IDs) in cell values
- **Data Extraction** — Extract structured fields (invoice number, date, vendor, totals) from spreadsheets

## Model Details

| Property | Value |
|----------|-------|
| Base model | [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) |
| Fine-tuning | LoRA (rank 16, alpha 32) |
| Training | 4000 iterations, batch_size=2, lr=3e-5, AdamW |
| Quantization | 4-bit, group_size=32 (Q4_K_M for GGUF) |
| Context length | 1024 tokens |
| License | MIT |

## Available Formats

| Format | File | Size | Platform |
|--------|------|------|----------|
| **GGUF** (Q4_K_M) | `cellsentry-1.5b-v3-q4km.gguf` | ~940 MB | Windows (llama.cpp) |
| **MLX** (4-bit g32) | `cellsentry-1.5b-v3-4bit-g32/` | ~920 MB | macOS (MLX) |

> Currently only the GGUF format is uploaded. MLX format coming soon.

## Usage

This model is designed to be used with [CellSentry](https://github.com/almax000/cellsentry), an open-source desktop app for spreadsheet auditing. The app downloads the model automatically on first launch.

### Manual Download

```bash
# Install Hugging Face CLI
pip install huggingface-hub

# Download GGUF model
huggingface-cli download almax000/cellsentry-model cellsentry-1.5b-v3-q4km.gguf --local-dir ./models
```

### Prompt Format

The model uses Qwen2.5 chat template with task-specific system prompts:

**Formula Audit:**
```
<|im_start|>system
You are a spreadsheet formula auditor...<|im_end|>
<|im_start|>user
{rule engine finding + cell context}<|im_end|>
<|im_start|>assistant
```

**PII Detection:**
```
<|im_start|>system
You are a PII detection specialist...<|im_end|>
<|im_start|>user
{cell values to scan}<|im_end|>
<|im_start|>assistant
```

**Data Extraction:**
```
<|im_start|>system
You are a document data extractor...<|im_end|>
<|im_start|>user
{spreadsheet content + template}<|im_end|>
<|im_start|>assistant
```

## Training

- **Method**: LoRA fine-tuning with multi-task data
- **Data**: Synthetic + real-world spreadsheet samples across all three tasks
- **Fusion**: LoRA weights fused into base model, then quantized (dequantize → fuse → re-quantize with group_size=32)
- **Key lesson**: group_size=64 loses fine-tuning quality; group_size=32 is the minimum viable floor for 1.5B models

## Limitations

- Optimized for structured spreadsheet content, not general text
- 1024 token context — large spreadsheets need chunking
- PII patterns trained primarily on US and Chinese formats
- Extraction templates cover 5 document types (invoice, receipt, PO, expense, payroll)

## Related

- [CellSentry App](https://github.com/almax000/cellsentry) — Desktop app that uses this model
- [CellSentry Website](https://cellsentry.pro) — Project homepage