README.md · almax000/cellsentry-model at main

cellsentry-model / README.md

almax000

Upload README.md with huggingface_hub

13ff8bf verified 15 days ago

preview code

raw

history blame contribute delete

3.31 kB

	---
	license: mit
	language:
	- en
	- zh
	base_model:
	- Qwen/Qwen2.5-1.5B
	tags:
	- cellsentry
	- excel
	- spreadsheet
	- formula-audit
	- pii-detection
	- data-extraction
	- gguf
	- mlx
	- lora
	- qwen2.5
	pipeline_tag: text-generation
	---

	# CellSentry Model — Multi-Task Spreadsheet AI

	A fine-tuned 1.5B parameter model for spreadsheet intelligence tasks. Built on Qwen2.5-1.5B with LoRA, this model handles three distinct tasks through prompt routing:

	- Formula Audit — Verify or dismiss rule engine findings in Excel formulas
	- PII Detection — Identify sensitive data (SSN, phone, email, national IDs) in cell values
	- Data Extraction — Extract structured fields (invoice number, date, vendor, totals) from spreadsheets

	## Model Details

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base model \| [Qwen/Qwen2.5-1.5B](https://huggingface.co/Qwen/Qwen2.5-1.5B) \|
	\| Fine-tuning \| LoRA (rank 16, alpha 32) \|
	\| Training \| 4000 iterations, batch_size=2, lr=3e-5, AdamW \|
	\| Quantization \| 4-bit, group_size=32 (Q4_K_M for GGUF) \|
	\| Context length \| 1024 tokens \|
	\| License \| MIT \|

	## Available Formats

	\| Format \| File \| Size \| Platform \|
	\|--------\|------\|------\|----------\|
	\| GGUF (Q4_K_M) \| `cellsentry-1.5b-v3-q4km.gguf` \| ~940 MB \| Windows (llama.cpp) \|
	\| MLX (4-bit g32) \| `cellsentry-1.5b-v3-4bit-g32/` \| ~920 MB \| macOS (MLX) \|

	> Currently only the GGUF format is uploaded. MLX format coming soon.

	## Usage

	This model is designed to be used with [CellSentry](https://github.com/almax000/cellsentry), an open-source desktop app for spreadsheet auditing. The app downloads the model automatically on first launch.

	### Manual Download

	```bash
	# Install Hugging Face CLI
	pip install huggingface-hub

	# Download GGUF model
	huggingface-cli download almax000/cellsentry-model cellsentry-1.5b-v3-q4km.gguf --local-dir ./models
	```

	### Prompt Format

	The model uses Qwen2.5 chat template with task-specific system prompts:

	Formula Audit:
	```
	<\|im_start\|>system
	You are a spreadsheet formula auditor...<\|im_end\|>
	<\|im_start\|>user
	{rule engine finding + cell context}<\|im_end\|>
	<\|im_start\|>assistant
	```

	PII Detection:
	```
	<\|im_start\|>system
	You are a PII detection specialist...<\|im_end\|>
	<\|im_start\|>user
	{cell values to scan}<\|im_end\|>
	<\|im_start\|>assistant
	```

	Data Extraction:
	```
	<\|im_start\|>system
	You are a document data extractor...<\|im_end\|>
	<\|im_start\|>user
	{spreadsheet content + template}<\|im_end\|>
	<\|im_start\|>assistant
	```

	## Training

	- Method: LoRA fine-tuning with multi-task data
	- Data: Synthetic + real-world spreadsheet samples across all three tasks
	- Fusion: LoRA weights fused into base model, then quantized (dequantize → fuse → re-quantize with group_size=32)
	- Key lesson: group_size=64 loses fine-tuning quality; group_size=32 is the minimum viable floor for 1.5B models

	## Limitations

	- Optimized for structured spreadsheet content, not general text
	- 1024 token context — large spreadsheets need chunking
	- PII patterns trained primarily on US and Chinese formats
	- Extraction templates cover 5 document types (invoice, receipt, PO, expense, payroll)

	## Related

	- [CellSentry App](https://github.com/almax000/cellsentry) — Desktop app that uses this model
	- [CellSentry Website](https://cellsentry.pro) — Project homepage