cellsentry-model / README.md

almax000

Upload README.md with huggingface_hub

13ff8bf verified 14 days ago

preview code

raw

history blame contribute delete

3.31 kB

metadata

license: mit
language:
  - en
  - zh
base_model:
  - Qwen/Qwen2.5-1.5B
tags:
  - cellsentry
  - excel
  - spreadsheet
  - formula-audit
  - pii-detection
  - data-extraction
  - gguf
  - mlx
  - lora
  - qwen2.5
pipeline_tag: text-generation

CellSentry Model — Multi-Task Spreadsheet AI

A fine-tuned 1.5B parameter model for spreadsheet intelligence tasks. Built on Qwen2.5-1.5B with LoRA, this model handles three distinct tasks through prompt routing:

Formula Audit — Verify or dismiss rule engine findings in Excel formulas
PII Detection — Identify sensitive data (SSN, phone, email, national IDs) in cell values
Data Extraction — Extract structured fields (invoice number, date, vendor, totals) from spreadsheets

Model Details

Property	Value
Base model	Qwen/Qwen2.5-1.5B
Fine-tuning	LoRA (rank 16, alpha 32)
Training	4000 iterations, batch_size=2, lr=3e-5, AdamW
Quantization	4-bit, group_size=32 (Q4_K_M for GGUF)
Context length	1024 tokens
License	MIT

Available Formats

Format	File	Size	Platform
GGUF (Q4_K_M)	`cellsentry-1.5b-v3-q4km.gguf`	~940 MB	Windows (llama.cpp)
MLX (4-bit g32)	`cellsentry-1.5b-v3-4bit-g32/`	~920 MB	macOS (MLX)

Currently only the GGUF format is uploaded. MLX format coming soon.

Usage

This model is designed to be used with CellSentry, an open-source desktop app for spreadsheet auditing. The app downloads the model automatically on first launch.

Manual Download

# Install Hugging Face CLI
pip install huggingface-hub

# Download GGUF model
huggingface-cli download almax000/cellsentry-model cellsentry-1.5b-v3-q4km.gguf --local-dir ./models

Prompt Format

The model uses Qwen2.5 chat template with task-specific system prompts:

Formula Audit:

<|im_start|>system
You are a spreadsheet formula auditor...<|im_end|>
<|im_start|>user
{rule engine finding + cell context}<|im_end|>
<|im_start|>assistant

PII Detection:

<|im_start|>system
You are a PII detection specialist...<|im_end|>
<|im_start|>user
{cell values to scan}<|im_end|>
<|im_start|>assistant

Data Extraction:

<|im_start|>system
You are a document data extractor...<|im_end|>
<|im_start|>user
{spreadsheet content + template}<|im_end|>
<|im_start|>assistant

Training

Method: LoRA fine-tuning with multi-task data
Data: Synthetic + real-world spreadsheet samples across all three tasks
Fusion: LoRA weights fused into base model, then quantized (dequantize → fuse → re-quantize with group_size=32)
Key lesson: group_size=64 loses fine-tuning quality; group_size=32 is the minimum viable floor for 1.5B models

Limitations

Optimized for structured spreadsheet content, not general text
1024 token context — large spreadsheets need chunking
PII patterns trained primarily on US and Chinese formats
Extraction templates cover 5 document types (invoice, receipt, PO, expense, payroll)

CellSentry App — Desktop app that uses this model
CellSentry Website — Project homepage

almax000
/

cellsentry-model