cellsentry-model / README.md
almax000's picture
Upload README.md with huggingface_hub
13ff8bf verified
metadata
license: mit
language:
  - en
  - zh
base_model:
  - Qwen/Qwen2.5-1.5B
tags:
  - cellsentry
  - excel
  - spreadsheet
  - formula-audit
  - pii-detection
  - data-extraction
  - gguf
  - mlx
  - lora
  - qwen2.5
pipeline_tag: text-generation

CellSentry Model β€” Multi-Task Spreadsheet AI

A fine-tuned 1.5B parameter model for spreadsheet intelligence tasks. Built on Qwen2.5-1.5B with LoRA, this model handles three distinct tasks through prompt routing:

  • Formula Audit β€” Verify or dismiss rule engine findings in Excel formulas
  • PII Detection β€” Identify sensitive data (SSN, phone, email, national IDs) in cell values
  • Data Extraction β€” Extract structured fields (invoice number, date, vendor, totals) from spreadsheets

Model Details

Property Value
Base model Qwen/Qwen2.5-1.5B
Fine-tuning LoRA (rank 16, alpha 32)
Training 4000 iterations, batch_size=2, lr=3e-5, AdamW
Quantization 4-bit, group_size=32 (Q4_K_M for GGUF)
Context length 1024 tokens
License MIT

Available Formats

Format File Size Platform
GGUF (Q4_K_M) cellsentry-1.5b-v3-q4km.gguf ~940 MB Windows (llama.cpp)
MLX (4-bit g32) cellsentry-1.5b-v3-4bit-g32/ ~920 MB macOS (MLX)

Currently only the GGUF format is uploaded. MLX format coming soon.

Usage

This model is designed to be used with CellSentry, an open-source desktop app for spreadsheet auditing. The app downloads the model automatically on first launch.

Manual Download

# Install Hugging Face CLI
pip install huggingface-hub

# Download GGUF model
huggingface-cli download almax000/cellsentry-model cellsentry-1.5b-v3-q4km.gguf --local-dir ./models

Prompt Format

The model uses Qwen2.5 chat template with task-specific system prompts:

Formula Audit:

<|im_start|>system
You are a spreadsheet formula auditor...<|im_end|>
<|im_start|>user
{rule engine finding + cell context}<|im_end|>
<|im_start|>assistant

PII Detection:

<|im_start|>system
You are a PII detection specialist...<|im_end|>
<|im_start|>user
{cell values to scan}<|im_end|>
<|im_start|>assistant

Data Extraction:

<|im_start|>system
You are a document data extractor...<|im_end|>
<|im_start|>user
{spreadsheet content + template}<|im_end|>
<|im_start|>assistant

Training

  • Method: LoRA fine-tuning with multi-task data
  • Data: Synthetic + real-world spreadsheet samples across all three tasks
  • Fusion: LoRA weights fused into base model, then quantized (dequantize β†’ fuse β†’ re-quantize with group_size=32)
  • Key lesson: group_size=64 loses fine-tuning quality; group_size=32 is the minimum viable floor for 1.5B models

Limitations

  • Optimized for structured spreadsheet content, not general text
  • 1024 token context β€” large spreadsheets need chunking
  • PII patterns trained primarily on US and Chinese formats
  • Extraction templates cover 5 document types (invoice, receipt, PO, expense, payroll)

Related