|
|
--- |
|
|
base_model: unsloth/granite-4.0-h-micro |
|
|
tags: |
|
|
- text-generation-inference |
|
|
- transformers |
|
|
- unsloth |
|
|
- granitemoehybrid |
|
|
- trl |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
|
|
|
# Precis: Document Summarization |
|
|
|
|
|
## Model Overview |
|
|
|
|
|
**Precis** is a specialized document summarization model fine-tuned from IBM's Granite 4.0-H-Micro (3.2B parameters) using efficient LoRA adapters. It generates comprehensive ~300-word summaries optimized for question-answering capability while maintaining complete privacy through local, on-premise processing. |
|
|
|
|
|
**Key Features:** |
|
|
- π **Privacy-First**: Process sensitive documents entirely on your infrastructure |
|
|
- β‘ **Fast**: 0.5s inference time (5-10x faster than cloud APIs) |
|
|
- π° **Cost-Effective**: Zero per-document API fees |
|
|
- π **Long Context**: 128K tokens β 320-380 book pages |
|
|
- π― **Specialized**: Trained on 5,500+ document-summary pairs, processed millions of tokens during training |
|
|
|
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Using with Transformers + PEFT |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
import torch |
|
|
|
|
|
# Load base model |
|
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
|
"unsloth/granite-4.0-h-micro", |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
# Load LoRA adapters |
|
|
model = PeftModel.from_pretrained(base_model, "cernis-intelligence/precis") |
|
|
tokenizer = AutoTokenizer.from_pretrained("cernis-intelligence/precis") |
|
|
|
|
|
# Generate summary |
|
|
document = """Your long document here...""" |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"} |
|
|
] |
|
|
|
|
|
inputs = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=True, |
|
|
add_generation_prompt=True, |
|
|
return_tensors="pt" |
|
|
).to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
inputs, |
|
|
max_new_tokens=512, |
|
|
temperature=0.3, |
|
|
top_p=0.9, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
summary = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(summary) |
|
|
``` |
|
|
|
|
|
### Using with Unsloth (Recommended) |
|
|
|
|
|
```python |
|
|
from unsloth import FastLanguageModel |
|
|
|
|
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
|
model_name="cernis-intelligence/precis", |
|
|
max_seq_length=2048, |
|
|
load_in_4bit=True, # For lower memory usage |
|
|
) |
|
|
|
|
|
FastLanguageModel.for_inference(model) |
|
|
|
|
|
messages = [ |
|
|
{"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"} |
|
|
] |
|
|
|
|
|
inputs = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=True, |
|
|
add_generation_prompt=True, |
|
|
return_tensors="pt" |
|
|
).to("cuda") |
|
|
|
|
|
outputs = model.generate(inputs, max_new_tokens=512, temperature=0.3) |
|
|
summary = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
### Using with vLLM (Production) |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
from vllm.lora.request import LoRARequest |
|
|
|
|
|
# Initialize vLLM with base model |
|
|
llm = LLM( |
|
|
model="unsloth/granite-4.0-h-micro", |
|
|
enable_lora=True, |
|
|
max_lora_rank=32, |
|
|
gpu_memory_utilization=0.9 |
|
|
) |
|
|
|
|
|
# Create LoRA request |
|
|
lora_request = LoRARequest( |
|
|
"precis-granite", |
|
|
1, |
|
|
"cernis-intelligence/precis" |
|
|
) |
|
|
|
|
|
# Sampling parameters |
|
|
sampling_params = SamplingParams( |
|
|
temperature=0.3, |
|
|
top_p=0.9, |
|
|
max_tokens=512 |
|
|
) |
|
|
|
|
|
# Generate |
|
|
prompts = ["Summarize the following document in around 300 words:\n\n" + document] |
|
|
outputs = llm.generate(prompts, sampling_params, lora_request=lora_request) |
|
|
|
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Training Details |
|
|
|
|
|
### Base Model |
|
|
- **Architecture**: IBM Granite 4.0-H-Micro |
|
|
- **Parameters**: 3.2B (38.4M trainable via LoRA) |
|
|
- **Context Length**: 128K tokens |
|
|
- **License**: Apache 2.0 |
|
|
|
|
|
## π― Use Cases |
|
|
|
|
|
### β
Perfect For: |
|
|
- π **Legal Document Review**: Summarize contracts while maintaining confidentiality |
|
|
- π₯ **Medical Records**: HIPAA-compliant summarization of patient notes |
|
|
- πΌ **Financial Reports**: Analyze earnings reports without exposing sensitive data |
|
|
- π **Research Papers**: Quick digests of academic literature |
|
|
- π§ **Email Threads**: Comprehensive summaries of long conversations |
|
|
|
|
|
### β οΈ Considerations: |
|
|
- Works best with documents under 380 pages (128K token limit) |
|
|
- Optimized for English text (multilingual support coming) |
|
|
- May miss some deeply nested structured data (tables, forms) |
|
|
- For specialized needs, consider fine-tuning on domain-specific data |
|
|
|
|
|
π License |
|
|
|
|
|
This model is released under the **Apache 2.0 License**, same as the base IBM Granite 4.0 model. |
|
|
|
|
|
``` |
|
|
Copyright 2025 |
|
|
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
|
you may not use this file except in compliance with the License. |
|
|
You may obtain a copy of the License at |
|
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
``` |
|
|
|
|
|
|