precis / README.md

Update README.md

398310d verified about 1 month ago

4.67 kB

	---
	base_model: unsloth/granite-4.0-h-micro
	tags:
	- text-generation-inference
	- transformers
	- unsloth
	- granitemoehybrid
	- trl
	license: apache-2.0
	language:
	- en
	---

	# Precis: Document Summarization

	## Model Overview

	Precis is a specialized document summarization model fine-tuned from IBM's Granite 4.0-H-Micro (3.2B parameters) using efficient LoRA adapters. It generates comprehensive ~300-word summaries optimized for question-answering capability while maintaining complete privacy through local, on-premise processing.

	Key Features:
	- 🔒 Privacy-First: Process sensitive documents entirely on your infrastructure
	- ⚡ Fast: 0.5s inference time (5-10x faster than cloud APIs)
	- 💰 Cost-Effective: Zero per-document API fees
	- 📚 Long Context: 128K tokens ≈ 320-380 book pages
	- 🎯 Specialized: Trained on 5,500+ document-summary pairs, processed millions of tokens during training


	## 🚀 Quick Start

	### Using with Transformers + PEFT

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	# Load base model
	base_model = AutoModelForCausalLM.from_pretrained(
	"unsloth/granite-4.0-h-micro",
	torch_dtype=torch.float16,
	device_map="auto"
	)

	# Load LoRA adapters
	model = PeftModel.from_pretrained(base_model, "cernis-intelligence/precis")
	tokenizer = AutoTokenizer.from_pretrained("cernis-intelligence/precis")

	# Generate summary
	document = """Your long document here..."""

	messages = [
	{"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"}
	]

	inputs = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt"
	).to(model.device)

	outputs = model.generate(
	inputs,
	max_new_tokens=512,
	temperature=0.3,
	top_p=0.9,
	do_sample=True
	)

	summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(summary)
	```

	### Using with Unsloth (Recommended)

	```python
	from unsloth import FastLanguageModel

	model, tokenizer = FastLanguageModel.from_pretrained(
	model_name="cernis-intelligence/precis",
	max_seq_length=2048,
	load_in_4bit=True, # For lower memory usage
	)

	FastLanguageModel.for_inference(model)

	messages = [
	{"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"}
	]

	inputs = tokenizer.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt"
	).to("cuda")

	outputs = model.generate(inputs, max_new_tokens=512, temperature=0.3)
	summary = tokenizer.decode(outputs[0], skip_special_tokens=True)
	```

	### Using with vLLM (Production)

	```python
	from vllm import LLM, SamplingParams
	from vllm.lora.request import LoRARequest

	# Initialize vLLM with base model
	llm = LLM(
	model="unsloth/granite-4.0-h-micro",
	enable_lora=True,
	max_lora_rank=32,
	gpu_memory_utilization=0.9
	)

	# Create LoRA request
	lora_request = LoRARequest(
	"precis-granite",
	1,
	"cernis-intelligence/precis"
	)

	# Sampling parameters
	sampling_params = SamplingParams(
	temperature=0.3,
	top_p=0.9,
	max_tokens=512
	)

	# Generate
	prompts = ["Summarize the following document in around 300 words:\n\n" + document]
	outputs = llm.generate(prompts, sampling_params, lora_request=lora_request)

	print(outputs[0].outputs[0].text)
	```

	---

	## 📊 Training Details

	### Base Model
	- Architecture: IBM Granite 4.0-H-Micro
	- Parameters: 3.2B (38.4M trainable via LoRA)
	- Context Length: 128K tokens
	- License: Apache 2.0

	## 🎯 Use Cases

	### ✅ Perfect For:
	- 📄 Legal Document Review: Summarize contracts while maintaining confidentiality
	- 🏥 Medical Records: HIPAA-compliant summarization of patient notes
	- 💼 Financial Reports: Analyze earnings reports without exposing sensitive data
	- 📚 Research Papers: Quick digests of academic literature
	- 📧 Email Threads: Comprehensive summaries of long conversations

	### ⚠️ Considerations:
	- Works best with documents under 380 pages (128K token limit)
	- Optimized for English text (multilingual support coming)
	- May miss some deeply nested structured data (tables, forms)
	- For specialized needs, consider fine-tuning on domain-specific data

	📄 License

	This model is released under the Apache 2.0 License, same as the base IBM Granite 4.0 model.

	```
	Copyright 2025

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0
	```