--- base_model: unsloth/granite-4.0-h-micro tags: - text-generation-inference - transformers - unsloth - granitemoehybrid - trl license: apache-2.0 language: - en --- # Precis: Document Summarization ## Model Overview **Precis** is a specialized document summarization model fine-tuned from IBM's Granite 4.0-H-Micro (3.2B parameters) using efficient LoRA adapters. It generates comprehensive ~300-word summaries optimized for question-answering capability while maintaining complete privacy through local, on-premise processing. **Key Features:** - 🔒 **Privacy-First**: Process sensitive documents entirely on your infrastructure - ⚡ **Fast**: 0.5s inference time (5-10x faster than cloud APIs) - 💰 **Cost-Effective**: Zero per-document API fees - 📚 **Long Context**: 128K tokens ≈ 320-380 book pages - 🎯 **Specialized**: Trained on 5,500+ document-summary pairs, processed millions of tokens during training ## 🚀 Quick Start ### Using with Transformers + PEFT ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch # Load base model base_model = AutoModelForCausalLM.from_pretrained( "unsloth/granite-4.0-h-micro", torch_dtype=torch.float16, device_map="auto" ) # Load LoRA adapters model = PeftModel.from_pretrained(base_model, "cernis-intelligence/precis") tokenizer = AutoTokenizer.from_pretrained("cernis-intelligence/precis") # Generate summary document = """Your long document here...""" messages = [ {"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"} ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to(model.device) outputs = model.generate( inputs, max_new_tokens=512, temperature=0.3, top_p=0.9, do_sample=True ) summary = tokenizer.decode(outputs[0], skip_special_tokens=True) print(summary) ``` ### Using with Unsloth (Recommended) ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name="cernis-intelligence/precis", max_seq_length=2048, load_in_4bit=True, # For lower memory usage ) FastLanguageModel.for_inference(model) messages = [ {"role": "user", "content": f"Summarize the following document in around 300 words:\n\n{document}"} ] inputs = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" ).to("cuda") outputs = model.generate(inputs, max_new_tokens=512, temperature=0.3) summary = tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ### Using with vLLM (Production) ```python from vllm import LLM, SamplingParams from vllm.lora.request import LoRARequest # Initialize vLLM with base model llm = LLM( model="unsloth/granite-4.0-h-micro", enable_lora=True, max_lora_rank=32, gpu_memory_utilization=0.9 ) # Create LoRA request lora_request = LoRARequest( "precis-granite", 1, "cernis-intelligence/precis" ) # Sampling parameters sampling_params = SamplingParams( temperature=0.3, top_p=0.9, max_tokens=512 ) # Generate prompts = ["Summarize the following document in around 300 words:\n\n" + document] outputs = llm.generate(prompts, sampling_params, lora_request=lora_request) print(outputs[0].outputs[0].text) ``` --- ## 📊 Training Details ### Base Model - **Architecture**: IBM Granite 4.0-H-Micro - **Parameters**: 3.2B (38.4M trainable via LoRA) - **Context Length**: 128K tokens - **License**: Apache 2.0 ## 🎯 Use Cases ### ✅ Perfect For: - 📄 **Legal Document Review**: Summarize contracts while maintaining confidentiality - 🏥 **Medical Records**: HIPAA-compliant summarization of patient notes - 💼 **Financial Reports**: Analyze earnings reports without exposing sensitive data - 📚 **Research Papers**: Quick digests of academic literature - 📧 **Email Threads**: Comprehensive summaries of long conversations ### ⚠️ Considerations: - Works best with documents under 380 pages (128K token limit) - Optimized for English text (multilingual support coming) - May miss some deeply nested structured data (tables, forms) - For specialized needs, consider fine-tuning on domain-specific data 📄 License This model is released under the **Apache 2.0 License**, same as the base IBM Granite 4.0 model. ``` Copyright 2025 Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 ```