SmartReport: Urban Summarization Model

SmartReport is a fine-tuned T5-base model designed to generate concise summaries of long-form urban governance reports. It was trained using LoRA (Low-Rank Adaptation) to handle the high-complexity and domain-specific vocabulary of urban planning, transportation, and public infrastructure documents.

Project Context

Standard summarization models often struggle with the extreme length and technical jargon of government reports. SmartReport addresses this by focusing on a cleaned, urban-centric subset of the GovReport dataset, ensuring the model remains grounded in city-related themes like zoning, transit, and housing.

Dataset & Filtering

The model was trained on a specialized subset of ccdv/govreport-summarization (GAO/CRS reports).

Data Statistics:

Training Set: 6,611 samples
Validation Set: 363 samples
Test Set: 366 samples
Average Report Length: ~5,600 words
Average Summary Length: ~480 words
Compression Ratio: ~8.7%

Topic Distribution:

The dataset was filtered to ensure relevance to urban development:

Urban/Transport: 95%
Infrastructure & Housing: ~65%
Environment & Finance: ~55%

Technical Specifications

To optimize memory usage and training efficiency, the model utilizes Parameter-Efficient Fine-Tuning (PEFT).

Rank (r): 8
Alpha: 32
Target Modules: Query, Value, Key, Output (q, v, k, o)
Dropout: 0.05
Input Context: Up to 1024 tokens (using truncation/sliding window for longer documents).

Quick Start

from peft import PeftModel, PeftConfig
from transformers import T5ForConditionalGeneration, T5Tokenizer

model_id = "YOUR_HF_USERNAME/SmartReport" # Update with your path
config = PeftConfig.from_pretrained(model_id)
base_model = T5ForConditionalGeneration.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, model_id)
tokenizer = T5Tokenizer.from_pretrained(config.base_model_name_or_path)

def generate_summary(text):
    inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=512, length_penalty=2.0, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example Usage
report_text = "The city's transit expansion project aims to..."
print(generate_summary(report_text))

Downloads last month: 1

Model tree for Abdennour07/SmartReport-T5-base-LoRA-GovReport

Base model

google-t5/t5-base

Adapter

(85)

this model

Abdennour07
/

SmartReport-T5-base-LoRA-GovReport