SmartReport: Urban Summarization Model

SmartReport is a fine-tuned T5-base model designed to generate concise summaries of long-form urban governance reports. It was trained using LoRA (Low-Rank Adaptation) to handle the high-complexity and domain-specific vocabulary of urban planning, transportation, and public infrastructure documents.

Project Context

Standard summarization models often struggle with the extreme length and technical jargon of government reports. SmartReport addresses this by focusing on a cleaned, urban-centric subset of the GovReport dataset, ensuring the model remains grounded in city-related themes like zoning, transit, and housing.

Dataset & Filtering

The model was trained on a specialized subset of ccdv/govreport-summarization (GAO/CRS reports).

Data Statistics:

  • Training Set: 6,611 samples
  • Validation Set: 363 samples
  • Test Set: 366 samples
  • Average Report Length: ~5,600 words
  • Average Summary Length: ~480 words
  • Compression Ratio: ~8.7%

Topic Distribution:

The dataset was filtered to ensure relevance to urban development:

  • Urban/Transport: 95%
  • Infrastructure & Housing: ~65%
  • Environment & Finance: ~55%

Technical Specifications

To optimize memory usage and training efficiency, the model utilizes Parameter-Efficient Fine-Tuning (PEFT).

  • Rank (r): 8
  • Alpha: 32
  • Target Modules: Query, Value, Key, Output (q, v, k, o)
  • Dropout: 0.05
  • Input Context: Up to 1024 tokens (using truncation/sliding window for longer documents).

Quick Start

from peft import PeftModel, PeftConfig
from transformers import T5ForConditionalGeneration, T5Tokenizer

model_id = "YOUR_HF_USERNAME/SmartReport" # Update with your path
config = PeftConfig.from_pretrained(model_id)
base_model = T5ForConditionalGeneration.from_pretrained(config.base_model_name_or_path)
model = PeftModel.from_pretrained(base_model, model_id)
tokenizer = T5Tokenizer.from_pretrained(config.base_model_name_or_path)

def generate_summary(text):
    inputs = tokenizer("summarize: " + text, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(**inputs, max_new_tokens=512, length_penalty=2.0, num_beams=4)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example Usage
report_text = "The city's transit expansion project aims to..."
print(generate_summary(report_text))
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Abdennour07/SmartReport-T5-base-LoRA-GovReport

Base model

google-t5/t5-base
Adapter
(79)
this model

Dataset used to train Abdennour07/SmartReport-T5-base-LoRA-GovReport