Update README.md

9a32b1a verified 4 days ago

6.42 kB

	---
	language:
	- pt
	license: other
	license_name: custom-agplv3-dual-license
	license_link: https://huggingface.co/inesctec/Citilink-BERTimbau-large-Topic-Classification-pt/blob/main/LICENSE
	tags:
	- transformers
	- text-classification
	- multi-label-classification
	- bertimbau
	- bert
	- portuguese
	- municipal-documents
	- meeting-minutes
	- fine-tuned
	library_name: transformers
	base_model:
	- neuralmind/bert-large-portuguese-cased
	pipeline_tag: text-classification
	---

	# Baseline_BERTimbau-large-Topic_Classification-Council-PT: Multi-Label Topic Classification for Portuguese Council Discussion Subjects

	## Model Description

	BERTimbau Large Topic Classification Council PT is a baseline implementation, consisting of a fine-tuned version of [neuralmind/bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-large-portuguese-cased) (BERTimbau Large) for multi-label topic classification to automatically identify and categorize discussion subjects from Portuguese municipal council meeting minutes, optimized with dynamic per-label thresholds to identify multiple simultaneous topics within municipal discussion subjects.

	## Key Features

	- 🎯 Specialized for Municipal Topics: Fine-tuned on Portuguese council meeting minutes discussion subjects
	- 🧠 Large Transformer Model: 335M parameters (24 layers, 1024 hidden dim, 16 attention heads)
	- 📊 Multi-Label Classification: Identifies multiple co-occurring topics per subject
	- ⚡ Dynamic Thresholds: Optimized per-label classification thresholds (not fixed 0.5)
	- 🇵🇹 Portuguese-Native: Built on BERTimbau, pre-trained on Brazilian Portuguese corpus
	- 🔄 End-to-End Learning: Direct fine-tuning on task-specific data

	## Model Details

	- Architecture: BERT Large (Transformer Encoder)
	- Base Model: neuralmind/bert-large-portuguese-cased (BERTimbau Large)
	- Parameters: ~335M
	- 24 transformer layers
	- 1024 hidden dimensions
	- 16 attention heads
	- 4096 intermediate size
	- Max Sequence Length: 512 tokens
	- Learning Rate: 5e-5
	- Warmup: 0.1
	- Batch Size: 16
	- Optimizer: AdamW
	- Weight Decay: 0.01
	- Classification Head: Linear layer (1024 → 22 labels)
	- Loss Function: BCEWithLogitsLoss (multi-label)
	- Optimization: Dynamic per-label thresholds (F1-maximization)
	- Framework: PyTorch + Hugging Face Transformers

	## How It Works

	The model processes Portuguese municipal texts through a fine-tuned transformer architecture:

	1. Text Preprocessing
	- Basic normalization (lowercasing, whitespace cleanup)
	- Minimum text length filtering (>10 chars)
	- Punctuation removal for noise reduction

	2. Tokenization
	- WordPiece tokenization (BERTimbau vocabulary)
	- Max length truncation at 512 tokens
	- Padding to max length for batching

	3. Transformer Encoding
	- 24-layer BERT encoder processes token sequences
	- Self-attention captures contextual dependencies
	- [CLS] token representation used for classification

	4. Multi-Label Prediction
	- Linear classification head outputs 22 logits
	- Sigmoid activation for independent probabilities
	- Dynamic per-label thresholds for final predictions

	5. Threshold Optimization
	- Each label has optimal threshold (0.10-0.90 range)
	- Optimized via F1-score grid search on validation set
	- Handles class imbalance better than fixed 0.5 threshold

	## Usage

	```python
	import torch
	import numpy as np
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	from joblib import load

	# Load model and tokenizer
	model_path = "path/to/model"
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model = AutoModelForSequenceClassification.from_pretrained(model_path)

	# Load optimal thresholds and label encoder
	thresholds = np.load("optimal_thresholds.npy")
	mlb = load("mlb_encoder.joblib")

	# Set device
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model = model.to(device)
	model.eval()

	# Classify text
	text = "A Câmara Municipal aprovou o orçamento de 2024 para obras públicas e educação."

	# Tokenize
	inputs = tokenizer(
	text,
	return_tensors="pt",
	max_length=256,
	truncation=True,
	padding="max_length"
	).to(device)

	# Predict
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.sigmoid(outputs.logits).cpu().numpy()[0]

	# Apply optimized thresholds
	predictions = (probs >= thresholds).astype(int)
	predicted_labels = mlb.inverse_transform(predictions.reshape(1, -1))[0]

	# Display results
	print(f"Text: {text}")
	print(f"\nPredicted Topics:")
	for label in predicted_labels:
	idx = list(mlb.classes_).index(label)
	prob = probs[idx]
	thresh = thresholds[idx]
	print(f" - {label}: {prob:.4f} (threshold: {thresh:.2f})")
	```


	## Categories

	The model classifies topics into 22 Portuguese administrative categories:

	\| Category \| Portuguese Name \|
	\|----------\|-----------------\|
	\| General Administration \| Administração Geral, Finanças e Recursos Humanos \|
	\| Environment \| Ambiente \|
	\| Economic Activities \| Atividades Económicas \|
	\| Social Action \| Ação Social \|
	\| Science \| Ciência \|
	\| Communication \| Comunicação e Relações Públicas \|
	\| External Cooperation \| Cooperação Externa e Relações Internacionais \|
	\| Culture \| Cultura \|
	\| Sports \| Desporto \|
	\| Education \| Educação e Formação Profissional \|
	\| Energy & Telecommunications \| Energia e Telecomunicações \|
	\| Housing \| Habitação \|
	\| Private Construction \| Obras Particulares \|
	\| Public Works \| Obras Públicas \|
	\| Territorial Planning \| Ordenamento do Território \|
	\| Other \| Outros \|
	\| Heritage \| Património \|
	\| Municipal Police \| Polícia Municipal \|
	\| Animal Protection \| Proteção Animal \|
	\| Civil Protection \| Proteção Civil \|
	\| Health \| Saúde \|
	\| Traffic & Transport \| Trânsito, Transportes e Comunicações \|

	## Evaluation Results

	### Comprehensive Performance Metrics

	\| Metric \| Score \| Description \|
	\|--------\|-------\|-------------\|
	\| F1-macro \| 0.6419 \| Macro-averaged F1 score \|
	\| F1-micro \| 0.8224 \| Micro-averaged F1 score \|
	\| Accuracy \| 0.5709 \| Subset accuracy (exact match) \|
	\| Hamming Loss \| 0.0286 \| Label-wise error rate \|
	\| Average Precision (macro) \| 0.681 \| Macro-averaged AP \|


	## License

	This project uses a custom dual-license based on AGPL v3.

	See the full license terms here: [LICENSE](./LICENSE)