Spaces:

beppeinthesky
/

pnrr-data-processor

Running

beppeinthesky commited on 10 days ago

Commit

7e85729

0 Parent(s):

feat: Add cluster analysis and semantic filtering modules

- Implemented `cluster_page.py` for PNRR project clustering analysis with Streamlit UI.
- Created `column_query_agent.py` to handle query analysis and column mapping using LLM.
- Developed `create.py` for building and loading FAISS indexes from DataFrames.
- Added `semantic_filter.py` for applying semantic filtering on PNRR projects based on user queries.
- Introduced `search.py` for performing multi-column semantic searches using FAISS.
- Added relevant queries documentation for better user guidance.
- Updated requirements.txt with necessary dependencies for the new features.
- Configured Streamlit settings for increased file upload size.

Files changed (17) hide show

.env.example +6 -0
.gitignore +18 -0
Dockerfile +28 -0
README.md +150 -0
app.py +78 -0
docker/compose.base.yaml +15 -0
modules/cluster_analysis.py +681 -0
modules/cluster_page.py +350 -0
modules/column_query_agent.py +83 -0
modules/create.py +75 -0
modules/fixtures/Scheda metadatazione_Progetti_Lozalizzazioni_PNRR_Italiadomani_V2.xlsx +0 -0
modules/home.py +113 -0
modules/search.py +46 -0
modules/semantic_filter.py +120 -0
relevant-queries.md +77 -0
requirements.txt +17 -0
streamlit_config/config.toml +5 -0

.env.example ADDED Viewed

	@@ -0,0 +1,6 @@

+OPENAI_API_KEY='<your-api-key>'
+# Credenziali login (formato: username:password)
+LOGIN_USER1=admin:password_admin
+LOGIN_USER2=<username>:<password>
+LOGIN_USER3=<username>:<password>

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+__pycache__/
+# Jupyter Notebook
+.ipynb_checkpoints
+# Environments
+.env
+data/
+faiss_index
+semantic_filter_results.xlsx
+# JetBrains IDE
+-.idea/
+*.xlsx

Dockerfile ADDED Viewed

	@@ -0,0 +1,28 @@

+FROM python:3.9
+# Create a user with a specified UID
+RUN useradd -m -u 1000 user
+# Set working directory
+WORKDIR /app
+# Copy and install dependencies as root (for permissions)
+COPY requirements.txt requirements.txt
+RUN pip install --no-cache-dir --upgrade pip \
+    && pip install --no-cache-dir --upgrade -r requirements.txt
+# Create results directory with proper permissions
+RUN mkdir -p /app/results && chown -R user:user /app/results
+# Copy application files
+COPY --chown=user:user . /app
+# Switch to the user
+USER user
+ENV PATH="/home/user/.local/bin:$PATH"
+ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
+# Expose the port for huggingface spaces
+EXPOSE 7860
+CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]

README.md ADDED Viewed

	@@ -0,0 +1,150 @@

+# PNRR Data Processor
+Un'applicazione web per l'analisi e l'elaborazione dei dati dei progetti PNRR (Piano Nazionale di Ripresa e Resilienza). L'applicazione offre strumenti avanzati di analisi basati su intelligenza artificiale per identificare pattern, raggruppamenti tematici e filtraggio semantico dei progetti.
+## 🚀 Setup ed Esecuzione (Docker)
+### Prerequisiti
+- Docker e Docker Compose installati
+- File Excel con i dati dei progetti PNRR
+### Configurazione Ambiente
+1. **Crea il file di configurazione ambiente:**
+   ```bash
+   cp .env.example .env
+   ```
+2. **Configura la chiave API OpenAI nel file `.env`:**
+   ```
+   OPENAI_API_KEY='sk-xxxxxxxxxxxxxxxxxxxxxxx'
+   ```
+   Per ottenere una chiave API OpenAI:
+   - Registrati o accedi a [OpenAI Platform](https://platform.openai.com)
+   - Vai nella sezione "API Keys" nel tuo profilo
+   - Crea una nuova chiave API
+   - Copia la chiave e inseriscila nel file `.env`
+### Avvio dell'Applicazione
+```bash
+docker compose -f docker/compose.base.yaml up --build
+```
+L'applicazione sarà disponibile su: **http://localhost:8501**
+### Monitoraggio
+Per visualizzare i log del container:
+```bash
+docker container logs -f semantic_filter
+```
+## 🎯 Funzionalità dell'Applicazione
+### 🔍 Filtro Semantico
+**Scopo**: Identificare automaticamente progetti PNRR rilevanti basandosi su query in linguaggio naturale.
+**Come funziona**:
+- Utilizza modelli di AI per comprendere il significato semantico delle descrizioni dei progetti
+- Confronta la query dell'utente con il contenuto dei progetti per trovare corrispondenze concettuali
+- Assegna un punteggio di confidenza a ogni progetto basato sulla rilevanza
+**Come usarlo**:
+1. Carica il file Excel contenente i progetti PNRR
+2. Imposta la soglia di confidenza (0.0-1.0) per filtrare i risultati
+3. Scrivi una query descrittiva in linguaggio naturale (es. "progetti di digitalizzazione nelle scuole", "infrastrutture sostenibili", "riqualificazione urbana")
+4. Scegli se aggiungere i risultati come nuova colonna o creare un nuovo file
+5. Avvia la ricerca e scarica i risultati
+**Output**: File Excel con i progetti filtrati e punteggi di rilevanza
+### 🎯 Analisi Cluster
+**Scopo**: Raggruppare automaticamente progetti simili in cluster tematici per identificare pattern ricorrenti e aree di investimento comuni.
+**Come funziona**:
+- Analizza il contenuto testuale delle colonne selezionate usando tecniche di machine learning
+- Applica preprocessing intelligente rimuovendo stopwords italiane, termini PNRR comuni e parole personalizzate dall'utente
+- Raggruppa progetti con caratteristiche simili in cluster tematici usando embeddings semantici
+- Genera automaticamente titoli, descrizioni e parole chiave per ogni cluster tramite AI
+- Calcola statistiche di distribuzione dei progetti
+**Come usarlo**:
+1. Carica il file Excel contenente i progetti PNRR
+2. Seleziona le colonne testuali da utilizzare per il clustering (es. titolo progetto, descrizione, sintesi)
+3. Configura i parametri:
+   - **Automatico**: L'algoritmo determina il numero ottimale di cluster
+   - **Manuale**: Specifica un numero fisso di cluster
+4. **Personalizza la blacklist** (opzionale):
+   - Aggiungi parole specifiche da escludere dall'analisi
+   - Inserisci termini troppo generici o irrilevanti per il tuo contesto
+   - Le parole possono essere inserite separate da virgole o una per riga
+   - Esempi: nomi di enti frequenti, termini tecnici comuni, location ricorrenti
+5. Avvia l'analisi e attendi il completamento
+6. Esplora i risultati nei cluster generati
+7. **🆕 Visualizza il plot PCA**: Analizza la distribuzione spaziale dei cluster nel grafico interattivo
+8. Scarica i risultati:
+   - **Sommario Cluster**: File con titoli, descrizioni e statistiche
+   - **Dati con Cluster ID**: File originale con aggiunta dell'identificativo cluster
+**Output**:
+- Sommario dei cluster con titoli, descrizioni, parole chiave e progetti campione
+- Dataset originale arricchito con l'ID del cluster di appartenenza
+- Visualizzazioni della distribuzione dei progetti per cluster
+- **🆕 Plot PCA interattivo**: Visualizzazione bidimensionale dei cluster nello spazio degli embeddings
+### 📊 Visualizzazione PCA dei Cluster
+**Caratteristiche**:
+- **Riduzione dimensionale**: I complessi embeddings multidimensionali vengono ridotti a 2 dimensioni tramite PCA (Principal Component Analysis)
+- **Plot interattivo**: Visualizzazione Plotly con zoom, pan e informazioni al passaggio del mouse
+- **Codifica colori**: Ogni cluster ha un colore distintivo per facilitare l'identificazione
+- **Informazioni dettagliate**: Hover con titolo cluster, descrizione e numero di progetti
+- **Varianza spiegata**: Mostra quanto della variabilità originale è preservata nelle due componenti principali
+**Interpretazione**:
+- Punti vicini rappresentano progetti semanticamente simili
+- Gruppi di punti dello stesso colore mostrano la coesione del cluster
+- La distanza tra cluster indica quanto sono diversi tematicamente
+- La percentuale di varianza spiegata indica l'affidabilità della rappresentazione 2D
+### 🎯 Blacklist Personalizzata
+**Scopo**: Migliorare la qualità dei cluster escludendo parole irrilevanti specifiche del tuo dataset.
+**Benefici**:
+- **Cluster più precisi**: Rimuovendo parole generiche, i cluster si basano su termini realmente distintivi
+- **Controllo granulare**: Personalizza l'analisi in base al tuo contesto specifico
+- **Flessibilità**: Testa diverse configurazioni per ottimizzare i risultati
+**Esempi di parole da escludere**:
+- Termini troppo frequenti: "progetto", "attività", "servizio"
+- Nomi di enti ricorrenti: "comune", "regione", "asl"
+- Parole tecniche generiche: "sistema", "gestione", "sviluppo"
+- Location comuni nel dataset: "milano", "roma", "italia"
+## 📋 Formato File
+L'applicazione è **completamente flessibile** riguardo al formato del file Excel:
+### ✅ Requisiti Minimi
+- File in formato Excel (.xlsx)
+- Almeno una colonna contenente testo descrittivo
+### 🎯 Adattabilità
+- **Nomi colonne**: Possono essere qualsiasi, l'interfaccia mostrerà tutte le colonne disponibili
+- **Struttura dati**: Qualsiasi struttura è supportata
+- **Selezione dinamica**: L'utente sceglie quali colonne utilizzare per l'analisi
+### 🏆 Colonne Consigliate (Opzionali)
+Per progetti PNRR, colonne come queste migliorano la qualità dell'analisi:
+- **Titolo/Nome Progetto**: Identificazione del progetto
+- **Descrizione/Sintesi**: Contenuto descrittivo dettagliato
+- **Settore/Ambito**: Area tematica del progetto
+- **Soggetto/Ente**: Organizzazione responsabile
+- **Località**: Informazioni geografiche
+> **💡 Suggerimento**: Più contenuto testuale descrittivo è disponibile, migliore sarà la qualità dell'analisi semantica e del clustering, indipendentemente dai nomi delle colonne.

app.py ADDED Viewed

	@@ -0,0 +1,78 @@

+import os
+import sys
+import streamlit as st
+from dotenv import load_dotenv
+load_dotenv()
+sys.path.append(os.path.dirname(os.path.abspath(__file__)))
+from modules.home import main as semantic_filter_main
+from modules.cluster_page import main as cluster_main
+def load_credentials() -> dict:
+    users = {}
+    for i in range(1, 4):
+        entry = os.getenv(f"LOGIN_USER{i}", "")
+        if ":" in entry:
+            username, password = entry.split(":", 1)
+            users[username] = password
+    return users
+def show_login() -> None:
+    st.title("🔒 Accesso")
+    with st.form("login_form"):
+        username = st.text_input("Username")
+        password = st.text_input("Password", type="password")
+        submitted = st.form_submit_button("Accedi")
+    if submitted:
+        credentials = load_credentials()
+        if username in credentials and credentials[username] == password:
+            st.session_state.authenticated = True
+            st.session_state.username = username
+            st.rerun()
+        else:
+            st.error("Credenziali non valide. Riprova.")
+def main() -> None:
+    st.sidebar.title("🏛️ PNRR Data Processor")
+    st.sidebar.markdown("---")
+    page = st.sidebar.radio(
+        "Seleziona una funzione:",
+        ["🔍 Filtro Semantico", "🎯 Analisi Cluster"],
+        format_func=lambda x: x
+    )
+    st.sidebar.markdown("---")
+    st.sidebar.markdown("""
+    ### ℹ️ Informazioni
+    **Filtro Semantico**: Filtra i progetti PNRR basandosi su query testuali usando ricerca semantica.
+    **Analisi Cluster**: Raggruppa automaticamente i progetti in cluster tematici per identificare pattern ricorrenti.
+    """)
+    if st.sidebar.button("🚪 Logout"):
+        st.session_state.authenticated = False
+        st.rerun()
+    if page == "🔍 Filtro Semantico":
+        semantic_filter_main()
+    elif page == "🎯 Analisi Cluster":
+        cluster_main()
+if __name__ == "__main__":
+    st.set_page_config(
+        page_title="PNRR Data Processor",
+        page_icon="🏛️",
+        layout="wide",
+        initial_sidebar_state="expanded"
+    )
+    if not st.session_state.get("authenticated", False):
+        show_login()
+    else:
+        main()

docker/compose.base.yaml ADDED Viewed

	@@ -0,0 +1,15 @@

+services:
+  semantic-filter:
+    build:
+      context: ..
+      dockerfile: docker/Dockerfile
+    container_name: semantic_filter
+    ports:
+      - "8501:8501"
+    restart: always
+    volumes:
+      - ..:/app
+      - /app/.venv
+      - ../streamlit_config/:/home/user/.streamlit
+    env_file:
+      - ../.env

modules/cluster_analysis.py ADDED Viewed

	@@ -0,0 +1,681 @@

+import logging
+import pandas as pd
+import numpy as np
+import os
+import json
+import re
+from sklearn.cluster import KMeans
+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.decomposition import PCA
+from sentence_transformers import SentenceTransformer
+from typing import List, Dict, Tuple, Optional
+from langchain_openai import ChatOpenAI
+from langchain.schema import HumanMessage
+import plotly.express as px
+import plotly.graph_objects as go
+RESULTS_DIR = '/app/results'
+SAVE_PATH_CLUSTERS = os.path.join(RESULTS_DIR, 'cluster_results.xlsx')
+SAVE_PATH_ORIGINAL = os.path.join(RESULTS_DIR, 'data_with_clusters.xlsx')
+EMBEDDING_MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
+LLM_MODEL_NAME = 'gpt-4o-mini'
+PNRR_STOPWORDS = {
+    'pnrr', 'piano', 'nazionale', 'ripresa', 'resilienza', 'progetto', 'progetti',
+    'intervento', 'interventi', 'attività', 'realizzazione', 'sviluppo',
+    'implementazione', 'potenziamento', 'miglioramento', 'sostegno',
+    'euro', 'milioni', 'miliardi', 'finanziamento', 'investimento',
+    'pubblico', 'pubblica', 'amministrazione', 'ente', 'comune', 'regione',
+    'italia', 'italiano', 'italiana', 'nazionale'
+}
+ITALIAN_STOPWORDS = {
+    # Articoli
+    'il', 'lo', 'la', 'i', 'gli', 'le', 'un', 'uno', 'una',
+    # Preposizioni semplici
+    'di', 'a', 'da', 'in', 'con', 'su', 'per', 'tra', 'fra',
+    # Preposizioni articolate più comuni
+    'del', 'dello', 'della', 'dei', 'degli', 'delle',
+    'al', 'allo', 'alla', 'ai', 'agli', 'alle',
+    'dal', 'dallo', 'dalla', 'dai', 'dagli', 'dalle',
+    'nel', 'nello', 'nella', 'nei', 'negli', 'nelle',
+    'sul', 'sullo', 'sulla', 'sui', 'sugli', 'sulle',
+    # Congiunzioni
+    'e', 'ed', 'o', 'od', 'ma', 'però', 'anche', 'ancora', 'quindi', 'dunque', 'mentre', 'quando', 'se',
+    # Pronomi
+    'che', 'chi', 'cui', 'quale', 'quali', 'questo', 'questa', 'questi', 'queste',
+    'quello', 'quella', 'quelli', 'quelle', 'stesso', 'stessa', 'stessi', 'stesse',
+    # Avverbi comuni
+    'dove', 'come', 'perché', 'già', 'più', 'molto', 'poco', 'tanto', 'quanto', 'sempre', 'mai',
+    'oggi', 'ieri', 'domani', 'prima', 'dopo', 'sopra', 'sotto', 'dentro', 'fuori',
+    # Aggettivi/pronomi indefiniti
+    'tutto', 'tutti', 'tutte', 'ogni', 'alcuni', 'alcune', 'altro', 'altri', 'altre',
+    'nessuno', 'nessuna', 'niente', 'nulla', 'qualche', 'qualcosa', 'qualcuno',
+    # Verbi ausiliari e modali comuni
+    'essere', 'avere', 'fare', 'dire', 'andare', 'venire', 'volere', 'potere', 'dovere', 'sapere',
+    'stare', 'dare', 'vedere', 'uscire', 'partire',
+    # Parole di contesto comune
+    'contesto', 'attraverso', 'mediante', 'presso', 'verso', 'circa', 'oltre', 'secondo', 'durante'
+}
+def preprocess_text(text: str, remove_domain_stopwords: bool = True, custom_blacklist: Optional[List[str]] = None) -> str:
+    """
+    Preprocess text by removing stopwords and applying cleaning.
+    Args:
+        text: Input text
+        remove_domain_stopwords: Whether to remove PNRR-specific stopwords
+        custom_blacklist: Additional words to exclude (will be added to default stopwords)
+    Returns:
+        str: Cleaned text
+    """
+    if not isinstance(text, str):
+        return ""
+    # Convert to lowercase
+    text = text.lower()
+    # Remove special characters but keep spaces and accented characters
+    text = re.sub(r'[^\w\sàèéìíîòóùú]', ' ', text)
+    # Remove numbers that are standalone
+    text = re.sub(r'\b\d+\b', ' ', text)
+    # Remove extra whitespace
+    text = ' '.join(text.split())
+    if remove_domain_stopwords:
+        # Split into words
+        words = text.split()
+        # Remove stopwords
+        stopwords_to_remove = ITALIAN_STOPWORDS.union(PNRR_STOPWORDS)
+        # Add custom blacklist if provided
+        if custom_blacklist:
+            custom_stopwords = {word.lower().strip()
+                                for word in custom_blacklist if word.strip()}
+            stopwords_to_remove = stopwords_to_remove.union(custom_stopwords)
+        # Filter words: remove stopwords, very short words, and words that are only numbers/special chars
+        filtered_words = []
+        for word in words:
+            if (word not in stopwords_to_remove and
+                len(word) > 2 and
+                not word.isdigit() and
+                    re.search(r'[a-zA-Zàèéìíîòóùú]', word)):  # Must contain at least one letter
+                filtered_words.append(word)
+        # Rejoin
+        text = ' '.join(filtered_words)
+    return text
+def combine_text_columns(df: pd.DataFrame, columns: List[str], preprocess: bool = True, custom_blacklist: Optional[List[str]] = None) -> pd.Series:
+    """Combine multiple text columns into a single text representation.
+    Args:
+        df: DataFrame containing the data
+        columns: List of column names to combine
+        preprocess: Whether to apply text preprocessing (cleaning and stopword removal)
+        custom_blacklist: Additional words to exclude from preprocessing
+    Returns:
+        pd.Series: Series containing the combined texts for each row
+    """
+    combined_texts = []
+    for idx, row in df.iterrows():
+        text_parts = []
+        for col in columns:
+            if col in df.columns and pd.notna(row[col]):
+                text_part = str(row[col])
+                if preprocess:
+                    text_part = preprocess_text(
+                        text_part, custom_blacklist=custom_blacklist)
+                text_parts.append(text_part)
+        combined_text = " | ".join(text_parts)
+        # Additional cleaning for the combined text
+        if preprocess:
+            combined_text = ' '.join(
+                combined_text.split())  # Remove extra spaces
+        combined_texts.append(combined_text)
+    return pd.Series(combined_texts)
+def create_embeddings(texts: List[str], model_name: str = EMBEDDING_MODEL_NAME) -> np.ndarray:
+    """Create vector embeddings for texts using sentence transformers.
+    Args:
+        texts: List of texts to process
+        model_name: Name of the model to use for embeddings
+    Returns:
+        np.ndarray: Numpy array containing the vector embeddings
+    """
+    logging.info(f"Creating embeddings with model: {model_name}")
+    model = SentenceTransformer(model_name)
+    embeddings = model.encode(texts, show_progress_bar=True)
+    return embeddings
+def perform_clustering(embeddings: np.ndarray, n_clusters: Optional[int] = None, max_clusters: int = 20, min_clusters: int = 2) -> Tuple[np.ndarray, int]:
+    """Perform K-means clustering on vector embeddings.
+    Args:
+        embeddings: Numpy array of embeddings
+        n_clusters: Fixed number of clusters (if None, determined automatically)
+        max_clusters: Maximum number of clusters for automatic selection
+        min_clusters: Minimum number of clusters for automatic selection
+    Returns:
+        Tuple[np.ndarray, int]: Tuple containing cluster labels and final number of clusters
+    """
+    if n_clusters is None:
+        # Use elbow method to find optimal number of clusters
+        n_clusters = find_optimal_clusters(embeddings, max_clusters, min_clusters)
+    logging.info(f"Performing clustering with {n_clusters} clusters")
+    kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
+    cluster_labels = kmeans.fit_predict(embeddings)
+    return cluster_labels, n_clusters
+def find_optimal_clusters(embeddings: np.ndarray, max_clusters: int = 20, min_clusters: int = 2) -> int:
+    """Find optimal number of clusters using the elbow method.
+    Args:
+        embeddings: Numpy array of embeddings
+        max_clusters: Maximum number of clusters to test
+        min_clusters: Minimum number of clusters to test
+    Returns:
+        int: Optimal number of clusters determined
+    """
+    if len(embeddings) < max_clusters:
+        max_clusters = len(embeddings) - 1
+    # Ensure min_clusters is at least 2 and not greater than max_clusters
+    min_clusters = max(2, min_clusters)
+    if min_clusters > max_clusters:
+        min_clusters = max_clusters
+    if max_clusters < 2:
+        return 2
+    inertias = []
+    K_range = range(min_clusters, min(max_clusters + 1, len(embeddings)))
+    for k in K_range:
+        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
+        kmeans.fit(embeddings)
+        inertias.append(kmeans.inertia_)
+    # Simple elbow detection
+    if len(inertias) < 2:
+        return min_clusters
+    # Calculate the rate of change
+    deltas = np.diff(inertias)
+    delta_deltas = np.diff(deltas)
+    # Find the point where the rate of change starts to flatten
+    if len(delta_deltas) > 0:
+        elbow_idx = np.argmax(delta_deltas) + min_clusters  # Start from min_clusters
+        return max(min_clusters, min(elbow_idx, max_clusters))
+    return min_clusters
+def generate_cluster_description(cluster_texts: List[str], cluster_id: int) -> Tuple[str, str]:
+    """Generate title and description for a cluster using LLM.
+    Args:
+        cluster_texts: List of texts belonging to the cluster
+        cluster_id: Numeric ID of the cluster
+    Returns:
+        Tuple[str, str]: Tuple containing title and description of the cluster
+    """
+    try:
+        # Sample up to 10 texts for analysis to avoid token limits
+        sample_texts = cluster_texts[:10] if len(cluster_texts) > 10 else cluster_texts
+        # Create a concise sample for the LLM
+        text_sample = "\n".join([f"- {text[:200]}" for text in sample_texts])
+        llm = ChatOpenAI(model=LLM_MODEL_NAME, temperature=0.3)
+        prompt = f"""
+            Analizza i seguenti progetti PNRR e identifica il tema comune che li accomuna.
+            Devi fornire un titolo breve (max 50 caratteri) e una descrizione concisa (max 150 caratteri) che catturi l'essenza di questi progetti.
+            Progetti del cluster {cluster_id + 1}:
+            {text_sample}
+            Rispondi in formato JSON con le chiavi "titolo" e "descrizione".
+            Il titolo deve essere specifico e descrittivo del tema comune.
+            La descrizione deve spiegare brevemente cosa accomuna questi progetti.
+            Esempio di risposta:
+            {{
+                "titolo": "Digitalizzazione Sanità",
+                "descrizione": "Progetti di migrazione cloud e infrastrutture digitali per aziende sanitarie"
+            }}
+        """
+        response = llm.invoke([HumanMessage(content=prompt)])
+        response_content = response.content.strip()
+        logging.info(f"LLM Response for cluster {cluster_id}: {response_content}")
+        try:
+            result = json.loads(response_content)
+            title = result.get("titolo", f"Cluster {cluster_id + 1}")[:50]
+            description = result.get("descrizione", "Cluster di progetti correlati")[:150]
+        except json.JSONDecodeError:
+            try:
+                # Try to extract JSON from the response using regex
+                json_match = re.search(r'\{[^}]*"titolo"[^}]*"descrizione"[^}]*\}', response_content, re.DOTALL)
+                if json_match:
+                    json_str = json_match.group(0)
+                    result = json.loads(json_str)
+                    title = result.get("titolo", f"Cluster {cluster_id + 1}")[:50]
+                    description = result.get("descrizione", "Cluster di progetti correlati")[:150]
+                else:
+                    # If no valid JSON found, try to extract title and description manually
+                    title_match = re.search(r'"titolo":\s*"([^"]+)"', response_content)
+                    desc_match = re.search(r'"descrizione":\s*"([^"]+)"', response_content)
+                    title = title_match.group(1)[:50] if title_match else f"Cluster {cluster_id + 1}"
+                    description = desc_match.group(1)[:150] if desc_match else "Cluster di progetti correlati"
+            except (json.JSONDecodeError, AttributeError) as e:
+                # Final fallback
+                logging.warning(f"Failed to parse JSON for cluster {cluster_id}: {e}")
+                title = f"Cluster {cluster_id + 1}"
+                description = "Cluster di progetti correlati"
+    except Exception as e:
+        logging.warning(f"Error generating description for cluster {cluster_id}: {e}")
+        title = f"Cluster {cluster_id + 1}"
+        description = f"Cluster contenente {len(cluster_texts)} progetti correlati"
+    return title, description
+def extract_keywords(cluster_texts: List[str], top_k: int = 5, custom_blacklist: Optional[List[str]] = None) -> List[str]:
+    """Extract top keywords from cluster texts using TF-IDF with advanced filtering.
+    Args:
+        cluster_texts: List of cluster texts
+        top_k: Maximum number of keywords to extract
+        custom_blacklist: List of words to exclude from extraction
+    Returns:
+        List[str]: List of the most relevant keywords
+    """
+    if not cluster_texts:
+        return []
+    try:
+        # Create custom stopwords list combining Italian, PNRR, and custom blacklist
+        custom_stopwords = ITALIAN_STOPWORDS.union(PNRR_STOPWORDS)
+        # Add custom blacklist
+        if custom_blacklist:
+            custom_stopwords_set = {word.lower().strip() for word in custom_blacklist if word.strip()}
+            custom_stopwords = custom_stopwords.union(custom_stopwords_set)
+        # Convert to list for TfidfVectorizer
+        stopwords_list = list(custom_stopwords)
+        # First pass: get more candidates
+        vectorizer = TfidfVectorizer(
+            max_features=200,  # Increased to get more candidates
+            stop_words=stopwords_list,
+            ngram_range=(1, 3),  # Include trigrams for better context
+            min_df=2,  # Appear in at least 2 documents
+            token_pattern=r'\b[a-zA-ZÀ-ÿ]{3,}\b' # Only words with 3+ characters, including accented
+        )
+        tfidf_matrix = vectorizer.fit_transform(cluster_texts)
+        feature_names = vectorizer.get_feature_names_out()
+        # Get mean TF-IDF scores
+        mean_scores = np.mean(tfidf_matrix.toarray(), axis=0)
+        # Create candidates with scores
+        candidates = [(feature_names[i], mean_scores[i]) for i in range(len(feature_names))]
+        candidates.sort(key=lambda x: x[1], reverse=True)
+        # Advanced filtering to remove redundant and similar terms
+        filtered_keywords = []
+        used_words = set()
+        for keyword, score in candidates:
+            # Skip if we have enough keywords
+            if len(filtered_keywords) >= top_k:
+                break
+            # Clean the keyword
+            keyword_clean = keyword.lower().strip()
+            # Skip very short words or numbers
+            if len(keyword_clean) < 3 or keyword_clean.isdigit():
+                continue
+            # Skip if it's essentially a stopword we missed
+            if keyword_clean in custom_stopwords:
+                continue
+            # Check for redundancy with already selected keywords
+            is_redundant = False
+            # Split ngrams to check individual words
+            keyword_words = set(keyword_clean.split())
+            # Check if this ngram contains words already used as single keywords
+            if len(keyword_words) > 1:
+                # If it's a multi-word term, check if we already have the main components
+                overlap_with_used = keyword_words.intersection(used_words)
+                if len(overlap_with_used) > 0:
+                    is_redundant = True
+            # Check similarity with existing keywords (basic containment check)
+            for existing_keyword in filtered_keywords:
+                existing_words = set(existing_keyword.lower().split())
+                # If current keyword is contained in existing or vice versa
+                if (keyword_words.issubset(existing_words) or
+                        existing_words.issubset(keyword_words)):
+                    is_redundant = True
+                    break
+                # Check if they share too many words (for multi-word terms)
+                if (len(keyword_words) > 1 and len(existing_words) > 1):
+                    shared_words = keyword_words.intersection(existing_words)
+                    if len(shared_words) >= min(len(keyword_words), len(existing_words)) * 0.7:
+                        is_redundant = True
+                        break
+            if not is_redundant:
+                filtered_keywords.append(keyword)
+                # Add individual words to used_words set
+                used_words.update(keyword_words)
+        return filtered_keywords[:top_k]
+    except Exception as e:
+        logging.warning(f"Error extracting keywords: {e}")
+        return []
+def analyze_clusters(
+    data_frame_path,
+    selected_columns: List[str],
+    n_clusters: Optional[int] = None,
+    max_clusters: int = 20,
+    min_clusters: int = 2,
+    preprocess_text_data: bool = True,
+    custom_blacklist: Optional[List[str]] = None
+) -> Tuple[pd.DataFrame, pd.DataFrame, np.ndarray, np.ndarray]:
+    """
+    Main function to perform cluster analysis on PNRR projects.
+    Args:
+        data_frame_path: Path to the Excel file
+        selected_columns: List of column names to use for clustering
+        n_clusters: Number of clusters (if None, will be determined automatically)
+        max_clusters: Maximum number of clusters for automatic selection
+        min_clusters: Minimum number of clusters for automatic selection
+        preprocess_text_data: Whether to preprocess text (remove stopwords, clean)
+        custom_blacklist: Additional words to exclude from analysis
+    Returns:
+        Tuple[pd.DataFrame, pd.DataFrame, np.ndarray, np.ndarray]: Tuple of (cluster_results_df, original_data_with_clusters_df, embeddings, cluster_labels)
+    """
+    logging.info(f"Loading DataFrame from {data_frame_path}...")
+    df = pd.read_excel(data_frame_path)
+    logging.info(f"Loaded DataFrame with {len(df)} rows")
+    available_columns = [col for col in selected_columns if col in df.columns]
+    if not available_columns:
+        raise ValueError("None of the selected columns are available in the DataFrame")
+    logging.info(f"Using columns for clustering: {available_columns}")
+    if preprocess_text_data:
+        logging.info(
+            "Preprocessing text data (removing stopwords and cleaning)")
+        if custom_blacklist:
+            logging.info(
+                f"Using custom blacklist with {len(custom_blacklist)} additional words")
+    combined_texts = combine_text_columns(
+        df, available_columns, preprocess=preprocess_text_data, custom_blacklist=custom_blacklist)
+    non_empty_mask = combined_texts.str.strip() != ""
+    if non_empty_mask.sum() == 0:
+        raise ValueError("No non-empty text found in selected columns")
+    df_filtered = df[non_empty_mask].copy()
+    texts_filtered = combined_texts[non_empty_mask].tolist()
+    embeddings = create_embeddings(texts_filtered)
+    cluster_labels, final_n_clusters = perform_clustering(embeddings, n_clusters, max_clusters, min_clusters)
+    df_filtered['cluster_id'] = cluster_labels
+    # Generate cluster summaries
+    cluster_results = []
+    for cluster_id in range(final_n_clusters):
+        cluster_mask = cluster_labels == cluster_id
+        cluster_texts = [texts_filtered[i] for i in range(len(texts_filtered)) if cluster_mask[i]]
+        if not cluster_texts:
+            continue
+        title, description = generate_cluster_description(cluster_texts, cluster_id)
+        keywords = extract_keywords(cluster_texts, custom_blacklist=custom_blacklist)
+        cluster_results.append({
+            'cluster_id': cluster_id,
+            'titolo': title,
+            'descrizione': description,
+            'num_progetti': len(cluster_texts),
+            'keywords': ', '.join(keywords),
+            'progetti_campione': ' | '.join(cluster_texts[:3])
+        })
+    cluster_df = pd.DataFrame(cluster_results)
+    # Prepare final dataframe with cluster assignments
+    # Start with original dataframe and add cluster_id column
+    df_with_clusters = df.copy()
+    df_with_clusters['cluster_id'] = -1  # Default value for unassigned
+    df_with_clusters.loc[non_empty_mask, 'cluster_id'] = cluster_labels
+    logging.info(f"Created {final_n_clusters} clusters")
+    logging.info(f"Assigned {len(cluster_labels)} projects to clusters")
+    return cluster_df, df_with_clusters, embeddings, cluster_labels
+def save_results(cluster_df: pd.DataFrame, data_with_clusters_df: pd.DataFrame) -> None:
+    """Save clustering results to Excel files.
+    Args:
+        cluster_df: DataFrame with cluster results
+        data_with_clusters_df: Original DataFrame with assigned cluster IDs
+    Returns:
+        None
+    """
+    # Ensure the results directory exists
+    os.makedirs(RESULTS_DIR, exist_ok=True)
+    logging.info(f"Saving cluster results to {SAVE_PATH_CLUSTERS}")
+    cluster_df.to_excel(SAVE_PATH_CLUSTERS, index=False)
+    logging.info(f"Saving data with clusters to {SAVE_PATH_ORIGINAL}")
+    data_with_clusters_df.to_excel(SAVE_PATH_ORIGINAL, index=False)
+    logging.info("Results saved successfully")
+def get_cluster_statistics(cluster_df: pd.DataFrame, data_with_clusters_df: pd.DataFrame) -> Dict[str, float]:
+    """Generate statistics about the clustering results.
+    Args:
+        cluster_df: DataFrame with cluster results
+        data_with_clusters_df: Original DataFrame with assigned cluster IDs
+    Returns:
+        Dict[str, float]: Dictionary containing clustering statistics
+    """
+    total_projects = len(data_with_clusters_df)
+    assigned_projects = len(data_with_clusters_df[data_with_clusters_df['cluster_id'] >= 0])
+    unassigned_projects = total_projects - assigned_projects
+    stats = {
+        'total_projects': total_projects,
+        'assigned_projects': assigned_projects,
+        'unassigned_projects': unassigned_projects,
+        'num_clusters': len(cluster_df),
+        'avg_projects_per_cluster': assigned_projects / len(cluster_df) if len(cluster_df) > 0 else 0,
+        'largest_cluster_size': cluster_df['num_progetti'].max() if len(cluster_df) > 0 else 0,
+        'smallest_cluster_size': cluster_df['num_progetti'].min() if len(cluster_df) > 0 else 0
+    }
+    return stats
+def create_cluster_pca_plot(embeddings: np.ndarray, cluster_labels: np.ndarray, cluster_df: pd.DataFrame) -> go.Figure:
+    """
+    Create a 2D PCA plot of clusters using plotly express.
+    Args:
+        embeddings: Numpy array of embeddings
+        cluster_labels: Cluster labels for each point
+        cluster_df: DataFrame with cluster information (for titles and descriptions)
+    Returns:
+        plotly.graph_objects.Figure: Interactive plot figure
+    """
+    try:
+        # Perform PCA to reduce to 2 dimensions
+        logging.info("Performing PCA reduction to 2D for visualization...")
+        pca = PCA(n_components=2, random_state=42)
+        embeddings_2d = pca.fit_transform(embeddings)
+        # Create a DataFrame for plotting
+        plot_df = pd.DataFrame({
+            'PC1': embeddings_2d[:, 0],
+            'PC2': embeddings_2d[:, 1],
+            'cluster_id': cluster_labels
+        })
+        # Create cluster titles mapping for hover information
+        cluster_titles = {}
+        cluster_colors = {}
+        for idx, row in cluster_df.iterrows():
+            cluster_id = row['cluster_id']
+            cluster_titles[cluster_id] = f"Cluster {cluster_id + 1}: {row['titolo']}"
+        # Add cluster titles to the plot DataFrame
+        plot_df['cluster_title'] = plot_df['cluster_id'].map(cluster_titles)
+        plot_df['cluster_description'] = plot_df['cluster_id'].map(
+            lambda x: cluster_df[cluster_df['cluster_id'] ==
+                                 x]['descrizione'].iloc[0] if x in cluster_df['cluster_id'].values else "Cluster sconosciuto"
+        )
+        plot_df['num_progetti'] = plot_df['cluster_id'].map(
+            lambda x: cluster_df[cluster_df['cluster_id'] ==
+                                 x]['num_progetti'].iloc[0] if x in cluster_df['cluster_id'].values else 0
+        )
+        # Create the scatter plot
+        fig = px.scatter(
+            plot_df,
+            x='PC1',
+            y='PC2',
+            color='cluster_id',
+            hover_data={
+                'cluster_title': True,
+                'cluster_description': True,
+                'num_progetti': True,
+                'PC1': ':.3f',
+                'PC2': ':.3f',
+                'cluster_id': False
+            },
+            title='Visualizzazione 2D dei Cluster (PCA)',
+            labels={
+                'PC1': f'Prima Componente Principale ({pca.explained_variance_ratio_[0]:.1%} varianza)',
+                'PC2': f'Seconda Componente Principale ({pca.explained_variance_ratio_[1]:.1%} varianza)',
+                'cluster_id': 'Cluster ID'
+            },
+            color_discrete_sequence=px.colors.qualitative.Set3
+        )
+        # Update layout for better presentation
+        fig.update_layout(
+            width=800,
+            height=600,
+            showlegend=True,
+            legend=dict(
+                orientation="v",
+                yanchor="top",
+                y=1,
+                xanchor="left",
+                x=1.02
+            ),
+            margin=dict(r=150),
+            font=dict(size=12),
+            plot_bgcolor='rgba(0,0,0,0)'
+        )
+        # Update traces for better markers
+        fig.update_traces(
+            marker=dict(
+                size=8,
+                opacity=0.7,
+                line=dict(width=1, color='DarkSlateGrey')
+            )
+        )
+        # Add explanation text
+        explained_variance_total = pca.explained_variance_ratio_[
+            0] + pca.explained_variance_ratio_[1]
+        fig.add_annotation(
+            text=f"Varianza totale spiegata: {explained_variance_total:.1%}<br>Ogni punto rappresenta un progetto PNRR",
+            xref="paper", yref="paper",
+            x=0.02, y=0.98,
+            xanchor="left", yanchor="top",
+            showarrow=False,
+            font=dict(size=10, color="gray"),
+            bgcolor="rgba(255,255,255,0.8)",
+            bordercolor="gray",
+            borderwidth=1
+        )
+        logging.info(
+            f"Created PCA plot with {len(plot_df)} points and {len(cluster_df)} clusters")
+        logging.info(
+            f"Total explained variance: {explained_variance_total:.3f}")
+        return fig
+    except Exception as e:
+        logging.error(f"Error creating PCA plot: {e}")
+        # Return empty figure in case of error
+        fig = go.Figure()
+        fig.add_annotation(
+            text=f"Errore nella creazione del plot PCA: {str(e)}",
+            x=0.5, y=0.5,
+            xref="paper", yref="paper",
+            showarrow=False
+        )
+        return fig

modules/cluster_page.py ADDED Viewed

	@@ -0,0 +1,350 @@

+import os
+import sys
+import logging
+import streamlit as st
+import pandas as pd
+from typing import Dict, Union, Any
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from modules import cluster_analysis
+METADATA_PATH = 'modules/fixtures/Scheda metadatazione_Progetti_Lozalizzazioni_PNRR_Italiadomani_V2.xlsx'
+def set_page_config() -> None:
+    """Configure Streamlit page settings for cluster analysis.
+    Returns:
+        None
+    """
+    st.set_page_config(
+        page_title="PNRR Cluster Analysis",
+        page_icon=":chart_with_upwards_trend:",
+        layout="wide"
+    )
+def load_metadata_columns() -> Dict[str, str]:
+    """Load available columns from metadata file.
+    Returns:
+        Dict[str, str]: Dictionary mapping column names to their descriptions
+    """
+    try:
+        metadata_paths = [
+            '/home/giuseppe/IUAV - PNRR/semantic-filter/data/metadata.csv',
+            'data/metadata.csv',
+            '../data/metadata.csv'
+        ]
+        metadata_df = None
+        for path in metadata_paths:
+            if os.path.exists(path):
+                metadata_df = pd.read_csv(path)
+                break
+        if metadata_df is None:
+            return {}
+        high_importance = metadata_df[
+            (metadata_df['Ranking importanza variabili (da 1, bassa importanza, a 5, massima importanza)'].isin([4, 5])) &
+            (metadata_df['Variabile dei file originali (Italiadomani/Regione Veneto)'].notna())
+        ]
+        columns_info = {}
+        for _, row in high_importance.iterrows():
+            var_name = row['Variabile dei file originali (Italiadomani/Regione Veneto)']
+            description = row['Descrizione']
+            if pd.notna(var_name) and pd.notna(description):
+                columns_info[var_name] = description
+        return columns_info
+    except Exception as e:
+        st.error(f"Errore nel caricamento dei metadati: {e}")
+        return {}
+def display_cluster_statistics(stats: Dict[str, Union[int, float]]) -> None:
+    """Display clustering statistics in an organized format.
+    Args:
+        stats: Dictionary containing clustering statistics
+    Returns:
+        None
+    """
+    col1, col2, col3, col4 = st.columns(4)
+    with col1:
+        st.metric("Progetti Totali", stats['total_projects'])
+    with col2:
+        st.metric("Progetti Assegnati", stats['assigned_projects'])
+    with col3:
+        st.metric("Numero Cluster", stats['num_clusters'])
+    with col4:
+        st.metric("Progetti per Cluster (media)", f"{stats['avg_projects_per_cluster']:.1f}")
+def main() -> None:
+    """Main function for cluster analysis user interface.
+    Handles file upload, parameter configuration, and analysis execution.
+    Returns:
+        None
+    """
+    st.title("🔍 Analisi Cluster Progetti PNRR")
+    st.markdown("""
+    Questa sezione permette di identificare automaticamente gruppi tematici di progetti PNRR
+    basati sul contenuto delle colonne selezionate. L'algoritmo utilizza tecniche di machine learning
+    per raggruppare progetti simili e genera automaticamente titoli e descrizioni per ogni cluster.
+    """)
+    st.header("📁 Carica il File Excel")
+    uploaded_file = st.file_uploader(
+        "Seleziona il file Excel contenente i progetti PNRR",
+        type=["xlsx"],
+        help="Carica un file Excel con i dati dei progetti PNRR"
+    )
+    if uploaded_file is not None:
+        try:
+            df = pd.read_excel(uploaded_file)
+            st.success(f"✅ File caricato con successo! Trovate {len(df)} righe e {len(df.columns)} colonne.")
+            columns_info = load_metadata_columns()
+            st.header("🎯 Selezione Colonne per Clustering")
+            st.markdown("""
+            Seleziona le colonne da utilizzare per il clustering. Le colonne testuali con informazioni
+            descrittive dei progetti sono generalmente le più efficaci per identificare temi ricorrenti.
+            """)
+            column_options = []
+            for col in df.columns:
+                if col in columns_info:
+                    description = columns_info[col][:100] + "..." if len(columns_info[col]) > 100 else columns_info[col]
+                    column_options.append((col, f"{col} - {description}"))
+                else:
+                    column_options.append((col, col))
+            selected_column_tuples = st.multiselect(
+                "Seleziona le colonne da utilizzare per il clustering:",
+                column_options,
+                format_func=lambda x: x[1],
+                help="Seleziona almeno una colonna. Le colonne con testo descrittivo sono più efficaci."
+            )
+            selected_columns = [col[0] for col in selected_column_tuples]
+            if selected_columns:
+                st.subheader("🔍 Anteprima Colonne Selezionate")
+                preview_df = df[selected_columns].head(3)
+                st.dataframe(preview_df, use_container_width=True)
+                st.header("⚙️ Parametri Clustering")
+                col1, col2 = st.columns(2)
+                with col1:
+                    auto_clusters = st.checkbox(
+                        "Determinazione automatica del numero di cluster",
+                        value=True,
+                        help="Se selezionato, l'algoritmo determinerà automaticamente il numero ottimale di cluster"
+                    )
+                with col2:
+                    if not auto_clusters:
+                        n_clusters = st.slider(
+                            "Numero di cluster",
+                            min_value=2,
+                            max_value=min(500, len(df) // 5),
+                            value=250,
+                            help="Numero fisso di cluster da creare"
+                        )
+                    else:
+                        col2_1, col2_2 = st.columns(2)
+                        with col2_1:
+                            min_clusters = st.number_input(
+                                "Numero minimo di cluster",
+                                min_value=2,
+                                max_value=500,
+                                value=5,
+                                step=1,
+                                help="Numero minimo di cluster per la determinazione automatica"
+                            )
+                        with col2_2:
+                            max_clusters = st.number_input(
+                                "Numero massimo di cluster",
+                                min_value=min_clusters,
+                                max_value=500,
+                                value=250,
+                                step=1,
+                                help="Numero massimo di cluster per la determinazione automatica"
+                            )
+                st.header("🚫 Blacklist Parole Personalizzata")
+                st.markdown("""
+                Aggiungi parole che vuoi escludere completamente dall'analisi del clustering.
+                Queste parole saranno rimosse dall'analisi per evitare che influenzino i risultati.
+                """)
+                col1_bl, col2_bl = st.columns([2, 1])
+                with col1_bl:
+                    custom_words_input = st.text_area(
+                        "Parole da escludere (una per riga o separate da virgola):",
+                        height=100,
+                        placeholder="digitalizzazione\ninfrastruttura\nsanità\n\noppure: digitalizzazione, infrastruttura, sanità",
+                        help="Inserisci parole che ritieni irrilevanti per il tuo contesto di analisi. "
+                             "Puoi inserire una parola per riga oppure separare le parole con virgole."
+                    )
+                with col2_bl:
+                    st.markdown("**Esempi di parole da escludere:**")
+                    st.markdown("- Termini troppo generici")
+                    st.markdown("- Nomi di enti frequenti")
+                    st.markdown("- Parole tecniche comuni")
+                    st.markdown("- Location ricorrenti")
+                # Parse custom blacklist
+                custom_blacklist = []
+                if custom_words_input.strip():
+                    # Try comma-separated first
+                    if ',' in custom_words_input:
+                        custom_blacklist = [
+                            word.strip() for word in custom_words_input.split(',')]
+                    else:
+                        # Otherwise, split by lines
+                        custom_blacklist = [
+                            word.strip() for word in custom_words_input.split('\n')]
+                    # Filter out empty strings
+                    custom_blacklist = [
+                        word for word in custom_blacklist if word]
+                    if custom_blacklist:
+                        st.success(
+                            f"✅ Saranno escluse {len(custom_blacklist)} parole personalizzate: {', '.join(custom_blacklist[:5])}{'...' if len(custom_blacklist) > 5 else ''}")
+                if st.button("🚀 Avvia Analisi Cluster", type="primary"):
+                    with st.spinner("Analisi in corso... Questo potrebbe richiedere alcuni minuti."):
+                        try:
+                            n_clusters_param = None if auto_clusters else n_clusters
+                            max_clusters_param = max_clusters if auto_clusters else 20
+                            min_clusters_param = min_clusters if auto_clusters else 2
+                            cluster_df, data_with_clusters_df, embeddings, cluster_labels = cluster_analysis.analyze_clusters(
+                                data_frame_path=uploaded_file,
+                                selected_columns=selected_columns,
+                                n_clusters=n_clusters_param,
+                                max_clusters=max_clusters_param,
+                                min_clusters=min_clusters_param,
+                                custom_blacklist=custom_blacklist if custom_blacklist else None
+                            )
+                            cluster_analysis.save_results(cluster_df, data_with_clusters_df)
+                            stats = cluster_analysis.get_cluster_statistics(cluster_df, data_with_clusters_df)
+                            st.success("✅ Analisi completata con successo!")
+                            st.header("📊 Statistiche Clustering")
+                            display_cluster_statistics(stats)
+                            st.header("🎯 Risultati Cluster")
+                            st.markdown(f"Sono stati identificati **{len(cluster_df)}** cluster tematici:")
+                            for idx, row in cluster_df.iterrows():
+                                with st.expander(f"**Cluster {row['cluster_id'] + 1}**: {row['titolo']} ({row['num_progetti']} progetti)"):
+                                    st.write(f"**Descrizione**: {row['descrizione']}")
+                                    st.write(f"**Parole chiave**: {row['keywords']}")
+                                    st.write(f"**Progetti di esempio**:")
+                                    st.write(row['progetti_campione'])
+                            st.header("📥 Download Risultati")
+                            col1, col2 = st.columns(2)
+                            with col1:
+                                with open(cluster_analysis.SAVE_PATH_CLUSTERS, 'rb') as f:
+                                    cluster_bytes = f.read()
+                                st.download_button(
+                                    label="📋 Scarica Sommario Cluster",
+                                    data=cluster_bytes,
+                                    file_name="cluster_results.xlsx",
+                                    mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
+                                    help="File Excel con titoli, descrizioni e statistiche dei cluster"
+                                )
+                            with col2:
+                                with open(cluster_analysis.SAVE_PATH_ORIGINAL, 'rb') as f:
+                                    data_bytes = f.read()
+                                st.download_button(
+                                    label="📊 Scarica Dati con Cluster ID",
+                                    data=data_bytes,
+                                    file_name="data_with_clusters.xlsx",
+                                    mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
+                                    help="File Excel originale con aggiunta colonna cluster_id per ogni progetto"
+                                )
+                            st.header(
+                                "📊 Visualizzazione Cluster nello Spazio degli Embeddings")
+                            st.markdown("""
+                            Questo grafico mostra una rappresentazione bidimensionale dei cluster ottenuti tramite PCA (Principal Component Analysis).
+                            Ogni punto rappresenta un progetto PNRR, colorato secondo il cluster di appartenenza.
+                            """)
+                            try:
+                                # Create and display PCA plot
+                                pca_fig = cluster_analysis.create_cluster_pca_plot(
+                                    embeddings, cluster_labels, cluster_df)
+                                st.plotly_chart(
+                                    pca_fig, use_container_width=True)
+                            except Exception as e:
+                                st.error(
+                                    f"❌ Errore nella creazione del plot PCA: {str(e)}")
+                                logging.error(
+                                    f"PCA plot error: {e}", exc_info=True)
+                            st.header("👀 Anteprima Risultati")
+                            cluster_counts = data_with_clusters_df['cluster_id'].value_counts().sort_index()
+                            cluster_counts_df = pd.DataFrame({
+                                'Cluster ID': cluster_counts.index,
+                                'Numero Progetti': cluster_counts.values
+                            })
+                            st.subheader("Distribuzione Progetti per Cluster")
+                            st.bar_chart(cluster_counts_df.set_index('Cluster ID'))
+                            st.subheader("Dati di Esempio con Cluster ID")
+                            sample_data = data_with_clusters_df[selected_columns + ['cluster_id']].head(10)
+                            st.dataframe(sample_data, use_container_width=True)
+                        except Exception as e:
+                            st.error(f"❌ Errore durante l'analisi: {str(e)}")
+                            logging.error(f"Clustering error: {e}", exc_info=True)
+            else:
+                st.warning("⚠️ Seleziona almeno una colonna per procedere con il clustering.")
+        except Exception as e:
+            st.error(f"❌ Errore nel caricamento del file: {str(e)}")
+    else:
+        st.info("👆 Carica un file Excel per iniziare l'analisi cluster.")
+        st.header("📋 Formato File Atteso")
+        st.markdown("""
+        Il file Excel dovrebbe contenere i dati dei progetti PNRR con colonne come:
+        - **Titolo Progetto**: Nome del progetto
+        - **Sintesi Progetto**: Descrizione dettagliata
+        - **Descrizione Missione**: Descrizione della missione PNRR
+        - **Descrizione Componente**: Descrizione della componente
+        - **Soggetto Attuatore**: Ente responsabile
+        - **Descrizione Comune**: Località del progetto
+        Più colonne testuali descrittive vengono selezionate, migliore sarà la qualità del clustering.
+        """)
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.INFO)
+    set_page_config()
+    main()

modules/column_query_agent.py ADDED Viewed

	@@ -0,0 +1,83 @@

+import json
+import logging
+import os
+from typing import Dict, Callable, Any
+import langchain_openai
+from langchain_core import prompts
+def agent(base_llm_model_name: str) -> Callable[[str, Dict[str, str]], Dict[str, str]]:
+    """Create an agent for query analysis and column mapping.
+    Args:
+        base_llm_model_name: Name of the LLM model to use
+    Returns:
+        Callable: Function that accepts (query, columns_and_descriptions) and returns column-query mapping
+    """
+    config = {
+        'model': base_llm_model_name,
+        'temperature': 0,
+        'max_tokens': 4000,
+        'max_retries': 10,
+        'seed': 123456
+    }
+    system_prompt = '''
+        You are a smart assistant that receives:
+        - a user search query with a lot of keywords,
+        - a list of columns extracted from a dataset,
+        - and for each column, its description explaining what it contains.
+        Your task:
+        - Analyze the query.
+        - For each column, determine if part of the query is highly relevant to it.
+        - Extract only the most relevant keywords or parts of the query that fit the topic and meaning of the column.
+        - Output a list of (query fragment, column name) pairs.
+        Rules:
+        - The query fragment must make sense for that specific column.
+        - If the column is not relevant to any part of the query, you can skip it.
+        - Do not modify the meaning of the user's query, but you can split and adapt it into multiple parts.
+        - Be concise but precise in fragment construction.
+        - Include the most important 5-10 columns, maximum.
+        - Does not change the names of the columns.
+        Output format: a JSON object with the key the column names and the values the query fragments.
+    '''
+    logging.info(f"Loading model {base_llm_model_name}...")
+    model = langchain_openai.ChatOpenAI(
+        api_key=os.getenv("OPENAI_API_KEY"),
+        model=config['model'],
+        temperature=config['temperature'],
+        max_tokens=config['max_tokens'],
+        max_retries=config['max_retries'],
+        seed=config['seed'],
+    )
+    prompt = prompts.ChatPromptTemplate.from_messages([
+        ('system', system_prompt),
+        ('human', 'User Query: {query}, Columns and Descriptions: {columns}'),
+    ])
+    chain = prompt | model
+    def invoke(query, columns_and_descriptions):
+        formatted_columns = "\n".join(
+            f"- {col}: {desc}" for col, desc in columns_and_descriptions.items()
+        )
+        return post_process(chain.invoke({'query': query, 'columns': formatted_columns}), columns_and_descriptions)
+    return invoke
+def post_process(response: Any, columns_and_descriptions: Dict[str, str]) -> Dict[str, str]:
+    """Post-process LLM response to extract column-query mapping.
+    Args:
+        response: LLM response containing JSON
+        columns_and_descriptions: Dictionary of available columns and descriptions
+    Returns:
+        Dict[str, str]: Dictionary mapping column names to relevant query fragments
+    """
+    json_response = json.loads(response.content.strip('`').lstrip('json\n'))
+    return {col: json_response[col] for col in columns_and_descriptions if col in json_response}

modules/create.py ADDED Viewed

	@@ -0,0 +1,75 @@

+import os
+import uuid
+import faiss
+import shutil
+import logging
+import pandas as pd
+from typing import Any
+from langchain_core import documents
+from langchain_community import embeddings
+from langchain_community import vectorstores
+from langchain_community.docstore import in_memory
+DEFAULT_INDEX_QUERY = "hello world"
+def build_faiss(
+    data_frame: pd.DataFrame,
+    index_path: str,
+    embedder: Any
+) -> vectorstores.FAISS:
+    """Build a FAISS index from a DataFrame.
+    Args:
+        data_frame: DataFrame containing data to index
+        index_path: Path where to save the FAISS index
+        embedder: Embedder object to generate vectors
+    Returns:
+        vectorstores.FAISS: Built FAISS vectorstore object
+    """
+    embedded_documents = []
+    for row_idx, row in data_frame.iterrows():
+        for col_name, cell_val in row.items():
+            embedded_documents.append(documents.Document(
+                page_content=str(cell_val),
+                metadata={"row": row_idx, "column": col_name},
+            ))
+    if os.path.exists(index_path):
+        shutil.rmtree(index_path, ignore_errors=True)
+        logging.debug(f"Deleted existing FAISS index at {index_path}")
+    vectorstore = vectorstores.FAISS(
+        embedding_function=embedder,
+        index=faiss.IndexFlatIP(len(embedder.embed_query(DEFAULT_INDEX_QUERY))),
+        docstore=in_memory.InMemoryDocstore(),
+        index_to_docstore_id={},
+    )
+    uuids = [str(uuid.uuid4()) for _ in range(len(embedded_documents))]
+    vectorstore.add_documents(documents=embedded_documents, ids=uuids)
+    logging.debug(f"Added {len(embedded_documents)} documents to FAISS index")
+    os.makedirs(index_path, exist_ok=True)
+    vectorstore.save_local(index_path)
+    logging.debug(f"FAISS index saved to ./{index_path}/")
+    return vectorstore
+def load_faiss_index(
+    index_path: str,
+    hf_model_name: str
+) -> vectorstores.FAISS:
+    """Load a previously saved FAISS index.
+    Args:
+        index_path: Path of the saved FAISS index
+        hf_model_name: Name of the HuggingFace model for embeddings
+    Returns:
+        vectorstores.FAISS: Loaded FAISS vectorstore object
+    """
+    embedder = embeddings.HuggingFaceEmbeddings(model_name=hf_model_name)
+    return vectorstores.FAISS.load_local(index_path, embedder, allow_dangerous_deserialization=True)

modules/fixtures/Scheda metadatazione_Progetti_Lozalizzazioni_PNRR_Italiadomani_V2.xlsx ADDED Viewed

Binary file (21.3 kB). View file

modules/home.py ADDED Viewed

	@@ -0,0 +1,113 @@

+import os
+import sys
+import torch
+import dotenv
+import logging
+import streamlit as st
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from modules import semantic_filter
+torch.classes.__path__ = []
+CONFIDENCE = 0.8
+CHUNK_SIZE = 1000
+METADATA_PATH = 'modules/fixtures/Scheda metadatazione_Progetti_Lozalizzazioni_PNRR_Italiadomani_V2.xlsx'
+def set_page_config() -> None:
+    """Configure Streamlit page settings for semantic filter.
+    Returns:
+        None
+    """
+    st.set_page_config(
+        page_title="Semantic Filter",
+        page_icon=":desert_island:",
+    )
+def main() -> None:
+    """Main function for semantic filter user interface.
+    Handles file upload, parameter configuration, and search execution.
+    Returns:
+        None
+    """
+    st.title("🔍 Filtro Semantico Progetti PNRR")
+    st.markdown("""
+    Questa sezione permette di filtrare i progetti PNRR utilizzando ricerca semantica avanzata.
+    Inserisci una query testuale e il sistema identificherà automaticamente i progetti più rilevanti.
+    """)
+    st.header("📁 Carica il File Excel")
+    uploaded_file = st.file_uploader(
+        "Seleziona il file Excel contenente i progetti PNRR", type=["xlsx"])
+    st.header("⚙️ Parametri di Ricerca")
+    col1, col2 = st.columns(2)
+    with col1:
+        confidence = st.slider(
+            "Soglia di confidenza",
+            min_value=0.0,
+            max_value=1.0,
+            value=CONFIDENCE,
+            step=0.01,
+            help="Valore minimo di similarità per considerare un progetto rilevante"
+        )
+    with col2:
+        output_option = st.selectbox(
+            "Opzione di output",
+            [("Aggiungi una colonna al file", "add_column"),
+             ("Crea un nuovo file", "new_file")],
+            format_func=lambda x: x[0],
+            index=0
+        )
+    st.header("💬 Query di Ricerca")
+    user_query = st.text_area(
+        "Inserisci la query di ricerca semantica:",
+        height=150,
+        max_chars=None,
+        help="Descrivi il tipo di progetti che stai cercando in linguaggio naturale",
+        placeholder="Esempio: progetti di digitalizzazione nelle scuole, infrastrutture sostenibili, riqualificazione urbana..."
+    )
+    if st.button("🚀 Avvia Ricerca Semantica", type="primary"):
+        if uploaded_file is not None and user_query:
+            with st.spinner("Ricerca in corso... Questo potrebbe richiedere alcuni minuti."):
+                try:
+                    semantic_filter.apply(
+                        data_frame_path=uploaded_file,
+                        metadata_path=METADATA_PATH,
+                        user_query=user_query,
+                        threshold=confidence,
+                        chunk_size=CHUNK_SIZE,
+                        output_option=output_option[1]
+                    )
+                    st.success("✅ Ricerca completata con successo!")
+                    with open(semantic_filter.SAVE_PATH, 'rb') as f:
+                        file_bytes = f.read()
+                    st.download_button(
+                        label="📥 Scarica Risultati",
+                        data=file_bytes,
+                        file_name=semantic_filter.SAVE_PATH.split('/')[-1],
+                        mime="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
+                    )
+                except Exception as e:
+                    st.error(f"❌ Errore durante la ricerca: {str(e)}")
+        else:
+            st.error("⚠️ Carica un file Excel e inserisci una query di ricerca.")
+if __name__ == "__main__":
+    logging.basicConfig(level=logging.INFO)
+    dotenv.load_dotenv()
+    set_page_config()
+    main()

modules/search.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import pandas as pd
+from typing import List, Tuple, Dict, Any
+from langchain_community.vectorstores import faiss
+def multi_column(db: faiss.FAISS, df: pd.DataFrame, qc_pairs: Dict[str, str], threshold: float) -> List[Tuple[int, float, Dict[str, Any]]]:
+    """Perform semantic search across multiple columns and return aggregated results.
+    Args:
+        db: FAISS vector database for search
+        df: Original DataFrame containing the data
+        qc_pairs: Dictionary mapping columns to query fragments
+        threshold: Minimum similarity threshold to include a result
+    Returns:
+        List[Tuple[int, float, Dict[str, Any]]]: List of tuples (row_id, avg_score, row_dict)
+    """
+    per_column_scores = []
+    for column, query in qc_pairs.items():
+        hits = db.similarity_search_with_score(
+            query,
+            k=db.index.ntotal,
+            filter={'column': column},
+            distance_strategy=faiss.DistanceStrategy.COSINE
+        )
+        score_map = {
+            doc.metadata['row']: score
+            for doc, score in hits
+            if score >= threshold
+        }
+        per_column_scores.append(score_map)
+    all_rows = set()
+    for score_map in per_column_scores:
+        all_rows.update(score_map.keys())
+    results = []
+    for rid in all_rows:
+        scores = [score_map[rid] for score_map in per_column_scores if rid in score_map]
+        if scores:
+            avg_score = sum(scores) / len(scores)
+            row_dict = df.loc[rid].to_dict()
+            results.append((rid, avg_score, row_dict))
+    results.sort(key=lambda x: x[1], reverse=True)
+    return results

modules/semantic_filter.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import logging
+import pandas as pd
+from typing import Any, List, Tuple, Dict, Union
+from stqdm import stqdm
+from modules import search
+from modules import create
+from modules import column_query_agent
+from langchain_huggingface import embeddings
+SAVE_PATH = 'semantic_filter_results.xlsx'
+LLM_MODEL_NAME = 'gpt-4o-mini'
+EMBEDDING_MODEL_NAME = 'sentence-transformers/all-MiniLM-L6-v2'
+def apply(
+    data_frame_path: Union[str, Any],
+    metadata_path: str,
+    user_query: str,
+    threshold: float,
+    chunk_size: int,
+    output_option: str
+) -> None:
+    """Apply semantic filter to PNRR data.
+    Args:
+        data_frame_path: Path or object of Excel file containing the data
+        metadata_path: Path to Excel file containing column metadata
+        user_query: User's textual query for semantic search
+        threshold: Minimum confidence threshold to consider a result relevant
+        chunk_size: Chunk size for data processing
+        output_option: Output option ('new_file' or 'add_column')
+    Returns:
+        None
+    """
+    query_agent = column_query_agent.agent(LLM_MODEL_NAME)
+    embedder = embeddings.HuggingFaceEmbeddings(model_name=EMBEDDING_MODEL_NAME)
+    logging.info(f"Loading DataFrame at {data_frame_path}...")
+    df = pd.read_excel(data_frame_path)
+    logging.info(f"Loaded DataFrame with {len(df)} rows")
+    metadata_df = pd.read_excel(metadata_path)
+    columns_and_descriptions = dict(zip(
+        metadata_df['Variabile'],
+        metadata_df['Descrizione']
+    ))
+    columns_and_descriptions = {k: v for k, v in columns_and_descriptions.items() if pd.notna(v) and k in df.columns}
+    query_pairs = query_agent(user_query, columns_and_descriptions)
+    relevant_columns = list(query_pairs.keys())
+    all_results = []
+    chunks = split_dataframe(df, chunk_size)
+    for chunk in stqdm(chunks, desc='Processing chunks'):
+        df_reduced = chunk[relevant_columns]
+        db = create.build_faiss(
+            df_reduced,
+            index_path='faiss_index',
+            embedder=embedder
+        )
+        results = search.multi_column(db, chunk, query_pairs, threshold)
+        all_results.extend(results)
+    all_results.sort(key=lambda x: x[1], reverse=True)
+    if output_option == 'new_file':
+        save_results_to_excel(all_results, SAVE_PATH)
+    else:
+        df['is_valid'] = False
+        for row_id, score, _row_dict in all_results:
+            df.at[row_id, 'is_valid'] = score
+        df.to_excel(SAVE_PATH, index=False)
+    logging.info(f"{len(all_results)} rows found")
+def split_dataframe(df: pd.DataFrame, chunk_size: int) -> List[pd.DataFrame]:
+    """Split a DataFrame into chunks of specified size.
+    Args:
+        df: DataFrame to split
+        chunk_size: Maximum size of each chunk
+    Returns:
+        List[pd.DataFrame]: List of DataFrame chunks
+    """
+    chunks = []
+    for i in range(0, len(df), chunk_size):
+        chunks.append(df.iloc[i:i + chunk_size])
+    return chunks
+def save_results_to_excel(results: List[Tuple[int, float, Dict[str, Any]]], output_path: str) -> None:
+    """Save semantic search results to Excel file.
+    Args:
+        results: List of tuples containing (row_id, score, row_dict)
+        output_path: Path of output Excel file
+    Returns:
+        None
+    """
+    if not results:
+        logging.warning("No results to save.")
+        return
+    data = []
+    for row_id, score, row_dict in results:
+        row = {
+            'row_id': row_id,
+            'score': score,
+            **row_dict
+        }
+        data.append(row)
+    df = pd.DataFrame(data)
+    df = df.sort_values(by='row_id').reset_index(drop=True)
+    df.to_excel(output_path, index=False)
+    logging.info(f"Saved {len(results)} results to {output_path}")

relevant-queries.md ADDED Viewed

	@@ -0,0 +1,77 @@

+**1. Cambiamento climatico e adattamento**
+Parole chiave
+- cambiamento climatico;
+- adattamento e mitigazione;
+- resilienza climatica, resilienza territoriale;
+- rischio idraulico, rischio idrogeologico;
+- strategie e difesa;
+Indicatori indiretti
+- eventi estremi, alluvioni;
+- infrastrutture verdi;
+- gestione acque meteoriche, drenaggio urbano sostenibile;
+- rigenerazione verde, forestazione urbana, alberature.
+**2. Mobilità e logistica**
+Parole chiave
+- mobilità sostenibile, mobilità dolce;
+- trasporto pubblico locale, TPL, autobus/mezzi elettrici;
+- ciclabilità, ciclovie, piste ciclabili;
+- intermodalità, hub logistico, stazioni intermodali;
+- logistica urbana, city logistics, logistica green.
+Indicatori indiretti
+- riduzione traffico, decongestionamento;
+- infrastrutture di trasporto;
+- elettrificazione mezzi.
+**3. Digitalizzazione e competitività**
+Parole chiave
+- digitalizzazione, transizione digitale;
+- innovazione tecnologica, trasformazione digitale;
+- banda larga, 5G, cloud computing;
+- piattaforme digitali, servizi digitali, interoperabilità.
+Indicatori indiretti
+- digital twin, dati aperti, data governance;
+- imprese innovative, startup, ecosistemi digitali;
+- piattaforme per servizi, digitalizzazione PA, e-government.
+**4. Rigenerazione urbana e territoriale**
+Parole chiave
+- rigenerazione urbana, rigenerazione territoriale
+- riqualificazione edilizia, riuso, riattivazione
+- spazi pubblici, periferie, edilizia sociale
+- partecipazione territoriale, co-progettazione, urbanistica tattica
+Indicatori
+- qualità urbana, inclusione territoriale
+- contrasto al degrado, recupero funzionale
+- housing sociale, servizi di prossimità, welfare urbano
+**5. Sostenibilità energetica**
+Parole chiave
+- energia rinnovabile, fonti rinnovabili;
+- efficienza energetica, risparmio energetico;
+- comunità energetiche, autoconsumo collettivo;
+- BIM, fotovoltaico, solare termico, pompe di calore;
+Indicatori indiretti
+- decarbonizzazione, transizione energetica;
+- piani energetici comunali, audit energetici;
+- reti intelligenti, smart grid, smart building.

requirements.txt ADDED Viewed

	@@ -0,0 +1,17 @@

+pandas==2.3.0
+openpyxl==3.1.5
+faiss-cpu==1.11.0
+python-dotenv==1.1.0
+stqdm==0.0.5
+langchain==0.3.24
+langchain-community==0.3.22
+langchain-huggingface==0.1.2
+huggingface-hub==0.30.2
+hf-xet==1.0.5
+langchain-openai==0.3.14
+sentence-transformers==4.1.0
+streamlit==1.44.1
+scikit-learn==1.5.1
+numpy>=1.25.0,<3.0
+protobuf==3.20.3
+plotly==5.24.0

streamlit_config/config.toml ADDED Viewed

	@@ -0,0 +1,5 @@

+[server]
+# Max size, in megabytes, for files uploaded with the file_uploader.
+#
+# Default: 200
+maxUploadSize = 400