Spaces:

Lukeetah
/

ScrapIT

Sleeping

App Files Files Community

Lukeetah commited on Jun 13, 2025

Commit

e564101

verified ·

1 Parent(s): e1246bd

Upload 5 files

Browse files

Files changed (5) hide show

README.md +53 -23
app.py +395 -377
requirements.txt +27 -5
test_app.py +19 -0
web_scraper_tool.py +361 -146

README.md CHANGED Viewed

@@ -1,42 +1,72 @@
 ---
-title: Web Scraper Tool
 emoji: 🕸️
 colorFrom: purple
 colorTo: blue
 sdk: gradio
-sdk_version: 5.33.2
 app_file: app.py
-pinned: false
 ---
 # 🕸️ Web Scraper Tool
-Una herramienta web para hacer scraping de páginas web y convertirlas a PDF o texto plano.
-Esta aplicación está optimizada para generar archivos que puedan ser procesados por Copilot.
 ## ✨ Características
-- ✅ Extracción de contenido web
-- 📄 Conversión a PDF o texto plano
-- 🖼️ Detección automática de imágenes
-- 🎨 Interfaz minimalista y profesional
-- 🤖 Optimizado para generar archivos compatibles con Copilot
-## 🚀 Uso
-1. Ingresa la URL de la página web que deseas procesar
-2. Selecciona el formato de salida (PDF o TXT)
-3. Haz clic en "Procesar URL"
-4. Descarga el archivo generado
-## 🛠️ Tecnologías utilizadas
-- Python
-- Gradio
-- BeautifulSoup
-- WeasyPrint
-- Hugging Face Spaces
-## 👨‍💻 Autor
-Desarrollado con 💜 para solucionar problemas de procesamiento de contenido web

 ---
+title: 🕸️ Web Scraper Tool
 emoji: 🕸️
 colorFrom: purple
 colorTo: blue
 sdk: gradio
+sdk_version: 4.44.1
 app_file: app.py
+pinned: true
+license: mit
+short_description: Extrae contenido web y convierte a PDF/TXT para Copilot
 ---
 # 🕸️ Web Scraper Tool
+Una herramienta profesional para extraer contenido de páginas web y convertirlo a formatos compatibles con Microsoft Copilot (PDF y TXT).
 ## ✨ Características
+- **URLs flexibles**: Funciona con cualquier formato de URL (HTTP, HTTPS, con/sin www, mayúsculas/minúsculas)
+- **Detección automática**: Identifica automáticamente si el contenido es una imagen o texto
+- **Múltiples formatos**: Genera archivos PDF (con formato visual) o TXT (texto plano)
+- **Optimizado para Copilot**: Los archivos están específicamente formateados para Microsoft Copilot
+- **Interfaz minimalista**: Diseño profesional y fácil de usar
+- **Procesamiento robusto**: Manejo inteligente de errores y normalización de URLs
+## 🚀 Cómo usar
+1. **Ingresa la URL**: Pega cualquier URL de página web (soporta formatos como "Https://EXAMPLE.com")
+2. **Selecciona formato**: Elige entre PDF (visual) o TXT (solo texto)
+3. **Procesa**: Haz clic en "Extraer y Convertir"
+4. **Descarga**: Obtén tu archivo listo para usar con Copilot
+## 🎯 Casos de uso
+- Extraer artículos y documentos para análisis con IA
+- Convertir páginas web a formato legible por Copilot
+- Guardar contenido de foros y discusiones (como Spiceworks)
+- Procesar documentación técnica
+- Extraer texto de páginas con mucho código HTML
+## 🛠️ Tecnología
+- **Frontend**: Gradio 4.44.1 con diseño minimalista personalizado
+- **Web Scraping**: Beautiful Soup + Requests con headers inteligentes
+- **Conversión PDF**: WeasyPrint con optimizaciones para texto
+- **Procesamiento**: Python con manejo robusto de errores
+## 📝 Formatos soportados
+### PDF
+- Mantiene formato visual y estructura
+- Incluye estilos CSS básicos
+- Ideal para documentos con formato
+### TXT
+- Texto plano limpio
+- Incluye metadatos del contenido
+- Perfecto para análisis de texto con IA
+## 🔧 Características técnicas
+- Normalización automática de URLs
+- Detección de content-type HTTP
+- Headers rotativos para evitar bloqueos
+- Timeout configurables
+- Manejo de encoding automático
+- Limpieza inteligente de HTML
+---
+Desarrollado con ❤️ para maximizar la compatibilidad con herramientas de IA como Microsoft Copilot.

app.py CHANGED Viewed

@@ -1,380 +1,398 @@
 import os
-import requests
-from bs4 import BeautifulSoup
-from weasyprint import HTML, CSS
-from urllib.parse import urlparse, urlunparse
-import re
-from PIL import Image
-import io
-class WebScrapperTool:
-    def __init__(self, output_dir="output"):
-        self.output_dir = output_dir
-        if not os.path.exists(output_dir):
-            os.makedirs(output_dir)
-        # Headers para evitar bloqueos
-        self.headers = {
-            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
-            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
-            'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3',
-            'Accept-Encoding': 'gzip, deflate',
-            'DNT': '1',
-            'Connection': 'keep-alive',
-            'Upgrade-Insecure-Requests': '1'
-        }
-    def normalize_url(self, url):
-        """Normaliza URLs manejando todos los casos de mayúsculas y formatos incorrectos"""
-        if not url:
-            raise ValueError("URL no puede estar vacía")
-        url = url.strip()
-        # Convertir esquemas a minúsculas pero mantener el resto
-        if url.lower().startswith('http://'):
-            url = 'http://' + url[7:]
-        elif url.lower().startswith('https://'):
-            url = 'https://' + url[8:]
-        elif not url.startswith(('http://', 'https://')):
-            # Si no tiene esquema, agregar https por defecto
-            url = 'https://' + url
-        try:
-            parsed = urlparse(url)
-            # Normalizar componentes
-            scheme = parsed.scheme.lower()
-            netloc = parsed.netloc.lower() if parsed.netloc else ''
-            path = parsed.path
-            params = parsed.params
-            query = parsed.query
-            fragment = parsed.fragment
-            # Si netloc está vacío pero hay path, intentar corregir
-            if not netloc and path:
-                parts = path.split('/', 1)
-                netloc = parts[0].lower()
-                path = '/' + parts[1] if len(parts) > 1 else ''
-            normalized_url = urlunparse((scheme, netloc, path, params, query, fragment))
-            return normalized_url
-        except Exception as e:
-            raise ValueError(f"URL inválida: {url}. Error: {str(e)}")
-    def is_image_url(self, url):
-        """Detecta si una URL es una imagen"""
-        image_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.webp', '.svg', '.bmp', '.tiff', '.ico'}
-        # Verificar por extensión
-        parsed_url = urlparse(url.lower())
-        path = parsed_url.path
-        if any(path.endswith(ext) for ext in image_extensions):
-            return True
-        # Verificar por content-type si es posible
-        try:
-            response = requests.head(url, headers=self.headers, timeout=10)
-            content_type = response.headers.get('content-type', '').lower()
-            if content_type.startswith('image/'):
-                return True
-        except:
-            pass
-        return False
-    def get_clean_html_for_pdf(self, html_content, base_url):
-        """Limpia HTML específicamente para conversión PDF robusta"""
-        soup = BeautifulSoup(html_content, 'html.parser')
-        # Remover elementos problemáticos para PDF
-        for element in soup(['script', 'style', 'noscript', 'iframe', 'embed', 'object']):
-            element.decompose()
-        # Remover atributos problemáticos
-        for tag in soup.find_all():
-            # Mantener solo atributos seguros
-            safe_attrs = ['href', 'src', 'alt', 'title', 'class', 'id']
-            attrs_to_remove = [attr for attr in tag.attrs if attr not in safe_attrs]
-            for attr in attrs_to_remove:
-                del tag[attr]
-        # Agregar CSS básico para mejor renderizado PDF
-        css_style = """
-        <style>
-        body {
-            font-family: Arial, sans-serif;
-            line-height: 1.6;
-            margin: 20px;
-            color: #333;
-        }
-        h1, h2, h3, h4, h5, h6 {
-            color: #2c3e50;
-            margin-top: 20px;
-        }
-        p {
-            margin-bottom: 10px;
-        }
-        a {
-            color: #3498db;
-            text-decoration: none;
-        }
-        img {
-            max-width: 100%;
-            height: auto;
-        }
-        table {
-            border-collapse: collapse;
-            width: 100%;
-        }
-        th, td {
-            border: 1px solid #ddd;
-            padding: 8px;
-            text-align: left;
-        }
-        </style>
-        """
-        # Insertar CSS en el head
-        if soup.head:
-            soup.head.insert(0, BeautifulSoup(css_style, 'html.parser'))
-        else:
-            # Si no hay head, crear uno
-            head = soup.new_tag('head')
-            head.insert(0, BeautifulSoup(css_style, 'html.parser'))
-            if soup.html:
-                soup.html.insert(0, head)
-            else:
-                # Crear estructura HTML completa
-                html_tag = soup.new_tag('html')
-                html_tag.insert(0, head)
-                body = soup.new_tag('body')
-                body.extend(soup.contents[:])
-                html_tag.append(body)
-                soup.clear()
-                soup.append(html_tag)
-        return str(soup)
-    def scrape_to_pdf(self, url, filename=None):
-        """Convierte página web a PDF con manejo robusto de errores"""
-        try:
-            normalized_url = self.normalize_url(url)
-            # Verificar si es imagen
-            if self.is_image_url(normalized_url):
-                return self._handle_image_to_pdf(normalized_url, filename)
-            # Obtener contenido web
-            response = requests.get(normalized_url, headers=self.headers, timeout=30)
-            response.raise_for_status()
-            response.encoding = response.apparent_encoding or 'utf-8'
-            # Limpiar HTML para PDF
-            clean_html = self.get_clean_html_for_pdf(response.text, normalized_url)
-            # Generar nombre de archivo
-            if not filename:
-                domain = urlparse(normalized_url).netloc.replace('www.', '')
-                filename = f"scraped_{domain.replace('.', '_')}.pdf"
-            if not filename.endswith('.pdf'):
-                filename += '.pdf'
-            pdf_path = os.path.join(self.output_dir, filename)
-            # Configurar WeasyPrint con opciones robustas
-            html_doc = HTML(string=clean_html, base_url=normalized_url)
-            # CSS adicional para mejorar renderizado
-            css = CSS(string='''
-                @page {
-                    margin: 2cm;
-                    size: A4;
-                }
-                body {
-                    font-size: 12pt;
-                }
-            ''')
-            html_doc.write_pdf(pdf_path, stylesheets=[css])
-            return {
-                'status': 'success',
-                'file': pdf_path,
-                'url': normalized_url,
-                'message': f'PDF generado exitosamente: {filename}'
-            }
-        except requests.RequestException as e:
-            return {
-                'status': 'error',
-                'message': f'Error al acceder a la URL: {str(e)}',
-                'url': url
-            }
-        except Exception as e:
-            return {
-                'status': 'error',
-                'message': f'Error al generar PDF: {str(e)}',
-                'url': url
-            }
-    def scrape_to_text(self, url, filename=None):
-        """Convierte página web a texto plano"""
-        try:
-            normalized_url = self.normalize_url(url)
-            # Verificar si es imagen
-            if self.is_image_url(normalized_url):
-                return self._handle_image_to_text(normalized_url, filename)
-            # Obtener contenido web
-            response = requests.get(normalized_url, headers=self.headers, timeout=30)
-            response.raise_for_status()
-            response.encoding = response.apparent_encoding or 'utf-8'
-            # Extraer texto limpio
-            soup = BeautifulSoup(response.text, 'html.parser')
-            # Remover elementos no deseados
-            for element in soup(['script', 'style', 'noscript', 'header', 'footer', 'nav']):
-                element.decompose()
-            # Extraer texto con separadores
-            text_content = soup.get_text(separator='\n', strip=True)
-            # Limpiar texto
-            lines = [line.strip() for line in text_content.split('\n') if line.strip()]
-            clean_text = '\n'.join(lines)
-            # Agregar metadatos
-            metadata = f"""URL: {normalized_url}
-Fecha de extracción: {requests.utils.default_headers()['User-Agent']}
-Caracteres extraídos: {len(clean_text)}
-{'='*50}
-{clean_text}"""
-            # Generar nombre de archivo
-            if not filename:
-                domain = urlparse(normalized_url).netloc.replace('www.', '')
-                filename = f"scraped_{domain.replace('.', '_')}.txt"
-            if not filename.endswith('.txt'):
-                filename += '.txt'
-            txt_path = os.path.join(self.output_dir, filename)
-            with open(txt_path, 'w', encoding='utf-8') as f:
-                f.write(metadata)
-            return {
-                'status': 'success',
-                'file': txt_path,
-                'url': normalized_url,
-                'message': f'Texto extra��do exitosamente: {filename}'
-            }
-        except Exception as e:
-            return {
-                'status': 'error',
-                'message': f'Error al extraer texto: {str(e)}',
-                'url': url
-            }
-    def _handle_image_to_pdf(self, url, filename):
-        """Maneja conversión de imagen a PDF"""
-        try:
-            response = requests.get(url, headers=self.headers, timeout=30)
-            response.raise_for_status()
-            # Crear HTML con la imagen
-            html_content = f"""
-            <html>
-            <head>
-                <style>
-                    body {{ margin: 0; padding: 20px; text-align: center; }}
-                    img {{ max-width: 100%; height: auto; }}
-                    .info {{ margin-top: 20px; font-family: Arial, sans-serif; }}
-                </style>
-            </head>
-            <body>
-                <img src="{url}" alt="Imagen extraída">
-                <div class="info">
-                    <p><strong>URL:</strong> {url}</p>
-                    <p><strong>Tipo:</strong> Imagen</p>
-                </div>
-            </body>
-            </html>
-            """
-            if not filename:
-                filename = "image_scraped.pdf"
-            pdf_path = os.path.join(self.output_dir, filename)
-            HTML(string=html_content).write_pdf(pdf_path)
-            return {
-                'status': 'success',
-                'file': pdf_path,
-                'url': url,
-                'message': f'Imagen convertida a PDF: {filename}'
-            }
-        except Exception as e:
-            return {
-                'status': 'error',
-                'message': f'Error al procesar imagen: {str(e)}',
-                'url': url
-            }
-    def _handle_image_to_text(self, url, filename):
-        """Maneja conversión de imagen a archivo de texto con metadatos"""
-        try:
-            response = requests.get(url, headers=self.headers, timeout=30)
-            response.raise_for_status()
-            # Obtener información de la imagen
-            try:
-                img = Image.open(io.BytesIO(response.content))
-                img_info = f"""IMAGEN DETECTADA
-URL: {url}
-Formato: {img.format}
-Dimensiones: {img.size[0]}x{img.size[1]} píxeles
-Modo: {img.mode}
-Tamaño del archivo: {len(response.content)} bytes
-Esta URL contiene una imagen, no texto extraíble.
-Para procesar el contenido visual, considera usar herramientas de OCR.
-"""
-            except:
-                img_info = f"""IMAGEN DETECTADA
-URL: {url}
-Tamaño del archivo: {len(response.content)} bytes
-Esta URL contiene una imagen, no texto extraíble.
 """
-            if not filename:
-                filename = "image_info.txt"
-            txt_path = os.path.join(self.output_dir, filename)
-            with open(txt_path, 'w', encoding='utf-8') as f:
-                f.write(img_info)
-            return {
-                'status': 'success',
-                'file': txt_path,
-                'url': url,
-                'message': f'Información de imagen guardada: {filename}'
-            }
-        except Exception as e:
-            return {
-                'status': 'error',
-                'message': f'Error al procesar imagen: {str(e)}',
-                'url': url
-            }

+import gradio as gr
 import os
+import tempfile
+from web_scraper_tool import WebScrapperTool
+# CSS personalizado con estética minimalista profesional
+custom_css = """
+/* Importar fuente Inter */
+@import url('https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap');
+/* Variables globales */
+:root {
+    --primary-color: #8b5cf6;
+    --primary-hover: #7c3aed;
+    --secondary-color: #f8fafc;
+    --text-primary: #1e293b;
+    --text-secondary: #64748b;
+    --border-color: #e2e8f0;
+    --success-color: #10b981;
+    --error-color: #ef4444;
+    --warning-color: #f59e0b;
+    --gradient-bg: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+}
+/* Reset y configuración base */
+* {
+    box-sizing: border-box;
+}
+body {
+    font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', system-ui, sans-serif;
+    background: var(--gradient-bg);
+    margin: 0;
+    padding: 0;
+    min-height: 100vh;
+}
+/* Contenedor principal */
+.gradio-container {
+    max-width: 800px !important;
+    margin: 0 auto !important;
+    padding: 2rem 1rem !important;
+    background: rgba(255, 255, 255, 0.95);
+    backdrop-filter: blur(10px);
+    border-radius: 24px;
+    box-shadow: 0 25px 50px -12px rgba(0, 0, 0, 0.25);
+    margin-top: 2rem !important;
+    margin-bottom: 2rem !important;
+}
+/* Título principal */
+.gradio-container h1 {
+    color: var(--text-primary);
+    font-size: 2.5rem;
+    font-weight: 700;
+    text-align: center;
+    margin-bottom: 0.5rem;
+    background: linear-gradient(135deg, var(--primary-color), var(--primary-hover));
+    -webkit-background-clip: text;
+    -webkit-text-fill-color: transparent;
+    background-clip: text;
+}
+/* Subtítulo */
+.gradio-container p {
+    color: var(--text-secondary);
+    font-size: 1.125rem;
+    text-align: center;
+    margin-bottom: 2rem;
+    line-height: 1.6;
+}
+/* Campos de entrada */
+.gr-textbox {
+    border: 2px solid var(--border-color) !important;
+    border-radius: 12px !important;
+    padding: 12px 16px !important;
+    font-size: 1rem !important;
+    transition: all 0.3s ease !important;
+    background: white !important;
+}
+.gr-textbox:focus {
+    border-color: var(--primary-color) !important;
+    box-shadow: 0 0 0 3px rgba(139, 92, 246, 0.1) !important;
+    outline: none !important;
+}
+/* Botones */
+.gr-button {
+    background: var(--primary-color) !important;
+    color: white !important;
+    border: none !important;
+    border-radius: 12px !important;
+    padding: 12px 24px !important;
+    font-size: 1rem !important;
+    font-weight: 600 !important;
+    cursor: pointer !important;
+    transition: all 0.3s ease !important;
+    text-transform: none !important;
+    letter-spacing: 0.025em !important;
+}
+.gr-button:hover {
+    background: var(--primary-hover) !important;
+    transform: translateY(-2px) !important;
+    box-shadow: 0 10px 25px -5px rgba(139, 92, 246, 0.4) !important;
+}
+.gr-button:active {
+    transform: translateY(0) !important;
+}
+/* Radio buttons */
+.gr-radio {
+    margin: 1rem 0 !important;
+}
+.gr-radio label {
+    font-weight: 500 !important;
+    color: var(--text-primary) !important;
+}
+/* Mensajes de estado */
+.gr-textbox[data-testid="textbox"] {
+    font-family: 'Inter', monospace !important;
+}
+/* Área de descarga */
+.gr-file {
+    border: 2px dashed var(--border-color) !important;
+    border-radius: 12px !important;
+    padding: 2rem !important;
+    text-align: center !important;
+    background: var(--secondary-color) !important;
+    transition: all 0.3s ease !important;
+}
+.gr-file:hover {
+    border-color: var(--primary-color) !important;
+    background: rgba(139, 92, 246, 0.05) !important;
+}
+/* Indicadores de progreso */
+.progress-bar {
+    width: 100%;
+    height: 6px;
+    background: var(--border-color);
+    border-radius: 3px;
+    overflow: hidden;
+    margin: 1rem 0;
+}
+.progress-fill {
+    height: 100%;
+    background: var(--primary-color);
+    border-radius: 3px;
+    transition: width 0.3s ease;
+}
+/* Estados de mensajes */
+.success-message {
+    background: rgba(16, 185, 129, 0.1) !important;
+    border: 1px solid var(--success-color) !important;
+    color: var(--success-color) !important;
+    border-radius: 8px !important;
+    padding: 12px 16px !important;
+    margin: 1rem 0 !important;
+}
+.error-message {
+    background: rgba(239, 68, 68, 0.1) !important;
+    border: 1px solid var(--error-color) !important;
+    color: var(--error-color) !important;
+    border-radius: 8px !important;
+    padding: 12px 16px !important;
+    margin: 1rem 0 !important;
+}
+/* Responsive design */
+@media (max-width: 768px) {
+    .gradio-container {
+        margin: 1rem !important;
+        padding: 1.5rem 1rem !important;
+        border-radius: 16px !important;
+    }
+    .gradio-container h1 {
+        font-size: 2rem !important;
+    }
+    .gradio-container p {
+        font-size: 1rem !important;
+    }
+}
+/* Footer */
+.footer {
+    text-align: center;
+    margin-top: 2rem;
+    padding-top: 2rem;
+    border-top: 1px solid var(--border-color);
+    color: var(--text-secondary);
+    font-size: 0.875rem;
+}
+/* Animaciones sutiles */
+@keyframes fadeIn {
+    from {
+        opacity: 0;
+        transform: translateY(20px);
+    }
+    to {
+        opacity: 1;
+        transform: translateY(0);
+    }
+}
+.gradio-container > * {
+    animation: fadeIn 0.6s ease forwards;
+}
 """
+# Inicializar la herramienta de scraping
+scraper = WebScrapperTool()
+def validate_url(url):
+    """Valida la URL ingresada"""
+    if not url or not url.strip():
+        return False, "❌ Por favor ingresa una URL válida"
+    try:
+        normalized = scraper.normalize_url(url.strip())
+        return True, f"✅ URL válida: {normalized}"
+    except Exception as e:
+        return False, f"❌ Error en URL: {str(e)}"
+def process_url(url, format_choice, progress=gr.Progress()):
+    """Procesa la URL y genera el archivo en el formato seleccionado"""
+    if not url or not url.strip():
+        return "❌ Por favor ingresa una URL válida", None
+    try:
+        # Validar URL
+        progress(0.1, desc="Validando URL...")
+        is_valid, message = validate_url(url)
+        if not is_valid:
+            return message, None
+        # Normalizar URL
+        progress(0.2, desc="Normalizando URL...")
+        normalized_url = scraper.normalize_url(url.strip())
+        # Detectar tipo de contenido
+        progress(0.3, desc="Detectando tipo de contenido...")
+        is_image = scraper.is_image_url(normalized_url)
+        content_type = "🖼️ Imagen" if is_image else "📄 Página web"
+        # Procesar según formato seleccionado
+        progress(0.5, desc=f"Extrayendo contenido ({format_choice})...")
+        if format_choice == "PDF":
+            result = scraper.scrape_to_pdf(normalized_url)
+        else:  # TXT
+            result = scraper.scrape_to_text(normalized_url)
+        progress(0.9, desc="Finalizando...")
+        if result['status'] == 'success':
+            progress(1.0, desc="¡Completado!")
+            success_msg = f"""✅ **Procesamiento exitoso**
+🔗 **URL procesada:** {result['url']}
+📁 **Archivo generado:** {os.path.basename(result['file'])}
+📊 **Tipo de contenido:** {content_type}
+📄 **Formato de salida:** {format_choice}
+💡 **Listo para Copilot:** El archivo está optimizado para ser procesado por Microsoft Copilot"""
+            return success_msg, result['file']
+        else:
+            error_msg = f"""❌ **Error en el procesamiento**
+🔗 **URL:** {result.get('url', url)}
+⚠️ **Error:** {result['message']}
+💡 **Sugerencias:**
+- Verifica que la URL esté accesible
+- Intenta con una URL diferente
+- Algunos sitios pueden bloquear el scraping automático"""
+            return error_msg, None
+    except Exception as e:
+        error_msg = f"""❌ **Error inesperado**
+⚠️ **Error:** {str(e)}
+💡 **Intenta nuevamente con una URL diferente**"""
+        return error_msg, None
+# Crear interfaz Gradio
+with gr.Blocks(css=custom_css, theme=gr.themes.Soft(), title="🕸️ Web Scraper Tool") as demo:
+    gr.HTML("""
+    <div style="text-align: center; margin-bottom: 2rem;">
+        <h1>🕸️ Web Scraper Tool</h1>
+        <p>Extrae contenido de páginas web y conviértelo a formatos compatibles con Microsoft Copilot</p>
+    </div>
+    """)
+    with gr.Row():
+        with gr.Column(scale=3):
+            url_input = gr.Textbox(
+                label="🔗 URL de la página web",
+                placeholder="https://ejemplo.com o Https://EJEMPLO.com (mayúsculas OK)",
+                info="Soporta URLs con cualquier formato de mayúsculas/minúsculas",
+                lines=1
+            )
+        with gr.Column(scale=1):
+            format_choice = gr.Radio(
+                choices=["PDF", "TXT"],
+                value="TXT",
+                label="📄 Formato de salida",
+                info="Ambos formatos son compatibles con Copilot"
+            )
+    # Botón de validación en tiempo real
+    validate_btn = gr.Button("🔍 Validar URL", variant="secondary", size="sm")
+    validation_output = gr.Textbox(
+        label="Estado de validación",
+        interactive=False,
+        show_label=False
+    )
+    # Botón principal de procesamiento
+    process_btn = gr.Button("🚀 Extraer y Convertir", variant="primary", size="lg")
+    # Área de resultados
+    with gr.Row():
+        with gr.Column(scale=2):
+            result_output = gr.Textbox(
+                label="📊 Resultado del procesamiento",
+                lines=8,
+                interactive=False
+            )
+        with gr.Column(scale=1):
+            file_output = gr.File(
+                label="📁 Archivo generado",
+                interactive=False
+            )
+    # Información adicional
+    gr.HTML("""
+    <div class="footer">
+        <h3>ℹ️ Información de uso</h3>
+        <ul style="text-align: left; max-width: 600px; margin: 0 auto;">
+            <li><strong>URLs flexibles:</strong> Funciona con cualquier formato (HTTP, HTTPS, con/sin www)</li>
+            <li><strong>Detección automática:</strong> Identifica si el contenido es una imagen o texto</li>
+            <li><strong>Optimizado para Copilot:</strong> Los archivos generados están formateados para Microsoft Copilot</li>
+            <li><strong>Formatos soportados:</strong> PDF (con formato visual) y TXT (texto plano)</li>
+        </ul>
+        <p style="margin-top: 1rem; color: #64748b;">
+            Desarrollado con ❤️ para maximizar la compatibilidad con herramientas de IA
+        </p>
+    </div>
+    """)
+    # Configurar eventos
+    validate_btn.click(
+        fn=lambda url: validate_url(url)[1],
+        inputs=[url_input],
+        outputs=[validation_output]
+    )
+    process_btn.click(
+        fn=process_url,
+        inputs=[url_input, format_choice],
+        outputs=[result_output, file_output]
+    )
+    # Validación automática al cambiar la URL
+    url_input.change(
+        fn=lambda url: validate_url(url)[1] if url else "",
+        inputs=[url_input],
+        outputs=[validation_output]
+    )
+# Configuración para Hugging Face Spaces
+if __name__ == "__main__":
+    demo.launch(
+        server_name="0.0.0.0",
+        server_port=7860,
+        show_error=True,
+        share=False
+    )

requirements.txt CHANGED Viewed

@@ -1,7 +1,29 @@
-gradio==4.12.0
 requests==2.31.0
-beautifulsoup4==4.12.2
-weasyprint==60.2
-Pillow==10.0.0
 lxml==4.9.3
-html5lib==1.1

+# Gradio framework - versión estable compatible con HF Spaces
+gradio==4.44.1
+# Web scraping y parsing HTML
 requests==2.31.0
+beautifulsoup4==4.12.3
 lxml==4.9.3
+# Conversión HTML a PDF - versión estable
+weasyprint==60.2
+# Manejo de imágenes
+Pillow==10.0.1
+# Dependencias específicas para WeasyPrint
+cffi==1.16.0
+pycparser==2.21
+cssselect2==0.7.0
+tinycss2==1.2.1
+webencodings==0.5.1
+# Dependencias de red compatibles
+urllib3==2.0.7
+certifi==2023.7.22
+charset-normalizer==3.3.2
+idna==3.4
+# Utilidades adicionales
+python-dateutil==2.8.2

test_app.py ADDED Viewed

	@@ -0,0 +1,19 @@

+import gradio as gr
+def test_function(text):
+    """Función de prueba simple"""
+    return f"✅ Funciona! Recibido: {text}"
+# Crear interfaz de prueba
+with gr.Blocks(title="Test App") as demo:
+    gr.HTML("<h1>🧪 Test de Funcionamiento</h1>")
+    with gr.Row():
+        input_text = gr.Textbox(label="Texto de prueba", placeholder="Escribe algo...")
+        output_text = gr.Textbox(label="Resultado", interactive=False)
+    btn = gr.Button("Probar", variant="primary")
+    btn.click(fn=test_function, inputs=[input_text], outputs=[output_text])
+if __name__ == "__main__":
+    demo.launch()

web_scraper_tool.py CHANGED Viewed

@@ -1,201 +1,416 @@
 import requests
 from bs4 import BeautifulSoup
-import os
 from weasyprint import HTML, CSS
-from PIL import Image
-from io import BytesIO
 import re
 import random
-import mimetypes
-import json
-import time
 class WebScrapperTool:
-    """Herramienta para hacer scraping de páginas web y convertir a diferentes formatos"""
-    def __init__(self, output_dir):
-        """Inicializa la herramienta
-        Args:
-            output_dir: Directorio donde se guardarán los archivos
-        """
         self.output_dir = output_dir
-        self.session = self._create_session()
-        # Crear directorio de salida si no existe
         if not os.path.exists(output_dir):
             os.makedirs(output_dir)
-    def _create_session(self):
-        """Crea una sesión de requests con user agent aleatorio"""
-        session = requests.Session()
-        # Lista de user agents comunes
-        user_agents = [
             'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
-            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15',
-            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
-            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
-            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 Edg/92.0.902.67'
         ]
-        # Configurar headers con user agent aleatorio
-        headers = {
-            'User-Agent': random.choice(user_agents),
             'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
             'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3',
-            'Upgrade-Insecure-Requests': '1',
-            'DNT': '1',  # Do Not Track
         }
-        session.headers.update(headers)
-        return session
-    def is_image_url(self, url):
-        """Verifica si una URL es una imagen basándose en la extensión y/o Content-Type
-        Args:
-            url: URL a verificar
-        Returns:
-            bool: True si es una imagen, False en caso contrario
-        """
-        # Verificar por extensión de archivo
-        image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp', '.svg', '.bmp', '.tiff']
-        if any(url.lower().endswith(ext) for ext in image_extensions):
             return True
-        # Verificar por Content-Type
         try:
-            response = self.session.head(url, timeout=10)
-            content_type = response.headers.get('Content-Type', '')
-            return content_type.startswith('image/')
         except:
-            # Si falla la verificación por header, intentamos con la extensión solamente
-            return False
-    def get_image_metadata(self, url):
-        """Obtiene metadatos de una imagen
-        Args:
-            url: URL de la imagen
-        Returns:
-            dict: Diccionario con metadatos
-        """
         try:
-            # Obtener la imagen
-            response = self.session.get(url, timeout=10)
             response.raise_for_status()
-            # Metadatos básicos
-            metadata = {
-                'URL': url,
-                'Content-Type': response.headers.get('Content-Type', 'Desconocido'),
-                'Tamaño (bytes)': len(response.content),
-            }
-            # Intentar obtener dimensiones
             try:
-                img = Image.open(BytesIO(response.content))
-                metadata['Dimensiones'] = f"{img.width}x{img.height} píxeles"
-                metadata['Formato'] = img.format
-                metadata['Modo'] = img.mode
-            except:
-                metadata['Dimensiones'] = "No se pudieron determinar"
-            return metadata
         except Exception as e:
-            return {'Error': str(e)}
-    def scrape_to_text(self, url, output_path=None):
-        """Hace scraping de una URL y guarda el contenido como texto plano
-        Args:
-            url: URL para hacer scraping
-            output_path: Ruta donde guardar el archivo de texto
-        Returns:
-            str: Ruta al archivo generado
-        """
-        try:
-            # Obtener contenido de la página
-            response = self.session.get(url, timeout=15)
             response.raise_for_status()
-            # Parsear HTML
             soup = BeautifulSoup(response.text, 'html.parser')
-            # Eliminar scripts, estilos y elementos no visibles
-            for element in soup(['script', 'style', 'head', 'title', 'meta', '[document]']):
-                element.extract()
-            # Obtener texto
-            text = soup.get_text(separator='\n')
-            # Limpiar espacios en blanco excesivos
-            lines = [line.strip() for line in text.split('\n')]
-            text = '\n'.join(line for line in lines if line)
-            # Generar nombre de archivo si no se proporciona
-            if not output_path:
-                filename = f"texto_{int(time.time())}.txt"
-                output_path = os.path.join(self.output_dir, filename)
-            # Guardar texto en archivo
-            with open(output_path, 'w', encoding='utf-8') as f:
-                f.write(f"URL: {url}\n\n")
-                f.write(text)
-            return output_path
-        except Exception as e:
-            raise Exception(f"Error al hacer scraping a texto: {str(e)}")
-    def scrape_to_pdf(self, url, output_path=None):
-        """Hace scraping de una URL y guarda el contenido como PDF
-        Args:
-            url: URL para hacer scraping
-            output_path: Ruta donde guardar el archivo PDF
-        Returns:
-            str: Ruta al archivo generado
-        """
         try:
-            # Generar nombre de archivo si no se proporciona
-            if not output_path:
-                filename = f"documento_{int(time.time())}.pdf"
-                output_path = os.path.join(self.output_dir, filename)
-            # CSS para mejorar el estilo del PDF
-            css_string = """
-                @page {
-                    margin: 1cm;
-                }
-                body {
-                    font-family: Arial, sans-serif;
-                    line-height: 1.5;
-                    font-size: 12px;
-                }
-                h1, h2, h3, h4, h5, h6 {
-                    margin-top: 1em;
-                    margin-bottom: 0.5em;
-                }
-                p {
-                    margin-bottom: 0.5em;
-                }
-                img {
-                    max-width: 100%;
-                    height: auto;
-                }
             """
-            # Generar PDF
-            HTML(url=url).write_pdf(
-                output_path,
-                stylesheets=[CSS(string=css_string)]
-            )
-            return output_path
         except Exception as e:
-            raise Exception(f"Error al convertir a PDF: {str(e)}")

+import os
 import requests
 from bs4 import BeautifulSoup
 from weasyprint import HTML, CSS
+from urllib.parse import urlparse, urlunparse
 import re
+from PIL import Image
+import io
 import random
 class WebScrapperTool:
+    def __init__(self, output_dir="output"):
         self.output_dir = output_dir
         if not os.path.exists(output_dir):
             os.makedirs(output_dir)
+        # Múltiples user agents para evitar bloqueos
+        self.user_agents = [
             'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
+            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
+            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:89.0) Gecko/20100101 Firefox/89.0'
         ]
+    def get_headers(self):
+        """Genera headers dinámicos para evitar detección"""
+        return {
+            'User-Agent': random.choice(self.user_agents),
             'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
             'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3',
+            'Accept-Encoding': 'gzip, deflate',
+            'DNT': '1',
+            'Connection': 'keep-alive',
+            'Upgrade-Insecure-Requests': '1'
         }
+    def normalize_url(self, url):
+        """Normaliza URLs manejando todos los casos de mayúsculas y formatos incorrectos"""
+        if not url:
+            raise ValueError("URL no puede estar vacía")
+        url = url.strip()
+        # Convertir esquemas a minúsculas pero mantener el resto
+        if url.lower().startswith('http://'):
+            url = 'http://' + url[7:]
+        elif url.lower().startswith('https://'):
+            url = 'https://' + url[8:]
+        elif not url.startswith(('http://', 'https://')):
+            # Si no tiene esquema, agregar https por defecto
+            url = 'https://' + url
+        try:
+            parsed = urlparse(url)
+            # Normalizar componentes
+            scheme = parsed.scheme.lower()
+            netloc = parsed.netloc.lower() if parsed.netloc else ''
+            path = parsed.path
+            params = parsed.params
+            query = parsed.query
+            fragment = parsed.fragment
+            # Si netloc está vacío pero hay path, intentar corregir
+            if not netloc and path:
+                parts = path.split('/', 1)
+                netloc = parts[0].lower()
+                path = '/' + parts[1] if len(parts) > 1 else ''
+            normalized_url = urlunparse((scheme, netloc, path, params, query, fragment))
+            return normalized_url
+        except Exception as e:
+            raise ValueError(f"URL inválida: {url}. Error: {str(e)}")
+    def is_image_url(self, url):
+        """Detecta si una URL es una imagen"""
+        image_extensions = {'.jpg', '.jpeg', '.png', '.gif', '.webp', '.svg', '.bmp', '.tiff', '.ico'}
+        # Verificar por extensión
+        parsed_url = urlparse(url.lower())
+        path = parsed_url.path
+        if any(path.endswith(ext) for ext in image_extensions):
             return True
+        # Verificar por content-type si es posible
         try:
+            response = requests.head(url, headers=self.get_headers(), timeout=10)
+            content_type = response.headers.get('content-type', '').lower()
+            if content_type.startswith('image/'):
+                return True
         except:
+            pass
+        return False
+    def get_clean_html_for_pdf(self, html_content, base_url):
+        """Limpia HTML específicamente para conversión PDF robusta"""
+        soup = BeautifulSoup(html_content, 'html.parser')
+        # Remover elementos problemáticos para PDF
+        for element in soup(['script', 'style', 'noscript', 'iframe', 'embed', 'object', 'form']):
+            element.decompose()
+        # Remover atributos problemáticos
+        for tag in soup.find_all():
+            # Mantener solo atributos seguros
+            safe_attrs = ['href', 'src', 'alt', 'title', 'class', 'id']
+            attrs_to_remove = [attr for attr in tag.attrs if attr not in safe_attrs]
+            for attr in attrs_to_remove:
+                del tag[attr]
+        # Crear estructura HTML completa si no existe
+        if not soup.html:
+            new_soup = BeautifulSoup('<!DOCTYPE html><html><head></head><body></body></html>', 'html.parser')
+            new_soup.body.extend(soup.contents[:])
+            soup = new_soup
+        # Agregar CSS básico para mejor renderizado PDF
+        css_style = soup.new_tag('style')
+        css_style.string = """
+        body {
+            font-family: Arial, sans-serif;
+            line-height: 1.6;
+            margin: 20px;
+            color: #333;
+            max-width: 800px;
+        }
+        h1, h2, h3, h4, h5, h6 {
+            color: #2c3e50;
+            margin-top: 20px;
+            page-break-after: avoid;
+        }
+        p {
+            margin-bottom: 10px;
+            text-align: justify;
+        }
+        a {
+            color: #3498db;
+            text-decoration: none;
+        }
+        img {
+            max-width: 100%;
+            height: auto;
+            page-break-inside: avoid;
+        }
+        table {
+            border-collapse: collapse;
+            width: 100%;
+            page-break-inside: avoid;
+        }
+        th, td {
+            border: 1px solid #ddd;
+            padding: 8px;
+            text-align: left;
+        }
+        @page {
+            margin: 2cm;
+            @bottom-center {
+                content: "Página " counter(page);
+            }
+        }
+        """
+        # Insertar CSS en el head
+        if soup.head:
+            soup.head.append(css_style)
+        return str(soup)
+    def scrape_to_pdf(self, url, filename=None):
+        """Convierte página web a PDF con manejo robusto de errores"""
         try:
+            normalized_url = self.normalize_url(url)
+            # Verificar si es imagen
+            if self.is_image_url(normalized_url):
+                return self._handle_image_to_pdf(normalized_url, filename)
+            # Obtener contenido web
+            response = requests.get(normalized_url, headers=self.get_headers(), timeout=30)
             response.raise_for_status()
+            response.encoding = response.apparent_encoding or 'utf-8'
+            # Limpiar HTML para PDF
+            clean_html = self.get_clean_html_for_pdf(response.text, normalized_url)
+            # Generar nombre de archivo
+            if not filename:
+                domain = urlparse(normalized_url).netloc.replace('www.', '')
+                domain_clean = re.sub(r'[^a-zA-Z0-9_-]', '_', domain)
+                filename = f"scraped_{domain_clean}.pdf"
+            if not filename.endswith('.pdf'):
+                filename += '.pdf'
+            pdf_path = os.path.join(self.output_dir, filename)
+            # Configurar WeasyPrint con opciones robustas
             try:
+                html_doc = HTML(string=clean_html, base_url=normalized_url)
+                html_doc.write_pdf(pdf_path)
+            except Exception as weasy_error:
+                # Fallback: usar HTML más simple
+                simple_html = f"""
+                <!DOCTYPE html>
+                <html>
+                <head>
+                    <meta charset="utf-8">
+                    <title>Web Scraping Result</title>
+                    <style>
+                        body {{ font-family: Arial, sans-serif; margin: 20px; line-height: 1.6; }}
+                        .header {{ background-color: #f8f9fa; padding: 10px; margin-bottom: 20px; }}
+                        .content {{ max-width: 800px; }}
+                    </style>
+                </head>
+                <body>
+                    <div class="header">
+                        <h1>Contenido Web Extraído</h1>
+                        <p><strong>URL:</strong> {normalized_url}</p>
+                    </div>
+                    <div class="content">
+                        {BeautifulSoup(response.text, 'html.parser').get_text()}
+                    </div>
+                </body>
+                </html>
+                """
+                html_doc = HTML(string=simple_html)
+                html_doc.write_pdf(pdf_path)
+            return {
+                'status': 'success',
+                'file': pdf_path,
+                'url': normalized_url,
+                'message': f'PDF generado exitosamente: {filename}'
+            }
+        except requests.RequestException as e:
+            return {
+                'status': 'error',
+                'message': f'Error al acceder a la URL: {str(e)}',
+                'url': url
+            }
         except Exception as e:
+            return {
+                'status': 'error',
+                'message': f'Error al generar PDF: {str(e)}',
+                'url': url
+            }
+    def scrape_to_text(self, url, filename=None):
+        """Convierte página web a texto plano"""
+        try:
+            normalized_url = self.normalize_url(url)
+            # Verificar si es imagen
+            if self.is_image_url(normalized_url):
+                return self._handle_image_to_text(normalized_url, filename)
+            # Obtener contenido web
+            response = requests.get(normalized_url, headers=self.get_headers(), timeout=30)
             response.raise_for_status()
+            response.encoding = response.apparent_encoding or 'utf-8'
+            # Extraer texto limpio
             soup = BeautifulSoup(response.text, 'html.parser')
+            # Remover elementos no deseados
+            for element in soup(['script', 'style', 'noscript', 'header', 'footer', 'nav', 'aside']):
+                element.decompose()
+            # Extraer texto con separadores
+            text_content = soup.get_text(separator='\n', strip=True)
+            # Limpiar texto
+            lines = [line.strip() for line in text_content.split('\n') if line.strip()]
+            clean_text = '\n'.join(lines)
+            # Agregar metadatos
+            from datetime import datetime
+            metadata = f"""CONTENIDO WEB EXTRAÍDO
+URL: {normalized_url}
+Fecha de extracción: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
+Caracteres extraídos: {len(clean_text)}
+Tipo de contenido: {'Imagen' if self.is_image_url(normalized_url) else 'Texto'}
+{'='*50}
+{clean_text}"""
+            # Generar nombre de archivo
+            if not filename:
+                domain = urlparse(normalized_url).netloc.replace('www.', '')
+                domain_clean = re.sub(r'[^a-zA-Z0-9_-]', '_', domain)
+                filename = f"scraped_{domain_clean}.txt"
+            if not filename.endswith('.txt'):
+                filename += '.txt'
+            txt_path = os.path.join(self.output_dir, filename)
+            with open(txt_path, 'w', encoding='utf-8') as f:
+                f.write(metadata)
+            return {
+                'status': 'success',
+                'file': txt_path,
+                'url': normalized_url,
+                'message': f'Texto extraído exitosamente: {filename}'
+            }
+        except Exception as e:
+            return {
+                'status': 'error',
+                'message': f'Error al extraer texto: {str(e)}',
+                'url': url
+            }
+    def _handle_image_to_pdf(self, url, filename):
+        """Maneja conversión de imagen a PDF"""
         try:
+            response = requests.get(url, headers=self.get_headers(), timeout=30)
+            response.raise_for_status()
+            # Crear HTML con la imagen
+            html_content = f"""
+            <!DOCTYPE html>
+            <html>
+            <head>
+                <meta charset="utf-8">
+                <style>
+                    body {{ margin: 0; padding: 20px; text-align: center; font-family: Arial, sans-serif; }}
+                    img {{ max-width: 100%; height: auto; }}
+                    .info {{ margin-top: 20px; }}
+                </style>
+            </head>
+            <body>
+                <div class="info">
+                    <h1>Imagen Extraída</h1>
+                    <p><strong>URL:</strong> {url}</p>
+                    <p><strong>Tipo:</strong> Imagen</p>
+                </div>
+                <img src="{url}" alt="Imagen extraída">
+            </body>
+            </html>
             """
+            if not filename:
+                filename = "image_scraped.pdf"
+            pdf_path = os.path.join(self.output_dir, filename)
+            HTML(string=html_content).write_pdf(pdf_path)
+            return {
+                'status': 'success',
+                'file': pdf_path,
+                'url': url,
+                'message': f'Imagen convertida a PDF: {filename}'
+            }
+        except Exception as e:
+            return {
+                'status': 'error',
+                'message': f'Error al procesar imagen: {str(e)}',
+                'url': url
+            }
+    def _handle_image_to_text(self, url, filename):
+        """Maneja conversión de imagen a archivo de texto con metadatos"""
+        try:
+            response = requests.get(url, headers=self.get_headers(), timeout=30)
+            response.raise_for_status()
+            # Obtener información de la imagen
+            try:
+                img = Image.open(io.BytesIO(response.content))
+                img_info = f"""IMAGEN DETECTADA
+URL: {url}
+Formato: {img.format}
+Dimensiones: {img.size[0]}x{img.size[1]} píxeles
+Modo: {img.mode}
+Tamaño del archivo: {len(response.content)} bytes
+Esta URL contiene una imagen, no texto extraíble.
+Para procesar el contenido visual, considera usar herramientas de OCR.
+"""
+            except:
+                img_info = f"""IMAGEN DETECTADA
+URL: {url}
+Tamaño del archivo: {len(response.content)} bytes
+Esta URL contiene una imagen, no texto extraíble.
+"""
+            if not filename:
+                filename = "image_info.txt"
+            txt_path = os.path.join(self.output_dir, filename)
+            with open(txt_path, 'w', encoding='utf-8') as f:
+                f.write(img_info)
+            return {
+                'status': 'success',
+                'file': txt_path,
+                'url': url,
+                'message': f'Información de imagen guardada: {filename}'
+            }
         except Exception as e:
+            return {
+                'status': 'error',
+                'message': f'Error al procesar imagen: {str(e)}',
+                'url': url
+            }