Upload 5 files
Browse files- DEPLOYMENT_GUIDE.md +64 -0
- README.md +36 -6
- app.py +130 -0
- requirements.txt +7 -0
- web_scraper_tool.py +201 -0
DEPLOYMENT_GUIDE.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
# 🚀 Guía de Deployment en Hugging Face Spaces
|
| 3 |
+
|
| 4 |
+
## Paso 1: Preparar la cuenta
|
| 5 |
+
1. Ir a https://huggingface.co/
|
| 6 |
+
2. Crear una cuenta o iniciar sesión
|
| 7 |
+
3. Generar un Access Token con permisos de escritura:
|
| 8 |
+
- Ir a Settings > Access Tokens
|
| 9 |
+
- Crear nuevo token con scope "Write"
|
| 10 |
+
|
| 11 |
+
## Paso 2: Crear el Space
|
| 12 |
+
1. Ir a https://huggingface.co/spaces
|
| 13 |
+
2. Hacer clic en "Create new Space"
|
| 14 |
+
3. Configurar:
|
| 15 |
+
- Space name: web-scraper-tool (o el nombre que prefieras)
|
| 16 |
+
- License: Apache-2.0
|
| 17 |
+
- Select the SDK: Gradio
|
| 18 |
+
- SDK version: 4.12.0
|
| 19 |
+
- Hardware: CPU basic (gratuito)
|
| 20 |
+
- Visibility: Public
|
| 21 |
+
|
| 22 |
+
## Paso 3: Subir los archivos
|
| 23 |
+
### Opción A: Interface Web
|
| 24 |
+
1. Una vez creado el Space, ir a "Files"
|
| 25 |
+
2. Subir cada archivo uno por uno:
|
| 26 |
+
- app.py
|
| 27 |
+
- web_scraper_tool.py
|
| 28 |
+
- requirements.txt
|
| 29 |
+
- .gitattributes
|
| 30 |
+
- README.md
|
| 31 |
+
|
| 32 |
+
### Opción B: Git (Recomendado)
|
| 33 |
+
```bash
|
| 34 |
+
# Clonar el repositorio
|
| 35 |
+
git clone https://huggingface.co/spaces/TU_USERNAME/TU_SPACE_NAME
|
| 36 |
+
cd TU_SPACE_NAME
|
| 37 |
+
|
| 38 |
+
# Copiar todos los archivos aquí
|
| 39 |
+
# Luego hacer commit y push
|
| 40 |
+
git add .
|
| 41 |
+
git commit -m "Initial commit: Web Scraper Tool"
|
| 42 |
+
git push
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## Paso 4: Verificar el deployment
|
| 46 |
+
1. El Space comenzará a buildear automáticamente
|
| 47 |
+
2. Ver los logs en tiempo real en la página del Space
|
| 48 |
+
3. Una vez completado, la aplicación estará disponible
|
| 49 |
+
|
| 50 |
+
## Paso 5: Configuraciones adicionales (Opcional)
|
| 51 |
+
- Cambiar el hardware si necesitas más recursos
|
| 52 |
+
- Configurar secretos si tienes API keys
|
| 53 |
+
- Personalizar el README.md con más información
|
| 54 |
+
|
| 55 |
+
## 🎯 URLs de ejemplo
|
| 56 |
+
- Tu Space estará disponible en: https://huggingface.co/spaces/TU_USERNAME/TU_SPACE_NAME
|
| 57 |
+
- La aplicación se ejecutará automáticamente
|
| 58 |
+
|
| 59 |
+
## 🔧 Troubleshooting común
|
| 60 |
+
1. **Error de dependencias**: Verificar requirements.txt
|
| 61 |
+
2. **Import error**: Asegurar que todos los archivos están subidos
|
| 62 |
+
3. **Build failed**: Revisar los logs para errores específicos
|
| 63 |
+
|
| 64 |
+
¡Tu aplicación estará lista en unos minutos! 🎉
|
README.md
CHANGED
|
@@ -1,12 +1,42 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: Web Scraper Tool
|
| 3 |
+
emoji: 🕸️
|
| 4 |
+
colorFrom: purple
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.12.0
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# 🕸️ Web Scraper Tool
|
| 13 |
+
|
| 14 |
+
Una herramienta web para hacer scraping de páginas web y convertirlas a PDF o texto plano.
|
| 15 |
+
Esta aplicación está optimizada para generar archivos que puedan ser procesados por Copilot.
|
| 16 |
+
|
| 17 |
+
## ✨ Características
|
| 18 |
+
|
| 19 |
+
- ✅ Extracción de contenido web
|
| 20 |
+
- 📄 Conversión a PDF o texto plano
|
| 21 |
+
- 🖼️ Detección automática de imágenes
|
| 22 |
+
- 🎨 Interfaz minimalista y profesional
|
| 23 |
+
- 🤖 Optimizado para generar archivos compatibles con Copilot
|
| 24 |
+
|
| 25 |
+
## 🚀 Uso
|
| 26 |
+
|
| 27 |
+
1. Ingresa la URL de la página web que deseas procesar
|
| 28 |
+
2. Selecciona el formato de salida (PDF o TXT)
|
| 29 |
+
3. Haz clic en "Procesar URL"
|
| 30 |
+
4. Descarga el archivo generado
|
| 31 |
+
|
| 32 |
+
## 🛠️ Tecnologías utilizadas
|
| 33 |
+
|
| 34 |
+
- Python
|
| 35 |
+
- Gradio
|
| 36 |
+
- BeautifulSoup
|
| 37 |
+
- WeasyPrint
|
| 38 |
+
- Hugging Face Spaces
|
| 39 |
+
|
| 40 |
+
## 👨💻 Autor
|
| 41 |
+
|
| 42 |
+
Desarrollado con 💜 para solucionar problemas de procesamiento de contenido web
|
app.py
ADDED
|
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import gradio as gr
|
| 2 |
+
import os
|
| 3 |
+
import tempfile
|
| 4 |
+
import time
|
| 5 |
+
from web_scraper_tool import WebScrapperTool
|
| 6 |
+
|
| 7 |
+
# Inicializar el scraper
|
| 8 |
+
scraper = WebScrapperTool("temp_output")
|
| 9 |
+
|
| 10 |
+
def scrape_url(url, output_format, progress=gr.Progress()):
|
| 11 |
+
"""Función principal que procesa la URL ingresada"""
|
| 12 |
+
progress(0, desc="Iniciando...")
|
| 13 |
+
|
| 14 |
+
# Validar URL
|
| 15 |
+
if not url.startswith(('http://', 'https://')):
|
| 16 |
+
return None, "Error: La URL debe comenzar con http:// o https://"
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
progress(0.2, desc="Analizando URL...")
|
| 20 |
+
# Detectar si es una imagen
|
| 21 |
+
is_image = scraper.is_image_url(url)
|
| 22 |
+
|
| 23 |
+
progress(0.4, desc="Iniciando descarga...")
|
| 24 |
+
|
| 25 |
+
temp_dir = tempfile.mkdtemp()
|
| 26 |
+
timestamp = int(time.time())
|
| 27 |
+
|
| 28 |
+
if is_image:
|
| 29 |
+
progress(0.6, desc="Procesando imagen...")
|
| 30 |
+
filename = f"imagen_{timestamp}.txt"
|
| 31 |
+
output_path = os.path.join(temp_dir, filename)
|
| 32 |
+
|
| 33 |
+
# Obtenemos metadatos de la imagen
|
| 34 |
+
metadata = scraper.get_image_metadata(url)
|
| 35 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 36 |
+
f.write(f"URL de la imagen: {url}\n\n")
|
| 37 |
+
f.write("Metadatos de la imagen:\n")
|
| 38 |
+
for key, value in metadata.items():
|
| 39 |
+
f.write(f"{key}: {value}\n")
|
| 40 |
+
|
| 41 |
+
progress(1.0, desc="¡Listo!")
|
| 42 |
+
return output_path, f"✅ Archivo generado exitosamente. Se detectó que la URL es una imagen."
|
| 43 |
+
else:
|
| 44 |
+
if output_format == "txt":
|
| 45 |
+
progress(0.6, desc="Extrayendo texto...")
|
| 46 |
+
filename = f"contenido_{timestamp}.txt"
|
| 47 |
+
output_path = os.path.join(temp_dir, filename)
|
| 48 |
+
scraper.scrape_to_text(url, output_path)
|
| 49 |
+
else: # PDF
|
| 50 |
+
progress(0.6, desc="Generando PDF...")
|
| 51 |
+
filename = f"contenido_{timestamp}.pdf"
|
| 52 |
+
output_path = os.path.join(temp_dir, filename)
|
| 53 |
+
scraper.scrape_to_pdf(url, output_path)
|
| 54 |
+
|
| 55 |
+
progress(1.0, desc="¡Listo!")
|
| 56 |
+
return output_path, f"✅ Archivo generado exitosamente en formato {output_format.upper()}"
|
| 57 |
+
|
| 58 |
+
except Exception as e:
|
| 59 |
+
return None, f"❌ Error: {str(e)}"
|
| 60 |
+
|
| 61 |
+
# Estilos CSS personalizados para una apariencia minimalista
|
| 62 |
+
css = """
|
| 63 |
+
.gradio-container {
|
| 64 |
+
font-family: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
|
| 65 |
+
max-width: 800px;
|
| 66 |
+
margin: 0 auto;
|
| 67 |
+
}
|
| 68 |
+
.main-header {
|
| 69 |
+
text-align: center;
|
| 70 |
+
margin-bottom: 2rem;
|
| 71 |
+
}
|
| 72 |
+
.app-description {
|
| 73 |
+
margin-bottom: 2rem;
|
| 74 |
+
text-align: center;
|
| 75 |
+
color: #666;
|
| 76 |
+
}
|
| 77 |
+
.gr-button {
|
| 78 |
+
border-radius: 4px !important;
|
| 79 |
+
}
|
| 80 |
+
.gr-button-primary {
|
| 81 |
+
background: linear-gradient(90deg, #5c1edb, #775af5) !important;
|
| 82 |
+
}
|
| 83 |
+
footer {
|
| 84 |
+
margin-top: 3rem;
|
| 85 |
+
text-align: center;
|
| 86 |
+
font-size: 0.8rem;
|
| 87 |
+
color: #888;
|
| 88 |
+
}
|
| 89 |
+
"""
|
| 90 |
+
|
| 91 |
+
# Definir la interfaz de Gradio
|
| 92 |
+
with gr.Blocks(css=css) as demo:
|
| 93 |
+
gr.HTML("<h1 class='main-header'>🕸️ Web Scraper Tool</h1>")
|
| 94 |
+
gr.HTML("<p class='app-description'>Ingresa una URL para extraer su contenido en formato PDF o texto plano. La herramienta detectará automáticamente si se trata de una imagen.</p>")
|
| 95 |
+
|
| 96 |
+
with gr.Row():
|
| 97 |
+
url_input = gr.Textbox(
|
| 98 |
+
label="URL",
|
| 99 |
+
placeholder="https://ejemplo.com",
|
| 100 |
+
info="Ingresa la URL que deseas procesar"
|
| 101 |
+
)
|
| 102 |
+
|
| 103 |
+
with gr.Row():
|
| 104 |
+
format_select = gr.Radio(
|
| 105 |
+
["txt", "pdf"],
|
| 106 |
+
label="Formato de salida",
|
| 107 |
+
value="txt",
|
| 108 |
+
info="Selecciona el formato para guardar el contenido"
|
| 109 |
+
)
|
| 110 |
+
|
| 111 |
+
with gr.Row():
|
| 112 |
+
submit_btn = gr.Button("Procesar URL", variant="primary")
|
| 113 |
+
|
| 114 |
+
with gr.Row():
|
| 115 |
+
output_message = gr.Textbox(label="Estado")
|
| 116 |
+
|
| 117 |
+
with gr.Row():
|
| 118 |
+
file_output = gr.File(label="Archivo generado")
|
| 119 |
+
|
| 120 |
+
submit_btn.click(
|
| 121 |
+
fn=scrape_url,
|
| 122 |
+
inputs=[url_input, format_select],
|
| 123 |
+
outputs=[file_output, output_message]
|
| 124 |
+
)
|
| 125 |
+
|
| 126 |
+
gr.HTML("<footer>Desarrollado con <a href='https://gradio.app'>Gradio</a> y <a href='https://huggingface.co/spaces'>Hugging Face Spaces</a></footer>")
|
| 127 |
+
|
| 128 |
+
# Iniciar la aplicación
|
| 129 |
+
if __name__ == "__main__":
|
| 130 |
+
demo.launch()
|
requirements.txt
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio==4.12.0
|
| 2 |
+
requests==2.31.0
|
| 3 |
+
beautifulsoup4==4.12.2
|
| 4 |
+
weasyprint==60.2
|
| 5 |
+
Pillow==10.0.0
|
| 6 |
+
lxml==4.9.3
|
| 7 |
+
html5lib==1.1
|
web_scraper_tool.py
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import requests
|
| 2 |
+
from bs4 import BeautifulSoup
|
| 3 |
+
import os
|
| 4 |
+
from weasyprint import HTML, CSS
|
| 5 |
+
from PIL import Image
|
| 6 |
+
from io import BytesIO
|
| 7 |
+
import re
|
| 8 |
+
import random
|
| 9 |
+
import mimetypes
|
| 10 |
+
import json
|
| 11 |
+
import time
|
| 12 |
+
|
| 13 |
+
class WebScrapperTool:
|
| 14 |
+
"""Herramienta para hacer scraping de páginas web y convertir a diferentes formatos"""
|
| 15 |
+
|
| 16 |
+
def __init__(self, output_dir):
|
| 17 |
+
"""Inicializa la herramienta
|
| 18 |
+
|
| 19 |
+
Args:
|
| 20 |
+
output_dir: Directorio donde se guardarán los archivos
|
| 21 |
+
"""
|
| 22 |
+
self.output_dir = output_dir
|
| 23 |
+
self.session = self._create_session()
|
| 24 |
+
|
| 25 |
+
# Crear directorio de salida si no existe
|
| 26 |
+
if not os.path.exists(output_dir):
|
| 27 |
+
os.makedirs(output_dir)
|
| 28 |
+
|
| 29 |
+
def _create_session(self):
|
| 30 |
+
"""Crea una sesión de requests con user agent aleatorio"""
|
| 31 |
+
session = requests.Session()
|
| 32 |
+
|
| 33 |
+
# Lista de user agents comunes
|
| 34 |
+
user_agents = [
|
| 35 |
+
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
|
| 36 |
+
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.0 Safari/605.1.15',
|
| 37 |
+
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36',
|
| 38 |
+
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
|
| 39 |
+
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 Edg/92.0.902.67'
|
| 40 |
+
]
|
| 41 |
+
|
| 42 |
+
# Configurar headers con user agent aleatorio
|
| 43 |
+
headers = {
|
| 44 |
+
'User-Agent': random.choice(user_agents),
|
| 45 |
+
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
|
| 46 |
+
'Accept-Language': 'es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3',
|
| 47 |
+
'Upgrade-Insecure-Requests': '1',
|
| 48 |
+
'DNT': '1', # Do Not Track
|
| 49 |
+
}
|
| 50 |
+
|
| 51 |
+
session.headers.update(headers)
|
| 52 |
+
return session
|
| 53 |
+
|
| 54 |
+
def is_image_url(self, url):
|
| 55 |
+
"""Verifica si una URL es una imagen basándose en la extensión y/o Content-Type
|
| 56 |
+
|
| 57 |
+
Args:
|
| 58 |
+
url: URL a verificar
|
| 59 |
+
|
| 60 |
+
Returns:
|
| 61 |
+
bool: True si es una imagen, False en caso contrario
|
| 62 |
+
"""
|
| 63 |
+
# Verificar por extensión de archivo
|
| 64 |
+
image_extensions = ['.jpg', '.jpeg', '.png', '.gif', '.webp', '.svg', '.bmp', '.tiff']
|
| 65 |
+
if any(url.lower().endswith(ext) for ext in image_extensions):
|
| 66 |
+
return True
|
| 67 |
+
|
| 68 |
+
# Verificar por Content-Type
|
| 69 |
+
try:
|
| 70 |
+
response = self.session.head(url, timeout=10)
|
| 71 |
+
content_type = response.headers.get('Content-Type', '')
|
| 72 |
+
return content_type.startswith('image/')
|
| 73 |
+
except:
|
| 74 |
+
# Si falla la verificación por header, intentamos con la extensión solamente
|
| 75 |
+
return False
|
| 76 |
+
|
| 77 |
+
def get_image_metadata(self, url):
|
| 78 |
+
"""Obtiene metadatos de una imagen
|
| 79 |
+
|
| 80 |
+
Args:
|
| 81 |
+
url: URL de la imagen
|
| 82 |
+
|
| 83 |
+
Returns:
|
| 84 |
+
dict: Diccionario con metadatos
|
| 85 |
+
"""
|
| 86 |
+
try:
|
| 87 |
+
# Obtener la imagen
|
| 88 |
+
response = self.session.get(url, timeout=10)
|
| 89 |
+
response.raise_for_status()
|
| 90 |
+
|
| 91 |
+
# Metadatos básicos
|
| 92 |
+
metadata = {
|
| 93 |
+
'URL': url,
|
| 94 |
+
'Content-Type': response.headers.get('Content-Type', 'Desconocido'),
|
| 95 |
+
'Tamaño (bytes)': len(response.content),
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
# Intentar obtener dimensiones
|
| 99 |
+
try:
|
| 100 |
+
img = Image.open(BytesIO(response.content))
|
| 101 |
+
metadata['Dimensiones'] = f"{img.width}x{img.height} píxeles"
|
| 102 |
+
metadata['Formato'] = img.format
|
| 103 |
+
metadata['Modo'] = img.mode
|
| 104 |
+
except:
|
| 105 |
+
metadata['Dimensiones'] = "No se pudieron determinar"
|
| 106 |
+
|
| 107 |
+
return metadata
|
| 108 |
+
except Exception as e:
|
| 109 |
+
return {'Error': str(e)}
|
| 110 |
+
|
| 111 |
+
def scrape_to_text(self, url, output_path=None):
|
| 112 |
+
"""Hace scraping de una URL y guarda el contenido como texto plano
|
| 113 |
+
|
| 114 |
+
Args:
|
| 115 |
+
url: URL para hacer scraping
|
| 116 |
+
output_path: Ruta donde guardar el archivo de texto
|
| 117 |
+
|
| 118 |
+
Returns:
|
| 119 |
+
str: Ruta al archivo generado
|
| 120 |
+
"""
|
| 121 |
+
try:
|
| 122 |
+
# Obtener contenido de la página
|
| 123 |
+
response = self.session.get(url, timeout=15)
|
| 124 |
+
response.raise_for_status()
|
| 125 |
+
|
| 126 |
+
# Parsear HTML
|
| 127 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
| 128 |
+
|
| 129 |
+
# Eliminar scripts, estilos y elementos no visibles
|
| 130 |
+
for element in soup(['script', 'style', 'head', 'title', 'meta', '[document]']):
|
| 131 |
+
element.extract()
|
| 132 |
+
|
| 133 |
+
# Obtener texto
|
| 134 |
+
text = soup.get_text(separator='\n')
|
| 135 |
+
|
| 136 |
+
# Limpiar espacios en blanco excesivos
|
| 137 |
+
lines = [line.strip() for line in text.split('\n')]
|
| 138 |
+
text = '\n'.join(line for line in lines if line)
|
| 139 |
+
|
| 140 |
+
# Generar nombre de archivo si no se proporciona
|
| 141 |
+
if not output_path:
|
| 142 |
+
filename = f"texto_{int(time.time())}.txt"
|
| 143 |
+
output_path = os.path.join(self.output_dir, filename)
|
| 144 |
+
|
| 145 |
+
# Guardar texto en archivo
|
| 146 |
+
with open(output_path, 'w', encoding='utf-8') as f:
|
| 147 |
+
f.write(f"URL: {url}\n\n")
|
| 148 |
+
f.write(text)
|
| 149 |
+
|
| 150 |
+
return output_path
|
| 151 |
+
except Exception as e:
|
| 152 |
+
raise Exception(f"Error al hacer scraping a texto: {str(e)}")
|
| 153 |
+
|
| 154 |
+
def scrape_to_pdf(self, url, output_path=None):
|
| 155 |
+
"""Hace scraping de una URL y guarda el contenido como PDF
|
| 156 |
+
|
| 157 |
+
Args:
|
| 158 |
+
url: URL para hacer scraping
|
| 159 |
+
output_path: Ruta donde guardar el archivo PDF
|
| 160 |
+
|
| 161 |
+
Returns:
|
| 162 |
+
str: Ruta al archivo generado
|
| 163 |
+
"""
|
| 164 |
+
try:
|
| 165 |
+
# Generar nombre de archivo si no se proporciona
|
| 166 |
+
if not output_path:
|
| 167 |
+
filename = f"documento_{int(time.time())}.pdf"
|
| 168 |
+
output_path = os.path.join(self.output_dir, filename)
|
| 169 |
+
|
| 170 |
+
# CSS para mejorar el estilo del PDF
|
| 171 |
+
css_string = """
|
| 172 |
+
@page {
|
| 173 |
+
margin: 1cm;
|
| 174 |
+
}
|
| 175 |
+
body {
|
| 176 |
+
font-family: Arial, sans-serif;
|
| 177 |
+
line-height: 1.5;
|
| 178 |
+
font-size: 12px;
|
| 179 |
+
}
|
| 180 |
+
h1, h2, h3, h4, h5, h6 {
|
| 181 |
+
margin-top: 1em;
|
| 182 |
+
margin-bottom: 0.5em;
|
| 183 |
+
}
|
| 184 |
+
p {
|
| 185 |
+
margin-bottom: 0.5em;
|
| 186 |
+
}
|
| 187 |
+
img {
|
| 188 |
+
max-width: 100%;
|
| 189 |
+
height: auto;
|
| 190 |
+
}
|
| 191 |
+
"""
|
| 192 |
+
|
| 193 |
+
# Generar PDF
|
| 194 |
+
HTML(url=url).write_pdf(
|
| 195 |
+
output_path,
|
| 196 |
+
stylesheets=[CSS(string=css_string)]
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
return output_path
|
| 200 |
+
except Exception as e:
|
| 201 |
+
raise Exception(f"Error al convertir a PDF: {str(e)}")
|