Spaces:

notrito
/

dialecto-detector

Sleeping

App Files Files Community

notrito commited on Nov 7, 2025

Commit

c5947ee

0 Parent(s):

init

Browse files

Files changed (18) hide show

.gitattributes +4 -0
README.md +88 -0
app.py +192 -0
model-last/config.cfg +3 -0
model-last/meta.json +3 -0
model-last/ner/cfg +3 -0
model-last/ner/model +3 -0
model-last/ner/moves +3 -0
model-last/tok2vec/cfg +3 -0
model-last/tok2vec/model +3 -0
model-last/tokenizer +3 -0
model-last/vocab/key2row +3 -0
model-last/vocab/lookups.bin +3 -0
model-last/vocab/strings.json +3 -0
model-last/vocab/vectors +3 -0
model-last/vocab/vectors.cfg +3 -0
requirements.txt +3 -0
tweets_sample.json +3 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,4 @@

+model-last/** filter=lfs diff=lfs merge=lfs -text
+*.json filter=lfs diff=lfs merge=lfs -text
+model-last/vocab/vectors filter=lfs diff=lfs merge=lfs -text
+model-last/vocab/*.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,88 @@

+---
+title: Detector de Dialecto Español
+emoji: 🗣️
+colorFrom: blue
+colorTo: red
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+---
+# 🗣️ Detector de Dialecto Español: Argentino vs Español
+Modelo de NLP basado en spaCy para detectar y clasificar dialectos del español (argentino 🇦🇷 vs español peninsular 🇪🇸).
+## 🎯 Descripción
+Este proyecto utiliza un modelo NER (Named Entity Recognition) entrenado con spaCy para identificar palabras y expresiones características de dos variantes del español:
+- **Argentinismos**: Palabras y expresiones típicas de Argentina (che, boludo, vos, bondi, etc.)
+- **Españolismos**: Palabras y expresiones típicas de España (tío, coño, guay, etc.)
+## 🚀 Cómo funciona
+El modelo detecta automáticamente:
+### Argentinismos 🇦🇷
+- **Vocabulario característico**: che, boludo, pibe, guita, bondi, quilombo
+- **Voseo**: vos, tenés, sos, querés, sabés, podés, hacés
+- **Expresiones**: pileta, remera, laburo, morfar
+### Españolismos 🇪🇸
+- **Vocabulario característico**: tío/tía, coño, ostras, hostia
+- **Jerga**: molar, curro, guay, flipar, gilipollas
+- **Expresiones**: botellón, me parto, chaval/chavala
+## 📊 Métricas del Modelo
+- **F-score**: 99.90%
+- **Precision**: 99.90%
+- **Recall**: 99.90%
+- **Ejemplos de entrenamiento**: 10,000 (balanceado 50/50)
+- **Dataset**: pysentimiento/spanish-tweets
+## 🛠️ Tecnologías
+- **spaCy 3.8.2**: Framework de NLP
+- **Gradio 4.44.0**: Interfaz web interactiva
+- **Pipeline**: tok2vec + ner
+- **Modelo base**: es_core_news_sm
+## 💡 Casos de Uso
+- Análisis de dialectos en redes sociales
+- Estudios sociolingüísticos
+- Clasificación automática de contenido por región
+- Herramienta educativa para aprender variantes del español
+## ⚠️ Limitaciones
+- El modelo está optimizado para **texto informal** (tweets, mensajes)
+- Puede tener falsos positivos con:
+  - Palabras ambiguas fuera de contexto
+  - Vocabulario compartido entre dialectos
+- Solo distingue entre **argentino** y **español peninsular** (no otros dialectos latinoamericanos)
+## 🔍 Ejemplos
+**Argentino:**
+> "Che boludo, ¿vos sabés dónde dejé las llaves del bondi?"
+**Español:**
+> "Tío, este curro es una pasada, chaval"
+## 📝 Notas Técnicas
+El modelo utiliza reglas de contexto para evitar falsos positivos en palabras ambiguas:
+- "che" vs "Che Guevara"
+- "mate" (bebida) vs "maté" (verbo)
+- "colectivo" (autobús) vs "colectivo" (grupo)
+## 👨‍💻 Autor
+Desarrollado como proyecto educativo de NLP con spaCy.
+## 📄 Licencia
+MIT License

app.py ADDED Viewed

	@@ -0,0 +1,192 @@

+import gradio as gr
+import spacy
+import json
+import random
+from collections import Counter
+# Cargar modelo
+print("Cargando modelo...")
+nlp = spacy.load("./model")
+print("✓ Modelo cargado")
+# Función de detección
+def detectar_dialectismos(texto):
+    doc = nlp(texto)
+    colors = {
+        "ARGENTINISMO": "#75aadb",  # Azul celeste argentino
+        "ESPAÑOLISMO": "#c60b1e"     # Rojo español
+    }
+    options = {
+        "colors": colors
+    }
+    html = spacy.displacy.render(
+        doc,
+        style="ent",
+        jupyter=False,
+        options=options
+    )
+    return html
+# Ejemplos predefinidos
+ejemplos = [
+    "Che boludo, ¿vos sabés dónde dejé las llaves del bondi?",
+    "Tío, este curro es una mierda, me voy a flipar",
+]
+# Cargar tweets al inicio (fuera de las funciones)
+with open('tweets_sample.json', 'r', encoding='utf-8') as f:
+    TODOS_LOS_TWEETS = json.load(f)
+def generar_muestra_y_estadisticas():
+    """
+    Genera muestra de 1000 tweets y retorna estadísticas + la muestra misma
+    """
+    # Samplear 1000 tweets aleatorios
+    muestra = random.sample(TODOS_LOS_TWEETS, min(1000, len(TODOS_LOS_TWEETS)))
+    # Calcular estadísticas (mismo código de antes)
+    total_argentinismos = 0
+    total_españolismos = 0
+    palabras_arg = []
+    palabras_esp = []
+    tweets_argentinos = 0
+    tweets_españoles = 0
+    for tweet in muestra:
+        argentinismos = tweet['argentinismos']
+        españolismos = tweet['españolismos']
+        total_argentinismos += len(argentinismos)
+        total_españolismos += len(españolismos)
+        palabras_arg.extend(argentinismos)
+        palabras_esp.extend(españolismos)
+        if len(argentinismos) > len(españolismos):
+            tweets_argentinos += 1
+        elif len(españolismos) > len(argentinismos):
+            tweets_españoles += 1
+    top_arg = Counter(palabras_arg).most_common(10)
+    top_esp = Counter(palabras_esp).most_common(10)
+    # HTML con estadísticas
+    html_stats = f"""
+    <div style="font-family: Arial, sans-serif; padding: 20px;">
+        <h2>📊 Estadísticas de 1000 tweets aleatorios</h2>
+        <div style="display: flex; gap: 20px; margin: 20px 0;">
+            <div style="flex: 1; background: #75aadb; color: white; padding: 20px; border-radius: 10px;">
+                <h3>🇦🇷 Argentinismos</h3>
+                <p style="font-size: 32px; margin: 10px 0;"><strong>{total_argentinismos}</strong></p>
+                <p>detectados en total</p>
+                <p style="font-size: 20px;"><strong>{tweets_argentinos}</strong> tweets argentinos</p>
+            </div>
+            <div style="flex: 1; background: #c60b1e; color: white; padding: 20px; border-radius: 10px;">
+                <h3>🇪🇸 Españolismos</h3>
+                <p style="font-size: 32px; margin: 10px 0;"><strong>{total_españolismos}</strong></p>
+                <p>detectados en total</p>
+                <p style="font-size: 20px;"><strong>{tweets_españoles}</strong> tweets españoles</p>
+            </div>
+        </div>
+        <div style="display: flex; gap: 20px; margin-top: 30px;">
+            <div style="flex: 1;">
+                <h3>🔝 Top 10 Argentinismos</h3>
+                <ol>
+                    {"".join(f'<li><strong>{palabra}</strong>: {count} veces</li>' for palabra, count in top_arg)}
+                </ol>
+            </div>
+            <div style="flex: 1;">
+                <h3>🔝 Top 10 Españolismos</h3>
+                <ol>
+                    {"".join(f'<li><strong>{palabra}</strong>: {count} veces</li>' for palabra, count in top_esp)}
+                </ol>
+            </div>
+        </div>
+    </div>
+    """
+    # Retornar HTML de stats y la muestra para usarla después
+    return html_stats, muestra
+def obtener_5_tweets_aleatorios(muestra):
+    """
+    Obtiene 5 tweets aleatorios de la muestra
+    """
+    if not muestra:
+        return gr.Radio(choices=[], label="Primero genera una muestra")
+    tweets_sample = random.sample(muestra, min(5, len(muestra)))
+    # Crear lista de opciones (texto truncado para visualización)
+    opciones = []
+    for i, tweet in enumerate(tweets_sample):
+        texto = tweet['text']
+        # Truncar si es muy largo
+        preview = texto[:100] + "..." if len(texto) > 100 else texto
+        opciones.append((preview, texto))  # (label, value)
+    return gr.Radio(choices=opciones, label="Selecciona un tweet", value=opciones[0][1] if opciones else None)
+# Variable global para almacenar la muestra actual
+muestra_actual = []
+def wrapper_generar_muestra():
+    global muestra_actual
+    html_stats, muestra_actual = generar_muestra_y_estadisticas()
+    return html_stats
+def wrapper_5_tweets():
+    global muestra_actual
+    return obtener_5_tweets_aleatorios(muestra_actual)
+# Interfaz Gradio
+with gr.Blocks() as demo:
+    gr.Markdown("# 🗣️ Detector de Dialecto Español: Argentino 🇦🇷 vs Español 🇪🇸")
+    gr.Markdown("Analiza una muestra de 1000 tweets aleatorios del dataset y explora ejemplos individuales.")
+    # Botón para generar muestra
+    btn_generar = gr.Button("🎲 Generar Muestra de 1000 Tweets", variant="primary", size="lg")
+    output_stats = gr.HTML()
+    gr.Markdown("---")
+    gr.Markdown("### Explorar ejemplos de la muestra")
+    # Botón para obtener 5 tweets
+    btn_samplear = gr.Button("📋 Mostrar 5 Tweets Aleatorios")
+    radio_tweets = gr.Radio(choices=[], label="Selecciona un tweet para analizar")
+    # Botón para analizar el tweet seleccionado
+    btn_analizar = gr.Button("🔍 Analizar Tweet Seleccionado", variant="secondary")
+    output_analisis = gr.HTML()
+    # Eventos
+    btn_generar.click(
+        fn=wrapper_generar_muestra,
+        inputs=None,
+        outputs=output_stats
+    )
+    btn_samplear.click(
+        fn=wrapper_5_tweets,
+        inputs=None,
+        outputs=radio_tweets
+    )
+    btn_analizar.click(
+        fn=detectar_dialectismos,
+        inputs=radio_tweets,
+        outputs=output_analisis
+    )
+if __name__ == "__main__":
+    demo.launch()

model-last/config.cfg ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0f47c08d6562cd88bfe379fdc95a2d3e02a6c5a9980c2c455de7219c3f9573fc
+size 2727

model-last/meta.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:49cde1bdf1a001d04b7e85543624fd15d1f0f31dcea1dc0ef8bf64f4e3aa8b6b
+size 918

model-last/ner/cfg ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a7172edadafba9f472e9ac0f2660eec04b6405e471be9e20267b79c67288d22d
+size 221

model-last/ner/model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c2a63b5af4610f2430f1222d2228ab8d207a5ca0ba88a19fffd8b7cacd0fae18
+size 128548

model-last/ner/moves ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5539fd26b49eefe62ff03861278c9d5e6110829d4fc8fcb91f838210f71da03
+size 247

model-last/tok2vec/cfg ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8a5a26e3056eb6fb06deeb3dbccfd88ae74900200c98c70b5966bbb7ec9d4de
+size 4

model-last/tok2vec/model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:db3c800e8c3bd8563d2041359079a980c7cb2edd306c0229e4285932302579e2
+size 6009091

model-last/tokenizer ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b59b7b576c81115906b8fcf07f331a84846f77b431676a0803906c94c817462f
+size 36912

model-last/vocab/key2row ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e8671a09a41a5b75945090b0587993eff21adca55f626dae7bb36f159a6eb5f7
+size 5994249

model-last/vocab/lookups.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76be8b528d0075f7aae98d6fa57a6d3c83ae480a8469e668d7b0af968995ac71
+size 1

model-last/vocab/strings.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d35a44de283505c654dd72c891eb71e1d2b72f9b0e995ca6143b26f25962bc60
+size 10789527

model-last/vocab/vectors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e62c4610ae1cf28f17f80020f38ac20d3f8c57b10c348d4403141e16e79b7664
+size 24000128

model-last/vocab/vectors.cfg ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ff4359091952c8cd16f1f0482f5770fb82d1707368d5cca3c46aa501f552e3c5
+size 22

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+gradio>=5.49.0
+spacy==3.8.2
+huggingface-hub<1.0.0

tweets_sample.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:aff909d13d9dabe99f2ca5cd137686d81d1c216f8a2a2dddf7e925983434d731
+size 406177