Upload 7 files

Browse files

Files changed (7) hide show

README.md +109 -3
config.json +27 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,109 @@
----
-license: apache-2.0
----

+---
+language: es
+license: apache-2.0
+tags:
+- spanish
+- hate-speech-detection
+- text-classification
+- beto
+- inclusivity
+datasets:
+- manueltonneau/spanish-hate-speech-superset
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+widget:
+- text: "Me encanta este país, la gente es muy amable"
+- text: "Todos los inmigrantes son delincuentes"
+---
+# InclusioCheck - Detector de Lenguaje de Odio en Español
+## 📋 Descripción del Modelo
+**InclusioCheck** es un modelo de clasificación de texto fine-tuned desde [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased)
+para detectar lenguaje de odio (hate speech) en textos en español.
+## 🚀 Uso Rápido
+```python
+from transformers import pipeline
+# Cargar el clasificador
+classifier = pipeline("text-classification", model="antonn-dromundo/InclusioCheck-BETO-HateSpeech")
+# Predecir
+resultado = classifier("Texto a analizar")
+print(resultado)
+```
+## 💻 Uso Avanzado
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+# Cargar modelo y tokenizer
+tokenizer = AutoTokenizer.from_pretrained("antonn-dromundo/InclusioCheck-BETO-HateSpeech")
+model = AutoModelForSequenceClassification.from_pretrained("antonn-dromundo/InclusioCheck-BETO-HateSpeech")
+# Función de predicción
+def predecir(texto):
+    inputs = tokenizer(texto, return_tensors="pt", truncation=True, max_length=128)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    prediccion = outputs.logits.argmax(-1).item()
+    probabilidad = torch.softmax(outputs.logits, dim=-1)[0][prediccion].item()
+    label = "Hate Speech" if prediccion == 1 else "No Hate Speech"
+    return {"label": label, "confidence": probabilidad}
+# Ejemplo
+print(predecir("Los inmigrantes son bienvenidos"))
+```
+## 📊 Métricas de Rendimiento
+| Métrica | Valor |
+|---------|-------|
+| Accuracy | 0.816 |
+| F1 Score | 0.827 |
+| Precision | 0.777 |
+| Recall | 0.884 |
+## 📚 Dataset de Entrenamiento
+- **Fuente**: [Spanish Hate Speech Superset](https://huggingface.co/datasets/manueltonneau/spanish-hate-speech-superset)
+- **Ejemplos de entrenamiento**: 12,350
+- **Ejemplos de test**: 2,180
+- **Clases**: 2 (No Hate / Hate Speech)
+- **Balanceo**: Sí (undersampling de clase mayoritaria)
+## 🎯 Casos de Uso
+- ✅ Moderación automática de contenido
+- ✅ Filtrado de comentarios en redes sociales
+- ✅ Auditoría de lenguaje inclusivo
+- ✅ Herramienta de apoyo para redacción
+## ⚠️ Limitaciones
+- El modelo está entrenado específicamente para **español**
+- Puede tener sesgos inherentes al dataset de entrenamiento
+- Recomendado como **herramienta de apoyo**, no como única fuente de decisión
+- El contexto cultural y la intención deben considerarse en casos ambiguos
+## 👤 Autoría
+Antonio Dromundo
+Creado como parte del proyecto **InclusioCheck** para promover la detección de lenguaje excluyente.
+De Mexico para el mundo
+## 📄 Licencia
+Apache 2.0
+## 🔗 Enlaces
+- [Repositorio del proyecto](#)
+- [Demo en Gradio](#)

config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "output_past": true,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "transformers_version": "4.57.1",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 31002
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b1ba5f2e2fdc9aea23c4ba2ef5f39977e3aa93eb5e911ba717b7408b7d267bab
+size 439433208

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "5": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": false,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff