Upload fine-tuned BERT meme-vs-event classifier

Browse files

Files changed (6) hide show

README.md +46 -0
config.json +39 -0
model.safetensors +3 -0
tokenizer.json +0 -0
tokenizer_config.json +14 -0
training_args.bin +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,46 @@

+---
+license: apache-2.0
+language: en
+library_name: transformers
+pipeline_tag: text-classification
+tags:
+  - bert
+  - text-classification
+  - tweet-classification
+  - meme-detection
+  - event-detection
+---
+# Meme vs Real Event Tweet Classifier
+Fine-tuned `bert-base-uncased` that classifies a tweet as either a **meme /
+low-signal cultural post** or a **real-world event** (breaking news,
+infrastructure outages, disasters, politics, etc.).
+- **Base model:** `bert-base-uncased`
+- **Task:** binary sequence classification
+- **Labels:** `0 = meme`, `1 = real_event`
+- **Max sequence length:** 128 tokens
+- **Preprocessing:** lowercase, strip URLs / mentions / hashtags / non-word chars
+## Quick start
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch, torch.nn.functional as F
+repo = "Aryan047/Dynamic-event-detector"
+tokenizer = AutoTokenizer.from_pretrained(repo)
+model = AutoModelForSequenceClassification.from_pretrained(repo).eval()
+text = "Massive 6.5 earthquake just rocked Istanbul, buildings swaying"
+enc = tokenizer(text, truncation=True, max_length=128, return_tensors="pt")
+probs = F.softmax(model(**enc).logits[0], dim=-1).tolist()
+print({"meme": probs[0], "real_event": probs[1]})
+```
+## Training pipeline
+Clusters of tweets were auto-labeled against the GDELT DOC 2.0 API using a
+lifespan-aware heuristic, then BERT was fine-tuned on an 80/20 split. See the
+companion notebook `meme_vs_event_classifier.ipynb` for the full pipeline.

config.json ADDED Viewed

	@@ -0,0 +1,39 @@

+{
+  "add_cross_attention": false,
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": null,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "eos_token_id": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "meme",
+    "1": "real_event"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "is_decoder": false,
+  "label2id": {
+    "meme": 0,
+    "real_event": 1
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "tie_word_embeddings": true,
+  "transformers_version": "5.0.0",
+  "type_vocab_size": 2,
+  "use_cache": false,
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:24d6b17c203ad33df80d5e6e1ddce4676a9aa7fb7c2154e7d772ac28b3599b39
+size 437958624

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "backend": "tokenizers",
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "is_local": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2d95f651825f49a44a544ba4f8bb25740788ee96b49b002bdec27e31a9d9b4df
+size 5201