bert-imdb-512 / README.md
jongador's picture
Upload folder using huggingface_hub
a649e72 verified
metadata
license: mit
language: en
library_name: transformers
pipeline_tag: text-classification
datasets:
  - imdb
metrics:
  - accuracy
tags:
  - text-classification
  - sentiment-analysis
  - bert
  - imdb
  - adversarial-nlp
  - textattack

bert-imdb-512

bert-base-uncased fine-tuned on the IMDB sentiment classification dataset with max_seq_length=512.

Trained as a victim model for adversarial NLP research (TextBugger / TextFooler / DeepWordBug-style attacks). The longer input window (vs. the typical 128-token TextAttack baselines) prevents truncation of ~95–98% of IMDB reviews and yields a stronger classifier.

Model Details

  • Architecture: bert-base-uncased (12 layers, 768 hidden, 12 heads, ~110M parameters)
  • Tokenization: WordPiece (subwords)
  • Max sequence length: 512 tokens
  • Task: Binary sentiment classification (positive / negative)

Training

Trained from bert-base-uncased on the IMDB train split (25,000 examples) using TextAttack 0.3.x.

Hyperparameter Value
Epochs 5
Per-device batch size 8
Gradient accumulation 2 (effective batch 16)
Learning rate 2e-5
Weight decay 0.01
Warmup steps 500
Random seed 786
Hardware NVIDIA RTX 3090 (24 GB)

Training command:

textattack train --model-name-or-path bert-base-uncased \
  --dataset imdb \
  --model-max-length 512 \
  --epochs 5 \
  --per-device-train-batch-size 8 \
  --gradient-accumulation-steps 2 \
  --learning-rate 2e-5 \
  --save-last \
  --output-dir ./models/bert-imdb-512

Evaluation

Evaluated on the IMDB test split (25,000 examples) at the best epoch checkpoint:

Metric Value
Accuracy 94.14%

For reference, the equivalent TextAttack baseline at 128 tokens (textattack/bert-base-uncased-imdb) reports ~89% on the same test set.

How to Use

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("jongador/bert-imdb-512")
model = AutoModelForSequenceClassification.from_pretrained("jongador/bert-imdb-512")

inputs = tokenizer("I loved this movie!", return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()  # 0 = negative, 1 = positive

License

MIT