---
license: mit
language: en
library_name: transformers
pipeline_tag: text-classification
datasets:
- imdb
metrics:
- accuracy
tags:
- text-classification
- sentiment-analysis
- bert
- imdb
- adversarial-nlp
- textattack
---

# bert-imdb-512

`bert-base-uncased` fine-tuned on the IMDB sentiment classification dataset with `max_seq_length=512`.

Trained as a victim model for adversarial NLP research (TextBugger / TextFooler / DeepWordBug-style attacks). The longer input window (vs. the typical 128-token TextAttack baselines) prevents truncation of ~95–98% of IMDB reviews and yields a stronger classifier.

## Model Details

- **Architecture**: `bert-base-uncased` (12 layers, 768 hidden, 12 heads, ~110M parameters)
- **Tokenization**: WordPiece (subwords)
- **Max sequence length**: 512 tokens
- **Task**: Binary sentiment classification (positive / negative)

## Training

Trained from `bert-base-uncased` on the IMDB train split (25,000 examples) using [TextAttack](https://github.com/QData/TextAttack) 0.3.x.

| Hyperparameter | Value |
| --- | --- |
| Epochs | 5 |
| Per-device batch size | 8 |
| Gradient accumulation | 2 (effective batch 16) |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup steps | 500 |
| Random seed | 786 |
| Hardware | NVIDIA RTX 3090 (24 GB) |

Training command:

```
textattack train --model-name-or-path bert-base-uncased \
  --dataset imdb \
  --model-max-length 512 \
  --epochs 5 \
  --per-device-train-batch-size 8 \
  --gradient-accumulation-steps 2 \
  --learning-rate 2e-5 \
  --save-last \
  --output-dir ./models/bert-imdb-512
```

## Evaluation

Evaluated on the IMDB test split (25,000 examples) at the best epoch checkpoint:

| Metric | Value |
| --- | --- |
| Accuracy | **94.14%** |

For reference, the equivalent TextAttack baseline at 128 tokens (`textattack/bert-base-uncased-imdb`) reports ~89% on the same test set.

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("jongador/bert-imdb-512")
model = AutoModelForSequenceClassification.from_pretrained("jongador/bert-imdb-512")

inputs = tokenizer("I loved this movie!", return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()  # 0 = negative, 1 = positive
```

## License

MIT