Text Classification
Transformers
Safetensors
English
bert
sentiment-analysis
imdb
adversarial-nlp
textattack
text-embeddings-inference
Instructions to use jongador/bert-imdb-512 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jongador/bert-imdb-512 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="jongador/bert-imdb-512")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("jongador/bert-imdb-512") model = AutoModelForSequenceClassification.from_pretrained("jongador/bert-imdb-512") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| language: en | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| datasets: | |
| - imdb | |
| metrics: | |
| - accuracy | |
| tags: | |
| - text-classification | |
| - sentiment-analysis | |
| - bert | |
| - imdb | |
| - adversarial-nlp | |
| - textattack | |
| # bert-imdb-512 | |
| `bert-base-uncased` fine-tuned on the IMDB sentiment classification dataset with `max_seq_length=512`. | |
| Trained as a victim model for adversarial NLP research (TextBugger / TextFooler / DeepWordBug-style attacks). The longer input window (vs. the typical 128-token TextAttack baselines) prevents truncation of ~95–98% of IMDB reviews and yields a stronger classifier. | |
| ## Model Details | |
| - **Architecture**: `bert-base-uncased` (12 layers, 768 hidden, 12 heads, ~110M parameters) | |
| - **Tokenization**: WordPiece (subwords) | |
| - **Max sequence length**: 512 tokens | |
| - **Task**: Binary sentiment classification (positive / negative) | |
| ## Training | |
| Trained from `bert-base-uncased` on the IMDB train split (25,000 examples) using [TextAttack](https://github.com/QData/TextAttack) 0.3.x. | |
| | Hyperparameter | Value | | |
| | --- | --- | | |
| | Epochs | 5 | | |
| | Per-device batch size | 8 | | |
| | Gradient accumulation | 2 (effective batch 16) | | |
| | Learning rate | 2e-5 | | |
| | Weight decay | 0.01 | | |
| | Warmup steps | 500 | | |
| | Random seed | 786 | | |
| | Hardware | NVIDIA RTX 3090 (24 GB) | | |
| Training command: | |
| ``` | |
| textattack train --model-name-or-path bert-base-uncased \ | |
| --dataset imdb \ | |
| --model-max-length 512 \ | |
| --epochs 5 \ | |
| --per-device-train-batch-size 8 \ | |
| --gradient-accumulation-steps 2 \ | |
| --learning-rate 2e-5 \ | |
| --save-last \ | |
| --output-dir ./models/bert-imdb-512 | |
| ``` | |
| ## Evaluation | |
| Evaluated on the IMDB test split (25,000 examples) at the best epoch checkpoint: | |
| | Metric | Value | | |
| | --- | --- | | |
| | Accuracy | **94.14%** | | |
| For reference, the equivalent TextAttack baseline at 128 tokens (`textattack/bert-base-uncased-imdb`) reports ~89% on the same test set. | |
| ## How to Use | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification | |
| tokenizer = AutoTokenizer.from_pretrained("jongador/bert-imdb-512") | |
| model = AutoModelForSequenceClassification.from_pretrained("jongador/bert-imdb-512") | |
| inputs = tokenizer("I loved this movie!", return_tensors="pt", truncation=True, max_length=512) | |
| outputs = model(**inputs) | |
| prediction = outputs.logits.argmax(-1).item() # 0 = negative, 1 = positive | |
| ``` | |
| ## License | |
| MIT | |