|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- bert |
|
|
- masked-language-model |
|
|
- structbert |
|
|
- dsa |
|
|
--- |
|
|
|
|
|
# StructBERT Encoder |
|
|
|
|
|
This model is a **StructBERT variant** fine-tuned on a custom Data Structures and Algorithms (DSA) corpus. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Architecture:** BERT (Masked Language Modeling) |
|
|
- **Tokenizer:** BERT tokenizer |
|
|
- **Training Data:** Merged DSA corpus (~32k lines) |
|
|
- **Framework:** Hugging Face Transformers |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- Predict missing tokens in DSA-related text |
|
|
- Research, education, and NLP experimentation |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Small corpus (~32k lines), so may not generalize beyond DSA content |
|
|
- Token predictions may be biased toward training examples |
|
|
- Not intended for production-grade applications |
|
|
|
|
|
## Example Usage |
|
|
|
|
|
```python |
|
|
from transformers import BertTokenizer, BertForMaskedLM |
|
|
|
|
|
tokenizer = BertTokenizer.from_pretrained("Saif10/StructBERT-encoder") |
|
|
model = BertForMaskedLM.from_pretrained("Saif10/StructBERT-encoder") |
|
|
|
|
|
text = "Binary search works by dividing the [MASK] into two halves." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
predicted_token_id = outputs.logits.argmax(-1) |
|
|
predicted_token = tokenizer.decode(predicted_token_id[0]) |
|
|
print(predicted_token) |
|
|
|