Ethan Z
commited on
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,83 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
license: mit
|
| 5 |
+
tags:
|
| 6 |
+
- roberta
|
| 7 |
+
- mlm
|
| 8 |
+
- small-model
|
| 9 |
+
- academic-language
|
| 10 |
+
- synthetic-data
|
| 11 |
+
- resource-efficient
|
| 12 |
+
- consumer-hardware
|
| 13 |
+
pipeline_tag: fill-mask
|
| 14 |
+
datasets:
|
| 15 |
+
- HuggingFaceTB/cosmopedia
|
| 16 |
+
library_name: transformers
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# CoBERTa: 24M Parameter Academic Language Model
|
| 20 |
+
|
| 21 |
+
<div align="center">
|
| 22 |
+
<img src="https://img.shields.io/badge/Parameters-24.5M-blue" alt="24.5M Parameters">
|
| 23 |
+
<img src="https://img.shields.io/badge/License-MIT-yellow" alt="MIT License">
|
| 24 |
+
</div>
|
| 25 |
+
|
| 26 |
+
## Model Description
|
| 27 |
+
|
| 28 |
+
CoBERTa is a **24.5 million parameter** RoBERTa-style masked language model specifically trained on synthetic academic data. It demonstrates that **high-quality synthetic data** can compensate for model scale, enabling domain specialization on consumer hardware.
|
| 29 |
+
|
| 30 |
+
It achieves academic language proficiency with 5× fewer parameters than comparable models, trained in around 6 hours on a MacBook Pro.
|
| 31 |
+
|
| 32 |
+
### Model Architecture
|
| 33 |
+
- **Type**: Encoder-only transformer
|
| 34 |
+
- **Layers**: 12
|
| 35 |
+
- **Hidden Size**: 192
|
| 36 |
+
- **Attention Heads**: 6
|
| 37 |
+
- **Parameters**: 24,506,224
|
| 38 |
+
- **Vocabulary**: 35,000 tokens
|
| 39 |
+
- **Maximum Sequence Length**: 512 tokens
|
| 40 |
+
|
| 41 |
+
## Training Details
|
| 42 |
+
|
| 43 |
+
### Training Data
|
| 44 |
+
- **Source**: 50,000 filtered samples from [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
|
| 45 |
+
- **License**: MIT
|
| 46 |
+
|
| 47 |
+
### Training Procedure
|
| 48 |
+
- **Framework**: Apple MLX
|
| 49 |
+
- **Hardware**: MacBook Pro M4 (16GB unified memory)
|
| 50 |
+
- **Training Time**: 6 hours
|
| 51 |
+
- **Batch Size**: 32
|
| 52 |
+
- **Learning Rate**: 9e-4 with linear warmup
|
| 53 |
+
- **Objective**: Masked language modeling (15% token masking)
|
| 54 |
+
|
| 55 |
+
## Intended uses & limitations
|
| 56 |
+
|
| 57 |
+
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.
|
| 58 |
+
|
| 59 |
+
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at a model like GPT2.
|
| 60 |
+
|
| 61 |
+
### Basic Usage
|
| 62 |
+
```python
|
| 63 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer
|
| 64 |
+
import torch
|
| 65 |
+
|
| 66 |
+
model = AutoModelForMaskedLM.from_pretrained("frogd51/coberta-base")
|
| 67 |
+
tokenizer = AutoTokenizer.from_pretrained("frogd51/coberta-base")
|
| 68 |
+
|
| 69 |
+
text = "The key to effective communication is to [MASK] clearly and listen actively."
|
| 70 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 71 |
+
|
| 72 |
+
with torch.no_grad():
|
| 73 |
+
outputs = model(**inputs)
|
| 74 |
+
predictions = outputs.logits
|
| 75 |
+
|
| 76 |
+
# Get top predictions for [MASK]
|
| 77 |
+
mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
|
| 78 |
+
mask_token_logits = predictions[0, mask_token_index, :]
|
| 79 |
+
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
|
| 80 |
+
|
| 81 |
+
for token in top_5_tokens:
|
| 82 |
+
print(f"{tokenizer.decode([token])}")
|
| 83 |
+
```
|