Ethan Z commited on
Commit
df4af2d
·
verified ·
1 Parent(s): 7b44145

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -3
README.md CHANGED
@@ -1,3 +1,83 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - roberta
7
+ - mlm
8
+ - small-model
9
+ - academic-language
10
+ - synthetic-data
11
+ - resource-efficient
12
+ - consumer-hardware
13
+ pipeline_tag: fill-mask
14
+ datasets:
15
+ - HuggingFaceTB/cosmopedia
16
+ library_name: transformers
17
+ ---
18
+
19
+ # CoBERTa: 24M Parameter Academic Language Model
20
+
21
+ <div align="center">
22
+ <img src="https://img.shields.io/badge/Parameters-24.5M-blue" alt="24.5M Parameters">
23
+ <img src="https://img.shields.io/badge/License-MIT-yellow" alt="MIT License">
24
+ </div>
25
+
26
+ ## Model Description
27
+
28
+ CoBERTa is a **24.5 million parameter** RoBERTa-style masked language model specifically trained on synthetic academic data. It demonstrates that **high-quality synthetic data** can compensate for model scale, enabling domain specialization on consumer hardware.
29
+
30
+ It achieves academic language proficiency with 5× fewer parameters than comparable models, trained in around 6 hours on a MacBook Pro.
31
+
32
+ ### Model Architecture
33
+ - **Type**: Encoder-only transformer
34
+ - **Layers**: 12
35
+ - **Hidden Size**: 192
36
+ - **Attention Heads**: 6
37
+ - **Parameters**: 24,506,224
38
+ - **Vocabulary**: 35,000 tokens
39
+ - **Maximum Sequence Length**: 512 tokens
40
+
41
+ ## Training Details
42
+
43
+ ### Training Data
44
+ - **Source**: 50,000 filtered samples from [Cosmopedia](https://huggingface.co/datasets/HuggingFaceTB/cosmopedia)
45
+ - **License**: MIT
46
+
47
+ ### Training Procedure
48
+ - **Framework**: Apple MLX
49
+ - **Hardware**: MacBook Pro M4 (16GB unified memory)
50
+ - **Training Time**: 6 hours
51
+ - **Batch Size**: 32
52
+ - **Learning Rate**: 9e-4 with linear warmup
53
+ - **Objective**: Masked language modeling (15% token masking)
54
+
55
+ ## Intended uses & limitations
56
+
57
+ You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. See the model hub to look for fine-tuned versions on a task that interests you.
58
+
59
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at a model like GPT2.
60
+
61
+ ### Basic Usage
62
+ ```python
63
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
64
+ import torch
65
+
66
+ model = AutoModelForMaskedLM.from_pretrained("frogd51/coberta-base")
67
+ tokenizer = AutoTokenizer.from_pretrained("frogd51/coberta-base")
68
+
69
+ text = "The key to effective communication is to [MASK] clearly and listen actively."
70
+ inputs = tokenizer(text, return_tensors="pt")
71
+
72
+ with torch.no_grad():
73
+ outputs = model(**inputs)
74
+ predictions = outputs.logits
75
+
76
+ # Get top predictions for [MASK]
77
+ mask_token_index = torch.where(inputs.input_ids == tokenizer.mask_token_id)[1]
78
+ mask_token_logits = predictions[0, mask_token_index, :]
79
+ top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
80
+
81
+ for token in top_5_tokens:
82
+ print(f"{tokenizer.decode([token])}")
83
+ ```