ajeet9843 commited on
Commit
58cdb3a
·
verified ·
1 Parent(s): 7cfd188

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md CHANGED
@@ -1,4 +1,21 @@
1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  # BERT-PRETRAINED-EDU
3
 
4
  ## Overview
@@ -28,3 +45,81 @@ using streaming data from **FineWeb-Edu**.
28
  ## Limitations
29
  - Not instruction-tuned
30
  - Not chat-optimized
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ ---
3
+ language: en
4
+ license: apache-2.0
5
+ tags:
6
+ - bert
7
+ - masked-language-modeling
8
+ - pretraining
9
+ - nlp
10
+ - pytorch
11
+ - streaming
12
+ - ddp
13
+ datasets:
14
+ - HuggingFaceFW/fineweb-edu
15
+ library_name: pytorch
16
+ pipeline_tag: fill-mask
17
+ model_type: bert
18
+ ---
19
  # BERT-PRETRAINED-EDU
20
 
21
  ## Overview
 
45
  ## Limitations
46
  - Not instruction-tuned
47
  - Not chat-optimized
48
+
49
+ ## Use Case
50
+ ### 1. Masked Language Modeling
51
+ ```python
52
+ import torch
53
+ from transformers import AutoTokenizer
54
+
55
+ # Load tokenizer
56
+ tokenizer = AutoTokenizer.from_pretrained(
57
+ "your_hf_username/bert-edu-pretrained-384d"
58
+ )
59
+
60
+ # Load model weights (custom architecture)
61
+ from model import BERT, BERTConfig # your model definition
62
+
63
+ config = BERTConfig(
64
+ vocab_size=30522,
65
+ dim=384,
66
+ n_layers=6,
67
+ n_heads=6,
68
+ seq_len=128
69
+ )
70
+
71
+ model = BERT(config)
72
+ model.load_state_dict(
73
+ torch.load("pytorch_model.bin", map_location="cpu")
74
+ )
75
+ model.eval()
76
+
77
+ # Input with [MASK]
78
+ text = "Machine learning is [MASK]."
79
+ inputs = tokenizer(text, return_tensors="pt")
80
+
81
+ # Forward pass
82
+ with torch.no_grad():
83
+ logits = model(
84
+ inputs["input_ids"],
85
+ torch.zeros_like(inputs["input_ids"])
86
+ )
87
+
88
+ mask_index = (inputs["input_ids"] == tokenizer.mask_token_id).nonzero()[0, 1]
89
+ pred_id = logits[0, mask_index].argmax(dim=-1)
90
+
91
+ print("Prediction:", tokenizer.decode(pred_id))
92
+ ```
93
+
94
+ ### Finetuning for Sentiment Classification
95
+ ```python
96
+ import torch.nn as nn
97
+
98
+ class BertForSentiment(nn.Module):
99
+ def __init__(self, bert, hidden_size, num_labels=2):
100
+ super().__init__()
101
+ self.bert = bert
102
+ self.classifier = nn.Linear(hidden_size, num_labels)
103
+
104
+ def forward(self, input_ids, attention_mask=None):
105
+ seg = torch.zeros_like(input_ids)
106
+ hidden_states = self.bert(input_ids, seg)
107
+ cls_token = hidden_states[:, 0, :]
108
+ return self.classifier(cls_token)
109
+
110
+ # Step-2 :
111
+ bert = BERT(config)
112
+ bert.load_state_dict(
113
+ torch.load("pytorch_model.bin", map_location="cpu")
114
+ )
115
+
116
+ model = BertForSentiment(bert, hidden_size=384)
117
+
118
+ # train onto sentiemental data(imdb)
119
+ from datasets import load_dataset
120
+
121
+ dataset = load_dataset("imdb")
122
+
123
+ # tokenize → train → evaluate
124
+
125
+ ```