YashikaNagpal commited on
Commit
5a886e4
·
verified ·
1 Parent(s): c9163c8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +64 -44
README.md CHANGED
@@ -6,70 +6,90 @@ This repository hosts a fine-tuned version of the **FacebookAI/roberta-base** mo
6
  - **Model Architecture:** RoBERTa
7
  - **Task:** Mask Filling
8
  - **Dataset:** Hugging Face's ‘Salesforce/wikitext’ (wikitext-2-raw-v1)
9
- - **Quantization:** None (Fine-tuned without quantization)
10
  - **Fine-tuning Framework:** Hugging Face Transformers
11
 
12
  ## Usage
13
  ### Installation
14
  ```sh
15
- pip install transformers torch datasets
16
- Loading the Model
17
- python
18
- Copy
19
- Edit
20
- from transformers import RobertaTokenizer, RobertaForMaskedLM
21
  import torch
22
 
23
- device = "cuda" if torch.cuda.is_available() else "cpu"
24
-
25
- model_name = "facebook/roberta-base"
26
  tokenizer = RobertaTokenizer.from_pretrained(model_name)
27
- model = RobertaForMaskedLM.from_pretrained(model_name).to(device)
28
-
29
- def fill_mask(text, model, tokenizer):
30
- """Fill masked tokens in input text using the fine-tuned model."""
31
- # ✅ Tokenize input & move to correct device
32
- inputs = tokenizer(text, return_tensors="pt").to(device)
33
-
34
- # ✅ Generate predictions
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  with torch.no_grad():
36
  outputs = model(**inputs)
37
  logits = outputs.logits
38
-
39
- # Get the most likely token for the masked position
40
- masked_index = torch.argmax(logits[0, inputs.input_ids[0] == tokenizer.mask_token_id])
41
- predicted_token_id = torch.argmax(logits[0, masked_index])
42
-
43
- # ✅ Decode the predicted token
44
  predicted_token = tokenizer.decode(predicted_token_id)
45
- return predicted_token
46
 
47
- # Test Example
48
- text = "The quick brown fox jumps over the lazy [MASK]."
49
- predicted_token = fill_mask(text, model, tokenizer)
50
- print(f"Predicted Token: {predicted_token}")
51
  📊 Evaluation Results
52
  After fine-tuning the RoBERTa-base model for mask filling, we evaluated the model's performance on the validation set from the Salesforce/wikitext dataset. The following results were obtained:
53
 
54
  Metric Score Meaning
55
- Accuracy 85% Measures the accuracy of correctly predicting masked tokens.
56
- Loss 0.35 Cross-entropy loss of the model's predictions.
57
- Fine-Tuning Details
58
- Dataset
59
- The Salesforce/wikitext dataset (specifically wikitext-2-raw-v1) was used for fine-tuning. This dataset consists of a large collection of raw text, making it suitable for language modeling tasks such as mask filling.
60
-
61
- Training
62
- Number of epochs: 5
63
- Batch size: 16
64
- Evaluation strategy: every 1000 steps
65
- Repository Structure
66
- bash
67
- Copy
68
- Edit
 
 
69
  .
70
- ├── model/ # Contains the fine-tuned model files
71
  ├── tokenizer_config/ # Tokenizer configuration and vocabulary files
 
72
  ├── README.md # Model documentation
 
 
73
  Limitations
74
  The model is primarily trained on the wikitext-2 dataset and may not perform well on highly domain-specific text without additional fine-tuning.
75
  The model may not handle edge cases involving unusual grammar or rare words as effectively.
 
6
  - **Model Architecture:** RoBERTa
7
  - **Task:** Mask Filling
8
  - **Dataset:** Hugging Face's ‘Salesforce/wikitext’ (wikitext-2-raw-v1)
9
+ - **Quantization:** FP16
10
  - **Fine-tuning Framework:** Hugging Face Transformers
11
 
12
  ## Usage
13
  ### Installation
14
  ```sh
15
+ from transformers import RobertaForMaskedLM, RobertaTokenizer
 
 
 
 
 
16
  import torch
17
 
18
+ # Load the fine-tuned RoBERTa model and tokenizer
19
+ model_name = 'roberta_finetuned' # Your fine-tuned RoBERTa model
20
+ model = RobertaForMaskedLM.from_pretrained(model_name)
21
  tokenizer = RobertaTokenizer.from_pretrained(model_name)
22
+
23
+ # Move the model to GPU if available
24
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
25
+ model.to(device)
26
+
27
+ # Quantize the model to FP16
28
+ model = model.half()
29
+
30
+ # Save the quantized model and tokenizer
31
+ model.save_pretrained("./quantized_roberta_model")
32
+ tokenizer.save_pretrained("./quantized_roberta_model")
33
+
34
+ # Example input for testing (10 sentences)
35
+ input_texts = [
36
+ "The sky is <mask> during the night.",
37
+ "Machine learning is a subset of <mask> intelligence.",
38
+ "The largest planet in the solar system is <mask>.",
39
+ "The Eiffel Tower is located in <mask>.",
40
+ "The sun rises in the <mask>.",
41
+ "Mount Everest is the highest mountain in the <mask>.",
42
+ "The capital of Japan is <mask>.",
43
+ "Shakespeare wrote Romeo and <mask>.",
44
+ "The currency of the United States is <mask>.",
45
+ "The fastest land animal is the <mask>."
46
+ ]
47
+
48
+ # Process each input sentence
49
+ for input_text in input_texts:
50
+ # Tokenize input text
51
+ inputs = tokenizer(input_text, return_tensors="pt").to(device)
52
+
53
+ # Perform inference
54
  with torch.no_grad():
55
  outputs = model(**inputs)
56
  logits = outputs.logits
57
+
58
+ # Get the prediction for the masked token
59
+ masked_index = inputs.input_ids[0].tolist().index(tokenizer.mask_token_id)
60
+ predicted_token_id = logits[0, masked_index].argmax(axis=-1)
 
 
61
  predicted_token = tokenizer.decode(predicted_token_id)
 
62
 
63
+ print(f"Input: {input_text}")
64
+ print(f"Predicted token: {predicted_token}\n")
65
+ ```
 
66
  📊 Evaluation Results
67
  After fine-tuning the RoBERTa-base model for mask filling, we evaluated the model's performance on the validation set from the Salesforce/wikitext dataset. The following results were obtained:
68
 
69
  Metric Score Meaning
70
+ Bleu Score: 0.8
71
+
72
+ ## Fine-Tuning Details
73
+ ### Dataset
74
+ The Hugging Face's `medical-qa-datasets’ dataset was used, containing different types of Patient and Doctor Questions and respective Answers.
75
+
76
+ ### Training
77
+ - **Number of epochs:** 3
78
+ - **Batch size:** 8
79
+ - **Evaluation strategy:** steps
80
+
81
+ ### Quantization
82
+ Post-training quantization was applied using PyTorch's built-in quantization framework to reduce the model size and improve inference efficiency.
83
+
84
+ ## Repository Structure
85
+ ```
86
  .
87
+ ├── model/ # Contains the quantized model files
88
  ├── tokenizer_config/ # Tokenizer configuration and vocabulary files
89
+ ├── model.safetensors/ # Quantized Model
90
  ├── README.md # Model documentation
91
+ ```
92
+
93
  Limitations
94
  The model is primarily trained on the wikitext-2 dataset and may not perform well on highly domain-specific text without additional fine-tuning.
95
  The model may not handle edge cases involving unusual grammar or rare words as effectively.