mtpti5iD commited on
Commit
91f83a8
·
verified ·
1 Parent(s): faaa00a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # FLAN-T5-Small Fine-Tuned on Red Hat Documentation
2
+
3
+ ## Overview
4
+ This repository hosts a fine-tuned **FLAN-T5-Small** model for question-answering tasks on Red Hat documentation. The model was fine-tuned using Low-Rank Adaptation (LoRA) and 4-bit quantization on a Google Colab T4 GPU (~15 GB VRAM, CUDA 11.8). The training dataset, `redhat-docs_dataset`, contains 55,741 rows in JSONL format with fields: `title`, `content`, `command`, and `url`. The model excels at extracting commands (e.g., `yum install X`) and summarizing procedures.
5
+
6
+ Project details are available on GitHub: [mtptisid/FLAN-T5-Small_finetuning_LoRA](https://github.com/mtptisid/FLAN-T5-Small_finetuning_LoRA).
7
+
8
+ ## Model Details
9
+ - **Base Model**: `google/flan-t5-small` (77M parameters).
10
+ - **Fine-Tuning**: LoRA (r=8, alpha=32, target_modules=["q", "v"], dropout=0.1).
11
+ - **Quantization**: 4-bit NormalFloat (nf4) with bfloat16 compute dtype (~6-8 GB VRAM).
12
+ - **Task**: Question-answering on Red Hat documentation.
13
+
14
+ ## Dataset
15
+ The `redhat-docs_dataset` contains 55,741 entries:
16
+ - **title**: Documentation section title.
17
+ - **content**: Detailed procedure or concept.
18
+ - **command**: Associated command (may be `null`).
19
+ - **url**: Reference URL.
20
+
21
+ ### Preprocessing
22
+ - Null `command` fields set to `""`; missing `title`/`content` set to `"Untitled"`/`""`.
23
+ - Formatted into `text` field: "Title: {title}
24
+ Content: {content}
25
+ Command: {command}".
26
+ - Tokenized with max_length=512, truncation, and padding.
27
+
28
+ ### Artifacts
29
+ - `data/redhat-docs_dataset.jsonl`: Original dataset.
30
+ - `data/formatted_dataset.jsonl`: Preprocessed dataset.
31
+ - `data/tokenized_dataset.jsonl`: Tokenized dataset.
32
+
33
+ ## Training
34
+ - **Hardware**: NVIDIA T4 GPU, CUDA 11.8.
35
+ - **Epochs**: 2 (~4-8 hours).
36
+ - **Batch Size**: Effective 32 (4 per-device, 8 gradient accumulation steps).
37
+ - **Optimizer**: Paged AdamW 8-bit.
38
+ - **Mixed Precision**: FP16.
39
+ - **Dependencies**: PyTorch 2.3.1, Transformers 4.46.0, BitsAndBytes 0.43.3, Triton 2.0.0, Datasets 3.0.2, PEFT 0.13.2.
40
+
41
+ ## Repository Structure
42
+ - `model/`: Model weights and tokenizer.
43
+ - `data/`: Dataset files.
44
+ - `finetune_script.py`: Training script.
45
+ - `README.md`: This file.
46
+
47
+ ## Usage
48
+ Load the model for inference on Red Hat documentation queries. Example:
49
+ - **Question**: "How do I install Package X?"
50
+ - **Context**: "Title: Installing Package X
51
+ Content: To install Package X, use the package manager yum.
52
+ Command: yum install X"
53
+ - **Output**: "Run `yum install X`"
54
+
55
+ ## Installation
56
+ Requires PyTorch 2.3.1, Transformers 4.46.0, and a GPU for 4-bit quantization.
57
+
58
+ ## Limitations
59
+ - Dataset may have `null` commands, affecting some queries.
60
+ - Trained for 2 epochs; more epochs may improve performance.
61
+ - Specialized for Red Hat documentation.
62
+
63
+ ## Future Work
64
+ - Add synthetic Q&A data.
65
+ - Implement retrieval for dynamic context.
66
+ - Evaluate with BLEU/ROUGE metrics.
67
+
68
+ ## License
69
+ MIT License. Verify `redhat-docs_dataset` licensing separately.
70
+
71
+ ## Acknowledgments
72
+ - Google FLAN-T5 Team
73
+ - Hugging Face
74
+ - Red Hat Documentation Team
75
+
76
+ ## Contact
77
+ Open issues on [GitHub](https://github.com/mtptisid/FLAN-T5-Small_finetuning_LoRA) or contact `mtpti5iD` via Hugging Face.
78
+
79
+ *Last Updated: June 17, 2025*