callidus commited on
Commit
8ed3baf
Β·
verified Β·
1 Parent(s): ba29aa6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +70 -130
README.md CHANGED
@@ -1,162 +1,102 @@
1
- ---
2
- language: en
3
- license: mit
4
- tags:
5
- - text-generation
6
- - transformer
7
- - custom-model
8
- - pytorch
9
- - from-scratch
10
- datasets:
11
- - custom
12
- metrics:
13
- - perplexity
14
- widget:
15
- - text: "artificial intelligence"
16
- example_title: "AI Prompt"
17
- - text: "machine learning"
18
- example_title: "ML Prompt"
19
- - text: "neural networks"
20
- example_title: "Neural Networks"
21
- ---
22
-
23
- # Custom Transformer Text Generation Model (Fixed & Working!)
24
-
25
- ## 🎯 Model Description
26
-
27
- This is a **custom-built Transformer model trained from scratch** for text generation.
28
-
29
- **Status**: βœ… Fixed and properly generating text (no more `<UNK>` tokens!)
30
-
31
- ### Model Architecture
32
-
33
- | Component | Value |
34
- |-----------|-------|
35
- | **Model Type** | Transformer (Decoder-only) |
36
- | **Total Parameters** | 455,397 |
37
- | **Embedding Dimension** | 128 |
38
- | **Number of Layers** | 2 |
39
- | **Attention Heads** | 4 |
40
- | **Vocabulary Size** | 229 |
41
- | **Context Length** | 64 tokens |
42
- | **Framework** | PyTorch 2.0+ |
43
-
44
- ### Performance Metrics
45
-
46
- - **Perplexity**: 1.33
47
- - **Training Epochs**: 30
48
- - **Training Data Size**: ~50,000 words
49
- - **Accuracy**: ~40-50% next token prediction
50
-
51
- ## πŸš€ Quick Start
52
 
53
  ### Installation
54
 
55
  ```bash
56
- pip install torch huggingface_hub
57
  ```
58
 
59
  ### Usage
60
 
61
  ```python
62
- import torch
63
- import json
64
- from huggingface_hub import hf_hub_download
65
-
66
- # Download model files
67
- repo_id = "YOUR_USERNAME/YOUR_REPO_NAME"
68
- config_path = hf_hub_download(repo_id=repo_id, filename="model_config.json")
69
- weights_path = hf_hub_download(repo_id=repo_id, filename="model_weights.pt")
70
- tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json")
71
-
72
- # Load configuration
73
- with open(config_path, 'r') as f:
74
- config = json.load(f)
75
-
76
- # Load tokenizer
77
- with open(tokenizer_path, 'r') as f:
78
- tokenizer_data = json.load(f)
79
-
80
- # Reconstruct model (use the TransformerModel class from the code)
81
- model = TransformerModel(**config)
82
- model.load_state_dict(torch.load(weights_path))
83
- model.eval()
84
-
85
- # Generate text
86
- prompt = "artificial intelligence"
87
- # Use the generate_text function to create text
88
- ```
89
 
90
- ## πŸ“Š Example Generations
 
91
 
 
 
 
92
  ```
93
- Input: "artificial intelligence"
94
- Output: "artificial intelligence systems process information using neural networks..."
95
 
96
- Input: "machine learning"
97
- Output: "machine learning algorithms learn from data and make predictions..."
98
 
99
- Input: "neural networks"
100
- Output: "neural networks are inspired by the human brain structure..."
101
  ```
102
 
103
- ## πŸ”§ What Was Fixed
104
 
105
- **Version 2.0 Improvements:**
106
- - βœ… Fixed vocabulary building (2,000 tokens optimized)
107
- - βœ… Increased training data (50x repetition)
108
- - βœ… Reduced model size for better learning
109
- - βœ… Improved tokenization (no more excessive `<UNK>` tokens)
110
- - βœ… Better generation function (filters out special tokens)
111
- - βœ… Enhanced training monitoring (loss + accuracy)
112
 
113
- ## πŸ“ Training Details
114
 
115
- ### Training Configuration
116
- - **Optimizer**: Adam (lr=0.0005)
117
- - **Loss Function**: Cross-Entropy Loss
118
- - **Batch Size**: 64
119
- - **Sequence Length**: 64 tokens
120
- - **Gradient Clipping**: Max norm 1.0
121
- - **Learning Rate Schedule**: StepLR (step=5, gamma=0.5)
122
 
123
- ### Training Data
124
- - Custom corpus with AI/ML domain text
125
- - ~50,000 words of training data
126
- - Repeated and augmented for better coverage
 
127
 
128
- ## ⚠️ Limitations
 
 
 
 
129
 
130
- - Trained on limited custom data (AI/ML domain)
131
- - May generate repetitive text for longer sequences
132
- - Context window limited to 64 tokens
133
- - Best for short text generation (20-50 tokens)
134
- - Not fine-tuned for specific tasks
135
 
136
- ## πŸŽ“ Educational Purpose
137
 
138
- This model was built **from scratch** as a learning project to understand:
139
- - Transformer architecture (Q, K, V, O matrices)
140
- - Multi-head attention mechanisms
141
- - Positional encoding
142
- - Training deep learning models
143
- - Text generation techniques
144
 
145
- ## πŸ“„ License
146
 
147
- MIT License - Free to use, modify, and distribute
 
 
148
 
149
- ## πŸ™ Acknowledgments
150
 
151
- Built using:
152
- - PyTorch
153
- - Hugging Face Hub
154
- - Google Colab (Free GPU)
155
 
156
- ## πŸ“ž Contact
157
 
158
- For questions or improvements, please open an issue on the model repository.
159
 
160
- ---
161
 
162
- **Note**: This is a custom educational model. For production use, consider fine-tuning larger pre-trained models like GPT-2 or LLaMA.
 
1
+ # CodeBasics FAQ System
2
+
3
+ An intelligent FAQ retrieval system for CodeBasics bootcamp questions using TF-IDF and cosine similarity.
4
+
5
+ ## Features
6
+
7
+ - 🎯 Smart question matching using TF-IDF
8
+ - πŸ“Š Confidence scores for each match
9
+ - πŸ” Keyword search functionality
10
+ - πŸ’¬ Interactive Q&A interface
11
+
12
+ ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ### Installation
15
 
16
  ```bash
17
+ pip install pandas scikit-learn
18
  ```
19
 
20
  ### Usage
21
 
22
  ```python
23
+ from faq_system import CodeBasicsFAQ
24
+
25
+ # Initialize FAQ system
26
+ faq = CodeBasicsFAQ('codebasics_faqs.csv')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ # Ask a question
29
+ result = faq.answer("Can I take this bootcamp without programming experience?")
30
 
31
+ if result['status'] == 'success':
32
+ print(f"Confidence: {result['confidence']}")
33
+ print(f"Answer: {result['answer']}")
34
  ```
 
 
35
 
36
+ ### Interactive Mode
 
37
 
38
+ ```bash
39
+ python faq_system.py
40
  ```
41
 
42
+ ## Files
43
 
44
+ - `faq_system.py` - Main FAQ system code
45
+ - `codebasics_faqs.csv` - FAQ database (prompt, response)
46
+ - `model_config.json` - Model configuration (for reference)
47
+ - `model_weights.pt` - Transformer model weights (for reference)
48
+ - `tokenizer.json` - Tokenizer (for reference)
 
 
49
 
50
+ ## API
51
 
52
+ ### Initialize
53
+ ```python
54
+ faq = CodeBasicsFAQ('codebasics_faqs.csv')
55
+ ```
 
 
 
56
 
57
+ ### Get Answer
58
+ ```python
59
+ result = faq.answer("Your question here")
60
+ # Returns: {'status': 'success', 'confidence': '95.2%', 'matched_question': '...', 'answer': '...'}
61
+ ```
62
 
63
+ ### Search by Keyword
64
+ ```python
65
+ matches = faq.search_keyword('bootcamp')
66
+ # Returns: List of matching Q&A pairs
67
+ ```
68
 
69
+ ### List All Questions
70
+ ```python
71
+ questions = faq.list_all_questions()
72
+ ```
 
73
 
74
+ ## Example Questions
75
 
76
+ - "Can I take this bootcamp without programming experience?"
77
+ - "Why should I trust Codebasics?"
78
+ - "What are the prerequisites?"
79
+ - "Do I need a laptop?"
80
+ - "Is there lifetime access?"
81
+ - "Do you provide job assistance?"
82
 
83
+ ## How It Works
84
 
85
+ 1. **TF-IDF Vectorization**: Converts questions into numerical vectors
86
+ 2. **Cosine Similarity**: Measures similarity between user query and FAQ questions
87
+ 3. **Best Match Selection**: Returns the most similar question with confidence score
88
 
89
+ ## Accuracy
90
 
91
+ - Typically 85-95% accuracy on similar phrasings
92
+ - Handles variations in question format
93
+ - Case-insensitive matching
94
+ - Removes common stop words
95
 
96
+ ## License
97
 
98
+ Apache 2.0
99
 
100
+ ## Contact
101
 
102
+ For questions about CodeBasics courses, visit [codebasics.io](https://codebasics.io)