callidus commited on
Commit
352a48c
·
verified ·
1 Parent(s): 4d33852

Add proper model card with YAML metadata

Browse files
Files changed (1) hide show
  1. README.md +349 -77
README.md CHANGED
@@ -28,18 +28,6 @@ An intelligent AI system for CodeBasics bootcamp questions with dual capabilitie
28
  - **Language:** English
29
  - **License:** Apache 2.0
30
 
31
- ## Features
32
-
33
- 🎯 **Smart Question Answering**
34
- - Intelligent FAQ matching using TF-IDF and cosine similarity
35
- - 50+ CodeBasics bootcamp questions covered
36
- - High accuracy for course-related queries
37
-
38
- 🤖 **Text Generation**
39
- - Transformer-based text generation
40
- - Trained on AI/ML domain text
41
- - Suitable for general tech content
42
-
43
  ## Quick Start
44
 
45
  ### Installation
@@ -48,111 +36,401 @@ An intelligent AI system for CodeBasics bootcamp questions with dual capabilitie
48
  pip install torch pandas scikit-learn huggingface_hub
49
  ```
50
 
51
- ### Usage
 
 
52
 
53
  ```python
54
- from huggingface_hub import hf_hub_download
 
 
 
 
 
 
 
 
 
 
 
 
55
  import pandas as pd
56
  from sklearn.feature_extraction.text import TfidfVectorizer
57
  from sklearn.metrics.pairwise import cosine_similarity
58
  import numpy as np
59
 
60
- # Download FAQ data
61
- csv_path = hf_hub_download(
62
- repo_id="callidus/good",
63
- filename="codebasics_faqs.csv"
64
- )
65
-
66
- # Load and use (see full code in repository)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ```
68
 
69
- ### Interactive Usage
70
 
 
71
  ```python
72
- # The system automatically chooses between FAQ and text generation
73
- result = smart_inference("Can I take this bootcamp without experience?")
74
- print(result) # Returns FAQ answer
 
 
 
75
 
 
 
76
  result = smart_inference("machine learning algorithms")
77
- print(result) # Returns generated text
 
 
 
78
  ```
79
 
80
  ## Example Questions
81
 
82
- **Bootcamp Questions:**
83
  - "Can I take this bootcamp without programming experience?"
84
  - "Why should I trust Codebasics?"
85
  - "What are the prerequisites?"
86
  - "Do you provide job assistance?"
87
  - "Is there lifetime access?"
 
 
88
 
89
- **General Topics:**
90
- - AI and machine learning concepts
91
- - Programming and data analytics
92
- - Technology discussions
 
93
 
94
  ## Files in Repository
95
 
96
  - `codebasics_faqs.csv` - FAQ database (50+ Q&A pairs)
97
- - `faq_system.py` - FAQ retrieval system code
98
- - `model_config.json` - Transformer model configuration
99
- - `model_weights.pt` - Transformer model weights
100
  - `tokenizer.json` - Tokenizer vocabulary
101
- - `README.md` - This file
102
 
103
  ## Model Architecture
104
 
105
  ### FAQ System
106
  - **Method:** TF-IDF + Cosine Similarity
107
- - **Vectorizer:** TfidfVectorizer with bigrams
108
- - **Threshold:** 0.2 similarity score
109
  - **Accuracy:** ~90% on similar phrasings
 
110
 
111
  ### Transformer Model
112
- - **Architecture:** Custom Transformer
113
  - **Layers:** 6 transformer blocks
114
  - **Hidden size:** 512
115
  - **Attention heads:** 8
116
  - **Vocabulary:** 229 tokens
117
- - **Max sequence length:** 512
118
 
119
- ## Training Data
120
 
121
- - **FAQ Data:** Custom CodeBasics bootcamp questions
122
- - **Text Generation:** AI/ML domain corpus
123
- - **Total samples:** Proprietary dataset
124
 
125
- ## Limitations
 
126
 
127
- - FAQ system requires questions similar to training data
128
- - Text generation model has limited vocabulary (229 tokens)
129
- - Best performance on CodeBasics-related questions
130
- - English language only
131
-
132
- ## Use Cases
133
-
134
- ✅ **Recommended:**
135
- - Answering CodeBasics bootcamp questions
136
- - Educational chatbots
137
- - Course support systems
138
- - General AI/ML text generation
139
 
140
- **Not Recommended:**
141
- - Medical or legal advice
142
- - Real-time information (trained on historical data)
143
- - Languages other than English
144
-
145
- ## Ethical Considerations
146
 
147
- - Model provides educational content only
148
- - Should not replace human instructors
149
- - Answers based on training data may be outdated
150
- - Users should verify critical information
151
 
152
  ## Citation
153
 
154
- If you use this model, please cite:
155
-
156
  ```bibtex
157
  @misc{codebasics-faq-2024,
158
  author = {callidus},
@@ -163,16 +441,10 @@ If you use this model, please cite:
163
  }
164
  ```
165
 
166
- ## Contact
167
-
168
- For questions about CodeBasics courses: [codebasics.io](https://codebasics.io)
169
-
170
  ## License
171
 
172
- Apache 2.0 - See LICENSE file for details
173
 
174
- ## Acknowledgments
175
 
176
- - CodeBasics for the educational content
177
- - Hugging Face for hosting infrastructure
178
- - Open source community for tools and libraries
 
28
  - **Language:** English
29
  - **License:** Apache 2.0
30
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ## Quick Start
32
 
33
  ### Installation
 
36
  pip install torch pandas scikit-learn huggingface_hub
37
  ```
38
 
39
+ ### Complete Inference Code
40
+
41
+ Copy and paste this complete code to use the model:
42
 
43
  ```python
44
+ # ============================================================================
45
+ # COMBINED INFERENCE: TRANSFORMER MODEL + FAQ SYSTEM
46
+ # ============================================================================
47
+
48
+ !pip install -q torch huggingface_hub pandas scikit-learn
49
+
50
+ import torch
51
+ import torch.nn as nn
52
+ import torch.nn.functional as F
53
+ import json
54
+ import math
55
+ from huggingface_hub import hf_hub_download, login
56
+ import re
57
  import pandas as pd
58
  from sklearn.feature_extraction.text import TfidfVectorizer
59
  from sklearn.metrics.pairwise import cosine_similarity
60
  import numpy as np
61
 
62
+ # ============================================================================
63
+ # CONFIGURATION
64
+ # ============================================================================
65
+
66
+ HF_TOKEN = "hf_your_token_here" # Replace with your token
67
+ REPO_ID = "callidus/good"
68
+
69
+ login(token=HF_TOKEN, add_to_git_credential=False)
70
+
71
+ # ============================================================================
72
+ # TRANSFORMER MODEL ARCHITECTURE
73
+ # ============================================================================
74
+
75
+ class MultiHeadAttention(nn.Module):
76
+ def __init__(self, d_model, num_heads):
77
+ super().__init__()
78
+ assert d_model % num_heads == 0
79
+ self.d_model = d_model
80
+ self.num_heads = num_heads
81
+ self.d_k = d_model // num_heads
82
+ self.W_q = nn.Linear(d_model, d_model)
83
+ self.W_k = nn.Linear(d_model, d_model)
84
+ self.W_v = nn.Linear(d_model, d_model)
85
+ self.W_o = nn.Linear(d_model, d_model)
86
+
87
+ def split_heads(self, x, batch_size):
88
+ x = x.view(batch_size, -1, self.num_heads, self.d_k)
89
+ return x.transpose(1, 2)
90
+
91
+ def forward(self, x, mask=None):
92
+ batch_size = x.size(0)
93
+ Q = self.split_heads(self.W_q(x), batch_size)
94
+ K = self.split_heads(self.W_k(x), batch_size)
95
+ V = self.split_heads(self.W_v(x), batch_size)
96
+ scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
97
+ if mask is not None:
98
+ scores = scores.masked_fill(mask == 0, -1e9)
99
+ attention_weights = F.softmax(scores, dim=-1)
100
+ attention_output = torch.matmul(attention_weights, V)
101
+ attention_output = attention_output.transpose(1, 2).contiguous()
102
+ attention_output = attention_output.view(batch_size, -1, self.d_model)
103
+ return self.W_o(attention_output), attention_weights
104
+
105
+ class FeedForward(nn.Module):
106
+ def __init__(self, d_model, d_ff, dropout=0.1):
107
+ super().__init__()
108
+ self.linear1 = nn.Linear(d_model, d_ff)
109
+ self.linear2 = nn.Linear(d_ff, d_model)
110
+ self.dropout = nn.Dropout(dropout)
111
+
112
+ def forward(self, x):
113
+ return self.linear2(self.dropout(F.relu(self.linear1(x))))
114
+
115
+ class TransformerBlock(nn.Module):
116
+ def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
117
+ super().__init__()
118
+ self.attention = MultiHeadAttention(d_model, num_heads)
119
+ self.feed_forward = FeedForward(d_model, d_ff, dropout)
120
+ self.norm1 = nn.LayerNorm(d_model)
121
+ self.norm2 = nn.LayerNorm(d_model)
122
+ self.dropout1 = nn.Dropout(dropout)
123
+ self.dropout2 = nn.Dropout(dropout)
124
+
125
+ def forward(self, x, mask=None):
126
+ attn_output, attn_weights = self.attention(x, mask)
127
+ x = self.norm1(x + self.dropout1(attn_output))
128
+ ff_output = self.feed_forward(x)
129
+ x = self.norm2(x + self.dropout2(ff_output))
130
+ return x, attn_weights
131
+
132
+ class PositionalEncoding(nn.Module):
133
+ def __init__(self, d_model, max_len=5000):
134
+ super().__init__()
135
+ pe = torch.zeros(max_len, d_model)
136
+ position = torch.arange(0, max_len).unsqueeze(1).float()
137
+ div_term = torch.exp(torch.arange(0, d_model, 2).float() *
138
+ -(math.log(10000.0) / d_model))
139
+ pe[:, 0::2] = torch.sin(position * div_term)
140
+ pe[:, 1::2] = torch.cos(position * div_term)
141
+ pe = pe.unsqueeze(0)
142
+ self.register_buffer('pe', pe)
143
+
144
+ def forward(self, x):
145
+ return x + self.pe[:, :x.size(1)]
146
+
147
+ class TransformerModel(nn.Module):
148
+ def __init__(self, vocab_size, d_model=512, num_heads=8,
149
+ num_layers=6, d_ff=2048, dropout=0.1, max_len=512):
150
+ super().__init__()
151
+ self.embedding = nn.Embedding(vocab_size, d_model)
152
+ self.pos_encoding = PositionalEncoding(d_model, max_len)
153
+ self.transformer_blocks = nn.ModuleList([
154
+ TransformerBlock(d_model, num_heads, d_ff, dropout)
155
+ for _ in range(num_layers)
156
+ ])
157
+ self.fc_out = nn.Linear(d_model, vocab_size)
158
+ self.dropout = nn.Dropout(dropout)
159
+ self.d_model = d_model
160
+
161
+ def forward(self, x, mask=None):
162
+ x = self.embedding(x) * math.sqrt(self.d_model)
163
+ x = self.pos_encoding(x)
164
+ x = self.dropout(x)
165
+ for transformer_block in self.transformer_blocks:
166
+ x, attn_weights = transformer_block(x, mask)
167
+ logits = self.fc_out(x)
168
+ return logits
169
+
170
+ class Tokenizer:
171
+ def __init__(self, tokenizer_data):
172
+ self.word2idx = tokenizer_data['word2idx']
173
+ self.idx2word = {int(k): v for k, v in tokenizer_data['idx2word'].items()}
174
+ self.vocab_size = tokenizer_data['vocab_size']
175
+ self.special_tokens = tokenizer_data['special_tokens']
176
+
177
+ def encode(self, text):
178
+ words = re.findall(r'\w+', text.lower())
179
+ return [self.word2idx.get(word, self.word2idx['<UNK>']) for word in words]
180
+
181
+ def decode(self, indices):
182
+ words = []
183
+ for idx in indices:
184
+ if idx in self.idx2word:
185
+ word = self.idx2word[idx]
186
+ if word not in ['<PAD>', '<SOS>', '<EOS>']:
187
+ words.append(word)
188
+ return ' '.join(words)
189
+
190
+ class TransformerInference:
191
+ def __init__(self, repo_id, token=None, device=None):
192
+ self.device = device or ('cuda' if torch.cuda.is_available() else 'cpu')
193
+ self.model = None
194
+ self.tokenizer = None
195
+ self.config = None
196
+ self.token = token
197
+ self.load_from_hub(repo_id)
198
+
199
+ def load_from_hub(self, repo_id):
200
+ config_path = hf_hub_download(repo_id=repo_id, filename="model_config.json", token=self.token)
201
+ weights_path = hf_hub_download(repo_id=repo_id, filename="model_weights.pt", token=self.token)
202
+ tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json", token=self.token)
203
+
204
+ with open(config_path, 'r') as f:
205
+ self.config = json.load(f)
206
+
207
+ with open(tokenizer_path, 'r') as f:
208
+ tokenizer_data = json.load(f)
209
+ self.tokenizer = Tokenizer(tokenizer_data)
210
+
211
+ self.model = TransformerModel(
212
+ vocab_size=self.config['vocab_size'],
213
+ d_model=self.config['d_model'],
214
+ num_heads=self.config['num_heads'],
215
+ num_layers=self.config['num_layers'],
216
+ d_ff=self.config['d_ff'],
217
+ dropout=self.config.get('dropout', 0.1),
218
+ max_len=self.config.get('max_len', 512)
219
+ )
220
+
221
+ state_dict = torch.load(weights_path, map_location=self.device, weights_only=True)
222
+ self.model.load_state_dict(state_dict)
223
+ self.model = self.model.to(self.device)
224
+ self.model.eval()
225
+
226
+ def generate(self, prompt, max_length=50, temperature=0.8, top_k=50, top_p=0.9):
227
+ self.model.eval()
228
+ tokens = self.tokenizer.encode(prompt)
229
+
230
+ if not tokens or all(t == self.tokenizer.word2idx['<UNK>'] for t in tokens):
231
+ tokens = [self.tokenizer.word2idx['<SOS>']]
232
+
233
+ generated = tokens.copy()
234
+
235
+ with torch.no_grad():
236
+ for _ in range(max_length):
237
+ input_tokens = generated[-64:]
238
+ if len(input_tokens) < 64:
239
+ input_tokens = [self.tokenizer.word2idx['<PAD>']] * (64 - len(input_tokens)) + input_tokens
240
+
241
+ input_ids = torch.tensor([input_tokens], dtype=torch.long).to(self.device)
242
+ logits = self.model(input_ids)
243
+ next_token_logits = logits[0, -1, :] / temperature
244
+
245
+ next_token_logits[self.tokenizer.word2idx['<PAD>']] = -float('inf')
246
+ next_token_logits[self.tokenizer.word2idx['<UNK>']] = -float('inf')
247
+
248
+ if top_k > 0:
249
+ indices_to_remove = next_token_logits < torch.topk(next_token_logits, top_k)[0][..., -1, None]
250
+ next_token_logits[indices_to_remove] = -float('inf')
251
+
252
+ if top_p < 1.0:
253
+ sorted_logits, sorted_indices = torch.sort(next_token_logits, descending=True)
254
+ cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
255
+ sorted_indices_to_remove = cumulative_probs > top_p
256
+ sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
257
+ sorted_indices_to_remove[..., 0] = 0
258
+ indices_to_remove = sorted_indices[sorted_indices_to_remove]
259
+ next_token_logits[indices_to_remove] = -float('inf')
260
+
261
+ probs = F.softmax(next_token_logits, dim=-1)
262
+ next_token = torch.multinomial(probs, num_samples=1).item()
263
+
264
+ if next_token == self.tokenizer.word2idx['<EOS>']:
265
+ break
266
+
267
+ generated.append(next_token)
268
+
269
+ return self.tokenizer.decode(generated)
270
+
271
+ # ============================================================================
272
+ # FAQ SYSTEM
273
+ # ============================================================================
274
+
275
+ class CodeBasicsFAQ:
276
+ def __init__(self, csv_path):
277
+ encodings = ['utf-8', 'latin-1', 'iso-8859-1', 'cp1252']
278
+ df = None
279
+
280
+ for encoding in encodings:
281
+ try:
282
+ df = pd.read_csv(csv_path, encoding=encoding)
283
+ break
284
+ except:
285
+ continue
286
+
287
+ if df is None:
288
+ raise Exception("Could not load FAQ CSV")
289
+
290
+ self.df = df
291
+ self.questions = df['prompt'].tolist()
292
+ self.answers = df['response'].tolist()
293
+
294
+ self.vectorizer = TfidfVectorizer(
295
+ lowercase=True,
296
+ stop_words='english',
297
+ ngram_range=(1, 2),
298
+ max_features=1000
299
+ )
300
+
301
+ self.question_vectors = self.vectorizer.fit_transform(self.questions)
302
+
303
+ def find_best_match(self, query, threshold=0.2):
304
+ query_vector = self.vectorizer.transform([query])
305
+ similarities = cosine_similarity(query_vector, self.question_vectors)[0]
306
+
307
+ best_idx = np.argmax(similarities)
308
+ best_score = similarities[best_idx]
309
+
310
+ if best_score >= threshold:
311
+ return {
312
+ 'question': self.questions[best_idx],
313
+ 'answer': self.answers[best_idx],
314
+ 'confidence': best_score
315
+ }
316
+ return None
317
+
318
+ # ============================================================================
319
+ # LOAD BOTH SYSTEMS
320
+ # ============================================================================
321
+
322
+ print("Loading systems...")
323
+ transformer = TransformerInference(repo_id=REPO_ID, token=HF_TOKEN)
324
+ csv_path = hf_hub_download(repo_id=REPO_ID, filename="codebasics_faqs.csv", token=HF_TOKEN)
325
+ faq = CodeBasicsFAQ(csv_path)
326
+ print("Ready!")
327
+
328
+ # ============================================================================
329
+ # SMART INFERENCE FUNCTION
330
+ # ============================================================================
331
+
332
+ def smart_inference(query):
333
+ """Automatically chooses FAQ or text generation"""
334
+ faq_match = faq.find_best_match(query)
335
+
336
+ if faq_match:
337
+ return faq_match['answer']
338
+ else:
339
+ return transformer.generate(query, max_length=50, temperature=0.8)
340
+
341
+ # ============================================================================
342
+ # USAGE
343
+ # ============================================================================
344
+
345
+ # Ask questions - system automatically picks best method
346
+ result = smart_inference("Can I take this bootcamp without programming experience?")
347
+ print(result)
348
+
349
+ # Interactive mode
350
+ while True:
351
+ user_input = input("Ask me: ").strip()
352
+ if user_input.lower() in ['quit', 'exit']:
353
+ break
354
+ print(smart_inference(user_input))
355
  ```
356
 
357
+ ## Usage Examples
358
 
359
+ ### FAQ Questions (Returns Accurate Answers)
360
  ```python
361
+ result = smart_inference("Can I take this bootcamp without programming experience?")
362
+ # Returns: "Yes, this is the perfect bootcamp for anyone..."
363
+
364
+ result = smart_inference("Why should I trust Codebasics?")
365
+ # Returns: "Till now 9000+ learners have benefitted..."
366
+ ```
367
 
368
+ ### General Topics (Returns Generated Text)
369
+ ```python
370
  result = smart_inference("machine learning algorithms")
371
+ # Returns: Generated text about ML
372
+
373
+ result = smart_inference("artificial intelligence")
374
+ # Returns: Generated text about AI
375
  ```
376
 
377
  ## Example Questions
378
 
379
+ ### Bootcamp Questions (FAQ System)
380
  - "Can I take this bootcamp without programming experience?"
381
  - "Why should I trust Codebasics?"
382
  - "What are the prerequisites?"
383
  - "Do you provide job assistance?"
384
  - "Is there lifetime access?"
385
+ - "Can I attend while working full time?"
386
+ - "What is the duration of this bootcamp?"
387
 
388
+ ### General Topics (Text Generation)
389
+ - "machine learning"
390
+ - "artificial intelligence"
391
+ - "neural networks"
392
+ - "data science"
393
 
394
  ## Files in Repository
395
 
396
  - `codebasics_faqs.csv` - FAQ database (50+ Q&A pairs)
397
+ - `model_config.json` - Transformer configuration
398
+ - `model_weights.pt` - Transformer weights
 
399
  - `tokenizer.json` - Tokenizer vocabulary
400
+ - `README.md` - This documentation
401
 
402
  ## Model Architecture
403
 
404
  ### FAQ System
405
  - **Method:** TF-IDF + Cosine Similarity
 
 
406
  - **Accuracy:** ~90% on similar phrasings
407
+ - **Threshold:** 0.2 similarity score
408
 
409
  ### Transformer Model
 
410
  - **Layers:** 6 transformer blocks
411
  - **Hidden size:** 512
412
  - **Attention heads:** 8
413
  - **Vocabulary:** 229 tokens
414
+ - **Max length:** 512 tokens
415
 
416
+ ## How It Works
417
 
418
+ The system intelligently routes queries:
 
 
419
 
420
+ 1. **FAQ Match?** → Returns accurate FAQ answer
421
+ 2. **No Match?** → Falls back to text generation
422
 
423
+ Users don't need to specify which system to use - it's automatic!
 
 
 
 
 
 
 
 
 
 
 
424
 
425
+ ## Limitations
 
 
 
 
 
426
 
427
+ - FAQ requires questions similar to training data
428
+ - Text generation has limited vocabulary (229 tokens)
429
+ - Best for CodeBasics bootcamp questions
430
+ - English language only
431
 
432
  ## Citation
433
 
 
 
434
  ```bibtex
435
  @misc{codebasics-faq-2024,
436
  author = {callidus},
 
441
  }
442
  ```
443
 
 
 
 
 
444
  ## License
445
 
446
+ Apache 2.0
447
 
448
+ ## Contact
449
 
450
+ For CodeBasics courses: [codebasics.io](https://codebasics.io)