callidus
/

good

@@ -1,162 +1,102 @@
----
-language: en
-license: mit
-tags:
-- text-generation
-- transformer
-- custom-model
-- pytorch
-- from-scratch
-datasets:
-- custom
-metrics:
-- perplexity
-widget:
-- text: "artificial intelligence"
-  example_title: "AI Prompt"
-- text: "machine learning"
-  example_title: "ML Prompt"
-- text: "neural networks"
-  example_title: "Neural Networks"
----
-# Custom Transformer Text Generation Model (Fixed & Working!)
-## 🎯 Model Description
-This is a **custom-built Transformer model trained from scratch** for text generation.
-**Status**: ✅ Fixed and properly generating text (no more `<UNK>` tokens!)
-### Model Architecture
-| Component | Value |
-|-----------|-------|
-| **Model Type** | Transformer (Decoder-only) |
-| **Total Parameters** | 455,397 |
-| **Embedding Dimension** | 128 |
-| **Number of Layers** | 2 |
-| **Attention Heads** | 4 |
-| **Vocabulary Size** | 229 |
-| **Context Length** | 64 tokens |
-| **Framework** | PyTorch 2.0+ |
-### Performance Metrics
-- **Perplexity**: 1.33
-- **Training Epochs**: 30
-- **Training Data Size**: ~50,000 words
-- **Accuracy**: ~40-50% next token prediction
-## 🚀 Quick Start
 ### Installation
 ```bash
-pip install torch huggingface_hub
 ```
 ### Usage
 ```python
-import torch
-import json
-from huggingface_hub import hf_hub_download
-# Download model files
-repo_id = "YOUR_USERNAME/YOUR_REPO_NAME"
-config_path = hf_hub_download(repo_id=repo_id, filename="model_config.json")
-weights_path = hf_hub_download(repo_id=repo_id, filename="model_weights.pt")
-tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json")
-# Load configuration
-with open(config_path, 'r') as f:
-    config = json.load(f)
-# Load tokenizer
-with open(tokenizer_path, 'r') as f:
-    tokenizer_data = json.load(f)
-# Reconstruct model (use the TransformerModel class from the code)
-model = TransformerModel(**config)
-model.load_state_dict(torch.load(weights_path))
-model.eval()
-# Generate text
-prompt = "artificial intelligence"
-# Use the generate_text function to create text
-```
-## 📊 Example Generations
 ```
-Input: "artificial intelligence"
-Output: "artificial intelligence systems process information using neural networks..."
-Input: "machine learning"
-Output: "machine learning algorithms learn from data and make predictions..."
-Input: "neural networks"
-Output: "neural networks are inspired by the human brain structure..."
 ```
-## 🔧 What Was Fixed
-**Version 2.0 Improvements:**
-- ✅ Fixed vocabulary building (2,000 tokens optimized)
-- ✅ Increased training data (50x repetition)
-- ✅ Reduced model size for better learning
-- ✅ Improved tokenization (no more excessive `<UNK>` tokens)
-- ✅ Better generation function (filters out special tokens)
-- ✅ Enhanced training monitoring (loss + accuracy)
-## 📝 Training Details
-### Training Configuration
-- **Optimizer**: Adam (lr=0.0005)
-- **Loss Function**: Cross-Entropy Loss
-- **Batch Size**: 64
-- **Sequence Length**: 64 tokens
-- **Gradient Clipping**: Max norm 1.0
-- **Learning Rate Schedule**: StepLR (step=5, gamma=0.5)
-### Training Data
-- Custom corpus with AI/ML domain text
-- ~50,000 words of training data
-- Repeated and augmented for better coverage
-## ⚠️ Limitations
-- Trained on limited custom data (AI/ML domain)
-- May generate repetitive text for longer sequences
-- Context window limited to 64 tokens
-- Best for short text generation (20-50 tokens)
-- Not fine-tuned for specific tasks
-## 🎓 Educational Purpose
-This model was built **from scratch** as a learning project to understand:
-- Transformer architecture (Q, K, V, O matrices)
-- Multi-head attention mechanisms
-- Positional encoding
-- Training deep learning models
-- Text generation techniques
-## 📄 License
-MIT License - Free to use, modify, and distribute
-## 🙏 Acknowledgments
-Built using:
-- PyTorch
-- Hugging Face Hub
-- Google Colab (Free GPU)
-## 📞 Contact
-For questions or improvements, please open an issue on the model repository.
----
-**Note**: This is a custom educational model. For production use, consider fine-tuning larger pre-trained models like GPT-2 or LLaMA.

+# CodeBasics FAQ System
+An intelligent FAQ retrieval system for CodeBasics bootcamp questions using TF-IDF and cosine similarity.
+## Features
+- 🎯 Smart question matching using TF-IDF
+- 📊 Confidence scores for each match
+- 🔍 Keyword search functionality
+- 💬 Interactive Q&A interface
+## Quick Start
 ### Installation
 ```bash
+pip install pandas scikit-learn
 ```
 ### Usage
 ```python
+from faq_system import CodeBasicsFAQ
+# Initialize FAQ system
+faq = CodeBasicsFAQ('codebasics_faqs.csv')
+# Ask a question
+result = faq.answer("Can I take this bootcamp without programming experience?")
+if result['status'] == 'success':
+    print(f"Confidence: {result['confidence']}")
+    print(f"Answer: {result['answer']}")
 ```
+### Interactive Mode
+```bash
+python faq_system.py
 ```
+## Files
+- `faq_system.py` - Main FAQ system code
+- `codebasics_faqs.csv` - FAQ database (prompt, response)
+- `model_config.json` - Model configuration (for reference)
+- `model_weights.pt` - Transformer model weights (for reference)
+- `tokenizer.json` - Tokenizer (for reference)
+## API
+### Initialize
+```python
+faq = CodeBasicsFAQ('codebasics_faqs.csv')
+```
+### Get Answer
+```python
+result = faq.answer("Your question here")
+# Returns: {'status': 'success', 'confidence': '95.2%', 'matched_question': '...', 'answer': '...'}
+```
+### Search by Keyword
+```python
+matches = faq.search_keyword('bootcamp')
+# Returns: List of matching Q&A pairs
+```
+### List All Questions
+```python
+questions = faq.list_all_questions()
+```
+## Example Questions
+- "Can I take this bootcamp without programming experience?"
+- "Why should I trust Codebasics?"
+- "What are the prerequisites?"
+- "Do I need a laptop?"
+- "Is there lifetime access?"
+- "Do you provide job assistance?"
+## How It Works
+1. **TF-IDF Vectorization**: Converts questions into numerical vectors
+2. **Cosine Similarity**: Measures similarity between user query and FAQ questions
+3. **Best Match Selection**: Returns the most similar question with confidence score
+## Accuracy
+- Typically 85-95% accuracy on similar phrasings
+- Handles variations in question format
+- Case-insensitive matching
+- Removes common stop words
+## License
+Apache 2.0
+## Contact
+For questions about CodeBasics courses, visit [codebasics.io](https://codebasics.io)