Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,162 +1,102 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
- perplexity
|
| 14 |
-
widget:
|
| 15 |
-
- text: "artificial intelligence"
|
| 16 |
-
example_title: "AI Prompt"
|
| 17 |
-
- text: "machine learning"
|
| 18 |
-
example_title: "ML Prompt"
|
| 19 |
-
- text: "neural networks"
|
| 20 |
-
example_title: "Neural Networks"
|
| 21 |
-
---
|
| 22 |
-
|
| 23 |
-
# Custom Transformer Text Generation Model (Fixed & Working!)
|
| 24 |
-
|
| 25 |
-
## π― Model Description
|
| 26 |
-
|
| 27 |
-
This is a **custom-built Transformer model trained from scratch** for text generation.
|
| 28 |
-
|
| 29 |
-
**Status**: β
Fixed and properly generating text (no more `<UNK>` tokens!)
|
| 30 |
-
|
| 31 |
-
### Model Architecture
|
| 32 |
-
|
| 33 |
-
| Component | Value |
|
| 34 |
-
|-----------|-------|
|
| 35 |
-
| **Model Type** | Transformer (Decoder-only) |
|
| 36 |
-
| **Total Parameters** | 455,397 |
|
| 37 |
-
| **Embedding Dimension** | 128 |
|
| 38 |
-
| **Number of Layers** | 2 |
|
| 39 |
-
| **Attention Heads** | 4 |
|
| 40 |
-
| **Vocabulary Size** | 229 |
|
| 41 |
-
| **Context Length** | 64 tokens |
|
| 42 |
-
| **Framework** | PyTorch 2.0+ |
|
| 43 |
-
|
| 44 |
-
### Performance Metrics
|
| 45 |
-
|
| 46 |
-
- **Perplexity**: 1.33
|
| 47 |
-
- **Training Epochs**: 30
|
| 48 |
-
- **Training Data Size**: ~50,000 words
|
| 49 |
-
- **Accuracy**: ~40-50% next token prediction
|
| 50 |
-
|
| 51 |
-
## π Quick Start
|
| 52 |
|
| 53 |
### Installation
|
| 54 |
|
| 55 |
```bash
|
| 56 |
-
pip install
|
| 57 |
```
|
| 58 |
|
| 59 |
### Usage
|
| 60 |
|
| 61 |
```python
|
| 62 |
-
import
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
# Download model files
|
| 67 |
-
repo_id = "YOUR_USERNAME/YOUR_REPO_NAME"
|
| 68 |
-
config_path = hf_hub_download(repo_id=repo_id, filename="model_config.json")
|
| 69 |
-
weights_path = hf_hub_download(repo_id=repo_id, filename="model_weights.pt")
|
| 70 |
-
tokenizer_path = hf_hub_download(repo_id=repo_id, filename="tokenizer.json")
|
| 71 |
-
|
| 72 |
-
# Load configuration
|
| 73 |
-
with open(config_path, 'r') as f:
|
| 74 |
-
config = json.load(f)
|
| 75 |
-
|
| 76 |
-
# Load tokenizer
|
| 77 |
-
with open(tokenizer_path, 'r') as f:
|
| 78 |
-
tokenizer_data = json.load(f)
|
| 79 |
-
|
| 80 |
-
# Reconstruct model (use the TransformerModel class from the code)
|
| 81 |
-
model = TransformerModel(**config)
|
| 82 |
-
model.load_state_dict(torch.load(weights_path))
|
| 83 |
-
model.eval()
|
| 84 |
-
|
| 85 |
-
# Generate text
|
| 86 |
-
prompt = "artificial intelligence"
|
| 87 |
-
# Use the generate_text function to create text
|
| 88 |
-
```
|
| 89 |
|
| 90 |
-
|
|
|
|
| 91 |
|
|
|
|
|
|
|
|
|
|
| 92 |
```
|
| 93 |
-
Input: "artificial intelligence"
|
| 94 |
-
Output: "artificial intelligence systems process information using neural networks..."
|
| 95 |
|
| 96 |
-
|
| 97 |
-
Output: "machine learning algorithms learn from data and make predictions..."
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
```
|
| 102 |
|
| 103 |
-
##
|
| 104 |
|
| 105 |
-
|
| 106 |
-
-
|
| 107 |
-
-
|
| 108 |
-
-
|
| 109 |
-
-
|
| 110 |
-
- β
Better generation function (filters out special tokens)
|
| 111 |
-
- β
Enhanced training monitoring (loss + accuracy)
|
| 112 |
|
| 113 |
-
##
|
| 114 |
|
| 115 |
-
###
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
- **Sequence Length**: 64 tokens
|
| 120 |
-
- **Gradient Clipping**: Max norm 1.0
|
| 121 |
-
- **Learning Rate Schedule**: StepLR (step=5, gamma=0.5)
|
| 122 |
|
| 123 |
-
###
|
| 124 |
-
|
| 125 |
-
|
| 126 |
-
|
|
|
|
| 127 |
|
| 128 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 129 |
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
- Not fine-tuned for specific tasks
|
| 135 |
|
| 136 |
-
##
|
| 137 |
|
| 138 |
-
|
| 139 |
-
-
|
| 140 |
-
-
|
| 141 |
-
-
|
| 142 |
-
-
|
| 143 |
-
-
|
| 144 |
|
| 145 |
-
##
|
| 146 |
|
| 147 |
-
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
##
|
| 150 |
|
| 151 |
-
|
| 152 |
-
-
|
| 153 |
-
-
|
| 154 |
-
-
|
| 155 |
|
| 156 |
-
##
|
| 157 |
|
| 158 |
-
|
| 159 |
|
| 160 |
-
|
| 161 |
|
| 162 |
-
|
|
|
|
| 1 |
+
# CodeBasics FAQ System
|
| 2 |
+
|
| 3 |
+
An intelligent FAQ retrieval system for CodeBasics bootcamp questions using TF-IDF and cosine similarity.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
|
| 7 |
+
- π― Smart question matching using TF-IDF
|
| 8 |
+
- π Confidence scores for each match
|
| 9 |
+
- π Keyword search functionality
|
| 10 |
+
- π¬ Interactive Q&A interface
|
| 11 |
+
|
| 12 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
### Installation
|
| 15 |
|
| 16 |
```bash
|
| 17 |
+
pip install pandas scikit-learn
|
| 18 |
```
|
| 19 |
|
| 20 |
### Usage
|
| 21 |
|
| 22 |
```python
|
| 23 |
+
from faq_system import CodeBasicsFAQ
|
| 24 |
+
|
| 25 |
+
# Initialize FAQ system
|
| 26 |
+
faq = CodeBasicsFAQ('codebasics_faqs.csv')
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
+
# Ask a question
|
| 29 |
+
result = faq.answer("Can I take this bootcamp without programming experience?")
|
| 30 |
|
| 31 |
+
if result['status'] == 'success':
|
| 32 |
+
print(f"Confidence: {result['confidence']}")
|
| 33 |
+
print(f"Answer: {result['answer']}")
|
| 34 |
```
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
### Interactive Mode
|
|
|
|
| 37 |
|
| 38 |
+
```bash
|
| 39 |
+
python faq_system.py
|
| 40 |
```
|
| 41 |
|
| 42 |
+
## Files
|
| 43 |
|
| 44 |
+
- `faq_system.py` - Main FAQ system code
|
| 45 |
+
- `codebasics_faqs.csv` - FAQ database (prompt, response)
|
| 46 |
+
- `model_config.json` - Model configuration (for reference)
|
| 47 |
+
- `model_weights.pt` - Transformer model weights (for reference)
|
| 48 |
+
- `tokenizer.json` - Tokenizer (for reference)
|
|
|
|
|
|
|
| 49 |
|
| 50 |
+
## API
|
| 51 |
|
| 52 |
+
### Initialize
|
| 53 |
+
```python
|
| 54 |
+
faq = CodeBasicsFAQ('codebasics_faqs.csv')
|
| 55 |
+
```
|
|
|
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
### Get Answer
|
| 58 |
+
```python
|
| 59 |
+
result = faq.answer("Your question here")
|
| 60 |
+
# Returns: {'status': 'success', 'confidence': '95.2%', 'matched_question': '...', 'answer': '...'}
|
| 61 |
+
```
|
| 62 |
|
| 63 |
+
### Search by Keyword
|
| 64 |
+
```python
|
| 65 |
+
matches = faq.search_keyword('bootcamp')
|
| 66 |
+
# Returns: List of matching Q&A pairs
|
| 67 |
+
```
|
| 68 |
|
| 69 |
+
### List All Questions
|
| 70 |
+
```python
|
| 71 |
+
questions = faq.list_all_questions()
|
| 72 |
+
```
|
|
|
|
| 73 |
|
| 74 |
+
## Example Questions
|
| 75 |
|
| 76 |
+
- "Can I take this bootcamp without programming experience?"
|
| 77 |
+
- "Why should I trust Codebasics?"
|
| 78 |
+
- "What are the prerequisites?"
|
| 79 |
+
- "Do I need a laptop?"
|
| 80 |
+
- "Is there lifetime access?"
|
| 81 |
+
- "Do you provide job assistance?"
|
| 82 |
|
| 83 |
+
## How It Works
|
| 84 |
|
| 85 |
+
1. **TF-IDF Vectorization**: Converts questions into numerical vectors
|
| 86 |
+
2. **Cosine Similarity**: Measures similarity between user query and FAQ questions
|
| 87 |
+
3. **Best Match Selection**: Returns the most similar question with confidence score
|
| 88 |
|
| 89 |
+
## Accuracy
|
| 90 |
|
| 91 |
+
- Typically 85-95% accuracy on similar phrasings
|
| 92 |
+
- Handles variations in question format
|
| 93 |
+
- Case-insensitive matching
|
| 94 |
+
- Removes common stop words
|
| 95 |
|
| 96 |
+
## License
|
| 97 |
|
| 98 |
+
Apache 2.0
|
| 99 |
|
| 100 |
+
## Contact
|
| 101 |
|
| 102 |
+
For questions about CodeBasics courses, visit [codebasics.io](https://codebasics.io)
|