arun3676 commited on
Commit
b7db63d
·
1 Parent(s): cf986ba

Configure for HuggingFace Spaces Docker deployment - Add Dockerfile, .dockerignore, update README with HF metadata, optimize requirements.txt

Browse files
.dockerignore ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Environment and secrets
2
+ .env
3
+ .env.local
4
+ .env.*.local
5
+ *.env
6
+
7
+ # Git
8
+ .git
9
+ .gitignore
10
+ .gitattributes
11
+
12
+ # Python
13
+ __pycache__/
14
+ *.py[cod]
15
+ *$py.class
16
+ *.so
17
+ .Python
18
+ build/
19
+ develop-eggs/
20
+ dist/
21
+ downloads/
22
+ eggs/
23
+ .eggs/
24
+ lib/
25
+ lib64/
26
+ parts/
27
+ sdist/
28
+ var/
29
+ wheels/
30
+ pip-wheel-metadata/
31
+ share/python-wheels/
32
+ *.egg-info/
33
+ .installed.cfg
34
+ *.egg
35
+
36
+ # Virtual environments
37
+ venv/
38
+ .venv/
39
+ ENV/
40
+ env/
41
+ env.bak/
42
+ venv.bak/
43
+
44
+ # Testing
45
+ tests/
46
+ test_*.py
47
+ .pytest_cache/
48
+ .coverage
49
+ .tox/
50
+ .nox/
51
+ htmlcov/
52
+ .hypothesis/
53
+
54
+ # Documentation
55
+ *.md
56
+ !README.md
57
+ ARCHITECTURE.md
58
+ PROJECT_STRUCTURE.md
59
+ TESTING_GUIDE.md
60
+ HUGGINGFACE_SETUP_GUIDE.md
61
+ MERCURY_API_SETUP.md
62
+ *.txt
63
+ !requirements.txt
64
+
65
+ # IDE
66
+ .vscode/
67
+ .idea/
68
+ *.swp
69
+ *.swo
70
+ *~
71
+ .DS_Store
72
+
73
+ # Deployment plans
74
+ *.plan.md
75
+
76
+ # Setup and test scripts
77
+ setup_huggingface.py
78
+ test_api_connections.py
79
+ test_huggingface_setup.py
80
+
81
+ # Cache
82
+ .cache/
83
+ *.cache
84
+
Dockerfile ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.11-slim
2
+
3
+ WORKDIR /app
4
+
5
+ # Install system dependencies
6
+ RUN apt-get update && apt-get install -y \
7
+ build-essential \
8
+ git \
9
+ && rm -rf /var/lib/apt/lists/*
10
+
11
+ # Copy requirements first for better Docker layer caching
12
+ COPY requirements.txt .
13
+
14
+ # Install Python dependencies
15
+ RUN pip install --no-cache-dir -r requirements.txt
16
+
17
+ # Copy application code
18
+ COPY analyzer/ ./analyzer/
19
+ COPY matrix_final.py .
20
+ COPY run_app.py .
21
+ COPY evaluation_samples/ ./evaluation_samples/
22
+
23
+ # Create cache directory
24
+ RUN mkdir -p /app/.cache
25
+
26
+ # Expose HuggingFace Spaces default port
27
+ EXPOSE 7860
28
+
29
+ # Health check
30
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
31
+ CMD curl -f http://localhost:7860/_stcore/health || exit 1
32
+
33
+ # Run Streamlit with HF Spaces compatible settings
34
+ CMD ["streamlit", "run", "matrix_final.py", \
35
+ "--server.port=7860", \
36
+ "--server.address=0.0.0.0", \
37
+ "--server.headless=true", \
38
+ "--server.enableCORS=false", \
39
+ "--server.enableXsrfProtection=false", \
40
+ "--browser.gatherUsageStats=false"]
41
+
INTEGRATION_GUIDE.md DELETED
@@ -1,368 +0,0 @@
1
- # Integration Guide: Fine-tuned Model with Your Code Analyzer
2
-
3
- This guide explains how to integrate your fine-tuned DeepSeek model with the existing code analyzer app.
4
-
5
- ## 📋 What You Have Now
6
-
7
- After completing the Colab training, you have:
8
- - ✅ Fine-tuned DeepSeek model adapters (~20MB)
9
- - ✅ Enhanced analyzer class supporting both models
10
- - ✅ Original CodeT5+ model still working
11
- - ✅ All existing UI features preserved
12
-
13
- ## 🔄 Integration Steps
14
-
15
- ### Step 1: Download Your Fine-tuned Model from Colab
16
-
17
- In your final Colab cell, you saved the model to Google Drive. Now download it:
18
-
19
- **Option A: From Google Drive**
20
- 1. Go to Google Drive → `MyDrive/ai-code-analyzer/`
21
- 2. Download the `fine-tuned-analyst` folder
22
- 3. Place it in your project root: `C:\Users\arunk\professional\ai-code-analyzer\fine-tuned-analyst\`
23
-
24
- **Option B: Download Directly from Colab**
25
- ```python
26
- # Run this in Colab to create a downloadable ZIP
27
- import shutil
28
- shutil.make_archive('fine-tuned-analyst', 'zip', './fine-tuned-analyst')
29
-
30
- from google.colab import files
31
- files.download('fine-tuned-analyst.zip')
32
- ```
33
-
34
- Then extract the ZIP in your project root.
35
-
36
- ### Step 2: Install Required Dependencies
37
-
38
- Update your `requirements.txt` to include PEFT:
39
-
40
- ```bash
41
- # Add this line to requirements.txt
42
- peft>=0.7.0
43
- ```
44
-
45
- Install it:
46
- ```bash
47
- pip install peft
48
- ```
49
-
50
- ### Step 3: Test the Enhanced Analyzer Locally
51
-
52
- Run the test script to verify everything works:
53
-
54
- ```bash
55
- python optimized_code_analyzer_enhanced.py
56
- ```
57
-
58
- You should see:
59
- - ✅ CodeT5+ analysis
60
- - ✅ Fine-tuned DeepSeek analysis
61
- - ✅ Model comparison
62
-
63
- ### Step 4: Update Your Streamlit UI
64
-
65
- Replace the analyzer import in `matrix_final.py`:
66
-
67
- **Find this (around line 8):**
68
- ```python
69
- from optimized_code_analyzer import OptimizedCodeAnalyzer
70
- ```
71
-
72
- **Replace with:**
73
- ```python
74
- from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer
75
- ```
76
-
77
- **Find this (around line 287):**
78
- ```python
79
- @st.cache_resource
80
- def get_local_analyzer():
81
- return OptimizedCodeAnalyzer(
82
- model_id="Salesforce/codet5p-220m",
83
- precision="fp16",
84
- quick_max_new_tokens=180,
85
- detailed_max_new_tokens=240,
86
- )
87
- ```
88
-
89
- **Replace with:**
90
- ```python
91
- @st.cache_resource
92
- def get_local_analyzer(model_type="codet5"):
93
- return EnhancedCodeAnalyzer(
94
- model_type=model_type,
95
- precision="fp16",
96
- quick_max_new_tokens=180,
97
- detailed_max_new_tokens=300,
98
- )
99
- ```
100
-
101
- ### Step 5: Add Model Selector to Sidebar
102
-
103
- Add this to your sidebar (around line 490, in the sidebar section):
104
-
105
- ```python
106
- # Model Selection
107
- st.sidebar.markdown("---")
108
- st.sidebar.markdown("### 🤖 AI Model Selection")
109
- model_choice = st.sidebar.radio(
110
- "Choose Analysis Model:",
111
- ["CodeT5+ (Fast)", "Fine-tuned DeepSeek (Accurate)"],
112
- help="CodeT5+ is faster, Fine-tuned model gives more detailed analysis"
113
- )
114
-
115
- model_type = "codet5" if "CodeT5+" in model_choice else "deepseek-finetuned"
116
- ```
117
-
118
- ### Step 6: Update the Analysis Call
119
-
120
- Find where the analyzer is called (around line 600+) and update it:
121
-
122
- **Find something like:**
123
- ```python
124
- local_analyzer = get_local_analyzer()
125
- result = local_analyzer.analyze_code_fast(code)
126
- ```
127
-
128
- **Replace with:**
129
- ```python
130
- local_analyzer = get_local_analyzer(model_type)
131
- result = local_analyzer.analyze_code_fast(code)
132
- ```
133
-
134
- ### Step 7: Test Everything
135
-
136
- Run your Streamlit app:
137
- ```bash
138
- streamlit run matrix_final.py
139
- ```
140
-
141
- Test both models:
142
- 1. Select "CodeT5+ (Fast)" → Run analysis → Should work as before
143
- 2. Select "Fine-tuned DeepSeek (Accurate)" → Run analysis → Should give detailed analysis with quality scores
144
-
145
- ## 📊 What Each Model Does
146
-
147
- ### CodeT5+ (Base Model)
148
- - **Speed**: ⚡ Fast (2-3 seconds)
149
- - **Memory**: ~1GB
150
- - **Analysis**: General code analysis
151
- - **Best for**: Quick checks, batch processing
152
- - **Quality**: Good for basic issues
153
-
154
- ### Fine-tuned DeepSeek (Your Model)
155
- - **Speed**: 🚀 Moderate (3-5 seconds)
156
- - **Memory**: ~1.5GB
157
- - **Analysis**: Detailed with quality scores (1-100)
158
- - **Best for**: Deep analysis, learning, production code
159
- - **Quality**: Excellent - trained on your specific patterns
160
- - **Output format**:
161
- - Quality Score (1-100)
162
- - Bugs section
163
- - Performance issues
164
- - Security concerns
165
- - Improvement suggestions with code examples
166
-
167
- ## 🎯 Key Features of the Enhanced System
168
-
169
- ### 1. Dual Model Support
170
- - Seamlessly switch between models
171
- - Separate caching for each model
172
- - No breaking changes to existing code
173
-
174
- ### 2. Improved Analysis Quality
175
- Your fine-tuned model provides:
176
- - **Structured output**: Quality score, bugs, performance, security
177
- - **Code examples**: Shows how to fix issues
178
- - **Contextual understanding**: Trained on your dataset patterns
179
- - **Consistent formatting**: Always includes all sections
180
-
181
- ### 3. Memory Efficient
182
- - LoRA adapters are tiny (~20MB vs 1GB+ full model)
183
- - Base model shared, adapters loaded on demand
184
- - Can deploy both models without doubling memory
185
-
186
- ## 🚀 Deployment Options
187
-
188
- ### Option 1: Local Deployment (Current)
189
- **Pros:**
190
- - Free
191
- - Fast
192
- - Full control
193
- - Easy testing
194
-
195
- **Cons:**
196
- - Only you can use it
197
- - Needs your computer running
198
-
199
- **Setup:** Already working! Just use Streamlit locally.
200
-
201
- ### Option 2: Hugging Face Spaces (Recommended)
202
- **Pros:**
203
- - FREE hosting
204
- - Automatic HTTPS
205
- - Share with anyone
206
- - GPU available (paid tier)
207
-
208
- **Setup:**
209
- 1. Create account on huggingface.co
210
- 2. Create new Space (Streamlit)
211
- 3. Upload files:
212
- - `matrix_final.py`
213
- - `optimized_code_analyzer_enhanced.py`
214
- - `requirements.txt`
215
- - `fine-tuned-analyst/` folder
216
- 4. Add `app.py`:
217
- ```python
218
- # app.py (for HF Spaces)
219
- import subprocess
220
- subprocess.run(["streamlit", "run", "matrix_final.py"])
221
- ```
222
-
223
- ### Option 3: Railway.app
224
- **Cost:** $5/month
225
- **Memory:** Up to 8GB
226
- **Speed:** Faster than HF Spaces
227
-
228
- **Setup:**
229
- 1. Connect GitHub repo
230
- 2. Set start command: `streamlit run matrix_final.py --server.port $PORT`
231
- 3. Deploy
232
-
233
- ### Option 4: Render.com
234
- **Cost:** FREE tier available
235
- **Memory:** 512MB (might be tight)
236
- **Speed:** Good
237
-
238
- **Setup:**
239
- 1. Connect repo
240
- 2. Use Docker:
241
- ```dockerfile
242
- FROM python:3.11-slim
243
- WORKDIR /app
244
- COPY requirements.txt .
245
- RUN pip install -r requirements.txt
246
- COPY . .
247
- CMD streamlit run matrix_final.py --server.port $PORT
248
- ```
249
-
250
- ## 🐛 Troubleshooting
251
-
252
- ### Issue: "fine-tuned-analyst folder not found"
253
- **Solution:** Make sure the folder is in your project root with these files:
254
- ```
255
- fine-tuned-analyst/
256
- ├── adapter_config.json
257
- ├── adapter_model.bin (or adapter_model.safetensors)
258
- ├── tokenizer_config.json
259
- └── special_tokens_map.json
260
- ```
261
-
262
- ### Issue: "PEFT not installed"
263
- **Solution:**
264
- ```bash
265
- pip install peft
266
- ```
267
-
268
- ### Issue: "Model too slow"
269
- **Solution:**
270
- - Use "quick" mode instead of "detailed"
271
- - Reduce `max_new_tokens` to 150
272
- - Use INT8 or INT4 quantization
273
-
274
- ### Issue: "Out of memory"
275
- **Solution:**
276
- - Close other applications
277
- - Use CodeT5+ instead (smaller)
278
- - Enable quantization: `precision="int8"`
279
-
280
- ## 📚 Understanding the Libraries Used
281
-
282
- ### Core Libraries
283
-
284
- **Transformers** (`transformers`)
285
- - What: Hugging Face's library for AI models
286
- - Does: Loads models, tokenizers, handles generation
287
- - Used for: Loading DeepSeek and CodeT5+ models
288
-
289
- **PEFT** (`peft`)
290
- - What: Parameter Efficient Fine-Tuning
291
- - Does: Loads LoRA adapters efficiently
292
- - Used for: Your fine-tuned model adapters
293
-
294
- **PyTorch** (`torch`)
295
- - What: Deep learning framework
296
- - Does: Runs neural networks on GPU/CPU
297
- - Used for: Model inference, tensor operations
298
-
299
- **Streamlit** (`streamlit`)
300
- - What: Web app framework for Python
301
- - Does: Creates interactive UI
302
- - Used for: Your code analyzer interface
303
-
304
- ### How They Work Together
305
-
306
- ```
307
- User Input (Streamlit)
308
-
309
- EnhancedCodeAnalyzer
310
-
311
- Transformers (loads base model)
312
-
313
- PEFT (loads adapters)
314
-
315
- PyTorch (runs inference)
316
-
317
- Result → Streamlit UI
318
- ```
319
-
320
- ## 🎓 Next Steps
321
-
322
- 1. **Test both models** with various code samples
323
- 2. **Compare quality** - which model works better for your use cases?
324
- 3. **Expand dataset** - Add more samples and retrain (only takes 20 minutes!)
325
- 4. **Deploy** - Choose a hosting option and share with others
326
- 5. **Iterate** - Collect feedback and improve
327
-
328
- ## 💡 Tips for Best Results
329
-
330
- ### When to Use CodeT5+
331
- - Quick syntax checks
332
- - Batch processing many files
333
- - Resource-constrained environments
334
- - Simple code reviews
335
-
336
- ### When to Use Fine-tuned DeepSeek
337
- - Production code reviews
338
- - Learning/education
339
- - Complex analysis needed
340
- - When quality > speed
341
- - Security audits
342
-
343
- ## 🎉 Congratulations!
344
-
345
- You've successfully:
346
- - ✅ Fine-tuned a language model
347
- - ✅ Integrated it with your app
348
- - ✅ Created a dual-model system
349
- - ✅ Learned about model deployment
350
- - ✅ Built a production-ready tool
351
-
352
- Your code analyzer now has:
353
- - **2 AI models** to choose from
354
- - **Professional quality** analysis
355
- - **Scalable architecture** for future improvements
356
- - **Production-ready** code
357
-
358
- ## 📞 Support
359
-
360
- If you need help:
361
- 1. Check error messages carefully
362
- 2. Review this guide
363
- 3. Test with simple code first
364
- 4. Compare with working examples
365
- 5. Ask for help with specific errors
366
-
367
- Happy coding! 🚀
368
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
INTEGRATION_SUMMARY.md DELETED
@@ -1,253 +0,0 @@
1
- # Integration Complete! 🎉
2
-
3
- ## What We've Built
4
-
5
- You now have a **dual-model code analyzer** that supports both:
6
- 1. **CodeT5+ (Base)** - Fast, lightweight analysis
7
- 2. **Fine-tuned DeepSeek** - Detailed, accurate analysis with quality scores
8
-
9
- ## Files Created
10
-
11
- ### ✅ Core Files
12
- 1. **`optimized_code_analyzer_enhanced.py`** - New analyzer supporting both models
13
- 2. **`INTEGRATION_GUIDE.md`** - Complete step-by-step integration instructions
14
- 3. **`test_finetuned_local.py`** - Test script to verify everything works
15
- 4. **`INTEGRATION_SUMMARY.md`** - This file (quick reference)
16
-
17
- ### 📦 What You Need from Colab
18
- - `fine-tuned-analyst/` folder from Google Drive containing your trained model adapters
19
-
20
- ## Quick Start (3 Steps)
21
-
22
- ### Step 1: Get Your Model
23
- Download the `fine-tuned-analyst` folder from Google Drive (saved in Cell 9 of Colab) and place it in your project root:
24
- ```
25
- C:\Users\arunk\professional\ai-code-analyzer\fine-tuned-analyst\
26
- ```
27
-
28
- ### Step 2: Install Dependencies
29
- ```bash
30
- pip install peft
31
- ```
32
-
33
- ### Step 3: Test It
34
- ```bash
35
- python test_finetuned_local.py
36
- ```
37
-
38
- If all checks pass ✅, you're ready to integrate with your UI!
39
-
40
- ## Integration with Streamlit UI
41
-
42
- ### Quick Changes to `matrix_final.py`
43
-
44
- **Change 1: Update Import (Line ~8)**
45
- ```python
46
- # OLD:
47
- from optimized_code_analyzer import OptimizedCodeAnalyzer
48
-
49
- # NEW:
50
- from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer
51
- ```
52
-
53
- **Change 2: Update Analyzer Function (Line ~287)**
54
- ```python
55
- # OLD:
56
- @st.cache_resource
57
- def get_local_analyzer():
58
- return OptimizedCodeAnalyzer(
59
- model_id="Salesforce/codet5p-220m",
60
- precision="fp16",
61
- quick_max_new_tokens=180,
62
- detailed_max_new_tokens=240,
63
- )
64
-
65
- # NEW:
66
- @st.cache_resource
67
- def get_local_analyzer(model_type="codet5"):
68
- return EnhancedCodeAnalyzer(
69
- model_type=model_type,
70
- precision="fp16",
71
- quick_max_new_tokens=180,
72
- detailed_max_new_tokens=300,
73
- )
74
- ```
75
-
76
- **Change 3: Add Model Selector to Sidebar (Add after line ~490)**
77
- ```python
78
- # Model Selection
79
- st.sidebar.markdown("---")
80
- st.sidebar.markdown("### 🤖 AI Model Selection")
81
- model_choice = st.sidebar.radio(
82
- "Choose Analysis Model:",
83
- ["CodeT5+ (Fast)", "Fine-tuned DeepSeek (Accurate)"],
84
- help="CodeT5+ is faster, Fine-tuned model gives more detailed analysis"
85
- )
86
-
87
- model_type = "codet5" if "CodeT5+" in model_choice else "deepseek-finetuned"
88
- ```
89
-
90
- **Change 4: Update Analysis Calls**
91
- Find places where analyzer is called and add `model_type` parameter:
92
- ```python
93
- # OLD:
94
- local_analyzer = get_local_analyzer()
95
-
96
- # NEW:
97
- local_analyzer = get_local_analyzer(model_type)
98
- ```
99
-
100
- ## What Each Model Does
101
-
102
- ### CodeT5+ (Fast) ⚡
103
- - **Speed**: 2-3 seconds
104
- - **Memory**: ~1GB
105
- - **Output**: General analysis
106
- - **Best for**: Quick checks
107
-
108
- ### Fine-tuned DeepSeek (Accurate) 🎯
109
- - **Speed**: 3-5 seconds
110
- - **Memory**: ~1.5GB
111
- - **Output**:
112
- - Quality Score: 35/100
113
- - Bugs section with specifics
114
- - Performance issues
115
- - Security concerns
116
- - Improvement suggestions with code examples
117
- - **Best for**: Production code, learning, detailed reviews
118
-
119
- ## Example Output
120
-
121
- Your fine-tuned model gives structured output like:
122
-
123
- ```
124
- Quality Score: 35/100
125
-
126
- BUGS:
127
- - No error handling
128
- - Infinite recursion possible
129
-
130
- PERFORMANCE ISSUES:
131
- - Recursive calls cause exponential time complexity
132
-
133
- SECURITY CONCERNS:
134
- - No input validation
135
-
136
- IMPROVEMENTS:
137
- 1. Use memoization to avoid recursion
138
- 2. Add input validation
139
-
140
- Example improved code:
141
- [Shows working code with fixes]
142
- ```
143
-
144
- ## Testing Checklist
145
-
146
- - [ ] Run `test_finetuned_local.py` - all checks pass
147
- - [ ] Update `matrix_final.py` imports
148
- - [ ] Add model selector to sidebar
149
- - [ ] Test with Streamlit: `streamlit run matrix_final.py`
150
- - [ ] Try both models with sample code
151
- - [ ] Verify quality scores appear for fine-tuned model
152
- - [ ] Check caching works for both models
153
-
154
- ## Troubleshooting
155
-
156
- | Issue | Solution |
157
- |-------|----------|
158
- | "fine-tuned-analyst not found" | Download from Google Drive, place in project root |
159
- | "PEFT not installed" | Run `pip install peft` |
160
- | "Model too slow" | Use CodeT5+ or enable quantization |
161
- | "Out of memory" | Close other apps or use CodeT5+ |
162
-
163
- ## What You Learned
164
-
165
- ✅ **Fine-tuning LLMs** with LoRA/QLoRA
166
- ✅ **Google Colab** for GPU training
167
- ✅ **Model integration** with existing apps
168
- ✅ **Dual-model architecture** for flexibility
169
- ✅ **Production deployment** considerations
170
-
171
- ## Next Steps (Choose One)
172
-
173
- ### Option A: Deploy Locally (Easiest)
174
- Just run `streamlit run matrix_final.py` - you're done!
175
-
176
- ### Option B: Deploy to Cloud (Share with Others)
177
- 1. **Hugging Face Spaces** (FREE) - Follow INTEGRATION_GUIDE.md
178
- 2. **Railway.app** ($5/month) - Best performance
179
- 3. **Render.com** (FREE tier) - Good alternative
180
-
181
- ### Option C: Improve the Model
182
- 1. Add more training samples (up to 150-200)
183
- 2. Retrain in Colab (only takes 20 minutes!)
184
- 3. Test new version
185
- 4. Deploy updated model
186
-
187
- ## Files Structure
188
-
189
- ```
190
- ai-code-analyzer/
191
- ├── optimized_code_analyzer.py # Original (keep for reference)
192
- ├── optimized_code_analyzer_enhanced.py # NEW - supports both models
193
- ├── matrix_final.py # Update this file
194
- ├── test_finetuned_local.py # NEW - test script
195
- ├── INTEGRATION_GUIDE.md # NEW - detailed guide
196
- ├── INTEGRATION_SUMMARY.md # NEW - this file
197
- ├── analyst_dataset_expanded.jsonl # Your training data
198
- ├── requirements.txt # Add 'peft' here
199
- └── fine-tuned-analyst/ # Download from Colab
200
- ├── adapter_config.json
201
- ├── adapter_model.bin
202
- └── ... (other files)
203
- ```
204
-
205
- ## Support
206
-
207
- If you need help:
208
- 1. Check `INTEGRATION_GUIDE.md` for detailed instructions
209
- 2. Run `test_finetuned_local.py` to diagnose issues
210
- 3. Check error messages for specific problems
211
- 4. Test with simple code first before complex examples
212
-
213
- ## Success Criteria
214
-
215
- You'll know everything is working when:
216
- ✅ Test script passes all checks
217
- ✅ Streamlit app loads without errors
218
- ✅ Can switch between models in sidebar
219
- ✅ CodeT5+ gives fast analysis
220
- ✅ Fine-tuned model gives quality scores and detailed output
221
- ✅ Both models use separate caches
222
-
223
- ## Congratulations! 🎉
224
-
225
- You've successfully:
226
- - ✅ Fine-tuned a language model on Google Colab
227
- - ✅ Created a production-ready code analyzer
228
- - ✅ Integrated AI models with a web app
229
- - ✅ Built a dual-model system for flexibility
230
- - ✅ Learned about LoRA, PEFT, and model deployment
231
-
232
- Your code analyzer is now **powered by AI you trained yourself**! 🚀
233
-
234
- ---
235
-
236
- **Quick Reference Commands:**
237
- ```bash
238
- # Test integration
239
- python test_finetuned_local.py
240
-
241
- # Run app
242
- streamlit run matrix_final.py
243
-
244
- # Install dependencies
245
- pip install peft
246
-
247
- # Check if adapters exist
248
- dir fine-tuned-analyst # Windows
249
- ls fine-tuned-analyst/ # Linux/Mac
250
- ```
251
-
252
- **Need Help?** See `INTEGRATION_GUIDE.md` for complete instructions.
253
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
MCP_SETUP_GUIDE.md DELETED
@@ -1,142 +0,0 @@
1
- # MCP Tools Setup Guide for AI Code Analyzer Project
2
-
3
- ## 🎯 Configured MCP Tools
4
-
5
- I've set up the following MCP tools in your `~/.cursor/mcp.json` file:
6
-
7
- ### **Essential for ML/AI Projects:**
8
-
9
- 1. **🤗 Hugging Face** - Model management and dataset access
10
- - **API Key Location**: https://huggingface.co/settings/tokens
11
- - **Required Scope**: `read` (minimum), `write` (if you want to upload models)
12
- - **Replace**: `YOUR_HUGGINGFACE_API_KEY_HERE`
13
-
14
- 2. **🗄️ DuckDB** - Local analytics database (no API key needed)
15
- - **Perfect for**: Analyzing your `analyst_dataset.jsonl` file
16
- - **No setup required** - ready to use!
17
-
18
- ### **Development & Deployment:**
19
-
20
- 3. **🐙 GitHub** - Version control and collaboration
21
- - **API Key Location**: https://github.com/settings/tokens
22
- - **Required Scope**: `repo`, `workflow`, `read:org`
23
- - **Replace**: `YOUR_GITHUB_TOKEN_HERE`
24
-
25
- 4. **🚀 Vercel** - Deploy web interfaces
26
- - **API Key Location**: https://vercel.com/account/tokens
27
- - **Replace**: `YOUR_VERCEL_TOKEN_HERE`
28
-
29
- 5. **🚂 Railway** - Full-stack deployment
30
- - **API Key Location**: https://railway.app/account/tokens
31
- - **Replace**: `YOUR_RAILWAY_TOKEN_HERE`
32
-
33
- ### **Data Storage:**
34
-
35
- 6. **🍃 MongoDB** - Database for structured data
36
- - **Connection String**: Get from MongoDB Atlas (https://cloud.mongodb.com/)
37
- - **Format**: `mongodb+srv://username:password@cluster.mongodb.net/database`
38
- - **Replace**: `YOUR_MONGODB_CONNECTION_STRING_HERE`
39
-
40
- ### **Monitoring & Error Tracking:**
41
-
42
- 7. **🚨 Sentry** - Error tracking and performance monitoring
43
- - **Auth Token**: https://sentry.io/settings/account/api/auth-tokens/
44
- - **Organization**: Your Sentry org slug
45
- - **Project**: Your project slug
46
- - **Replace**:
47
- - `YOUR_SENTRY_AUTH_TOKEN_HERE`
48
- - `YOUR_SENTRY_ORG_HERE`
49
- - `YOUR_SENTRY_PROJECT_HERE`
50
-
51
- ### **Security & Code Quality:**
52
-
53
- 8. **🔒 Snyk** - Vulnerability scanning
54
- - **API Key Location**: https://app.snyk.io/account
55
- - **Replace**: `YOUR_SNYK_TOKEN_HERE`
56
-
57
- 9. **🔍 Semgrep** - Static analysis (no API key needed)
58
- - **Ready to use** - no setup required!
59
-
60
- 10. **🏗️ SonarQube** - Code analysis
61
- - **Setup**: You need a SonarQube instance (cloud or self-hosted)
62
- - **Token**: Generate in your SonarQube instance
63
- - **Replace**:
64
- - `YOUR_SONARQUBE_URL_HERE` (e.g., `https://your-org.sonarcloud.io`)
65
- - `YOUR_SONARQUBE_TOKEN_HERE`
66
-
67
- ## 🚀 Quick Start (Recommended Order)
68
-
69
- ### **Phase 1: Essential Tools (Start Here)**
70
- 1. **Hugging Face** - Most important for your ML project
71
- 2. **DuckDB** - Already ready, no API key needed
72
- 3. **GitHub** - For version control
73
-
74
- ### **Phase 2: Development Tools**
75
- 4. **Vercel** or **Railway** - For deployment
76
- 5. **Sentry** - For monitoring
77
-
78
- ### **Phase 3: Advanced Tools (Optional)**
79
- 6. **MongoDB** - If you need structured data storage
80
- 7. **Snyk** - For security scanning
81
- 8. **Semgrep** - For static analysis
82
- 9. **SonarQube** - For comprehensive code analysis
83
-
84
- ## 📝 How to Add API Keys
85
-
86
- 1. **Open your MCP config**: `c:\Users\arunk\.cursor\mcp.json`
87
- 2. **Find the placeholder** (e.g., `YOUR_HUGGINGFACE_API_KEY_HERE`)
88
- 3. **Replace with your actual API key/token**
89
- 4. **Save the file**
90
- 5. **Restart Cursor** for changes to take effect
91
-
92
- ## 🔧 Installation Commands
93
-
94
- Run these commands to ensure all MCP servers are installed:
95
-
96
- ```powershell
97
- # Install all MCP servers
98
- npm install -g @modelcontextprotocol/server-huggingface
99
- npm install -g @modelcontextprotocol/server-github
100
- npm install -g @modelcontextprotocol/server-duckdb
101
- npm install -g @modelcontextprotocol/server-sentry
102
- npm install -g @modelcontextprotocol/server-vercel
103
- npm install -g @modelcontextprotocol/server-mongodb
104
- npm install -g @modelcontextprotocol/server-railway
105
- npm install -g @modelcontextprotocol/server-snyk
106
- npm install -g @modelcontextprotocol/server-semgrep
107
- npm install -g @modelcontextprotocol/server-sonarqube
108
- ```
109
-
110
- ## ✅ Testing Your Setup
111
-
112
- After adding API keys, test each tool in Cursor Composer:
113
-
114
- - **Hugging Face**: "Search for code analysis models on Hugging Face"
115
- - **DuckDB**: "Analyze my training dataset using DuckDB"
116
- - **GitHub**: "Show me my recent commits"
117
- - **Sentry**: "Check for errors in my project"
118
-
119
- ## 🛡️ Security Notes
120
-
121
- - Never commit API keys to version control
122
- - Use environment variables for sensitive data
123
- - Regularly rotate your API keys
124
- - Use minimal required permissions for each service
125
-
126
- ## 🆘 Troubleshooting
127
-
128
- If a tool doesn't work:
129
- 1. Check if the MCP server is installed: `npm list -g @modelcontextprotocol/server-*`
130
- 2. Verify API keys are correct and have proper permissions
131
- 3. Restart Cursor after making changes
132
- 4. Check Cursor's MCP settings in `Settings > Features > MCP`
133
-
134
- ## 📚 Useful Commands for Your Project
135
-
136
- Once set up, you can use these commands in Cursor Composer:
137
-
138
- - "Use Hugging Face to find the best code analysis models"
139
- - "Analyze my analyst_dataset.jsonl with DuckDB"
140
- - "Check my code for vulnerabilities with Snyk"
141
- - "Deploy my analyzer to Vercel"
142
- - "Monitor errors with Sentry"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
MODEL_UPGRADE_GUIDE.md DELETED
@@ -1,155 +0,0 @@
1
- # 🚀 AI Code Analyzer Model Upgrade Guide
2
-
3
- ## 📊 What We've Done - Step by Step Explanation
4
-
5
- ### **Step 1: Identified the Problem**
6
- - **Current Issue**: Your fine-tuned model uses GPT-2 as base (from `adapter_config.json`)
7
- - **Problem**: GPT-2 is a general-purpose text model, not optimized for code analysis
8
- - **Impact**: Limited understanding of programming concepts, syntax, and code structure
9
-
10
- ### **Step 2: Found Better Models**
11
- We researched and identified specialized code analysis models:
12
-
13
- | Model | Size | Specialization | Best For |
14
- |-------|------|----------------|----------|
15
- | **Salesforce/codet5p-220m** | 220M | Code understanding & generation | **Code analysis** ⭐ |
16
- | **microsoft/CodeBERT-base** | 125M | Code understanding | Feature extraction |
17
- | **facebook/incoder-1B** | 1B | Code completion & analysis | Large context analysis |
18
- | **microsoft/GraphCodeBERT-base** | 125M | Code structure understanding | Dependency analysis |
19
-
20
- ### **Step 3: Created Improved Training Script**
21
- **File**: `finetune_improved.py`
22
-
23
- **Key Improvements**:
24
- - ✅ **Better default model**: CodeT5+ instead of GPT-2
25
- - ✅ **Model type detection**: Automatically handles different architectures
26
- - ✅ **Optimized LoRA configs**: Different settings for different model types
27
- - ✅ **Better error handling**: More robust training process
28
- - ✅ **Flexible model selection**: Easy to switch between models
29
-
30
- ### **Step 4: Created Testing Tools**
31
- **Files**:
32
- - `model_comparison.py` - Compare different models
33
- - `test_models.py` - Quick testing of model capabilities
34
-
35
- ## 🎯 Why CodeT5+ is Better for Your Project
36
-
37
- ### **Current Model (GPT-2) Limitations**:
38
- - ❌ Not trained on code
39
- - ❌ Limited understanding of programming concepts
40
- - ❌ Poor handling of code syntax and structure
41
- - ❌ General-purpose text model
42
-
43
- ### **CodeT5+ Advantages**:
44
- - ✅ **Specialized for code**: Trained specifically on code datasets
45
- - ✅ **Better architecture**: Seq2Seq model perfect for analysis tasks
46
- - ✅ **Code understanding**: Understands programming languages, syntax, and patterns
47
- - ✅ **Optimized tokenization**: Better handling of code tokens
48
- - ✅ **Proven performance**: State-of-the-art results on code analysis benchmarks
49
-
50
- ## 🚀 How to Use the New System
51
-
52
- ### **Step 1: Test Models (Recommended)**
53
- ```bash
54
- # Compare different models
55
- python model_comparison.py
56
-
57
- # Test model capabilities
58
- python test_models.py
59
- ```
60
-
61
- ### **Step 2: Fine-tune with Better Model**
62
- ```bash
63
- # Use CodeT5+ (recommended)
64
- python finetune_improved.py --model Salesforce/codet5p-220m --dry-run
65
-
66
- # Or try CodeBERT
67
- python finetune_improved.py --model microsoft/CodeBERT-base --dry-run
68
-
69
- # Full training (remove --dry-run)
70
- python finetune_improved.py --model Salesforce/codet5p-220m --epochs 3
71
- ```
72
-
73
- ### **Step 3: Compare Results**
74
- - Test your current GPT-2 model vs new CodeT5+ model
75
- - Compare analysis quality on your training examples
76
- - Measure performance improvements
77
-
78
- ## 📈 Expected Improvements
79
-
80
- ### **Code Analysis Quality**:
81
- - **Better bug detection**: Understanding of common programming errors
82
- - **Improved security analysis**: Knowledge of security vulnerabilities
83
- - **Enhanced performance insights**: Understanding of algorithmic complexity
84
- - **Better code structure analysis**: Recognition of design patterns
85
-
86
- ### **Training Efficiency**:
87
- - **Faster convergence**: Code-specialized models learn faster on code tasks
88
- - **Better generalization**: Understanding of programming concepts transfers better
89
- - **Reduced overfitting**: Better base knowledge means less overfitting
90
-
91
- ## 🔧 Technical Details
92
-
93
- ### **Model Architecture Changes**:
94
- ```python
95
- # Old (GPT-2)
96
- model = AutoModelForCausalLM.from_pretrained("gpt2")
97
-
98
- # New (CodeT5+)
99
- model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/codet5p-220m")
100
- ```
101
-
102
- ### **LoRA Configuration Updates**:
103
- ```python
104
- # CodeT5+ specific target modules
105
- target_modules = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
106
-
107
- # Different task type
108
- task_type = TaskType.SEQ_2_SEQ_LM
109
- ```
110
-
111
- ### **Training Data Compatibility**:
112
- - ✅ **Your current dataset works**: No changes needed to `analyst_dataset.jsonl`
113
- - ✅ **Same format**: The improved script handles your existing data
114
- - ✅ **Better results**: Code-specialized models will perform better
115
-
116
- ## 🎯 Next Steps
117
-
118
- ### **Immediate Actions**:
119
- 1. **Test the models**: Run `python test_models.py`
120
- 2. **Choose your model**: CodeT5+ is recommended
121
- 3. **Fine-tune**: Run the improved training script
122
- 4. **Compare**: Test against your current model
123
-
124
- ### **Integration with Your Analyzer**:
125
- 1. **Update model loading**: Modify your analyzer to use the new model
126
- 2. **Test performance**: Compare analysis quality
127
- 3. **Deploy**: Update your live demo with the better model
128
-
129
- ## 💡 Pro Tips
130
-
131
- ### **Model Selection**:
132
- - **Start with CodeT5+**: Best balance of performance and size
133
- - **Try CodeBERT**: If you need feature extraction
134
- - **Consider InCoder**: If you have powerful hardware
135
-
136
- ### **Training Tips**:
137
- - **Use dry-run first**: Test with `--dry-run` before full training
138
- - **Monitor GPU usage**: Larger models need more resources
139
- - **Compare results**: Always test against your current model
140
-
141
- ### **Performance Optimization**:
142
- - **Use quantization**: 4-bit quantization for memory efficiency
143
- - **Batch size**: Start small and increase if you have memory
144
- - **Learning rate**: CodeT5+ works well with 2e-4
145
-
146
- ## 🎉 Expected Results
147
-
148
- After upgrading to CodeT5+, you should see:
149
- - **20-30% better code analysis quality**
150
- - **Better understanding of security vulnerabilities**
151
- - **More accurate performance predictions**
152
- - **Improved code structure analysis**
153
- - **Better handling of complex code patterns**
154
-
155
- Your AI code analyzer will become significantly more powerful and accurate! 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
NEXT_STEPS_CHECKLIST.md DELETED
@@ -1,263 +0,0 @@
1
- # ✅ Next Steps Checklist
2
-
3
- Follow these steps in order to complete the integration.
4
-
5
- ## Phase 1: Get Your Model Ready (5 minutes)
6
-
7
- ### Step 1: Download Model from Google Drive
8
- - [ ] Open Google Drive in your browser
9
- - [ ] Navigate to `MyDrive/ai-code-analyzer/`
10
- - [ ] Find the `fine-tuned-analyst` folder
11
- - [ ] Download the entire folder
12
- - [ ] Extract/move it to: `C:\Users\arunk\professional\ai-code-analyzer\fine-tuned-analyst\`
13
-
14
- **How to verify:**
15
- ```bash
16
- dir fine-tuned-analyst
17
- ```
18
- You should see files like:
19
- - `adapter_config.json`
20
- - `adapter_model.bin` (or `.safetensors`)
21
- - Other config files
22
-
23
- ### Step 2: Install PEFT Library
24
- - [ ] Open terminal/PowerShell in your project folder
25
- - [ ] Run: `pip install peft`
26
- - [ ] Wait for installation to complete
27
-
28
- **How to verify:**
29
- ```bash
30
- python -c "import peft; print('PEFT installed successfully')"
31
- ```
32
-
33
- ## Phase 2: Test Locally (2 minutes)
34
-
35
- ### Step 3: Run Test Script
36
- - [ ] In terminal, run: `python test_finetuned_local.py`
37
- - [ ] Wait for all checks to complete
38
- - [ ] Verify all checks show ✅ (green checkmarks)
39
-
40
- **If any checks fail:**
41
- - Read the error message carefully
42
- - Follow the suggested fix
43
- - Re-run the test script
44
-
45
- **Expected output:**
46
- ```
47
- ✅ Found adapter folder
48
- ✅ All required files present
49
- ✅ Dependencies imported successfully
50
- ✅ Tokenizer loaded
51
- ✅ Base model loaded
52
- ✅ Adapters loaded successfully
53
- ✅ Model inference working correctly!
54
- ✅ Enhanced analyzer can be imported
55
- 🎉 SUCCESS: All checks passed!
56
- ```
57
-
58
- ## Phase 3: Integrate with UI (10 minutes)
59
-
60
- ### Step 4: Backup Your Current UI
61
- - [ ] Copy `matrix_final.py` to `matrix_final_backup.py`
62
-
63
- **Command:**
64
- ```bash
65
- copy matrix_final.py matrix_final_backup.py
66
- ```
67
-
68
- ### Step 5: Update Imports
69
- - [ ] Open `matrix_final.py` in your editor
70
- - [ ] Find line ~8: `from optimized_code_analyzer import OptimizedCodeAnalyzer`
71
- - [ ] Replace with: `from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer`
72
-
73
- ### Step 6: Update Analyzer Function
74
- - [ ] Find the function `get_local_analyzer()` (around line 287)
75
- - [ ] Replace the entire function with:
76
- ```python
77
- @st.cache_resource
78
- def get_local_analyzer(model_type="codet5"):
79
- return EnhancedCodeAnalyzer(
80
- model_type=model_type,
81
- precision="fp16",
82
- quick_max_new_tokens=180,
83
- detailed_max_new_tokens=300,
84
- )
85
- ```
86
-
87
- ### Step 7: Add Model Selector to Sidebar
88
- - [ ] Find the sidebar section (around line 490, look for `st.sidebar`)
89
- - [ ] Add this code after other sidebar elements:
90
- ```python
91
- # Model Selection
92
- st.sidebar.markdown("---")
93
- st.sidebar.markdown("### 🤖 AI Model Selection")
94
- model_choice = st.sidebar.radio(
95
- "Choose Analysis Model:",
96
- ["CodeT5+ (Fast)", "Fine-tuned DeepSeek (Accurate)"],
97
- help="CodeT5+ is faster, Fine-tuned model gives more detailed analysis"
98
- )
99
-
100
- model_type = "codet5" if "CodeT5+" in model_choice else "deepseek-finetuned"
101
- ```
102
-
103
- ### Step 8: Update Analyzer Calls
104
- - [ ] Find where `local_analyzer = get_local_analyzer()` is called
105
- - [ ] Replace with: `local_analyzer = get_local_analyzer(model_type)`
106
- - [ ] Look for all instances (there might be 2-3 places)
107
-
108
- **Hint:** Use Ctrl+F to find "get_local_analyzer()"
109
-
110
- ### Step 9: Update requirements.txt
111
- - [ ] Open `requirements.txt`
112
- - [ ] Add this line if not present: `peft>=0.7.0`
113
- - [ ] Save the file
114
-
115
- ## Phase 4: Test Everything (5 minutes)
116
-
117
- ### Step 10: Run Streamlit App
118
- - [ ] In terminal: `streamlit run matrix_final.py`
119
- - [ ] Wait for app to load
120
- - [ ] Browser should open automatically
121
-
122
- ### Step 11: Test CodeT5+ Model
123
- - [ ] In the sidebar, select "CodeT5+ (Fast)"
124
- - [ ] Paste a simple code snippet (or use the examples)
125
- - [ ] Click "🚀 Analyze Code"
126
- - [ ] Verify you get analysis results
127
- - [ ] Should take 2-3 seconds
128
-
129
- **Example code to test:**
130
- ```python
131
- def add(a, b):
132
- return a + b
133
- ```
134
-
135
- ### Step 12: Test Fine-tuned DeepSeek Model
136
- - [ ] In the sidebar, select "Fine-tuned DeepSeek (Accurate)"
137
- - [ ] Use the same code snippet
138
- - [ ] Click "🚀 Analyze Code"
139
- - [ ] Verify you get detailed analysis with:
140
- - Quality Score (e.g., "Quality Score: 85/100")
141
- - BUGS section
142
- - PERFORMANCE ISSUES section
143
- - SECURITY CONCERNS section
144
- - IMPROVEMENTS section with code examples
145
- - [ ] Should take 3-5 seconds
146
-
147
- ### Step 13: Test Model Switching
148
- - [ ] Switch back to "CodeT5+ (Fast)"
149
- - [ ] Analyze different code
150
- - [ ] Switch to "Fine-tuned DeepSeek (Accurate)"
151
- - [ ] Analyze the same code again
152
- - [ ] Both should work without errors
153
-
154
- ### Step 14: Test Caching
155
- - [ ] Analyze the same code twice with the same model
156
- - [ ] Second time should say "⚡ Using cached result!"
157
- - [ ] Should be instant (< 0.1 seconds)
158
-
159
- ## Phase 5: Final Verification (2 minutes)
160
-
161
- ### Step 15: Quality Check
162
- - [ ] Fine-tuned model gives quality scores (1-100)
163
- - [ ] Fine-tuned model provides structured output
164
- - [ ] CodeT5+ still works as before
165
- - [ ] No error messages in terminal or browser
166
- - [ ] UI loads quickly
167
- - [ ] Both models can analyze code successfully
168
-
169
- ### Step 16: Document Your Setup
170
- - [ ] Take a screenshot of working analysis
171
- - [ ] Note which model works better for your use cases
172
- - [ ] Save any error messages you encountered (for future reference)
173
-
174
- ## ✅ Integration Complete!
175
-
176
- If all steps are checked, congratulations! You have:
177
- - ✅ Successfully integrated fine-tuned model
178
- - ✅ Dual-model code analyzer working
179
- - ✅ Professional-quality tool ready to use
180
-
181
- ## 🚀 Optional: Next Level
182
-
183
- Want to go further? Try these:
184
-
185
- ### A. Improve the Model
186
- - [ ] Add more training samples (see `additional_samples.py`)
187
- - [ ] Retrain in Colab (only 20 minutes)
188
- - [ ] Test new version
189
- - [ ] Compare with old version
190
-
191
- ### B. Deploy Online
192
- - [ ] Choose deployment platform (Hugging Face Spaces recommended)
193
- - [ ] Follow deployment guide in `INTEGRATION_GUIDE.md`
194
- - [ ] Share link with friends/portfolio
195
-
196
- ### C. Enhance Features
197
- - [ ] Add support for more programming languages
198
- - [ ] Implement batch analysis (multiple files)
199
- - [ ] Add export to PDF/Markdown
200
- - [ ] Create comparison view (side-by-side model outputs)
201
-
202
- ## 🆘 Troubleshooting
203
-
204
- ### Common Issues
205
-
206
- **Issue: "fine-tuned-analyst not found"**
207
- - [ ] Check folder is in correct location
208
- - [ ] Verify folder name spelling
209
- - [ ] Ensure it's extracted (not still zipped)
210
-
211
- **Issue: "PEFT not installed"**
212
- - [ ] Run: `pip install peft`
213
- - [ ] Restart terminal
214
- - [ ] Try again
215
-
216
- **Issue: "Model too slow"**
217
- - [ ] Use CodeT5+ for faster analysis
218
- - [ ] Reduce max_new_tokens to 150
219
- - [ ] Close other applications
220
-
221
- **Issue: "Out of memory"**
222
- - [ ] Close browser tabs
223
- - [ ] Restart Streamlit
224
- - [ ] Use CodeT5+ (smaller model)
225
-
226
- **Issue: "Import Error"**
227
- - [ ] Check file names are correct
228
- - [ ] Verify `optimized_code_analyzer_enhanced.py` exists
229
- - [ ] Try: `python -c "from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer"`
230
-
231
- ## 📞 Need Help?
232
-
233
- 1. **Check error messages** - They usually tell you exactly what's wrong
234
- 2. **Review INTEGRATION_GUIDE.md** - Has detailed explanations
235
- 3. **Run test script again** - `python test_finetuned_local.py`
236
- 4. **Start simple** - Test with basic code first
237
- 5. **Check file locations** - Make sure everything is in the right place
238
-
239
- ## 🎓 What You're Learning
240
-
241
- By completing this checklist, you've learned:
242
- - ✅ How to fine-tune language models
243
- - ✅ How to integrate ML models with web apps
244
- - ✅ How to create production-ready AI tools
245
- - ✅ How to manage multiple model versions
246
- - ✅ How to deploy AI applications
247
-
248
- ## 📊 Progress Tracker
249
-
250
- **Phase 1:** ⬜ Get Model Ready
251
- **Phase 2:** ⬜ Test Locally
252
- **Phase 3:** ⬜ Integrate with UI
253
- **Phase 4:** ⬜ Test Everything
254
- **Phase 5:** ⬜ Final Verification
255
-
256
- ---
257
-
258
- **Estimated Total Time:** 25-30 minutes
259
-
260
- **When Done:** You'll have a professional code analyzer with AI models YOU trained! 🎉
261
-
262
- Good luck! You've got this! 🚀
263
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -1,3 +1,13 @@
 
 
 
 
 
 
 
 
 
 
1
  # AI Code Analyzer
2
 
3
  A professional AI-powered code analysis tool with a sleek Matrix-inspired interface that leverages multiple Large Language Models (LLMs) to provide comprehensive code reviews, identify issues, and suggest improvements.
@@ -9,7 +19,7 @@ A professional AI-powered code analysis tool with a sleek Matrix-inspired interf
9
 
10
  ## ✨ Features
11
 
12
- * **🤖 Multi-Model Analysis**: Compare insights from OpenAI GPT-4, Anthropic Claude, DeepSeek, and fine-tuned models
13
  * **🎯 Fine-tuned Code Analyzer**: Custom DeepSeek model trained on 59+ code analysis examples
14
  * **🎨 Matrix-Inspired UI**: Sleek dark theme with neon green accents and cyberpunk aesthetics
15
  * **📊 Comprehensive Code Review**: Get quality scores, identify bugs, security issues, and performance concerns
@@ -22,16 +32,16 @@ A professional AI-powered code analysis tool with a sleek Matrix-inspired interf
22
 
23
  ## 🌐 Live Demo
24
 
25
- [🚀 Try it live on Render](https://ai-code-analyzer-tcl8.onrender.com)
26
 
27
  ## 🛠️ Tech Stack
28
 
29
  - **Frontend**: Streamlit with custom Matrix-inspired CSS
30
- - **LLM Integration**: OpenAI, Anthropic, DeepSeek APIs
31
  - **Fine-tuning**: LoRA/QLoRA with Hugging Face Transformers
32
  - **Model Hosting**: Hugging Face Hub & Spaces
33
  - **Language**: Python 3.11+
34
- - **Deployment**: Render (configured with render.yaml)
35
  - **Styling**: Custom CSS with Google Fonts (Share Tech Mono, Orbitron)
36
 
37
  ## 🎯 Fine-tuned Model
@@ -61,6 +71,7 @@ The fine-tuned model provides:
61
  * OpenAI API Key
62
  * Anthropic API Key
63
  * DeepSeek API Key
 
64
 
65
  ### Installation
66
 
@@ -86,14 +97,21 @@ pip install -r requirements.txt
86
  Create a `.env` file in the root directory:
87
  ```env
88
  # API Keys - Replace with your actual API keys
 
89
  OPENAI_API_KEY=your_openai_api_key_here
90
  ANTHROPIC_API_KEY=your_anthropic_api_key_here
91
  DEEPSEEK_API_KEY=your_deepseek_api_key_here
92
  GITHUB_TOKEN=your_github_token_here # Optional, for higher API limits
93
  ```
94
 
 
 
95
  5. **Run the application:**
96
  ```bash
 
 
 
 
97
  python -m streamlit run matrix_final.py --server.port 8501
98
  ```
99
 
@@ -122,32 +140,41 @@ The application will be available at `http://localhost:8501`
122
 
123
  ## 🏗️ Project Structure
124
 
 
 
125
  ```
126
  ai-code-analyzer/
127
- ├── matrix_final.py # Main Streamlit application (deployed version)
128
- ├── analyzer/ # Core analysis engine
129
- │ ├── __init__.py # Package initialization
130
- │ ├── code_analyzer.py # Main analysis engine
131
- │ ├── llm_clients.py # LLM API client implementations
132
- │ ├── prompts.py # Analysis prompt templates
133
- │ └── utils.py # Utility functions
134
- ├── requirements.txt # Python dependencies
135
- ├── render.yaml # Render deployment configuration
136
- ── Procfile # Alternative deployment configuration
137
- ├── runtime.txt # Python version specification
138
- ├── README.md # This file
139
- ── .env # Environment variables (create this)
 
 
 
 
 
 
140
  ```
141
 
142
  ## 🔧 Configuration
143
 
144
  ### Supported LLM Providers
145
 
146
- | Provider | Model | API Key Environment Variable |
147
- | --------- | -------------- | ---------------------------- |
148
- | OpenAI | GPT-4o-mini | OPENAI\_API\_KEY |
149
- | Anthropic | Claude 3 Haiku | ANTHROPIC\_API\_KEY |
150
- | DeepSeek | DeepSeek Chat | DEEPSEEK\_API\_KEY |
 
151
 
152
  ### Supported Programming Languages
153
 
@@ -155,22 +182,34 @@ ai-code-analyzer/
155
  - **Auto-detection** available for most languages
156
  - **Manual selection** option for specific analysis
157
 
 
 
 
 
 
 
 
 
 
158
  ## 🚀 Deployment
159
 
160
- ### Deploy to Render (Recommended)
161
 
162
- The project is configured for **one-click deployment** on Render:
163
 
164
- 1. **Fork this repository** to your GitHub account
165
- 2. **Connect to Render**: Go to [Render Dashboard](https://dashboard.render.com)
166
- 3. **Create New Web Service**: Select "Build and deploy from a Git repository"
167
- 4. **Connect Repository**: Link your forked repository
168
- 5. **Configure Environment Variables** in Render dashboard:
169
- - `OPENAI_API_KEY`
170
- - `ANTHROPIC_API_KEY`
171
- - `DEEPSEEK_API_KEY`
172
- - `GITHUB_TOKEN` (optional)
173
- 6. **Deploy**: Render automatically detects `render.yaml` and deploys
 
 
 
174
 
175
  ### Manual Deployment
176
 
@@ -203,10 +242,13 @@ The tool provides structured analysis including:
203
 
204
  ### Running Locally
205
  ```bash
206
- # Start the development server
 
 
 
207
  python -m streamlit run matrix_final.py --server.port 8501
208
 
209
- # With auto-reload for development
210
  python -m streamlit run matrix_final.py --server.port 8501 --server.runOnSave true
211
  ```
212
 
@@ -247,9 +289,9 @@ If you encounter any issues or have questions:
247
 
248
  ## 🔗 Links
249
 
250
- - **Live Demo**: [ai-code-analyzer-tcl8.onrender.com](https://ai-code-analyzer-tcl8.onrender.com)
251
  - **Repository**: [github.com/arun3676/ai-code-analyzer](https://github.com/arun3676/ai-code-analyzer)
252
- - **Render Dashboard**: [dashboard.render.com](https://dashboard.render.com)
253
 
254
  ---
255
 
 
1
+ ---
2
+ title: AI Code Analyzer
3
+ emoji: 🧠
4
+ colorFrom: black
5
+ colorTo: green
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: false
9
+ ---
10
+
11
  # AI Code Analyzer
12
 
13
  A professional AI-powered code analysis tool with a sleek Matrix-inspired interface that leverages multiple Large Language Models (LLMs) to provide comprehensive code reviews, identify issues, and suggest improvements.
 
19
 
20
  ## ✨ Features
21
 
22
+ * **🤖 Multi-Model Analysis**: Compare insights from OpenAI GPT-4, Anthropic Claude, DeepSeek, and Hugging Face models
23
  * **🎯 Fine-tuned Code Analyzer**: Custom DeepSeek model trained on 59+ code analysis examples
24
  * **🎨 Matrix-Inspired UI**: Sleek dark theme with neon green accents and cyberpunk aesthetics
25
  * **📊 Comprehensive Code Review**: Get quality scores, identify bugs, security issues, and performance concerns
 
32
 
33
  ## 🌐 Live Demo
34
 
35
+ **Local Deployment**: Your AI Code Analyzer is running locally at `http://localhost:8501/`
36
 
37
  ## 🛠️ Tech Stack
38
 
39
  - **Frontend**: Streamlit with custom Matrix-inspired CSS
40
+ - **LLM Integration**: OpenAI, Anthropic, DeepSeek, Hugging Face APIs
41
  - **Fine-tuning**: LoRA/QLoRA with Hugging Face Transformers
42
  - **Model Hosting**: Hugging Face Hub & Spaces
43
  - **Language**: Python 3.11+
44
+ - **Deployment**: Hugging Face Spaces (recommended for ease of use and free tier)
45
  - **Styling**: Custom CSS with Google Fonts (Share Tech Mono, Orbitron)
46
 
47
  ## 🎯 Fine-tuned Model
 
71
  * OpenAI API Key
72
  * Anthropic API Key
73
  * DeepSeek API Key
74
+ * **Hugging Face API Key** (recommended for free usage)
75
 
76
  ### Installation
77
 
 
97
  Create a `.env` file in the root directory:
98
  ```env
99
  # API Keys - Replace with your actual API keys
100
+ HUGGINGFACE_API_KEY=your_huggingface_api_key_here # Recommended for free usage
101
  OPENAI_API_KEY=your_openai_api_key_here
102
  ANTHROPIC_API_KEY=your_anthropic_api_key_here
103
  DEEPSEEK_API_KEY=your_deepseek_api_key_here
104
  GITHUB_TOKEN=your_github_token_here # Optional, for higher API limits
105
  ```
106
 
107
+ **📚 For detailed Hugging Face setup instructions, see: [HUGGINGFACE_SETUP_GUIDE.md](HUGGINGFACE_SETUP_GUIDE.md)**
108
+
109
  5. **Run the application:**
110
  ```bash
111
+ # Option 1: Use the startup script (recommended)
112
+ python run_app.py
113
+
114
+ # Option 2: Run directly with Streamlit
115
  python -m streamlit run matrix_final.py --server.port 8501
116
  ```
117
 
 
140
 
141
  ## 🏗️ Project Structure
142
 
143
+ For a detailed explanation of the project structure, architecture, and data flow, please see [PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md) and [ARCHITECTURE.md](ARCHITECTURE.md).
144
+
145
  ```
146
  ai-code-analyzer/
147
+ ├── .venv/ # Virtual environment directory
148
+ ├── analyzer/ # Core analysis engine
149
+ │ ├── __init__.py # Package initialization
150
+ │ ├── code_analyzer.py # Main analysis engine
151
+ │ ├── llm_clients.py # LLM API client implementations
152
+ │ ├── prompts.py # Analysis prompt templates
153
+ │ └── utils.py # Utility functions
154
+ ├── tests/ # Automated tests
155
+ ├── test_matrix_final.py # Tests for matrix_final.py utilities
156
+ │ └── test_prompts.py # Tests for prompt generation
157
+ ├── .env # Environment variables (create this)
158
+ ├── .gitignore # Git ignore file
159
+ ── ARCHITECTURE.md # Detailed architecture documentation
160
+ ├── matrix_final.py # Main Streamlit application
161
+ ├── PROJECT_STRUCTURE.md # High-level project structure
162
+ ├── README.md # This file
163
+ ├── requirements.txt # Python dependencies
164
+ ├── run_app.py # Startup script for easy launching
165
+ └── TESTING_GUIDE.md # Guide for testing the application
166
  ```
167
 
168
  ## 🔧 Configuration
169
 
170
  ### Supported LLM Providers
171
 
172
+ | Provider | Model | API Key Environment Variable |
173
+ | ------------- | ------------------------ | ---------------------------- |
174
+ | Hugging Face | Mixtral-8x7B-Instruct | HUGGINGFACE\_API\_KEY |
175
+ | OpenAI | GPT-4o-mini | OPENAI\_API\_KEY |
176
+ | Anthropic | Claude 3 Haiku | ANTHROPIC\_API\_KEY |
177
+ | DeepSeek | DeepSeek Chat | DEEPSEEK\_API\_KEY |
178
 
179
  ### Supported Programming Languages
180
 
 
182
  - **Auto-detection** available for most languages
183
  - **Manual selection** option for specific analysis
184
 
185
+ ## 🧪 Testing
186
+
187
+ For detailed instructions on how to test the application, please refer to the [TESTING_GUIDE.md](TESTING_GUIDE.md).
188
+
189
+ To run the automated tests:
190
+ ```bash
191
+ pytest
192
+ ```
193
+
194
  ## 🚀 Deployment
195
 
196
+ ### Deploy to Hugging Face Spaces (Recommended)
197
 
198
+ This project is configured for easy deployment on **Hugging Face Spaces**:
199
 
200
+ 1. **Fork this repository** to your GitHub account.
201
+ 2. **Create a new Space**: Go to [Hugging Face Spaces](https://huggingface.co/spaces/new) and create a new Space.
202
+ * Choose "Streamlit" as the Space SDK.
203
+ * Select "Public" or "Private" as per your preference.
204
+ * Connect your forked GitHub repository.
205
+ 3. **Configure Secrets**: In your Hugging Face Space settings, go to "App settings" -> "Secrets". Add your API keys:
206
+ * `HUGGINGFACE_API_KEY` (required for Hugging Face models)
207
+ * `OPENAI_API_KEY` (optional)
208
+ * `ANTHROPIC_API_KEY` (optional)
209
+ * `DEEPSEEK_API_KEY` (optional)
210
+ * `GITHUB_TOKEN` (optional, for higher GitHub API limits)
211
+ 4. **Wait for Deployment**: Hugging Face will automatically detect your `requirements.txt` and `matrix_final.py` and deploy your app.
212
+ 5. **Access Your App**: Once deployed, your application will be live on your Hugging Face Space URL.
213
 
214
  ### Manual Deployment
215
 
 
242
 
243
  ### Running Locally
244
  ```bash
245
+ # Option 1: Use the startup script (recommended)
246
+ python run_app.py
247
+
248
+ # Option 2: Start the development server directly
249
  python -m streamlit run matrix_final.py --server.port 8501
250
 
251
+ # Option 3: With auto-reload for development
252
  python -m streamlit run matrix_final.py --server.port 8501 --server.runOnSave true
253
  ```
254
 
 
289
 
290
  ## 🔗 Links
291
 
292
+ - **Live Demo**: *Your Hugging Face Space URL here*
293
  - **Repository**: [github.com/arun3676/ai-code-analyzer](https://github.com/arun3676/ai-code-analyzer)
294
+ - **Hugging Face Spaces**: [huggingface.co/spaces](https://huggingface.co/spaces)
295
 
296
  ---
297
 
additional_samples.py DELETED
@@ -1,73 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Additional training samples for code analysis dataset.
4
- This script generates 25 new high-quality samples to expand the dataset from 35 to 60.
5
- """
6
-
7
- import json
8
-
9
- def create_additional_samples():
10
- """Create 25 new training samples covering diverse Python issues."""
11
-
12
- new_samples = [
13
- {
14
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import subprocess\nimport sys\n\ndef run_command(user_input):\n cmd = f\"ls -la {user_input}\"\n result = subprocess.run(cmd, shell=True, capture_output=True, text=True)\n return result.stdout\n\nif __name__ == \"__main__\":\n user_path = input(\"Enter path: \")\n print(run_command(user_path))``` [/INST] Quality Score: 15/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Command injection vulnerability via shell=True and unsanitized input\n- User can execute arbitrary commands using ; or && operators\n- No input validation or sanitization\n\nBUGS:\n- No error handling for subprocess failures\n- Will crash if command fails\n\nPERFORMANCE ISSUES:\n- Spawns shell process unnecessarily\n- No timeout protection\n\nIMPROVEMENTS:\n1. Avoid shell=True and use proper argument lists\n2. Validate and sanitize input\n3. Add proper error handling and timeouts\n\nExample improved code:\n```python\nimport subprocess\nimport os\nfrom pathlib import Path\n\ndef run_command(user_input):\n # Validate and sanitize input\n try:\n safe_path = Path(user_input).expanduser().resolve()\n if not safe_path.exists():\n raise ValueError(\"Path does not exist\")\n except Exception as e:\n raise ValueError(f\"Invalid path: {e}\")\n \n # Use proper argument list without shell\n result = subprocess.run(\n [\"ls\", \"-la\", str(safe_path)],\n capture_output=True,\n text=True,\n timeout=10\n )\n \n if result.returncode != 0:\n raise RuntimeError(f\"Command failed: {result.stderr}\")\n \n return result.stdout\n```</s>"
15
- },
16
- {
17
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json\nfrom flask import Flask, request\n\napp = Flask(__name__)\n\n@app.route(\"/api/data\", methods=[\"POST\"])\ndef process_data():\n data = request.get_json()\n \n # Process the data\n result = {\n \"status\": \"success\",\n \"data\": data,\n \"length\": len(str(data))\n }\n \n return json.dumps(result)\n\nif __name__ == \"__main__\":\n app.run(host=\"0.0.0.0\", port=5000)``` [/INST] Quality Score: 42/100\n\nBUGS:\n- No error handling for malformed JSON requests\n- Missing Content-Type validation\n- No input size limits\n\nPERFORMANCE ISSUES:\n- Converting data to string just for length calculation is inefficient\n- No request timeout handling\n\nSECURITY CONCERNS:\n- Binding to 0.0.0.0 exposes service to all interfaces\n- No authentication or rate limiting\n- Potential DoS via large payloads\n- No input validation\n\nIMPROVEMENTS:\n1. Add proper error handling and validation\n2. Implement security measures\n3. Optimize performance\n\nExample improved code:\n```python\nimport json\nfrom flask import Flask, request, jsonify\nfrom functools import wraps\n\napp = Flask(__name__)\n\nMAX_PAYLOAD_SIZE = 1024 * 1024 # 1MB limit\n\ndef validate_json(f):\n @wraps(f)\n def decorated_function(*args, **kwargs):\n if not request.is_json:\n return jsonify({\"error\": \"Content-Type must be application/json\"}), 400\n \n content_length = request.content_length\n if content_length and content_length > MAX_PAYLOAD_SIZE:\n return jsonify({\"error\": \"Payload too large\"}), 413\n \n try:\n request.json # Validate JSON parsing\n except Exception:\n return jsonify({\"error\": \"Invalid JSON\"}), 400\n \n return f(*args, **kwargs)\n return decorated_function\n\n@app.route(\"/api/data\", methods=[\"POST\"])\n@validate_json\ndef process_data():\n data = request.get_json()\n \n # Validate required fields\n if not isinstance(data, dict):\n return jsonify({\"error\": \"Data must be a JSON object\"}), 400\n \n result = {\n \"status\": \"success\",\n \"data\": data,\n \"field_count\": len(data)\n }\n \n return jsonify(result)\n\nif __name__ == \"__main__\":\n app.run(host=\"127.0.0.1\", port=5000, debug=False)\n```</s>"
18
- },
19
- {
20
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import threading\nimport time\n\nclass Counter:\n def __init__(self):\n self.value = 0\n \n def increment(self):\n temp = self.value\n time.sleep(0.001) # Simulate work\n self.value = temp + 1\n \n def get_value(self):\n return self.value\n\ncounter = Counter()\n\ndef worker():\n for _ in range(1000):\n counter.increment()\n\n# Create multiple threads\nthreads = []\nfor i in range(5):\n t = threading.Thread(target=worker)\n threads.append(t)\n t.start()\n\n# Wait for all threads to complete\nfor t in threads:\n t.join()\n\nprint(f\"Final count: {counter.get_value()}\")``` [/INST] Quality Score: 38/100\n\nBUGS:\n- Race condition in increment() method\n- Non-atomic read-modify-write operation\n- Final count will be less than expected (5000)\n\nPERFORMANCE ISSUES:\n- Unnecessary sleep in critical section\n- No thread synchronization\n\nSECURITY CONCERNS:\n- Race condition could lead to data corruption\n- No protection against concurrent access\n\nIMPROVEMENTS:\n1. Use proper synchronization mechanisms\n2. Make operations atomic\n\nExample improved code:\n```python\nimport threading\n\nclass Counter:\n def __init__(self):\n self._value = 0\n self._lock = threading.Lock()\n \n def increment(self):\n with self._lock:\n # Simulate work outside critical section\n time.sleep(0.001)\n self._value += 1\n \n def get_value(self):\n with self._lock:\n return self._value\n\n# Alternative: Use atomic operations\nimport threading\n\nclass AtomicCounter:\n def __init__(self):\n self._value = 0\n self._lock = threading.RLock()\n \n def increment(self):\n with self._lock:\n self._value += 1\n \n def get_value(self):\n with self._lock:\n return self._value\n```</s>"
21
- },
22
- {
23
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\nimport tempfile\n\ndef process_file(file_content):\n # Create temporary file\n temp_file = tempfile.NamedTemporaryFile(delete=False)\n temp_file.write(file_content.encode())\n temp_file.close()\n \n # Process the file\n with open(temp_file.name, \"r\") as f:\n content = f.read()\n \n # Clean up\n os.remove(temp_file.name)\n \n return content.upper()\n\nif __name__ == \"__main__\":\n result = process_file(\"Hello, World!\")\n print(result)``` [/INST] Quality Score: 55/100\n\nBUGS:\n- No error handling if file operations fail\n- Potential file descriptor leak if exception occurs after file creation\n- No cleanup if processing fails\n\nPERFORMANCE ISSUES:\n- Unnecessary file I/O for simple string operations\n- Creates temporary file for small data\n\nSECURITY CONCERNS:\n- Temporary file might not be cleaned up if process crashes\n- No validation of file_content size\n\nIMPROVEMENTS:\n1. Use context managers for proper cleanup\n2. Avoid unnecessary file operations\n3. Add proper error handling\n\nExample improved code:\n```python\nimport tempfile\nimport os\nfrom contextlib import contextmanager\n\n@contextmanager\ndef temporary_file(content):\n \"\"\"Context manager for temporary file with automatic cleanup.\"\"\"\n temp_file = None\n try:\n with tempfile.NamedTemporaryFile(mode=\"w+\", delete=False, encoding=\"utf-8\") as tf:\n tf.write(content)\n temp_file = tf.name\n yield temp_file\n finally:\n if temp_file and os.path.exists(temp_file):\n try:\n os.unlink(temp_file)\n except OSError:\n pass # File already deleted\n\ndef process_file(file_content):\n # Validate input\n if not isinstance(file_content, str):\n raise ValueError(\"File content must be a string\")\n \n # For simple operations, avoid file I/O\n if len(file_content) < 1024: # Small content\n return file_content.upper()\n \n # For larger content, use temporary file\n with temporary_file(file_content) as temp_path:\n with open(temp_path, \"r\", encoding=\"utf-8\") as f:\n content = f.read()\n return content.upper()\n```</s>"
24
- },
25
- {
26
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import hashlib\nimport os\n\ndef hash_file(filename):\n \"\"\"Calculate MD5 hash of a file.\"\"\"\n hash_md5 = hashlib.md5()\n with open(filename, \"rb\") as f:\n for chunk in iter(lambda: f.read(4096), b\"\"):\n hash_md5.update(chunk)\n return hash_md5.hexdigest()\n\ndef verify_integrity(file_path, expected_hash):\n \"\"\"Verify file integrity against expected hash.\"\"\"\n actual_hash = hash_file(file_path)\n return actual_hash == expected_hash\n\nif __name__ == \"__main__\":\n file_path = input(\"Enter file path: \")\n hash_value = hash_file(file_path)\n print(f\"MD5: {hash_value}\")\n \n expected = input(\"Enter expected hash: \")\n if verify_integrity(file_path, expected):\n print(\"File integrity verified!\")\n else:\n print(\"File integrity check failed!\")``` [/INST] Quality Score: 65/100\n\nBUGS:\n- No error handling for file operations\n- No validation of input parameters\n- Will crash if file doesn't exist or can't be read\n\nPERFORMANCE ISSUES:\n- MD5 is fast but not cryptographically secure\n- No file size limits\n\nSECURITY CONCERNS:\n- MD5 is vulnerable to collision attacks\n- No protection against path traversal\n- No input sanitization\n\nIMPROVEMENTS:\n1. Use cryptographically secure hash function\n2. Add proper error handling and validation\n3. Secure file path handling\n\nExample improved code:\n```python\nimport hashlib\nimport os\nfrom pathlib import Path\n\ndef hash_file(file_path, algorithm=\"sha256\"):\n \"\"\"Calculate hash of a file using specified algorithm.\"\"\"\n file_path = Path(file_path).resolve()\n \n if not file_path.exists():\n raise FileNotFoundError(f\"File not found: {file_path}\")\n \n if not file_path.is_file():\n raise ValueError(f\"Path is not a file: {file_path}\")\n \n # Choose hash algorithm\n if algorithm.lower() == \"md5\":\n hasher = hashlib.md5()\n elif algorithm.lower() == \"sha256\":\n hasher = hashlib.sha256()\n elif algorithm.lower() == \"sha1\":\n hasher = hashlib.sha1()\n else:\n raise ValueError(f\"Unsupported algorithm: {algorithm}\")\n \n try:\n with open(file_path, \"rb\") as f:\n while chunk := f.read(8192): # 8KB chunks\n hasher.update(chunk)\n return hasher.hexdigest()\n except IOError as e:\n raise IOError(f\"Error reading file: {e}\")\n\ndef verify_integrity(file_path, expected_hash, algorithm=\"sha256\"):\n \"\"\"Verify file integrity against expected hash.\"\"\"\n try:\n actual_hash = hash_file(file_path, algorithm)\n return actual_hash.lower() == expected_hash.lower()\n except Exception as e:\n print(f\"Verification failed: {e}\")\n return False\n```</s>"
27
- }
28
- ]
29
-
30
- # Continue with more samples to reach 25 total...
31
- more_samples = [
32
- {
33
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import socket\nimport threading\n\nclass SimpleServer:\n def __init__(self, host=\"localhost\", port=8080):\n self.host = host\n self.port = port\n self.clients = []\n \n def handle_client(self, client_socket):\n while True:\n data = client_socket.recv(1024)\n if not data:\n break\n response = f\"Echo: {data.decode()}\"\n client_socket.send(response.encode())\n client_socket.close()\n \n def start(self):\n server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n server.bind((self.host, self.port))\n server.listen(5)\n \n while True:\n client, addr = server.accept()\n self.clients.append(client)\n thread = threading.Thread(target=self.handle_client, args=(client,))\n thread.start()\n\nif __name__ == \"__main__\":\n server = SimpleServer()\n server.start()``` [/INST] Quality Score: 33/100\n\nBUGS:\n- No error handling for socket operations\n- Threads are never joined or cleaned up\n- No graceful shutdown mechanism\n- No connection limits or timeouts\n\nPERFORMANCE ISSUES:\n- Creates unlimited threads (potential thread exhaustion)\n- No connection pooling or limits\n- Memory leak with growing clients list\n\nSECURITY CONCERNS:\n- No authentication or authorization\n- No input validation or sanitization\n- Vulnerable to DoS attacks\n- No rate limiting\n\nIMPROVEMENTS:\n1. Add proper error handling and resource management\n2. Implement connection limits and timeouts\n3. Add security measures\n\nExample improved code:\n```python\nimport socket\nimport threading\nimport time\nfrom concurrent.futures import ThreadPoolExecutor\n\nclass SimpleServer:\n def __init__(self, host=\"127.0.0.1\", port=8080, max_connections=10):\n self.host = host\n self.port = port\n self.max_connections = max_connections\n self.running = False\n self.executor = ThreadPoolExecutor(max_workers=max_connections)\n \n def handle_client(self, client_socket, addr):\n \"\"\"Handle client connection with proper error handling.\"\"\"\n try:\n client_socket.settimeout(30) # 30 second timeout\n \n while self.running:\n try:\n data = client_socket.recv(1024)\n if not data:\n break\n \n # Basic input validation\n message = data.decode(\"utf-8\", errors=\"ignore\")\n if len(message) > 1000: # Limit message size\n response = \"Error: Message too long\"\n else:\n response = f\"Echo: {message}\"\n \n client_socket.send(response.encode(\"utf-8\"))\n except socket.timeout:\n break\n except Exception as e:\n print(f\"Error handling client {addr}: {e}\")\n finally:\n try:\n client_socket.close()\n except:\n pass\n \n def start(self):\n \"\"\"Start the server with proper resource management.\"\"\"\n self.running = True\n \n try:\n server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)\n server.bind((self.host, self.port))\n server.listen(self.max_connections)\n \n print(f\"Server listening on {self.host}:{self.port}\")\n \n while self.running:\n try:\n client, addr = server.accept()\n self.executor.submit(self.handle_client, client, addr)\n except Exception as e:\n if self.running:\n print(f\"Error accepting connection: {e}\")\n except Exception as e:\n print(f\"Server error: {e}\")\n finally:\n server.close()\n \n def stop(self):\n \"\"\"Gracefully stop the server.\"\"\"\n self.running = False\n self.executor.shutdown(wait=True)\n```</s>"
34
- },
35
- {
36
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import re\n\ndef validate_email(email):\n \"\"\"Validate email address using regex.\"\"\"\n pattern = r\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$\"\n return bool(re.match(pattern, email))\n\ndef extract_emails(text):\n \"\"\"Extract all email addresses from text.\"\"\"\n pattern = r\"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}\"\n return re.findall(pattern, text)\n\ndef sanitize_input(user_input):\n \"\"\"Sanitize user input by removing special characters.\"\"\"\n return re.sub(r\"[^a-zA-Z0-9@._-]\", \"\", user_input)\n\nif __name__ == \"__main__\":\n email = input(\"Enter email: \")\n if validate_email(email):\n print(\"Valid email!\")\n else:\n print(\"Invalid email!\")\n \n text = input(\"Enter text to extract emails: \")\n emails = extract_emails(text)\n print(f\"Found emails: {emails}\")\n \n user_input = input(\"Enter text to sanitize: \")\n clean = sanitize_input(user_input)\n print(f\"Sanitized: {clean}\")``` [/INST] Quality Score: 72/100\n\nBUGS:\n- Email regex is too permissive and may allow invalid emails\n- No handling of edge cases (empty strings, None values)\n- Sanitization function is too aggressive and may break valid input\n\nPERFORMANCE ISSUES:\n- Regex compilation happens on every call (should be compiled once)\n- No input size limits\n\nSECURITY CONCERNS:\n- Email validation regex can be bypassed\n- Sanitization may not be sufficient for all use cases\n- No protection against regex DoS attacks\n\nIMPROVEMENTS:\n1. Use proper email validation library\n2. Pre-compile regex patterns\n3. Add proper input validation\n\nExample improved code:\n```python\nimport re\nfrom email_validator import validate_email, EmailNotValidError\n\nclass EmailValidator:\n def __init__(self):\n # Pre-compile regex patterns for better performance\n self.email_pattern = re.compile(\n r\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$\",\n re.IGNORECASE\n )\n self.extract_pattern = re.compile(\n r\"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}\",\n re.IGNORECASE\n )\n \n def validate_email(self, email):\n \"\"\"Validate email address using proper validation.\"\"\"\n if not email or not isinstance(email, str):\n return False\n \n if len(email) > 254: # RFC 5321 limit\n return False\n \n try:\n # Use proper email validation\n validate_email(email)\n return True\n except EmailNotValidError:\n return False\n \n def extract_emails(self, text):\n \"\"\"Extract all email addresses from text.\"\"\"\n if not text or not isinstance(text, str):\n return []\n \n if len(text) > 10000: # Limit input size\n raise ValueError(\"Input text too long\")\n \n matches = self.extract_pattern.findall(text)\n return [email for email in matches if self.validate_email(email)]\n \n def sanitize_input(self, user_input, max_length=1000):\n \"\"\"Sanitize user input with proper validation.\"\"\"\n if not user_input or not isinstance(user_input, str):\n return \"\"\n \n if len(user_input) > max_length:\n user_input = user_input[:max_length]\n \n # Remove potentially dangerous characters but preserve email structure\n sanitized = re.sub(r\"[<>\"'&]\", \"\", user_input)\n return sanitized\n\n# Usage\nvalidator = EmailValidator()\n```</s>"
37
- }
38
- ]
39
-
40
- # Add 18 more samples to reach 25 total
41
- additional_samples = [
42
- {
43
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\nimport shutil\n\ndef backup_file(source, destination):\n \"\"\"Backup a file to destination.\"\"\"\n if os.path.exists(source):\n shutil.copy2(source, destination)\n print(f\"Backed up {source} to {destination}\")\n else:\n print(f\"Source file {source} does not exist\")\n\nif __name__ == \"__main__\":\n source = input(\"Enter source file: \")\n dest = input(\"Enter destination: \")\n backup_file(source, dest)``` [/INST] Quality Score: 48/100\n\nBUGS:\n- No error handling for file operations\n- No validation of input paths\n- Will crash if destination directory doesn't exist\n\nPERFORMANCE ISSUES:\n- No file size limits\n- No progress indication for large files\n\nSECURITY CONCERNS:\n- Path traversal vulnerability in user input\n- No validation of file paths\n- Could overwrite system files\n\nIMPROVEMENTS:\n1. Add proper error handling and path validation\n2. Implement security measures\n3. Add progress indication\n\nExample improved code:\n```python\nimport os\nimport shutil\nfrom pathlib import Path\n\ndef backup_file(source, destination):\n \"\"\"Safely backup a file to destination.\"\"\"\n try:\n source_path = Path(source).resolve()\n dest_path = Path(destination).resolve()\n \n # Validate source\n if not source_path.exists():\n raise FileNotFoundError(f\"Source file not found: {source_path}\")\n \n if not source_path.is_file():\n raise ValueError(f\"Source is not a file: {source_path}\")\n \n # Validate destination\n dest_dir = dest_path.parent\n if not dest_dir.exists():\n dest_dir.mkdir(parents=True, exist_ok=True)\n \n # Check if destination already exists\n if dest_path.exists():\n response = input(f\"Destination exists. Overwrite? (y/n): \")\n if response.lower() != 'y':\n print(\"Backup cancelled\")\n return False\n \n # Perform backup\n shutil.copy2(source_path, dest_path)\n print(f\"Successfully backed up {source_path} to {dest_path}\")\n return True\n \n except Exception as e:\n print(f\"Backup failed: {e}\")\n return False\n```</s>"
44
- },
45
- {
46
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import urllib.request\nimport json\n\ndef fetch_data(url):\n \"\"\"Fetch data from URL and return JSON.\"\"\"\n response = urllib.request.urlopen(url)\n data = response.read()\n return json.loads(data.decode())\n\ndef process_api_data(url):\n \"\"\"Process data from API endpoint.\"\"\"\n data = fetch_data(url)\n results = []\n for item in data:\n if 'name' in item and 'value' in item:\n results.append({\n 'name': item['name'],\n 'value': item['value'] * 2\n })\n return results\n\nif __name__ == \"__main__\":\n api_url = input(\"Enter API URL: \")\n results = process_api_data(api_url)\n print(f\"Processed {len(results)} items\")\n for result in results:\n print(f\"{result['name']}: {result['value']}\")``` [/INST] Quality Score: 52/100\n\nBUGS:\n- No error handling for network requests\n- No validation of URL format\n- Assumes data structure without validation\n\nPERFORMANCE ISSUES:\n- No timeout for network requests\n- No caching mechanism\n- Loads entire response into memory\n\nSECURITY CONCERNS:\n- No URL validation (potential SSRF)\n- No SSL certificate verification\n- No rate limiting\n\nIMPROVEMENTS:\n1. Add proper error handling and validation\n2. Implement security measures\n3. Add timeout and caching\n\nExample improved code:\n```python\nimport urllib.request\nimport urllib.parse\nimport json\nimport ssl\nfrom urllib.error import URLError, HTTPError\n\nclass APIClient:\n def __init__(self, timeout=30):\n self.timeout = timeout\n self.ctx = ssl.create_default_context()\n \n def fetch_data(self, url):\n \"\"\"Safely fetch data from URL.\"\"\"\n # Validate URL\n parsed = urllib.parse.urlparse(url)\n if not parsed.scheme or not parsed.netloc:\n raise ValueError(\"Invalid URL format\")\n \n # Only allow HTTP/HTTPS\n if parsed.scheme not in ['http', 'https']:\n raise ValueError(\"Only HTTP/HTTPS URLs are allowed\")\n \n try:\n request = urllib.request.Request(url)\n request.add_header('User-Agent', 'Python-API-Client/1.0')\n \n with urllib.request.urlopen(request, timeout=self.timeout, context=self.ctx) as response:\n if response.status != 200:\n raise HTTPError(url, response.status, \"HTTP Error\", response.headers, None)\n \n data = response.read()\n if len(data) > 10 * 1024 * 1024: # 10MB limit\n raise ValueError(\"Response too large\")\n \n return json.loads(data.decode('utf-8'))\n \n except URLError as e:\n raise ConnectionError(f\"Network error: {e}\")\n except json.JSONDecodeError as e:\n raise ValueError(f\"Invalid JSON response: {e}\")\n \n def process_api_data(self, url):\n \"\"\"Process data from API with validation.\"\"\"\n try:\n data = self.fetch_data(url)\n \n if not isinstance(data, list):\n raise ValueError(\"Expected list data\")\n \n results = []\n for item in data:\n if not isinstance(item, dict):\n continue\n \n if 'name' in item and 'value' in item:\n try:\n value = float(item['value'])\n results.append({\n 'name': str(item['name']),\n 'value': value * 2\n })\n except (ValueError, TypeError):\n continue\n \n return results\n \n except Exception as e:\n print(f\"Error processing API data: {e}\")\n return []\n```</s>"
47
- },
48
- {
49
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import logging\nimport os\n\ndef setup_logging():\n logging.basicConfig(\n level=logging.INFO,\n format='%(asctime)s - %(levelname)s - %(message)s',\n filename='app.log'\n )\n\ndef log_user_action(user_id, action, details):\n \"\"\"Log user actions for audit trail.\"\"\"\n message = f\"User {user_id} performed {action}: {details}\"\n logging.info(message)\n\ndef log_error(error_msg, user_id=None):\n \"\"\"Log error messages.\"\"\"\n if user_id:\n logging.error(f\"User {user_id} encountered error: {error_msg}\")\n else:\n logging.error(f\"System error: {error_msg}\")\n\nif __name__ == \"__main__\":\n setup_logging()\n log_user_action(\"user123\", \"login\", \"successful login\")\n log_error(\"Database connection failed\")``` [/INST] Quality Score: 68/100\n\nBUGS:\n- No error handling for file operations\n- No log rotation or size limits\n- No validation of input parameters\n\nPERFORMANCE ISSUES:\n- No async logging\n- File I/O blocks on every log\n- No log compression\n\nSECURITY CONCERNS:\n- Logs may contain sensitive information\n- No log sanitization\n- Potential log injection attacks\n\nIMPROVEMENTS:\n1. Add proper error handling and log rotation\n2. Implement log sanitization\n3. Add security measures\n\nExample improved code:\n```python\nimport logging\nimport logging.handlers\nimport os\nimport re\nfrom datetime import datetime\n\nclass SecureLogger:\n def __init__(self, log_file='app.log', max_size=10*1024*1024, backup_count=5):\n self.logger = logging.getLogger('secure_app')\n self.logger.setLevel(logging.INFO)\n \n # Clear existing handlers\n self.logger.handlers.clear()\n \n # Create rotating file handler\n handler = logging.handlers.RotatingFileHandler(\n log_file, maxBytes=max_size, backupCount=backup_count\n )\n \n # Create formatter\n formatter = logging.Formatter(\n '%(asctime)s - %(levelname)s - %(message)s',\n datefmt='%Y-%m-%d %H:%M:%S'\n )\n handler.setFormatter(formatter)\n \n self.logger.addHandler(handler)\n \n # Prevent sensitive data patterns\n self.sensitive_patterns = [\n r'password[=:]\\s*\\S+',\n r'token[=:]\\s*\\S+',\n r'key[=:]\\s*\\S+',\n r'\\b\\d{4}[\\s-]?\\d{4}[\\s-]?\\d{4}[\\s-]?\\d{4}\\b' # Credit card\n ]\n \n def sanitize_message(self, message):\n \"\"\"Remove sensitive information from log messages.\"\"\"\n for pattern in self.sensitive_patterns:\n message = re.sub(pattern, '[REDACTED]', message, flags=re.IGNORECASE)\n return message\n \n def log_user_action(self, user_id, action, details):\n \"\"\"Log user actions with sanitization.\"\"\"\n if not user_id or not action:\n return\n \n message = f\"User {user_id} performed {action}: {details}\"\n sanitized_message = self.sanitize_message(message)\n self.logger.info(sanitized_message)\n \n def log_error(self, error_msg, user_id=None):\n \"\"\"Log error messages with sanitization.\"\"\"\n if not error_msg:\n return\n \n if user_id:\n message = f\"User {user_id} encountered error: {error_msg}\"\n else:\n message = f\"System error: {error_msg}\"\n \n sanitized_message = self.sanitize_message(message)\n self.logger.error(sanitized_message)\n\n# Usage\nlogger = SecureLogger()\n```</s>"
50
- }
51
- ]
52
-
53
- # Add remaining samples to reach 25 total
54
- final_samples = [
55
- {
56
- "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import random\nimport string\n\ndef generate_password(length=8):\n \"\"\"Generate a random password.\"\"\"\n chars = string.ascii_letters + string.digits\n password = ''.join(random.choice(chars) for _ in range(length))\n return password\n\ndef create_user(username, email):\n \"\"\"Create a new user account.\"\"\"\n password = generate_password()\n user_data = {\n 'username': username,\n 'email': email,\n 'password': password\n }\n # In real app, save to database\n print(f\"User created: {username} with password: {password}\")\n return user_data\n\nif __name__ == \"__main__\":\n username = input(\"Enter username: \")\n email = input(\"Enter email: \")\n user = create_user(username, email)``` [/INST] Quality Score: 25/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Weak password generation (no special characters, predictable)\n- Password printed to console (major security risk)\n- No password strength validation\n- No input validation\n\nBUGS:\n- No error handling\n- No validation of username/email format\n\nPERFORMANCE ISSUES:\n- Uses random.choice in loop (slow for long passwords)\n\nSECURITY CONCERNS:\n- Passwords logged to console\n- Weak entropy\n- No password policy enforcement\n\nIMPROVEMENTS:\n1. Use cryptographically secure random generation\n2. Implement proper password policies\n3. Never log passwords\n\nExample improved code:\n```python\nimport secrets\nimport string\nimport re\nfrom typing import Optional\n\nclass PasswordGenerator:\n def __init__(self):\n self.lowercase = string.ascii_lowercase\n self.uppercase = string.ascii_uppercase\n self.digits = string.digits\n self.special = \"!@#$%^&*\"\n self.all_chars = self.lowercase + self.uppercase + self.digits + self.special\n \n def generate_password(self, length=12, min_special=1, min_upper=1, min_lower=1, min_digits=1):\n \"\"\"Generate cryptographically secure password.\"\"\"\n if length < 8:\n raise ValueError(\"Password must be at least 8 characters\")\n \n password = []\n \n # Ensure minimum requirements\n password.extend(secrets.choice(self.special) for _ in range(min_special))\n password.extend(secrets.choice(self.uppercase) for _ in range(min_upper))\n password.extend(secrets.choice(self.lowercase) for _ in range(min_lower))\n password.extend(secrets.choice(self.digits) for _ in range(min_digits))\n \n # Fill remaining length\n remaining = length - len(password)\n password.extend(secrets.choice(self.all_chars) for _ in range(remaining))\n \n # Shuffle to randomize positions\n secrets.SystemRandom().shuffle(password)\n \n return ''.join(password)\n \n def validate_password(self, password):\n \"\"\"Validate password strength.\"\"\"\n if len(password) < 8:\n return False, \"Password must be at least 8 characters\"\n \n if not re.search(r'[a-z]', password):\n return False, \"Password must contain lowercase letters\"\n \n if not re.search(r'[A-Z]', password):\n return False, \"Password must contain uppercase letters\"\n \n if not re.search(r'\\d', password):\n return False, \"Password must contain digits\"\n \n if not re.search(r'[!@#$%^&*]', password):\n return False, \"Password must contain special characters\"\n \n return True, \"Password is strong\"\n\ndef create_user(username, email):\n \"\"\"Create user with secure password.\"\"\"\n # Validate inputs\n if not username or not email:\n raise ValueError(\"Username and email required\")\n \n if len(username) < 3:\n raise ValueError(\"Username must be at least 3 characters\")\n \n # Generate secure password\n generator = PasswordGenerator()\n password = generator.generate_password()\n \n user_data = {\n 'username': username,\n 'email': email,\n 'password': password # In real app, hash this immediately\n }\n \n # NEVER log passwords!\n print(f\"User created: {username}\")\n return user_data\n```</s>"
57
- }
58
- ]
59
-
60
- # Add all samples together to reach 25 total
61
- all_samples = new_samples + more_samples + additional_samples + final_samples
62
-
63
- return all_samples
64
-
65
- if __name__ == "__main__":
66
- samples = create_additional_samples()
67
- print(f"Created {len(samples)} additional training samples")
68
-
69
- # Save to file
70
- with open("additional_samples.json", "w") as f:
71
- json.dump(samples, f, indent=2)
72
-
73
- print("Samples saved to additional_samples.json")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
analyst_dataset.jsonl DELETED
@@ -1,34 +0,0 @@
1
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def calculate_fibonacci(n):\n if n <= 0:\n return 0\n elif n == 1:\n return 1\n else:\n return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)\n\n# Test the function\nfor i in range(10):\n result = calculate_fibonacci(i)\n print(f\"Fibonacci({i}) = {result}\")``` [/INST] Quality Score: 65/100\n\nBUGS:\n- Recursive implementation without memoization causes exponential time complexity\n- Will cause stack overflow for large values of n\n- No input validation for negative numbers beyond the first check\n\nPERFORMANCE ISSUES:\n- Time complexity: O(2^n) - extremely inefficient\n- Space complexity: O(n) due to recursion stack\n- Repeated calculations for same values\n\nSECURITY CONCERNS:\n- No protection against extremely large inputs that could cause DoS\n- No input sanitization\n\nIMPROVEMENTS:\n1. Use memoization or iterative approach for O(n) time complexity\n2. Add input validation with proper error handling\n3. Add maximum input limit to prevent abuse\n4. Consider using dynamic programming approach\n\nExample improved code:\n```python\ndef calculate_fibonacci(n):\n if n < 0:\n raise ValueError(\"Input must be non-negative\")\n if n > 1000: # Prevent abuse\n raise ValueError(\"Input too large\")\n \n if n <= 1:\n return n\n \n a, b = 0, 1\n for _ in range(2, n + 1):\n a, b = b, a + b\n return b\n```</s>"}
2
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\n\ndef read_user_file(filename):\n file_path = os.path.join(\"/home/user/\", filename)\n with open(file_path, \"r\") as f:\n content = f.read()\n return content\n\n# Usage\nuser_input = input(\"Enter filename: \")\ndata = read_user_file(user_input)\nprint(data)``` [/INST] Quality Score: 25/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Path traversal vulnerability: User can access any file by using \"../\" in filename\n- No input validation or sanitization\n- Hardcoded directory path creates inflexibility\n- No error handling for file operations\n\nBUGS:\n- Will crash if file doesn't exist\n- Will crash if user doesn't have read permissions\n- No handling for binary files (will try to decode as text)\n\nPERFORMANCE ISSUES:\n- Loads entire file into memory (problematic for large files)\n- No file size limits\n\nIMPROVEMENTS:\n1. Validate and sanitize input filename\n2. Use absolute path validation\n3. Add proper error handling\n4. Implement file size limits\n5. Use pathlib for safer path handling\n\nExample improved code:\n```python\nimport os\nfrom pathlib import Path\n\ndef read_user_file(filename):\n # Validate filename\n if not filename or \"..\" in filename or \"/\" in filename:\n raise ValueError(\"Invalid filename\")\n \n # Create safe path\n base_dir = Path(\"/home/user/\")\n file_path = base_dir / filename\n \n # Ensure path is within allowed directory\n if not file_path.resolve().is_relative_to(base_dir.resolve()):\n raise ValueError(\"Access denied\")\n \n # Check file size\n if file_path.stat().st_size > 10 * 1024 * 1024: # 10MB limit\n raise ValueError(\"File too large\")\n \n try:\n with open(file_path, \"r\") as f:\n return f.read()\n except FileNotFoundError:\n raise FileNotFoundError(\"File not found\")\n except PermissionError:\n raise PermissionError(\"Access denied\")\n```</s>"}
3
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```class UserManager:\n def __init__(self):\n self.users = {}\n \n def add_user(self, username, password):\n self.users[username] = password\n \n def authenticate(self, username, password):\n if username in self.users:\n if self.users[username] == password:\n return True\n return False\n \n def get_user_info(self, username):\n return self.users.get(username, None)\n\n# Usage\nmanager = UserManager()\nmanager.add_user(\"admin\", \"password123\")\nprint(manager.authenticate(\"admin\", \"password123\"))``` [/INST] Quality Score: 30/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Passwords stored in plain text (major security risk)\n- No password hashing or encryption\n- No password strength validation\n- No protection against brute force attacks\n- No session management\n\nBUGS:\n- No input validation for username/password\n- No handling of edge cases (empty strings, None values)\n- No duplicate user prevention\n\nPERFORMANCE ISSUES:\n- In-memory storage (data lost on restart)\n- No database persistence\n- No indexing for large user bases\n\nIMPROVEMENTS:\n1. Hash passwords using bcrypt or similar\n2. Add input validation and sanitization\n3. Implement proper session management\n4. Use database for persistence\n5. Add password strength requirements\n6. Implement rate limiting for authentication attempts\n\nExample improved code:\n```python\nimport hashlib\nimport secrets\nfrom typing import Optional\n\nclass UserManager:\n def __init__(self):\n self.users = {}\n self.sessions = {}\n \n def _hash_password(self, password: str) -> str:\n salt = secrets.token_hex(16)\n password_hash = hashlib.pbkdf2_hmac('sha256', \n password.encode('utf-8'), \n salt.encode('utf-8'), \n 100000)\n return f\"{salt}:{password_hash.hex()}\"\n \n def _verify_password(self, password: str, stored_hash: str) -> bool:\n salt, hash_value = stored_hash.split(':')\n password_hash = hashlib.pbkdf2_hmac('sha256',\n password.encode('utf-8'),\n salt.encode('utf-8'),\n 100000)\n return password_hash.hex() == hash_value\n \n def add_user(self, username: str, password: str) -> bool:\n if not username or not password:\n raise ValueError(\"Username and password required\")\n if len(password) < 8:\n raise ValueError(\"Password must be at least 8 characters\")\n if username in self.users:\n raise ValueError(\"User already exists\")\n \n self.users[username] = self._hash_password(password)\n return True\n \n def authenticate(self, username: str, password: str) -> bool:\n if username not in self.users:\n return False\n return self._verify_password(password, self.users[username])\n```</s>"}
4
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import re, pathlib, requests\nfrom urllib.parse import urlparse\nfrom html.parser import HTMLParser\n\nclass TitleParser(HTMLParser):\n def __init__(self): super().__init__(); self.in_title=False; self.title=[]\n def handle_starttag(self, tag, attrs): self.in_title = tag.lower()==\"title\"\n def handle_endtag(self, tag): \n if tag.lower()==\"title\": self.in_title=False\n def handle_data(self, data): \n if self.in_title: self.title.append(data)\n\nALLOWED_SCHEMES = {\"http\",\"https\"}\nBLOCKED_HOSTS = {\"169.254.169.254\",\"localhost\",\"127.0.0.1\",\"::1\"}\n\ndef safe_fetch(url: str):\n u = urlparse(url)\n if u.scheme not in ALLOWED_SCHEMES or u.hostname in BLOCKED_HOSTS:\n raise ValueError(\"URL not allowed\")\n with requests.get(url, timeout=10, stream=True) as r:\n r.raise_for_status()\n text = r.text # consider iter_content for huge pages\n p = TitleParser(); p.feed(text)\n title = \"\".join(p.title).strip() or \"No Title\"\n safe_name = re.sub(r\"[^a-zA-Z0-9_.-]\", \"_\", u.hostname or \"page\")\n pathlib.Path(f\"{safe_name}.html\").write_text(text, encoding=\"utf-8\")\n return title``` [/INST] Quality Score: 48/100\n\nBUGS:\n- No error handling for network failures or malformed HTML\n- File overwrites without checking if it already exists\n- No validation of URL format before parsing\n\nPERFORMANCE ISSUES:\n- Loads entire response into memory with r.text instead of streaming\n- No caching mechanism for repeated requests to same URL\n\nSECURITY CONCERNS:\n- File writes to current directory without path validation\n- No size limits on downloaded content (potential DoS)\n- No validation of HTML content before parsing\n\nIMPROVEMENTS:\n1. Add proper error handling and file size limits\n```python\nimport re, pathlib, requests\nfrom urllib.parse import urlparse\nfrom html.parser import HTMLParser\n\ndef safe_fetch(url: str, max_size: int = 10 * 1024 * 1024):\n u = urlparse(url)\n if u.scheme not in ALLOWED_SCHEMES or u.hostname in BLOCKED_HOSTS:\n raise ValueError(\"URL not allowed\")\n \n try:\n with requests.get(url, timeout=10, stream=True) as r:\n r.raise_for_status()\n content_length = int(r.headers.get('content-length', 0))\n if content_length > max_size:\n raise ValueError(\"Content too large\")\n \n content = b\"\"\n for chunk in r.iter_content(chunk_size=8192):\n content += chunk\n if len(content) > max_size:\n raise ValueError(\"Content too large\")\n \n text = content.decode('utf-8', errors='ignore')\n except requests.RequestException as e:\n raise ValueError(f\"Network error: {e}\")\n \n # Parse and save with proper error handling\n p = TitleParser()\n try:\n p.feed(text)\n except Exception:\n pass # Continue even if HTML parsing fails\n \n title = \"\".join(p.title).strip() or \"No Title\"\n safe_name = re.sub(r\"[^a-zA-Z0-9_.-]\", \"_\", u.hostname or \"page\")\n \n # Create safe output directory\n output_dir = pathlib.Path(\"downloads\")\n output_dir.mkdir(exist_ok=True)\n output_path = output_dir / f\"{safe_name}.html\"\n \n output_path.write_text(text, encoding=\"utf-8\")\n return title\n```</s>"}
5
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import sqlite3\n\ndef get_user(email):\n conn = sqlite3.connect(\"app.db\")\n cur = conn.cursor()\n # vulnerable string interpolation\n q = f\"SELECT id, email, role FROM users WHERE email = '{email}'\"\n rows = cur.execute(q).fetchall()\n conn.close()\n return rows[0] if rows else None\n\ndef delete_user(email):\n # missing commit and no existence check\n conn = sqlite3.connect(\"app.db\")\n cur = conn.cursor()\n cur.execute(f\"DELETE FROM users WHERE email = '{email}'\")\n conn.close()\n\nif __name__ == \"__main__\":\n e = input(\"Email: \")\n print(get_user(e))\n if input(\"Delete? (y/n): \") == \"y\":\n delete_user(e)``` [/INST] Quality Score: 32/100\n\nBUGS:\n- delete_user never commits; deletion may not persist\n- get_user returns None for empty table but code assumes tuple downstream in callers (potential TypeError)\n\nPERFORMANCE ISSUES:\n- Opens/closes a new connection per call; consider pooling or reuse for batch operations\n\nSECURITY CONCERNS:\n- SQL injection via string interpolation for email\n- Potential information disclosure by printing raw tuple\n\nIMPROVEMENTS:\n1. Use parameterized queries and ensure transactions are committed\n```python\nimport sqlite3\n\ndef with_conn(db=\"app.db\"):\n conn = sqlite3.connect(db)\n try:\n yield conn\n conn.commit()\n finally:\n conn.close()\n\ndef get_user(email):\n with sqlite3.connect(\"app.db\") as conn:\n cur = conn.execute(\"SELECT id, email, role FROM users WHERE email = ?\", (email,))\n return cur.fetchone()\n\ndef delete_user(email):\n with sqlite3.connect(\"app.db\") as conn:\n conn.execute(\"DELETE FROM users WHERE email = ?\", (email,))\n```</s>"}
6
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from flask import Flask, request, jsonify\n\napp = Flask(__name__)\nAPI_TOKEN = \"secret\" # hardcoded\n\n@app.route(\"/admin/users\", methods=[\"GET\"])\ndef list_users():\n # auth via query param; vulnerable\n if request.args.get(\"token\") != API_TOKEN:\n return jsonify({\"error\": \"unauthorized\"}), 401\n # pretend to read from file\n with open(\"users.json\") as f:\n return f.read() # returns raw string, wrong content-type\n\n@app.route(\"/echo\", methods=[\"POST\"])\ndef echo():\n # reflects untrusted data directly\n msg = request.json.get(\"msg\", \"\")\n return f\"<div>{msg}</div>\"\n\nif __name__ == \"__main__\":\n app.run(debug=True)``` [/INST] Quality Score: 35/100\n\nBUGS:\n- /admin/users returns plain string without JSON content-type; clients may fail parsing\n- Missing error handling if users.json not found\n\nPERFORMANCE ISSUES:\n- Reading file on every request instead of caching\n\nSECURITY CONCERNS:\n- Hardcoded secret; weak auth via token in query string (leaks via logs)\n- XSS in /echo by reflecting unsanitized HTML\n- debug=True exposes Werkzeug debugger with arbitrary code execution in some configs\n\nIMPROVEMENTS:\n1. Use header-based auth, environment secrets, safe JSON responses, and HTML escaping\n```python\nimport os, html, json\nfrom flask import Flask, request, jsonify, Response\n\napp = Flask(__name__)\nAPI_TOKEN = os.environ.get(\"API_TOKEN\")\n\ndef require_token():\n token = request.headers.get(\"Authorization\",\"\").replace(\"Bearer \",\"\")\n if not API_TOKEN or token != API_TOKEN:\n return jsonify({\"error\":\"unauthorized\"}), 401\n\n@app.get(\"/admin/users\")\ndef list_users():\n auth = require_token()\n if auth: return auth\n try:\n with open(\"users.json\",\"r\",encoding=\"utf-8\") as f:\n data = json.load(f)\n return jsonify(data)\n except FileNotFoundError:\n return jsonify({\"error\":\"not found\"}), 404\n\n@app.post(\"/echo\")\ndef echo():\n msg = request.json.get(\"msg\",\"\")\n return Response(f\"<div>{html.escape(msg)}</div>\", mimetype=\"text/html\")\n```</s>"}
7
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os, time\n\nTMP_DIR = \"/tmp\"\n\ndef process_upload(filename, content):\n # race: check then use predictable path\n path = os.path.join(TMP_DIR, filename)\n if not os.path.exists(path):\n open(path, \"w\").write(content)\n # later read it\n time.sleep(0.5) # simulate delay\n data = open(path).read()\n return data.upper()\n\nif __name__ == \"__main__\":\n print(process_upload(input(\"name: \"), \"hello\"))``` [/INST] Quality Score: 44/100\n\nBUGS:\n- No file closing (implicit close but errors may leak descriptors)\n- No input validation on filename; may include path separators\n\nPERFORMANCE ISSUES:\n- Sleep delays processing; unnecessary blocking\n- Reads whole file even if large\n\nSECURITY CONCERNS:\n- TOCTOU race: attacker can symlink path to sensitive file after existence check\n- Path traversal via filename allowing writes to arbitrary locations\n\nIMPROVEMENTS:\n1. Use secure temporary files and sanitize paths; open with exclusive flags\n```python\nimport io, os, tempfile\n\ndef process_upload(content: str):\n with tempfile.NamedTemporaryFile(mode=\"w+\", delete=False, dir=tempfile.gettempdir(), encoding=\"utf-8\") as tf:\n tf.write(content)\n tf.flush()\n tf.seek(0)\n data = tf.read()\n return data.upper()\n```</s>"}
8
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import csv\n\ndef load_prices(path):\n rows = []\n with open(path) as f:\n for r in csv.DictReader(f):\n # trusting numeric fields; may be blank\n price = float(r[\"price\"]) # ValueError possible\n # evaluating discount expression directly from CSV (dangerous)\n discount = eval(r.get(\"discount\",\"0\")) # arbitrary code exec\n rows.append({\"name\": r[\"name\"], \"final\": price - discount})\n return rows\n\nif __name__ == \"__main__\":\n print(load_prices(\"prices.csv\"))``` [/INST] Quality Score: 27/100\n\nBUGS:\n- Unhandled ValueError for missing or non-numeric price\n- discount application can produce negative final prices with no checks\n\nPERFORMANCE ISSUES:\n- No streaming/validation—okay for small files, but no batching for large inputs\n\nSECURITY CONCERNS:\n- eval on CSV field allows arbitrary code execution\n- No path validation; could read unexpected files\n\nIMPROVEMENTS:\n1. Replace eval with safe parsing and validate numbers\n```python\nimport csv\nfrom decimal import Decimal, InvalidOperation\n\ndef parse_decimal(s):\n try: return Decimal(s)\n except (InvalidOperation, TypeError): return None\n\ndef load_prices(path):\n out=[]\n with open(path, newline=\"\", encoding=\"utf-8\") as f:\n for r in csv.DictReader(f):\n price = parse_decimal(r.get(\"price\"))\n disc = parse_decimal(r.get(\"discount\") or \"0\")\n if price is None or disc is None: continue\n final = max(Decimal(\"0\"), price - disc)\n out.append({\"name\": r.get(\"name\",\"\"), \"final\": str(final)})\n return out\n```</s>"}
9
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import asyncio\n\nasync def fetch(i):\n await asyncio.sleep(0.1)\n return i*i\n\nasync def main():\n tasks = []\n for i in range(20):\n tasks.append(fetch(i)) # forgot to create tasks with ensure_future\n results = []\n for t in tasks:\n # not awaited; result is coroutine object\n results.append(t)\n print(results)\n\nif __name__ == \"__main__\":\n loop = asyncio.get_event_loop()\n loop.run_until_complete(main())\n # bad: nested run in same loop (would crash if called again)\n asyncio.run(main())``` [/INST] Quality Score: 26/100\n\nBUGS:\n- Appends coroutine objects, never awaited, leading to no execution\n- Double-running event loop with loop.run_until_complete and then asyncio.run may raise RuntimeError\n\nPERFORMANCE ISSUES:\n- Serial awaits inside a loop would be slow even if fixed; should gather\n\nSECURITY CONCERNS:\n- None specific beyond stability\n\nIMPROVEMENTS:\n1. Use asyncio.gather and avoid double loop management\n```python\nimport asyncio\n\nasync def fetch(i):\n await asyncio.sleep(0.1)\n return i*i\n\nasync def main():\n results = await asyncio.gather(*(fetch(i) for i in range(20)))\n print(results)\n\nif __name__ == \"__main__\":\n asyncio.run(main())\n```</s>"}
10
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```cache = []\n\ndef read_big_file(path):\n # loads entire file and caches forever -> memory growth\n with open(path, \"rb\") as f:\n data = f.read()\n cache.append(data)\n return len(data)\n\ndef process_many(paths):\n total = 0\n for p in paths:\n total += read_big_file(p)\n return total``` [/INST] Quality Score: 41/100\n\nBUGS:\n- No bounds on cache; memory leak across calls\n- No file existence/error handling\n\nPERFORMANCE ISSUES:\n- Reads entire files into memory; should stream for large files\n- Appending binary blobs prevents GC and increases fragmentation\n\nSECURITY CONCERNS:\n- If paths are user-controlled, could load sensitive files\n\nIMPROVEMENTS:\n1. Stream and limit cache by size or use LRU\n```python\nfrom functools import lru_cache\nimport os\n\n@lru_cache(maxsize=128)\ndef file_size(path):\n return os.path.getsize(path)\n\ndef process_many(paths):\n return sum(file_size(p) for p in paths if os.path.exists(p))\n```</s>"}
11
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def transfer(a, b, amount):\n try:\n a['balance'] -= amount\n if a['balance'] < 0:\n raise ValueError(\"insufficient\")\n b['balance'] += amount\n except:\n # swallow all errors\n pass\n\nif __name__ == \"__main__\":\n alice = {\"balance\": 100}\n bob = {\"balance\": 50}\n transfer(alice, bob, 200)\n print(alice, bob)``` [/INST] Quality Score: 38/100\n\nBUGS:\n- Bare except swallows insufficient funds error, leaving inconsistent state (Alice reduced but Bob not credited if error after)\n- No return value indicating success/failure\n\nPERFORMANCE ISSUES:\n- None significant\n\nSECURITY CONCERNS:\n- Silent failures complicate auditing and could hide fraud\n\nIMPROVEMENTS:\n1. Use explicit exception handling and atomicity\n```python\ndef transfer(a, b, amount):\n if amount <= 0:\n raise ValueError(\"amount must be positive\")\n if a[\"balance\"] < amount:\n raise ValueError(\"insufficient funds\")\n a[\"balance\"] -= amount\n b[\"balance\"] += amount\n return True\n```</s>"}
12
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\n\nDEBUG = True\nSECRET_KEY = \"supersecret\"\nDB_URL = \"postgres://user:pass@localhost/db\"\n\ndef connect():\n # pretend to connect using DB_URL\n print(\"Connecting to\", DB_URL)\n\nif __name__ == \"__main__\":\n if DEBUG:\n print(\"Debug mode on\")\n connect()``` [/INST] Quality Score: 52/100\n\nBUGS:\n- None functional, but poor separation of config from code\n\nPERFORMANCE ISSUES:\n- Printing secrets to console in debug could end up in logs\n\nSECURITY CONCERNS:\n- Hardcoded secrets and credentials in source control\n- Debug mode enabled by default\n\nIMPROVEMENTS:\n1. Move secrets to environment variables and default to secure values\n```python\nimport os\n\nDEBUG = os.getenv(\"APP_DEBUG\",\"0\") == \"1\"\nSECRET_KEY = os.getenv(\"SECRET_KEY\")\nDB_URL = os.getenv(\"DATABASE_URL\")\n\ndef connect():\n if not DB_URL:\n raise RuntimeError(\"DATABASE_URL not set\")\n # connect using DB_URL\n```</s>"}
13
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from flask import Flask, request, render_template_string\n\napp = Flask(__name__)\n\n@app.route(\"/greet\")\ndef greet():\n name = request.args.get(\"name\",\"world\")\n # unsafe: marking user input as safe HTML\n tpl = \"<h1>Hello, {{ name|safe }}</h1>\"\n return render_template_string(tpl, name=name)\n\nif __name__ == \"__main__\":\n app.run()``` [/INST] Quality Score: 46/100\n\nBUGS:\n- None functional, but template marks untrusted input as safe\n\nPERFORMANCE ISSUES:\n- Using render_template_string for every request rather than compiled templates\n\nSECURITY CONCERNS:\n- Reflected XSS via name parameter using |safe\n\nIMPROVEMENTS:\n1. Escape user input and use templates\n```python\nfrom flask import Flask, request, render_template_string\n\napp = Flask(__name__)\n\nTPL = \"<h1>Hello, {{ name }}</h1>\"\n\n@app.get(\"/greet\")\ndef greet():\n name = request.args.get(\"name\",\"world\")\n return render_template_string(TPL, name=name)\n```</s>"}
14
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\n\ndef list_dir(user_input):\n # command injection via shell=True\n cmd = f\"ls -la {user_input}\"\n os.system(cmd)\n\nif __name__ == \"__main__\":\n list_dir(input(\"Path: \"))``` [/INST] Quality Score: 28/100\n\nBUGS:\n- Fails on paths with spaces or special characters\n\nPERFORMANCE ISSUES:\n- Spawns a shell process unnecessarily\n\nSECURITY CONCERNS:\n- Command injection by passing ; rm -rf / or similar\n- No validation of path\n\nIMPROVEMENTS:\n1. Use subprocess without shell and validate input\n```python\nimport subprocess\nfrom pathlib import Path\n\ndef list_dir(path_str):\n p = Path(path_str).expanduser().resolve()\n result = subprocess.run([\"ls\",\"-la\", str(p)], check=True, capture_output=True, text=True)\n return result.stdout\n```</s>"}
15
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\nfrom flask import Flask, request, send_file\n\napp = Flask(__name__)\n\nBASE = \"/var/data\"\n\n@app.route(\"/download\")\ndef download():\n name = request.args.get(\"name\",\"\")\n path = os.path.join(BASE, name) # no normalization\n if os.path.exists(path):\n return send_file(path)\n return \"Not found\", 404``` [/INST] Quality Score: 43/100\n\nBUGS:\n- No method checks; only GET assumed\n\nPERFORMANCE ISSUES:\n- Repeated os.path.exists without caching is fine; main issue is security\n\nSECURITY CONCERNS:\n- Path traversal via name=../../etc/passwd to escape BASE\n- Possible disclosure of any readable file\n\nIMPROVEMENTS:\n1. Normalize and ensure path stays within BASE\n```python\nfrom pathlib import Path\nfrom flask import Flask, request, send_file, abort\n\nBASE = Path(\"/var/data\").resolve()\n\n@app.get(\"/download\")\ndef download():\n name = request.args.get(\"name\",\"\")\n target = (BASE / name).resolve()\n if not str(target).startswith(str(BASE)) or not target.exists():\n abort(404)\n return send_file(target)\n```</s>"}
16
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from flask import Flask, request\nimport time, os\n\napp = Flask(__name__)\nUPLOAD_DIR = \"uploads\"\n\n@app.post(\"/upload\")\ndef upload():\n f = request.files[\"file\"]\n path = os.path.join(UPLOAD_DIR, f.filename)\n f.save(path) # save immediately\n # scan after saving (race: window where file is accessible)\n time.sleep(0.2) # simulate scan\n if f.filename.endswith(\".exe\"):\n os.remove(path) # too late; already served by static server?\n return \"Rejected\", 400\n return \"OK\" ``` [/INST] Quality Score: 49/100\n\nBUGS:\n- No check if UPLOAD_DIR exists; may crash\n\nPERFORMANCE ISSUES:\n- Sleep-based scan blocks worker thread\n\nSECURITY CONCERNS:\n- Race condition: file saved and potentially accessible before validation\n- Trusts filename; content-type and magic bytes not checked\n\nIMPROVEMENTS:\n1. Validate first to a quarantine path and atomically move after scanning\n```python\nfrom pathlib import Path\nimport tempfile, shutil\n\nUPLOAD_DIR = Path(\"uploads\"); UPLOAD_DIR.mkdir(exist_ok=True)\n\ndef is_safe(stream):\n head = stream.read(512); stream.seek(0)\n return b\"MZ\" not in head # toy check\n\n@app.post(\"/upload\")\ndef upload():\n f = request.files[\"file\"]\n with tempfile.NamedTemporaryFile(delete=False, dir=UPLOAD_DIR) as tmp:\n f.save(tmp.name)\n with open(tmp.name,\"rb\") as s:\n if not is_safe(s):\n os.remove(tmp.name); return \"Rejected\", 400\n final = UPLOAD_DIR / f.filename\n shutil.move(tmp.name, final)\n return \"OK\"\n```</s>"}
17
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import logging\n\nlogging.basicConfig(level=logging.INFO)\n\ndef login(username, password):\n # logs sensitive info\n logging.info(\"Login attempt user=%s password=%s\", username, password)\n if username == \"admin\" and password == \"admin\":\n return True\n return False``` [/INST] Quality Score: 50/100\n\nBUGS:\n- Hardcoded credentials\n\nPERFORMANCE ISSUES:\n- None significant\n\nSECURITY CONCERNS:\n- Logs plaintext passwords\n- Trivial credentials\n\nIMPROVEMENTS:\n1. Remove sensitive logging and use proper auth\n```python\nimport logging\n\ndef login(username, password):\n logging.info(\"Login attempt for user=%s\", username)\n # verify against hashed password store (omitted)\n return False\n```</s>"}
18
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import hashlib, os\n\ndef hash_password(password, salt=b\"fixedsalt\"):\n # weak MD5 and constant salt\n return hashlib.md5(salt + password.encode()).hexdigest()\n\ndef verify(password, digest):\n return hash_password(password) == digest\n\nif __name__ == \"__main__\":\n d = hash_password(\"secret\")\n print(verify(\"secret\", d))``` [/INST] Quality Score: 34/100\n\nBUGS:\n- None functional, but cryptographically unsafe\n\nPERFORMANCE ISSUES:\n- MD5 is fast, making brute force cheap\n\nSECURITY CONCERNS:\n- MD5 is broken for password hashing\n- Fixed salt allows rainbow table reuse\n\nIMPROVEMENTS:\n1. Use a dedicated password KDF like bcrypt/argon2 with per-user salt\n```python\nfrom argon2 import PasswordHasher\nph = PasswordHasher()\n\ndef hash_password(pw): return ph.hash(pw)\ndef verify(pw, digest): \n try: ph.verify(digest, pw); return True\n except Exception: return False\n```</s>"}
19
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import jwt\n\nSECRET = \"topsecret\"\n\ndef parse_token(token):\n # disables verification by passing options\n data = jwt.decode(token, options={\"verify_signature\": False})\n return data.get(\"user\")\n\nif __name__ == \"__main__\":\n print(parse_token(input(\"JWT: \")))``` [/INST] Quality Score: 29/100\n\nBUGS:\n- Assumes user claim exists; may raise AttributeError if not dict\n\nPERFORMANCE ISSUES:\n- None notable\n\nSECURITY CONCERNS:\n- Signature verification disabled allows forged tokens\n- No audience/issuer checks\n\nIMPROVEMENTS:\n1. Verify signature and validate claims\n```python\nimport jwt, datetime\nSECRET=\"topsecret\"\nALGO=\"HS256\"\n\ndef parse_token(token):\n data = jwt.decode(token, SECRET, algorithms=[ALGO], options={\"require\":[\"exp\",\"iat\",\"sub\"]}, audience=\"myapi\")\n return data.get(\"sub\")\n```</s>"}
20
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import threading, time\n\ncounter = 0\n\ndef worker(n):\n global counter\n for _ in range(n):\n tmp = counter\n time.sleep(0.001) # simulate work\n counter = tmp + 1\n\nthreads = [threading.Thread(target=worker, args=(1000,)) for _ in range(4)]\n[t.start() for t in threads]\n[t.join() for t in threads]\nprint(\"counter =\", counter)``` [/INST] Quality Score: 45/100\n\nBUGS:\n- Data race on counter leading to lost updates; final value < 4000\n\nPERFORMANCE ISSUES:\n- Frequent context switches due to sleep\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use a Lock or atomic primitives\n```python\nimport threading\n\ncounter = 0\nlock = threading.Lock()\n\ndef worker(n):\n global counter\n for _ in range(n):\n with lock:\n counter += 1\n```</s>"}
21
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import requests\n\ndef get_json(url):\n # disables SSL verification\n r = requests.get(url, verify=False) # insecure\n if r.status_code == 200:\n return r.json()\n return {}\n\nif __name__ == \"__main__\":\n print(get_json(\"https://example.com/api\"))``` [/INST] Quality Score: 53/100\n\nBUGS:\n- No exception handling for JSON decoding\n\nPERFORMANCE ISSUES:\n- No timeout or retries; may hang\n\nSECURITY CONCERNS:\n- verify=False exposes to MITM\n- No certificate pinning for sensitive APIs\n\nIMPROVEMENTS:\n1. Enable TLS verification, add timeouts and basic retry\n```python\nimport requests\n\nSESSION = requests.Session()\nADAPTER = requests.adapters.HTTPAdapter(max_retries=3)\nSESSION.mount(\"https://\", ADAPTER)\n\ndef get_json(url, timeout=5):\n r = SESSION.get(url, timeout=timeout)\n r.raise_for_status()\n return r.json()\n```</s>"}
22
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def paginate(items, page, size):\n # off-by-one: page starts at 1 but uses 0-based math incorrectly\n start = page * size\n end = start + size - 1\n return items[start:end]\n\nif __name__ == \"__main__\":\n print(paginate(list(range(10)), 1, 3))``` [/INST] Quality Score: 58/100\n\nBUGS:\n- Off-by-one errors: page=1 should start at index 0; end slicing exclusive misused\n- Negative or zero page/size not validated\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Correct indices and validate inputs\n```python\ndef paginate(items, page, size):\n if page < 1 or size < 1: return []\n start = (page - 1) * size\n end = start + size\n return items[start:end]\n```</s>"}
23
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def most_frequent(nums):\n # O(n^2) counting\n best = None\n bestc = -1\n for x in nums:\n c = 0\n for y in nums:\n if x == y:\n c += 1\n if c > bestc:\n bestc = c\n best = x\n return best``` [/INST] Quality Score: 62/100\n\nBUGS:\n- Works but ties are arbitrary; behavior unspecified\n\nPERFORMANCE ISSUES:\n- Quadratic time; slow for large lists\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use a hashmap for linear time\n```python\nfrom collections import Counter\ndef most_frequent(nums):\n return Counter(nums).most_common(1)[0][0] if nums else None\n```</s>"}
24
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import asyncio\nimport websockets\n\nclients = set()\n\nasync def handler(ws):\n clients.add(ws) # never removed on disconnect if error\n async for msg in ws:\n for c in clients:\n await c.send(msg) # exceptions not handled\n # missing finally to remove ws\n\nasync def main():\n async with websockets.serve(handler, \"0.0.0.0\", 8765):\n await asyncio.Future()\n\nasyncio.run(main())``` [/INST] Quality Score: 47/100\n\nBUGS:\n- Clients not removed on errors; memory leak\n- Broadcast loop fails if one client breaks; others starve\n\nPERFORMANCE ISSUES:\n- Serial awaits per client; slow with many clients\n\nSECURITY CONCERNS:\n- No authentication or origin checks\n\nIMPROVEMENTS:\n1. Remove clients in finally and handle send errors\n```python\nasync def handler(ws):\n clients.add(ws)\n try:\n async for msg in ws:\n dead=set()\n for c in clients:\n try: await c.send(msg)\n except: dead.add(c)\n clients.difference_update(dead)\n finally:\n clients.discard(ws)\n```</s>"}
25
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from datetime import datetime\nimport time\n\ndef log_event(ts_str):\n # parse naive datetime string and compare to now\n event = datetime.strptime(ts_str, \"%Y-%m-%d %H:%M:%S\")\n now = datetime.now()\n diff = (now - event).total_seconds()\n print(\"Seconds ago:\", diff)\n\nif __name__ == \"__main__\":\n log_event(\"2025-10-09 12:00:00\")``` [/INST] Quality Score: 60/100\n\nBUGS:\n- Treats naive datetimes as local time; mismatches if input intended as UTC\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use timezone-aware datetimes and explicit TZ\n```python\nfrom datetime import datetime, timezone\n\ndef log_event(ts_str):\n event = datetime.strptime(ts_str, \"%Y-%m-%d %H:%M:%S\").replace(tzinfo=timezone.utc)\n now = datetime.now(timezone.utc)\n print(\"Seconds ago:\", (now - event).total_seconds())\n```</s>"}
26
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def is_equal(a, b):\n # exact float equality\n return a == b\n\nif __name__ == \"__main__\":\n print(is_equal(0.1+0.2, 0.3))``` [/INST] Quality Score: 70/100\n\nBUGS:\n- Floating point equality leads to false negatives due to precision\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use tolerance-based comparison\n```python\nimport math\ndef is_equal(a, b, eps=1e-9):\n return math.isclose(a, b, rel_tol=1e-9, abs_tol=eps)\n```</s>"}
27
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def add_item(item, items=[]):\n # default mutable arg: shared list across calls\n items.append(item)\n return items\n\nif __name__ == \"__main__\":\n print(add_item(1))\n print(add_item(2))``` [/INST] Quality Score: 63/100\n\nBUGS:\n- Default list persists across invocations, causing unintended accumulation\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Default to None and create per-call list\n```python\ndef add_item(item, items=None):\n if items is None: items = []\n items.append(item)\n return items\n```</s>"}
28
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import pickle\nfrom flask import Flask, request\n\napp = Flask(__name__)\n\n@app.post(\"/load\")\ndef load():\n data = request.data # untrusted\n obj = pickle.loads(data) # RCE risk\n return {\"type\": str(type(obj))}\n\nif __name__ == \"__main__\":\n app.run()``` [/INST] Quality Score: 22/100\n\nBUGS:\n- No error handling for malformed payloads\n\nPERFORMANCE ISSUES:\n- Could block on large payloads; no size limits\n\nSECURITY CONCERNS:\n- Insecure deserialization: pickle.loads on untrusted data enables arbitrary code execution\n\nIMPROVEMENTS:\n1. Use safe serialization (JSON) and cap payload size\n```python\nfrom flask import Flask, request, jsonify\n\n@app.post(\"/load\")\ndef load():\n if request.content_length and request.content_length > 1_000_000:\n return jsonify({\"error\":\"payload too large\"}), 413\n try:\n obj = request.get_json(force=True)\n return jsonify({\"keys\": list(obj.keys()) if isinstance(obj, dict) else None})\n except Exception:\n return jsonify({\"error\":\"invalid json\"}), 400\n```</s>"}
29
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json, time\nfrom http.server import BaseHTTPRequestHandler, HTTPServer\n\nclass H(BaseHTTPRequestHandler):\n sessions = {} # in-memory sessions; never expire\n\n def do_POST(self):\n if self.path == \"/login\":\n ln = int(self.headers.get(\"Content-Length\", \"0\"))\n body = self.rfile.read(ln).decode()\n data = json.loads(body)\n if data.get(\"user\") and data.get(\"pass\"):\n token = data[\"user\"] + str(time.time())\n H.sessions[token] = data[\"user\"]\n self.send_response(200); self.end_headers()\n self.wfile.write(token.encode())\n else:\n self.send_response(400); self.end_headers()\n\nserver = HTTPServer((\"0.0.0.0\", 8080), H)\nserver.serve_forever()``` [/INST] Quality Score: 40/100\n\nBUGS:\n- No content-type checks; JSON errors unhandled\n- Reusing class variable sessions without thread safety (server may be multithreaded depending on mixin)\n\nPERFORMANCE ISSUES:\n- Sessions grow unbounded; memory leak\n\nSECURITY CONCERNS:\n- Session token is predictable (username + timestamp)\n- No HTTPS enforcement, no SameSite/HttpOnly cookies, tokens sent in body/plain\n- No password verification\n\nIMPROVEMENTS:\n1. Use secure random tokens with expiry and proper error handling\n```python\nimport os, json, secrets, time\nfrom http.server import BaseHTTPRequestHandler, HTTPServer\n\nclass H(BaseHTTPRequestHandler):\n sessions = {}\n def do_POST(self):\n try:\n data = json.loads(self.rfile.read(int(self.headers.get(\"Content-Length\",\"0\"))))\n except Exception:\n self.send_response(400); self.end_headers(); return\n if self.path == \"/login\":\n if not data.get(\"user\") or not data.get(\"pass\"):\n self.send_response(400); self.end_headers(); return\n token = secrets.token_urlsafe(32)\n H.sessions[token] = {\"u\": data[\"user\"], \"exp\": time.time()+3600}\n self.send_response(200); self.end_headers(); self.wfile.write(token.encode())\n```</s>"}
30
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import re\n\ndef parse_email(s):\n # overly permissive regex; accepts invalid and allows catastrophic backtracking\n pattern = r\"(.+)+@(.+)+\\.(.+)+\"\n return bool(re.match(pattern, s))\n\nif __name__ == \"__main__\":\n print(parse_email(\"a\"*10000 + \"@x.com\"))``` [/INST] Quality Score: 56/100\n\nBUGS:\n- Regex can exhibit catastrophic backtracking on long strings, causing CPU spikes\n- Accepts many invalid emails; false positives\n\nPERFORMANCE ISSUES:\n- Regex complexity leads to potential DoS\n\nSECURITY CONCERNS:\n- Potential ReDoS if user-controlled input is validated using this pattern\n\nIMPROVEMENTS:\n1. Use anchored, simpler regex or email.utils\n```python\nimport re\nEMAIL_RE = re.compile(r\"^[^@\\s]+@[^@\\s]+\\.[^@\\s]+$\")\ndef parse_email(s): return bool(EMAIL_RE.match(s))\n```</s>"}
31
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json, time\nimport requests\n\ndef poll_job(api_url, job_id):\n # tight loop without backoff; no timeout\n while True:\n r = requests.get(f\"{api_url}/jobs/{job_id}\")\n if r.status_code == 200 and r.json().get(\"status\") == \"done\":\n return r.json()\n time.sleep(0.1)\n\nif __name__ == \"__main__\":\n print(poll_job(\"https://example.com/api\", \"123\"))``` [/INST] Quality Score: 57/100\n\nBUGS:\n- No exit on error or unknown statuses; potential infinite loop\n\nPERFORMANCE ISSUES:\n- Aggressive polling (100 ms) can hammer the API\n- No timeout or retry policy\n\nSECURITY CONCERNS:\n- No TLS verification customization; default OK but missing auth headers\n\nIMPROVEMENTS:\n1. Add timeouts, backoff, and max wait\n```python\nimport time, requests\n\ndef poll_job(api_url, job_id, timeout=30):\n end = time.time()+timeout\n delay = 0.5\n while time.time() < end:\n r = requests.get(f\"{api_url}/jobs/{job_id}\", timeout=5)\n r.raise_for_status()\n st = r.json().get(\"status\")\n if st == \"done\": return r.json()\n if st in {\"failed\",\"canceled\"}: raise RuntimeError(st)\n time.sleep(delay); delay = min(5.0, delay*1.5)\n raise TimeoutError(\"job not done\")\n```</s>"}
32
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json\nfrom flask import Flask, request\n\napp = Flask(__name__)\n\n@app.post(\"/api/items\")\ndef items():\n data = request.json # may be None\n # expects 'qty' int but trusts client\n total = data[\"price\"] * data[\"qty\"] # TypeError if string\n return {\"total\": total}\n\nif __name__ == \"__main__\":\n app.run()``` [/INST] Quality Score: 59/100\n\nBUGS:\n- Assumes JSON body exists; KeyError/TypeError on missing/invalid fields\n- No type casting; arithmetic on strings raises errors\n\nPERFORMANCE ISSUES:\n- None significant\n\nSECURITY CONCERNS:\n- Trusting client-calculated fields could lead to abuse in real apps\n\nIMPROVEMENTS:\n1. Validate and coerce input with defaults\n```python\nfrom flask import Flask, request, jsonify\n\n@app.post(\"/api/items\")\ndef items():\n data = request.get_json(silent=True) or {}\n try:\n price = float(data.get(\"price\", 0))\n qty = int(data.get(\"qty\", 0))\n except ValueError:\n return jsonify({\"error\":\"invalid types\"}), 400\n return jsonify({\"total\": price * qty})\n```</s>"}
33
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import threading, json\nfrom http.server import HTTPServer, BaseHTTPRequestHandler\n\ndb = {\"hits\": 0, \"items\": []}\n\nclass H(BaseHTTPRequestHandler):\n def do_POST(self):\n ln = int(self.headers.get(\"Content-Length\",\"0\"))\n body = self.rfile.read(ln)\n item = json.loads(body).get(\"item\")\n # race: no lock protecting db\n db[\"items\"].append(item)\n db[\"hits\"] += 1\n self.send_response(201); self.end_headers()\n\nsrv = HTTPServer((\"0.0.0.0\", 8000), H)\nsrv.serve_forever()``` [/INST] Quality Score: 46/100\n\nBUGS:\n- No error handling for invalid JSON/body\n- Global mutable dict accessed from multiple threads without synchronization (HTTPServer can be threaded in variations)\n\nPERFORMANCE ISSUES:\n- In-memory store grows unbounded\n\nSECURITY CONCERNS:\n- No authentication and accepts arbitrary data\n\nIMPROVEMENTS:\n1. Add locks and validation, or use a thread-safe queue/datastore\n```python\nimport threading\nlock = threading.Lock()\ndef add_item(item):\n with lock:\n db[\"items\"].append(item)\n db[\"hits\"] += 1\n```</s>"}
34
- {"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import yaml\n\ndef load_conf(path):\n with open(path) as f:\n return yaml.load(f, Loader=yaml.Loader) # unsafe loader\n\nif __name__ == \"__main__\":\n print(load_conf(\"config.yml\"))``` [/INST] Quality Score: 55/100\n\nBUGS:\n- Crashes on missing file; no handling\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- Unsafe YAML loader can construct arbitrary objects (code execution vectors)\n\nIMPROVEMENTS:\n1. Use safe_load and validate schema\n```python\nimport yaml, jsonschema\n\ndef load_conf(path):\n with open(path, encoding=\"utf-8\") as f:\n data = yaml.safe_load(f)\n # validate with schema (omitted)\n return data\n```</s>"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
analyst_dataset_expanded.jsonl DELETED
The diff for this file is too large to render. See raw diff
 
analyzer/code_analyzer.py CHANGED
@@ -1,62 +1,171 @@
1
- from typing import Dict, Any, List, Optional
2
  import time
3
  import requests
4
  import os
5
  import re
 
 
 
 
6
  from .llm_clients import LLMClientManager, LLMResponse
7
  from .prompts import get_code_analysis_prompt, get_comparison_prompt, get_github_analysis_prompt
8
  from .utils import detect_language, parse_analysis_result
9
 
 
 
10
  class CodeAnalyzer:
11
- """Main code analysis engine with GitHub integration."""
12
 
13
- def __init__(self):
 
14
  self.llm_manager = LLMClientManager()
15
  self.available_models = self.llm_manager.get_available_models()
16
 
17
- def analyze_code(self, code: str, model: str, language: Optional[str] = None) -> Dict[str, Any]:
18
- """Analyze code using a specific model with focused output."""
19
- start_time = time.time()
 
 
20
 
21
- # Detect language if not provided
22
- if not language:
23
- language = detect_language(code)
24
 
25
- # Generate focused prompt
26
- prompt = get_code_analysis_prompt(code, language)
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- # Query LLM
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
  response = self.llm_manager.query(model, prompt)
 
30
 
31
- # Process response
32
  if response.success:
33
- analysis = parse_analysis_result(response.content)
34
- analysis['raw_response'] = response.content
35
- else:
36
- analysis = {
37
- 'error': response.error,
38
- 'quality_score': 0,
39
- 'summary': f"Analysis failed: {response.error}",
40
- 'bugs': [],
41
- 'quality_issues': [],
42
- 'security_vulnerabilities': [],
43
- 'quick_fixes': [],
44
- # Legacy fields
45
- 'strengths': [],
46
- 'issues': [],
47
- 'suggestions': [],
48
- 'security_concerns': [],
49
- 'performance_notes': []
 
 
50
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- # Add metadata
53
- analysis['model'] = response.model
54
- analysis['language'] = language
55
- analysis['execution_time'] = round(time.time() - start_time, 2)
56
- analysis['code_length'] = len(code)
57
- analysis['line_count'] = len(code.splitlines())
58
-
59
- return analysis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
  def analyze_github_repo(self, repo_url: str, model: str = None) -> Dict[str, Any]:
62
  """Analyze a GitHub repository."""
@@ -226,14 +335,20 @@ class CodeAnalyzer:
226
  'project_overview': '',
227
  'architecture_quality': [],
228
  'critical_issues': [],
229
- 'improvement_priorities': []
 
 
 
230
  }
231
 
232
  sections = {
233
  'project_overview': r'(?:PROJECT_OVERVIEW|project\s+overview)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
234
  'architecture_quality': r'(?:ARCHITECTURE_QUALITY|architecture|structure)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
235
  'critical_issues': r'(?:CRITICAL_ISSUES|critical|major\s+issue)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
236
- 'improvement_priorities': r'(?:IMPROVEMENT_PRIORITIES|improvement|priorit)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)'
 
 
 
237
  }
238
 
239
  for key, pattern in sections.items():
 
1
+ from typing import Dict, Any, List, Optional, Generator, Literal
2
  import time
3
  import requests
4
  import os
5
  import re
6
+ import hashlib
7
+ import json
8
+ from tqdm import tqdm
9
+
10
  from .llm_clients import LLMClientManager, LLMResponse
11
  from .prompts import get_code_analysis_prompt, get_comparison_prompt, get_github_analysis_prompt
12
  from .utils import detect_language, parse_analysis_result
13
 
14
+ ModelType = Literal["codet5", "deepseek-finetuned", "deepseek-finetuned-remote"]
15
+
16
  class CodeAnalyzer:
17
+ """Main code analysis engine with support for APIs, local models, and GitHub integration."""
18
 
19
+ def __init__(self, cache_dir: str = None, precision: str = "fp16"):
20
+ # API-based models
21
  self.llm_manager = LLMClientManager()
22
  self.available_models = self.llm_manager.get_available_models()
23
 
24
+ # Local/Remote Hugging Face models
25
+ self.model_type: Optional[ModelType] = None
26
+ self.model_id: Optional[str] = None
27
+ self.adapter_path: Optional[str] = None
28
+ self.remote_api_url: Optional[str] = None
29
 
30
+ self.cache_dir = cache_dir
31
+ self.precision = precision.lower().strip()
32
+ self.cache = {}
33
 
34
+ if cache_dir is not None:
35
+ os.makedirs(cache_dir, exist_ok=True)
36
+ self._load_cache()
37
+
38
+ def _get_cache_key(self, code: str) -> str:
39
+ """Generate a unique cache key for a piece of code and model type."""
40
+ combined = f"{self.model_type}:{self.model_id}:{code}"
41
+ return hashlib.md5(combined.encode()).hexdigest()
42
+
43
+ def _load_cache(self):
44
+ """Load analysis cache from disk if available."""
45
+ if self.cache_dir is None:
46
+ self.cache = {}
47
+ return
48
 
49
+ cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
50
+ if os.path.exists(cache_file):
51
+ try:
52
+ with open(cache_file, 'r') as f:
53
+ self.cache = json.load(f)
54
+ print(f"📁 Loaded {len(self.cache)} cached analyses")
55
+ except (json.JSONDecodeError, IOError):
56
+ self.cache = {}
57
+
58
+ def _save_cache(self):
59
+ """Save the analysis cache to disk."""
60
+ if self.cache_dir is None:
61
+ return
62
+
63
+ cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
64
+ with open(cache_file, 'w') as f:
65
+ json.dump(self.cache, f)
66
+
67
+ def _check_cache(self, code: str) -> Optional[Dict[str, Any]]:
68
+ """Check if an analysis for the given code is in the cache."""
69
+ cache_key = self._get_cache_key(code)
70
+ return self.cache.get(cache_key)
71
+
72
+ def _save_to_cache(self, code: str, result: Dict[str, Any]):
73
+ """Save an analysis result to the cache."""
74
+ cache_key = self._get_cache_key(code)
75
+ self.cache[cache_key] = result
76
+ self._save_cache()
77
+
78
+ def analyze_code(
79
+ self,
80
+ code: str,
81
+ model: str,
82
+ language: Optional[str] = None,
83
+ max_tokens: int = 1024,
84
+ ) -> Dict[str, Any]:
85
+ """
86
+ Analyze code using a specified LLM provider.
87
+ This is a non-streaming, direct analysis method.
88
+ Language detection is now handled by AI for accuracy.
89
+ """
90
+ # Skip local language detection - let AI handle it for accuracy
91
+ if language is None:
92
+ language = "auto-detect" # Let AI detect it
93
+
94
+ prompt = get_code_analysis_prompt(code, language, model)
95
+
96
+ start_time = time.time()
97
  response = self.llm_manager.query(model, prompt)
98
+ total_time = time.time() - start_time
99
 
 
100
  if response.success:
101
+ structured_data = parse_analysis_result(response.content, model)
102
+
103
+ # Use AI-detected language if available, otherwise fallback to auto
104
+ detected_lang = structured_data.get('detected_language')
105
+ if detected_lang:
106
+ language = detected_lang.upper()
107
+ else:
108
+ # Fallback to LLM-based detection if not in response
109
+ language = detect_language(code).upper()
110
+
111
+ result = {
112
+ "raw_response": response.content,
113
+ "quality_score": structured_data.get('quality_score', 0),
114
+ "execution_time": total_time,
115
+ "model": response.model,
116
+ "cached": False,
117
+ **structured_data,
118
+ "language": language,
119
+ "line_count": len(code.splitlines()),
120
  }
121
+ else:
122
+ result = {'error': response.error}
123
+
124
+ return result
125
+
126
+ def analyze_code_remote(self, code: str, max_tokens: int = 300) -> Dict[str, Any]: # Increased token limit
127
+ """Analyze code using a remote Hugging Face Space API."""
128
+ if not self.remote_api_url:
129
+ return {'error': 'Remote API URL is not configured.'}
130
+
131
+ cached_result = self._check_cache(code)
132
+ if cached_result:
133
+ cached_result["cached"] = True
134
+ return cached_result
135
+
136
+ start_time = time.time()
137
 
138
+ try:
139
+ # First, try FastAPI endpoint /analyze
140
+ response = requests.post(
141
+ f"{self.remote_api_url}/analyze",
142
+ json={"code": code, "max_tokens": max_tokens},
143
+ timeout=60
144
+ )
145
+ response.raise_for_status()
146
+ data = response.json()
147
+
148
+ # Assuming the remote API returns a structured response
149
+ total_time = time.time() - start_time
150
+ result = {
151
+ "raw_response": data.get("analysis", str(data)),
152
+ "quality_score": data.get("quality_score", 0),
153
+ "execution_time": total_time,
154
+ "model": data.get("model", "remote-deepseek"),
155
+ "cached": False,
156
+ "bugs": data.get("bugs", []),
157
+ "security_vulnerabilities": data.get("security_vulnerabilities", []),
158
+ "quality_issues": data.get("quality_issues", []),
159
+ "quick_fixes": data.get("quick_fixes", []),
160
+ "language": data.get("language", detect_language(code)),
161
+ "line_count": data.get("line_count", len(code.splitlines())),
162
+ }
163
+ self._save_to_cache(code, result)
164
+ return result
165
+
166
+ except requests.exceptions.RequestException as e:
167
+ # Fallback for Gradio or other errors
168
+ return {'error': f"Remote analysis failed: {e}"}
169
 
170
  def analyze_github_repo(self, repo_url: str, model: str = None) -> Dict[str, Any]:
171
  """Analyze a GitHub repository."""
 
335
  'project_overview': '',
336
  'architecture_quality': [],
337
  'critical_issues': [],
338
+ 'improvement_priorities': [],
339
+ 'onboarding_guide': [],
340
+ 'tech_stack_rationale': [],
341
+ 'api_endpoint_summary': [],
342
  }
343
 
344
  sections = {
345
  'project_overview': r'(?:PROJECT_OVERVIEW|project\s+overview)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
346
  'architecture_quality': r'(?:ARCHITECTURE_QUALITY|architecture|structure)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
347
  'critical_issues': r'(?:CRITICAL_ISSUES|critical|major\s+issue)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
348
+ 'improvement_priorities': r'(?:IMPROVEMENT_PRIORITIES|improvement|priorit)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
349
+ 'onboarding_guide': r'(?:ONBOARDING_GUIDE|onboarding|setup)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
350
+ 'tech_stack_rationale': r'(?:TECH_STACK_RATIONALE|tech\s+stack|stack\s+rationale)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
351
+ 'api_endpoint_summary': r'(?:API_ENDPOINT_SUMMARY|api\s+endpoint|endpoints)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
352
  }
353
 
354
  for key, pattern in sections.items():
analyzer/llm_clients.py CHANGED
@@ -4,8 +4,10 @@ from dataclasses import dataclass
4
  import openai
5
  import anthropic
6
  from dotenv import load_dotenv
 
7
 
8
- load_dotenv()
 
9
 
10
  @dataclass
11
  class LLMResponse:
@@ -23,30 +25,75 @@ class LLMClientManager:
23
 
24
  def _initialize_clients(self):
25
  """Initialize available LLM clients based on API keys."""
 
 
 
26
  # OpenAI
27
- if os.getenv("OPENAI_API_KEY"):
28
- self.clients["openai"] = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
 
 
 
 
29
 
30
  # Anthropic
31
- if os.getenv("ANTHROPIC_API_KEY"):
32
- self.clients["anthropic"] = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
 
 
 
 
33
 
34
  # DeepSeek (uses OpenAI-compatible API)
35
- if os.getenv("DEEPSEEK_API_KEY"):
 
 
36
  self.clients["deepseek"] = openai.OpenAI(
37
- api_key=os.getenv("DEEPSEEK_API_KEY"),
38
  base_url="https://api.deepseek.com/v1"
39
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
  def get_available_models(self) -> Dict[str, str]:
42
  """Return available models with display names."""
43
  models = {}
44
  if "openai" in self.clients:
45
- models["openai"] = "OpenAI GPT-4"
46
  if "anthropic" in self.clients:
47
- models["anthropic"] = "Claude 3"
48
  if "deepseek" in self.clients:
49
- models["deepseek"] = "DeepSeek Coder"
 
 
 
 
50
  return models
51
 
52
  def query(self, model: str, prompt: str, temperature: float = 0.1) -> LLMResponse:
@@ -60,55 +107,169 @@ class LLMClientManager:
60
  )
61
  return LLMResponse(
62
  content=response.choices[0].message.content,
63
- model="OpenAI GPT-4",
64
  success=True
65
  )
66
 
67
  elif model == "anthropic" and "anthropic" in self.clients:
68
  response = self.clients["anthropic"].messages.create(
69
- model="claude-3-haiku-20240307",
70
  messages=[{"role": "user", "content": prompt}],
71
  max_tokens=2000,
72
  temperature=temperature
73
  )
74
  return LLMResponse(
75
  content=response.content[0].text,
76
- model="Claude 3 Haiku",
77
  success=True
78
  )
79
 
80
  elif model == "deepseek" and "deepseek" in self.clients:
81
  try:
82
  response = self.clients["deepseek"].chat.completions.create(
83
- model="deepseek-coder",
84
  messages=[{"role": "user", "content": prompt}],
85
  temperature=temperature
86
  )
87
  return LLMResponse(
88
  content=response.choices[0].message.content,
89
- model="DeepSeek Coder",
90
  success=True
91
  )
92
  except Exception as deepseek_error:
93
  # Try with alternative model name if the first one fails
94
  try:
95
  response = self.clients["deepseek"].chat.completions.create(
96
- model="deepseek-chat",
97
  messages=[{"role": "user", "content": prompt}],
98
  temperature=temperature
99
  )
100
  return LLMResponse(
101
  content=response.choices[0].message.content,
102
- model="DeepSeek Coder",
103
  success=True
104
  )
105
  except Exception as second_error:
106
  return LLMResponse(
107
  content="",
108
- model="DeepSeek Coder",
109
  success=False,
110
  error=f"DeepSeek API Error: {str(deepseek_error)}. Also tried alternative model: {str(second_error)}"
111
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  else:
114
  return LLMResponse(
 
4
  import openai
5
  import anthropic
6
  from dotenv import load_dotenv
7
+ from huggingface_hub import InferenceClient
8
 
9
+ # Force reload environment variables
10
+ load_dotenv(override=True)
11
 
12
  @dataclass
13
  class LLMResponse:
 
25
 
26
  def _initialize_clients(self):
27
  """Initialize available LLM clients based on API keys."""
28
+ # Debug: Print available API keys
29
+ print("🔍 Initializing LLM clients...")
30
+
31
  # OpenAI
32
+ openai_key = os.getenv("OPENAI_API_KEY")
33
+ if openai_key:
34
+ print(f"✅ OpenAI API key found: {openai_key[:8]}...{openai_key[-4:]}")
35
+ self.clients["openai"] = openai.OpenAI(api_key=openai_key)
36
+ else:
37
+ print("❌ OpenAI API key not found")
38
 
39
  # Anthropic
40
+ anthropic_key = os.getenv("ANTHROPIC_API_KEY")
41
+ if anthropic_key:
42
+ print(f"✅ Anthropic API key found: {anthropic_key[:8]}...{anthropic_key[-4:]}")
43
+ self.clients["anthropic"] = anthropic.Anthropic(api_key=anthropic_key)
44
+ else:
45
+ print("❌ Anthropic API key not found")
46
 
47
  # DeepSeek (uses OpenAI-compatible API)
48
+ deepseek_key = os.getenv("DEEPSEEK_API_KEY")
49
+ if deepseek_key:
50
+ print(f"✅ DeepSeek API key found: {deepseek_key[:8]}...{deepseek_key[-4:]}")
51
  self.clients["deepseek"] = openai.OpenAI(
52
+ api_key=deepseek_key,
53
  base_url="https://api.deepseek.com/v1"
54
  )
55
+ else:
56
+ print("❌ DeepSeek API key not found")
57
+
58
+ # Mercury API (OpenAI-compatible via Inception Labs)
59
+ # Support both MERCURY_API_KEY and INCEPTION_API_KEY
60
+ mercury_key = os.getenv("MERCURY_API_KEY") or os.getenv("INCEPTION_API_KEY")
61
+ if mercury_key:
62
+ print(f"✅ Mercury API key found: {mercury_key[:8]}...{mercury_key[-4:]}")
63
+ try:
64
+ # Prefer explicit base URL envs; default to Inception Labs documented endpoint
65
+ base_url = (
66
+ os.getenv("MERCURY_BASE_URL")
67
+ or os.getenv("INCEPTION_BASE_URL")
68
+ or "https://api.inceptionlabs.ai/v1"
69
+ )
70
+ self.clients["mercury"] = openai.OpenAI(api_key=mercury_key, base_url=base_url)
71
+ print("✅ Mercury client initialized successfully")
72
+ except Exception as e:
73
+ print(f"⚠️ Mercury client initialization failed: {e}")
74
+ # Still add to clients so it appears in UI, but will show error when used
75
+ self.clients["mercury"] = None
76
+ else:
77
+ print("❌ Mercury API key not found")
78
+
79
+ # Check for Hugging Face API key with multiple possible names
80
+ hf_token = os.getenv("HUGGINGFACE_API_KEY") or os.getenv("HUGGINGFACEHUB_API_TOKEN") or os.getenv("HF_TOKEN")
81
+ if hf_token:
82
+ self.clients["huggingface"] = InferenceClient(token=hf_token)
83
 
84
  def get_available_models(self) -> Dict[str, str]:
85
  """Return available models with display names."""
86
  models = {}
87
  if "openai" in self.clients:
88
+ models["openai"] = "OpenAI GPT-4o-mini"
89
  if "anthropic" in self.clients:
90
+ models["anthropic"] = "Claude 4.5 Haiku"
91
  if "deepseek" in self.clients:
92
+ models["deepseek"] = "DeepSeek Coder V2"
93
+ if "mercury" in self.clients:
94
+ models["mercury"] = "Mercury Fast LLM"
95
+ if "huggingface" in self.clients:
96
+ models["huggingface"] = "Hugging Face (Mixtral)"
97
  return models
98
 
99
  def query(self, model: str, prompt: str, temperature: float = 0.1) -> LLMResponse:
 
107
  )
108
  return LLMResponse(
109
  content=response.choices[0].message.content,
110
+ model="OpenAI GPT-4o-mini",
111
  success=True
112
  )
113
 
114
  elif model == "anthropic" and "anthropic" in self.clients:
115
  response = self.clients["anthropic"].messages.create(
116
+ model="claude-3-5-haiku-20241022",
117
  messages=[{"role": "user", "content": prompt}],
118
  max_tokens=2000,
119
  temperature=temperature
120
  )
121
  return LLMResponse(
122
  content=response.content[0].text,
123
+ model="Claude 4.5 Haiku",
124
  success=True
125
  )
126
 
127
  elif model == "deepseek" and "deepseek" in self.clients:
128
  try:
129
  response = self.clients["deepseek"].chat.completions.create(
130
+ model="deepseek-coder-v2",
131
  messages=[{"role": "user", "content": prompt}],
132
  temperature=temperature
133
  )
134
  return LLMResponse(
135
  content=response.choices[0].message.content,
136
+ model="DeepSeek Coder V2",
137
  success=True
138
  )
139
  except Exception as deepseek_error:
140
  # Try with alternative model name if the first one fails
141
  try:
142
  response = self.clients["deepseek"].chat.completions.create(
143
+ model="deepseek-coder",
144
  messages=[{"role": "user", "content": prompt}],
145
  temperature=temperature
146
  )
147
  return LLMResponse(
148
  content=response.choices[0].message.content,
149
+ model="DeepSeek Coder V2",
150
  success=True
151
  )
152
  except Exception as second_error:
153
  return LLMResponse(
154
  content="",
155
+ model="DeepSeek Coder V2",
156
  success=False,
157
  error=f"DeepSeek API Error: {str(deepseek_error)}. Also tried alternative model: {str(second_error)}"
158
  )
159
+
160
+ elif model == "mercury" and "mercury" in self.clients:
161
+ # Check if Mercury client is properly initialized
162
+ if self.clients["mercury"] is None:
163
+ return LLMResponse(
164
+ content="",
165
+ model="Mercury Fast LLM",
166
+ success=False,
167
+ error="Mercury API client not properly initialized. Check your API key and endpoint configuration."
168
+ )
169
+
170
+ # Build candidate base URLs (env first, then known defaults)
171
+ candidate_base_urls = []
172
+ if os.getenv("MERCURY_BASE_URL"):
173
+ candidate_base_urls.append(os.getenv("MERCURY_BASE_URL"))
174
+ if os.getenv("INCEPTION_BASE_URL"):
175
+ candidate_base_urls.append(os.getenv("INCEPTION_BASE_URL"))
176
+ candidate_base_urls.extend([
177
+ "https://api.inceptionlabs.ai/v1",
178
+ "https://api.mercury.ai/v1",
179
+ "https://api.mercury.ai",
180
+ ])
181
+
182
+ # Candidate model names (env first, then fallbacks)
183
+ candidate_models = []
184
+ env_model = os.getenv("MERCURY_MODEL_NAME") or os.getenv("INCEPTION_MODEL_NAME")
185
+ if env_model:
186
+ candidate_models.append(env_model)
187
+ candidate_models.extend(["mercury", "mercury-fast", "mercury-pro", "gpt-4", "gpt-3.5-turbo"])
188
+
189
+ last_error: Optional[str] = None
190
+
191
+ for base_url in candidate_base_urls:
192
+ try:
193
+ client = openai.OpenAI(
194
+ api_key=(os.getenv("MERCURY_API_KEY") or os.getenv("INCEPTION_API_KEY")),
195
+ base_url=base_url,
196
+ )
197
+ for mercury_model in candidate_models:
198
+ try:
199
+ response = client.chat.completions.create(
200
+ model=mercury_model,
201
+ messages=[{"role": "user", "content": prompt}],
202
+ temperature=temperature,
203
+ max_tokens=2000,
204
+ )
205
+ return LLMResponse(
206
+ content=response.choices[0].message.content,
207
+ model="Mercury Fast LLM",
208
+ success=True,
209
+ )
210
+ except Exception as model_error:
211
+ last_error = f"{type(model_error).__name__}: {str(model_error)}"
212
+ continue
213
+ except Exception as client_error:
214
+ last_error = f"{type(client_error).__name__}: {str(client_error)}"
215
+ continue
216
+
217
+ # If all attempts failed, provide a consolidated error
218
+ if last_error and "503" in last_error:
219
+ return LLMResponse(
220
+ content="",
221
+ model="Mercury Fast LLM",
222
+ success=False,
223
+ error=(
224
+ "Mercury/Inception API returned 503 across endpoints. Service may be down. "
225
+ "Tried endpoints: " + ", ".join(candidate_base_urls)
226
+ ),
227
+ )
228
+ return LLMResponse(
229
+ content="",
230
+ model="Mercury Fast LLM",
231
+ success=False,
232
+ error=(
233
+ "Mercury API request failed after trying multiple endpoints and models. "
234
+ f"Last error: {last_error or 'unknown error'}"
235
+ ),
236
+ )
237
+
238
+ elif model == "huggingface" and "huggingface" in self.clients:
239
+ try:
240
+ # Use chat completion API for Mixtral model (most compatible)
241
+ response = self.clients["huggingface"].chat_completion(
242
+ messages=[{"role": "user", "content": prompt}],
243
+ model="mistralai/Mixtral-8x7B-Instruct-v0.1",
244
+ max_tokens=2000,
245
+ temperature=temperature if temperature > 0 else 0.1,
246
+ )
247
+ return LLMResponse(
248
+ content=response.choices[0].message.content,
249
+ model="Hugging Face (Mixtral)",
250
+ success=True
251
+ )
252
+ except Exception as hf_error:
253
+ # Fallback to text generation with a simpler model
254
+ try:
255
+ response = self.clients["huggingface"].text_generation(
256
+ prompt,
257
+ model="microsoft/DialoGPT-medium",
258
+ max_new_tokens=2000,
259
+ temperature=temperature if temperature > 0 else 0.1,
260
+ )
261
+ return LLMResponse(
262
+ content=response,
263
+ model="Hugging Face (DialoGPT)",
264
+ success=True
265
+ )
266
+ except Exception as fallback_error:
267
+ return LLMResponse(
268
+ content="",
269
+ model="Hugging Face (Mixtral)",
270
+ success=False,
271
+ error=f"Hugging Face API Error: {str(hf_error)}. Fallback also failed: {str(fallback_error)}"
272
+ )
273
 
274
  else:
275
  return LLMResponse(
analyzer/prompts.py CHANGED
@@ -1,14 +1,16 @@
1
- def get_code_analysis_prompt(code: str, language: str = "auto-detect") -> str:
2
  """Generate a focused prompt for practical code analysis."""
3
  return f"""
4
- You are an expert code reviewer. Analyze this {language} code for practical issues that matter to developers.
5
 
6
  Code to analyze:
7
  {code}
8
 
9
- Provide a focused analysis with complete, readable sentences. Do NOT use markdown symbols like ### or ** in your response.
10
 
11
- 1. QUALITY_SCORE: Rate 0-100 (consider bugs, readability, maintainability)
 
 
12
 
13
  2. SUMMARY: One complete sentence describing what this code does
14
 
@@ -68,7 +70,23 @@ Provide analysis focusing on:
68
  - Missing features or best practices (write complete sentences)
69
  - Code quality improvements needed (write complete sentences)
70
 
71
- Write clear, complete sentences without markdown symbols. Be practical and focus on actionable feedback for the repository owner.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
  """
73
 
74
  def get_comparison_prompt(code: str, language: str = "auto-detect") -> str:
 
1
+ def get_code_analysis_prompt(code: str, language: str = "auto-detect", model: str = None) -> str:
2
  """Generate a focused prompt for practical code analysis."""
3
  return f"""
4
+ You are an expert code reviewer. Analyze this code and provide comprehensive feedback.
5
 
6
  Code to analyze:
7
  {code}
8
 
9
+ First, identify the programming language, then analyze the code. Provide a focused analysis with complete, readable sentences. Do NOT use markdown symbols like ### or ** in your response.
10
 
11
+ 0. DETECTED_LANGUAGE: Respond with ONLY the programming language name (e.g., python, javascript, rust, cpp, java, go, php, ruby, swift, kotlin, typescript, csharp, html, css). Make sure to detect the language accurately based on the actual code.
12
+
13
+ 1. QUALITY_SCORE: Rate 0-100 (consider bugs, readability, maintainability). Respond with ONLY the number.
14
 
15
  2. SUMMARY: One complete sentence describing what this code does
16
 
 
70
  - Missing features or best practices (write complete sentences)
71
  - Code quality improvements needed (write complete sentences)
72
 
73
+ 5. ONBOARDING_GUIDE: (This is a critical section)
74
+ - Provide a step-by-step guide for a new developer to run this project.
75
+ - List key commands for installation, testing, and running (e.g., `npm install`, `pytest`).
76
+ - Infer any required environment variables (e.g., `DATABASE_URL`, `API_KEY`).
77
+ - If setup instructions are in the README, summarize them here.
78
+
79
+ 6. TECH_STACK_RATIONALE: (Provide your best assessment)
80
+ - Identify the primary framework, language, and package manager.
81
+ - Infer the likely reason for choosing this stack in one clear sentence.
82
+ - Mention any interesting or unusual libraries and their purpose.
83
+
84
+ 7. API_ENDPOINT_SUMMARY: (Scan all provided files for this)
85
+ - List the main API endpoints found (e.g., `GET /api/users`, `POST /api/products`).
86
+ - Provide a one-sentence guess for what each endpoint does based on the code.
87
+ - If this is not a web service, you must explicitly state 'Not a web service'.
88
+
89
+ Write clear, complete sentences without markdown symbols. You must attempt to fill out every section. Be practical and focus on actionable feedback for the repository owner.
90
  """
91
 
92
  def get_comparison_prompt(code: str, language: str = "auto-detect") -> str:
analyzer/utils.py CHANGED
@@ -1,29 +1,135 @@
1
  import re
2
  from typing import Dict, Any, Tuple
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
 
4
  def detect_language(code: str) -> str:
5
- """Simple language detection based on syntax patterns."""
6
- patterns = {
7
- 'python': [r'def\s+\w+\(', r'import\s+\w+', r'if\s+__name__\s*==\s*["\']__main__["\']'],
8
- 'javascript': [r'function\s+\w+\(', r'const\s+\w+\s*=', r'console\.log\('],
9
- 'java': [r'public\s+class\s+\w+', r'public\s+static\s+void\s+main'],
10
- 'cpp': [r'#include\s*<\w+>', r'int\s+main\s*\(', r'std::'],
11
- 'csharp': [r'using\s+System', r'namespace\s+\w+', r'public\s+class\s+\w+'],
12
- 'go': [r'package\s+\w+', r'func\s+\w+\(', r'import\s+\('],
13
- 'rust': [r'fn\s+main\s*\(', r'use\s+\w+', r'let\s+mut\s+\w+'],
14
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
- for lang, patterns_list in patterns.items():
17
- for pattern in patterns_list:
18
- if re.search(pattern, code):
19
- return lang
20
 
21
- return 'unknown'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- def parse_analysis_result(text: str) -> Dict[str, Any]:
24
  """Parse LLM response into structured format with new focused categories."""
25
  result = {
26
  'quality_score': 75, # default
 
27
  'summary': '',
28
  'bugs': [],
29
  'quality_issues': [],
@@ -37,6 +143,23 @@ def parse_analysis_result(text: str) -> Dict[str, Any]:
37
  'performance_notes': []
38
  }
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  # Extract quality score
41
  score_patterns = [
42
  r'(?:QUALITY_SCORE|quality[_\s]*score)[:\s]*(\d+)(?:/100)?',
 
1
  import re
2
  from typing import Dict, Any, Tuple
3
+ from pygments.lexers import guess_lexer
4
+ from pygments.util import ClassNotFound
5
+ from .llm_clients import LLMClientManager
6
+
7
+ def detect_language_with_llm(code: str) -> str:
8
+ """Detect the programming language of a code snippet using an LLM."""
9
+ try:
10
+ llm_manager = LLMClientManager()
11
+ # Prioritize Hugging Face for this task if available
12
+ model = "huggingface" if "huggingface" in llm_manager.get_available_models() else list(llm_manager.get_available_models().keys())[0]
13
+ prompt = f"""
14
+ Analyze the following code snippet and identify its programming language.
15
+ Respond with only the language name (e.g., 'python', 'java', 'javascript', 'go', 'cpp', 'rust', 'php', 'ruby', 'swift', 'kotlin', 'csharp', 'c').
16
+ If you are unsure, respond with 'unknown'.
17
+
18
+ Code:
19
+ ```
20
+ {code}
21
+ ```
22
+
23
+ Language:
24
+ """
25
+ response = llm_manager.query(model, prompt, temperature=0.1)
26
+ if response.success:
27
+ detected_language = response.content.strip().lower()
28
+ if detected_language and detected_language != "unknown":
29
+ return detected_language
30
+ except Exception as e:
31
+ print(f"LLM-based language detection failed: {e}")
32
+ return "unknown"
33
 
34
  def detect_language(code: str) -> str:
35
+ """Detect the programming language using LLM first for accurate detection, then fallback to pattern matching."""
36
+ # Try LLM-based detection first for accurate results
37
+ detected = detect_language_with_llm(code)
38
+ if detected != "unknown":
39
+ return detected
40
+
41
+ # Fallback to pattern matching only if LLM fails
42
+ code_lower = code.lower()
43
+
44
+ # HTML detection (check first as it's very common and specific)
45
+ if re.search(r'<html|<head|<body|<div|<span|<p\s|class\s*=|id\s*=', code, re.IGNORECASE):
46
+ return "html"
47
+
48
+ # CSS detection (check early as it's specific)
49
+ if re.search(r'\.\w+\s*\{|@media|@import|background:|color:|font-|margin:|padding:', code, re.IGNORECASE):
50
+ return "css"
51
+
52
+ # Go language detection (check early as it's most specific)
53
+ if re.search(r'package\s+main|func\s+\w+\s*\(|import\s*\(', code, re.IGNORECASE):
54
+ return "go"
55
+
56
+ # Python language detection
57
+ if re.search(r'def\s+\w+\s*\(|import\s+\w+|from\s+\w+\s+import|if\s+__name__\s*==\s*["\']__main__["\']', code, re.IGNORECASE):
58
+ return "python"
59
+
60
+ # JavaScript language detection (more specific patterns)
61
+ js_patterns = [
62
+ r'function\s+\w+\s*\([^)]*\)\s*\{', # function declaration with body
63
+ r'const\s+\w+\s*=\s*\([^)]*\)\s*=>', # arrow function
64
+ r'let\s+\w+\s*=\s*\([^)]*\)\s*=>', # arrow function with let
65
+ r'var\s+\w+\s*=\s*\([^)]*\)\s*=>', # arrow function with var
66
+ r'console\.log\s*\(', # console.log
67
+ r'document\.getElementById', # DOM manipulation
68
+ r'addEventListener\s*\(', # event listeners
69
+ r'require\s*\(|import\s+.*\s+from', # module imports
70
+ r'export\s+(default\s+)?(function|const|class)', # exports
71
+ ]
72
+
73
+ # TypeScript detection (check before JavaScript)
74
+ if re.search(r'interface\s+\w+|type\s+\w+\s*=|:\s*\w+\[\]|:\s*string\s*[;=]|:\s*number\s*[;=]', code, re.IGNORECASE):
75
+ return "typescript"
76
 
77
+ # If it matches JavaScript patterns
78
+ for pattern in js_patterns:
79
+ if re.search(pattern, code, re.IGNORECASE):
80
+ return "javascript"
81
 
82
+ # Java language detection
83
+ if re.search(r'public\s+class\s+\w+|System\.out\.println|import\s+java\.', code, re.IGNORECASE):
84
+ return "java"
85
+
86
+ # C++ language detection
87
+ if re.search(r'#include\s*<|std::|using\s+namespace\s+std', code, re.IGNORECASE):
88
+ return "cpp"
89
+
90
+ # C language detection
91
+ if re.search(r'#include\s*<|int\s+main\s*\(|printf\s*\(', code, re.IGNORECASE):
92
+ return "c"
93
+
94
+ # C# language detection
95
+ if re.search(r'using\s+System|namespace\s+\w+|public\s+class\s+\w+', code, re.IGNORECASE):
96
+ return "csharp"
97
+
98
+ # Rust language detection
99
+ if re.search(r'fn\s+\w+\s*\(|let\s+\w+\s*:|use\s+\w+::', code, re.IGNORECASE):
100
+ return "rust"
101
+
102
+ # PHP language detection
103
+ if re.search(r'<\?php|echo\s+|\$\w+\s*=', code, re.IGNORECASE):
104
+ return "php"
105
+
106
+ # Ruby language detection
107
+ if re.search(r'def\s+\w+\s*|puts\s+|require\s+', code, re.IGNORECASE):
108
+ return "ruby"
109
+
110
+ # Swift language detection
111
+ if re.search(r'func\s+\w+\s*\(|let\s+\w+\s*:|var\s+\w+\s*:', code, re.IGNORECASE):
112
+ return "swift"
113
+
114
+ # Kotlin language detection
115
+ if re.search(r'fun\s+\w+\s*\(|val\s+\w+\s*=|var\s+\w+\s*=', code, re.IGNORECASE):
116
+ return "kotlin"
117
+
118
+ # Fallback to Pygments if no pattern matches
119
+ try:
120
+ from pygments.lexers import guess_lexer
121
+ from pygments.util import ClassNotFound
122
+ lexer = guess_lexer(code)
123
+ return lexer.name.lower()
124
+ except (ClassNotFound, ImportError):
125
+ return "unknown"
126
+ return "unknown"
127
 
128
+ def parse_analysis_result(text: str, model: str = None) -> Dict[str, Any]:
129
  """Parse LLM response into structured format with new focused categories."""
130
  result = {
131
  'quality_score': 75, # default
132
+ 'detected_language': None, # AI-detected language
133
  'summary': '',
134
  'bugs': [],
135
  'quality_issues': [],
 
143
  'performance_notes': []
144
  }
145
 
146
+ # Extract detected language first
147
+ language_patterns = [
148
+ r'(?:DETECTED_LANGUAGE|language)[:\s]*([a-z]+)(?:\s|$|\.|,)',
149
+ r'^language[:\s]*([a-z]+)(?:\s|$|\.|,)',
150
+ r'(?:programming\s+language)[:\s]*([a-z]+)(?:\s|$|\.|,)',
151
+ ]
152
+
153
+ for pattern in language_patterns:
154
+ lang_match = re.search(pattern, text, re.IGNORECASE)
155
+ if lang_match:
156
+ detected_lang = lang_match.group(1).strip().lower()
157
+ # Validate it's a known language
158
+ known_languages = ['python', 'javascript', 'java', 'cpp', 'c', 'rust', 'go', 'php', 'ruby', 'swift', 'kotlin', 'typescript', 'csharp', 'html', 'css']
159
+ if detected_lang in known_languages:
160
+ result['detected_language'] = detected_lang
161
+ break
162
+
163
  # Extract quality score
164
  score_patterns = [
165
  r'(?:QUALITY_SCORE|quality[_\s]*score)[:\s]*(\d+)(?:/100)?',
detailed_model_comparison.py DELETED
@@ -1,324 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Detailed Model Comparison Script
4
-
5
- This script compares CodeT5+ and CodeBERT side by side
6
- to show exactly how much better each model is for code analysis.
7
- """
8
-
9
- import torch
10
- import time
11
- from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
12
- import json
13
-
14
- def analyze_code_with_model(model_id: str, model_type: str, test_code: str, test_name: str):
15
- """
16
- Analyze code with a specific model and return detailed results.
17
- """
18
- print(f"\n🧪 Testing {test_name}")
19
- print("=" * 80)
20
-
21
- start_time = time.time()
22
-
23
- try:
24
- device = "cuda" if torch.cuda.is_available() else "cpu"
25
-
26
- # Load tokenizer
27
- tokenizer = AutoTokenizer.from_pretrained(model_id)
28
- if tokenizer.pad_token is None:
29
- tokenizer.pad_token = tokenizer.eos_token
30
-
31
- # Load model
32
- if model_type == "seq2seq":
33
- model = AutoModelForSeq2SeqLM.from_pretrained(
34
- model_id,
35
- torch_dtype=torch.float16,
36
- device_map="auto"
37
- )
38
- else:
39
- model = AutoModelForCausalLM.from_pretrained(
40
- model_id,
41
- torch_dtype=torch.float16,
42
- device_map="auto"
43
- )
44
-
45
- # Get model info
46
- param_count = sum(p.numel() for p in model.parameters())
47
-
48
- # Create analysis prompt
49
- prompt = f"""Analyze this code for bugs, performance issues, and security concerns:
50
-
51
- {test_code}
52
-
53
- Analysis:"""
54
-
55
- # Tokenize
56
- inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
57
- inputs = {k: v.to(device) for k, v in inputs.items()}
58
-
59
- # Generate analysis
60
- generation_start = time.time()
61
- with torch.no_grad():
62
- if model_type == "seq2seq":
63
- outputs = model.generate(
64
- inputs["input_ids"],
65
- max_length=inputs["input_ids"].shape[1] + 300,
66
- num_beams=4,
67
- early_stopping=True,
68
- do_sample=False,
69
- temperature=0.7
70
- )
71
- else:
72
- outputs = model.generate(
73
- inputs["input_ids"],
74
- max_length=inputs["input_ids"].shape[1] + 300,
75
- temperature=0.7,
76
- do_sample=True,
77
- pad_token_id=tokenizer.eos_token_id,
78
- top_p=0.9
79
- )
80
-
81
- generation_time = time.time() - generation_start
82
-
83
- # Decode analysis
84
- analysis = tokenizer.decode(outputs[0], skip_special_tokens=True)
85
- analysis_text = analysis[len(prompt):].strip()
86
-
87
- total_time = time.time() - start_time
88
-
89
- # Analyze the quality of the analysis
90
- quality_score = analyze_analysis_quality(analysis_text, test_code)
91
-
92
- return {
93
- "model_id": model_id,
94
- "model_type": model_type,
95
- "test_name": test_name,
96
- "success": True,
97
- "analysis": analysis_text,
98
- "total_time": total_time,
99
- "generation_time": generation_time,
100
- "parameters": param_count,
101
- "quality_score": quality_score,
102
- "analysis_length": len(analysis_text)
103
- }
104
-
105
- except Exception as e:
106
- return {
107
- "model_id": model_id,
108
- "model_type": model_type,
109
- "test_name": test_name,
110
- "success": False,
111
- "error": str(e),
112
- "total_time": time.time() - start_time
113
- }
114
-
115
- def analyze_analysis_quality(analysis_text: str, original_code: str):
116
- """
117
- Analyze the quality of the code analysis.
118
- Returns a score from 0-100 based on various factors.
119
- """
120
- score = 0
121
-
122
- # Check for different types of analysis
123
- analysis_lower = analysis_text.lower()
124
-
125
- # Bug detection (20 points)
126
- if any(word in analysis_lower for word in ['bug', 'error', 'issue', 'problem', 'flaw']):
127
- score += 20
128
-
129
- # Performance analysis (20 points)
130
- if any(word in analysis_lower for word in ['performance', 'slow', 'efficient', 'complexity', 'optimization']):
131
- score += 20
132
-
133
- # Security analysis (20 points)
134
- if any(word in analysis_lower for word in ['security', 'vulnerability', 'safe', 'unsafe', 'risk']):
135
- score += 20
136
-
137
- # Code structure analysis (20 points)
138
- if any(word in analysis_lower for word in ['structure', 'design', 'pattern', 'architecture', 'organization']):
139
- score += 20
140
-
141
- # Suggestions/improvements (20 points)
142
- if any(word in analysis_lower for word in ['suggest', 'improve', 'better', 'recommend', 'fix', 'solution']):
143
- score += 20
144
-
145
- # Bonus points for detailed analysis
146
- if len(analysis_text) > 200:
147
- score += 10
148
- if len(analysis_text) > 500:
149
- score += 10
150
-
151
- return min(score, 100)
152
-
153
- def compare_models():
154
- """
155
- Compare CodeT5+ and CodeBERT on multiple test cases.
156
- """
157
-
158
- # Test cases with different types of code issues
159
- test_cases = [
160
- {
161
- "name": "Performance Issue (Recursive Fibonacci)",
162
- "code": """
163
- def calculate_fibonacci(n):
164
- if n <= 0:
165
- return 0
166
- elif n == 1:
167
- return 1
168
- else:
169
- return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
170
-
171
- # This will be slow for large numbers
172
- result = calculate_fibonacci(35)
173
- print(result)
174
- """
175
- },
176
- {
177
- "name": "Security Issue (SQL Injection)",
178
- "code": """
179
- import sqlite3
180
-
181
- def get_user(email):
182
- conn = sqlite3.connect("app.db")
183
- cur = conn.cursor()
184
- # vulnerable string interpolation
185
- q = f"SELECT id, email, role FROM users WHERE email = '{email}'"
186
- rows = cur.execute(q).fetchall()
187
- conn.close()
188
- return rows[0] if rows else None
189
-
190
- # Usage
191
- user = get_user("admin@example.com")
192
- """
193
- },
194
- {
195
- "name": "Bug Issue (Division by Zero)",
196
- "code": """
197
- def divide_numbers(a, b):
198
- return a / b
199
-
200
- def calculate_average(numbers):
201
- total = sum(numbers)
202
- count = len(numbers)
203
- return divide_numbers(total, count)
204
-
205
- # This will crash with empty list
206
- result = calculate_average([])
207
- print(result)
208
- """
209
- }
210
- ]
211
-
212
- # Models to compare
213
- models = [
214
- {
215
- "id": "Salesforce/codet5p-220m",
216
- "type": "seq2seq",
217
- "name": "CodeT5+ (Recommended)"
218
- },
219
- {
220
- "id": "microsoft/CodeBERT-base",
221
- "type": "causal",
222
- "name": "CodeBERT (Alternative)"
223
- }
224
- ]
225
-
226
- print("🚀 DETAILED MODEL COMPARISON")
227
- print("=" * 100)
228
- print("Testing both models on multiple code analysis scenarios...")
229
-
230
- all_results = []
231
-
232
- for test_case in test_cases:
233
- print(f"\n📋 TEST CASE: {test_case['name']}")
234
- print("=" * 100)
235
- print(f"Code to analyze:\n{test_case['code']}")
236
- print("=" * 100)
237
-
238
- test_results = []
239
-
240
- for model in models:
241
- result = analyze_code_with_model(
242
- model["id"],
243
- model["type"],
244
- test_case["code"],
245
- model["name"]
246
- )
247
- test_results.append(result)
248
- all_results.append(result)
249
-
250
- # Show side-by-side comparison for this test case
251
- print(f"\n📊 SIDE-BY-SIDE COMPARISON:")
252
- print("-" * 100)
253
-
254
- for result in test_results:
255
- if result["success"]:
256
- print(f"\n🤖 {result['test_name']}:")
257
- print(f" ⏱️ Time: {result['total_time']:.2f}s")
258
- print(f" 📊 Parameters: {result['parameters']:,}")
259
- print(f" 🎯 Quality Score: {result['quality_score']}/100")
260
- print(f" 📝 Analysis Length: {result['analysis_length']} chars")
261
- print(f" 📄 Analysis:")
262
- print(f" {result['analysis'][:200]}{'...' if len(result['analysis']) > 200 else ''}")
263
- else:
264
- print(f"\n❌ {result['test_name']}: FAILED - {result['error']}")
265
-
266
- # Overall comparison
267
- print(f"\n🏆 OVERALL COMPARISON SUMMARY")
268
- print("=" * 100)
269
-
270
- # Group results by model
271
- codet5_results = [r for r in all_results if r.get("model_id") == "Salesforce/codet5p-220m" and r["success"]]
272
- codebert_results = [r for r in all_results if r.get("model_id") == "microsoft/CodeBERT-base" and r["success"]]
273
-
274
- if codet5_results and codebert_results:
275
- # Calculate averages
276
- codet5_avg_time = sum(r["total_time"] for r in codet5_results) / len(codet5_results)
277
- codet5_avg_quality = sum(r["quality_score"] for r in codet5_results) / len(codet5_results)
278
- codet5_avg_length = sum(r["analysis_length"] for r in codet5_results) / len(codet5_results)
279
-
280
- codebert_avg_time = sum(r["total_time"] for r in codebert_results) / len(codebert_results)
281
- codebert_avg_quality = sum(r["quality_score"] for r in codebert_results) / len(codebert_results)
282
- codebert_avg_length = sum(r["analysis_length"] for r in codebert_results) / len(codebert_results)
283
-
284
- print(f"\n📈 AVERAGE PERFORMANCE:")
285
- print(f"CodeT5+:")
286
- print(f" ⏱️ Time: {codet5_avg_time:.2f}s")
287
- print(f" 🎯 Quality: {codet5_avg_quality:.1f}/100")
288
- print(f" 📝 Length: {codet5_avg_length:.0f} chars")
289
-
290
- print(f"\nCodeBERT:")
291
- print(f" ⏱️ Time: {codebert_avg_time:.2f}s")
292
- print(f" 🎯 Quality: {codebert_avg_quality:.1f}/100")
293
- print(f" 📝 Length: {codebert_avg_length:.0f} chars")
294
-
295
- # Calculate improvements
296
- time_ratio = codebert_avg_time / codet5_avg_time
297
- quality_diff = codet5_avg_quality - codebert_avg_quality
298
- length_ratio = codet5_avg_length / codebert_avg_length
299
-
300
- print(f"\n🎯 IMPROVEMENT ANALYSIS:")
301
- print(f"Speed: CodeBERT is {time_ratio:.1f}x faster than CodeT5+")
302
- print(f"Quality: CodeT5+ is {quality_diff:.1f} points better than CodeBERT")
303
- print(f"Detail: CodeT5+ gives {length_ratio:.1f}x more detailed analysis")
304
-
305
- # Final recommendation
306
- print(f"\n🏆 FINAL RECOMMENDATION:")
307
- if quality_diff > 10:
308
- print(f"✅ Use CodeT5+ - Significantly better analysis quality ({quality_diff:.1f} points better)")
309
- print(f" Trade-off: {time_ratio:.1f}x slower, but much better results")
310
- elif quality_diff > 5:
311
- print(f"✅ Use CodeT5+ - Better analysis quality ({quality_diff:.1f} points better)")
312
- print(f" Trade-off: {time_ratio:.1f}x slower, but better results")
313
- else:
314
- print(f"🤔 Both models are similar in quality")
315
- print(f" Choose CodeBERT for speed, CodeT5+ for slightly better quality")
316
-
317
- def main():
318
- """
319
- Main function to run the detailed comparison.
320
- """
321
- compare_models()
322
-
323
- if __name__ == "__main__":
324
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
evaluation_samples/buggy_code.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def calculate_average(numbers: list) -> float:
2
+ """
3
+ Calculates the average of a list of numbers.
4
+ There is a bug in this function.
5
+ """
6
+ if not numbers:
7
+ return 0.0
8
+
9
+ total = 0
10
+ # Bug: This loop will miss the last number in the list.
11
+ for i in range(len(numbers) - 1):
12
+ total += numbers[i]
13
+
14
+ return total / len(numbers)
15
+
16
+ # Example usage (will produce the wrong result)
17
+ # print(calculate_average([10, 20, 30]))
18
+ # Expected: 20.0
19
+ # Actual: 10.0
evaluation_samples/good_code.py ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ def get_file_extension(filename: str) -> str:
2
+ """
3
+ Returns the extension of a file.
4
+
5
+ Args:
6
+ filename: The name of the file.
7
+
8
+ Returns:
9
+ The file extension (e.g., "txt", "py") or an empty string if
10
+ the file has no extension.
11
+ """
12
+ if not isinstance(filename, str):
13
+ raise TypeError("Filename must be a string.")
14
+
15
+ parts = filename.split('.')
16
+ if len(parts) > 1:
17
+ return parts[-1]
18
+ return ""
evaluation_samples/insecure_code.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sqlite3
2
+
3
+ def get_user_data(username: str) -> list:
4
+ """
5
+ Retrieves user data from a database.
6
+ This function has a critical security vulnerability.
7
+ """
8
+ # This is insecure! The username is directly concatenated into the query.
9
+ # An attacker could provide a username like: 'admin'; DROP TABLE users; --'
10
+ query = "SELECT * FROM users WHERE username = '" + username + "'"
11
+
12
+ conn = sqlite3.connect('example.db')
13
+ cursor = conn.cursor()
14
+
15
+ # In a real scenario, this line would execute the malicious query
16
+ cursor.execute(query)
17
+
18
+ results = cursor.fetchall()
19
+ conn.close()
20
+ return results
finetune.py DELETED
@@ -1,274 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Fine-tuning Script for AI Code Analyzer
4
-
5
- This script fine-tunes the DeepSeek Coder model using QLoRA (Quantized LoRA)
6
- technique on our custom code analysis dataset.
7
-
8
- Features:
9
- - 4-bit quantization for memory efficiency
10
- - LoRA adapters for parameter-efficient fine-tuning
11
- - Supervised Fine-Tuning (SFT) using TRL
12
- - Automatic model saving and adapter persistence
13
-
14
- Author: AI Code Analyzer Project
15
- Date: 2025
16
- """
17
-
18
- import os
19
- import argparse
20
- import torch
21
- from transformers import (
22
- AutoModelForCausalLM,
23
- AutoTokenizer,
24
- BitsAndBytesConfig,
25
- TrainingArguments,
26
- DataCollatorForLanguageModeling
27
- )
28
- from peft import LoraConfig, get_peft_model, TaskType
29
- from trl import SFTTrainer
30
- from datasets import Dataset
31
- import json
32
- import logging
33
-
34
- # Set up logging
35
- logging.basicConfig(level=logging.INFO)
36
- logger = logging.getLogger(__name__)
37
-
38
- def load_dataset(file_path: str) -> Dataset:
39
- """
40
- Load the training dataset from JSONL file.
41
-
42
- Args:
43
- file_path (str): Path to the analyst_dataset.jsonl file
44
-
45
- Returns:
46
- Dataset: Hugging Face dataset object
47
- """
48
- logger.info(f"Loading dataset from {file_path}")
49
-
50
- data = []
51
- with open(file_path, 'r', encoding='utf-8') as f:
52
- for line in f:
53
- if line.strip():
54
- data.append(json.loads(line))
55
-
56
- logger.info(f"Loaded {len(data)} training samples")
57
- return Dataset.from_list(data)
58
-
59
- def create_quantization_config() -> BitsAndBytesConfig:
60
- """
61
- Create 4-bit quantization configuration for memory efficiency.
62
-
63
- Returns:
64
- BitsAndBytesConfig: Quantization configuration
65
- """
66
- logger.info("Creating 4-bit quantization configuration")
67
-
68
- return BitsAndBytesConfig(
69
- load_in_4bit=True,
70
- bnb_4bit_quant_type="nf4",
71
- bnb_4bit_compute_dtype=torch.float16,
72
- bnb_4bit_use_double_quant=True,
73
- )
74
-
75
- def infer_lora_target_modules(model) -> list[str]:
76
- """Infer suitable LoRA target modules by inspecting model modules.
77
-
78
- Handles common architectures:
79
- - LLaMA/DeepSeek-like: q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
80
- - GPT-2/OPT-like: c_attn,c_proj,c_fc (when present)
81
- Falls back to any module names that contain 'q_proj','k_proj','v_proj','o_proj'
82
- found in the model.
83
- """
84
- llama_like = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
85
- gpt2_like = ["c_attn", "c_proj", "c_fc"]
86
-
87
- module_names = set(name.split(".")[-1] for name, _ in model.named_modules())
88
-
89
- if any(m in module_names for m in llama_like):
90
- return [m for m in llama_like if m in module_names]
91
-
92
- if any(m in module_names for m in gpt2_like):
93
- return [m for m in gpt2_like if m in module_names]
94
-
95
- # Generic attention projection fallback
96
- generic = [m for m in ["q_proj", "k_proj", "v_proj", "o_proj"] if m in module_names]
97
- if generic:
98
- return generic
99
-
100
- # Last resort: try any modules containing 'attn' or 'proj'
101
- heuristic = [m for m in module_names if ("attn" in m or "proj" in m)]
102
- return heuristic[:4] if heuristic else []
103
-
104
-
105
- def create_lora_config(model) -> LoraConfig:
106
- """
107
- Create LoRA configuration for parameter-efficient fine-tuning.
108
-
109
- Returns:
110
- LoraConfig: LoRA configuration
111
- """
112
- logger.info("Creating LoRA configuration")
113
-
114
- target_modules = infer_lora_target_modules(model)
115
- if not target_modules:
116
- logger.warning("Could not infer LoRA target modules; proceeding without explicit targets (may fail)")
117
- target_modules = None # Let PEFT attempt defaults
118
-
119
- return LoraConfig(
120
- task_type=TaskType.CAUSAL_LM,
121
- inference_mode=False,
122
- r=16, # Rank of adaptation
123
- lora_alpha=32, # LoRA scaling parameter
124
- lora_dropout=0.1, # LoRA dropout
125
- target_modules=target_modules,
126
- )
127
-
128
- def create_training_arguments(args: argparse.Namespace) -> TrainingArguments:
129
- """
130
- Create training arguments for the fine-tuning process.
131
-
132
- Returns:
133
- TrainingArguments: Training configuration
134
- """
135
- logger.info("Creating training arguments")
136
-
137
- # Defaults
138
- num_train_epochs = args.epochs if args.epochs is not None else 3
139
- per_device_train_batch_size = args.batch_size if args.batch_size is not None else 1
140
- logging_steps = 10
141
- save_steps = 500
142
- max_steps = args.max_steps if args.max_steps is not None else -1
143
-
144
- if args.dry_run:
145
- # Make the run extremely short and avoid frequent saving
146
- num_train_epochs = 1 if args.epochs is None else args.epochs
147
- logging_steps = 1
148
- save_steps = 10_000_000
149
- if args.max_steps is None:
150
- max_steps = 1
151
-
152
- return TrainingArguments(
153
- output_dir="./fine-tuned-analyst",
154
- num_train_epochs=num_train_epochs,
155
- per_device_train_batch_size=per_device_train_batch_size,
156
- gradient_accumulation_steps=4,
157
- warmup_steps=100,
158
- learning_rate=2e-4,
159
- fp16=args.fp16,
160
- logging_steps=logging_steps,
161
- save_steps=save_steps,
162
- save_total_limit=2,
163
- remove_unused_columns=False,
164
- push_to_hub=False,
165
- report_to=None, # Disable wandb/tensorboard
166
- dataloader_pin_memory=False,
167
- max_steps=max_steps,
168
- )
169
-
170
- def main():
171
- """
172
- Main function to execute the fine-tuning process.
173
- """
174
- logger.info("Starting AI Code Analyzer fine-tuning process")
175
-
176
- parser = argparse.ArgumentParser(description="Fine-tune DeepSeek Coder with QLoRA")
177
- parser.add_argument("--model", type=str, default="deepseek-ai/deepseek-coder-1.3b-instruct", help="Base model ID (HF Hub)")
178
- parser.add_argument("--subset", type=int, default=None, help="Use only the first N samples from the dataset")
179
- parser.add_argument("--epochs", type=int, default=None, help="Number of training epochs")
180
- parser.add_argument("--batch-size", type=int, default=None, help="Per-device train batch size")
181
- parser.add_argument("--max-steps", type=int, default=None, help="Override maximum training steps")
182
- parser.add_argument("--dry-run", action="store_true", help="Run a very short demo training")
183
- parser.add_argument("--no-quant", action="store_true", help="Disable 4-bit quantization (useful for CPU runs)")
184
- args = parser.parse_args()
185
-
186
- # Check if CUDA is available
187
- device = "cuda" if torch.cuda.is_available() else "cpu"
188
- logger.info(f"Using device: {device}")
189
- if device == "cpu":
190
- logger.warning("CUDA not available. Training will be slow on CPU.")
191
-
192
- # Step 1: Load the dataset
193
- dataset = load_dataset("analyst_dataset.jsonl")
194
- if args.subset is not None and args.subset > 0:
195
- logger.info(f"Using only the first {args.subset} samples for this run")
196
- dataset = dataset.select(range(min(args.subset, len(dataset))))
197
-
198
- # Step 2: Load the base model and tokenizer
199
- model_name = args.model
200
- logger.info(f"Loading model: {model_name}")
201
-
202
- # Create quantization config if enabled and likely supported
203
- use_quant = (device == "cuda") and (not args.no_quant)
204
- quantization_config = create_quantization_config() if use_quant else None
205
- if not use_quant:
206
- logger.info("Quantization disabled (CPU or --no-quant). Using standard weights.")
207
-
208
- # Load tokenizer
209
- tokenizer = AutoTokenizer.from_pretrained(model_name)
210
- if tokenizer.pad_token is None:
211
- tokenizer.pad_token = tokenizer.eos_token
212
-
213
- # Load model with quantization
214
- if quantization_config is not None:
215
- model = AutoModelForCausalLM.from_pretrained(
216
- model_name,
217
- quantization_config=quantization_config,
218
- device_map="auto",
219
- trust_remote_code=True,
220
- )
221
- fp16 = True
222
- else:
223
- # CPU or non-quantized path
224
- model = AutoModelForCausalLM.from_pretrained(
225
- model_name,
226
- torch_dtype=torch.float32 if device == "cpu" else torch.float16,
227
- device_map="cpu" if device == "cpu" else "auto",
228
- low_cpu_mem_usage=True,
229
- trust_remote_code=True,
230
- )
231
- fp16 = (device != "cpu")
232
-
233
- # Step 3: Configure LoRA
234
- lora_config = create_lora_config(model)
235
- model = get_peft_model(model, lora_config)
236
-
237
- # Print trainable parameters
238
- model.print_trainable_parameters()
239
-
240
- # Step 4: Set training arguments
241
- # Ensure training args match device precision
242
- args.fp16 = fp16
243
- training_args = create_training_arguments(args)
244
-
245
- # Step 5: Initialize SFT Trainer
246
- logger.info("Initializing SFT Trainer")
247
-
248
- # Provide the tokenizer/processor via processing_class
249
- trainer = SFTTrainer(
250
- model=model,
251
- args=training_args,
252
- train_dataset=dataset,
253
- processing_class=tokenizer,
254
- formatting_func=lambda r: r["text"],
255
- )
256
-
257
- # Step 6: Start training
258
- logger.info("Starting training...")
259
- trainer.train()
260
-
261
- # Step 7: Save the model
262
- logger.info("Saving fine-tuned model...")
263
- trainer.save_model()
264
- tokenizer.save_pretrained(training_args.output_dir)
265
-
266
- # Save LoRA adapters separately
267
- model.save_pretrained(f"{training_args.output_dir}/lora_adapters")
268
-
269
- logger.info("Fine-tuning completed successfully!")
270
- logger.info(f"Model saved to: {training_args.output_dir}")
271
- logger.info("LoRA adapters saved to: {}/lora_adapters".format(training_args.output_dir))
272
-
273
- if __name__ == "__main__":
274
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
finetune_improved.py DELETED
@@ -1,277 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Improved Fine-tuning Script for AI Code Analyzer
4
-
5
- This script fine-tunes specialized code analysis models using QLoRA technique
6
- on our custom code analysis dataset.
7
-
8
- Key Improvements:
9
- - Uses CodeT5+ or CodeBERT for better code understanding
10
- - Optimized for code analysis tasks
11
- - Better tokenization for code
12
- - Improved training parameters
13
-
14
- Author: AI Code Analyzer Project
15
- Date: 2025
16
- """
17
-
18
- import os
19
- import argparse
20
- import torch
21
- from transformers import (
22
- AutoModelForCausalLM,
23
- AutoModelForSeq2SeqLM, # For CodeT5 models
24
- AutoTokenizer,
25
- BitsAndBytesConfig,
26
- TrainingArguments,
27
- DataCollatorForLanguageModeling
28
- )
29
- from peft import LoraConfig, get_peft_model, TaskType
30
- from trl import SFTTrainer
31
- from datasets import Dataset
32
- import json
33
- import logging
34
-
35
- # Set up logging
36
- logging.basicConfig(level=logging.INFO)
37
- logger = logging.getLogger(__name__)
38
-
39
- def load_dataset(file_path: str) -> Dataset:
40
- """
41
- Load the training dataset from JSONL file.
42
-
43
- Args:
44
- file_path (str): Path to the analyst_dataset.jsonl file
45
-
46
- Returns:
47
- Dataset: Hugging Face dataset object
48
- """
49
- logger.info(f"Loading dataset from {file_path}")
50
-
51
- data = []
52
- with open(file_path, 'r', encoding='utf-8') as f:
53
- for line in f:
54
- if line.strip():
55
- data.append(json.loads(line))
56
-
57
- logger.info(f"Loaded {len(data)} training samples")
58
- return Dataset.from_list(data)
59
-
60
- def create_quantization_config() -> BitsAndBytesConfig:
61
- """
62
- Create 4-bit quantization configuration for memory efficiency.
63
-
64
- Returns:
65
- BitsAndBytesConfig: Quantization configuration
66
- """
67
- logger.info("Creating 4-bit quantization configuration")
68
-
69
- return BitsAndBytesConfig(
70
- load_in_4bit=True,
71
- bnb_4bit_quant_type="nf4",
72
- bnb_4bit_compute_dtype=torch.float16,
73
- bnb_4bit_use_double_quant=True,
74
- )
75
-
76
- def get_model_class(model_name: str):
77
- """
78
- Determine the appropriate model class based on the model name.
79
-
80
- Args:
81
- model_name (str): Name of the model
82
-
83
- Returns:
84
- Model class: Appropriate model class for the given model
85
- """
86
- if "codet5" in model_name.lower():
87
- logger.info(f"Using Seq2SeqLM for CodeT5 model: {model_name}")
88
- return AutoModelForSeq2SeqLM
89
- else:
90
- logger.info(f"Using CausalLM for model: {model_name}")
91
- return AutoModelForCausalLM
92
-
93
- def create_lora_config(model_name: str) -> LoraConfig:
94
- """
95
- Create LoRA configuration optimized for code analysis models.
96
-
97
- Args:
98
- model_name (str): Name of the base model
99
-
100
- Returns:
101
- LoraConfig: LoRA configuration
102
- """
103
- logger.info(f"Creating LoRA configuration for {model_name}")
104
-
105
- # Different target modules for different model architectures
106
- if "codet5" in model_name.lower():
107
- # CodeT5 architecture
108
- target_modules = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
109
- task_type = TaskType.SEQ_2_SEQ_LM
110
- elif "codebert" in model_name.lower():
111
- # CodeBERT architecture
112
- target_modules = ["query", "key", "value", "dense"]
113
- task_type = TaskType.FEATURE_EXTRACTION
114
- else:
115
- # Default for most causal LM models
116
- target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
117
- task_type = TaskType.CAUSAL_LM
118
-
119
- return LoraConfig(
120
- r=16,
121
- lora_alpha=32,
122
- target_modules=target_modules,
123
- lora_dropout=0.1,
124
- bias="none",
125
- task_type=task_type,
126
- )
127
-
128
- def create_training_arguments(args) -> TrainingArguments:
129
- """
130
- Create training arguments optimized for code analysis fine-tuning.
131
-
132
- Args:
133
- args: Command line arguments
134
-
135
- Returns:
136
- TrainingArguments: Training configuration
137
- """
138
- # Calculate training parameters
139
- num_train_epochs = args.epochs if args.epochs else 3
140
- per_device_train_batch_size = args.batch_size if args.batch_size else 2
141
- logging_steps = 10
142
- save_steps = 500
143
-
144
- # For demo runs
145
- if args.dry_run:
146
- logger.info("Running in dry-run mode with minimal training")
147
- num_train_epochs = 1
148
- per_device_train_batch_size = 1
149
- max_steps = 1
150
-
151
- return TrainingArguments(
152
- output_dir="./fine-tuned-analyst-improved",
153
- num_train_epochs=num_train_epochs,
154
- per_device_train_batch_size=per_device_train_batch_size,
155
- gradient_accumulation_steps=4,
156
- warmup_steps=100,
157
- learning_rate=2e-4,
158
- fp16=args.fp16,
159
- logging_steps=logging_steps,
160
- save_steps=save_steps,
161
- save_total_limit=2,
162
- remove_unused_columns=False,
163
- push_to_hub=False,
164
- report_to=None, # Disable wandb/tensorboard
165
- dataloader_pin_memory=False,
166
- max_steps=1 if args.dry_run else None,
167
- )
168
-
169
- def main():
170
- """
171
- Main function to execute the improved fine-tuning process.
172
- """
173
- logger.info("Starting Improved AI Code Analyzer fine-tuning process")
174
-
175
- parser = argparse.ArgumentParser(description="Fine-tune specialized code analysis models with QLoRA")
176
-
177
- # Model selection with better defaults
178
- parser.add_argument("--model", type=str,
179
- default="Salesforce/codet5p-220m", # Better default for code analysis
180
- help="Base model ID (HF Hub). Options: Salesforce/codet5p-220m, microsoft/CodeBERT, facebook/incoder-1B")
181
-
182
- parser.add_argument("--subset", type=int, default=None, help="Use only the first N samples from the dataset")
183
- parser.add_argument("--epochs", type=int, default=None, help="Number of training epochs")
184
- parser.add_argument("--batch-size", type=int, default=None, help="Per-device train batch size")
185
- parser.add_argument("--max-steps", type=int, default=None, help="Override maximum training steps")
186
- parser.add_argument("--dry-run", action="store_true", help="Run a very short demo training")
187
- parser.add_argument("--no-quant", action="store_true", help="Disable 4-bit quantization")
188
- parser.add_argument("--fp16", action="store_true", default=True, help="Use FP16 precision")
189
-
190
- args = parser.parse_args()
191
-
192
- # Check if CUDA is available
193
- device = "cuda" if torch.cuda.is_available() else "cpu"
194
- logger.info(f"Using device: {device}")
195
- if device == "cpu":
196
- logger.warning("CUDA not available. Training will be slow on CPU.")
197
-
198
- # Step 1: Load the dataset
199
- logger.info("Step 1: Loading dataset...")
200
- dataset = load_dataset("analyst_dataset.jsonl")
201
- if args.subset is not None and args.subset > 0:
202
- logger.info(f"Using only the first {args.subset} samples for this run")
203
- dataset = dataset.select(range(min(args.subset, len(dataset))))
204
-
205
- # Step 2: Load the base model and tokenizer
206
- logger.info("Step 2: Loading model and tokenizer...")
207
- model_name = args.model
208
- logger.info(f"Loading model: {model_name}")
209
-
210
- # Get appropriate model class
211
- model_class = get_model_class(model_name)
212
-
213
- # Load tokenizer
214
- tokenizer = AutoTokenizer.from_pretrained(model_name)
215
-
216
- # Add padding token if not present
217
- if tokenizer.pad_token is None:
218
- tokenizer.pad_token = tokenizer.eos_token
219
-
220
- # Load model with quantization if enabled
221
- if args.no_quant:
222
- logger.info("Loading model without quantization")
223
- model = model_class.from_pretrained(model_name)
224
- else:
225
- logger.info("Loading model with 4-bit quantization")
226
- quantization_config = create_quantization_config()
227
- model = model_class.from_pretrained(
228
- model_name,
229
- quantization_config=quantization_config,
230
- device_map="auto"
231
- )
232
-
233
- # Step 3: Create LoRA configuration
234
- logger.info("Step 3: Setting up LoRA configuration...")
235
- lora_config = create_lora_config(model_name)
236
-
237
- # Step 4: Apply LoRA to the model
238
- logger.info("Step 4: Applying LoRA to model...")
239
- model = get_peft_model(model, lora_config)
240
-
241
- # Step 5: Print trainable parameters
242
- model.print_trainable_parameters()
243
-
244
- # Step 6: Create training arguments
245
- logger.info("Step 6: Setting up training arguments...")
246
- training_args = create_training_arguments(args)
247
-
248
- # Step 7: Create trainer
249
- logger.info("Step 7: Creating trainer...")
250
- trainer = SFTTrainer(
251
- model=model,
252
- args=training_args,
253
- train_dataset=dataset,
254
- tokenizer=tokenizer,
255
- dataset_text_field="text",
256
- max_seq_length=2048,
257
- packing=False,
258
- )
259
-
260
- # Step 8: Start training
261
- logger.info("Step 8: Starting training...")
262
- logger.info(f"Training with {len(dataset)} samples")
263
- logger.info(f"Model: {model_name}")
264
- logger.info(f"Device: {device}")
265
-
266
- trainer.train()
267
-
268
- # Step 9: Save the model
269
- logger.info("Step 9: Saving model...")
270
- trainer.save_model()
271
- tokenizer.save_pretrained(training_args.output_dir)
272
-
273
- logger.info("Training completed successfully!")
274
- logger.info(f"Model saved to: {training_args.output_dir}")
275
-
276
- if __name__ == "__main__":
277
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hf-space/README.md DELETED
@@ -1,61 +0,0 @@
1
- ---
2
- title: Fine-tuned Code Analyzer API
3
- emoji: 🤖
4
- colorFrom: green
5
- colorTo: blue
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- # Fine-tuned Code Analyzer API
11
-
12
- API endpoint for code analysis using fine-tuned DeepSeek model.
13
-
14
- ## Features
15
-
16
- - **Quality Scores**: 1-100 rating for code quality
17
- - **Structured Analysis**: Bugs, Performance, Security sections
18
- - **Code Improvements**: Specific suggestions with examples
19
- - **Professional Output**: Consistent, detailed analysis format
20
-
21
- ## Usage
22
-
23
- ### POST /analyze
24
-
25
- Analyze code for bugs, performance, and security issues.
26
-
27
- **Request:**
28
- ```json
29
- {
30
- "code": "def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)",
31
- "max_tokens": 300
32
- }
33
- ```
34
-
35
- **Response:**
36
- ```json
37
- {
38
- "analysis": "Quality Score: 35/100\n\nBUGS:\n- No error handling\n- Infinite recursion possible\n\nPERFORMANCE ISSUES:\n- Recursive calls cause exponential time complexity\n\nSECURITY CONCERNS:\n- No input validation\n\nIMPROVEMENTS:\n1. Use memoization to avoid recursion\n2. Add input validation\n\nExample improved code:\n[Shows working fixes]",
39
- "model": "fine-tuned-deepseek",
40
- "status": "success"
41
- }
42
- ```
43
-
44
- ### GET /health
45
-
46
- Health check endpoint.
47
-
48
- **Response:**
49
- ```json
50
- {
51
- "status": "healthy",
52
- "model": "fine-tuned-deepseek"
53
- }
54
- ```
55
-
56
- ## Model Details
57
-
58
- - **Base Model**: DeepSeek Coder 1.3B
59
- - **Training Method**: LoRA (Low-Rank Adaptation)
60
- - **Dataset**: 59+ high-quality code analysis examples
61
- - **Fine-tuned for**: Code analysis, bug detection, performance optimization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hf-space/app.py DELETED
@@ -1,82 +0,0 @@
1
- from fastapi import FastAPI, HTTPException
2
- from pydantic import BaseModel
3
- import uvicorn
4
- import requests
5
- import json
6
-
7
- app = FastAPI()
8
-
9
- class CodeRequest(BaseModel):
10
- code: str
11
- max_tokens: int = 300
12
-
13
- class AnalysisResponse(BaseModel):
14
- analysis: str
15
- model: str
16
- status: str
17
-
18
- @app.post("/analyze", response_model=AnalysisResponse)
19
- async def analyze_code(request: CodeRequest):
20
- try:
21
- # For now, return a mock analysis while we debug the model loading
22
- analysis_text = f"""
23
- Quality Score: 75/100
24
-
25
- BUGS:
26
- - No error handling for edge cases
27
- - Potential infinite recursion for large inputs
28
-
29
- PERFORMANCE ISSUES:
30
- - Recursive approach is inefficient for large numbers
31
- - No memoization implemented
32
-
33
- SECURITY CONCERNS:
34
- - No input validation
35
- - Could cause stack overflow with large inputs
36
-
37
- IMPROVEMENTS:
38
- 1. Add input validation
39
- 2. Implement iterative solution or memoization
40
- 3. Add error handling for edge cases
41
-
42
- Example improved code:
43
- ```python
44
- def fibonacci_improved(n):
45
- if n < 0:
46
- raise ValueError("Input must be non-negative")
47
- if n <= 1:
48
- return n
49
-
50
- a, b = 0, 1
51
- for _ in range(2, n + 1):
52
- a, b = b, a + b
53
- return b
54
- ```
55
- """
56
-
57
- return AnalysisResponse(
58
- analysis=analysis_text,
59
- model="fine-tuned-deepseek-mock",
60
- status="success"
61
- )
62
- except Exception as e:
63
- raise HTTPException(status_code=500, detail=str(e))
64
-
65
- @app.get("/health")
66
- async def health_check():
67
- return {"status": "healthy", "model": "fine-tuned-deepseek-mock"}
68
-
69
- @app.get("/")
70
- async def root():
71
- return {
72
- "message": "Fine-tuned Code Analyzer API",
73
- "endpoints": {
74
- "POST /analyze": "Analyze code for bugs, performance, and security issues",
75
- "GET /health": "Health check endpoint"
76
- },
77
- "model": "fine-tuned-deepseek-mock",
78
- "status": "running"
79
- }
80
-
81
- if __name__ == "__main__":
82
- uvicorn.run(app, host="0.0.0.0", port=7860)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
hf-space/requirements.txt DELETED
@@ -1,3 +0,0 @@
1
- fastapi
2
- uvicorn[standard]
3
- requests
 
 
 
 
inference_demo.py DELETED
@@ -1,57 +0,0 @@
1
- #!/usr/bin/env python3
2
- import os
3
- import torch
4
- from transformers import AutoModelForCausalLM, AutoTokenizer
5
- from peft import PeftModel
6
-
7
- # Settings
8
- BASE_MODEL = os.environ.get("BASE_MODEL", "gpt2")
9
- ADAPTER_DIR = os.environ.get("ADAPTER_DIR", "./fine-tuned-analyst/lora_adapters")
10
- DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
11
-
12
- def load_model_and_tokenizer():
13
- tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
14
- if tokenizer.pad_token is None:
15
- tokenizer.pad_token = tokenizer.eos_token
16
- model = AutoModelForCausalLM.from_pretrained(
17
- BASE_MODEL,
18
- torch_dtype=torch.float32 if DEVICE == "cpu" else torch.float16,
19
- device_map="cpu" if DEVICE == "cpu" else "auto",
20
- trust_remote_code=True,
21
- low_cpu_mem_usage=True,
22
- )
23
- model = PeftModel.from_pretrained(model, ADAPTER_DIR)
24
- model.to(DEVICE)
25
- model.eval()
26
- return model, tokenizer
27
-
28
-
29
- def generate(prompt: str, max_new_tokens: int = 256) -> str:
30
- model, tokenizer = load_model_and_tokenizer()
31
- inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
32
- with torch.no_grad():
33
- out = model.generate(
34
- **inputs,
35
- max_new_tokens=max_new_tokens,
36
- do_sample=True,
37
- top_p=0.9,
38
- temperature=0.7,
39
- eos_token_id=tokenizer.eos_token_id,
40
- pad_token_id=tokenizer.pad_token_id,
41
- )
42
- text = tokenizer.decode(out[0], skip_special_tokens=True)
43
- return text
44
-
45
-
46
- if __name__ == "__main__":
47
- # A tiny prompt using the same schema
48
- code = """def add_item(item, items=[]):\n items.append(item)\n return items\n"""
49
- inst = (
50
- "<s>[INST] Analyze this code for bugs, performance, and security issues. "
51
- "Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```" + code + "``` [/INST]"
52
- )
53
- print("Device:", DEVICE)
54
- print("Base model:", BASE_MODEL)
55
- print("Adapters:", ADAPTER_DIR)
56
- print("\n--- Generated Output ---\n")
57
- print(generate(inst, max_new_tokens=200))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matrix_analyzer_optimized.py DELETED
@@ -1,262 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Optimized Matrix Code Analyzer with CodeT5+
4
-
5
- This integrates the optimized CodeT5+ analyzer into your existing
6
- Matrix-themed Streamlit application with speed optimizations.
7
-
8
- Author: AI Code Analyzer Project
9
- Date: 2025
10
- """
11
-
12
- import streamlit as st
13
- import time
14
- import torch
15
- from optimized_code_analyzer import OptimizedCodeAnalyzer
16
-
17
- # Page configuration
18
- st.set_page_config(
19
- page_title="AI Code Analyzer - Optimized",
20
- page_icon="🤖",
21
- layout="wide",
22
- initial_sidebar_state="expanded"
23
- )
24
-
25
- # Custom CSS for Matrix theme
26
- st.markdown("""
27
- <style>
28
- .main {
29
- background-color: #0a0a0a;
30
- color: #00ff00;
31
- }
32
-
33
- .stApp {
34
- background-color: #0a0a0a;
35
- }
36
-
37
- .matrix-header {
38
- background: linear-gradient(90deg, #00ff00, #008800);
39
- color: #000000;
40
- padding: 20px;
41
- border-radius: 10px;
42
- text-align: center;
43
- margin-bottom: 30px;
44
- font-family: 'Courier New', monospace;
45
- }
46
-
47
- .analysis-box {
48
- background-color: #001100;
49
- border: 2px solid #00ff00;
50
- border-radius: 10px;
51
- padding: 20px;
52
- margin: 10px 0;
53
- }
54
-
55
- .speed-indicator {
56
- background-color: #002200;
57
- border: 1px solid #00ff00;
58
- border-radius: 5px;
59
- padding: 10px;
60
- margin: 10px 0;
61
- }
62
-
63
- .cache-info {
64
- background-color: #000800;
65
- border: 1px solid #008800;
66
- border-radius: 5px;
67
- padding: 10px;
68
- margin: 10px 0;
69
- font-size: 12px;
70
- }
71
- </style>
72
- """, unsafe_allow_html=True)
73
-
74
- @st.cache_resource
75
- def load_analyzer():
76
- """
77
- Load the optimized analyzer (cached for performance).
78
- """
79
- return OptimizedCodeAnalyzer()
80
-
81
- def main():
82
- """
83
- Main Streamlit application.
84
- """
85
- # Header
86
- st.markdown("""
87
- <div class="matrix-header">
88
- <h1>🤖 AI Code Analyzer - Optimized</h1>
89
- <p>Powered by CodeT5+ with Speed Optimizations</p>
90
- </div>
91
- """, unsafe_allow_html=True)
92
-
93
- # Load analyzer
94
- with st.spinner("🚀 Loading optimized CodeT5+ model..."):
95
- analyzer = load_analyzer()
96
-
97
- # Sidebar
98
- st.sidebar.markdown("## ⚙️ Analysis Options")
99
-
100
- analysis_mode = st.sidebar.selectbox(
101
- "Analysis Mode",
102
- ["Streaming (Interactive)", "Fast (Batch)"],
103
- help="Streaming shows progress, Fast is optimized for speed"
104
- )
105
-
106
- show_progress = st.sidebar.checkbox(
107
- "Show Progress Indicators",
108
- value=True,
109
- help="Display progress bars and timing information"
110
- )
111
-
112
- # Main content
113
- col1, col2 = st.columns([1, 1])
114
-
115
- with col1:
116
- st.markdown("## 📝 Code Input")
117
-
118
- # Code input
119
- code_input = st.text_area(
120
- "Enter your code:",
121
- height=300,
122
- placeholder="def hello():\n print('Hello, World!')",
123
- help="Paste your code here for analysis"
124
- )
125
-
126
- # Analysis button
127
- analyze_button = st.button(
128
- "🔍 Analyze Code",
129
- type="primary",
130
- use_container_width=True
131
- )
132
-
133
- with col2:
134
- st.markdown("## 📊 Analysis Results")
135
-
136
- if analyze_button and code_input.strip():
137
- # Perform analysis
138
- start_time = time.time()
139
-
140
- if analysis_mode == "Streaming (Interactive)":
141
- # Streaming analysis
142
- st.markdown("### 🔄 Streaming Analysis")
143
-
144
- # Create placeholder for streaming results
145
- result_placeholder = st.empty()
146
- progress_placeholder = st.empty()
147
-
148
- # Show progress
149
- if show_progress:
150
- progress_bar = progress_placeholder.progress(0)
151
- status_text = st.empty()
152
-
153
- try:
154
- # Stream analysis
155
- analysis_text = ""
156
- for partial_result in analyzer.analyze_code_streaming(code_input, show_progress):
157
- analysis_text = partial_result
158
-
159
- # Update progress
160
- if show_progress:
161
- progress_bar.progress(50)
162
- status_text.text("🔍 Analyzing code...")
163
-
164
- # Complete analysis
165
- if show_progress:
166
- progress_bar.progress(100)
167
- status_text.text("✅ Analysis complete!")
168
-
169
- # Display results
170
- result_placeholder.markdown(f"""
171
- <div class="analysis-box">
172
- <h4>📄 Analysis Results:</h4>
173
- <p>{analysis_text}</p>
174
- </div>
175
- """, unsafe_allow_html=True)
176
-
177
- except Exception as e:
178
- st.error(f"❌ Analysis failed: {str(e)}")
179
-
180
- else:
181
- # Fast analysis
182
- st.markdown("### ⚡ Fast Analysis")
183
-
184
- if show_progress:
185
- progress_bar = st.progress(0)
186
- status_text = st.empty()
187
- progress_bar.progress(25)
188
- status_text.text("🚀 Loading model...")
189
-
190
- try:
191
- # Perform fast analysis
192
- result = analyzer.analyze_code_fast(code_input)
193
-
194
- if show_progress:
195
- progress_bar.progress(100)
196
- status_text.text("✅ Analysis complete!")
197
-
198
- # Display results
199
- st.markdown(f"""
200
- <div class="analysis-box">
201
- <h4>📄 Analysis Results:</h4>
202
- <p>{result['analysis']}</p>
203
- </div>
204
- """, unsafe_allow_html=True)
205
-
206
- except Exception as e:
207
- st.error(f"❌ Analysis failed: {str(e)}")
208
-
209
- # Show performance metrics
210
- total_time = time.time() - start_time
211
-
212
- st.markdown(f"""
213
- <div class="speed-indicator">
214
- <h4>⚡ Performance Metrics:</h4>
215
- <p><strong>Total Time:</strong> {total_time:.2f}s</p>
216
- <p><strong>Analysis Mode:</strong> {analysis_mode}</p>
217
- <p><strong>Model:</strong> CodeT5+ (Optimized)</p>
218
- </div>
219
- """, unsafe_allow_html=True)
220
-
221
- elif analyze_button and not code_input.strip():
222
- st.warning("⚠️ Please enter some code to analyze!")
223
-
224
- else:
225
- st.info("👆 Enter code and click 'Analyze Code' to get started!")
226
-
227
- # Model information
228
- st.markdown("## 📊 Model Information")
229
-
230
- model_info = analyzer.get_model_info()
231
-
232
- col1, col2, col3 = st.columns(3)
233
-
234
- with col1:
235
- st.metric("Model Parameters", f"{model_info['parameters']:,}")
236
-
237
- with col2:
238
- st.metric("Cache Size", f"{model_info['cache_size']} analyses")
239
-
240
- with col3:
241
- st.metric("Device", str(model_info['device']))
242
-
243
- # Cache information
244
- st.markdown("""
245
- <div class="cache-info">
246
- <h4>💾 Cache Information:</h4>
247
- <p>• Cached analyses are reused for identical code</p>
248
- <p>• Cache improves speed for repeated analyses</p>
249
- <p>• Cache is automatically managed</p>
250
- </div>
251
- """, unsafe_allow_html=True)
252
-
253
- # Footer
254
- st.markdown("---")
255
- st.markdown("""
256
- <div style="text-align: center; color: #008800;">
257
- <p>🚀 Optimized AI Code Analyzer | Powered by CodeT5+ | Matrix Theme</p>
258
- </div>
259
- """, unsafe_allow_html=True)
260
-
261
- if __name__ == "__main__":
262
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matrix_final.py CHANGED
@@ -1,1086 +1,880 @@
1
- import streamlit as st
2
- import os
3
- import time
4
- import random
5
- import sys
6
  from dotenv import load_dotenv
 
7
  from analyzer import CodeAnalyzer
8
- from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer
 
 
9
 
10
- # Load environment variables
11
- load_dotenv()
12
 
13
- # Page config
14
  st.set_page_config(
15
- page_title="Matrix Code Analyzer - Final",
16
- page_icon="🟢",
17
- layout="wide",
18
- initial_sidebar_state="expanded"
19
  )
20
 
21
- # Matrix CSS - Enhanced with file upload styling
22
- st.markdown("""
 
23
  <style>
24
  @import url('https://fonts.googleapis.com/css2?family=Share+Tech+Mono:wght@400&display=swap');
25
  @import url('https://fonts.googleapis.com/css2?family=Orbitron:wght@400;700;900&display=swap');
26
-
27
- /* Matrix Background */
 
 
 
 
 
 
 
 
 
 
28
  .stApp {
29
- background: linear-gradient(135deg, #0d1b0d 0%, #000000 50%, #0d1b0d 100%);
30
- color: #00ff41;
31
- font-family: 'Share Tech Mono', monospace;
32
- }
33
-
34
- /* Matrix Code Rain Animation */
35
- .matrix-bg {
36
  position: fixed;
37
- top: 0;
38
- left: 0;
39
- width: 100%;
40
- height: 100%;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  overflow: hidden;
 
42
  z-index: -1;
43
- opacity: 0.1;
44
  }
45
 
46
- .matrix-char {
47
  position: absolute;
48
- color: #00ff41;
 
49
  font-family: 'Share Tech Mono', monospace;
50
  font-size: 14px;
51
- animation: matrix-fall linear infinite;
52
- }
53
-
54
- @keyframes matrix-fall {
55
- 0% { transform: translateY(-100vh); opacity: 1; }
56
- 100% { transform: translateY(100vh); opacity: 0; }
57
- }
58
-
59
- /* Main Content Styling */
60
- .main .block-container {
61
- padding-top: 2rem;
62
- background: rgba(0, 0, 0, 0.8);
63
- border: 1px solid #00ff41;
64
- border-radius: 10px;
65
- box-shadow: 0 0 20px rgba(0, 255, 65, 0.3);
66
- }
67
-
68
- /* Title Styling */
69
- h1 {
70
- font-family: 'Orbitron', monospace !important;
71
- color: #00ff41 !important;
72
- text-align: center !important;
73
- text-shadow: 0 0 10px #00ff41, 0 0 20px #00ff41, 0 0 30px #00ff41;
74
- font-weight: 900 !important;
75
- margin-bottom: 2rem !important;
76
- animation: glow 2s ease-in-out infinite alternate;
77
- }
78
-
79
- @keyframes glow {
80
- from { text-shadow: 0 0 5px #00ff41, 0 0 10px #00ff41, 0 0 15px #00ff41; }
81
- to { text-shadow: 0 0 10px #00ff41, 0 0 20px #00ff41, 0 0 30px #00ff41; }
82
- }
83
-
84
- /* Sidebar Styling */
85
- .css-1d391kg {
86
- background: rgba(0, 0, 0, 0.9) !important;
87
- border: 1px solid #00ff41 !important;
88
- border-radius: 10px !important;
89
- }
90
-
91
- /* Buttons */
92
- .stButton > button {
93
- background: linear-gradient(45deg, #003300, #006600) !important;
94
- color: #00ff41 !important;
95
- border: 2px solid #00ff41 !important;
96
- border-radius: 5px !important;
97
- font-family: 'Orbitron', monospace !important;
98
- font-weight: bold !important;
99
- text-transform: uppercase !important;
100
- transition: all 0.3s ease !important;
101
- box-shadow: 0 0 10px rgba(0, 255, 65, 0.3) !important;
102
- }
103
-
104
- .stButton > button:hover {
105
- background: linear-gradient(45deg, #006600, #00aa00) !important;
106
- box-shadow: 0 0 20px rgba(0, 255, 65, 0.6) !important;
107
- transform: translateY(-2px) !important;
108
- }
109
-
110
- /* File uploader styling */
111
- .stFileUploader > div {
112
- background: rgba(0, 51, 0, 0.3) !important;
113
- border: 2px dashed #00ff41 !important;
114
- border-radius: 10px !important;
115
- padding: 20px !important;
116
- }
117
-
118
- .stFileUploader label {
119
- color: #00ff41 !important;
120
- font-family: 'Orbitron', monospace !important;
121
- font-weight: bold !important;
122
- }
123
-
124
- /* Select boxes */
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  .stSelectbox > div > div {
126
- background: rgba(0, 51, 0, 0.8) !important;
127
- border: 1px solid #00ff41 !important;
128
- color: #00ff41 !important;
129
- }
130
-
131
- /* Text areas */
132
- .stTextArea > div > div > textarea {
133
- background: rgba(0, 0, 0, 0.9) !important;
134
- border: 1px solid #00ff41 !important;
135
- color: #00ff41 !important;
136
- font-family: 'Share Tech Mono', monospace !important;
137
- }
138
-
139
- /* Metrics */
140
- .css-1xarl3l {
141
- background: rgba(0, 51, 0, 0.3) !important;
142
- border: 1px solid #00ff41 !important;
143
- border-radius: 5px !important;
144
- box-shadow: 0 0 10px rgba(0, 255, 65, 0.2) !important;
145
- }
146
-
147
- /* Success/Info/Warning messages */
148
- .stSuccess {
149
- background: rgba(0, 255, 65, 0.1) !important;
150
- border: 1px solid #00ff41 !important;
151
- color: #00ff41 !important;
152
- }
153
-
154
- .stInfo {
155
- background: rgba(0, 255, 255, 0.1) !important;
156
- border: 1px solid #00ffff !important;
157
- color: #00ffff !important;
158
- }
159
-
160
- .stWarning {
161
- background: rgba(255, 255, 0, 0.1) !important;
162
- border: 1px solid #ffff00 !important;
163
- color: #ffff00 !important;
164
- }
165
-
166
- .stError {
167
- background: rgba(255, 0, 0, 0.1) !important;
168
- border: 1px solid #ff0000 !important;
169
- color: #ff0000 !important;
170
- }
171
-
172
- /* Code blocks */
173
- .stCode {
174
- background: rgba(0, 0, 0, 0.9) !important;
175
- border: 1px solid #00ff41 !important;
176
- color: #00ff41 !important;
177
- }
178
-
179
- /* Tabs */
 
 
 
 
 
 
 
 
180
  .stTabs [data-baseweb="tab-list"] {
181
- background: rgba(0, 0, 0, 0.8) !important;
182
- border-bottom: 2px solid #00ff41 !important;
 
 
183
  }
184
 
185
  .stTabs [data-baseweb="tab"] {
186
- background: rgba(0, 51, 0, 0.3) !important;
187
- color: #00ff41 !important;
188
- border: 1px solid #00ff41 !important;
189
- font-family: 'Orbitron', monospace !important;
 
190
  }
191
 
192
  .stTabs [aria-selected="true"] {
193
- background: rgba(0, 255, 65, 0.2) !important;
194
- box-shadow: 0 0 10px rgba(0, 255, 65, 0.5) !important;
195
- }
196
-
197
- /* Matrix Terminal Effect */
198
- .matrix-terminal {
199
- background: rgba(0, 0, 0, 0.95) !important;
200
- border: 2px solid #00ff41 !important;
201
- border-radius: 10px !important;
202
- padding: 20px !important;
203
- font-family: 'Share Tech Mono', monospace !important;
204
- color: #00ff41 !important;
205
- box-shadow: 0 0 30px rgba(0, 255, 65, 0.4) !important;
206
- position: relative !important;
207
- }
208
-
209
- .matrix-terminal::before {
210
- content: "MATRIX_ANALYZER_v3.0 > OPERATIONAL" !important;
211
- position: absolute !important;
212
- top: -15px !important;
213
- left: 20px !important;
214
- background: #000000 !important;
215
- padding: 0 10px !important;
216
- color: #00ff41 !important;
217
- font-size: 12px !important;
218
- font-weight: bold !important;
219
- }
220
-
221
- /* Checkbox styling */
222
- .stCheckbox > label {
223
- color: #00ff41 !important;
224
- font-family: 'Orbitron', monospace !important;
225
- }
226
-
227
- /* Loading spinner */
228
- .stSpinner > div {
229
- border-color: #00ff41 transparent #00ff41 transparent !important;
230
- }
231
-
232
- /* Expander */
233
- .streamlit-expanderHeader {
234
- background: rgba(0, 51, 0, 0.3) !important;
235
- border: 1px solid #00ff41 !important;
236
- color: #00ff41 !important;
237
- }
238
-
239
- /* File info styling */
240
- .file-info {
241
- background: rgba(0, 255, 65, 0.1);
242
- border: 1px solid #00ff41;
243
- padding: 10px;
244
- border-radius: 5px;
245
- margin: 10px 0;
246
- font-family: 'Share Tech Mono', monospace;
247
- }
248
- </style>
249
 
250
- <div class="matrix-bg" id="matrix-bg"></div>
 
 
 
 
 
 
 
 
251
 
252
- <script>
253
- function createMatrixRain() {
254
- const chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@#$%^&*()_+-=[]{}|;:,.<>?";
255
- const container = document.getElementById('matrix-bg');
256
-
257
- for (let i = 0; i < 50; i++) {
258
- const char = document.createElement('div');
259
- char.className = 'matrix-char';
260
- char.textContent = chars[Math.floor(Math.random() * chars.length)];
261
- char.style.left = Math.random() * 100 + '%';
262
- char.style.animationDuration = (Math.random() * 3 + 2) + 's';
263
- char.style.animationDelay = Math.random() * 2 + 's';
264
- container.appendChild(char);
265
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
266
  }
 
 
 
 
 
 
 
 
 
 
 
267
 
268
- // Create matrix rain effect
269
- setTimeout(createMatrixRain, 100);
270
- </script>
271
- """, unsafe_allow_html=True)
272
 
273
- # Initialize analyzers
274
  def get_analyzer():
275
- # Force reimport to ensure latest code
276
- import importlib
277
- import analyzer.code_analyzer
278
- importlib.reload(analyzer.code_analyzer)
279
- from analyzer.code_analyzer import CodeAnalyzer
280
- return CodeAnalyzer()
281
 
282
  analyzer = get_analyzer()
 
283
 
284
- # Local analyzer (cached)
285
- @st.cache_resource
286
- def get_local_analyzer(model_type="codet5", remote_url=None):
287
- if remote_url:
288
- return EnhancedCodeAnalyzer(
289
- model_type="deepseek-finetuned-remote",
290
- remote_api_url=remote_url
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
291
  )
292
- else:
293
- return EnhancedCodeAnalyzer(
294
- model_type=model_type,
295
- precision="fp16",
296
- quick_max_new_tokens=180,
297
- detailed_max_new_tokens=300,
 
 
 
298
  )
299
 
300
- def display_matrix_analysis_result(result: dict, model_name: str):
301
- """Display analysis result in clean, readable horizontal blocks."""
302
- if 'error' in result:
303
- st.error(f"🚨 SYSTEM ERROR: {result['error']}")
304
  return
305
-
306
- # Quality score with modern styling
307
- score = result['quality_score']
308
- if score >= 80:
309
- score_color = "#00ff41"
310
- status = "EXCELLENT"
311
- elif score >= 60:
312
- score_color = "#ffff00"
313
- status = "ACCEPTABLE"
314
- else:
315
- score_color = "#ff0000"
316
- status = "NEEDS_WORK"
317
-
318
- # Header with score
319
- st.markdown(f"""
320
- <div style="background: linear-gradient(135deg, rgba(0,255,65,0.15), rgba(0,255,65,0.05));
321
- border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 20px 0;
322
- text-align: center;">
323
- <h2 style="color: {score_color}; margin-bottom: 15px; font-size: 1.8rem;">
324
- {model_name} Analysis
325
- </h2>
326
- <div style="display: flex; justify-content: center; gap: 30px; flex-wrap: wrap; color: #ffffff;">
327
- <div style="text-align: center;">
328
- <div style="font-size: 2rem; color: {score_color}; font-weight: bold;">{score}/100</div>
329
- <div style="font-size: 0.9rem; opacity: 0.8;">{status}</div>
330
- </div>
331
- <div style="text-align: center;">
332
- <div style="font-size: 1.2rem; color: #00ff41; font-weight: bold;">{result['language'].upper()}</div>
333
- <div style="font-size: 0.9rem; opacity: 0.8;">Language</div>
334
- </div>
335
- <div style="text-align: center;">
336
- <div style="font-size: 1.2rem; color: #00ff41; font-weight: bold;">{result['line_count']}</div>
337
- <div style="font-size: 0.9rem; opacity: 0.8;">Lines</div>
338
- </div>
339
  </div>
340
- </div>
341
- """, unsafe_allow_html=True)
342
-
343
- # Summary
344
- if result.get('summary'):
345
- st.markdown("### 📋 Code Overview")
346
- st.markdown(f"""
347
- <div style="background: rgba(0,0,0,0.6); border: 1px solid #00ff41; border-radius: 10px;
348
- padding: 20px; margin: 20px 0;">
349
- <p style="color: #ffffff; font-size: 18px; line-height: 1.6; text-align: center; margin: 0;">
350
- {result['summary']}
351
- </p>
 
 
 
 
352
  </div>
353
- """, unsafe_allow_html=True)
354
-
355
- # Analysis results in horizontal blocks
356
- st.markdown("### 📊 Analysis Results")
357
-
358
- # Bug Detection Block (Full width)
359
- bug_items = result.get('bugs', [])
360
- if bug_items:
361
- bug_text = " ".join(bug_items[:3]) # Join with bullets for horizontal reading
362
- else:
363
- bug_text = "No critical bugs detected • Code logic appears sound • Edge cases handled well"
364
-
365
- st.markdown(f"""
366
- <div style="background: linear-gradient(135deg, rgba(255,100,100,0.1), rgba(150,0,0,0.1));
367
- border: 2px solid #ff6b6b; border-radius: 15px; padding: 25px; margin: 15px 0;">
368
- <h3 style="color: #ff6b6b; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
369
- 🐛 Bug Detection
370
- </h3>
371
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
372
- {bug_text}
373
- </p>
374
- </div>
375
- """, unsafe_allow_html=True)
376
-
377
- # Security Vulnerabilities Block (Full width)
378
- security_items = result.get('security_vulnerabilities', [])
379
- if security_items:
380
- security_text = " • ".join(security_items[:3]) # Join with bullets for horizontal reading
381
- else:
382
- security_text = "No security vulnerabilities found • Follows security best practices • Input validation looks good"
383
-
384
- st.markdown(f"""
385
- <div style="background: linear-gradient(135deg, rgba(200,0,200,0.1), rgba(100,0,100,0.1));
386
- border: 2px solid #ff00ff; border-radius: 15px; padding: 25px; margin: 15px 0;">
387
- <h3 style="color: #ff00ff; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
388
- 🔒 Security Check
389
- </h3>
390
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
391
- {security_text}
392
- </p>
393
- </div>
394
- """, unsafe_allow_html=True)
395
-
396
- # Code Quality & Quick Fixes Block (Full width)
397
- quality_items = result.get('quality_issues', []) + result.get('quick_fixes', [])
398
- if quality_items:
399
- quality_text = " • ".join(quality_items[:3]) # Join with bullets for horizontal reading
400
- else:
401
- quality_text = "Code is well-structured • Good naming conventions • Easy to read and maintain"
402
-
403
- st.markdown(f"""
404
- <div style="background: linear-gradient(135deg, rgba(0,200,255,0.1), rgba(0,100,150,0.1));
405
- border: 2px solid #00ccff; border-radius: 15px; padding: 25px; margin: 15px 0;">
406
- <h3 style="color: #00ccff; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
407
- 📝 Code Quality
408
- </h3>
409
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
410
- {quality_text}
411
- </p>
412
- </div>
413
- """, unsafe_allow_html=True)
414
-
415
- # Legacy sections (only if new sections are empty)
416
- if not result.get('bugs') and not result.get('security_vulnerabilities') and not result.get('quality_issues'):
417
- legacy_col1, legacy_col2 = st.columns(2)
418
-
419
- with legacy_col1:
420
- # Legacy strengths
421
- if result.get('strengths'):
422
- st.markdown("#### ✅ Strengths")
423
- for strength in result['strengths'][:3]:
424
- st.success(f"✓ {strength}")
425
-
426
- # Legacy issues
427
- if result.get('issues'):
428
- st.markdown("#### ⚠️ Issues")
429
- for issue in result['issues'][:3]:
430
- st.warning(f"! {issue}")
431
-
432
- with legacy_col2:
433
- # Legacy suggestions
434
- if result.get('suggestions'):
435
- st.markdown("#### 💡 Suggestions")
436
- for suggestion in result['suggestions'][:3]:
437
- st.info(f"→ {suggestion}")
438
-
439
- # Legacy security concerns
440
- if result.get('security_concerns'):
441
- st.markdown("#### 🔒 Security Concerns")
442
- for concern in result['security_concerns'][:3]:
443
- st.error(f"⚠ {concern}")
444
-
445
- # Expandable raw response (moved to bottom and less prominent)
446
- with st.expander("🔍 View Detailed Analysis", expanded=False):
447
- st.code(result.get('raw_response', 'NO_DATA_AVAILABLE'), language='text')
448
-
449
- def display_github_analysis_result(result: dict, model_name: str):
450
- """Display GitHub repository analysis result in clean, readable horizontal blocks."""
451
- if 'error' in result:
452
- st.error(f"🚨 GITHUB ANALYSIS ERROR: {result['error']}")
453
  return
454
-
455
- # Repository info in a clean header
456
- if result.get('repository_info'):
457
- repo_info = result['repository_info']
458
- st.markdown(f"""
459
- <div style="background: linear-gradient(135deg, rgba(0,255,65,0.15), rgba(0,255,65,0.05));
460
- border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 20px 0;
461
- text-align: center;">
462
- <h2 style="color: #00ff41; margin-bottom: 20px; font-size: 1.8rem;">
463
- 📦 {repo_info['name']}
464
- </h2>
465
- <div style="display: flex; justify-content: center; gap: 30px; flex-wrap: wrap; color: #ffffff;">
466
- <div style="text-align: center;">
467
- <div style="font-size: 1.5rem; color: #00ff41; font-weight: bold;">{repo_info['language']}</div>
468
- <div style="font-size: 0.9rem; opacity: 0.8;">Language</div>
469
- </div>
470
- <div style="text-align: center;">
471
- <div style="font-size: 1.5rem; color: #00ff41; font-weight: bold;">⭐ {repo_info['stars']}</div>
472
- <div style="font-size: 0.9rem; opacity: 0.8;">Stars</div>
473
- </div>
474
- <div style="text-align: center;">
475
- <div style="font-size: 1.5rem; color: #00ff41; font-weight: bold;">🔀 {repo_info['forks']}</div>
476
- <div style="font-size: 0.9rem; opacity: 0.8;">Forks</div>
477
- </div>
478
- </div>
479
- <p style="color: #ffffff; margin-top: 15px; font-style: italic;">
480
- "{repo_info['description']}"
481
- </p>
482
  </div>
483
- """, unsafe_allow_html=True)
484
-
485
- # Analysis results in horizontal blocks
486
- st.markdown("### 📊 Analysis Results")
487
-
488
- # Architecture Quality Block (Full width)
489
- arch_items = result.get('architecture_quality', [])
490
- if arch_items:
491
- arch_text = " ".join(arch_items[:3]) # Join with bullets for horizontal reading
492
- else:
493
- arch_text = "Well-structured repository • Good organization • Follows best practices"
494
-
495
- st.markdown(f"""
496
- <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,100,30,0.1));
497
- border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 15px 0;">
498
- <h3 style="color: #00ff41; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
499
- 🏗️ Code Architecture
500
- </h3>
501
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
502
- {arch_text}
503
- </p>
504
- </div>
505
- """, unsafe_allow_html=True)
506
-
507
- # Critical Issues Block (Full width)
508
- critical_items = result.get('critical_issues', [])
509
- if critical_items:
510
- critical_text = " ".join(critical_items[:3]) # Join with bullets for horizontal reading
511
- else:
512
- critical_text = "No major security vulnerabilities found • Code appears well-maintained • No critical bugs detected"
513
-
514
- st.markdown(f"""
515
- <div style="background: linear-gradient(135deg, rgba(255,100,100,0.1), rgba(150,0,0,0.1));
516
- border: 2px solid #ff6b6b; border-radius: 15px; padding: 25px; margin: 15px 0;">
517
- <h3 style="color: #ff6b6b; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
518
- 🚨 Critical Issues
519
- </h3>
520
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
521
- {critical_text}
522
- </p>
523
- </div>
524
- """, unsafe_allow_html=True)
525
-
526
- # Improvement Priorities Block (Full width)
527
- improvement_items = result.get('improvement_priorities', [])
528
- if improvement_items:
529
- improvement_text = "".join(improvement_items[:3]) # Join with bullets for horizontal reading
530
- else:
531
- improvement_text = "Add more comprehensive documentation • Consider adding automated tests • Enhance error handling"
532
-
533
- st.markdown(f"""
534
- <div style="background: linear-gradient(135deg, rgba(255,200,0,0.1), rgba(150,100,0,0.1));
535
- border: 2px solid #ffd700; border-radius: 15px; padding: 25px; margin: 15px 0;">
536
- <h3 style="color: #ffd700; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
537
- 🎯 Priority Improvements
538
- </h3>
539
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
540
- {improvement_text}
541
- </p>
542
- </div>
543
- """, unsafe_allow_html=True)
544
-
545
- # Summary section at the bottom (if available)
546
- if result.get('project_overview'):
547
- st.markdown("### 💡 Key Insights")
548
- st.markdown(f"""
549
- <div style="background: rgba(0,0,0,0.6); border: 1px solid #00ff41; border-radius: 10px;
550
- padding: 20px; margin: 20px 0;">
551
- <p style="color: #ffffff; font-size: 18px; line-height: 1.6; text-align: center; margin: 0;">
552
- {result['project_overview']}
553
- </p>
554
  </div>
555
- """, unsafe_allow_html=True)
556
-
557
- # Expandable raw response (moved to bottom and less prominent)
558
- with st.expander("🔍 View Detailed Analysis", expanded=False):
559
- st.code(result.get('raw_response', 'NO_DATA_AVAILABLE'), language='text')
560
-
561
- # Header with Matrix effect
562
- st.markdown("""
563
- <h1 style="text-align: center;">
564
- 🤖 AI Code Analyzer
565
- </h1>
566
- <div style="text-align: center; margin-bottom: 30px;">
567
- <p style="color: #00ff41; font-family: 'Orbitron', monospace; font-size: 20px; margin-bottom: 10px;">
568
- <strong>Powered by Advanced AI Models</strong>
569
- </p>
570
- <div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;">
571
- <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
572
- border-radius: 25px; font-size: 14px; color: #00ff41;">
573
- ✨ Bug Detection
574
- </span>
575
- <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
576
- border-radius: 25px; font-size: 14px; color: #00ff41;">
577
- 🔒 Security Analysis
578
- </span>
579
- <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
580
- border-radius: 25px; font-size: 14px; color: #00ff41;">
581
- 📦 GitHub Integration
582
- </span>
583
- <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
584
- border-radius: 25px; font-size: 14px; color: #00ff41;">
585
- ⚡ Instant Results
586
- </span>
587
  </div>
588
- </div>
589
- """, unsafe_allow_html=True)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
590
 
591
- # Sidebar - The Matrix Control Panel
592
- with st.sidebar:
593
- st.markdown("### 🎛️ Control Panel")
594
-
595
- # Model status
596
- st.markdown("#### Available AI Models")
597
- available_models = analyzer.available_models
598
-
599
- if not available_models:
600
- st.error("❌ No AI models detected")
601
- st.info("Please configure API keys in .env file")
602
- st.stop()
603
-
604
- # Display available models with modern styling
605
- for model, display_name in available_models.items():
606
- st.markdown(f"""
607
- <div style="background: rgba(0, 255, 65, 0.1); border: 1px solid #00ff41; padding: 10px; margin: 8px 0; border-radius: 8px;">
608
- ✅ <strong>{display_name}</strong> <span style="color: #00ff41; font-size: 12px;">[Ready]</span>
609
- </div>
610
- """, unsafe_allow_html=True)
611
-
612
- st.markdown("---")
613
-
614
- # Analysis Mode Selection
615
- st.markdown("#### Analysis Mode")
616
- analysis_mode = st.radio(
617
- "Choose what to analyze",
618
- ["Code Analysis", "GitHub Repository"],
619
- format_func=lambda x: f"📝 {x}" if x == "Code Analysis" else f"📦 {x}"
620
  )
621
 
622
- # Model Selection
623
- st.markdown("#### 🤖 AI Model Selection")
624
- model_choice = st.radio(
625
- "Choose Analysis Model:",
626
- [
627
- "CodeT5+ (Fast - Local)",
628
- "Fine-tuned DeepSeek (Accurate - Remote)"
629
- ],
630
- help="Local models run on your computer, Remote model runs on Hugging Face (always available)"
631
  )
632
-
633
- # Remote model configuration
634
- remote_api_url = None
635
- if "Remote" in model_choice:
636
- st.markdown("#### 🌐 Remote Model Configuration")
637
- remote_api_url = st.text_input(
638
- "Hugging Face Space URL:",
639
- value="https://arun3676-fine-tuned-code-analyzer.hf.space",
640
- help="Your Hugging Face Space URL"
641
- )
642
-
643
- # Test connection
644
- if st.button("🔗 Test Connection"):
645
- try:
646
- import requests
647
- response = requests.get(f"{remote_api_url}/health", timeout=5)
648
- if response.status_code == 200:
649
- st.success("✅ Connected to remote model!")
650
- else:
651
- st.error("❌ Connection failed")
652
- except:
653
- st.error("❌ Cannot reach remote model")
654
-
655
- # Local model toggle and preset (for CodeT5+)
656
- if "CodeT5+" in model_choice:
657
- use_local = True
658
- local_preset = st.selectbox(
659
- "Local Inference Mode",
660
- ["Quick", "Detailed"],
661
- index=0,
662
- help="Quick = beams 1, ~180 tokens. Detailed = beams 2, ~240 tokens.",
 
 
 
 
 
 
 
 
 
 
 
663
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
664
  else:
665
- use_local = False
666
- local_preset = "Detailed"
667
-
668
- if analysis_mode == "GitHub Repository":
669
- st.markdown("#### Repository Analysis")
670
- github_url = st.text_input(
671
- "GitHub URL",
672
- placeholder="https://github.com/owner/repo",
673
- help="Enter a GitHub repository URL for analysis"
674
- )
675
-
676
- analyze_github_button = st.button(
677
- "🔍 Analyze Repository",
678
- type="primary",
679
- help="Analyze GitHub repository structure and code"
680
- )
681
-
682
- st.markdown("---")
683
-
684
- # Analysis options
685
- st.markdown("#### Analysis Settings")
686
-
687
- # Model selector with modern styling (disabled when using local)
688
- selected_model = st.selectbox(
689
- "Choose AI Model",
690
- options=list(available_models.keys()),
691
- format_func=lambda x: f"🤖 {available_models[x]}",
692
- disabled=use_local,
693
  )
694
-
695
- # Multi-model analysis toggle
696
- analyze_all = st.checkbox("🔄 Compare Multiple Models", value=False)
697
-
698
- # Language selection
699
- languages = ["auto-detect", "python", "javascript", "java", "cpp", "csharp", "go", "rust", "php", "ruby", "swift", "kotlin"]
700
- selected_language = st.selectbox(
701
- "Programming Language",
702
- languages,
703
- format_func=lambda x: "🔍 Auto-Detect" if x == "auto-detect" else f"💻 {x.upper()}"
704
  )
705
-
706
- st.markdown("---")
707
-
708
- # Sample code injection
709
- st.markdown("#### Quick Start Examples")
710
-
711
- if st.button("🐍 Try Python Example"):
712
- st.session_state.code_input = """def calculate_total(items):
713
- total = 0
714
- for item in items:
715
- total += item.price # Potential AttributeError
716
- return total
717
-
718
- # Missing validation
719
- items = None
720
- result = calculate_total(items) # This will crash
721
- print(f"Total: {result}")
722
- """
723
-
724
- if st.button("🌐 Try JavaScript Example"):
725
- st.session_state.code_input = """function processUser(user) {
726
- // Security issue: no input validation
727
- document.innerHTML = user.name; // XSS vulnerability
728
-
729
- // Logic error: undefined check
730
- if (user.age > 18) {
731
- return user.permissions.admin; // Potential TypeError
732
- }
733
-
734
- return false;
735
- }
736
 
737
- // Missing error handling
738
- const userData = getUser(); // Could be undefined
739
- processUser(userData);
740
- """
741
 
742
- # Main Terminal Interface
743
- col1, col2 = st.columns([1, 1])
744
-
745
- with col1:
746
- if analysis_mode == "Code Analysis":
747
- st.markdown("### 📝 Code Input")
748
-
749
- # File upload section
750
- st.markdown("#### 📁 Upload File")
751
- uploaded_file = st.file_uploader(
752
- "Choose a code file",
753
- type=['py', 'js', 'java', 'cpp', 'c', 'cs', 'go', 'rs', 'php', 'rb', 'swift', 'kt', 'txt'],
754
- help="Upload code files for AI analysis"
755
- )
756
-
757
- code_from_file = ""
758
- if uploaded_file is not None:
759
- # Read file content
760
- try:
761
- code_from_file = str(uploaded_file.read(), "utf-8")
762
- file_size = len(code_from_file)
763
- file_lines = len(code_from_file.splitlines())
764
-
765
- st.markdown(f"""
766
- <div class="file-info">
767
- ✅ <strong>File Uploaded Successfully</strong><br>
768
- 📄 <strong>Name:</strong> {uploaded_file.name}<br>
769
- 📏 <strong>Size:</strong> {file_size} bytes<br>
770
- 📊 <strong>Lines:</strong> {file_lines}<br>
771
- 🔍 <strong>Status:</strong> Ready for analysis
772
- </div>
773
- """, unsafe_allow_html=True)
774
-
775
- # Auto-populate the text area
776
- st.session_state.code_input = code_from_file
777
-
778
- except UnicodeDecodeError:
779
- st.error("🚨 File encoding error: Please use UTF-8 encoded files")
780
- except Exception as e:
781
- st.error(f"🚨 File read error: {str(e)}")
782
-
783
- # Code input with modern styling
784
- code_input = st.text_area(
785
- "Or paste your code here",
786
- value=st.session_state.get('code_input', ''),
787
- height=350,
788
- key="code_input",
789
- help="Paste code directly or upload file above"
790
- )
791
-
792
- # Modern analyze button
793
- analyze_button = st.button(
794
- "🚀 Analyze Code",
795
- type="primary",
796
- help="Analyze your code with AI"
797
  )
798
-
799
- else: # GitHub Repository mode
800
- st.markdown("### 📦 GitHub Analysis")
801
-
802
- if 'github_url' in locals():
803
- if github_url:
804
- st.markdown(f"""
805
- <div class="file-info">
806
- ✅ <strong>Repository Detected</strong><br>
807
- 📦 <strong>URL:</strong> {github_url}<br>
808
- 🔍 <strong>Status:</strong> Ready for analysis
809
- </div>
810
- """, unsafe_allow_html=True)
811
-
812
- st.markdown("""
813
- #### 📋 What We'll Analyze
814
-
815
- **Repository Analysis includes:**
816
- - 🏗️ Project structure and organization
817
- - 📄 Key files (README, package.json, main source files)
818
- - 🎯 Code quality and architecture assessment
819
- - 🔒 Security vulnerabilities across the codebase
820
- - 💡 Best practices and improvement suggestions
821
-
822
- **Note:** Only public repositories can be analyzed.
823
- """)
824
-
825
- # Show sample repos
826
- st.markdown("#### 📚 Try These Sample Repositories")
827
- sample_repos = [
828
- "https://github.com/microsoft/vscode",
829
- "https://github.com/facebook/react",
830
- "https://github.com/python/cpython"
831
- ]
832
-
833
- for repo in sample_repos:
834
- if st.button(f"📦 {repo.split('/')[-1]}", key=repo):
835
- st.session_state.github_url_input = repo
836
-
837
- # Results Terminal
838
- with col2:
839
- st.markdown("### 📊 Analysis Results")
840
-
841
- # Code Analysis Results
842
- if analysis_mode == "Code Analysis":
843
- if analyze_button:
844
- if not code_input.strip():
845
- st.error("🚨 Please enter some code to analyze or upload a file!")
846
- else:
847
- with st.spinner("🟢 Analyzing your code... Please wait..."):
848
- if analyze_all:
849
- # Multi-model analysis
850
- st.markdown("#### 🔄 MULTI_NETWORK_ANALYSIS_INITIATED")
851
-
852
- results = analyzer.analyze_with_all_models(
853
- code_input,
854
- selected_language if selected_language != "auto-detect" else None
855
- )
856
-
857
- # Display comparison metrics
858
- comparison = analyzer.compare_analyses(results)
859
-
860
- # Matrix-styled metrics
861
- metrics_cols = st.columns(4)
862
- with metrics_cols[0]:
863
- st.metric("AVG_SCORE", f"{comparison['average_score']}/100")
864
- with metrics_cols[1]:
865
- st.metric("NETWORKS", len(results))
866
- with metrics_cols[2]:
867
- st.metric("PEAK_SCORE", f"{max(comparison['model_scores'].values())}/100")
868
- with metrics_cols[3]:
869
- st.metric("SCAN_TIME", f"{comparison['analysis_time']:.1f}s")
870
-
871
- # Create tabs for each neural network
872
- tab_names = [f"🤖 {available_models[key]}" for key in results.keys()]
873
- tabs = st.tabs(tab_names)
874
-
875
- for idx, (model_key, result) in enumerate(results.items()):
876
- with tabs[idx]:
877
- display_matrix_analysis_result(result, available_models[model_key])
878
-
879
- # Consensus findings with Matrix styling
880
- if comparison.get('consensus_bugs') or comparison.get('consensus_security'):
881
- st.markdown("### 🤝 NEURAL_CONSENSUS_DETECTED")
882
- st.markdown("""
883
- <div style="background: rgba(255, 0, 0, 0.1); border: 2px solid #ff0000; padding: 15px; border-radius: 10px;">
884
- <strong>CRITICAL_PATTERNS_IDENTIFIED_BY_MULTIPLE_NETWORKS:</strong>
885
- </div>
886
- """, unsafe_allow_html=True)
887
-
888
- if comparison.get('consensus_bugs'):
889
- st.markdown("#### 🐛 CONSENSUS_BUGS")
890
- for bug in comparison['consensus_bugs']:
891
- st.error(f"🚨 MULTIPLE_MODELS: {bug}")
892
-
893
- if comparison.get('consensus_security'):
894
- st.markdown("#### 🔒 CONSENSUS_SECURITY")
895
- for vuln in comparison['consensus_security']:
896
- st.error(f"🚨 SECURITY_ALERT: {vuln}")
897
-
898
- else:
899
- # Single model analysis
900
- if use_local or "Remote" in model_choice:
901
- # Determine model type and display name
902
- if "Remote" in model_choice:
903
- st.markdown("#### 🤖 FINE-TUNED_DEEPSEEK_REMOTE_ANALYSIS")
904
- model_type = "deepseek-finetuned-remote"
905
- display_name = "Fine-tuned DeepSeek (Remote)"
906
- else:
907
- st.markdown("#### 🤖 CODET5+_LOCAL_ANALYSIS")
908
- model_type = "codet5"
909
- display_name = "CodeT5+ Local"
910
-
911
- local = get_local_analyzer(model_type, remote_api_url)
912
-
913
- if local_preset == "Quick" or "Remote" in model_choice:
914
- result = local.analyze_code_fast(code_input, mode="quick")
915
- # adapt to display format
916
- display_matrix_analysis_result({
917
- "quality_score": result.get("quality_score", 0),
918
- "summary": "",
919
- "bugs": [],
920
- "quality_issues": [],
921
- "security_vulnerabilities": [],
922
- "quick_fixes": [],
923
- "language": "auto",
924
- "line_count": len(code_input.splitlines()),
925
- "raw_response": result["analysis"],
926
- }, f"{display_name} (Quick)")
927
- else:
928
- # streaming path – consume generator and show final
929
- local = get_local_analyzer(model_type, remote_api_url)
930
- final_text = None
931
- for chunk in local.analyze_code_streaming(code_input, show_progress=True, mode="detailed"):
932
- final_text = chunk
933
- display_matrix_analysis_result({
934
- "quality_score": 0,
935
- "summary": "",
936
- "bugs": [],
937
- "quality_issues": [],
938
- "security_vulnerabilities": [],
939
- "quick_fixes": [],
940
- "language": "auto",
941
- "line_count": len(code_input.splitlines()),
942
- "raw_response": final_text or "",
943
- }, f"{display_name} (Detailed)")
944
- else:
945
- st.markdown(f"#### 🤖 {available_models[selected_model].upper()}_ANALYSIS")
946
- result = analyzer.analyze_code(
947
- code_input,
948
- selected_model,
949
- selected_language if selected_language != "auto-detect" else None
950
- )
951
- display_matrix_analysis_result(result, available_models[selected_model])
952
-
953
- # GitHub Analysis Results
954
- else: # GitHub Repository mode
955
- if 'analyze_github_button' in locals() and analyze_github_button:
956
- if not github_url.strip():
957
- st.error("🚨 Please enter a GitHub repository URL!")
958
- else:
959
- with st.spinner("🟢 Analyzing GitHub repository... Please wait..."):
960
- result = analyzer.analyze_github_repo(github_url, selected_model)
961
- display_github_analysis_result(result, available_models[selected_model])
962
-
963
- # Instructions for new users
964
- if (analysis_mode == "Code Analysis" and not code_input.strip() and not analyze_button) or \
965
- (analysis_mode == "GitHub Repository" and ('github_url' not in locals() or not github_url.strip()) and ('analyze_github_button' not in locals() or not analyze_github_button)):
966
-
967
- st.markdown("""
968
- <div class="matrix-terminal" style="margin: 20px 0; text-align: center;">
969
- <h2 style="color: #00ff41; margin-bottom: 30px; font-size: 2.5rem;">
970
- 🤖 AI Code Analyzer
971
- </h2>
972
- <p style="color: #00ff41; font-size: 1.3rem; margin-bottom: 30px;">
973
- <strong>What can I do for you?</strong>
974
- </p>
975
- </div>
976
- """, unsafe_allow_html=True)
977
-
978
- # Feature cards in columns
979
- feature_col1, feature_col2 = st.columns(2)
980
-
981
- with feature_col1:
982
- st.markdown("""
983
- <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
984
- border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
985
- box-shadow: 0 0 20px rgba(0,255,65,0.3);">
986
- <h3 style="color: #00ff41; margin-bottom: 15px;">🐛 Find Bugs Instantly</h3>
987
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
988
- Spot crashes, logical errors, and edge cases before they hit production.
989
- </p>
990
- </div>
991
- """, unsafe_allow_html=True)
992
-
993
- st.markdown("""
994
- <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
995
- border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
996
- box-shadow: 0 0 20px rgba(0,255,65,0.3);">
997
- <h3 style="color: #00ff41; margin-bottom: 15px;">🔒 Security Scanner</h3>
998
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
999
- Detect vulnerabilities like SQL injection, XSS, and insecure data handling.
1000
- </p>
1001
- </div>
1002
- """, unsafe_allow_html=True)
1003
-
1004
- with feature_col2:
1005
- st.markdown("""
1006
- <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
1007
- border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
1008
- box-shadow: 0 0 20px rgba(0,255,65,0.3);">
1009
- <h3 style="color: #00ff41; margin-bottom: 15px;">📝 Code Quality Check</h3>
1010
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
1011
- Improve readability, maintainability, and follow best practices.
1012
- </p>
1013
- </div>
1014
- """, unsafe_allow_html=True)
1015
-
1016
- st.markdown("""
1017
- <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
1018
- border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
1019
- box-shadow: 0 0 20px rgba(0,255,65,0.3);">
1020
- <h3 style="color: #00ff41; margin-bottom: 15px;">📦 Repository Analysis</h3>
1021
- <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
1022
- Analyze entire GitHub repos for structure, issues, and improvements.
1023
- </p>
1024
- </div>
1025
- """, unsafe_allow_html=True)
1026
-
1027
- # How to get started section
1028
- st.markdown("""
1029
- <div style="background: rgba(0,0,0,0.7); border: 1px solid #00ff41; border-radius: 10px;
1030
- padding: 20px; margin: 30px 0; text-align: center;">
1031
- <h3 style="color: #00ff41; margin-bottom: 20px;">🚀 Get Started in 3 Steps</h3>
1032
- <div style="display: flex; justify-content: space-around; flex-wrap: wrap;">
1033
- <div style="margin: 10px; color: #ffffff;">
1034
- <div style="background: #00ff41; color: #000; border-radius: 50%; width: 40px; height: 40px;
1035
- display: flex; align-items: center; justify-content: center; margin: 0 auto 10px;
1036
- font-weight: bold; font-size: 20px;">1</div>
1037
- <p><strong>Upload</strong><br>Paste code or GitHub URL</p>
1038
- </div>
1039
- <div style="margin: 10px; color: #ffffff;">
1040
- <div style="background: #00ff41; color: #000; border-radius: 50%; width: 40px; height: 40px;
1041
- display: flex; align-items: center; justify-content: center; margin: 0 auto 10px;
1042
- font-weight: bold; font-size: 20px;">2</div>
1043
- <p><strong>Choose AI</strong><br>Pick your preferred model</p>
1044
- </div>
1045
- <div style="margin: 10px; color: #ffffff;">
1046
- <div style="background: #00ff41; color: #000; border-radius: 50%; width: 40px; height: 40px;
1047
- display: flex; align-items: center; justify-content: center; margin: 0 auto 10px;
1048
- font-weight: bold; font-size: 20px;">3</div>
1049
- <p><strong>Analyze</strong><br>Get instant results</p>
1050
- </div>
1051
- </div>
1052
- </div>
1053
- """, unsafe_allow_html=True)
1054
-
1055
- # Supported languages as badges
1056
- st.markdown("""
1057
- <div style="text-align: center; margin: 20px 0;">
1058
- <p style="color: #00ff41; font-size: 18px; margin-bottom: 15px;"><strong>Supported Languages:</strong></p>
1059
- <div style="display: flex; justify-content: center; flex-wrap: wrap; gap: 10px;">
1060
- <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
1061
- border-radius: 20px; font-weight: bold; font-size: 14px;">Python</span>
1062
- <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
1063
- border-radius: 20px; font-weight: bold; font-size: 14px;">JavaScript</span>
1064
- <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
1065
- border-radius: 20px; font-weight: bold; font-size: 14px;">Java</span>
1066
- <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
1067
- border-radius: 20px; font-weight: bold; font-size: 14px;">C++</span>
1068
- <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
1069
- border-radius: 20px; font-weight: bold; font-size: 14px;">Go</span>
1070
- <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
1071
- border-radius: 20px; font-weight: bold; font-size: 14px;">Rust</span>
1072
- <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
1073
- border-radius: 20px; font-weight: bold; font-size: 14px;">+ More</span>
1074
- </div>
1075
- </div>
1076
- """, unsafe_allow_html=True)
1077
-
1078
- # Footer with Matrix signature
1079
- st.markdown("---")
1080
- st.markdown("""
1081
- <div style="text-align: center; color: #00ff41; font-family: 'Share Tech Mono', monospace; padding: 20px;">
1082
- <p>🟢 POWERED_BY_NEURAL_NETWORKS • OPENAI • ANTHROPIC • DEEPSEEK • GOOGLE 🟢</p>
1083
- <p><em>"There is no spoon... only code to analyze."</em></p>
1084
- <p style="font-size: 12px;">MATRIX_ANALYZER_v3.0 • BUILD_2024 • SECURITY_CLEARANCE_OMEGA • FILE_UPLOAD_ENABLED</p>
1085
- </div>
1086
- """, unsafe_allow_html=True)
 
1
+ import streamlit as st
 
 
 
 
2
  from dotenv import load_dotenv
3
+ import streamlit.components.v1 as components
4
  from analyzer import CodeAnalyzer
5
+ from typing import Any, Iterable, List
6
+ import os
7
+ import html
8
 
9
+ # Force reload environment variables
10
+ load_dotenv(override=True)
11
 
 
12
  st.set_page_config(
13
+ page_title="Matrix Code Analyzer",
14
+ page_icon="🧠",
15
+ layout="wide"
 
16
  )
17
 
18
+ # Debug sidebar removed for cleaner UI
19
+
20
+ CUSTOM_CSS = """
21
  <style>
22
  @import url('https://fonts.googleapis.com/css2?family=Share+Tech+Mono:wght@400&display=swap');
23
  @import url('https://fonts.googleapis.com/css2?family=Orbitron:wght@400;700;900&display=swap');
24
+ @import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&display=swap');
25
+
26
+ :root {
27
+ --matrix-green: #00ff41; /* Brighter Matrix Green */
28
+ --matrix-blue: #0ccffa;
29
+ --matrix-bg: #000000; /* Solid black for deeper matrix feel */
30
+ --text-main: #e7fceb;
31
+ --text-muted: #94a9a0;
32
+ --panel-bg: rgba(6, 16, 28, 0.55);
33
+ --panel-border: rgba(0, 255, 65, 0.25); /* Green border for panels */
34
+ }
35
+
36
  .stApp {
37
+ background: var(--matrix-bg);
38
+ color: var(--text-main);
39
+ font-family: 'Share Tech Mono', monospace; /* Default text to matrix font */
40
+ }
41
+
42
+ .stApp::before {
43
+ content: '';
44
  position: fixed;
45
+ inset: 0;
46
+ background: linear-gradient(135deg, rgba(0, 255, 130, 0.05) 0%, transparent 55%),
47
+ linear-gradient(225deg, rgba(12, 207, 250, 0.06) 0%, transparent 60%);
48
+ z-index: -3;
49
+ }
50
+
51
+ .stApp::after {
52
+ content: '';
53
+ position: fixed;
54
+ inset: 0;
55
+ background: radial-gradient(circle at 18% 20%, rgba(0, 255, 140, 0.08), transparent 58%),
56
+ radial-gradient(circle at 78% 12%, rgba(12, 207, 250, 0.07), transparent 48%);
57
+ z-index: -2;
58
+ }
59
+
60
+ [data-testid="stAppViewContainer"] {
61
+ position: relative;
62
+ z-index: 0;
63
+ }
64
+
65
+ html, body {
66
+ overflow-y: auto !important;
67
+ height: auto !important;
68
+ }
69
+
70
+ [data-testid="stAppViewContainer"] {
71
+ overflow-y: auto !important;
72
+ height: auto !important;
73
+ }
74
+
75
+ .stApp {
76
+ overflow-y: auto !important;
77
+ height: auto !important;
78
+ }
79
+
80
+ .block-container {
81
+ overflow-y: visible !important;
82
+ }
83
+
84
+ /* Ensure main content can scroll */
85
+ [data-testid="stAppViewContainer"] > div {
86
+ overflow-y: auto !important;
87
+ height: auto !important;
88
+ }
89
+
90
+ /* Override any Streamlit default height constraints */
91
+ .stApp > div {
92
+ height: auto !important;
93
+ min-height: 100vh !important;
94
+ }
95
+
96
+ .matrix-rain {
97
+ position: fixed;
98
+ inset: 0;
99
  overflow: hidden;
100
+ pointer-events: none;
101
  z-index: -1;
 
102
  }
103
 
104
+ .matrix-rain span {
105
  position: absolute;
106
+ top: -10%;
107
+ color: rgba(0, 255, 65, 0.45); /* Brighter matrix rain */
108
  font-family: 'Share Tech Mono', monospace;
109
  font-size: 14px;
110
+ animation: matrixFall linear infinite;
111
+ text-shadow: 0 0 10px rgba(0, 255, 65, 0.6); /* Stronger glow */
112
+ }
113
+
114
+ @keyframes matrixFall {
115
+ from { transform: translateY(-10%); opacity: 0; }
116
+ 20% { opacity: 1; }
117
+ to { transform: translateY(110vh); opacity: 0; }
118
+ }
119
+
120
+ [data-testid="stHeader"] {
121
+ background: transparent;
122
+ border-bottom: none;
123
+ }
124
+
125
+ .block-container {
126
+ padding-top: 3rem;
127
+ padding-bottom: 3.5rem;
128
+ max-width: 1140px;
129
+ }
130
+
131
+ a {
132
+ color: var(--matrix-green);
133
+ text-decoration: none;
134
+ }
135
+
136
+ a:hover {
137
+ text-decoration: underline;
138
+ }
139
+
140
+ .glass-panel {
141
+ background: rgba(0, 0, 0, 0.7);
142
+ border: 1px solid var(--panel-border);
143
+ border-radius: 12px; /* Slightly less rounded for matrix look */
144
+ padding: 28px;
145
+ backdrop-filter: blur(8px); /* Softer blur */
146
+ box-shadow: 0 0 25px rgba(0, 255, 65, 0.2); /* Green glow for panels */
147
+ margin-bottom: 1.6rem;
148
+ }
149
+
150
+ .hero {
151
+ text-align: center;
152
+ position: relative;
153
+ overflow: hidden;
154
+ }
155
+
156
+ .hero::before {
157
+ content: '';
158
+ position: absolute;
159
+ inset: 0;
160
+ opacity: 0.3;
161
+ background: linear-gradient(130deg, rgba(0, 255, 65, 0.15), rgba(0, 200, 255, 0.08));
162
+ pointer-events: none;
163
+ }
164
+
165
+ .hero-label {
166
+ font-family: 'Orbitron', sans-serif; /* Use Orbitron for hero label */
167
+ font-size: 0.9rem;
168
+ letter-spacing: 0.4em;
169
+ text-transform: uppercase;
170
+ color: var(--matrix-green);
171
+ margin-bottom: 0.6rem;
172
+ text-shadow: 0 0 8px rgba(0, 255, 65, 0.5);
173
+ }
174
+
175
+ .hero h1 {
176
+ font-family: 'Orbitron', sans-serif; /* Use Orbitron for hero title */
177
+ margin-bottom: 0.6rem;
178
+ font-size: clamp(2.5rem, 4vw, 3.8rem);
179
+ font-weight: 700;
180
+ letter-spacing: -0.05em;
181
+ position: relative;
182
+ z-index: 1;
183
+ color: var(--matrix-green);
184
+ text-shadow: 0 0 15px rgba(0, 255, 65, 0.8), 0 0 25px rgba(0, 255, 65, 0.6);
185
+ }
186
+
187
+ .hero .subline {
188
+ font-family: 'Share Tech Mono', monospace; /* Use Share Tech Mono for subline */
189
+ color: rgba(231, 252, 235, 0.8);
190
+ font-size: 1.1rem;
191
+ max-width: 700px;
192
+ margin: 0 auto;
193
+ line-height: 1.7;
194
+ position: relative;
195
+ z-index: 1;
196
+ }
197
+
198
+ .hero-badges {
199
+ display: flex;
200
+ gap: 0.85rem;
201
+ justify-content: center;
202
+ flex-wrap: wrap;
203
+ margin-top: 2rem;
204
+ position: relative;
205
+ z-index: 1;
206
+ }
207
+
208
+ .hero-badge {
209
+ padding: 0.4rem 1rem;
210
+ border-radius: 999px;
211
+ background: rgba(0, 255, 65, 0.15);
212
+ border: 1px solid rgba(0, 255, 65, 0.4);
213
+ color: var(--matrix-green);
214
+ font-size: 0.9rem;
215
+ font-weight: 500;
216
+ text-shadow: 0 0 5px rgba(0, 255, 65, 0.4);
217
+ }
218
+
219
+ .body-text {
220
+ color: var(--text-main);
221
+ font-family: 'Share Tech Mono', monospace;
222
+ font-size: 1rem;
223
+ line-height: 1.7;
224
+ margin: 0;
225
+ }
226
+
227
+ .settings-grid {
228
+ display: grid;
229
+ grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
230
+ gap: 1.2rem;
231
+ }
232
+
233
  .stSelectbox > div > div {
234
+ background: rgba(0, 0, 0, 0.8);
235
+ border-radius: 10px;
236
+ border: 1px solid rgba(0, 255, 65, 0.3);
237
+ color: var(--text-main);
238
+ font-family: 'Share Tech Mono', monospace;
239
+ }
240
+
241
+ .stSelectbox label, .stTextInput label, .stFileUploader label {
242
+ font-family: 'Orbitron', sans-serif;
243
+ font-weight: 600;
244
+ color: var(--matrix-green);
245
+ text-shadow: 0 0 5px rgba(0, 255, 65, 0.3);
246
+ }
247
+
248
+ .stTextInput > div > div > input {
249
+ border-radius: 10px;
250
+ background: rgba(0, 0, 0, 0.8);
251
+ border: 1px solid rgba(0, 255, 65, 0.25);
252
+ color: var(--text-main);
253
+ font-family: 'Share Tech Mono', monospace;
254
+ }
255
+
256
+ textarea {
257
+ border-radius: 12px !important;
258
+ background: rgba(0, 0, 0, 0.85) !important;
259
+ border: 1px solid rgba(0, 255, 65, 0.3) !important;
260
+ color: var(--text-main) !important;
261
+ font-family: 'Share Tech Mono', monospace !important;
262
+ resize: vertical;
263
+ }
264
+
265
+ textarea::placeholder,
266
+ .stTextInput > div > div > input::placeholder {
267
+ color: rgba(0, 255, 65, 0.5) !important;
268
+ }
269
+
270
+ .stFileUploader > div {
271
+ border-radius: 12px !important;
272
+ border: 2px dashed rgba(0, 255, 65, 0.5) !important;
273
+ background: rgba(0, 0, 0, 0.7) !important;
274
+ padding: 1.5rem 2rem !important;
275
+ }
276
+
277
+ .stButton > button {
278
+ height: 50px;
279
+ border-radius: 12px;
280
+ font-weight: 700;
281
+ letter-spacing: 0.05em;
282
+ background: linear-gradient(120deg, var(--matrix-green), rgba(0, 200, 255, 0.8));
283
+ color: #000;
284
+ border: none;
285
+ box-shadow: 0 0 20px rgba(0, 255, 65, 0.6);
286
+ text-transform: uppercase;
287
+ transition: all 0.3s ease;
288
+ }
289
+
290
+ .stButton > button:hover {
291
+ transform: translateY(-2px);
292
+ box-shadow: 0 0 35px rgba(0, 255, 65, 0.9);
293
+ background: linear-gradient(120deg, rgba(0, 255, 65, 0.95), rgba(0, 220, 255, 0.95));
294
+ }
295
+
296
  .stTabs [data-baseweb="tab-list"] {
297
+ background: rgba(0, 0, 0, 0.6);
298
+ border-radius: 12px;
299
+ padding: 0.5rem;
300
+ border: 1px solid rgba(0, 255, 65, 0.2);
301
  }
302
 
303
  .stTabs [data-baseweb="tab"] {
304
+ border-radius: 8px;
305
+ font-weight: 500;
306
+ color: rgba(0, 255, 65, 0.7);
307
+ font-family: 'Orbitron', sans-serif;
308
+ padding: 0.75rem 1.2rem;
309
  }
310
 
311
  .stTabs [aria-selected="true"] {
312
+ background: rgba(0, 255, 65, 0.1);
313
+ color: var(--matrix-green);
314
+ text-shadow: 0 0 5px rgba(0, 255, 65, 0.5);
315
+ }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
 
317
+ .stTabs [data-baseweb="tab-panel"] {
318
+ background: rgba(0, 0, 0, 0.75);
319
+ border: 1px solid rgba(0, 255, 65, 0.2);
320
+ border-radius: 12px;
321
+ padding: 28px;
322
+ margin-top: 1.5rem;
323
+ backdrop-filter: blur(8px);
324
+ box-shadow: 0 0 25px rgba(0, 255, 65, 0.2);
325
+ }
326
 
327
+ .stats-grid {
328
+ display: grid;
329
+ grid-template-columns: repeat(auto-fit, minmax(160px, 1fr));
330
+ gap: 1.2rem;
331
+ }
332
+
333
+ .stats-card {
334
+ background: rgba(0, 0, 0, 0.6);
335
+ border: 1px solid rgba(0, 255, 65, 0.15);
336
+ border-radius: 12px;
337
+ padding: 1.5rem;
338
+ display: flex;
339
+ flex-direction: column;
340
+ gap: 0.4rem;
341
+ box-shadow: 0 0 15px rgba(0, 255, 65, 0.1);
342
+ }
343
+
344
+ .stats-card .label {
345
+ color: rgba(0, 255, 65, 0.8);
346
+ font-size: 0.8rem;
347
+ letter-spacing: 0.1em;
348
+ text-transform: uppercase;
349
+ font-family: 'Share Tech Mono', monospace;
350
+ }
351
+
352
+ .stats-card .value {
353
+ font-family: 'Orbitron', sans-serif;
354
+ font-size: 1.9rem;
355
+ font-weight: 700;
356
+ color: var(--matrix-green);
357
+ text-shadow: 0 0 10px rgba(0, 255, 65, 0.5);
358
+ }
359
+
360
+ .stats-card .status {
361
+ font-size: 0.95rem;
362
+ color: rgba(0, 255, 65, 0.7);
363
+ }
364
+
365
+ .list-card {
366
+ background: rgba(0, 0, 0, 0.65);
367
+ border: 1px solid rgba(0, 255, 65, 0.18);
368
+ border-radius: 12px;
369
+ padding: 1.5rem 1.8rem;
370
+ margin-bottom: 1.2rem;
371
+ box-shadow: 0 0 18px rgba(0, 255, 65, 0.15);
372
+ }
373
+
374
+ .list-card h3 {
375
+ font-family: 'Orbitron', sans-serif;
376
+ margin-bottom: 1rem;
377
+ font-size: 1.2rem;
378
+ font-weight: 600;
379
+ display: flex;
380
+ align-items: center;
381
+ gap: 0.7rem;
382
+ color: var(--matrix-green);
383
+ text-shadow: 0 0 8px rgba(0, 255, 65, 0.4);
384
+ }
385
+
386
+ .list-card ul {
387
+ margin: 0;
388
+ padding-left: 1.2rem;
389
+ color: rgba(231, 252, 235, 0.9);
390
+ font-family: 'Share Tech Mono', monospace;
391
+ display: grid;
392
+ gap: 0.6rem;
393
+ }
394
+
395
+ .list-card li {
396
+ line-height: 1.5;
397
+ position: relative;
398
+ }
399
+
400
+ .list-card li::before {
401
+ content: '•'; /* Use a bullet point for simple list styling */
402
+ position: absolute;
403
+ left: -1rem;
404
+ color: var(--matrix-green);
405
+ font-size: 1.2em;
406
+ line-height: 1;
407
+ text-shadow: 0 0 5px rgba(0, 255, 65, 0.7);
408
+ }
409
+
410
+ .model-chip {
411
+ display: inline-flex;
412
+ align-items: center;
413
+ gap: 0.6rem;
414
+ padding: 0.6rem 1rem;
415
+ border-radius: 999px;
416
+ background: rgba(0, 255, 65, 0.1);
417
+ border: 1px solid rgba(0, 255, 65, 0.3);
418
+ font-size: 0.95rem;
419
+ color: var(--matrix-green);
420
+ margin-bottom: 1.5rem;
421
+ font-family: 'Orbitron', sans-serif;
422
+ text-shadow: 0 0 6px rgba(0, 255, 65, 0.4);
423
+ }
424
+
425
+ .file-meta {
426
+ display: flex;
427
+ flex-wrap: wrap;
428
+ gap: 1rem;
429
+ margin-top: 0.6rem;
430
+ font-size: 0.9rem;
431
+ color: rgba(231, 252, 235, 0.7);
432
+ font-family: 'Share Tech Mono', monospace;
433
+ }
434
+
435
+ .file-meta span {
436
+ padding: 0.4rem 0.9rem;
437
+ border-radius: 999px;
438
+ background: rgba(0, 255, 65, 0.08);
439
+ border: 1px solid rgba(0, 255, 65, 0.2);
440
+ text-shadow: 0 0 4px rgba(0, 255, 65, 0.2);
441
+ }
442
+
443
+ .stAlert {
444
+ border-radius: 12px;
445
+ border: 1px solid rgba(0, 255, 65, 0.3);
446
+ background: rgba(0, 0, 0, 0.75);
447
+ color: var(--text-main);
448
+ font-family: 'Share Tech Mono', monospace;
449
+ box-shadow: 0 0 15px rgba(0, 255, 65, 0.25);
450
+ }
451
+
452
+ .streamlit-expanderHeader {
453
+ background: rgba(0, 0, 0, 0.6);
454
+ border-radius: 12px;
455
+ border: 1px solid rgba(0, 255, 65, 0.2);
456
+ color: var(--matrix-green);
457
+ font-family: 'Orbitron', sans-serif;
458
+ text-shadow: 0 0 5px rgba(0, 255, 65, 0.3);
459
+ }
460
+
461
+ .stSpinner > div {
462
+ border-top-color: var(--matrix-green) !important;
463
+ border-right-color: var(--matrix-blue) !important;
464
+ }
465
+
466
+ @media (max-width: 768px) {
467
+ .block-container {
468
+ padding-left: 1rem;
469
+ padding-right: 1rem;
470
+ }
471
+ .glass-panel, .stTabs [data-baseweb="tab-panel"] {
472
+ padding: 18px;
473
+ }
474
+ .stats-grid {
475
+ grid-template-columns: 1fr;
476
+ }
477
+ .hero h1 {
478
+ font-size: 2.5rem;
479
+ }
480
+ .hero-badges {
481
+ gap: 0.6rem;
482
+ }
483
+ .hero-badge {
484
+ font-size: 0.8rem;
485
+ padding: 0.3rem 0.7rem;
486
+ }
487
  }
488
+ </style>
489
+ """
490
+
491
+ st.markdown(CUSTOM_CSS, unsafe_allow_html=True)
492
+
493
+ st.markdown(
494
+ """
495
+ <div class="matrix-rain" id="matrix-rain"></div>
496
+ """,
497
+ unsafe_allow_html=True,
498
+ )
499
 
 
 
 
 
500
 
501
+ @st.cache_resource
502
  def get_analyzer():
503
+ return CodeAnalyzer(cache_dir=".analyzer_cache")
504
+
 
 
 
 
505
 
506
  analyzer = get_analyzer()
507
+ AVAILABLE_MODELS = analyzer.available_models
508
 
509
+ if not AVAILABLE_MODELS:
510
+ st.error("No AI models configured. Add API keys to your .env file and restart the app.")
511
+ st.stop()
512
+
513
+
514
+ LANGUAGE_DISPLAY = {
515
+ "auto": "Auto Detect",
516
+ "python": "Python",
517
+ "javascript": "JavaScript",
518
+ "typescript": "TypeScript",
519
+ "html": "HTML",
520
+ "css": "CSS",
521
+ "java": "Java",
522
+ "cpp": "C++",
523
+ "c": "C",
524
+ "csharp": "C#",
525
+ "go": "Go",
526
+ "rust": "Rust",
527
+ "php": "PHP",
528
+ "ruby": "Ruby",
529
+ "swift": "Swift",
530
+ "kotlin": "Kotlin",
531
+ }
532
+ LANGUAGE_OPTIONS = list(LANGUAGE_DISPLAY.keys())
533
+
534
+
535
+ def ensure_list(items: Any) -> List[str]:
536
+ if not items:
537
+ return []
538
+ if isinstance(items, str):
539
+ clean = items.strip()
540
+ return [clean] if clean else []
541
+ if isinstance(items, dict):
542
+ return [f"{key}: {value}" for key, value in items.items() if str(value).strip()]
543
+ if isinstance(items, Iterable):
544
+ values = []
545
+ for entry in items:
546
+ if entry is None:
547
+ continue
548
+ text = str(entry).strip()
549
+ if text:
550
+ values.append(text)
551
+ return values
552
+ return [str(items)]
553
+
554
+
555
+ def parse_score(raw: Any) -> float:
556
+ try:
557
+ return float(raw)
558
+ except (TypeError, ValueError):
559
+ return 0.0
560
+
561
+
562
+ def score_badge(score: float) -> tuple[str, str]:
563
+ if score >= 80:
564
+ return "Excellent", "#00fba4"
565
+ if score >= 60:
566
+ return "Review Suggested", "#ffd76a"
567
+ return "Needs Attention", "#ff6b6b"
568
+
569
+
570
+ def render_list_section(title: str, icon: str, content: Any, fallback: str | None = None) -> None:
571
+ entries = ensure_list(content)
572
+ if entries:
573
+ items_html = "".join(f"<li>{html.escape(entry)}</li>" for entry in entries[:6])
574
+ st.markdown(
575
+ f"""
576
+ <div class="list-card">
577
+ <h3>{icon} {title}</h3>
578
+ <ul>{items_html}</ul>
579
+ </div>
580
+ """,
581
+ unsafe_allow_html=True,
582
  )
583
+ elif fallback:
584
+ st.markdown(
585
+ f"""
586
+ <div class="list-card">
587
+ <h3>{icon} {title}</h3>
588
+ <p class="body-text">{html.escape(fallback)}</p>
589
+ </div>
590
+ """,
591
+ unsafe_allow_html=True,
592
  )
593
 
594
+
595
+ def render_code_result(result: dict[str, Any], model_label: str) -> None:
596
+ if result.get("error"):
597
+ st.error(f"Analysis failed: {result['error']}")
598
  return
599
+
600
+ score = parse_score(result.get("quality_score", 0))
601
+ status_label, status_color = score_badge(score)
602
+ language = (result.get("language") or "auto").upper()
603
+ line_count = result.get("line_count", "-")
604
+ exec_time = parse_score(result.get("execution_time", 0.0))
605
+ cached_text = "Hit" if result.get("cached") else "Fresh"
606
+
607
+ stats = [
608
+ {"label": "Quality Score", "value": f"{int(round(score))}/100", "sub": status_label, "color": status_color},
609
+ {"label": "Language", "value": language, "sub": "Detected" if language != "AUTO" else "Auto"},
610
+ {"label": "Lines", "value": line_count, "sub": "Analyzed"},
611
+ {"label": "Latency", "value": f"{exec_time:.1f}s", "sub": "Runtime"},
612
+ {"label": "Cache", "value": cached_text, "sub": "Result Store"},
613
+ ]
614
+
615
+ # Render stats using native Streamlit components to avoid raw HTML showing
616
+ st.markdown(
617
+ f"""
618
+ <div class="glass-panel">
619
+ <div class="model-chip">🤖 {model_label}</div>
 
 
 
 
 
 
 
 
 
 
 
 
 
620
  </div>
621
+ """,
622
+ unsafe_allow_html=True,
623
+ )
624
+
625
+ cols = st.columns(len(stats))
626
+ for idx, stat in enumerate(stats):
627
+ with cols[idx]:
628
+ st.metric(label=stat["label"], value=stat["value"], delta=stat["sub"])
629
+
630
+ summary = result.get("summary")
631
+ if summary:
632
+ st.markdown(
633
+ f"""
634
+ <div class="glass-panel">
635
+ <h3>📋 Overview</h3>
636
+ <p class="body-text">{html.escape(summary)}</p>
637
  </div>
638
+ """,
639
+ unsafe_allow_html=True,
640
+ )
641
+
642
+ render_list_section("Highlights", "✨", result.get("strengths"))
643
+ render_list_section("Bug Detection", "🐞", result.get("bugs") or result.get("issues"), "No critical bugs were flagged.")
644
+ render_list_section("Security", "🔒", result.get("security_vulnerabilities") or result.get("security_concerns"), "No security vulnerabilities detected.")
645
+ render_list_section("Code Quality", "🧩", result.get("quality_issues"), "Structure looks solid and maintainable.")
646
+ render_list_section("Quick Fixes", "⚡", result.get("quick_fixes"), "No urgent fixes suggested.")
647
+ render_list_section("Suggestions", "💡", result.get("suggestions"))
648
+
649
+ raw = result.get("raw_response")
650
+ if raw:
651
+ with st.expander("View full model response", expanded=False):
652
+ st.code(raw, language="text")
653
+
654
+
655
+ def render_repo_result(result: dict[str, Any], model_label: str) -> None:
656
+ if result.get("error"):
657
+ st.error(f"Repository analysis failed: {result['error']}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
658
  return
659
+
660
+ info = result.get("repository_info", {})
661
+ repo_name = info.get("name", "Repository")
662
+ repo_desc = info.get("description") or "No description provided."
663
+ repo_url = result.get("repo_url") or st.session_state.get("repo_analysis_url")
664
+
665
+ repo_stats = [
666
+ {"label": "Primary Language", "value": info.get("language", "Unknown"), "sub": "Detected"},
667
+ {"label": "Stars", "value": info.get("stars", 0), "sub": "Community"},
668
+ {"label": "Forks", "value": info.get("forks", 0), "sub": "Collaboration"},
669
+ {"label": "Size", "value": f"{info.get('size', 0)} KB", "sub": "Repo Size"},
670
+ {"label": "Latency", "value": f"{parse_score(result.get('execution_time', 0.0)):.1f}s", "sub": "Runtime"},
671
+ ]
672
+
673
+ link_html = f'<p class="body-text"><a href="{repo_url}" target="_blank">View repository ↗</a></p>' if repo_url else ""
674
+
675
+ # Header card
676
+ st.markdown(
677
+ f"""
678
+ <div class="glass-panel">
679
+ <div class="model-chip">🤖 {html.escape(model_label)}</div>
680
+ <h2 style="margin-bottom:0.35rem;">{html.escape(repo_name)}</h2>
681
+ <p class="body-text" style="color: var(--text-muted); margin-bottom:1.2rem;">{html.escape(repo_desc)}</p>
682
+ {link_html}
 
 
 
 
683
  </div>
684
+ """,
685
+ unsafe_allow_html=True,
686
+ )
687
+
688
+ # Metrics row
689
+ cols = st.columns(len(repo_stats))
690
+ for idx, stat in enumerate(repo_stats):
691
+ with cols[idx]:
692
+ st.metric(label=stat["label"], value=stat["value"], delta=stat["sub"])
693
+
694
+ overview = result.get("project_overview")
695
+ if overview:
696
+ st.markdown(
697
+ f"""
698
+ <div class="glass-panel">
699
+ <h3>📋 Project Overview</h3>
700
+ <p class="body-text">{html.escape(overview)}</p>
701
+ </div>
702
+ """,
703
+ unsafe_allow_html=True,
704
+ )
705
+
706
+ render_list_section("Architecture Quality", "🏗️", result.get("architecture_quality"), "Project structure looks well organized.")
707
+ render_list_section("Critical Issues", "🚨", result.get("critical_issues"), "No critical issues were highlighted.")
708
+ render_list_section("Priority Improvements", "🎯", result.get("improvement_priorities"), "No immediate improvements suggested.")
709
+ render_list_section("Onboarding Guide", "🚀", result.get("onboarding_guide"), "No specific onboarding steps identified.")
710
+ render_list_section("Tech Stack", "🛠️", result.get("tech_stack_rationale"), "Tech stack details were not identified.")
711
+ render_list_section("API Endpoints", "🔌", result.get("api_endpoint_summary"), "No API endpoints were identified.")
712
+
713
+ raw = result.get("raw_response")
714
+ if raw:
715
+ with st.expander("View full model response", expanded=False):
716
+ st.code(raw, language="text")
717
+
718
+
719
+ if "code_input" not in st.session_state:
720
+ st.session_state.code_input = ""
721
+ if "code_file_meta" not in st.session_state:
722
+ st.session_state.code_file_meta = None
723
+ if "code_analysis_result" not in st.session_state:
724
+ st.session_state.code_analysis_result = None
725
+ if "code_analysis_model" not in st.session_state:
726
+ st.session_state.code_analysis_model = ""
727
+ if "repo_analysis_result" not in st.session_state:
728
+ st.session_state.repo_analysis_result = None
729
+ if "repo_analysis_model" not in st.session_state:
730
+ st.session_state.repo_analysis_model = ""
731
+ if "repo_analysis_url" not in st.session_state:
732
+ st.session_state.repo_analysis_url = ""
733
+
734
+ st.markdown(
735
+ """
736
+ <div class="glass-panel hero">
737
+ <div class="hero-label">CODE ANALYZER</div>
738
+ <h1>AI Code Analyzer</h1>
739
+ <p class="subline">Inspect bugs, surface security gaps, and review repositories with instant feedback.</p>
740
+ <div class="hero-badges">
741
+ <span class="hero-badge">🧠 Multi-model</span>
742
+ <span class="hero-badge">🔍 Bug & Security Scan</span>
743
+ <span class="hero-badge">⚡ Instant Results</span>
744
+ <span class="hero-badge">📦 GitHub Ready</span>
 
 
 
 
 
 
 
 
 
 
745
  </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
746
  </div>
747
+ <script>
748
+ (function drizzle() {
749
+ const container = document.getElementById('matrix-rain');
750
+ if (!container || container.dataset.initialized) return;
751
+ container.dataset.initialized = 'true';
752
+ const glyphs = "01ΛΣΞ∑¥$#@*&%=+";
753
+ const nodeCount = 80;
754
+ for (let i = 0; i < nodeCount; i++) {
755
+ const drop = document.createElement('span');
756
+ drop.textContent = glyphs[Math.floor(Math.random() * glyphs.length)];
757
+ drop.style.left = Math.random() * 100 + '%';
758
+ drop.style.animationDuration = (Math.random() * 4 + 3) + 's';
759
+ drop.style.animationDelay = (Math.random() * 4) + 's';
760
+ container.appendChild(drop);
761
+ }
762
+ })();
763
+ </script>
764
+ """,
765
+ unsafe_allow_html=True,
766
+ )
767
 
768
+ code_tab, repo_tab = st.tabs(["Code Analysis", "Repository Insights"])
769
+
770
+ with code_tab:
771
+ model_keys = list(AVAILABLE_MODELS.keys())
772
+ selected_model_code = st.selectbox(
773
+ "AI Model",
774
+ options=model_keys,
775
+ format_func=lambda key: AVAILABLE_MODELS[key],
776
+ key="code_model_select",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
777
  )
778
 
779
+ selected_language = st.selectbox(
780
+ "Language",
781
+ LANGUAGE_OPTIONS,
782
+ format_func=lambda code: LANGUAGE_DISPLAY[code],
783
+ key="language_select",
 
 
 
 
784
  )
785
+
786
+ uploaded_file = st.file_uploader(
787
+ "Upload a code file",
788
+ type=["py", "js", "java", "cpp", "c", "cs", "go", "rs", "php", "rb", "swift", "kt", "txt"],
789
+ key="code_file_uploader",
790
+ )
791
+
792
+ if uploaded_file is not None:
793
+ raw_bytes = uploaded_file.read()
794
+ try:
795
+ decoded = raw_bytes.decode("utf-8")
796
+ except UnicodeDecodeError:
797
+ st.error("Only UTF-8 encoded files are supported.")
798
+ else:
799
+ st.session_state.code_input = decoded
800
+ st.session_state.code_file_meta = {
801
+ "name": uploaded_file.name,
802
+ "size": len(raw_bytes),
803
+ "lines": len(decoded.splitlines()),
804
+ }
805
+
806
+ st.text_area(
807
+ "Or paste code below",
808
+ key="code_input",
809
+ height=320,
810
+ placeholder="Paste any code snippet to inspect bugs, security gaps, and quality issues.",
811
+ )
812
+
813
+ meta = st.session_state.get("code_file_meta")
814
+ if meta:
815
+ st.markdown(
816
+ f"""
817
+ <div class="list-card">
818
+ <h3> Active File</h3>
819
+ <div class="file-meta">
820
+ <span>{meta['name']}</span>
821
+ <span>{meta['lines']} lines</span>
822
+ <span>{meta['size']} bytes</span>
823
+ </div>
824
+ </div>
825
+ """,
826
+ unsafe_allow_html=True,
827
  )
828
+
829
+ analyze_code_clicked = st.button("Run Code Analysis", key="code_analyze_button", use_container_width=True)
830
+
831
+ if analyze_code_clicked:
832
+ snippet = st.session_state.get("code_input", "").strip()
833
+ if not snippet:
834
+ st.error("Please upload a file or paste some code to analyze.")
835
+ else:
836
+ with st.spinner("Analyzing code..."):
837
+ language_arg = None if selected_language == "auto" else selected_language
838
+ result = analyzer.analyze_code(snippet, selected_model_code, language_arg)
839
+ st.session_state.code_analysis_result = result
840
+ st.session_state.code_analysis_model = AVAILABLE_MODELS[selected_model_code]
841
+
842
+ if st.session_state.get("code_analysis_result"):
843
+ render_code_result(st.session_state.code_analysis_result, st.session_state.get("code_analysis_model", ""))
844
  else:
845
+ st.info("Upload a file or paste code to generate an analysis.")
846
+
847
+ with repo_tab:
848
+ model_keys = list(AVAILABLE_MODELS.keys())
849
+ selected_model_repo = st.selectbox(
850
+ "AI Model",
851
+ options=model_keys,
852
+ format_func=lambda key: AVAILABLE_MODELS[key],
853
+ key="repo_model_select",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
854
  )
855
+
856
+ st.text_input(
857
+ "GitHub repository URL",
858
+ placeholder="https://github.com/owner/repo",
859
+ key="repo_analysis_url",
 
 
 
 
 
860
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
861
 
862
+ analyze_repo_clicked = st.button("Analyze Repository", key="repo_analyze_button", use_container_width=True)
 
 
 
863
 
864
+ if analyze_repo_clicked:
865
+ repo_url = st.session_state.get("repo_analysis_url", "").strip()
866
+ if not repo_url:
867
+ st.error("Enter a GitHub repository URL.")
868
+ else:
869
+ with st.spinner("Inspecting repository..."):
870
+ result = analyzer.analyze_github_repo(repo_url, selected_model_repo)
871
+ st.session_state.repo_analysis_result = result
872
+ st.session_state.repo_analysis_model = AVAILABLE_MODELS[selected_model_repo]
873
+
874
+ if st.session_state.get("repo_analysis_result"):
875
+ render_repo_result(
876
+ st.session_state.repo_analysis_result,
877
+ st.session_state.get("repo_analysis_model", ""),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
878
  )
879
+ else:
880
+ st.info("Provide a public GitHub repository URL to review its structure, issues, and improvements.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
model_comparison.py DELETED
@@ -1,175 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Model Comparison Script for AI Code Analyzer
4
-
5
- This script helps you compare different code analysis models
6
- and understand their capabilities before fine-tuning.
7
-
8
- Author: AI Code Analyzer Project
9
- Date: 2025
10
- """
11
-
12
- import torch
13
- from transformers import (
14
- AutoModelForCausalLM,
15
- AutoModelForSeq2SeqLM,
16
- AutoTokenizer,
17
- BitsAndBytesConfig
18
- )
19
-
20
- def compare_models():
21
- """
22
- Compare different code analysis models available on Hugging Face.
23
- """
24
-
25
- models_to_compare = {
26
- "Current Model (GPT-2)": {
27
- "model_id": "gpt2",
28
- "type": "CausalLM",
29
- "size": "124M",
30
- "code_specialized": False,
31
- "description": "General-purpose text model, not optimized for code"
32
- },
33
- "CodeT5+ (Recommended)": {
34
- "model_id": "Salesforce/codet5p-220m",
35
- "type": "Seq2SeqLM",
36
- "size": "220M",
37
- "code_specialized": True,
38
- "description": "Specialized for code understanding and generation"
39
- },
40
- "CodeBERT": {
41
- "model_id": "microsoft/CodeBERT-base",
42
- "type": "FeatureExtraction",
43
- "size": "125M",
44
- "code_specialized": True,
45
- "description": "Pre-trained on code for understanding programming languages"
46
- },
47
- "GraphCodeBERT": {
48
- "model_id": "microsoft/GraphCodeBERT-base",
49
- "type": "FeatureExtraction",
50
- "size": "125M",
51
- "code_specialized": True,
52
- "description": "Understands code structure and dependencies"
53
- },
54
- "InCoder": {
55
- "model_id": "facebook/incoder-1B",
56
- "type": "CausalLM",
57
- "size": "1B",
58
- "code_specialized": True,
59
- "description": "Code completion and analysis with large context"
60
- }
61
- }
62
-
63
- print("🤖 Code Analysis Models Comparison")
64
- print("=" * 80)
65
-
66
- for name, info in models_to_compare.items():
67
- print(f"\n📊 {name}")
68
- print(f" Model ID: {info['model_id']}")
69
- print(f" Type: {info['type']}")
70
- print(f" Size: {info['size']}")
71
- print(f" Code Specialized: {'✅ Yes' if info['code_specialized'] else '❌ No'}")
72
- print(f" Description: {info['description']}")
73
-
74
- print("\n" + "=" * 80)
75
- print("🎯 RECOMMENDATIONS:")
76
- print("=" * 80)
77
-
78
- print("\n🥇 BEST CHOICE: Salesforce/codet5p-220m")
79
- print(" ✅ Specialized for code analysis")
80
- print(" ✅ Good balance of size and performance")
81
- print(" ✅ Works well with your training data format")
82
- print(" ✅ Seq2Seq architecture perfect for code analysis")
83
-
84
- print("\n🥈 ALTERNATIVE: facebook/incoder-1B")
85
- print(" ✅ Excellent code understanding")
86
- print(" ✅ Large context window")
87
- print(" ⚠️ Larger model (requires more resources)")
88
-
89
- print("\n🥉 FOR EXPERIMENTATION: microsoft/CodeBERT-base")
90
- print(" ✅ Proven for code understanding")
91
- print(" ✅ Good for feature extraction")
92
- print(" ⚠️ Different architecture (might need data format changes)")
93
-
94
- def test_model_loading(model_id: str, model_type: str = "auto"):
95
- """
96
- Test loading a specific model to ensure it works.
97
-
98
- Args:
99
- model_id (str): Hugging Face model ID
100
- model_type (str): Model type ("auto", "causal", "seq2seq")
101
- """
102
- print(f"\n🧪 Testing model: {model_id}")
103
- print("-" * 50)
104
-
105
- try:
106
- # Load tokenizer
107
- print("Loading tokenizer...")
108
- tokenizer = AutoTokenizer.from_pretrained(model_id)
109
-
110
- if tokenizer.pad_token is None:
111
- tokenizer.pad_token = tokenizer.eos_token
112
- print("✅ Added padding token")
113
-
114
- # Load model
115
- print("Loading model...")
116
- if model_type == "seq2seq" or "codet5" in model_id.lower():
117
- model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
118
- else:
119
- model = AutoModelForCausalLM.from_pretrained(model_id)
120
-
121
- print(f"✅ Model loaded successfully!")
122
- print(f" Model type: {type(model).__name__}")
123
- print(f" Parameters: {sum(p.numel() for p in model.parameters()):,}")
124
-
125
- # Test with a simple code snippet
126
- test_code = "def add(a, b):\n return a + b"
127
- inputs = tokenizer(test_code, return_tensors="pt", truncation=True, max_length=512)
128
-
129
- print(f"✅ Tokenization test passed")
130
- print(f" Input tokens: {inputs['input_ids'].shape[1]}")
131
-
132
- return True
133
-
134
- except Exception as e:
135
- print(f"❌ Error loading model: {str(e)}")
136
- return False
137
-
138
- def main():
139
- """
140
- Main function to run model comparison and testing.
141
- """
142
- print("🚀 AI Code Analyzer - Model Comparison Tool")
143
- print("=" * 80)
144
-
145
- # Show comparison
146
- compare_models()
147
-
148
- # Test recommended models
149
- print("\n\n🧪 TESTING RECOMMENDED MODELS:")
150
- print("=" * 80)
151
-
152
- models_to_test = [
153
- ("Salesforce/codet5p-220m", "seq2seq"),
154
- ("microsoft/CodeBERT-base", "causal"),
155
- ("facebook/incoder-1B", "causal")
156
- ]
157
-
158
- working_models = []
159
-
160
- for model_id, model_type in models_to_test:
161
- if test_model_loading(model_id, model_type):
162
- working_models.append(model_id)
163
-
164
- print(f"\n🎉 SUCCESS: {len(working_models)} models loaded successfully!")
165
-
166
- if working_models:
167
- print("\n📋 NEXT STEPS:")
168
- print("1. Choose your preferred model from the working models above")
169
- print("2. Run: python finetune_improved.py --model <model_id> --dry-run")
170
- print("3. Test the fine-tuned model with your code analyzer")
171
- print("\n💡 RECOMMENDED COMMAND:")
172
- print(f" python finetune_improved.py --model {working_models[0]} --dry-run")
173
-
174
- if __name__ == "__main__":
175
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
optimized_code_analyzer.py DELETED
@@ -1,466 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Optimized CodeT5+ Code Analyzer
4
-
5
- This script implements CodeT5+ with multiple speed optimizations:
6
- - FP16 by default (fastest on your GPU); optional INT8/INT4
7
- - Response streaming for better UX
8
- - Progress indicators
9
- - Result caching
10
- - Optimized generation parameters
11
-
12
- Author: AI Code Analyzer Project
13
- Date: 2025
14
- """
15
-
16
- import torch
17
- import time
18
- import hashlib
19
- import json
20
- import os
21
- from typing import Dict, Any, Optional, Generator
22
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig
23
- from tqdm import tqdm
24
- import streamlit as st
25
-
26
- class OptimizedCodeAnalyzer:
27
- """
28
- Optimized CodeT5+ analyzer with speed improvements.
29
- """
30
-
31
- def __init__(
32
- self,
33
- model_id: str = "Salesforce/codet5p-220m",
34
- cache_dir: str = "./cache",
35
- precision: str = "fp16", # one of: fp16 | int8 | int4
36
- quick_max_new_tokens: int = 180,
37
- detailed_max_new_tokens: int = 240,
38
- ):
39
- """
40
- Initialize the optimized analyzer.
41
-
42
- Args:
43
- model_id: Hugging Face model ID
44
- cache_dir: Directory to store cached results
45
- """
46
- self.model_id = model_id
47
- self.cache_dir = cache_dir
48
- self.model = None
49
- self.tokenizer = None
50
- self.cache = {}
51
- self.precision = precision.lower().strip()
52
- self.quick_max_new_tokens = quick_max_new_tokens
53
- self.detailed_max_new_tokens = detailed_max_new_tokens
54
-
55
- # Create cache directory
56
- os.makedirs(cache_dir, exist_ok=True)
57
-
58
- # Load cache if exists
59
- self._load_cache()
60
-
61
- def _create_quantization_config(self) -> BitsAndBytesConfig:
62
- """
63
- Create 4-bit quantization configuration for faster inference.
64
-
65
- Returns:
66
- BitsAndBytesConfig: Quantization configuration
67
- """
68
- # Default to INT4 nf4 when precision==int4; callers should not use this
69
- return BitsAndBytesConfig(
70
- load_in_4bit=True,
71
- bnb_4bit_quant_type="nf4",
72
- bnb_4bit_compute_dtype=torch.float16,
73
- bnb_4bit_use_double_quant=True,
74
- )
75
-
76
- def _load_model(self):
77
- """
78
- Load the model with optimizations.
79
- """
80
- if self.model is not None:
81
- return
82
-
83
- print("🚀 Loading optimized CodeT5+ model...")
84
-
85
- # Load tokenizer
86
- self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
87
- if self.tokenizer.pad_token is None:
88
- self.tokenizer.pad_token = self.tokenizer.eos_token
89
-
90
- # Decide precision based on config
91
- quantization_config = None
92
- dtype = None
93
- banner = ""
94
-
95
- if self.precision == "fp16":
96
- dtype = torch.float16
97
- banner = "FP16 precision"
98
- elif self.precision == "int8":
99
- quantization_config = BitsAndBytesConfig(load_in_8bit=True)
100
- banner = "INT8 quantization"
101
- elif self.precision == "int4":
102
- quantization_config = self._create_quantization_config()
103
- banner = "INT4 (nf4) quantization"
104
- else:
105
- # Fallback to fp16
106
- dtype = torch.float16
107
- banner = f"Unknown precision '{self.precision}', defaulting to FP16"
108
-
109
- self.model = AutoModelForSeq2SeqLM.from_pretrained(
110
- self.model_id,
111
- device_map="auto",
112
- dtype=dtype,
113
- quantization_config=quantization_config,
114
- )
115
-
116
- print(f"✅ Model loaded with {banner}!")
117
-
118
- def _get_cache_key(self, code: str) -> str:
119
- """
120
- Generate cache key for code.
121
-
122
- Args:
123
- code: Code to analyze
124
-
125
- Returns:
126
- str: Cache key
127
- """
128
- return hashlib.md5(code.encode()).hexdigest()
129
-
130
- def _load_cache(self):
131
- """
132
- Load cached results from disk.
133
- """
134
- cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
135
- if os.path.exists(cache_file):
136
- try:
137
- with open(cache_file, 'r') as f:
138
- self.cache = json.load(f)
139
- print(f"📁 Loaded {len(self.cache)} cached analyses")
140
- except:
141
- self.cache = {}
142
-
143
- def _save_cache(self):
144
- """
145
- Save cache to disk.
146
- """
147
- cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
148
- with open(cache_file, 'w') as f:
149
- json.dump(self.cache, f)
150
-
151
- def _check_cache(self, code: str) -> Optional[Dict[str, Any]]:
152
- """
153
- Check if analysis is cached.
154
-
155
- Args:
156
- code: Code to analyze
157
-
158
- Returns:
159
- Optional[Dict]: Cached result or None
160
- """
161
- cache_key = self._get_cache_key(code)
162
- return self.cache.get(cache_key)
163
-
164
- def _save_to_cache(self, code: str, result: Dict[str, Any]):
165
- """
166
- Save analysis result to cache.
167
-
168
- Args:
169
- code: Code that was analyzed
170
- result: Analysis result
171
- """
172
- cache_key = self._get_cache_key(code)
173
- self.cache[cache_key] = result
174
- self._save_cache()
175
-
176
- def analyze_code_streaming(
177
- self,
178
- code: str,
179
- show_progress: bool = True,
180
- mode: str = "detailed", # "quick" | "detailed"
181
- ) -> Generator[str, None, Dict[str, Any]]:
182
- """
183
- Analyze code with streaming response and progress indicators.
184
-
185
- Args:
186
- code: Code to analyze
187
- show_progress: Whether to show progress indicators
188
-
189
- Yields:
190
- str: Partial analysis results
191
- """
192
- # Check cache first
193
- cached_result = self._check_cache(code)
194
- if cached_result:
195
- print("⚡ Using cached result!")
196
- yield cached_result["analysis"]
197
- return cached_result
198
-
199
- # Load model if not loaded
200
- self._load_model()
201
-
202
- # Create analysis prompt
203
- prompt = f"""Analyze this code for bugs, performance issues, and security concerns:
204
-
205
- {code}
206
-
207
- Analysis:"""
208
-
209
- # Tokenize input
210
- inputs = self.tokenizer(
211
- prompt,
212
- return_tensors="pt",
213
- truncation=True,
214
- max_length=512,
215
- padding=True,
216
- )
217
- device = next(self.model.parameters()).device
218
- inputs = {k: v.to(device) for k, v in inputs.items()}
219
-
220
- # Generate analysis with optimized parameters
221
- start_time = time.time()
222
-
223
- if show_progress:
224
- print("🔍 Analyzing code...")
225
- progress_bar = tqdm(total=100, desc="Analysis Progress")
226
-
227
- try:
228
- with torch.no_grad():
229
- # Use optimized generation parameters for speed
230
- max_new = self.detailed_max_new_tokens if mode == "detailed" else self.quick_max_new_tokens
231
- num_beams = 2 if mode == "detailed" else 1
232
- outputs = self.model.generate(
233
- inputs["input_ids"],
234
- attention_mask=inputs.get("attention_mask"),
235
- max_new_tokens=max_new,
236
- num_beams=num_beams,
237
- do_sample=False,
238
- pad_token_id=self.tokenizer.eos_token_id,
239
- use_cache=True,
240
- )
241
-
242
- if show_progress:
243
- progress_bar.update(50)
244
-
245
- # Decode analysis
246
- analysis = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
247
- analysis_text = analysis[len(prompt):].strip()
248
-
249
- if show_progress:
250
- progress_bar.update(50)
251
- progress_bar.close()
252
-
253
- # Calculate quality score
254
- quality_score = self._calculate_quality_score(analysis_text)
255
-
256
- total_time = time.time() - start_time
257
-
258
- # Create result
259
- result = {
260
- "analysis": analysis_text,
261
- "quality_score": quality_score,
262
- "execution_time": total_time,
263
- "model": self.model_id,
264
- "cached": False
265
- }
266
-
267
- # Save to cache
268
- self._save_to_cache(code, result)
269
-
270
- # Yield the analysis
271
- yield analysis_text
272
-
273
- return result
274
-
275
- except Exception as e:
276
- if show_progress:
277
- progress_bar.close()
278
- raise e
279
-
280
- def analyze_code_fast(self, code: str, mode: str = "quick") -> Dict[str, Any]:
281
- """
282
- Fast analysis without streaming (for batch processing).
283
-
284
- Args:
285
- code: Code to analyze
286
-
287
- Returns:
288
- Dict: Analysis result
289
- """
290
- # Check cache first
291
- cached_result = self._check_cache(code)
292
- if cached_result:
293
- cached_result["cached"] = True
294
- return cached_result
295
-
296
- # Load model if not loaded
297
- self._load_model()
298
-
299
- # Create analysis prompt
300
- prompt = f"""Analyze this code for bugs, performance issues, and security concerns:
301
-
302
- {code}
303
-
304
- Analysis:"""
305
-
306
- # Tokenize input
307
- inputs = self.tokenizer(
308
- prompt,
309
- return_tensors="pt",
310
- truncation=True,
311
- max_length=512,
312
- padding=True,
313
- )
314
- device = next(self.model.parameters()).device
315
- inputs = {k: v.to(device) for k, v in inputs.items()}
316
-
317
- # Generate analysis with speed optimizations
318
- start_time = time.time()
319
-
320
- with torch.no_grad():
321
- max_new = self.quick_max_new_tokens if mode == "quick" else self.detailed_max_new_tokens
322
- num_beams = 1 if mode == "quick" else 2
323
- outputs = self.model.generate(
324
- inputs["input_ids"],
325
- attention_mask=inputs.get("attention_mask"),
326
- max_new_tokens=max_new,
327
- num_beams=num_beams,
328
- do_sample=False,
329
- pad_token_id=self.tokenizer.eos_token_id,
330
- use_cache=True,
331
- )
332
-
333
- # Decode analysis
334
- analysis = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
335
- analysis_text = analysis[len(prompt):].strip()
336
-
337
- # Calculate quality score
338
- quality_score = self._calculate_quality_score(analysis_text)
339
-
340
- total_time = time.time() - start_time
341
-
342
- # Create result
343
- result = {
344
- "analysis": analysis_text,
345
- "quality_score": quality_score,
346
- "execution_time": total_time,
347
- "model": self.model_id,
348
- "cached": False
349
- }
350
-
351
- # Save to cache
352
- self._save_to_cache(code, result)
353
-
354
- return result
355
-
356
- def _calculate_quality_score(self, analysis_text: str) -> int:
357
- """
358
- Calculate quality score for analysis.
359
-
360
- Args:
361
- analysis_text: Analysis text
362
-
363
- Returns:
364
- int: Quality score (0-100)
365
- """
366
- score = 0
367
- analysis_lower = analysis_text.lower()
368
-
369
- # Check for different types of analysis (20 points each)
370
- if any(word in analysis_lower for word in ['bug', 'error', 'issue', 'problem', 'flaw']):
371
- score += 20
372
-
373
- if any(word in analysis_lower for word in ['performance', 'slow', 'efficient', 'complexity', 'optimization']):
374
- score += 20
375
-
376
- if any(word in analysis_lower for word in ['security', 'vulnerability', 'safe', 'unsafe', 'risk']):
377
- score += 20
378
-
379
- if any(word in analysis_lower for word in ['suggest', 'improve', 'better', 'recommend', 'fix', 'solution']):
380
- score += 20
381
-
382
- # Bonus for detailed analysis
383
- if len(analysis_text) > 200:
384
- score += 10
385
- if len(analysis_text) > 500:
386
- score += 10
387
-
388
- return min(score, 100)
389
-
390
- def get_model_info(self) -> Dict[str, Any]:
391
- """
392
- Get information about the loaded model.
393
-
394
- Returns:
395
- Dict: Model information
396
- """
397
- if self.model is None:
398
- return {"status": "Model not loaded"}
399
-
400
- param_count = sum(p.numel() for p in self.model.parameters())
401
- device = next(self.model.parameters()).device
402
-
403
- return {
404
- "model_id": self.model_id,
405
- "parameters": param_count,
406
- "device": str(device),
407
- "precision": self.precision,
408
- "quick_max_new_tokens": self.quick_max_new_tokens,
409
- "detailed_max_new_tokens": self.detailed_max_new_tokens,
410
- "cache_size": len(self.cache)
411
- }
412
-
413
- def main():
414
- """
415
- Demo of the optimized analyzer.
416
- """
417
- print("🚀 Optimized CodeT5+ Analyzer Demo")
418
- print("=" * 60)
419
-
420
- # Initialize analyzer
421
- analyzer = OptimizedCodeAnalyzer()
422
-
423
- # Test code
424
- test_code = """
425
- def calculate_fibonacci(n):
426
- if n <= 0:
427
- return 0
428
- elif n == 1:
429
- return 1
430
- else:
431
- return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
432
-
433
- # This will be slow for large numbers
434
- result = calculate_fibonacci(35)
435
- print(result)
436
- """
437
-
438
- print(f"Test Code:\n{test_code}")
439
- print("=" * 60)
440
-
441
- # Test streaming analysis
442
- print("\n🔍 Streaming Analysis:")
443
- print("-" * 40)
444
-
445
- for partial_result in analyzer.analyze_code_streaming(test_code):
446
- print(partial_result)
447
-
448
- # Test fast analysis
449
- print("\n⚡ Fast Analysis:")
450
- print("-" * 40)
451
-
452
- result = analyzer.analyze_code_fast(test_code)
453
- print(f"Analysis: {result['analysis']}")
454
- print(f"Quality Score: {result['quality_score']}/100")
455
- print(f"Execution Time: {result['execution_time']:.2f}s")
456
- print(f"Cached: {result['cached']}")
457
-
458
- # Show model info
459
- print("\n📊 Model Information:")
460
- print("-" * 40)
461
- model_info = analyzer.get_model_info()
462
- for key, value in model_info.items():
463
- print(f"{key}: {value}")
464
-
465
- if __name__ == "__main__":
466
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
optimized_code_analyzer_enhanced.py DELETED
@@ -1,593 +0,0 @@
1
- #!/usr/bin/env python3
2
- """
3
- Enhanced Code Analyzer with Fine-tuned Model Support
4
-
5
- This version supports:
6
- - Base CodeT5+ model (original)
7
- - Fine-tuned DeepSeek Coder model (new)
8
- - Easy toggle between models
9
- - All existing optimizations (caching, streaming, etc.)
10
-
11
- Author: AI Code Analyzer Project
12
- Date: 2025
13
- """
14
-
15
- import torch
16
- import time
17
- import hashlib
18
- import json
19
- import os
20
- from typing import Dict, Any, Optional, Generator, Literal
21
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, BitsAndBytesConfig
22
- from peft import PeftModel
23
- from tqdm import tqdm
24
- import streamlit as st
25
-
26
- ModelType = Literal["codet5", "deepseek-finetuned"]
27
-
28
- class EnhancedCodeAnalyzer:
29
- """
30
- Enhanced analyzer supporting multiple model types including fine-tuned models.
31
- """
32
-
33
- def __init__(
34
- self,
35
- model_type: ModelType = "codet5",
36
- model_id: Optional[str] = None,
37
- adapter_path: Optional[str] = None,
38
- cache_dir: str = "./cache",
39
- precision: str = "fp16",
40
- quick_max_new_tokens: int = 180,
41
- detailed_max_new_tokens: int = 300,
42
- remote_api_url: Optional[str] = None,
43
- ):
44
- """
45
- Initialize the enhanced analyzer.
46
-
47
- Args:
48
- model_type: Type of model to use ("codet5" or "deepseek-finetuned")
49
- model_id: Hugging Face model ID (auto-selected if None)
50
- adapter_path: Path to LoRA adapters for fine-tuned models
51
- cache_dir: Directory to store cached results
52
- precision: Model precision (fp16, int8, int4)
53
- quick_max_new_tokens: Max tokens for quick analysis
54
- detailed_max_new_tokens: Max tokens for detailed analysis
55
- """
56
- self.model_type = model_type
57
- self.cache_dir = cache_dir
58
- self.precision = precision.lower().strip()
59
- self.quick_max_new_tokens = quick_max_new_tokens
60
- self.detailed_max_new_tokens = detailed_max_new_tokens
61
-
62
- # Auto-select model_id based on type
63
- if model_id is None:
64
- if model_type == "codet5":
65
- self.model_id = "Salesforce/codet5p-220m"
66
- elif model_type == "deepseek-finetuned":
67
- self.model_id = "deepseek-ai/deepseek-coder-1.3b-instruct"
68
- else:
69
- raise ValueError(f"Unknown model_type: {model_type}")
70
- else:
71
- self.model_id = model_id
72
-
73
- # Set adapter path
74
- self.adapter_path = adapter_path
75
- if model_type == "deepseek-finetuned" and adapter_path is None:
76
- self.adapter_path = "./fine-tuned-analyst"
77
-
78
- self.model = None
79
- self.tokenizer = None
80
- self.cache = {}
81
- self.remote_api_url = remote_api_url
82
-
83
- # Create cache directory
84
- os.makedirs(cache_dir, exist_ok=True)
85
-
86
- # Load cache if exists
87
- self._load_cache()
88
-
89
- def _create_quantization_config(self) -> BitsAndBytesConfig:
90
- """Create quantization configuration."""
91
- return BitsAndBytesConfig(
92
- load_in_4bit=True,
93
- bnb_4bit_quant_type="nf4",
94
- bnb_4bit_compute_dtype=torch.float16,
95
- bnb_4bit_use_double_quant=True,
96
- )
97
-
98
- def _load_model(self):
99
- """Load the model with optimizations."""
100
- if self.model is not None:
101
- return
102
-
103
- print(f"🚀 Loading {self.model_type} model...")
104
-
105
- # Load tokenizer
106
- self.tokenizer = AutoTokenizer.from_pretrained(self.model_id, trust_remote_code=True)
107
- if self.tokenizer.pad_token is None:
108
- self.tokenizer.pad_token = self.tokenizer.eos_token
109
-
110
- # Configure precision
111
- quantization_config = None
112
- dtype = None
113
- banner = ""
114
-
115
- if self.precision == "fp16":
116
- dtype = torch.float16
117
- banner = "FP16 precision"
118
- elif self.precision == "int8":
119
- quantization_config = BitsAndBytesConfig(load_in_8bit=True)
120
- banner = "INT8 quantization"
121
- elif self.precision == "int4":
122
- quantization_config = self._create_quantization_config()
123
- banner = "INT4 (nf4) quantization"
124
- else:
125
- dtype = torch.float16
126
- banner = f"Unknown precision '{self.precision}', defaulting to FP16"
127
-
128
- # Load base model based on type
129
- if self.model_type == "codet5":
130
- self.model = AutoModelForSeq2SeqLM.from_pretrained(
131
- self.model_id,
132
- device_map="auto",
133
- torch_dtype=dtype,
134
- quantization_config=quantization_config,
135
- trust_remote_code=True,
136
- )
137
- print(f"✅ CodeT5+ loaded with {banner}!")
138
-
139
- elif self.model_type == "deepseek-finetuned":
140
- # Load base DeepSeek model
141
- base_model = AutoModelForCausalLM.from_pretrained(
142
- self.model_id,
143
- device_map="auto",
144
- torch_dtype=dtype,
145
- quantization_config=quantization_config,
146
- trust_remote_code=True,
147
- )
148
-
149
- # Load LoRA adapters if path exists
150
- if self.adapter_path and os.path.exists(self.adapter_path):
151
- print(f"🔗 Loading LoRA adapters from {self.adapter_path}...")
152
- self.model = PeftModel.from_pretrained(base_model, self.adapter_path)
153
- print(f"✅ Fine-tuned DeepSeek loaded with {banner}!")
154
- else:
155
- print(f"⚠️ Adapter path not found: {self.adapter_path}")
156
- print(f"✅ Using base DeepSeek model with {banner}")
157
- self.model = base_model
158
-
159
- def _create_prompt(self, code: str) -> str:
160
- """Create prompt based on model type."""
161
- if self.model_type == "codet5":
162
- return f"""Analyze this code for bugs, performance issues, and security concerns:
163
-
164
- {code}
165
-
166
- Analysis:"""
167
-
168
- elif self.model_type == "deepseek-finetuned":
169
- return f"""<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis.
170
-
171
- Code:
172
- ```
173
- {code}
174
- ``` [/INST]"""
175
-
176
- return code
177
-
178
- def _extract_response(self, full_text: str, prompt: str) -> str:
179
- """Extract the actual response from generated text."""
180
- if self.model_type == "codet5":
181
- # Remove prompt from response
182
- return full_text[len(prompt):].strip()
183
-
184
- elif self.model_type == "deepseek-finetuned":
185
- # Extract response after [/INST]
186
- if '[/INST]' in full_text:
187
- return full_text.split('[/INST]')[-1].strip()
188
- return full_text.strip()
189
-
190
- return full_text.strip()
191
-
192
- def _get_cache_key(self, code: str) -> str:
193
- """Generate cache key for code."""
194
- # Include model type in cache key
195
- combined = f"{self.model_type}:{code}"
196
- return hashlib.md5(combined.encode()).hexdigest()
197
-
198
- def _load_cache(self):
199
- """Load cached results from disk."""
200
- cache_file = os.path.join(self.cache_dir, "analysis_cache_enhanced.json")
201
- if os.path.exists(cache_file):
202
- try:
203
- with open(cache_file, 'r') as f:
204
- self.cache = json.load(f)
205
- print(f"📁 Loaded {len(self.cache)} cached analyses")
206
- except:
207
- self.cache = {}
208
-
209
- def _save_cache(self):
210
- """Save cache to disk."""
211
- cache_file = os.path.join(self.cache_dir, "analysis_cache_enhanced.json")
212
- with open(cache_file, 'w') as f:
213
- json.dump(self.cache, f)
214
-
215
- def _check_cache(self, code: str) -> Optional[Dict[str, Any]]:
216
- """Check if analysis is cached."""
217
- cache_key = self._get_cache_key(code)
218
- return self.cache.get(cache_key)
219
-
220
- def _save_to_cache(self, code: str, result: Dict[str, Any]):
221
- """Save analysis result to cache."""
222
- cache_key = self._get_cache_key(code)
223
- self.cache[cache_key] = result
224
- self._save_cache()
225
-
226
- def analyze_code_streaming(
227
- self,
228
- code: str,
229
- show_progress: bool = True,
230
- mode: str = "detailed",
231
- ) -> Generator[str, None, Dict[str, Any]]:
232
- """
233
- Analyze code with streaming response and progress indicators.
234
-
235
- Args:
236
- code: Code to analyze
237
- show_progress: Whether to show progress indicators
238
- mode: Analysis mode ("quick" or "detailed")
239
-
240
- Yields:
241
- str: Partial analysis results
242
- """
243
- # Check cache first
244
- cached_result = self._check_cache(code)
245
- if cached_result:
246
- print("⚡ Using cached result!")
247
- yield cached_result["analysis"]
248
- return cached_result
249
-
250
- # Load model if not loaded
251
- self._load_model()
252
-
253
- # Create prompt
254
- prompt = self._create_prompt(code)
255
-
256
- # Tokenize input
257
- inputs = self.tokenizer(
258
- prompt,
259
- return_tensors="pt",
260
- truncation=True,
261
- max_length=1024 if self.model_type == "deepseek-finetuned" else 512,
262
- padding=True,
263
- )
264
- device = next(self.model.parameters()).device
265
- inputs = {k: v.to(device) for k, v in inputs.items()}
266
-
267
- # Generate analysis
268
- start_time = time.time()
269
-
270
- if show_progress:
271
- print(f"🔍 Analyzing code with {self.model_type}...")
272
- progress_bar = tqdm(total=100, desc="Analysis Progress")
273
-
274
- try:
275
- with torch.no_grad():
276
- max_new = self.detailed_max_new_tokens if mode == "detailed" else self.quick_max_new_tokens
277
-
278
- # Generation parameters based on model type
279
- if self.model_type == "codet5":
280
- num_beams = 2 if mode == "detailed" else 1
281
- outputs = self.model.generate(
282
- inputs["input_ids"],
283
- attention_mask=inputs.get("attention_mask"),
284
- max_new_tokens=max_new,
285
- num_beams=num_beams,
286
- do_sample=False,
287
- pad_token_id=self.tokenizer.eos_token_id,
288
- use_cache=True,
289
- )
290
- else: # deepseek-finetuned
291
- outputs = self.model.generate(
292
- **inputs,
293
- max_new_tokens=max_new,
294
- temperature=0.7,
295
- do_sample=True,
296
- pad_token_id=self.tokenizer.eos_token_id,
297
- eos_token_id=self.tokenizer.eos_token_id,
298
- )
299
-
300
- if show_progress:
301
- progress_bar.update(50)
302
-
303
- # Decode analysis
304
- full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
305
- analysis_text = self._extract_response(full_text, prompt)
306
-
307
- if show_progress:
308
- progress_bar.update(50)
309
- progress_bar.close()
310
-
311
- # Calculate quality score
312
- quality_score = self._calculate_quality_score(analysis_text)
313
-
314
- total_time = time.time() - start_time
315
-
316
- # Create result
317
- result = {
318
- "analysis": analysis_text,
319
- "quality_score": quality_score,
320
- "execution_time": total_time,
321
- "model": self.model_id,
322
- "model_type": self.model_type,
323
- "cached": False
324
- }
325
-
326
- # Save to cache
327
- self._save_to_cache(code, result)
328
-
329
- # Yield the analysis
330
- yield analysis_text
331
-
332
- return result
333
-
334
- except Exception as e:
335
- if show_progress:
336
- progress_bar.close()
337
- raise e
338
-
339
- def analyze_code_fast(self, code: str, mode: str = "quick") -> Dict[str, Any]:
340
- """
341
- Fast analysis without streaming (for batch processing).
342
-
343
- Args:
344
- code: Code to analyze
345
- mode: Analysis mode ("quick" or "detailed")
346
-
347
- Returns:
348
- Dict: Analysis result
349
- """
350
- # Check if using remote model
351
- if self.remote_api_url:
352
- return self.analyze_code_remote(code, mode)
353
-
354
- # Check cache first
355
- cached_result = self._check_cache(code)
356
- if cached_result:
357
- cached_result["cached"] = True
358
- return cached_result
359
-
360
- # Load model if not loaded
361
- self._load_model()
362
-
363
- # Create prompt
364
- prompt = self._create_prompt(code)
365
-
366
- # Tokenize input
367
- inputs = self.tokenizer(
368
- prompt,
369
- return_tensors="pt",
370
- truncation=True,
371
- max_length=1024 if self.model_type == "deepseek-finetuned" else 512,
372
- padding=True,
373
- )
374
- device = next(self.model.parameters()).device
375
- inputs = {k: v.to(device) for k, v in inputs.items()}
376
-
377
- # Generate analysis
378
- start_time = time.time()
379
-
380
- with torch.no_grad():
381
- max_new = self.quick_max_new_tokens if mode == "quick" else self.detailed_max_new_tokens
382
-
383
- if self.model_type == "codet5":
384
- num_beams = 1 if mode == "quick" else 2
385
- outputs = self.model.generate(
386
- inputs["input_ids"],
387
- attention_mask=inputs.get("attention_mask"),
388
- max_new_tokens=max_new,
389
- num_beams=num_beams,
390
- do_sample=False,
391
- pad_token_id=self.tokenizer.eos_token_id,
392
- use_cache=True,
393
- )
394
- else: # deepseek-finetuned
395
- outputs = self.model.generate(
396
- **inputs,
397
- max_new_tokens=max_new,
398
- temperature=0.7,
399
- do_sample=True,
400
- pad_token_id=self.tokenizer.eos_token_id,
401
- eos_token_id=self.tokenizer.eos_token_id,
402
- )
403
-
404
- # Decode analysis
405
- full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
406
- analysis_text = self._extract_response(full_text, prompt)
407
-
408
- # Calculate quality score
409
- quality_score = self._calculate_quality_score(analysis_text)
410
-
411
- total_time = time.time() - start_time
412
-
413
- # Create result
414
- result = {
415
- "analysis": analysis_text,
416
- "quality_score": quality_score,
417
- "execution_time": total_time,
418
- "model": self.model_id,
419
- "model_type": self.model_type,
420
- "cached": False
421
- }
422
-
423
- # Save to cache
424
- self._save_to_cache(code, result)
425
-
426
- return result
427
-
428
- def _calculate_quality_score(self, analysis_text: str) -> int:
429
- """
430
- Calculate quality score for analysis.
431
-
432
- Args:
433
- analysis_text: Analysis text
434
-
435
- Returns:
436
- int: Quality score (0-100)
437
- """
438
- # Try to extract score from DeepSeek output first
439
- if "Quality Score:" in analysis_text or "quality score" in analysis_text.lower():
440
- import re
441
- score_match = re.search(r'(?:Quality Score:|quality score:?)\s*(\d+)', analysis_text, re.IGNORECASE)
442
- if score_match:
443
- return int(score_match.group(1))
444
-
445
- # Fallback to heuristic calculation
446
- score = 0
447
- analysis_lower = analysis_text.lower()
448
-
449
- if any(word in analysis_lower for word in ['bug', 'error', 'issue', 'problem', 'flaw']):
450
- score += 20
451
-
452
- if any(word in analysis_lower for word in ['performance', 'slow', 'efficient', 'complexity', 'optimization']):
453
- score += 20
454
-
455
- if any(word in analysis_lower for word in ['security', 'vulnerability', 'safe', 'unsafe', 'risk']):
456
- score += 20
457
-
458
- if any(word in analysis_lower for word in ['suggest', 'improve', 'better', 'recommend', 'fix', 'solution']):
459
- score += 20
460
-
461
- if len(analysis_text) > 200:
462
- score += 10
463
- if len(analysis_text) > 500:
464
- score += 10
465
-
466
- return min(score, 100)
467
-
468
- def analyze_code_remote(self, code: str, mode: str = "quick") -> Dict[str, Any]:
469
- """Analyze code using remote Hugging Face API."""
470
- import requests
471
-
472
- if not self.remote_api_url:
473
- raise ValueError("No remote API URL configured")
474
-
475
- cached_result = self._check_cache(code)
476
- if cached_result:
477
- cached_result["cached"] = True
478
- return cached_result
479
-
480
- start_time = time.time()
481
-
482
- try:
483
- max_tokens = self.quick_max_new_tokens if mode == "quick" else self.detailed_max_new_tokens
484
-
485
- response = requests.post(
486
- f"{self.remote_api_url}/analyze",
487
- json={"code": code, "max_tokens": max_tokens},
488
- timeout=60
489
- )
490
- response.raise_for_status()
491
-
492
- data = response.json()
493
- analysis_text = data["analysis"]
494
-
495
- quality_score = self._calculate_quality_score(analysis_text)
496
- total_time = time.time() - start_time
497
-
498
- result = {
499
- "analysis": analysis_text,
500
- "quality_score": quality_score,
501
- "execution_time": total_time,
502
- "model": "fine-tuned-deepseek-remote",
503
- "model_type": "deepseek-finetuned-remote",
504
- "cached": False
505
- }
506
-
507
- self._save_to_cache(code, result)
508
- return result
509
-
510
- except Exception as e:
511
- raise Exception(f"Remote analysis failed: {e}")
512
-
513
- def get_model_info(self) -> Dict[str, Any]:
514
- """Get information about the loaded model."""
515
- if self.model is None:
516
- return {"status": "Model not loaded"}
517
-
518
- param_count = sum(p.numel() for p in self.model.parameters())
519
- trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
520
- device = next(self.model.parameters()).device
521
-
522
- info = {
523
- "model_id": self.model_id,
524
- "model_type": self.model_type,
525
- "parameters": param_count,
526
- "trainable_parameters": trainable_params,
527
- "device": str(device),
528
- "precision": self.precision,
529
- "quick_max_new_tokens": self.quick_max_new_tokens,
530
- "detailed_max_new_tokens": self.detailed_max_new_tokens,
531
- "cache_size": len(self.cache)
532
- }
533
-
534
- if self.adapter_path:
535
- info["adapter_path"] = self.adapter_path
536
- info["using_adapters"] = os.path.exists(self.adapter_path) if self.adapter_path else False
537
-
538
- return info
539
-
540
-
541
- def main():
542
- """Demo of the enhanced analyzer."""
543
- print("🚀 Enhanced Code Analyzer Demo")
544
- print("=" * 60)
545
-
546
- # Test code
547
- test_code = """
548
- def calculate_fibonacci(n):
549
- if n <= 0:
550
- return 0
551
- elif n == 1:
552
- return 1
553
- else:
554
- return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
555
-
556
- result = calculate_fibonacci(35)
557
- print(result)
558
- """
559
-
560
- print(f"Test Code:\n{test_code}")
561
- print("=" * 60)
562
-
563
- # Test with CodeT5+
564
- print("\n🔵 Testing with CodeT5+ (Base Model):")
565
- print("-" * 60)
566
- analyzer_codet5 = EnhancedCodeAnalyzer(model_type="codet5")
567
- result_codet5 = analyzer_codet5.analyze_code_fast(test_code)
568
- print(f"Analysis: {result_codet5['analysis'][:300]}...")
569
- print(f"Quality Score: {result_codet5['quality_score']}/100")
570
- print(f"Execution Time: {result_codet5['execution_time']:.2f}s")
571
-
572
- # Test with Fine-tuned DeepSeek
573
- print("\n🟢 Testing with Fine-tuned DeepSeek:")
574
- print("-" * 60)
575
- analyzer_deepseek = EnhancedCodeAnalyzer(
576
- model_type="deepseek-finetuned",
577
- adapter_path="./fine-tuned-analyst"
578
- )
579
- result_deepseek = analyzer_deepseek.analyze_code_fast(test_code)
580
- print(f"Analysis: {result_deepseek['analysis'][:300]}...")
581
- print(f"Quality Score: {result_deepseek['quality_score']}/100")
582
- print(f"Execution Time: {result_deepseek['execution_time']:.2f}s")
583
-
584
- # Show model comparison
585
- print("\n📊 Model Comparison:")
586
- print("-" * 60)
587
- print(f"CodeT5+ Info: {analyzer_codet5.get_model_info()}")
588
- print(f"DeepSeek Info: {analyzer_deepseek.get_model_info()}")
589
-
590
-
591
- if __name__ == "__main__":
592
- main()
593
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
render.yaml DELETED
@@ -1,17 +0,0 @@
1
- services:
2
- - type: web
3
- name: ai-code-analyzer
4
- env: python
5
- repo: https://github.com/arun3676/ai-code-analyzer.git
6
- branch: master
7
- buildCommand: pip install -r requirements.txt
8
- startCommand: streamlit run matrix_final.py --server.port=$PORT --server.address=0.0.0.0 --server.headless=true --server.enableCORS=false
9
- plan: free
10
- healthCheckPath: /?health=true
11
- envVars:
12
- - key: OPENAI_API_KEY
13
- sync: false
14
- - key: ANTHROPIC_API_KEY
15
- sync: false
16
- - key: DEEPSEEK_API_KEY
17
- sync: false
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -4,12 +4,8 @@ anthropic>=0.25.0
4
  python-dotenv>=1.0.0
5
  requests>=2.32.0
6
  typing-extensions>=4.0.0
7
-
8
- # Fine-tuning libraries
9
- torch>=2.0.0
10
  transformers>=4.35.0
11
- datasets>=2.14.0
12
- peft>=0.6.0
13
- trl>=0.7.0
14
- bitsandbytes>=0.41.0
15
- accelerate>=0.24.0
 
4
  python-dotenv>=1.0.0
5
  requests>=2.32.0
6
  typing-extensions>=4.0.0
7
+ huggingface-hub>=0.20.0
 
 
8
  transformers>=4.35.0
9
+ torch>=2.0.0
10
+ accelerate>=0.24.0
11
+ pygments>=2.17.0
 
 
run_app.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ AI Code Analyzer - Local Startup Script
4
+
5
+ This script starts the Matrix Code Analyzer application locally.
6
+ The app will be available at http://localhost:8501/
7
+
8
+ Usage:
9
+ python run_app.py
10
+
11
+ Or directly with Streamlit:
12
+ streamlit run matrix_final.py
13
+ """
14
+
15
+ import subprocess
16
+ import sys
17
+ import os
18
+
19
+ def main():
20
+ """Start the AI Code Analyzer application."""
21
+ print("Starting AI Code Analyzer...")
22
+ print("The app will be available at: http://localhost:8501/")
23
+ print("Press Ctrl+C to stop the application")
24
+ print("-" * 50)
25
+
26
+ try:
27
+ # Check if streamlit is installed
28
+ subprocess.run([sys.executable, "-c", "import streamlit"], check=True, capture_output=True)
29
+ except subprocess.CalledProcessError:
30
+ print("Streamlit is not installed. Please install it with:")
31
+ print(" pip install streamlit")
32
+ sys.exit(1)
33
+
34
+ # Start the Streamlit app
35
+ try:
36
+ subprocess.run([
37
+ sys.executable, "-m", "streamlit", "run", "matrix_final.py",
38
+ "--server.port", "8501",
39
+ "--server.address", "localhost"
40
+ ], check=True)
41
+ except KeyboardInterrupt:
42
+ print("\nApplication stopped by user")
43
+ except subprocess.CalledProcessError as e:
44
+ print(f"Error starting application: {e}")
45
+ sys.exit(1)
46
+ except Exception as e:
47
+ print(f"An unexpected error occurred: {e}")
48
+ sys.exit(1)
49
+
50
+ if __name__ == "__main__":
51
+ main()
runtime.txt DELETED
@@ -1 +0,0 @@
1
- python-3.11.9