Spaces:

arunn7
/

ai-code-analyzer

Sleeping

App Files Files Community

arun3676 commited on Oct 25, 2025

Commit

b7db63d

1 Parent(s): cf986ba

Configure for HuggingFace Spaces Docker deployment - Add Dockerfile, .dockerignore, update README with HF metadata, optimize requirements.txt

Browse files

Files changed (34) hide show

.dockerignore +84 -0
Dockerfile +41 -0
INTEGRATION_GUIDE.md +0 -368
INTEGRATION_SUMMARY.md +0 -253
MCP_SETUP_GUIDE.md +0 -142
MODEL_UPGRADE_GUIDE.md +0 -155
NEXT_STEPS_CHECKLIST.md +0 -263
README.md +80 -38
additional_samples.py +0 -73
analyst_dataset.jsonl +0 -34
analyst_dataset_expanded.jsonl +0 -0
analyzer/code_analyzer.py +155 -40
analyzer/llm_clients.py +179 -18
analyzer/prompts.py +23 -5
analyzer/utils.py +139 -16
detailed_model_comparison.py +0 -324
evaluation_samples/buggy_code.py +19 -0
evaluation_samples/good_code.py +18 -0
evaluation_samples/insecure_code.py +20 -0
finetune.py +0 -274
finetune_improved.py +0 -277
hf-space/README.md +0 -61
hf-space/app.py +0 -82
hf-space/requirements.txt +0 -3
inference_demo.py +0 -57
matrix_analyzer_optimized.py +0 -262
matrix_final.py +823 -1029
model_comparison.py +0 -175
optimized_code_analyzer.py +0 -466
optimized_code_analyzer_enhanced.py +0 -593
render.yaml +0 -17
requirements.txt +4 -8
run_app.py +51 -0
runtime.txt +0 -1

.dockerignore ADDED Viewed

	@@ -0,0 +1,84 @@

+# Environment and secrets
+.env
+.env.local
+.env.*.local
+*.env
+# Git
+.git
+.gitignore
+.gitattributes
+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+pip-wheel-metadata/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+# Virtual environments
+venv/
+.venv/
+ENV/
+env/
+env.bak/
+venv.bak/
+# Testing
+tests/
+test_*.py
+.pytest_cache/
+.coverage
+.tox/
+.nox/
+htmlcov/
+.hypothesis/
+# Documentation
+*.md
+!README.md
+ARCHITECTURE.md
+PROJECT_STRUCTURE.md
+TESTING_GUIDE.md
+HUGGINGFACE_SETUP_GUIDE.md
+MERCURY_API_SETUP.md
+*.txt
+!requirements.txt
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+*~
+.DS_Store
+# Deployment plans
+*.plan.md
+# Setup and test scripts
+setup_huggingface.py
+test_api_connections.py
+test_huggingface_setup.py
+# Cache
+.cache/
+*.cache

Dockerfile ADDED Viewed

	@@ -0,0 +1,41 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    git \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better Docker layer caching
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY analyzer/ ./analyzer/
+COPY matrix_final.py .
+COPY run_app.py .
+COPY evaluation_samples/ ./evaluation_samples/
+# Create cache directory
+RUN mkdir -p /app/.cache
+# Expose HuggingFace Spaces default port
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
+    CMD curl -f http://localhost:7860/_stcore/health || exit 1
+# Run Streamlit with HF Spaces compatible settings
+CMD ["streamlit", "run", "matrix_final.py", \
+     "--server.port=7860", \
+     "--server.address=0.0.0.0", \
+     "--server.headless=true", \
+     "--server.enableCORS=false", \
+     "--server.enableXsrfProtection=false", \
+     "--browser.gatherUsageStats=false"]

INTEGRATION_GUIDE.md DELETED Viewed

@@ -1,368 +0,0 @@
-# Integration Guide: Fine-tuned Model with Your Code Analyzer
-This guide explains how to integrate your fine-tuned DeepSeek model with the existing code analyzer app.
-## 📋 What You Have Now
-After completing the Colab training, you have:
-- ✅ Fine-tuned DeepSeek model adapters (~20MB)
-- ✅ Enhanced analyzer class supporting both models
-- ✅ Original CodeT5+ model still working
-- ✅ All existing UI features preserved
-## 🔄 Integration Steps
-### Step 1: Download Your Fine-tuned Model from Colab
-In your final Colab cell, you saved the model to Google Drive. Now download it:
-**Option A: From Google Drive**
-1. Go to Google Drive → `MyDrive/ai-code-analyzer/`
-2. Download the `fine-tuned-analyst` folder
-3. Place it in your project root: `C:\Users\arunk\professional\ai-code-analyzer\fine-tuned-analyst\`
-**Option B: Download Directly from Colab**
-```python
-# Run this in Colab to create a downloadable ZIP
-import shutil
-shutil.make_archive('fine-tuned-analyst', 'zip', './fine-tuned-analyst')
-from google.colab import files
-files.download('fine-tuned-analyst.zip')
-```
-Then extract the ZIP in your project root.
-### Step 2: Install Required Dependencies
-Update your `requirements.txt` to include PEFT:
-```bash
-# Add this line to requirements.txt
-peft>=0.7.0
-```
-Install it:
-```bash
-pip install peft
-```
-### Step 3: Test the Enhanced Analyzer Locally
-Run the test script to verify everything works:
-```bash
-python optimized_code_analyzer_enhanced.py
-```
-You should see:
-- ✅ CodeT5+ analysis
-- ✅ Fine-tuned DeepSeek analysis
-- ✅ Model comparison
-### Step 4: Update Your Streamlit UI
-Replace the analyzer import in `matrix_final.py`:
-**Find this (around line 8):**
-```python
-from optimized_code_analyzer import OptimizedCodeAnalyzer
-```
-**Replace with:**
-```python
-from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer
-```
-**Find this (around line 287):**
-```python
-@st.cache_resource
-def get_local_analyzer():
-    return OptimizedCodeAnalyzer(
-        model_id="Salesforce/codet5p-220m",
-        precision="fp16",
-        quick_max_new_tokens=180,
-        detailed_max_new_tokens=240,
-    )
-```
-**Replace with:**
-```python
-@st.cache_resource
-def get_local_analyzer(model_type="codet5"):
-    return EnhancedCodeAnalyzer(
-        model_type=model_type,
-        precision="fp16",
-        quick_max_new_tokens=180,
-        detailed_max_new_tokens=300,
-    )
-```
-### Step 5: Add Model Selector to Sidebar
-Add this to your sidebar (around line 490, in the sidebar section):
-```python
-# Model Selection
-st.sidebar.markdown("---")
-st.sidebar.markdown("### 🤖 AI Model Selection")
-model_choice = st.sidebar.radio(
-    "Choose Analysis Model:",
-    ["CodeT5+ (Fast)", "Fine-tuned DeepSeek (Accurate)"],
-    help="CodeT5+ is faster, Fine-tuned model gives more detailed analysis"
-)
-model_type = "codet5" if "CodeT5+" in model_choice else "deepseek-finetuned"
-```
-### Step 6: Update the Analysis Call
-Find where the analyzer is called (around line 600+) and update it:
-**Find something like:**
-```python
-local_analyzer = get_local_analyzer()
-result = local_analyzer.analyze_code_fast(code)
-```
-**Replace with:**
-```python
-local_analyzer = get_local_analyzer(model_type)
-result = local_analyzer.analyze_code_fast(code)
-```
-### Step 7: Test Everything
-Run your Streamlit app:
-```bash
-streamlit run matrix_final.py
-```
-Test both models:
-1. Select "CodeT5+ (Fast)" → Run analysis → Should work as before
-2. Select "Fine-tuned DeepSeek (Accurate)" → Run analysis → Should give detailed analysis with quality scores
-## 📊 What Each Model Does
-### CodeT5+ (Base Model)
-- **Speed**: ⚡ Fast (2-3 seconds)
-- **Memory**: ~1GB
-- **Analysis**: General code analysis
-- **Best for**: Quick checks, batch processing
-- **Quality**: Good for basic issues
-### Fine-tuned DeepSeek (Your Model)
-- **Speed**: 🚀 Moderate (3-5 seconds)
-- **Memory**: ~1.5GB
-- **Analysis**: Detailed with quality scores (1-100)
-- **Best for**: Deep analysis, learning, production code
-- **Quality**: Excellent - trained on your specific patterns
-- **Output format**:
-  - Quality Score (1-100)
-  - Bugs section
-  - Performance issues
-  - Security concerns
-  - Improvement suggestions with code examples
-## 🎯 Key Features of the Enhanced System
-### 1. Dual Model Support
-- Seamlessly switch between models
-- Separate caching for each model
-- No breaking changes to existing code
-### 2. Improved Analysis Quality
-Your fine-tuned model provides:
-- **Structured output**: Quality score, bugs, performance, security
-- **Code examples**: Shows how to fix issues
-- **Contextual understanding**: Trained on your dataset patterns
-- **Consistent formatting**: Always includes all sections
-### 3. Memory Efficient
-- LoRA adapters are tiny (~20MB vs 1GB+ full model)
-- Base model shared, adapters loaded on demand
-- Can deploy both models without doubling memory
-## 🚀 Deployment Options
-### Option 1: Local Deployment (Current)
-**Pros:**
-- Free
-- Fast
-- Full control
-- Easy testing
-**Cons:**
-- Only you can use it
-- Needs your computer running
-**Setup:** Already working! Just use Streamlit locally.
-### Option 2: Hugging Face Spaces (Recommended)
-**Pros:**
-- FREE hosting
-- Automatic HTTPS
-- Share with anyone
-- GPU available (paid tier)
-**Setup:**
-1. Create account on huggingface.co
-2. Create new Space (Streamlit)
-3. Upload files:
-   - `matrix_final.py`
-   - `optimized_code_analyzer_enhanced.py`
-   - `requirements.txt`
-   - `fine-tuned-analyst/` folder
-4. Add `app.py`:
-```python
-# app.py (for HF Spaces)
-import subprocess
-subprocess.run(["streamlit", "run", "matrix_final.py"])
-```
-### Option 3: Railway.app
-**Cost:** $5/month
-**Memory:** Up to 8GB
-**Speed:** Faster than HF Spaces
-**Setup:**
-1. Connect GitHub repo
-2. Set start command: `streamlit run matrix_final.py --server.port $PORT`
-3. Deploy
-### Option 4: Render.com
-**Cost:** FREE tier available
-**Memory:** 512MB (might be tight)
-**Speed:** Good
-**Setup:**
-1. Connect repo
-2. Use Docker:
-```dockerfile
-FROM python:3.11-slim
-WORKDIR /app
-COPY requirements.txt .
-RUN pip install -r requirements.txt
-COPY . .
-CMD streamlit run matrix_final.py --server.port $PORT
-```
-## 🐛 Troubleshooting
-### Issue: "fine-tuned-analyst folder not found"
-**Solution:** Make sure the folder is in your project root with these files:
-```
-fine-tuned-analyst/
-├── adapter_config.json
-├── adapter_model.bin (or adapter_model.safetensors)
-├── tokenizer_config.json
-└── special_tokens_map.json
-```
-### Issue: "PEFT not installed"
-**Solution:**
-```bash
-pip install peft
-```
-### Issue: "Model too slow"
-**Solution:**
-- Use "quick" mode instead of "detailed"
-- Reduce `max_new_tokens` to 150
-- Use INT8 or INT4 quantization
-### Issue: "Out of memory"
-**Solution:**
-- Close other applications
-- Use CodeT5+ instead (smaller)
-- Enable quantization: `precision="int8"`
-## 📚 Understanding the Libraries Used
-### Core Libraries
-**Transformers** (`transformers`)
-- What: Hugging Face's library for AI models
-- Does: Loads models, tokenizers, handles generation
-- Used for: Loading DeepSeek and CodeT5+ models
-**PEFT** (`peft`)
-- What: Parameter Efficient Fine-Tuning
-- Does: Loads LoRA adapters efficiently
-- Used for: Your fine-tuned model adapters
-**PyTorch** (`torch`)
-- What: Deep learning framework
-- Does: Runs neural networks on GPU/CPU
-- Used for: Model inference, tensor operations
-**Streamlit** (`streamlit`)
-- What: Web app framework for Python
-- Does: Creates interactive UI
-- Used for: Your code analyzer interface
-### How They Work Together
-```
-User Input (Streamlit)
-    ↓
-EnhancedCodeAnalyzer
-    ↓
-Transformers (loads base model)
-    ↓
-PEFT (loads adapters)
-    ↓
-PyTorch (runs inference)
-    ↓
-Result → Streamlit UI
-```
-## 🎓 Next Steps
-1. **Test both models** with various code samples
-2. **Compare quality** - which model works better for your use cases?
-3. **Expand dataset** - Add more samples and retrain (only takes 20 minutes!)
-4. **Deploy** - Choose a hosting option and share with others
-5. **Iterate** - Collect feedback and improve
-## 💡 Tips for Best Results
-### When to Use CodeT5+
-- Quick syntax checks
-- Batch processing many files
-- Resource-constrained environments
-- Simple code reviews
-### When to Use Fine-tuned DeepSeek
-- Production code reviews
-- Learning/education
-- Complex analysis needed
-- When quality > speed
-- Security audits
-## 🎉 Congratulations!
-You've successfully:
-- ✅ Fine-tuned a language model
-- ✅ Integrated it with your app
-- ✅ Created a dual-model system
-- ✅ Learned about model deployment
-- ✅ Built a production-ready tool
-Your code analyzer now has:
-- **2 AI models** to choose from
-- **Professional quality** analysis
-- **Scalable architecture** for future improvements
-- **Production-ready** code
-## 📞 Support
-If you need help:
-1. Check error messages carefully
-2. Review this guide
-3. Test with simple code first
-4. Compare with working examples
-5. Ask for help with specific errors
-Happy coding! 🚀

INTEGRATION_SUMMARY.md DELETED Viewed

@@ -1,253 +0,0 @@
-# Integration Complete! 🎉
-## What We've Built
-You now have a **dual-model code analyzer** that supports both:
-1. **CodeT5+ (Base)** - Fast, lightweight analysis
-2. **Fine-tuned DeepSeek** - Detailed, accurate analysis with quality scores
-## Files Created
-### ✅ Core Files
-1. **`optimized_code_analyzer_enhanced.py`** - New analyzer supporting both models
-2. **`INTEGRATION_GUIDE.md`** - Complete step-by-step integration instructions
-3. **`test_finetuned_local.py`** - Test script to verify everything works
-4. **`INTEGRATION_SUMMARY.md`** - This file (quick reference)
-### 📦 What You Need from Colab
-- `fine-tuned-analyst/` folder from Google Drive containing your trained model adapters
-## Quick Start (3 Steps)
-### Step 1: Get Your Model
-Download the `fine-tuned-analyst` folder from Google Drive (saved in Cell 9 of Colab) and place it in your project root:
-```
-C:\Users\arunk\professional\ai-code-analyzer\fine-tuned-analyst\
-```
-### Step 2: Install Dependencies
-```bash
-pip install peft
-```
-### Step 3: Test It
-```bash
-python test_finetuned_local.py
-```
-If all checks pass ✅, you're ready to integrate with your UI!
-## Integration with Streamlit UI
-### Quick Changes to `matrix_final.py`
-**Change 1: Update Import (Line ~8)**
-```python
-# OLD:
-from optimized_code_analyzer import OptimizedCodeAnalyzer
-# NEW:
-from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer
-```
-**Change 2: Update Analyzer Function (Line ~287)**
-```python
-# OLD:
-@st.cache_resource
-def get_local_analyzer():
-    return OptimizedCodeAnalyzer(
-        model_id="Salesforce/codet5p-220m",
-        precision="fp16",
-        quick_max_new_tokens=180,
-        detailed_max_new_tokens=240,
-    )
-# NEW:
-@st.cache_resource
-def get_local_analyzer(model_type="codet5"):
-    return EnhancedCodeAnalyzer(
-        model_type=model_type,
-        precision="fp16",
-        quick_max_new_tokens=180,
-        detailed_max_new_tokens=300,
-    )
-```
-**Change 3: Add Model Selector to Sidebar (Add after line ~490)**
-```python
-# Model Selection
-st.sidebar.markdown("---")
-st.sidebar.markdown("### 🤖 AI Model Selection")
-model_choice = st.sidebar.radio(
-    "Choose Analysis Model:",
-    ["CodeT5+ (Fast)", "Fine-tuned DeepSeek (Accurate)"],
-    help="CodeT5+ is faster, Fine-tuned model gives more detailed analysis"
-)
-model_type = "codet5" if "CodeT5+" in model_choice else "deepseek-finetuned"
-```
-**Change 4: Update Analysis Calls**
-Find places where analyzer is called and add `model_type` parameter:
-```python
-# OLD:
-local_analyzer = get_local_analyzer()
-# NEW:
-local_analyzer = get_local_analyzer(model_type)
-```
-## What Each Model Does
-### CodeT5+ (Fast) ⚡
-- **Speed**: 2-3 seconds
-- **Memory**: ~1GB
-- **Output**: General analysis
-- **Best for**: Quick checks
-### Fine-tuned DeepSeek (Accurate) 🎯
-- **Speed**: 3-5 seconds
-- **Memory**: ~1.5GB
-- **Output**:
-  - Quality Score: 35/100
-  - Bugs section with specifics
-  - Performance issues
-  - Security concerns
-  - Improvement suggestions with code examples
-- **Best for**: Production code, learning, detailed reviews
-## Example Output
-Your fine-tuned model gives structured output like:
-```
-Quality Score: 35/100
-BUGS:
-- No error handling
-- Infinite recursion possible
-PERFORMANCE ISSUES:
-- Recursive calls cause exponential time complexity
-SECURITY CONCERNS:
-- No input validation
-IMPROVEMENTS:
-1. Use memoization to avoid recursion
-2. Add input validation
-Example improved code:
-[Shows working code with fixes]
-```
-## Testing Checklist
-- [ ] Run `test_finetuned_local.py` - all checks pass
-- [ ] Update `matrix_final.py` imports
-- [ ] Add model selector to sidebar
-- [ ] Test with Streamlit: `streamlit run matrix_final.py`
-- [ ] Try both models with sample code
-- [ ] Verify quality scores appear for fine-tuned model
-- [ ] Check caching works for both models
-## Troubleshooting
-| Issue | Solution |
-|-------|----------|
-| "fine-tuned-analyst not found" | Download from Google Drive, place in project root |
-| "PEFT not installed" | Run `pip install peft` |
-| "Model too slow" | Use CodeT5+ or enable quantization |
-| "Out of memory" | Close other apps or use CodeT5+ |
-## What You Learned
-✅ **Fine-tuning LLMs** with LoRA/QLoRA
-✅ **Google Colab** for GPU training
-✅ **Model integration** with existing apps
-✅ **Dual-model architecture** for flexibility
-✅ **Production deployment** considerations
-## Next Steps (Choose One)
-### Option A: Deploy Locally (Easiest)
-Just run `streamlit run matrix_final.py` - you're done!
-### Option B: Deploy to Cloud (Share with Others)
-1. **Hugging Face Spaces** (FREE) - Follow INTEGRATION_GUIDE.md
-2. **Railway.app** ($5/month) - Best performance
-3. **Render.com** (FREE tier) - Good alternative
-### Option C: Improve the Model
-1. Add more training samples (up to 150-200)
-2. Retrain in Colab (only takes 20 minutes!)
-3. Test new version
-4. Deploy updated model
-## Files Structure
-```
-ai-code-analyzer/
-├── optimized_code_analyzer.py          # Original (keep for reference)
-├── optimized_code_analyzer_enhanced.py # NEW - supports both models
-├── matrix_final.py                     # Update this file
-├── test_finetuned_local.py            # NEW - test script
-├── INTEGRATION_GUIDE.md               # NEW - detailed guide
-├── INTEGRATION_SUMMARY.md             # NEW - this file
-├── analyst_dataset_expanded.jsonl     # Your training data
-├── requirements.txt                    # Add 'peft' here
-└── fine-tuned-analyst/                # Download from Colab
-    ├── adapter_config.json
-    ├── adapter_model.bin
-    └── ... (other files)
-```
-## Support
-If you need help:
-1. Check `INTEGRATION_GUIDE.md` for detailed instructions
-2. Run `test_finetuned_local.py` to diagnose issues
-3. Check error messages for specific problems
-4. Test with simple code first before complex examples
-## Success Criteria
-You'll know everything is working when:
-✅ Test script passes all checks
-✅ Streamlit app loads without errors
-✅ Can switch between models in sidebar
-✅ CodeT5+ gives fast analysis
-✅ Fine-tuned model gives quality scores and detailed output
-✅ Both models use separate caches
-## Congratulations! 🎉
-You've successfully:
-- ✅ Fine-tuned a language model on Google Colab
-- ✅ Created a production-ready code analyzer
-- ✅ Integrated AI models with a web app
-- ✅ Built a dual-model system for flexibility
-- ✅ Learned about LoRA, PEFT, and model deployment
-Your code analyzer is now **powered by AI you trained yourself**! 🚀
----
-**Quick Reference Commands:**
-```bash
-# Test integration
-python test_finetuned_local.py
-# Run app
-streamlit run matrix_final.py
-# Install dependencies
-pip install peft
-# Check if adapters exist
-dir fine-tuned-analyst  # Windows
-ls fine-tuned-analyst/  # Linux/Mac
-```
-**Need Help?** See `INTEGRATION_GUIDE.md` for complete instructions.

MCP_SETUP_GUIDE.md DELETED Viewed

@@ -1,142 +0,0 @@
-# MCP Tools Setup Guide for AI Code Analyzer Project
-## 🎯 Configured MCP Tools
-I've set up the following MCP tools in your `~/.cursor/mcp.json` file:
-### **Essential for ML/AI Projects:**
-1. **🤗 Hugging Face** - Model management and dataset access
-   - **API Key Location**: https://huggingface.co/settings/tokens
-   - **Required Scope**: `read` (minimum), `write` (if you want to upload models)
-   - **Replace**: `YOUR_HUGGINGFACE_API_KEY_HERE`
-2. **🗄️ DuckDB** - Local analytics database (no API key needed)
-   - **Perfect for**: Analyzing your `analyst_dataset.jsonl` file
-   - **No setup required** - ready to use!
-### **Development & Deployment:**
-3. **🐙 GitHub** - Version control and collaboration
-   - **API Key Location**: https://github.com/settings/tokens
-   - **Required Scope**: `repo`, `workflow`, `read:org`
-   - **Replace**: `YOUR_GITHUB_TOKEN_HERE`
-4. **🚀 Vercel** - Deploy web interfaces
-   - **API Key Location**: https://vercel.com/account/tokens
-   - **Replace**: `YOUR_VERCEL_TOKEN_HERE`
-5. **🚂 Railway** - Full-stack deployment
-   - **API Key Location**: https://railway.app/account/tokens
-   - **Replace**: `YOUR_RAILWAY_TOKEN_HERE`
-### **Data Storage:**
-6. **🍃 MongoDB** - Database for structured data
-   - **Connection String**: Get from MongoDB Atlas (https://cloud.mongodb.com/)
-   - **Format**: `mongodb+srv://username:password@cluster.mongodb.net/database`
-   - **Replace**: `YOUR_MONGODB_CONNECTION_STRING_HERE`
-### **Monitoring & Error Tracking:**
-7. **🚨 Sentry** - Error tracking and performance monitoring
-   - **Auth Token**: https://sentry.io/settings/account/api/auth-tokens/
-   - **Organization**: Your Sentry org slug
-   - **Project**: Your project slug
-   - **Replace**:
-     - `YOUR_SENTRY_AUTH_TOKEN_HERE`
-     - `YOUR_SENTRY_ORG_HERE`
-     - `YOUR_SENTRY_PROJECT_HERE`
-### **Security & Code Quality:**
-8. **🔒 Snyk** - Vulnerability scanning
-   - **API Key Location**: https://app.snyk.io/account
-   - **Replace**: `YOUR_SNYK_TOKEN_HERE`
-9. **🔍 Semgrep** - Static analysis (no API key needed)
-   - **Ready to use** - no setup required!
-10. **🏗️ SonarQube** - Code analysis
-    - **Setup**: You need a SonarQube instance (cloud or self-hosted)
-    - **Token**: Generate in your SonarQube instance
-    - **Replace**:
-      - `YOUR_SONARQUBE_URL_HERE` (e.g., `https://your-org.sonarcloud.io`)
-      - `YOUR_SONARQUBE_TOKEN_HERE`
-## 🚀 Quick Start (Recommended Order)
-### **Phase 1: Essential Tools (Start Here)**
-1. **Hugging Face** - Most important for your ML project
-2. **DuckDB** - Already ready, no API key needed
-3. **GitHub** - For version control
-### **Phase 2: Development Tools**
-4. **Vercel** or **Railway** - For deployment
-5. **Sentry** - For monitoring
-### **Phase 3: Advanced Tools (Optional)**
-6. **MongoDB** - If you need structured data storage
-7. **Snyk** - For security scanning
-8. **Semgrep** - For static analysis
-9. **SonarQube** - For comprehensive code analysis
-## 📝 How to Add API Keys
-1. **Open your MCP config**: `c:\Users\arunk\.cursor\mcp.json`
-2. **Find the placeholder** (e.g., `YOUR_HUGGINGFACE_API_KEY_HERE`)
-3. **Replace with your actual API key/token**
-4. **Save the file**
-5. **Restart Cursor** for changes to take effect
-## 🔧 Installation Commands
-Run these commands to ensure all MCP servers are installed:
-```powershell
-# Install all MCP servers
-npm install -g @modelcontextprotocol/server-huggingface
-npm install -g @modelcontextprotocol/server-github
-npm install -g @modelcontextprotocol/server-duckdb
-npm install -g @modelcontextprotocol/server-sentry
-npm install -g @modelcontextprotocol/server-vercel
-npm install -g @modelcontextprotocol/server-mongodb
-npm install -g @modelcontextprotocol/server-railway
-npm install -g @modelcontextprotocol/server-snyk
-npm install -g @modelcontextprotocol/server-semgrep
-npm install -g @modelcontextprotocol/server-sonarqube
-```
-## ✅ Testing Your Setup
-After adding API keys, test each tool in Cursor Composer:
-- **Hugging Face**: "Search for code analysis models on Hugging Face"
-- **DuckDB**: "Analyze my training dataset using DuckDB"
-- **GitHub**: "Show me my recent commits"
-- **Sentry**: "Check for errors in my project"
-## 🛡️ Security Notes
-- Never commit API keys to version control
-- Use environment variables for sensitive data
-- Regularly rotate your API keys
-- Use minimal required permissions for each service
-## 🆘 Troubleshooting
-If a tool doesn't work:
-1. Check if the MCP server is installed: `npm list -g @modelcontextprotocol/server-*`
-2. Verify API keys are correct and have proper permissions
-3. Restart Cursor after making changes
-4. Check Cursor's MCP settings in `Settings > Features > MCP`
-## 📚 Useful Commands for Your Project
-Once set up, you can use these commands in Cursor Composer:
-- "Use Hugging Face to find the best code analysis models"
-- "Analyze my analyst_dataset.jsonl with DuckDB"
-- "Check my code for vulnerabilities with Snyk"
-- "Deploy my analyzer to Vercel"
-- "Monitor errors with Sentry"

MODEL_UPGRADE_GUIDE.md DELETED Viewed

@@ -1,155 +0,0 @@
-# 🚀 AI Code Analyzer Model Upgrade Guide
-## 📊 What We've Done - Step by Step Explanation
-### **Step 1: Identified the Problem**
-- **Current Issue**: Your fine-tuned model uses GPT-2 as base (from `adapter_config.json`)
-- **Problem**: GPT-2 is a general-purpose text model, not optimized for code analysis
-- **Impact**: Limited understanding of programming concepts, syntax, and code structure
-### **Step 2: Found Better Models**
-We researched and identified specialized code analysis models:
-| Model | Size | Specialization | Best For |
-|-------|------|----------------|----------|
-| **Salesforce/codet5p-220m** | 220M | Code understanding & generation | **Code analysis** ⭐ |
-| **microsoft/CodeBERT-base** | 125M | Code understanding | Feature extraction |
-| **facebook/incoder-1B** | 1B | Code completion & analysis | Large context analysis |
-| **microsoft/GraphCodeBERT-base** | 125M | Code structure understanding | Dependency analysis |
-### **Step 3: Created Improved Training Script**
-**File**: `finetune_improved.py`
-**Key Improvements**:
-- ✅ **Better default model**: CodeT5+ instead of GPT-2
-- ✅ **Model type detection**: Automatically handles different architectures
-- ✅ **Optimized LoRA configs**: Different settings for different model types
-- ✅ **Better error handling**: More robust training process
-- ✅ **Flexible model selection**: Easy to switch between models
-### **Step 4: Created Testing Tools**
-**Files**:
-- `model_comparison.py` - Compare different models
-- `test_models.py` - Quick testing of model capabilities
-## 🎯 Why CodeT5+ is Better for Your Project
-### **Current Model (GPT-2) Limitations**:
-- ❌ Not trained on code
-- ❌ Limited understanding of programming concepts
-- ❌ Poor handling of code syntax and structure
-- ❌ General-purpose text model
-### **CodeT5+ Advantages**:
-- ✅ **Specialized for code**: Trained specifically on code datasets
-- ✅ **Better architecture**: Seq2Seq model perfect for analysis tasks
-- ✅ **Code understanding**: Understands programming languages, syntax, and patterns
-- ✅ **Optimized tokenization**: Better handling of code tokens
-- ✅ **Proven performance**: State-of-the-art results on code analysis benchmarks
-## 🚀 How to Use the New System
-### **Step 1: Test Models (Recommended)**
-```bash
-# Compare different models
-python model_comparison.py
-# Test model capabilities
-python test_models.py
-```
-### **Step 2: Fine-tune with Better Model**
-```bash
-# Use CodeT5+ (recommended)
-python finetune_improved.py --model Salesforce/codet5p-220m --dry-run
-# Or try CodeBERT
-python finetune_improved.py --model microsoft/CodeBERT-base --dry-run
-# Full training (remove --dry-run)
-python finetune_improved.py --model Salesforce/codet5p-220m --epochs 3
-```
-### **Step 3: Compare Results**
-- Test your current GPT-2 model vs new CodeT5+ model
-- Compare analysis quality on your training examples
-- Measure performance improvements
-## 📈 Expected Improvements
-### **Code Analysis Quality**:
-- **Better bug detection**: Understanding of common programming errors
-- **Improved security analysis**: Knowledge of security vulnerabilities
-- **Enhanced performance insights**: Understanding of algorithmic complexity
-- **Better code structure analysis**: Recognition of design patterns
-### **Training Efficiency**:
-- **Faster convergence**: Code-specialized models learn faster on code tasks
-- **Better generalization**: Understanding of programming concepts transfers better
-- **Reduced overfitting**: Better base knowledge means less overfitting
-## 🔧 Technical Details
-### **Model Architecture Changes**:
-```python
-# Old (GPT-2)
-model = AutoModelForCausalLM.from_pretrained("gpt2")
-# New (CodeT5+)
-model = AutoModelForSeq2SeqLM.from_pretrained("Salesforce/codet5p-220m")
-```
-### **LoRA Configuration Updates**:
-```python
-# CodeT5+ specific target modules
-target_modules = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
-# Different task type
-task_type = TaskType.SEQ_2_SEQ_LM
-```
-### **Training Data Compatibility**:
-- ✅ **Your current dataset works**: No changes needed to `analyst_dataset.jsonl`
-- ✅ **Same format**: The improved script handles your existing data
-- ✅ **Better results**: Code-specialized models will perform better
-## 🎯 Next Steps
-### **Immediate Actions**:
-1. **Test the models**: Run `python test_models.py`
-2. **Choose your model**: CodeT5+ is recommended
-3. **Fine-tune**: Run the improved training script
-4. **Compare**: Test against your current model
-### **Integration with Your Analyzer**:
-1. **Update model loading**: Modify your analyzer to use the new model
-2. **Test performance**: Compare analysis quality
-3. **Deploy**: Update your live demo with the better model
-## 💡 Pro Tips
-### **Model Selection**:
-- **Start with CodeT5+**: Best balance of performance and size
-- **Try CodeBERT**: If you need feature extraction
-- **Consider InCoder**: If you have powerful hardware
-### **Training Tips**:
-- **Use dry-run first**: Test with `--dry-run` before full training
-- **Monitor GPU usage**: Larger models need more resources
-- **Compare results**: Always test against your current model
-### **Performance Optimization**:
-- **Use quantization**: 4-bit quantization for memory efficiency
-- **Batch size**: Start small and increase if you have memory
-- **Learning rate**: CodeT5+ works well with 2e-4
-## 🎉 Expected Results
-After upgrading to CodeT5+, you should see:
-- **20-30% better code analysis quality**
-- **Better understanding of security vulnerabilities**
-- **More accurate performance predictions**
-- **Improved code structure analysis**
-- **Better handling of complex code patterns**
-Your AI code analyzer will become significantly more powerful and accurate! 🚀

NEXT_STEPS_CHECKLIST.md DELETED Viewed

@@ -1,263 +0,0 @@
-# ✅ Next Steps Checklist
-Follow these steps in order to complete the integration.
-## Phase 1: Get Your Model Ready (5 minutes)
-### Step 1: Download Model from Google Drive
-- [ ] Open Google Drive in your browser
-- [ ] Navigate to `MyDrive/ai-code-analyzer/`
-- [ ] Find the `fine-tuned-analyst` folder
-- [ ] Download the entire folder
-- [ ] Extract/move it to: `C:\Users\arunk\professional\ai-code-analyzer\fine-tuned-analyst\`
-**How to verify:**
-```bash
-dir fine-tuned-analyst
-```
-You should see files like:
-- `adapter_config.json`
-- `adapter_model.bin` (or `.safetensors`)
-- Other config files
-### Step 2: Install PEFT Library
-- [ ] Open terminal/PowerShell in your project folder
-- [ ] Run: `pip install peft`
-- [ ] Wait for installation to complete
-**How to verify:**
-```bash
-python -c "import peft; print('PEFT installed successfully')"
-```
-## Phase 2: Test Locally (2 minutes)
-### Step 3: Run Test Script
-- [ ] In terminal, run: `python test_finetuned_local.py`
-- [ ] Wait for all checks to complete
-- [ ] Verify all checks show ✅ (green checkmarks)
-**If any checks fail:**
-- Read the error message carefully
-- Follow the suggested fix
-- Re-run the test script
-**Expected output:**
-```
-✅ Found adapter folder
-✅ All required files present
-✅ Dependencies imported successfully
-✅ Tokenizer loaded
-✅ Base model loaded
-✅ Adapters loaded successfully
-✅ Model inference working correctly!
-✅ Enhanced analyzer can be imported
-🎉 SUCCESS: All checks passed!
-```
-## Phase 3: Integrate with UI (10 minutes)
-### Step 4: Backup Your Current UI
-- [ ] Copy `matrix_final.py` to `matrix_final_backup.py`
-**Command:**
-```bash
-copy matrix_final.py matrix_final_backup.py
-```
-### Step 5: Update Imports
-- [ ] Open `matrix_final.py` in your editor
-- [ ] Find line ~8: `from optimized_code_analyzer import OptimizedCodeAnalyzer`
-- [ ] Replace with: `from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer`
-### Step 6: Update Analyzer Function
-- [ ] Find the function `get_local_analyzer()` (around line 287)
-- [ ] Replace the entire function with:
-```python
-@st.cache_resource
-def get_local_analyzer(model_type="codet5"):
-    return EnhancedCodeAnalyzer(
-        model_type=model_type,
-        precision="fp16",
-        quick_max_new_tokens=180,
-        detailed_max_new_tokens=300,
-    )
-```
-### Step 7: Add Model Selector to Sidebar
-- [ ] Find the sidebar section (around line 490, look for `st.sidebar`)
-- [ ] Add this code after other sidebar elements:
-```python
-# Model Selection
-st.sidebar.markdown("---")
-st.sidebar.markdown("### 🤖 AI Model Selection")
-model_choice = st.sidebar.radio(
-    "Choose Analysis Model:",
-    ["CodeT5+ (Fast)", "Fine-tuned DeepSeek (Accurate)"],
-    help="CodeT5+ is faster, Fine-tuned model gives more detailed analysis"
-)
-model_type = "codet5" if "CodeT5+" in model_choice else "deepseek-finetuned"
-```
-### Step 8: Update Analyzer Calls
-- [ ] Find where `local_analyzer = get_local_analyzer()` is called
-- [ ] Replace with: `local_analyzer = get_local_analyzer(model_type)`
-- [ ] Look for all instances (there might be 2-3 places)
-**Hint:** Use Ctrl+F to find "get_local_analyzer()"
-### Step 9: Update requirements.txt
-- [ ] Open `requirements.txt`
-- [ ] Add this line if not present: `peft>=0.7.0`
-- [ ] Save the file
-## Phase 4: Test Everything (5 minutes)
-### Step 10: Run Streamlit App
-- [ ] In terminal: `streamlit run matrix_final.py`
-- [ ] Wait for app to load
-- [ ] Browser should open automatically
-### Step 11: Test CodeT5+ Model
-- [ ] In the sidebar, select "CodeT5+ (Fast)"
-- [ ] Paste a simple code snippet (or use the examples)
-- [ ] Click "🚀 Analyze Code"
-- [ ] Verify you get analysis results
-- [ ] Should take 2-3 seconds
-**Example code to test:**
-```python
-def add(a, b):
-    return a + b
-```
-### Step 12: Test Fine-tuned DeepSeek Model
-- [ ] In the sidebar, select "Fine-tuned DeepSeek (Accurate)"
-- [ ] Use the same code snippet
-- [ ] Click "🚀 Analyze Code"
-- [ ] Verify you get detailed analysis with:
-  - Quality Score (e.g., "Quality Score: 85/100")
-  - BUGS section
-  - PERFORMANCE ISSUES section
-  - SECURITY CONCERNS section
-  - IMPROVEMENTS section with code examples
-- [ ] Should take 3-5 seconds
-### Step 13: Test Model Switching
-- [ ] Switch back to "CodeT5+ (Fast)"
-- [ ] Analyze different code
-- [ ] Switch to "Fine-tuned DeepSeek (Accurate)"
-- [ ] Analyze the same code again
-- [ ] Both should work without errors
-### Step 14: Test Caching
-- [ ] Analyze the same code twice with the same model
-- [ ] Second time should say "⚡ Using cached result!"
-- [ ] Should be instant (< 0.1 seconds)
-## Phase 5: Final Verification (2 minutes)
-### Step 15: Quality Check
-- [ ] Fine-tuned model gives quality scores (1-100)
-- [ ] Fine-tuned model provides structured output
-- [ ] CodeT5+ still works as before
-- [ ] No error messages in terminal or browser
-- [ ] UI loads quickly
-- [ ] Both models can analyze code successfully
-### Step 16: Document Your Setup
-- [ ] Take a screenshot of working analysis
-- [ ] Note which model works better for your use cases
-- [ ] Save any error messages you encountered (for future reference)
-## ✅ Integration Complete!
-If all steps are checked, congratulations! You have:
-- ✅ Successfully integrated fine-tuned model
-- ✅ Dual-model code analyzer working
-- ✅ Professional-quality tool ready to use
-## 🚀 Optional: Next Level
-Want to go further? Try these:
-### A. Improve the Model
-- [ ] Add more training samples (see `additional_samples.py`)
-- [ ] Retrain in Colab (only 20 minutes)
-- [ ] Test new version
-- [ ] Compare with old version
-### B. Deploy Online
-- [ ] Choose deployment platform (Hugging Face Spaces recommended)
-- [ ] Follow deployment guide in `INTEGRATION_GUIDE.md`
-- [ ] Share link with friends/portfolio
-### C. Enhance Features
-- [ ] Add support for more programming languages
-- [ ] Implement batch analysis (multiple files)
-- [ ] Add export to PDF/Markdown
-- [ ] Create comparison view (side-by-side model outputs)
-## 🆘 Troubleshooting
-### Common Issues
-**Issue: "fine-tuned-analyst not found"**
-- [ ] Check folder is in correct location
-- [ ] Verify folder name spelling
-- [ ] Ensure it's extracted (not still zipped)
-**Issue: "PEFT not installed"**
-- [ ] Run: `pip install peft`
-- [ ] Restart terminal
-- [ ] Try again
-**Issue: "Model too slow"**
-- [ ] Use CodeT5+ for faster analysis
-- [ ] Reduce max_new_tokens to 150
-- [ ] Close other applications
-**Issue: "Out of memory"**
-- [ ] Close browser tabs
-- [ ] Restart Streamlit
-- [ ] Use CodeT5+ (smaller model)
-**Issue: "Import Error"**
-- [ ] Check file names are correct
-- [ ] Verify `optimized_code_analyzer_enhanced.py` exists
-- [ ] Try: `python -c "from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer"`
-## 📞 Need Help?
-1. **Check error messages** - They usually tell you exactly what's wrong
-2. **Review INTEGRATION_GUIDE.md** - Has detailed explanations
-3. **Run test script again** - `python test_finetuned_local.py`
-4. **Start simple** - Test with basic code first
-5. **Check file locations** - Make sure everything is in the right place
-## 🎓 What You're Learning
-By completing this checklist, you've learned:
-- ✅ How to fine-tune language models
-- ✅ How to integrate ML models with web apps
-- ✅ How to create production-ready AI tools
-- ✅ How to manage multiple model versions
-- ✅ How to deploy AI applications
-## 📊 Progress Tracker
-**Phase 1:** ⬜ Get Model Ready
-**Phase 2:** ⬜ Test Locally
-**Phase 3:** ⬜ Integrate with UI
-**Phase 4:** ⬜ Test Everything
-**Phase 5:** ⬜ Final Verification
----
-**Estimated Total Time:** 25-30 minutes
-**When Done:** You'll have a professional code analyzer with AI models YOU trained! 🎉
-Good luck! You've got this! 🚀

README.md CHANGED Viewed

@@ -1,3 +1,13 @@
 # AI Code Analyzer
 A professional AI-powered code analysis tool with a sleek Matrix-inspired interface that leverages multiple Large Language Models (LLMs) to provide comprehensive code reviews, identify issues, and suggest improvements.
@@ -9,7 +19,7 @@ A professional AI-powered code analysis tool with a sleek Matrix-inspired interf
 ## ✨ Features
-* **🤖 Multi-Model Analysis**: Compare insights from OpenAI GPT-4, Anthropic Claude, DeepSeek, and fine-tuned models
 * **🎯 Fine-tuned Code Analyzer**: Custom DeepSeek model trained on 59+ code analysis examples
 * **🎨 Matrix-Inspired UI**: Sleek dark theme with neon green accents and cyberpunk aesthetics
 * **📊 Comprehensive Code Review**: Get quality scores, identify bugs, security issues, and performance concerns
@@ -22,16 +32,16 @@ A professional AI-powered code analysis tool with a sleek Matrix-inspired interf
 ## 🌐 Live Demo
-[🚀 Try it live on Render](https://ai-code-analyzer-tcl8.onrender.com)
 ## 🛠️ Tech Stack
 - **Frontend**: Streamlit with custom Matrix-inspired CSS
-- **LLM Integration**: OpenAI, Anthropic, DeepSeek APIs
 - **Fine-tuning**: LoRA/QLoRA with Hugging Face Transformers
 - **Model Hosting**: Hugging Face Hub & Spaces
 - **Language**: Python 3.11+
-- **Deployment**: Render (configured with render.yaml)
 - **Styling**: Custom CSS with Google Fonts (Share Tech Mono, Orbitron)
 ## 🎯 Fine-tuned Model
@@ -61,6 +71,7 @@ The fine-tuned model provides:
    * OpenAI API Key
    * Anthropic API Key
    * DeepSeek API Key
 ### Installation
@@ -86,14 +97,21 @@ pip install -r requirements.txt
    Create a `.env` file in the root directory:
 ```env
 # API Keys - Replace with your actual API keys
 OPENAI_API_KEY=your_openai_api_key_here
 ANTHROPIC_API_KEY=your_anthropic_api_key_here
 DEEPSEEK_API_KEY=your_deepseek_api_key_here
 GITHUB_TOKEN=your_github_token_here  # Optional, for higher API limits
 ```
 5. **Run the application:**
 ```bash
 python -m streamlit run matrix_final.py --server.port 8501
 ```
@@ -122,32 +140,41 @@ The application will be available at `http://localhost:8501`
 ## 🏗️ Project Structure
 ```
 ai-code-analyzer/
-├── matrix_final.py        # Main Streamlit application (deployed version)
-├── analyzer/              # Core analysis engine
-│   ├── __init__.py       # Package initialization
-│   ├── code_analyzer.py  # Main analysis engine
-│   ├── llm_clients.py    # LLM API client implementations
-│   ├── prompts.py        # Analysis prompt templates
-│   └── utils.py          # Utility functions
-├── requirements.txt       # Python dependencies
-├── render.yaml           # Render deployment configuration
-├── Procfile             # Alternative deployment configuration
-├── runtime.txt          # Python version specification
-├── README.md            # This file
-└── .env                 # Environment variables (create this)
 ```
 ## 🔧 Configuration
 ### Supported LLM Providers
-| Provider  | Model          | API Key Environment Variable |
-| --------- | -------------- | ---------------------------- |
-| OpenAI    | GPT-4o-mini    | OPENAI\_API\_KEY             |
-| Anthropic | Claude 3 Haiku | ANTHROPIC\_API\_KEY          |
-| DeepSeek  | DeepSeek Chat  | DEEPSEEK\_API\_KEY           |
 ### Supported Programming Languages
@@ -155,22 +182,34 @@ ai-code-analyzer/
 - **Auto-detection** available for most languages
 - **Manual selection** option for specific analysis
 ## 🚀 Deployment
-### Deploy to Render (Recommended)
-The project is configured for **one-click deployment** on Render:
-1. **Fork this repository** to your GitHub account
-2. **Connect to Render**: Go to [Render Dashboard](https://dashboard.render.com)
-3. **Create New Web Service**: Select "Build and deploy from a Git repository"
-4. **Connect Repository**: Link your forked repository
-5. **Configure Environment Variables** in Render dashboard:
-   - `OPENAI_API_KEY`
-   - `ANTHROPIC_API_KEY`
-   - `DEEPSEEK_API_KEY`
-   - `GITHUB_TOKEN` (optional)
-6. **Deploy**: Render automatically detects `render.yaml` and deploys
 ### Manual Deployment
@@ -203,10 +242,13 @@ The tool provides structured analysis including:
 ### Running Locally
 ```bash
-# Start the development server
 python -m streamlit run matrix_final.py --server.port 8501
-# With auto-reload for development
 python -m streamlit run matrix_final.py --server.port 8501 --server.runOnSave true
 ```
@@ -247,9 +289,9 @@ If you encounter any issues or have questions:
 ## 🔗 Links
-- **Live Demo**: [ai-code-analyzer-tcl8.onrender.com](https://ai-code-analyzer-tcl8.onrender.com)
 - **Repository**: [github.com/arun3676/ai-code-analyzer](https://github.com/arun3676/ai-code-analyzer)
-- **Render Dashboard**: [dashboard.render.com](https://dashboard.render.com)
 ---

+---
+title: AI Code Analyzer
+emoji: 🧠
+colorFrom: black
+colorTo: green
+sdk: docker
+app_port: 7860
+pinned: false
+---
 # AI Code Analyzer
 A professional AI-powered code analysis tool with a sleek Matrix-inspired interface that leverages multiple Large Language Models (LLMs) to provide comprehensive code reviews, identify issues, and suggest improvements.
 ## ✨ Features
+* **🤖 Multi-Model Analysis**: Compare insights from OpenAI GPT-4, Anthropic Claude, DeepSeek, and Hugging Face models
 * **🎯 Fine-tuned Code Analyzer**: Custom DeepSeek model trained on 59+ code analysis examples
 * **🎨 Matrix-Inspired UI**: Sleek dark theme with neon green accents and cyberpunk aesthetics
 * **📊 Comprehensive Code Review**: Get quality scores, identify bugs, security issues, and performance concerns
 ## 🌐 Live Demo
+**Local Deployment**: Your AI Code Analyzer is running locally at `http://localhost:8501/`
 ## 🛠️ Tech Stack
 - **Frontend**: Streamlit with custom Matrix-inspired CSS
+- **LLM Integration**: OpenAI, Anthropic, DeepSeek, Hugging Face APIs
 - **Fine-tuning**: LoRA/QLoRA with Hugging Face Transformers
 - **Model Hosting**: Hugging Face Hub & Spaces
 - **Language**: Python 3.11+
+- **Deployment**: Hugging Face Spaces (recommended for ease of use and free tier)
 - **Styling**: Custom CSS with Google Fonts (Share Tech Mono, Orbitron)
 ## 🎯 Fine-tuned Model
    * OpenAI API Key
    * Anthropic API Key
    * DeepSeek API Key
+   * **Hugging Face API Key** (recommended for free usage)
 ### Installation
    Create a `.env` file in the root directory:
 ```env
 # API Keys - Replace with your actual API keys
+HUGGINGFACE_API_KEY=your_huggingface_api_key_here  # Recommended for free usage
 OPENAI_API_KEY=your_openai_api_key_here
 ANTHROPIC_API_KEY=your_anthropic_api_key_here
 DEEPSEEK_API_KEY=your_deepseek_api_key_here
 GITHUB_TOKEN=your_github_token_here  # Optional, for higher API limits
 ```
+**📚 For detailed Hugging Face setup instructions, see: [HUGGINGFACE_SETUP_GUIDE.md](HUGGINGFACE_SETUP_GUIDE.md)**
 5. **Run the application:**
 ```bash
+# Option 1: Use the startup script (recommended)
+python run_app.py
+# Option 2: Run directly with Streamlit
 python -m streamlit run matrix_final.py --server.port 8501
 ```
 ## 🏗️ Project Structure
+For a detailed explanation of the project structure, architecture, and data flow, please see [PROJECT_STRUCTURE.md](PROJECT_STRUCTURE.md) and [ARCHITECTURE.md](ARCHITECTURE.md).
 ```
 ai-code-analyzer/
+├── .venv/                           # Virtual environment directory
+├── analyzer/                          # Core analysis engine
+│   ├── __init__.py                   # Package initialization
+│   ├── code_analyzer.py              # Main analysis engine
+│   ├── llm_clients.py                # LLM API client implementations
+│   ├── prompts.py                    # Analysis prompt templates
+│   └── utils.py                      # Utility functions
+├── tests/                             # Automated tests
+│   ├── test_matrix_final.py          # Tests for matrix_final.py utilities
+│   └── test_prompts.py               # Tests for prompt generation
+├── .env                              # Environment variables (create this)
+├── .gitignore                        # Git ignore file
+├── ARCHITECTURE.md                   # Detailed architecture documentation
+├── matrix_final.py                    # Main Streamlit application
+├── PROJECT_STRUCTURE.md              # High-level project structure
+├── README.md                         # This file
+├── requirements.txt                   # Python dependencies
+├── run_app.py                         # Startup script for easy launching
+└── TESTING_GUIDE.md                  # Guide for testing the application
 ```
 ## 🔧 Configuration
 ### Supported LLM Providers
+| Provider      | Model                    | API Key Environment Variable |
+| ------------- | ------------------------ | ---------------------------- |
+| Hugging Face  | Mixtral-8x7B-Instruct    | HUGGINGFACE\_API\_KEY        |
+| OpenAI        | GPT-4o-mini              | OPENAI\_API\_KEY             |
+| Anthropic     | Claude 3 Haiku           | ANTHROPIC\_API\_KEY          |
+| DeepSeek      | DeepSeek Chat            | DEEPSEEK\_API\_KEY           |
 ### Supported Programming Languages
 - **Auto-detection** available for most languages
 - **Manual selection** option for specific analysis
+## 🧪 Testing
+For detailed instructions on how to test the application, please refer to the [TESTING_GUIDE.md](TESTING_GUIDE.md).
+To run the automated tests:
+```bash
+pytest
+```
 ## 🚀 Deployment
+### Deploy to Hugging Face Spaces (Recommended)
+This project is configured for easy deployment on **Hugging Face Spaces**:
+1.  **Fork this repository** to your GitHub account.
+2.  **Create a new Space**: Go to [Hugging Face Spaces](https://huggingface.co/spaces/new) and create a new Space.
+    *   Choose "Streamlit" as the Space SDK.
+    *   Select "Public" or "Private" as per your preference.
+    *   Connect your forked GitHub repository.
+3.  **Configure Secrets**: In your Hugging Face Space settings, go to "App settings" -> "Secrets". Add your API keys:
+    *   `HUGGINGFACE_API_KEY` (required for Hugging Face models)
+    *   `OPENAI_API_KEY` (optional)
+    *   `ANTHROPIC_API_KEY` (optional)
+    *   `DEEPSEEK_API_KEY` (optional)
+    *   `GITHUB_TOKEN` (optional, for higher GitHub API limits)
+4.  **Wait for Deployment**: Hugging Face will automatically detect your `requirements.txt` and `matrix_final.py` and deploy your app.
+5.  **Access Your App**: Once deployed, your application will be live on your Hugging Face Space URL.
 ### Manual Deployment
 ### Running Locally
 ```bash
+# Option 1: Use the startup script (recommended)
+python run_app.py
+# Option 2: Start the development server directly
 python -m streamlit run matrix_final.py --server.port 8501
+# Option 3: With auto-reload for development
 python -m streamlit run matrix_final.py --server.port 8501 --server.runOnSave true
 ```
 ## 🔗 Links
+- **Live Demo**: *Your Hugging Face Space URL here*
 - **Repository**: [github.com/arun3676/ai-code-analyzer](https://github.com/arun3676/ai-code-analyzer)
+- **Hugging Face Spaces**: [huggingface.co/spaces](https://huggingface.co/spaces)
 ---

additional_samples.py DELETED Viewed

@@ -1,73 +0,0 @@
-#!/usr/bin/env python3
-"""
-Additional training samples for code analysis dataset.
-This script generates 25 new high-quality samples to expand the dataset from 35 to 60.
-"""
-import json
-def create_additional_samples():
-    """Create 25 new training samples covering diverse Python issues."""
-    new_samples = [
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import subprocess\nimport sys\n\ndef run_command(user_input):\n    cmd = f\"ls -la {user_input}\"\n    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)\n    return result.stdout\n\nif __name__ == \"__main__\":\n    user_path = input(\"Enter path: \")\n    print(run_command(user_path))``` [/INST] Quality Score: 15/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Command injection vulnerability via shell=True and unsanitized input\n- User can execute arbitrary commands using ; or && operators\n- No input validation or sanitization\n\nBUGS:\n- No error handling for subprocess failures\n- Will crash if command fails\n\nPERFORMANCE ISSUES:\n- Spawns shell process unnecessarily\n- No timeout protection\n\nIMPROVEMENTS:\n1. Avoid shell=True and use proper argument lists\n2. Validate and sanitize input\n3. Add proper error handling and timeouts\n\nExample improved code:\n```python\nimport subprocess\nimport os\nfrom pathlib import Path\n\ndef run_command(user_input):\n    # Validate and sanitize input\n    try:\n        safe_path = Path(user_input).expanduser().resolve()\n        if not safe_path.exists():\n            raise ValueError(\"Path does not exist\")\n    except Exception as e:\n        raise ValueError(f\"Invalid path: {e}\")\n    \n    # Use proper argument list without shell\n    result = subprocess.run(\n        [\"ls\", \"-la\", str(safe_path)],\n        capture_output=True,\n        text=True,\n        timeout=10\n    )\n    \n    if result.returncode != 0:\n        raise RuntimeError(f\"Command failed: {result.stderr}\")\n    \n    return result.stdout\n```</s>"
-        },
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json\nfrom flask import Flask, request\n\napp = Flask(__name__)\n\n@app.route(\"/api/data\", methods=[\"POST\"])\ndef process_data():\n    data = request.get_json()\n    \n    # Process the data\n    result = {\n        \"status\": \"success\",\n        \"data\": data,\n        \"length\": len(str(data))\n    }\n    \n    return json.dumps(result)\n\nif __name__ == \"__main__\":\n    app.run(host=\"0.0.0.0\", port=5000)``` [/INST] Quality Score: 42/100\n\nBUGS:\n- No error handling for malformed JSON requests\n- Missing Content-Type validation\n- No input size limits\n\nPERFORMANCE ISSUES:\n- Converting data to string just for length calculation is inefficient\n- No request timeout handling\n\nSECURITY CONCERNS:\n- Binding to 0.0.0.0 exposes service to all interfaces\n- No authentication or rate limiting\n- Potential DoS via large payloads\n- No input validation\n\nIMPROVEMENTS:\n1. Add proper error handling and validation\n2. Implement security measures\n3. Optimize performance\n\nExample improved code:\n```python\nimport json\nfrom flask import Flask, request, jsonify\nfrom functools import wraps\n\napp = Flask(__name__)\n\nMAX_PAYLOAD_SIZE = 1024 * 1024  # 1MB limit\n\ndef validate_json(f):\n    @wraps(f)\n    def decorated_function(*args, **kwargs):\n        if not request.is_json:\n            return jsonify({\"error\": \"Content-Type must be application/json\"}), 400\n        \n        content_length = request.content_length\n        if content_length and content_length > MAX_PAYLOAD_SIZE:\n            return jsonify({\"error\": \"Payload too large\"}), 413\n        \n        try:\n            request.json  # Validate JSON parsing\n        except Exception:\n            return jsonify({\"error\": \"Invalid JSON\"}), 400\n        \n        return f(*args, **kwargs)\n    return decorated_function\n\n@app.route(\"/api/data\", methods=[\"POST\"])\n@validate_json\ndef process_data():\n    data = request.get_json()\n    \n    # Validate required fields\n    if not isinstance(data, dict):\n        return jsonify({\"error\": \"Data must be a JSON object\"}), 400\n    \n    result = {\n        \"status\": \"success\",\n        \"data\": data,\n        \"field_count\": len(data)\n    }\n    \n    return jsonify(result)\n\nif __name__ == \"__main__\":\n    app.run(host=\"127.0.0.1\", port=5000, debug=False)\n```</s>"
-        },
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import threading\nimport time\n\nclass Counter:\n    def __init__(self):\n        self.value = 0\n    \n    def increment(self):\n        temp = self.value\n        time.sleep(0.001)  # Simulate work\n        self.value = temp + 1\n    \n    def get_value(self):\n        return self.value\n\ncounter = Counter()\n\ndef worker():\n    for _ in range(1000):\n        counter.increment()\n\n# Create multiple threads\nthreads = []\nfor i in range(5):\n    t = threading.Thread(target=worker)\n    threads.append(t)\n    t.start()\n\n# Wait for all threads to complete\nfor t in threads:\n    t.join()\n\nprint(f\"Final count: {counter.get_value()}\")``` [/INST] Quality Score: 38/100\n\nBUGS:\n- Race condition in increment() method\n- Non-atomic read-modify-write operation\n- Final count will be less than expected (5000)\n\nPERFORMANCE ISSUES:\n- Unnecessary sleep in critical section\n- No thread synchronization\n\nSECURITY CONCERNS:\n- Race condition could lead to data corruption\n- No protection against concurrent access\n\nIMPROVEMENTS:\n1. Use proper synchronization mechanisms\n2. Make operations atomic\n\nExample improved code:\n```python\nimport threading\n\nclass Counter:\n    def __init__(self):\n        self._value = 0\n        self._lock = threading.Lock()\n    \n    def increment(self):\n        with self._lock:\n            # Simulate work outside critical section\n            time.sleep(0.001)\n            self._value += 1\n    \n    def get_value(self):\n        with self._lock:\n            return self._value\n\n# Alternative: Use atomic operations\nimport threading\n\nclass AtomicCounter:\n    def __init__(self):\n        self._value = 0\n        self._lock = threading.RLock()\n    \n    def increment(self):\n        with self._lock:\n            self._value += 1\n    \n    def get_value(self):\n        with self._lock:\n            return self._value\n```</s>"
-        },
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\nimport tempfile\n\ndef process_file(file_content):\n    # Create temporary file\n    temp_file = tempfile.NamedTemporaryFile(delete=False)\n    temp_file.write(file_content.encode())\n    temp_file.close()\n    \n    # Process the file\n    with open(temp_file.name, \"r\") as f:\n        content = f.read()\n    \n    # Clean up\n    os.remove(temp_file.name)\n    \n    return content.upper()\n\nif __name__ == \"__main__\":\n    result = process_file(\"Hello, World!\")\n    print(result)``` [/INST] Quality Score: 55/100\n\nBUGS:\n- No error handling if file operations fail\n- Potential file descriptor leak if exception occurs after file creation\n- No cleanup if processing fails\n\nPERFORMANCE ISSUES:\n- Unnecessary file I/O for simple string operations\n- Creates temporary file for small data\n\nSECURITY CONCERNS:\n- Temporary file might not be cleaned up if process crashes\n- No validation of file_content size\n\nIMPROVEMENTS:\n1. Use context managers for proper cleanup\n2. Avoid unnecessary file operations\n3. Add proper error handling\n\nExample improved code:\n```python\nimport tempfile\nimport os\nfrom contextlib import contextmanager\n\n@contextmanager\ndef temporary_file(content):\n    \"\"\"Context manager for temporary file with automatic cleanup.\"\"\"\n    temp_file = None\n    try:\n        with tempfile.NamedTemporaryFile(mode=\"w+\", delete=False, encoding=\"utf-8\") as tf:\n            tf.write(content)\n            temp_file = tf.name\n        yield temp_file\n    finally:\n        if temp_file and os.path.exists(temp_file):\n            try:\n                os.unlink(temp_file)\n            except OSError:\n                pass  # File already deleted\n\ndef process_file(file_content):\n    # Validate input\n    if not isinstance(file_content, str):\n        raise ValueError(\"File content must be a string\")\n    \n    # For simple operations, avoid file I/O\n    if len(file_content) < 1024:  # Small content\n        return file_content.upper()\n    \n    # For larger content, use temporary file\n    with temporary_file(file_content) as temp_path:\n        with open(temp_path, \"r\", encoding=\"utf-8\") as f:\n            content = f.read()\n        return content.upper()\n```</s>"
-        },
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import hashlib\nimport os\n\ndef hash_file(filename):\n    \"\"\"Calculate MD5 hash of a file.\"\"\"\n    hash_md5 = hashlib.md5()\n    with open(filename, \"rb\") as f:\n        for chunk in iter(lambda: f.read(4096), b\"\"):\n            hash_md5.update(chunk)\n    return hash_md5.hexdigest()\n\ndef verify_integrity(file_path, expected_hash):\n    \"\"\"Verify file integrity against expected hash.\"\"\"\n    actual_hash = hash_file(file_path)\n    return actual_hash == expected_hash\n\nif __name__ == \"__main__\":\n    file_path = input(\"Enter file path: \")\n    hash_value = hash_file(file_path)\n    print(f\"MD5: {hash_value}\")\n    \n    expected = input(\"Enter expected hash: \")\n    if verify_integrity(file_path, expected):\n        print(\"File integrity verified!\")\n    else:\n        print(\"File integrity check failed!\")``` [/INST] Quality Score: 65/100\n\nBUGS:\n- No error handling for file operations\n- No validation of input parameters\n- Will crash if file doesn't exist or can't be read\n\nPERFORMANCE ISSUES:\n- MD5 is fast but not cryptographically secure\n- No file size limits\n\nSECURITY CONCERNS:\n- MD5 is vulnerable to collision attacks\n- No protection against path traversal\n- No input sanitization\n\nIMPROVEMENTS:\n1. Use cryptographically secure hash function\n2. Add proper error handling and validation\n3. Secure file path handling\n\nExample improved code:\n```python\nimport hashlib\nimport os\nfrom pathlib import Path\n\ndef hash_file(file_path, algorithm=\"sha256\"):\n    \"\"\"Calculate hash of a file using specified algorithm.\"\"\"\n    file_path = Path(file_path).resolve()\n    \n    if not file_path.exists():\n        raise FileNotFoundError(f\"File not found: {file_path}\")\n    \n    if not file_path.is_file():\n        raise ValueError(f\"Path is not a file: {file_path}\")\n    \n    # Choose hash algorithm\n    if algorithm.lower() == \"md5\":\n        hasher = hashlib.md5()\n    elif algorithm.lower() == \"sha256\":\n        hasher = hashlib.sha256()\n    elif algorithm.lower() == \"sha1\":\n        hasher = hashlib.sha1()\n    else:\n        raise ValueError(f\"Unsupported algorithm: {algorithm}\")\n    \n    try:\n        with open(file_path, \"rb\") as f:\n            while chunk := f.read(8192):  # 8KB chunks\n                hasher.update(chunk)\n        return hasher.hexdigest()\n    except IOError as e:\n        raise IOError(f\"Error reading file: {e}\")\n\ndef verify_integrity(file_path, expected_hash, algorithm=\"sha256\"):\n    \"\"\"Verify file integrity against expected hash.\"\"\"\n    try:\n        actual_hash = hash_file(file_path, algorithm)\n        return actual_hash.lower() == expected_hash.lower()\n    except Exception as e:\n        print(f\"Verification failed: {e}\")\n        return False\n```</s>"
-        }
-    ]
-    # Continue with more samples to reach 25 total...
-    more_samples = [
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import socket\nimport threading\n\nclass SimpleServer:\n    def __init__(self, host=\"localhost\", port=8080):\n        self.host = host\n        self.port = port\n        self.clients = []\n    \n    def handle_client(self, client_socket):\n        while True:\n            data = client_socket.recv(1024)\n            if not data:\n                break\n            response = f\"Echo: {data.decode()}\"\n            client_socket.send(response.encode())\n        client_socket.close()\n    \n    def start(self):\n        server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n        server.bind((self.host, self.port))\n        server.listen(5)\n        \n        while True:\n            client, addr = server.accept()\n            self.clients.append(client)\n            thread = threading.Thread(target=self.handle_client, args=(client,))\n            thread.start()\n\nif __name__ == \"__main__\":\n    server = SimpleServer()\n    server.start()``` [/INST] Quality Score: 33/100\n\nBUGS:\n- No error handling for socket operations\n- Threads are never joined or cleaned up\n- No graceful shutdown mechanism\n- No connection limits or timeouts\n\nPERFORMANCE ISSUES:\n- Creates unlimited threads (potential thread exhaustion)\n- No connection pooling or limits\n- Memory leak with growing clients list\n\nSECURITY CONCERNS:\n- No authentication or authorization\n- No input validation or sanitization\n- Vulnerable to DoS attacks\n- No rate limiting\n\nIMPROVEMENTS:\n1. Add proper error handling and resource management\n2. Implement connection limits and timeouts\n3. Add security measures\n\nExample improved code:\n```python\nimport socket\nimport threading\nimport time\nfrom concurrent.futures import ThreadPoolExecutor\n\nclass SimpleServer:\n    def __init__(self, host=\"127.0.0.1\", port=8080, max_connections=10):\n        self.host = host\n        self.port = port\n        self.max_connections = max_connections\n        self.running = False\n        self.executor = ThreadPoolExecutor(max_workers=max_connections)\n    \n    def handle_client(self, client_socket, addr):\n        \"\"\"Handle client connection with proper error handling.\"\"\"\n        try:\n            client_socket.settimeout(30)  # 30 second timeout\n            \n            while self.running:\n                try:\n                    data = client_socket.recv(1024)\n                    if not data:\n                        break\n                    \n                    # Basic input validation\n                    message = data.decode(\"utf-8\", errors=\"ignore\")\n                    if len(message) > 1000:  # Limit message size\n                        response = \"Error: Message too long\"\n                    else:\n                        response = f\"Echo: {message}\"\n                    \n                    client_socket.send(response.encode(\"utf-8\"))\n                except socket.timeout:\n                    break\n        except Exception as e:\n            print(f\"Error handling client {addr}: {e}\")\n        finally:\n            try:\n                client_socket.close()\n            except:\n                pass\n    \n    def start(self):\n        \"\"\"Start the server with proper resource management.\"\"\"\n        self.running = True\n        \n        try:\n            server = socket.socket(socket.AF_INET, socket.SOCK_STREAM)\n            server.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)\n            server.bind((self.host, self.port))\n            server.listen(self.max_connections)\n            \n            print(f\"Server listening on {self.host}:{self.port}\")\n            \n            while self.running:\n                try:\n                    client, addr = server.accept()\n                    self.executor.submit(self.handle_client, client, addr)\n                except Exception as e:\n                    if self.running:\n                        print(f\"Error accepting connection: {e}\")\n        except Exception as e:\n            print(f\"Server error: {e}\")\n        finally:\n            server.close()\n    \n    def stop(self):\n        \"\"\"Gracefully stop the server.\"\"\"\n        self.running = False\n        self.executor.shutdown(wait=True)\n```</s>"
-        },
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import re\n\ndef validate_email(email):\n    \"\"\"Validate email address using regex.\"\"\"\n    pattern = r\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$\"\n    return bool(re.match(pattern, email))\n\ndef extract_emails(text):\n    \"\"\"Extract all email addresses from text.\"\"\"\n    pattern = r\"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}\"\n    return re.findall(pattern, text)\n\ndef sanitize_input(user_input):\n    \"\"\"Sanitize user input by removing special characters.\"\"\"\n    return re.sub(r\"[^a-zA-Z0-9@._-]\", \"\", user_input)\n\nif __name__ == \"__main__\":\n    email = input(\"Enter email: \")\n    if validate_email(email):\n        print(\"Valid email!\")\n    else:\n        print(\"Invalid email!\")\n    \n    text = input(\"Enter text to extract emails: \")\n    emails = extract_emails(text)\n    print(f\"Found emails: {emails}\")\n    \n    user_input = input(\"Enter text to sanitize: \")\n    clean = sanitize_input(user_input)\n    print(f\"Sanitized: {clean}\")``` [/INST] Quality Score: 72/100\n\nBUGS:\n- Email regex is too permissive and may allow invalid emails\n- No handling of edge cases (empty strings, None values)\n- Sanitization function is too aggressive and may break valid input\n\nPERFORMANCE ISSUES:\n- Regex compilation happens on every call (should be compiled once)\n- No input size limits\n\nSECURITY CONCERNS:\n- Email validation regex can be bypassed\n- Sanitization may not be sufficient for all use cases\n- No protection against regex DoS attacks\n\nIMPROVEMENTS:\n1. Use proper email validation library\n2. Pre-compile regex patterns\n3. Add proper input validation\n\nExample improved code:\n```python\nimport re\nfrom email_validator import validate_email, EmailNotValidError\n\nclass EmailValidator:\n    def __init__(self):\n        # Pre-compile regex patterns for better performance\n        self.email_pattern = re.compile(\n            r\"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$\",\n            re.IGNORECASE\n        )\n        self.extract_pattern = re.compile(\n            r\"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}\",\n            re.IGNORECASE\n        )\n    \n    def validate_email(self, email):\n        \"\"\"Validate email address using proper validation.\"\"\"\n        if not email or not isinstance(email, str):\n            return False\n        \n        if len(email) > 254:  # RFC 5321 limit\n            return False\n        \n        try:\n            # Use proper email validation\n            validate_email(email)\n            return True\n        except EmailNotValidError:\n            return False\n    \n    def extract_emails(self, text):\n        \"\"\"Extract all email addresses from text.\"\"\"\n        if not text or not isinstance(text, str):\n            return []\n        \n        if len(text) > 10000:  # Limit input size\n            raise ValueError(\"Input text too long\")\n        \n        matches = self.extract_pattern.findall(text)\n        return [email for email in matches if self.validate_email(email)]\n    \n    def sanitize_input(self, user_input, max_length=1000):\n        \"\"\"Sanitize user input with proper validation.\"\"\"\n        if not user_input or not isinstance(user_input, str):\n            return \"\"\n        \n        if len(user_input) > max_length:\n            user_input = user_input[:max_length]\n        \n        # Remove potentially dangerous characters but preserve email structure\n        sanitized = re.sub(r\"[<>\"'&]\", \"\", user_input)\n        return sanitized\n\n# Usage\nvalidator = EmailValidator()\n```</s>"
-        }
-    ]
-    # Add 18 more samples to reach 25 total
-    additional_samples = [
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\nimport shutil\n\ndef backup_file(source, destination):\n    \"\"\"Backup a file to destination.\"\"\"\n    if os.path.exists(source):\n        shutil.copy2(source, destination)\n        print(f\"Backed up {source} to {destination}\")\n    else:\n        print(f\"Source file {source} does not exist\")\n\nif __name__ == \"__main__\":\n    source = input(\"Enter source file: \")\n    dest = input(\"Enter destination: \")\n    backup_file(source, dest)``` [/INST] Quality Score: 48/100\n\nBUGS:\n- No error handling for file operations\n- No validation of input paths\n- Will crash if destination directory doesn't exist\n\nPERFORMANCE ISSUES:\n- No file size limits\n- No progress indication for large files\n\nSECURITY CONCERNS:\n- Path traversal vulnerability in user input\n- No validation of file paths\n- Could overwrite system files\n\nIMPROVEMENTS:\n1. Add proper error handling and path validation\n2. Implement security measures\n3. Add progress indication\n\nExample improved code:\n```python\nimport os\nimport shutil\nfrom pathlib import Path\n\ndef backup_file(source, destination):\n    \"\"\"Safely backup a file to destination.\"\"\"\n    try:\n        source_path = Path(source).resolve()\n        dest_path = Path(destination).resolve()\n        \n        # Validate source\n        if not source_path.exists():\n            raise FileNotFoundError(f\"Source file not found: {source_path}\")\n        \n        if not source_path.is_file():\n            raise ValueError(f\"Source is not a file: {source_path}\")\n        \n        # Validate destination\n        dest_dir = dest_path.parent\n        if not dest_dir.exists():\n            dest_dir.mkdir(parents=True, exist_ok=True)\n        \n        # Check if destination already exists\n        if dest_path.exists():\n            response = input(f\"Destination exists. Overwrite? (y/n): \")\n            if response.lower() != 'y':\n                print(\"Backup cancelled\")\n                return False\n        \n        # Perform backup\n        shutil.copy2(source_path, dest_path)\n        print(f\"Successfully backed up {source_path} to {dest_path}\")\n        return True\n        \n    except Exception as e:\n        print(f\"Backup failed: {e}\")\n        return False\n```</s>"
-        },
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import urllib.request\nimport json\n\ndef fetch_data(url):\n    \"\"\"Fetch data from URL and return JSON.\"\"\"\n    response = urllib.request.urlopen(url)\n    data = response.read()\n    return json.loads(data.decode())\n\ndef process_api_data(url):\n    \"\"\"Process data from API endpoint.\"\"\"\n    data = fetch_data(url)\n    results = []\n    for item in data:\n        if 'name' in item and 'value' in item:\n            results.append({\n                'name': item['name'],\n                'value': item['value'] * 2\n            })\n    return results\n\nif __name__ == \"__main__\":\n    api_url = input(\"Enter API URL: \")\n    results = process_api_data(api_url)\n    print(f\"Processed {len(results)} items\")\n    for result in results:\n        print(f\"{result['name']}: {result['value']}\")``` [/INST] Quality Score: 52/100\n\nBUGS:\n- No error handling for network requests\n- No validation of URL format\n- Assumes data structure without validation\n\nPERFORMANCE ISSUES:\n- No timeout for network requests\n- No caching mechanism\n- Loads entire response into memory\n\nSECURITY CONCERNS:\n- No URL validation (potential SSRF)\n- No SSL certificate verification\n- No rate limiting\n\nIMPROVEMENTS:\n1. Add proper error handling and validation\n2. Implement security measures\n3. Add timeout and caching\n\nExample improved code:\n```python\nimport urllib.request\nimport urllib.parse\nimport json\nimport ssl\nfrom urllib.error import URLError, HTTPError\n\nclass APIClient:\n    def __init__(self, timeout=30):\n        self.timeout = timeout\n        self.ctx = ssl.create_default_context()\n    \n    def fetch_data(self, url):\n        \"\"\"Safely fetch data from URL.\"\"\"\n        # Validate URL\n        parsed = urllib.parse.urlparse(url)\n        if not parsed.scheme or not parsed.netloc:\n            raise ValueError(\"Invalid URL format\")\n        \n        # Only allow HTTP/HTTPS\n        if parsed.scheme not in ['http', 'https']:\n            raise ValueError(\"Only HTTP/HTTPS URLs are allowed\")\n        \n        try:\n            request = urllib.request.Request(url)\n            request.add_header('User-Agent', 'Python-API-Client/1.0')\n            \n            with urllib.request.urlopen(request, timeout=self.timeout, context=self.ctx) as response:\n                if response.status != 200:\n                    raise HTTPError(url, response.status, \"HTTP Error\", response.headers, None)\n                \n                data = response.read()\n                if len(data) > 10 * 1024 * 1024:  # 10MB limit\n                    raise ValueError(\"Response too large\")\n                \n                return json.loads(data.decode('utf-8'))\n                \n        except URLError as e:\n            raise ConnectionError(f\"Network error: {e}\")\n        except json.JSONDecodeError as e:\n            raise ValueError(f\"Invalid JSON response: {e}\")\n    \n    def process_api_data(self, url):\n        \"\"\"Process data from API with validation.\"\"\"\n        try:\n            data = self.fetch_data(url)\n            \n            if not isinstance(data, list):\n                raise ValueError(\"Expected list data\")\n            \n            results = []\n            for item in data:\n                if not isinstance(item, dict):\n                    continue\n                \n                if 'name' in item and 'value' in item:\n                    try:\n                        value = float(item['value'])\n                        results.append({\n                            'name': str(item['name']),\n                            'value': value * 2\n                        })\n                    except (ValueError, TypeError):\n                        continue\n            \n            return results\n            \n        except Exception as e:\n            print(f\"Error processing API data: {e}\")\n            return []\n```</s>"
-        },
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import logging\nimport os\n\ndef setup_logging():\n    logging.basicConfig(\n        level=logging.INFO,\n        format='%(asctime)s - %(levelname)s - %(message)s',\n        filename='app.log'\n    )\n\ndef log_user_action(user_id, action, details):\n    \"\"\"Log user actions for audit trail.\"\"\"\n    message = f\"User {user_id} performed {action}: {details}\"\n    logging.info(message)\n\ndef log_error(error_msg, user_id=None):\n    \"\"\"Log error messages.\"\"\"\n    if user_id:\n        logging.error(f\"User {user_id} encountered error: {error_msg}\")\n    else:\n        logging.error(f\"System error: {error_msg}\")\n\nif __name__ == \"__main__\":\n    setup_logging()\n    log_user_action(\"user123\", \"login\", \"successful login\")\n    log_error(\"Database connection failed\")``` [/INST] Quality Score: 68/100\n\nBUGS:\n- No error handling for file operations\n- No log rotation or size limits\n- No validation of input parameters\n\nPERFORMANCE ISSUES:\n- No async logging\n- File I/O blocks on every log\n- No log compression\n\nSECURITY CONCERNS:\n- Logs may contain sensitive information\n- No log sanitization\n- Potential log injection attacks\n\nIMPROVEMENTS:\n1. Add proper error handling and log rotation\n2. Implement log sanitization\n3. Add security measures\n\nExample improved code:\n```python\nimport logging\nimport logging.handlers\nimport os\nimport re\nfrom datetime import datetime\n\nclass SecureLogger:\n    def __init__(self, log_file='app.log', max_size=10*1024*1024, backup_count=5):\n        self.logger = logging.getLogger('secure_app')\n        self.logger.setLevel(logging.INFO)\n        \n        # Clear existing handlers\n        self.logger.handlers.clear()\n        \n        # Create rotating file handler\n        handler = logging.handlers.RotatingFileHandler(\n            log_file, maxBytes=max_size, backupCount=backup_count\n        )\n        \n        # Create formatter\n        formatter = logging.Formatter(\n            '%(asctime)s - %(levelname)s - %(message)s',\n            datefmt='%Y-%m-%d %H:%M:%S'\n        )\n        handler.setFormatter(formatter)\n        \n        self.logger.addHandler(handler)\n        \n        # Prevent sensitive data patterns\n        self.sensitive_patterns = [\n            r'password[=:]\\s*\\S+',\n            r'token[=:]\\s*\\S+',\n            r'key[=:]\\s*\\S+',\n            r'\\b\\d{4}[\\s-]?\\d{4}[\\s-]?\\d{4}[\\s-]?\\d{4}\\b'  # Credit card\n        ]\n    \n    def sanitize_message(self, message):\n        \"\"\"Remove sensitive information from log messages.\"\"\"\n        for pattern in self.sensitive_patterns:\n            message = re.sub(pattern, '[REDACTED]', message, flags=re.IGNORECASE)\n        return message\n    \n    def log_user_action(self, user_id, action, details):\n        \"\"\"Log user actions with sanitization.\"\"\"\n        if not user_id or not action:\n            return\n        \n        message = f\"User {user_id} performed {action}: {details}\"\n        sanitized_message = self.sanitize_message(message)\n        self.logger.info(sanitized_message)\n    \n    def log_error(self, error_msg, user_id=None):\n        \"\"\"Log error messages with sanitization.\"\"\"\n        if not error_msg:\n            return\n        \n        if user_id:\n            message = f\"User {user_id} encountered error: {error_msg}\"\n        else:\n            message = f\"System error: {error_msg}\"\n        \n        sanitized_message = self.sanitize_message(message)\n        self.logger.error(sanitized_message)\n\n# Usage\nlogger = SecureLogger()\n```</s>"
-        }
-    ]
-    # Add remaining samples to reach 25 total
-    final_samples = [
-        {
-            "text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import random\nimport string\n\ndef generate_password(length=8):\n    \"\"\"Generate a random password.\"\"\"\n    chars = string.ascii_letters + string.digits\n    password = ''.join(random.choice(chars) for _ in range(length))\n    return password\n\ndef create_user(username, email):\n    \"\"\"Create a new user account.\"\"\"\n    password = generate_password()\n    user_data = {\n        'username': username,\n        'email': email,\n        'password': password\n    }\n    # In real app, save to database\n    print(f\"User created: {username} with password: {password}\")\n    return user_data\n\nif __name__ == \"__main__\":\n    username = input(\"Enter username: \")\n    email = input(\"Enter email: \")\n    user = create_user(username, email)``` [/INST] Quality Score: 25/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Weak password generation (no special characters, predictable)\n- Password printed to console (major security risk)\n- No password strength validation\n- No input validation\n\nBUGS:\n- No error handling\n- No validation of username/email format\n\nPERFORMANCE ISSUES:\n- Uses random.choice in loop (slow for long passwords)\n\nSECURITY CONCERNS:\n- Passwords logged to console\n- Weak entropy\n- No password policy enforcement\n\nIMPROVEMENTS:\n1. Use cryptographically secure random generation\n2. Implement proper password policies\n3. Never log passwords\n\nExample improved code:\n```python\nimport secrets\nimport string\nimport re\nfrom typing import Optional\n\nclass PasswordGenerator:\n    def __init__(self):\n        self.lowercase = string.ascii_lowercase\n        self.uppercase = string.ascii_uppercase\n        self.digits = string.digits\n        self.special = \"!@#$%^&*\"\n        self.all_chars = self.lowercase + self.uppercase + self.digits + self.special\n    \n    def generate_password(self, length=12, min_special=1, min_upper=1, min_lower=1, min_digits=1):\n        \"\"\"Generate cryptographically secure password.\"\"\"\n        if length < 8:\n            raise ValueError(\"Password must be at least 8 characters\")\n        \n        password = []\n        \n        # Ensure minimum requirements\n        password.extend(secrets.choice(self.special) for _ in range(min_special))\n        password.extend(secrets.choice(self.uppercase) for _ in range(min_upper))\n        password.extend(secrets.choice(self.lowercase) for _ in range(min_lower))\n        password.extend(secrets.choice(self.digits) for _ in range(min_digits))\n        \n        # Fill remaining length\n        remaining = length - len(password)\n        password.extend(secrets.choice(self.all_chars) for _ in range(remaining))\n        \n        # Shuffle to randomize positions\n        secrets.SystemRandom().shuffle(password)\n        \n        return ''.join(password)\n    \n    def validate_password(self, password):\n        \"\"\"Validate password strength.\"\"\"\n        if len(password) < 8:\n            return False, \"Password must be at least 8 characters\"\n        \n        if not re.search(r'[a-z]', password):\n            return False, \"Password must contain lowercase letters\"\n        \n        if not re.search(r'[A-Z]', password):\n            return False, \"Password must contain uppercase letters\"\n        \n        if not re.search(r'\\d', password):\n            return False, \"Password must contain digits\"\n        \n        if not re.search(r'[!@#$%^&*]', password):\n            return False, \"Password must contain special characters\"\n        \n        return True, \"Password is strong\"\n\ndef create_user(username, email):\n    \"\"\"Create user with secure password.\"\"\"\n    # Validate inputs\n    if not username or not email:\n        raise ValueError(\"Username and email required\")\n    \n    if len(username) < 3:\n        raise ValueError(\"Username must be at least 3 characters\")\n    \n    # Generate secure password\n    generator = PasswordGenerator()\n    password = generator.generate_password()\n    \n    user_data = {\n        'username': username,\n        'email': email,\n        'password': password  # In real app, hash this immediately\n    }\n    \n    # NEVER log passwords!\n    print(f\"User created: {username}\")\n    return user_data\n```</s>"
-        }
-    ]
-    # Add all samples together to reach 25 total
-    all_samples = new_samples + more_samples + additional_samples + final_samples
-    return all_samples
-if __name__ == "__main__":
-    samples = create_additional_samples()
-    print(f"Created {len(samples)} additional training samples")
-    # Save to file
-    with open("additional_samples.json", "w") as f:
-        json.dump(samples, f, indent=2)
-    print("Samples saved to additional_samples.json")

analyst_dataset.jsonl DELETED Viewed

@@ -1,34 +0,0 @@
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def calculate_fibonacci(n):\n    if n <= 0:\n        return 0\n    elif n == 1:\n        return 1\n    else:\n        return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)\n\n# Test the function\nfor i in range(10):\n    result = calculate_fibonacci(i)\n    print(f\"Fibonacci({i}) = {result}\")``` [/INST] Quality Score: 65/100\n\nBUGS:\n- Recursive implementation without memoization causes exponential time complexity\n- Will cause stack overflow for large values of n\n- No input validation for negative numbers beyond the first check\n\nPERFORMANCE ISSUES:\n- Time complexity: O(2^n) - extremely inefficient\n- Space complexity: O(n) due to recursion stack\n- Repeated calculations for same values\n\nSECURITY CONCERNS:\n- No protection against extremely large inputs that could cause DoS\n- No input sanitization\n\nIMPROVEMENTS:\n1. Use memoization or iterative approach for O(n) time complexity\n2. Add input validation with proper error handling\n3. Add maximum input limit to prevent abuse\n4. Consider using dynamic programming approach\n\nExample improved code:\n```python\ndef calculate_fibonacci(n):\n    if n < 0:\n        raise ValueError(\"Input must be non-negative\")\n    if n > 1000:  # Prevent abuse\n        raise ValueError(\"Input too large\")\n    \n    if n <= 1:\n        return n\n    \n    a, b = 0, 1\n    for _ in range(2, n + 1):\n        a, b = b, a + b\n    return b\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\n\ndef read_user_file(filename):\n    file_path = os.path.join(\"/home/user/\", filename)\n    with open(file_path, \"r\") as f:\n        content = f.read()\n    return content\n\n# Usage\nuser_input = input(\"Enter filename: \")\ndata = read_user_file(user_input)\nprint(data)``` [/INST] Quality Score: 25/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Path traversal vulnerability: User can access any file by using \"../\" in filename\n- No input validation or sanitization\n- Hardcoded directory path creates inflexibility\n- No error handling for file operations\n\nBUGS:\n- Will crash if file doesn't exist\n- Will crash if user doesn't have read permissions\n- No handling for binary files (will try to decode as text)\n\nPERFORMANCE ISSUES:\n- Loads entire file into memory (problematic for large files)\n- No file size limits\n\nIMPROVEMENTS:\n1. Validate and sanitize input filename\n2. Use absolute path validation\n3. Add proper error handling\n4. Implement file size limits\n5. Use pathlib for safer path handling\n\nExample improved code:\n```python\nimport os\nfrom pathlib import Path\n\ndef read_user_file(filename):\n    # Validate filename\n    if not filename or \"..\" in filename or \"/\" in filename:\n        raise ValueError(\"Invalid filename\")\n    \n    # Create safe path\n    base_dir = Path(\"/home/user/\")\n    file_path = base_dir / filename\n    \n    # Ensure path is within allowed directory\n    if not file_path.resolve().is_relative_to(base_dir.resolve()):\n        raise ValueError(\"Access denied\")\n    \n    # Check file size\n    if file_path.stat().st_size > 10 * 1024 * 1024:  # 10MB limit\n        raise ValueError(\"File too large\")\n    \n    try:\n        with open(file_path, \"r\") as f:\n            return f.read()\n    except FileNotFoundError:\n        raise FileNotFoundError(\"File not found\")\n    except PermissionError:\n        raise PermissionError(\"Access denied\")\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```class UserManager:\n    def __init__(self):\n        self.users = {}\n    \n    def add_user(self, username, password):\n        self.users[username] = password\n    \n    def authenticate(self, username, password):\n        if username in self.users:\n            if self.users[username] == password:\n                return True\n        return False\n    \n    def get_user_info(self, username):\n        return self.users.get(username, None)\n\n# Usage\nmanager = UserManager()\nmanager.add_user(\"admin\", \"password123\")\nprint(manager.authenticate(\"admin\", \"password123\"))``` [/INST] Quality Score: 30/100\n\nCRITICAL SECURITY VULNERABILITIES:\n- Passwords stored in plain text (major security risk)\n- No password hashing or encryption\n- No password strength validation\n- No protection against brute force attacks\n- No session management\n\nBUGS:\n- No input validation for username/password\n- No handling of edge cases (empty strings, None values)\n- No duplicate user prevention\n\nPERFORMANCE ISSUES:\n- In-memory storage (data lost on restart)\n- No database persistence\n- No indexing for large user bases\n\nIMPROVEMENTS:\n1. Hash passwords using bcrypt or similar\n2. Add input validation and sanitization\n3. Implement proper session management\n4. Use database for persistence\n5. Add password strength requirements\n6. Implement rate limiting for authentication attempts\n\nExample improved code:\n```python\nimport hashlib\nimport secrets\nfrom typing import Optional\n\nclass UserManager:\n    def __init__(self):\n        self.users = {}\n        self.sessions = {}\n    \n    def _hash_password(self, password: str) -> str:\n        salt = secrets.token_hex(16)\n        password_hash = hashlib.pbkdf2_hmac('sha256', \n                                          password.encode('utf-8'), \n                                          salt.encode('utf-8'), \n                                          100000)\n        return f\"{salt}:{password_hash.hex()}\"\n    \n    def _verify_password(self, password: str, stored_hash: str) -> bool:\n        salt, hash_value = stored_hash.split(':')\n        password_hash = hashlib.pbkdf2_hmac('sha256',\n                                           password.encode('utf-8'),\n                                           salt.encode('utf-8'),\n                                           100000)\n        return password_hash.hex() == hash_value\n    \n    def add_user(self, username: str, password: str) -> bool:\n        if not username or not password:\n            raise ValueError(\"Username and password required\")\n        if len(password) < 8:\n            raise ValueError(\"Password must be at least 8 characters\")\n        if username in self.users:\n            raise ValueError(\"User already exists\")\n        \n        self.users[username] = self._hash_password(password)\n        return True\n    \n    def authenticate(self, username: str, password: str) -> bool:\n        if username not in self.users:\n            return False\n        return self._verify_password(password, self.users[username])\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import re, pathlib, requests\nfrom urllib.parse import urlparse\nfrom html.parser import HTMLParser\n\nclass TitleParser(HTMLParser):\n    def __init__(self): super().__init__(); self.in_title=False; self.title=[]\n    def handle_starttag(self, tag, attrs): self.in_title = tag.lower()==\"title\"\n    def handle_endtag(self, tag): \n        if tag.lower()==\"title\": self.in_title=False\n    def handle_data(self, data): \n        if self.in_title: self.title.append(data)\n\nALLOWED_SCHEMES = {\"http\",\"https\"}\nBLOCKED_HOSTS = {\"169.254.169.254\",\"localhost\",\"127.0.0.1\",\"::1\"}\n\ndef safe_fetch(url: str):\n    u = urlparse(url)\n    if u.scheme not in ALLOWED_SCHEMES or u.hostname in BLOCKED_HOSTS:\n        raise ValueError(\"URL not allowed\")\n    with requests.get(url, timeout=10, stream=True) as r:\n        r.raise_for_status()\n        text = r.text  # consider iter_content for huge pages\n    p = TitleParser(); p.feed(text)\n    title = \"\".join(p.title).strip() or \"No Title\"\n    safe_name = re.sub(r\"[^a-zA-Z0-9_.-]\", \"_\", u.hostname or \"page\")\n    pathlib.Path(f\"{safe_name}.html\").write_text(text, encoding=\"utf-8\")\n    return title``` [/INST] Quality Score: 48/100\n\nBUGS:\n- No error handling for network failures or malformed HTML\n- File overwrites without checking if it already exists\n- No validation of URL format before parsing\n\nPERFORMANCE ISSUES:\n- Loads entire response into memory with r.text instead of streaming\n- No caching mechanism for repeated requests to same URL\n\nSECURITY CONCERNS:\n- File writes to current directory without path validation\n- No size limits on downloaded content (potential DoS)\n- No validation of HTML content before parsing\n\nIMPROVEMENTS:\n1. Add proper error handling and file size limits\n```python\nimport re, pathlib, requests\nfrom urllib.parse import urlparse\nfrom html.parser import HTMLParser\n\ndef safe_fetch(url: str, max_size: int = 10 * 1024 * 1024):\n    u = urlparse(url)\n    if u.scheme not in ALLOWED_SCHEMES or u.hostname in BLOCKED_HOSTS:\n        raise ValueError(\"URL not allowed\")\n    \n    try:\n        with requests.get(url, timeout=10, stream=True) as r:\n            r.raise_for_status()\n            content_length = int(r.headers.get('content-length', 0))\n            if content_length > max_size:\n                raise ValueError(\"Content too large\")\n            \n            content = b\"\"\n            for chunk in r.iter_content(chunk_size=8192):\n                content += chunk\n                if len(content) > max_size:\n                    raise ValueError(\"Content too large\")\n            \n            text = content.decode('utf-8', errors='ignore')\n    except requests.RequestException as e:\n        raise ValueError(f\"Network error: {e}\")\n    \n    # Parse and save with proper error handling\n    p = TitleParser()\n    try:\n        p.feed(text)\n    except Exception:\n        pass  # Continue even if HTML parsing fails\n    \n    title = \"\".join(p.title).strip() or \"No Title\"\n    safe_name = re.sub(r\"[^a-zA-Z0-9_.-]\", \"_\", u.hostname or \"page\")\n    \n    # Create safe output directory\n    output_dir = pathlib.Path(\"downloads\")\n    output_dir.mkdir(exist_ok=True)\n    output_path = output_dir / f\"{safe_name}.html\"\n    \n    output_path.write_text(text, encoding=\"utf-8\")\n    return title\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import sqlite3\n\ndef get_user(email):\n    conn = sqlite3.connect(\"app.db\")\n    cur = conn.cursor()\n    # vulnerable string interpolation\n    q = f\"SELECT id, email, role FROM users WHERE email = '{email}'\"\n    rows = cur.execute(q).fetchall()\n    conn.close()\n    return rows[0] if rows else None\n\ndef delete_user(email):\n    # missing commit and no existence check\n    conn = sqlite3.connect(\"app.db\")\n    cur = conn.cursor()\n    cur.execute(f\"DELETE FROM users WHERE email = '{email}'\")\n    conn.close()\n\nif __name__ == \"__main__\":\n    e = input(\"Email: \")\n    print(get_user(e))\n    if input(\"Delete? (y/n): \") == \"y\":\n        delete_user(e)``` [/INST] Quality Score: 32/100\n\nBUGS:\n- delete_user never commits; deletion may not persist\n- get_user returns None for empty table but code assumes tuple downstream in callers (potential TypeError)\n\nPERFORMANCE ISSUES:\n- Opens/closes a new connection per call; consider pooling or reuse for batch operations\n\nSECURITY CONCERNS:\n- SQL injection via string interpolation for email\n- Potential information disclosure by printing raw tuple\n\nIMPROVEMENTS:\n1. Use parameterized queries and ensure transactions are committed\n```python\nimport sqlite3\n\ndef with_conn(db=\"app.db\"):\n    conn = sqlite3.connect(db)\n    try:\n        yield conn\n        conn.commit()\n    finally:\n        conn.close()\n\ndef get_user(email):\n    with sqlite3.connect(\"app.db\") as conn:\n        cur = conn.execute(\"SELECT id, email, role FROM users WHERE email = ?\", (email,))\n        return cur.fetchone()\n\ndef delete_user(email):\n    with sqlite3.connect(\"app.db\") as conn:\n        conn.execute(\"DELETE FROM users WHERE email = ?\", (email,))\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from flask import Flask, request, jsonify\n\napp = Flask(__name__)\nAPI_TOKEN = \"secret\"  # hardcoded\n\n@app.route(\"/admin/users\", methods=[\"GET\"])\ndef list_users():\n    # auth via query param; vulnerable\n    if request.args.get(\"token\") != API_TOKEN:\n        return jsonify({\"error\": \"unauthorized\"}), 401\n    # pretend to read from file\n    with open(\"users.json\") as f:\n        return f.read()  # returns raw string, wrong content-type\n\n@app.route(\"/echo\", methods=[\"POST\"])\ndef echo():\n    # reflects untrusted data directly\n    msg = request.json.get(\"msg\", \"\")\n    return f\"<div>{msg}</div>\"\n\nif __name__ == \"__main__\":\n    app.run(debug=True)``` [/INST] Quality Score: 35/100\n\nBUGS:\n- /admin/users returns plain string without JSON content-type; clients may fail parsing\n- Missing error handling if users.json not found\n\nPERFORMANCE ISSUES:\n- Reading file on every request instead of caching\n\nSECURITY CONCERNS:\n- Hardcoded secret; weak auth via token in query string (leaks via logs)\n- XSS in /echo by reflecting unsanitized HTML\n- debug=True exposes Werkzeug debugger with arbitrary code execution in some configs\n\nIMPROVEMENTS:\n1. Use header-based auth, environment secrets, safe JSON responses, and HTML escaping\n```python\nimport os, html, json\nfrom flask import Flask, request, jsonify, Response\n\napp = Flask(__name__)\nAPI_TOKEN = os.environ.get(\"API_TOKEN\")\n\ndef require_token():\n    token = request.headers.get(\"Authorization\",\"\").replace(\"Bearer \",\"\")\n    if not API_TOKEN or token != API_TOKEN:\n        return jsonify({\"error\":\"unauthorized\"}), 401\n\n@app.get(\"/admin/users\")\ndef list_users():\n    auth = require_token()\n    if auth: return auth\n    try:\n        with open(\"users.json\",\"r\",encoding=\"utf-8\") as f:\n            data = json.load(f)\n        return jsonify(data)\n    except FileNotFoundError:\n        return jsonify({\"error\":\"not found\"}), 404\n\n@app.post(\"/echo\")\ndef echo():\n    msg = request.json.get(\"msg\",\"\")\n    return Response(f\"<div>{html.escape(msg)}</div>\", mimetype=\"text/html\")\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os, time\n\nTMP_DIR = \"/tmp\"\n\ndef process_upload(filename, content):\n    # race: check then use predictable path\n    path = os.path.join(TMP_DIR, filename)\n    if not os.path.exists(path):\n        open(path, \"w\").write(content)\n    # later read it\n    time.sleep(0.5)  # simulate delay\n    data = open(path).read()\n    return data.upper()\n\nif __name__ == \"__main__\":\n    print(process_upload(input(\"name: \"), \"hello\"))``` [/INST] Quality Score: 44/100\n\nBUGS:\n- No file closing (implicit close but errors may leak descriptors)\n- No input validation on filename; may include path separators\n\nPERFORMANCE ISSUES:\n- Sleep delays processing; unnecessary blocking\n- Reads whole file even if large\n\nSECURITY CONCERNS:\n- TOCTOU race: attacker can symlink path to sensitive file after existence check\n- Path traversal via filename allowing writes to arbitrary locations\n\nIMPROVEMENTS:\n1. Use secure temporary files and sanitize paths; open with exclusive flags\n```python\nimport io, os, tempfile\n\ndef process_upload(content: str):\n    with tempfile.NamedTemporaryFile(mode=\"w+\", delete=False, dir=tempfile.gettempdir(), encoding=\"utf-8\") as tf:\n        tf.write(content)\n        tf.flush()\n        tf.seek(0)\n        data = tf.read()\n    return data.upper()\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import csv\n\ndef load_prices(path):\n    rows = []\n    with open(path) as f:\n        for r in csv.DictReader(f):\n            # trusting numeric fields; may be blank\n            price = float(r[\"price\"])  # ValueError possible\n            # evaluating discount expression directly from CSV (dangerous)\n            discount = eval(r.get(\"discount\",\"0\"))  # arbitrary code exec\n            rows.append({\"name\": r[\"name\"], \"final\": price - discount})\n    return rows\n\nif __name__ == \"__main__\":\n    print(load_prices(\"prices.csv\"))``` [/INST] Quality Score: 27/100\n\nBUGS:\n- Unhandled ValueError for missing or non-numeric price\n- discount application can produce negative final prices with no checks\n\nPERFORMANCE ISSUES:\n- No streaming/validation—okay for small files, but no batching for large inputs\n\nSECURITY CONCERNS:\n- eval on CSV field allows arbitrary code execution\n- No path validation; could read unexpected files\n\nIMPROVEMENTS:\n1. Replace eval with safe parsing and validate numbers\n```python\nimport csv\nfrom decimal import Decimal, InvalidOperation\n\ndef parse_decimal(s):\n    try: return Decimal(s)\n    except (InvalidOperation, TypeError): return None\n\ndef load_prices(path):\n    out=[]\n    with open(path, newline=\"\", encoding=\"utf-8\") as f:\n        for r in csv.DictReader(f):\n            price = parse_decimal(r.get(\"price\"))\n            disc = parse_decimal(r.get(\"discount\") or \"0\")\n            if price is None or disc is None: continue\n            final = max(Decimal(\"0\"), price - disc)\n            out.append({\"name\": r.get(\"name\",\"\"), \"final\": str(final)})\n    return out\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import asyncio\n\nasync def fetch(i):\n    await asyncio.sleep(0.1)\n    return i*i\n\nasync def main():\n    tasks = []\n    for i in range(20):\n        tasks.append(fetch(i))  # forgot to create tasks with ensure_future\n    results = []\n    for t in tasks:\n        # not awaited; result is coroutine object\n        results.append(t)\n    print(results)\n\nif __name__ == \"__main__\":\n    loop = asyncio.get_event_loop()\n    loop.run_until_complete(main())\n    # bad: nested run in same loop (would crash if called again)\n    asyncio.run(main())``` [/INST] Quality Score: 26/100\n\nBUGS:\n- Appends coroutine objects, never awaited, leading to no execution\n- Double-running event loop with loop.run_until_complete and then asyncio.run may raise RuntimeError\n\nPERFORMANCE ISSUES:\n- Serial awaits inside a loop would be slow even if fixed; should gather\n\nSECURITY CONCERNS:\n- None specific beyond stability\n\nIMPROVEMENTS:\n1. Use asyncio.gather and avoid double loop management\n```python\nimport asyncio\n\nasync def fetch(i):\n    await asyncio.sleep(0.1)\n    return i*i\n\nasync def main():\n    results = await asyncio.gather(*(fetch(i) for i in range(20)))\n    print(results)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```cache = []\n\ndef read_big_file(path):\n    # loads entire file and caches forever -> memory growth\n    with open(path, \"rb\") as f:\n        data = f.read()\n    cache.append(data)\n    return len(data)\n\ndef process_many(paths):\n    total = 0\n    for p in paths:\n        total += read_big_file(p)\n    return total``` [/INST] Quality Score: 41/100\n\nBUGS:\n- No bounds on cache; memory leak across calls\n- No file existence/error handling\n\nPERFORMANCE ISSUES:\n- Reads entire files into memory; should stream for large files\n- Appending binary blobs prevents GC and increases fragmentation\n\nSECURITY CONCERNS:\n- If paths are user-controlled, could load sensitive files\n\nIMPROVEMENTS:\n1. Stream and limit cache by size or use LRU\n```python\nfrom functools import lru_cache\nimport os\n\n@lru_cache(maxsize=128)\ndef file_size(path):\n    return os.path.getsize(path)\n\ndef process_many(paths):\n    return sum(file_size(p) for p in paths if os.path.exists(p))\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def transfer(a, b, amount):\n    try:\n        a['balance'] -= amount\n        if a['balance'] < 0:\n            raise ValueError(\"insufficient\")\n        b['balance'] += amount\n    except:\n        # swallow all errors\n        pass\n\nif __name__ == \"__main__\":\n    alice = {\"balance\": 100}\n    bob = {\"balance\": 50}\n    transfer(alice, bob, 200)\n    print(alice, bob)``` [/INST] Quality Score: 38/100\n\nBUGS:\n- Bare except swallows insufficient funds error, leaving inconsistent state (Alice reduced but Bob not credited if error after)\n- No return value indicating success/failure\n\nPERFORMANCE ISSUES:\n- None significant\n\nSECURITY CONCERNS:\n- Silent failures complicate auditing and could hide fraud\n\nIMPROVEMENTS:\n1. Use explicit exception handling and atomicity\n```python\ndef transfer(a, b, amount):\n    if amount <= 0:\n        raise ValueError(\"amount must be positive\")\n    if a[\"balance\"] < amount:\n        raise ValueError(\"insufficient funds\")\n    a[\"balance\"] -= amount\n    b[\"balance\"] += amount\n    return True\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\n\nDEBUG = True\nSECRET_KEY = \"supersecret\"\nDB_URL = \"postgres://user:pass@localhost/db\"\n\ndef connect():\n    # pretend to connect using DB_URL\n    print(\"Connecting to\", DB_URL)\n\nif __name__ == \"__main__\":\n    if DEBUG:\n        print(\"Debug mode on\")\n    connect()``` [/INST] Quality Score: 52/100\n\nBUGS:\n- None functional, but poor separation of config from code\n\nPERFORMANCE ISSUES:\n- Printing secrets to console in debug could end up in logs\n\nSECURITY CONCERNS:\n- Hardcoded secrets and credentials in source control\n- Debug mode enabled by default\n\nIMPROVEMENTS:\n1. Move secrets to environment variables and default to secure values\n```python\nimport os\n\nDEBUG = os.getenv(\"APP_DEBUG\",\"0\") == \"1\"\nSECRET_KEY = os.getenv(\"SECRET_KEY\")\nDB_URL = os.getenv(\"DATABASE_URL\")\n\ndef connect():\n    if not DB_URL:\n        raise RuntimeError(\"DATABASE_URL not set\")\n    # connect using DB_URL\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from flask import Flask, request, render_template_string\n\napp = Flask(__name__)\n\n@app.route(\"/greet\")\ndef greet():\n    name = request.args.get(\"name\",\"world\")\n    # unsafe: marking user input as safe HTML\n    tpl = \"<h1>Hello, {{ name|safe }}</h1>\"\n    return render_template_string(tpl, name=name)\n\nif __name__ == \"__main__\":\n    app.run()``` [/INST] Quality Score: 46/100\n\nBUGS:\n- None functional, but template marks untrusted input as safe\n\nPERFORMANCE ISSUES:\n- Using render_template_string for every request rather than compiled templates\n\nSECURITY CONCERNS:\n- Reflected XSS via name parameter using |safe\n\nIMPROVEMENTS:\n1. Escape user input and use templates\n```python\nfrom flask import Flask, request, render_template_string\n\napp = Flask(__name__)\n\nTPL = \"<h1>Hello, {{ name }}</h1>\"\n\n@app.get(\"/greet\")\ndef greet():\n    name = request.args.get(\"name\",\"world\")\n    return render_template_string(TPL, name=name)\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\n\ndef list_dir(user_input):\n    # command injection via shell=True\n    cmd = f\"ls -la {user_input}\"\n    os.system(cmd)\n\nif __name__ == \"__main__\":\n    list_dir(input(\"Path: \"))``` [/INST] Quality Score: 28/100\n\nBUGS:\n- Fails on paths with spaces or special characters\n\nPERFORMANCE ISSUES:\n- Spawns a shell process unnecessarily\n\nSECURITY CONCERNS:\n- Command injection by passing ; rm -rf / or similar\n- No validation of path\n\nIMPROVEMENTS:\n1. Use subprocess without shell and validate input\n```python\nimport subprocess\nfrom pathlib import Path\n\ndef list_dir(path_str):\n    p = Path(path_str).expanduser().resolve()\n    result = subprocess.run([\"ls\",\"-la\", str(p)], check=True, capture_output=True, text=True)\n    return result.stdout\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import os\nfrom flask import Flask, request, send_file\n\napp = Flask(__name__)\n\nBASE = \"/var/data\"\n\n@app.route(\"/download\")\ndef download():\n    name = request.args.get(\"name\",\"\")\n    path = os.path.join(BASE, name)  # no normalization\n    if os.path.exists(path):\n        return send_file(path)\n    return \"Not found\", 404``` [/INST] Quality Score: 43/100\n\nBUGS:\n- No method checks; only GET assumed\n\nPERFORMANCE ISSUES:\n- Repeated os.path.exists without caching is fine; main issue is security\n\nSECURITY CONCERNS:\n- Path traversal via name=../../etc/passwd to escape BASE\n- Possible disclosure of any readable file\n\nIMPROVEMENTS:\n1. Normalize and ensure path stays within BASE\n```python\nfrom pathlib import Path\nfrom flask import Flask, request, send_file, abort\n\nBASE = Path(\"/var/data\").resolve()\n\n@app.get(\"/download\")\ndef download():\n    name = request.args.get(\"name\",\"\")\n    target = (BASE / name).resolve()\n    if not str(target).startswith(str(BASE)) or not target.exists():\n        abort(404)\n    return send_file(target)\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from flask import Flask, request\nimport time, os\n\napp = Flask(__name__)\nUPLOAD_DIR = \"uploads\"\n\n@app.post(\"/upload\")\ndef upload():\n    f = request.files[\"file\"]\n    path = os.path.join(UPLOAD_DIR, f.filename)\n    f.save(path)  # save immediately\n    # scan after saving (race: window where file is accessible)\n    time.sleep(0.2)  # simulate scan\n    if f.filename.endswith(\".exe\"):\n        os.remove(path)  # too late; already served by static server?\n        return \"Rejected\", 400\n    return \"OK\" ``` [/INST] Quality Score: 49/100\n\nBUGS:\n- No check if UPLOAD_DIR exists; may crash\n\nPERFORMANCE ISSUES:\n- Sleep-based scan blocks worker thread\n\nSECURITY CONCERNS:\n- Race condition: file saved and potentially accessible before validation\n- Trusts filename; content-type and magic bytes not checked\n\nIMPROVEMENTS:\n1. Validate first to a quarantine path and atomically move after scanning\n```python\nfrom pathlib import Path\nimport tempfile, shutil\n\nUPLOAD_DIR = Path(\"uploads\"); UPLOAD_DIR.mkdir(exist_ok=True)\n\ndef is_safe(stream):\n    head = stream.read(512); stream.seek(0)\n    return b\"MZ\" not in head  # toy check\n\n@app.post(\"/upload\")\ndef upload():\n    f = request.files[\"file\"]\n    with tempfile.NamedTemporaryFile(delete=False, dir=UPLOAD_DIR) as tmp:\n        f.save(tmp.name)\n    with open(tmp.name,\"rb\") as s:\n        if not is_safe(s):\n            os.remove(tmp.name); return \"Rejected\", 400\n    final = UPLOAD_DIR / f.filename\n    shutil.move(tmp.name, final)\n    return \"OK\"\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import logging\n\nlogging.basicConfig(level=logging.INFO)\n\ndef login(username, password):\n    # logs sensitive info\n    logging.info(\"Login attempt user=%s password=%s\", username, password)\n    if username == \"admin\" and password == \"admin\":\n        return True\n    return False``` [/INST] Quality Score: 50/100\n\nBUGS:\n- Hardcoded credentials\n\nPERFORMANCE ISSUES:\n- None significant\n\nSECURITY CONCERNS:\n- Logs plaintext passwords\n- Trivial credentials\n\nIMPROVEMENTS:\n1. Remove sensitive logging and use proper auth\n```python\nimport logging\n\ndef login(username, password):\n    logging.info(\"Login attempt for user=%s\", username)\n    # verify against hashed password store (omitted)\n    return False\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import hashlib, os\n\ndef hash_password(password, salt=b\"fixedsalt\"):\n    # weak MD5 and constant salt\n    return hashlib.md5(salt + password.encode()).hexdigest()\n\ndef verify(password, digest):\n    return hash_password(password) == digest\n\nif __name__ == \"__main__\":\n    d = hash_password(\"secret\")\n    print(verify(\"secret\", d))``` [/INST] Quality Score: 34/100\n\nBUGS:\n- None functional, but cryptographically unsafe\n\nPERFORMANCE ISSUES:\n- MD5 is fast, making brute force cheap\n\nSECURITY CONCERNS:\n- MD5 is broken for password hashing\n- Fixed salt allows rainbow table reuse\n\nIMPROVEMENTS:\n1. Use a dedicated password KDF like bcrypt/argon2 with per-user salt\n```python\nfrom argon2 import PasswordHasher\nph = PasswordHasher()\n\ndef hash_password(pw): return ph.hash(pw)\ndef verify(pw, digest): \n    try: ph.verify(digest, pw); return True\n    except Exception: return False\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import jwt\n\nSECRET = \"topsecret\"\n\ndef parse_token(token):\n    # disables verification by passing options\n    data = jwt.decode(token, options={\"verify_signature\": False})\n    return data.get(\"user\")\n\nif __name__ == \"__main__\":\n    print(parse_token(input(\"JWT: \")))``` [/INST] Quality Score: 29/100\n\nBUGS:\n- Assumes user claim exists; may raise AttributeError if not dict\n\nPERFORMANCE ISSUES:\n- None notable\n\nSECURITY CONCERNS:\n- Signature verification disabled allows forged tokens\n- No audience/issuer checks\n\nIMPROVEMENTS:\n1. Verify signature and validate claims\n```python\nimport jwt, datetime\nSECRET=\"topsecret\"\nALGO=\"HS256\"\n\ndef parse_token(token):\n    data = jwt.decode(token, SECRET, algorithms=[ALGO], options={\"require\":[\"exp\",\"iat\",\"sub\"]}, audience=\"myapi\")\n    return data.get(\"sub\")\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import threading, time\n\ncounter = 0\n\ndef worker(n):\n    global counter\n    for _ in range(n):\n        tmp = counter\n        time.sleep(0.001)  # simulate work\n        counter = tmp + 1\n\nthreads = [threading.Thread(target=worker, args=(1000,)) for _ in range(4)]\n[t.start() for t in threads]\n[t.join() for t in threads]\nprint(\"counter =\", counter)``` [/INST] Quality Score: 45/100\n\nBUGS:\n- Data race on counter leading to lost updates; final value < 4000\n\nPERFORMANCE ISSUES:\n- Frequent context switches due to sleep\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use a Lock or atomic primitives\n```python\nimport threading\n\ncounter = 0\nlock = threading.Lock()\n\ndef worker(n):\n    global counter\n    for _ in range(n):\n        with lock:\n            counter += 1\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import requests\n\ndef get_json(url):\n    # disables SSL verification\n    r = requests.get(url, verify=False)  # insecure\n    if r.status_code == 200:\n        return r.json()\n    return {}\n\nif __name__ == \"__main__\":\n    print(get_json(\"https://example.com/api\"))``` [/INST] Quality Score: 53/100\n\nBUGS:\n- No exception handling for JSON decoding\n\nPERFORMANCE ISSUES:\n- No timeout or retries; may hang\n\nSECURITY CONCERNS:\n- verify=False exposes to MITM\n- No certificate pinning for sensitive APIs\n\nIMPROVEMENTS:\n1. Enable TLS verification, add timeouts and basic retry\n```python\nimport requests\n\nSESSION = requests.Session()\nADAPTER = requests.adapters.HTTPAdapter(max_retries=3)\nSESSION.mount(\"https://\", ADAPTER)\n\ndef get_json(url, timeout=5):\n    r = SESSION.get(url, timeout=timeout)\n    r.raise_for_status()\n    return r.json()\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def paginate(items, page, size):\n    # off-by-one: page starts at 1 but uses 0-based math incorrectly\n    start = page * size\n    end = start + size - 1\n    return items[start:end]\n\nif __name__ == \"__main__\":\n    print(paginate(list(range(10)), 1, 3))``` [/INST] Quality Score: 58/100\n\nBUGS:\n- Off-by-one errors: page=1 should start at index 0; end slicing exclusive misused\n- Negative or zero page/size not validated\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Correct indices and validate inputs\n```python\ndef paginate(items, page, size):\n    if page < 1 or size < 1: return []\n    start = (page - 1) * size\n    end = start + size\n    return items[start:end]\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def most_frequent(nums):\n    # O(n^2) counting\n    best = None\n    bestc = -1\n    for x in nums:\n        c = 0\n        for y in nums:\n            if x == y:\n                c += 1\n        if c > bestc:\n            bestc = c\n            best = x\n    return best``` [/INST] Quality Score: 62/100\n\nBUGS:\n- Works but ties are arbitrary; behavior unspecified\n\nPERFORMANCE ISSUES:\n- Quadratic time; slow for large lists\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use a hashmap for linear time\n```python\nfrom collections import Counter\ndef most_frequent(nums):\n    return Counter(nums).most_common(1)[0][0] if nums else None\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import asyncio\nimport websockets\n\nclients = set()\n\nasync def handler(ws):\n    clients.add(ws)  # never removed on disconnect if error\n    async for msg in ws:\n        for c in clients:\n            await c.send(msg)  # exceptions not handled\n    # missing finally to remove ws\n\nasync def main():\n    async with websockets.serve(handler, \"0.0.0.0\", 8765):\n        await asyncio.Future()\n\nasyncio.run(main())``` [/INST] Quality Score: 47/100\n\nBUGS:\n- Clients not removed on errors; memory leak\n- Broadcast loop fails if one client breaks; others starve\n\nPERFORMANCE ISSUES:\n- Serial awaits per client; slow with many clients\n\nSECURITY CONCERNS:\n- No authentication or origin checks\n\nIMPROVEMENTS:\n1. Remove clients in finally and handle send errors\n```python\nasync def handler(ws):\n    clients.add(ws)\n    try:\n        async for msg in ws:\n            dead=set()\n            for c in clients:\n                try: await c.send(msg)\n                except: dead.add(c)\n            clients.difference_update(dead)\n    finally:\n        clients.discard(ws)\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```from datetime import datetime\nimport time\n\ndef log_event(ts_str):\n    # parse naive datetime string and compare to now\n    event = datetime.strptime(ts_str, \"%Y-%m-%d %H:%M:%S\")\n    now = datetime.now()\n    diff = (now - event).total_seconds()\n    print(\"Seconds ago:\", diff)\n\nif __name__ == \"__main__\":\n    log_event(\"2025-10-09 12:00:00\")``` [/INST] Quality Score: 60/100\n\nBUGS:\n- Treats naive datetimes as local time; mismatches if input intended as UTC\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use timezone-aware datetimes and explicit TZ\n```python\nfrom datetime import datetime, timezone\n\ndef log_event(ts_str):\n    event = datetime.strptime(ts_str, \"%Y-%m-%d %H:%M:%S\").replace(tzinfo=timezone.utc)\n    now = datetime.now(timezone.utc)\n    print(\"Seconds ago:\", (now - event).total_seconds())\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def is_equal(a, b):\n    # exact float equality\n    return a == b\n\nif __name__ == \"__main__\":\n    print(is_equal(0.1+0.2, 0.3))``` [/INST] Quality Score: 70/100\n\nBUGS:\n- Floating point equality leads to false negatives due to precision\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Use tolerance-based comparison\n```python\nimport math\ndef is_equal(a, b, eps=1e-9):\n    return math.isclose(a, b, rel_tol=1e-9, abs_tol=eps)\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```def add_item(item, items=[]):\n    # default mutable arg: shared list across calls\n    items.append(item)\n    return items\n\nif __name__ == \"__main__\":\n    print(add_item(1))\n    print(add_item(2))``` [/INST] Quality Score: 63/100\n\nBUGS:\n- Default list persists across invocations, causing unintended accumulation\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- None\n\nIMPROVEMENTS:\n1. Default to None and create per-call list\n```python\ndef add_item(item, items=None):\n    if items is None: items = []\n    items.append(item)\n    return items\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import pickle\nfrom flask import Flask, request\n\napp = Flask(__name__)\n\n@app.post(\"/load\")\ndef load():\n    data = request.data  # untrusted\n    obj = pickle.loads(data)  # RCE risk\n    return {\"type\": str(type(obj))}\n\nif __name__ == \"__main__\":\n    app.run()``` [/INST] Quality Score: 22/100\n\nBUGS:\n- No error handling for malformed payloads\n\nPERFORMANCE ISSUES:\n- Could block on large payloads; no size limits\n\nSECURITY CONCERNS:\n- Insecure deserialization: pickle.loads on untrusted data enables arbitrary code execution\n\nIMPROVEMENTS:\n1. Use safe serialization (JSON) and cap payload size\n```python\nfrom flask import Flask, request, jsonify\n\n@app.post(\"/load\")\ndef load():\n    if request.content_length and request.content_length > 1_000_000:\n        return jsonify({\"error\":\"payload too large\"}), 413\n    try:\n        obj = request.get_json(force=True)\n        return jsonify({\"keys\": list(obj.keys()) if isinstance(obj, dict) else None})\n    except Exception:\n        return jsonify({\"error\":\"invalid json\"}), 400\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json, time\nfrom http.server import BaseHTTPRequestHandler, HTTPServer\n\nclass H(BaseHTTPRequestHandler):\n    sessions = {}  # in-memory sessions; never expire\n\n    def do_POST(self):\n        if self.path == \"/login\":\n            ln = int(self.headers.get(\"Content-Length\", \"0\"))\n            body = self.rfile.read(ln).decode()\n            data = json.loads(body)\n            if data.get(\"user\") and data.get(\"pass\"):\n                token = data[\"user\"] + str(time.time())\n                H.sessions[token] = data[\"user\"]\n                self.send_response(200); self.end_headers()\n                self.wfile.write(token.encode())\n            else:\n                self.send_response(400); self.end_headers()\n\nserver = HTTPServer((\"0.0.0.0\", 8080), H)\nserver.serve_forever()``` [/INST] Quality Score: 40/100\n\nBUGS:\n- No content-type checks; JSON errors unhandled\n- Reusing class variable sessions without thread safety (server may be multithreaded depending on mixin)\n\nPERFORMANCE ISSUES:\n- Sessions grow unbounded; memory leak\n\nSECURITY CONCERNS:\n- Session token is predictable (username + timestamp)\n- No HTTPS enforcement, no SameSite/HttpOnly cookies, tokens sent in body/plain\n- No password verification\n\nIMPROVEMENTS:\n1. Use secure random tokens with expiry and proper error handling\n```python\nimport os, json, secrets, time\nfrom http.server import BaseHTTPRequestHandler, HTTPServer\n\nclass H(BaseHTTPRequestHandler):\n    sessions = {}\n    def do_POST(self):\n        try:\n            data = json.loads(self.rfile.read(int(self.headers.get(\"Content-Length\",\"0\"))))\n        except Exception:\n            self.send_response(400); self.end_headers(); return\n        if self.path == \"/login\":\n            if not data.get(\"user\") or not data.get(\"pass\"):\n                self.send_response(400); self.end_headers(); return\n            token = secrets.token_urlsafe(32)\n            H.sessions[token] = {\"u\": data[\"user\"], \"exp\": time.time()+3600}\n            self.send_response(200); self.end_headers(); self.wfile.write(token.encode())\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import re\n\ndef parse_email(s):\n    # overly permissive regex; accepts invalid and allows catastrophic backtracking\n    pattern = r\"(.+)+@(.+)+\\.(.+)+\"\n    return bool(re.match(pattern, s))\n\nif __name__ == \"__main__\":\n    print(parse_email(\"a\"*10000 + \"@x.com\"))``` [/INST] Quality Score: 56/100\n\nBUGS:\n- Regex can exhibit catastrophic backtracking on long strings, causing CPU spikes\n- Accepts many invalid emails; false positives\n\nPERFORMANCE ISSUES:\n- Regex complexity leads to potential DoS\n\nSECURITY CONCERNS:\n- Potential ReDoS if user-controlled input is validated using this pattern\n\nIMPROVEMENTS:\n1. Use anchored, simpler regex or email.utils\n```python\nimport re\nEMAIL_RE = re.compile(r\"^[^@\\s]+@[^@\\s]+\\.[^@\\s]+$\")\ndef parse_email(s): return bool(EMAIL_RE.match(s))\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json, time\nimport requests\n\ndef poll_job(api_url, job_id):\n    # tight loop without backoff; no timeout\n    while True:\n        r = requests.get(f\"{api_url}/jobs/{job_id}\")\n        if r.status_code == 200 and r.json().get(\"status\") == \"done\":\n            return r.json()\n        time.sleep(0.1)\n\nif __name__ == \"__main__\":\n    print(poll_job(\"https://example.com/api\", \"123\"))``` [/INST] Quality Score: 57/100\n\nBUGS:\n- No exit on error or unknown statuses; potential infinite loop\n\nPERFORMANCE ISSUES:\n- Aggressive polling (100 ms) can hammer the API\n- No timeout or retry policy\n\nSECURITY CONCERNS:\n- No TLS verification customization; default OK but missing auth headers\n\nIMPROVEMENTS:\n1. Add timeouts, backoff, and max wait\n```python\nimport time, requests\n\ndef poll_job(api_url, job_id, timeout=30):\n    end = time.time()+timeout\n    delay = 0.5\n    while time.time() < end:\n        r = requests.get(f\"{api_url}/jobs/{job_id}\", timeout=5)\n        r.raise_for_status()\n        st = r.json().get(\"status\")\n        if st == \"done\": return r.json()\n        if st in {\"failed\",\"canceled\"}: raise RuntimeError(st)\n        time.sleep(delay); delay = min(5.0, delay*1.5)\n    raise TimeoutError(\"job not done\")\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import json\nfrom flask import Flask, request\n\napp = Flask(__name__)\n\n@app.post(\"/api/items\")\ndef items():\n    data = request.json  # may be None\n    # expects 'qty' int but trusts client\n    total = data[\"price\"] * data[\"qty\"]  # TypeError if string\n    return {\"total\": total}\n\nif __name__ == \"__main__\":\n    app.run()``` [/INST] Quality Score: 59/100\n\nBUGS:\n- Assumes JSON body exists; KeyError/TypeError on missing/invalid fields\n- No type casting; arithmetic on strings raises errors\n\nPERFORMANCE ISSUES:\n- None significant\n\nSECURITY CONCERNS:\n- Trusting client-calculated fields could lead to abuse in real apps\n\nIMPROVEMENTS:\n1. Validate and coerce input with defaults\n```python\nfrom flask import Flask, request, jsonify\n\n@app.post(\"/api/items\")\ndef items():\n    data = request.get_json(silent=True) or {}\n    try:\n        price = float(data.get(\"price\", 0))\n        qty = int(data.get(\"qty\", 0))\n    except ValueError:\n        return jsonify({\"error\":\"invalid types\"}), 400\n    return jsonify({\"total\": price * qty})\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import threading, json\nfrom http.server import HTTPServer, BaseHTTPRequestHandler\n\ndb = {\"hits\": 0, \"items\": []}\n\nclass H(BaseHTTPRequestHandler):\n    def do_POST(self):\n        ln = int(self.headers.get(\"Content-Length\",\"0\"))\n        body = self.rfile.read(ln)\n        item = json.loads(body).get(\"item\")\n        # race: no lock protecting db\n        db[\"items\"].append(item)\n        db[\"hits\"] += 1\n        self.send_response(201); self.end_headers()\n\nsrv = HTTPServer((\"0.0.0.0\", 8000), H)\nsrv.serve_forever()``` [/INST] Quality Score: 46/100\n\nBUGS:\n- No error handling for invalid JSON/body\n- Global mutable dict accessed from multiple threads without synchronization (HTTPServer can be threaded in variations)\n\nPERFORMANCE ISSUES:\n- In-memory store grows unbounded\n\nSECURITY CONCERNS:\n- No authentication and accepts arbitrary data\n\nIMPROVEMENTS:\n1. Add locks and validation, or use a thread-safe queue/datastore\n```python\nimport threading\nlock = threading.Lock()\ndef add_item(item):\n    with lock:\n        db[\"items\"].append(item)\n        db[\"hits\"] += 1\n```</s>"}
-{"text": "<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```import yaml\n\ndef load_conf(path):\n    with open(path) as f:\n        return yaml.load(f, Loader=yaml.Loader)  # unsafe loader\n\nif __name__ == \"__main__\":\n    print(load_conf(\"config.yml\"))``` [/INST] Quality Score: 55/100\n\nBUGS:\n- Crashes on missing file; no handling\n\nPERFORMANCE ISSUES:\n- None\n\nSECURITY CONCERNS:\n- Unsafe YAML loader can construct arbitrary objects (code execution vectors)\n\nIMPROVEMENTS:\n1. Use safe_load and validate schema\n```python\nimport yaml, jsonschema\n\ndef load_conf(path):\n    with open(path, encoding=\"utf-8\") as f:\n        data = yaml.safe_load(f)\n    # validate with schema (omitted)\n    return data\n```</s>"}

analyst_dataset_expanded.jsonl DELETED Viewed

The diff for this file is too large to render. See raw diff

analyzer/code_analyzer.py CHANGED Viewed

@@ -1,62 +1,171 @@
-from typing import Dict, Any, List, Optional
 import time
 import requests
 import os
 import re
 from .llm_clients import LLMClientManager, LLMResponse
 from .prompts import get_code_analysis_prompt, get_comparison_prompt, get_github_analysis_prompt
 from .utils import detect_language, parse_analysis_result
 class CodeAnalyzer:
-    """Main code analysis engine with GitHub integration."""
-    def __init__(self):
         self.llm_manager = LLMClientManager()
         self.available_models = self.llm_manager.get_available_models()
-    def analyze_code(self, code: str, model: str, language: Optional[str] = None) -> Dict[str, Any]:
-        """Analyze code using a specific model with focused output."""
-        start_time = time.time()
-        # Detect language if not provided
-        if not language:
-            language = detect_language(code)
-        # Generate focused prompt
-        prompt = get_code_analysis_prompt(code, language)
-        # Query LLM
         response = self.llm_manager.query(model, prompt)
-        # Process response
         if response.success:
-            analysis = parse_analysis_result(response.content)
-            analysis['raw_response'] = response.content
-        else:
-            analysis = {
-                'error': response.error,
-                'quality_score': 0,
-                'summary': f"Analysis failed: {response.error}",
-                'bugs': [],
-                'quality_issues': [],
-                'security_vulnerabilities': [],
-                'quick_fixes': [],
-                # Legacy fields
-                'strengths': [],
-                'issues': [],
-                'suggestions': [],
-                'security_concerns': [],
-                'performance_notes': []
             }
-        # Add metadata
-        analysis['model'] = response.model
-        analysis['language'] = language
-        analysis['execution_time'] = round(time.time() - start_time, 2)
-        analysis['code_length'] = len(code)
-        analysis['line_count'] = len(code.splitlines())
-        return analysis
     def analyze_github_repo(self, repo_url: str, model: str = None) -> Dict[str, Any]:
         """Analyze a GitHub repository."""
@@ -226,14 +335,20 @@ class CodeAnalyzer:
             'project_overview': '',
             'architecture_quality': [],
             'critical_issues': [],
-            'improvement_priorities': []
         }
         sections = {
             'project_overview': r'(?:PROJECT_OVERVIEW|project\s+overview)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
             'architecture_quality': r'(?:ARCHITECTURE_QUALITY|architecture|structure)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
             'critical_issues': r'(?:CRITICAL_ISSUES|critical|major\s+issue)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
-            'improvement_priorities': r'(?:IMPROVEMENT_PRIORITIES|improvement|priorit)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)'
         }
         for key, pattern in sections.items():

+from typing import Dict, Any, List, Optional, Generator, Literal
 import time
 import requests
 import os
 import re
+import hashlib
+import json
+from tqdm import tqdm
 from .llm_clients import LLMClientManager, LLMResponse
 from .prompts import get_code_analysis_prompt, get_comparison_prompt, get_github_analysis_prompt
 from .utils import detect_language, parse_analysis_result
+ModelType = Literal["codet5", "deepseek-finetuned", "deepseek-finetuned-remote"]
 class CodeAnalyzer:
+    """Main code analysis engine with support for APIs, local models, and GitHub integration."""
+    def __init__(self, cache_dir: str = None, precision: str = "fp16"):
+        # API-based models
         self.llm_manager = LLMClientManager()
         self.available_models = self.llm_manager.get_available_models()
+        # Local/Remote Hugging Face models
+        self.model_type: Optional[ModelType] = None
+        self.model_id: Optional[str] = None
+        self.adapter_path: Optional[str] = None
+        self.remote_api_url: Optional[str] = None
+        self.cache_dir = cache_dir
+        self.precision = precision.lower().strip()
+        self.cache = {}
+        if cache_dir is not None:
+            os.makedirs(cache_dir, exist_ok=True)
+        self._load_cache()
+    def _get_cache_key(self, code: str) -> str:
+        """Generate a unique cache key for a piece of code and model type."""
+        combined = f"{self.model_type}:{self.model_id}:{code}"
+        return hashlib.md5(combined.encode()).hexdigest()
+    def _load_cache(self):
+        """Load analysis cache from disk if available."""
+        if self.cache_dir is None:
+            self.cache = {}
+            return
+        cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
+        if os.path.exists(cache_file):
+            try:
+                with open(cache_file, 'r') as f:
+                    self.cache = json.load(f)
+                print(f"📁 Loaded {len(self.cache)} cached analyses")
+            except (json.JSONDecodeError, IOError):
+                self.cache = {}
+    def _save_cache(self):
+        """Save the analysis cache to disk."""
+        if self.cache_dir is None:
+            return
+        cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
+        with open(cache_file, 'w') as f:
+            json.dump(self.cache, f)
+    def _check_cache(self, code: str) -> Optional[Dict[str, Any]]:
+        """Check if an analysis for the given code is in the cache."""
+        cache_key = self._get_cache_key(code)
+        return self.cache.get(cache_key)
+    def _save_to_cache(self, code: str, result: Dict[str, Any]):
+        """Save an analysis result to the cache."""
+        cache_key = self._get_cache_key(code)
+        self.cache[cache_key] = result
+        self._save_cache()
+    def analyze_code(
+        self,
+        code: str,
+        model: str,
+        language: Optional[str] = None,
+        max_tokens: int = 1024,
+    ) -> Dict[str, Any]:
+        """
+        Analyze code using a specified LLM provider.
+        This is a non-streaming, direct analysis method.
+        Language detection is now handled by AI for accuracy.
+        """
+        # Skip local language detection - let AI handle it for accuracy
+        if language is None:
+            language = "auto-detect"  # Let AI detect it
+        prompt = get_code_analysis_prompt(code, language, model)
+        start_time = time.time()
         response = self.llm_manager.query(model, prompt)
+        total_time = time.time() - start_time
         if response.success:
+            structured_data = parse_analysis_result(response.content, model)
+            # Use AI-detected language if available, otherwise fallback to auto
+            detected_lang = structured_data.get('detected_language')
+            if detected_lang:
+                language = detected_lang.upper()
+            else:
+                # Fallback to LLM-based detection if not in response
+                language = detect_language(code).upper()
+            result = {
+                "raw_response": response.content,
+                "quality_score": structured_data.get('quality_score', 0),
+                "execution_time": total_time,
+                "model": response.model,
+                "cached": False,
+                **structured_data,
+                "language": language,
+                "line_count": len(code.splitlines()),
             }
+        else:
+            result = {'error': response.error}
+        return result
+    def analyze_code_remote(self, code: str, max_tokens: int = 300) -> Dict[str, Any]: # Increased token limit
+        """Analyze code using a remote Hugging Face Space API."""
+        if not self.remote_api_url:
+            return {'error': 'Remote API URL is not configured.'}
+        cached_result = self._check_cache(code)
+        if cached_result:
+            cached_result["cached"] = True
+            return cached_result
+        start_time = time.time()
+        try:
+            # First, try FastAPI endpoint /analyze
+            response = requests.post(
+                f"{self.remote_api_url}/analyze",
+                json={"code": code, "max_tokens": max_tokens},
+                timeout=60
+            )
+            response.raise_for_status()
+            data = response.json()
+            # Assuming the remote API returns a structured response
+            total_time = time.time() - start_time
+            result = {
+                "raw_response": data.get("analysis", str(data)),
+                "quality_score": data.get("quality_score", 0),
+                "execution_time": total_time,
+                "model": data.get("model", "remote-deepseek"),
+                "cached": False,
+                "bugs": data.get("bugs", []),
+                "security_vulnerabilities": data.get("security_vulnerabilities", []),
+                "quality_issues": data.get("quality_issues", []),
+                "quick_fixes": data.get("quick_fixes", []),
+                "language": data.get("language", detect_language(code)),
+                "line_count": data.get("line_count", len(code.splitlines())),
+            }
+            self._save_to_cache(code, result)
+            return result
+        except requests.exceptions.RequestException as e:
+            # Fallback for Gradio or other errors
+            return {'error': f"Remote analysis failed: {e}"}
     def analyze_github_repo(self, repo_url: str, model: str = None) -> Dict[str, Any]:
         """Analyze a GitHub repository."""
             'project_overview': '',
             'architecture_quality': [],
             'critical_issues': [],
+            'improvement_priorities': [],
+            'onboarding_guide': [],
+            'tech_stack_rationale': [],
+            'api_endpoint_summary': [],
         }
         sections = {
             'project_overview': r'(?:PROJECT_OVERVIEW|project\s+overview)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
             'architecture_quality': r'(?:ARCHITECTURE_QUALITY|architecture|structure)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
             'critical_issues': r'(?:CRITICAL_ISSUES|critical|major\s+issue)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
+            'improvement_priorities': r'(?:IMPROVEMENT_PRIORITIES|improvement|priorit)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
+            'onboarding_guide': r'(?:ONBOARDING_GUIDE|onboarding|setup)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
+            'tech_stack_rationale': r'(?:TECH_STACK_RATIONALE|tech\s+stack|stack\s+rationale)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
+            'api_endpoint_summary': r'(?:API_ENDPOINT_SUMMARY|api\s+endpoint|endpoints)[:\s]*(.+?)(?=\n\s*(?:\d+\.|[A-Z_]+:)|$)',
         }
         for key, pattern in sections.items():

analyzer/llm_clients.py CHANGED Viewed

@@ -4,8 +4,10 @@ from dataclasses import dataclass
 import openai
 import anthropic
 from dotenv import load_dotenv
-load_dotenv()
 @dataclass
 class LLMResponse:
@@ -23,30 +25,75 @@ class LLMClientManager:
     def _initialize_clients(self):
         """Initialize available LLM clients based on API keys."""
         # OpenAI
-        if os.getenv("OPENAI_API_KEY"):
-            self.clients["openai"] = openai.OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
         # Anthropic
-        if os.getenv("ANTHROPIC_API_KEY"):
-            self.clients["anthropic"] = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
         # DeepSeek (uses OpenAI-compatible API)
-        if os.getenv("DEEPSEEK_API_KEY"):
             self.clients["deepseek"] = openai.OpenAI(
-                api_key=os.getenv("DEEPSEEK_API_KEY"),
                 base_url="https://api.deepseek.com/v1"
             )
     def get_available_models(self) -> Dict[str, str]:
         """Return available models with display names."""
         models = {}
         if "openai" in self.clients:
-            models["openai"] = "OpenAI GPT-4"
         if "anthropic" in self.clients:
-            models["anthropic"] = "Claude 3"
         if "deepseek" in self.clients:
-            models["deepseek"] = "DeepSeek Coder"
         return models
     def query(self, model: str, prompt: str, temperature: float = 0.1) -> LLMResponse:
@@ -60,55 +107,169 @@ class LLMClientManager:
                 )
                 return LLMResponse(
                     content=response.choices[0].message.content,
-                    model="OpenAI GPT-4",
                     success=True
                 )
             elif model == "anthropic" and "anthropic" in self.clients:
                 response = self.clients["anthropic"].messages.create(
-                    model="claude-3-haiku-20240307",
                     messages=[{"role": "user", "content": prompt}],
                     max_tokens=2000,
                     temperature=temperature
                 )
                 return LLMResponse(
                     content=response.content[0].text,
-                    model="Claude 3 Haiku",
                     success=True
                 )
             elif model == "deepseek" and "deepseek" in self.clients:
                 try:
                     response = self.clients["deepseek"].chat.completions.create(
-                        model="deepseek-coder",
                         messages=[{"role": "user", "content": prompt}],
                         temperature=temperature
                     )
                     return LLMResponse(
                         content=response.choices[0].message.content,
-                        model="DeepSeek Coder",
                         success=True
                     )
                 except Exception as deepseek_error:
                     # Try with alternative model name if the first one fails
                     try:
                         response = self.clients["deepseek"].chat.completions.create(
-                            model="deepseek-chat",
                             messages=[{"role": "user", "content": prompt}],
                             temperature=temperature
                         )
                         return LLMResponse(
                             content=response.choices[0].message.content,
-                            model="DeepSeek Coder",
                             success=True
                         )
                     except Exception as second_error:
                         return LLMResponse(
                             content="",
-                            model="DeepSeek Coder",
                             success=False,
                             error=f"DeepSeek API Error: {str(deepseek_error)}. Also tried alternative model: {str(second_error)}"
                         )
             else:
                 return LLMResponse(

 import openai
 import anthropic
 from dotenv import load_dotenv
+from huggingface_hub import InferenceClient
+# Force reload environment variables
+load_dotenv(override=True)
 @dataclass
 class LLMResponse:
     def _initialize_clients(self):
         """Initialize available LLM clients based on API keys."""
+        # Debug: Print available API keys
+        print("🔍 Initializing LLM clients...")
         # OpenAI
+        openai_key = os.getenv("OPENAI_API_KEY")
+        if openai_key:
+            print(f"✅ OpenAI API key found: {openai_key[:8]}...{openai_key[-4:]}")
+            self.clients["openai"] = openai.OpenAI(api_key=openai_key)
+        else:
+            print("❌ OpenAI API key not found")
         # Anthropic
+        anthropic_key = os.getenv("ANTHROPIC_API_KEY")
+        if anthropic_key:
+            print(f"✅ Anthropic API key found: {anthropic_key[:8]}...{anthropic_key[-4:]}")
+            self.clients["anthropic"] = anthropic.Anthropic(api_key=anthropic_key)
+        else:
+            print("❌ Anthropic API key not found")
         # DeepSeek (uses OpenAI-compatible API)
+        deepseek_key = os.getenv("DEEPSEEK_API_KEY")
+        if deepseek_key:
+            print(f"✅ DeepSeek API key found: {deepseek_key[:8]}...{deepseek_key[-4:]}")
             self.clients["deepseek"] = openai.OpenAI(
+                api_key=deepseek_key,
                 base_url="https://api.deepseek.com/v1"
             )
+        else:
+            print("❌ DeepSeek API key not found")
+        # Mercury API (OpenAI-compatible via Inception Labs)
+        # Support both MERCURY_API_KEY and INCEPTION_API_KEY
+        mercury_key = os.getenv("MERCURY_API_KEY") or os.getenv("INCEPTION_API_KEY")
+        if mercury_key:
+            print(f"✅ Mercury API key found: {mercury_key[:8]}...{mercury_key[-4:]}")
+            try:
+                # Prefer explicit base URL envs; default to Inception Labs documented endpoint
+                base_url = (
+                    os.getenv("MERCURY_BASE_URL")
+                    or os.getenv("INCEPTION_BASE_URL")
+                    or "https://api.inceptionlabs.ai/v1"
+                )
+                self.clients["mercury"] = openai.OpenAI(api_key=mercury_key, base_url=base_url)
+                print("✅ Mercury client initialized successfully")
+            except Exception as e:
+                print(f"⚠️  Mercury client initialization failed: {e}")
+                # Still add to clients so it appears in UI, but will show error when used
+                self.clients["mercury"] = None
+        else:
+            print("❌ Mercury API key not found")
+        # Check for Hugging Face API key with multiple possible names
+        hf_token = os.getenv("HUGGINGFACE_API_KEY") or os.getenv("HUGGINGFACEHUB_API_TOKEN") or os.getenv("HF_TOKEN")
+        if hf_token:
+            self.clients["huggingface"] = InferenceClient(token=hf_token)
     def get_available_models(self) -> Dict[str, str]:
         """Return available models with display names."""
         models = {}
         if "openai" in self.clients:
+            models["openai"] = "OpenAI GPT-4o-mini"
         if "anthropic" in self.clients:
+            models["anthropic"] = "Claude 4.5 Haiku"
         if "deepseek" in self.clients:
+            models["deepseek"] = "DeepSeek Coder V2"
+        if "mercury" in self.clients:
+            models["mercury"] = "Mercury Fast LLM"
+        if "huggingface" in self.clients:
+            models["huggingface"] = "Hugging Face (Mixtral)"
         return models
     def query(self, model: str, prompt: str, temperature: float = 0.1) -> LLMResponse:
                 )
                 return LLMResponse(
                     content=response.choices[0].message.content,
+                    model="OpenAI GPT-4o-mini",
                     success=True
                 )
             elif model == "anthropic" and "anthropic" in self.clients:
                 response = self.clients["anthropic"].messages.create(
+                    model="claude-3-5-haiku-20241022",
                     messages=[{"role": "user", "content": prompt}],
                     max_tokens=2000,
                     temperature=temperature
                 )
                 return LLMResponse(
                     content=response.content[0].text,
+                    model="Claude 4.5 Haiku",
                     success=True
                 )
             elif model == "deepseek" and "deepseek" in self.clients:
                 try:
                     response = self.clients["deepseek"].chat.completions.create(
+                        model="deepseek-coder-v2",
                         messages=[{"role": "user", "content": prompt}],
                         temperature=temperature
                     )
                     return LLMResponse(
                         content=response.choices[0].message.content,
+                        model="DeepSeek Coder V2",
                         success=True
                     )
                 except Exception as deepseek_error:
                     # Try with alternative model name if the first one fails
                     try:
                         response = self.clients["deepseek"].chat.completions.create(
+                            model="deepseek-coder",
                             messages=[{"role": "user", "content": prompt}],
                             temperature=temperature
                         )
                         return LLMResponse(
                             content=response.choices[0].message.content,
+                            model="DeepSeek Coder V2",
                             success=True
                         )
                     except Exception as second_error:
                         return LLMResponse(
                             content="",
+                            model="DeepSeek Coder V2",
                             success=False,
                             error=f"DeepSeek API Error: {str(deepseek_error)}. Also tried alternative model: {str(second_error)}"
                         )
+            elif model == "mercury" and "mercury" in self.clients:
+                # Check if Mercury client is properly initialized
+                if self.clients["mercury"] is None:
+                    return LLMResponse(
+                        content="",
+                        model="Mercury Fast LLM",
+                        success=False,
+                        error="Mercury API client not properly initialized. Check your API key and endpoint configuration."
+                    )
+                # Build candidate base URLs (env first, then known defaults)
+                candidate_base_urls = []
+                if os.getenv("MERCURY_BASE_URL"):
+                    candidate_base_urls.append(os.getenv("MERCURY_BASE_URL"))
+                if os.getenv("INCEPTION_BASE_URL"):
+                    candidate_base_urls.append(os.getenv("INCEPTION_BASE_URL"))
+                candidate_base_urls.extend([
+                    "https://api.inceptionlabs.ai/v1",
+                    "https://api.mercury.ai/v1",
+                    "https://api.mercury.ai",
+                ])
+                # Candidate model names (env first, then fallbacks)
+                candidate_models = []
+                env_model = os.getenv("MERCURY_MODEL_NAME") or os.getenv("INCEPTION_MODEL_NAME")
+                if env_model:
+                    candidate_models.append(env_model)
+                candidate_models.extend(["mercury", "mercury-fast", "mercury-pro", "gpt-4", "gpt-3.5-turbo"])
+                last_error: Optional[str] = None
+                for base_url in candidate_base_urls:
+                    try:
+                        client = openai.OpenAI(
+                            api_key=(os.getenv("MERCURY_API_KEY") or os.getenv("INCEPTION_API_KEY")),
+                            base_url=base_url,
+                        )
+                        for mercury_model in candidate_models:
+                            try:
+                                response = client.chat.completions.create(
+                                    model=mercury_model,
+                                    messages=[{"role": "user", "content": prompt}],
+                                    temperature=temperature,
+                                    max_tokens=2000,
+                                )
+                                return LLMResponse(
+                                    content=response.choices[0].message.content,
+                                    model="Mercury Fast LLM",
+                                    success=True,
+                                )
+                            except Exception as model_error:
+                                last_error = f"{type(model_error).__name__}: {str(model_error)}"
+                                continue
+                    except Exception as client_error:
+                        last_error = f"{type(client_error).__name__}: {str(client_error)}"
+                        continue
+                # If all attempts failed, provide a consolidated error
+                if last_error and "503" in last_error:
+                    return LLMResponse(
+                        content="",
+                        model="Mercury Fast LLM",
+                        success=False,
+                        error=(
+                            "Mercury/Inception API returned 503 across endpoints. Service may be down. "
+                            "Tried endpoints: " + ", ".join(candidate_base_urls)
+                        ),
+                    )
+                return LLMResponse(
+                    content="",
+                    model="Mercury Fast LLM",
+                    success=False,
+                    error=(
+                        "Mercury API request failed after trying multiple endpoints and models. "
+                        f"Last error: {last_error or 'unknown error'}"
+                    ),
+                )
+            elif model == "huggingface" and "huggingface" in self.clients:
+                try:
+                    # Use chat completion API for Mixtral model (most compatible)
+                    response = self.clients["huggingface"].chat_completion(
+                        messages=[{"role": "user", "content": prompt}],
+                        model="mistralai/Mixtral-8x7B-Instruct-v0.1",
+                        max_tokens=2000,
+                        temperature=temperature if temperature > 0 else 0.1,
+                    )
+                    return LLMResponse(
+                        content=response.choices[0].message.content,
+                        model="Hugging Face (Mixtral)",
+                        success=True
+                    )
+                except Exception as hf_error:
+                    # Fallback to text generation with a simpler model
+                    try:
+                        response = self.clients["huggingface"].text_generation(
+                            prompt,
+                            model="microsoft/DialoGPT-medium",
+                            max_new_tokens=2000,
+                            temperature=temperature if temperature > 0 else 0.1,
+                        )
+                        return LLMResponse(
+                            content=response,
+                            model="Hugging Face (DialoGPT)",
+                            success=True
+                        )
+                    except Exception as fallback_error:
+                        return LLMResponse(
+                            content="",
+                            model="Hugging Face (Mixtral)",
+                            success=False,
+                            error=f"Hugging Face API Error: {str(hf_error)}. Fallback also failed: {str(fallback_error)}"
+                        )
             else:
                 return LLMResponse(

analyzer/prompts.py CHANGED Viewed

@@ -1,14 +1,16 @@
-def get_code_analysis_prompt(code: str, language: str = "auto-detect") -> str:
     """Generate a focused prompt for practical code analysis."""
     return f"""
-You are an expert code reviewer. Analyze this {language} code for practical issues that matter to developers.
 Code to analyze:
 {code}
-Provide a focused analysis with complete, readable sentences. Do NOT use markdown symbols like ### or ** in your response.
-1. QUALITY_SCORE: Rate 0-100 (consider bugs, readability, maintainability)
 2. SUMMARY: One complete sentence describing what this code does
@@ -68,7 +70,23 @@ Provide analysis focusing on:
    - Missing features or best practices (write complete sentences)
    - Code quality improvements needed (write complete sentences)
-Write clear, complete sentences without markdown symbols. Be practical and focus on actionable feedback for the repository owner.
 """
 def get_comparison_prompt(code: str, language: str = "auto-detect") -> str:

+def get_code_analysis_prompt(code: str, language: str = "auto-detect", model: str = None) -> str:
     """Generate a focused prompt for practical code analysis."""
     return f"""
+You are an expert code reviewer. Analyze this code and provide comprehensive feedback.
 Code to analyze:
 {code}
+First, identify the programming language, then analyze the code. Provide a focused analysis with complete, readable sentences. Do NOT use markdown symbols like ### or ** in your response.
+0. DETECTED_LANGUAGE: Respond with ONLY the programming language name (e.g., python, javascript, rust, cpp, java, go, php, ruby, swift, kotlin, typescript, csharp, html, css). Make sure to detect the language accurately based on the actual code.
+1. QUALITY_SCORE: Rate 0-100 (consider bugs, readability, maintainability). Respond with ONLY the number.
 2. SUMMARY: One complete sentence describing what this code does
    - Missing features or best practices (write complete sentences)
    - Code quality improvements needed (write complete sentences)
+5. ONBOARDING_GUIDE: (This is a critical section)
+   - Provide a step-by-step guide for a new developer to run this project.
+   - List key commands for installation, testing, and running (e.g., `npm install`, `pytest`).
+   - Infer any required environment variables (e.g., `DATABASE_URL`, `API_KEY`).
+   - If setup instructions are in the README, summarize them here.
+6. TECH_STACK_RATIONALE: (Provide your best assessment)
+   - Identify the primary framework, language, and package manager.
+   - Infer the likely reason for choosing this stack in one clear sentence.
+   - Mention any interesting or unusual libraries and their purpose.
+7. API_ENDPOINT_SUMMARY: (Scan all provided files for this)
+   - List the main API endpoints found (e.g., `GET /api/users`, `POST /api/products`).
+   - Provide a one-sentence guess for what each endpoint does based on the code.
+   - If this is not a web service, you must explicitly state 'Not a web service'.
+Write clear, complete sentences without markdown symbols. You must attempt to fill out every section. Be practical and focus on actionable feedback for the repository owner.
 """
 def get_comparison_prompt(code: str, language: str = "auto-detect") -> str:

analyzer/utils.py CHANGED Viewed

@@ -1,29 +1,135 @@
 import re
 from typing import Dict, Any, Tuple
 def detect_language(code: str) -> str:
-    """Simple language detection based on syntax patterns."""
-    patterns = {
-        'python': [r'def\s+\w+\(', r'import\s+\w+', r'if\s+__name__\s*==\s*["\']__main__["\']'],
-        'javascript': [r'function\s+\w+\(', r'const\s+\w+\s*=', r'console\.log\('],
-        'java': [r'public\s+class\s+\w+', r'public\s+static\s+void\s+main'],
-        'cpp': [r'#include\s*<\w+>', r'int\s+main\s*\(', r'std::'],
-        'csharp': [r'using\s+System', r'namespace\s+\w+', r'public\s+class\s+\w+'],
-        'go': [r'package\s+\w+', r'func\s+\w+\(', r'import\s+\('],
-        'rust': [r'fn\s+main\s*\(', r'use\s+\w+', r'let\s+mut\s+\w+'],
-    }
-    for lang, patterns_list in patterns.items():
-        for pattern in patterns_list:
-            if re.search(pattern, code):
-                return lang
-    return 'unknown'
-def parse_analysis_result(text: str) -> Dict[str, Any]:
     """Parse LLM response into structured format with new focused categories."""
     result = {
         'quality_score': 75,  # default
         'summary': '',
         'bugs': [],
         'quality_issues': [],
@@ -37,6 +143,23 @@ def parse_analysis_result(text: str) -> Dict[str, Any]:
         'performance_notes': []
     }
     # Extract quality score
     score_patterns = [
         r'(?:QUALITY_SCORE|quality[_\s]*score)[:\s]*(\d+)(?:/100)?',

 import re
 from typing import Dict, Any, Tuple
+from pygments.lexers import guess_lexer
+from pygments.util import ClassNotFound
+from .llm_clients import LLMClientManager
+def detect_language_with_llm(code: str) -> str:
+    """Detect the programming language of a code snippet using an LLM."""
+    try:
+        llm_manager = LLMClientManager()
+        # Prioritize Hugging Face for this task if available
+        model = "huggingface" if "huggingface" in llm_manager.get_available_models() else list(llm_manager.get_available_models().keys())[0]
+        prompt = f"""
+        Analyze the following code snippet and identify its programming language.
+        Respond with only the language name (e.g., 'python', 'java', 'javascript', 'go', 'cpp', 'rust', 'php', 'ruby', 'swift', 'kotlin', 'csharp', 'c').
+        If you are unsure, respond with 'unknown'.
+        Code:
+        ```
+        {code}
+        ```
+        Language:
+        """
+        response = llm_manager.query(model, prompt, temperature=0.1)
+        if response.success:
+            detected_language = response.content.strip().lower()
+            if detected_language and detected_language != "unknown":
+                return detected_language
+    except Exception as e:
+        print(f"LLM-based language detection failed: {e}")
+    return "unknown"
 def detect_language(code: str) -> str:
+    """Detect the programming language using LLM first for accurate detection, then fallback to pattern matching."""
+    # Try LLM-based detection first for accurate results
+    detected = detect_language_with_llm(code)
+    if detected != "unknown":
+        return detected
+    # Fallback to pattern matching only if LLM fails
+    code_lower = code.lower()
+    # HTML detection (check first as it's very common and specific)
+    if re.search(r'<html|<head|<body|<div|<span|<p\s|class\s*=|id\s*=', code, re.IGNORECASE):
+        return "html"
+    # CSS detection (check early as it's specific)
+    if re.search(r'\.\w+\s*\{|@media|@import|background:|color:|font-|margin:|padding:', code, re.IGNORECASE):
+        return "css"
+    # Go language detection (check early as it's most specific)
+    if re.search(r'package\s+main|func\s+\w+\s*\(|import\s*\(', code, re.IGNORECASE):
+        return "go"
+    # Python language detection
+    if re.search(r'def\s+\w+\s*\(|import\s+\w+|from\s+\w+\s+import|if\s+__name__\s*==\s*["\']__main__["\']', code, re.IGNORECASE):
+        return "python"
+    # JavaScript language detection (more specific patterns)
+    js_patterns = [
+        r'function\s+\w+\s*\([^)]*\)\s*\{',  # function declaration with body
+        r'const\s+\w+\s*=\s*\([^)]*\)\s*=>',  # arrow function
+        r'let\s+\w+\s*=\s*\([^)]*\)\s*=>',    # arrow function with let
+        r'var\s+\w+\s*=\s*\([^)]*\)\s*=>',    # arrow function with var
+        r'console\.log\s*\(',                  # console.log
+        r'document\.getElementById',           # DOM manipulation
+        r'addEventListener\s*\(',              # event listeners
+        r'require\s*\(|import\s+.*\s+from',    # module imports
+        r'export\s+(default\s+)?(function|const|class)',  # exports
+    ]
+    # TypeScript detection (check before JavaScript)
+    if re.search(r'interface\s+\w+|type\s+\w+\s*=|:\s*\w+\[\]|:\s*string\s*[;=]|:\s*number\s*[;=]', code, re.IGNORECASE):
+        return "typescript"
+    # If it matches JavaScript patterns
+    for pattern in js_patterns:
+        if re.search(pattern, code, re.IGNORECASE):
+            return "javascript"
+    # Java language detection
+    if re.search(r'public\s+class\s+\w+|System\.out\.println|import\s+java\.', code, re.IGNORECASE):
+        return "java"
+    # C++ language detection
+    if re.search(r'#include\s*<|std::|using\s+namespace\s+std', code, re.IGNORECASE):
+        return "cpp"
+    # C language detection
+    if re.search(r'#include\s*<|int\s+main\s*\(|printf\s*\(', code, re.IGNORECASE):
+        return "c"
+    # C# language detection
+    if re.search(r'using\s+System|namespace\s+\w+|public\s+class\s+\w+', code, re.IGNORECASE):
+        return "csharp"
+    # Rust language detection
+    if re.search(r'fn\s+\w+\s*\(|let\s+\w+\s*:|use\s+\w+::', code, re.IGNORECASE):
+        return "rust"
+    # PHP language detection
+    if re.search(r'<\?php|echo\s+|\$\w+\s*=', code, re.IGNORECASE):
+        return "php"
+    # Ruby language detection
+    if re.search(r'def\s+\w+\s*|puts\s+|require\s+', code, re.IGNORECASE):
+        return "ruby"
+    # Swift language detection
+    if re.search(r'func\s+\w+\s*\(|let\s+\w+\s*:|var\s+\w+\s*:', code, re.IGNORECASE):
+        return "swift"
+    # Kotlin language detection
+    if re.search(r'fun\s+\w+\s*\(|val\s+\w+\s*=|var\s+\w+\s*=', code, re.IGNORECASE):
+        return "kotlin"
+    # Fallback to Pygments if no pattern matches
+    try:
+        from pygments.lexers import guess_lexer
+        from pygments.util import ClassNotFound
+        lexer = guess_lexer(code)
+        return lexer.name.lower()
+    except (ClassNotFound, ImportError):
+        return "unknown"
+    return "unknown"
+def parse_analysis_result(text: str, model: str = None) -> Dict[str, Any]:
     """Parse LLM response into structured format with new focused categories."""
     result = {
         'quality_score': 75,  # default
+        'detected_language': None,  # AI-detected language
         'summary': '',
         'bugs': [],
         'quality_issues': [],
         'performance_notes': []
     }
+    # Extract detected language first
+    language_patterns = [
+        r'(?:DETECTED_LANGUAGE|language)[:\s]*([a-z]+)(?:\s|$|\.|,)',
+        r'^language[:\s]*([a-z]+)(?:\s|$|\.|,)',
+        r'(?:programming\s+language)[:\s]*([a-z]+)(?:\s|$|\.|,)',
+    ]
+    for pattern in language_patterns:
+        lang_match = re.search(pattern, text, re.IGNORECASE)
+        if lang_match:
+            detected_lang = lang_match.group(1).strip().lower()
+            # Validate it's a known language
+            known_languages = ['python', 'javascript', 'java', 'cpp', 'c', 'rust', 'go', 'php', 'ruby', 'swift', 'kotlin', 'typescript', 'csharp', 'html', 'css']
+            if detected_lang in known_languages:
+                result['detected_language'] = detected_lang
+                break
     # Extract quality score
     score_patterns = [
         r'(?:QUALITY_SCORE|quality[_\s]*score)[:\s]*(\d+)(?:/100)?',

detailed_model_comparison.py DELETED Viewed

@@ -1,324 +0,0 @@
-#!/usr/bin/env python3
-"""
-Detailed Model Comparison Script
-This script compares CodeT5+ and CodeBERT side by side
-to show exactly how much better each model is for code analysis.
-"""
-import torch
-import time
-from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
-import json
-def analyze_code_with_model(model_id: str, model_type: str, test_code: str, test_name: str):
-    """
-    Analyze code with a specific model and return detailed results.
-    """
-    print(f"\n🧪 Testing {test_name}")
-    print("=" * 80)
-    start_time = time.time()
-    try:
-        device = "cuda" if torch.cuda.is_available() else "cpu"
-        # Load tokenizer
-        tokenizer = AutoTokenizer.from_pretrained(model_id)
-        if tokenizer.pad_token is None:
-            tokenizer.pad_token = tokenizer.eos_token
-        # Load model
-        if model_type == "seq2seq":
-            model = AutoModelForSeq2SeqLM.from_pretrained(
-                model_id,
-                torch_dtype=torch.float16,
-                device_map="auto"
-            )
-        else:
-            model = AutoModelForCausalLM.from_pretrained(
-                model_id,
-                torch_dtype=torch.float16,
-                device_map="auto"
-            )
-        # Get model info
-        param_count = sum(p.numel() for p in model.parameters())
-        # Create analysis prompt
-        prompt = f"""Analyze this code for bugs, performance issues, and security concerns:
-{test_code}
-Analysis:"""
-        # Tokenize
-        inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
-        inputs = {k: v.to(device) for k, v in inputs.items()}
-        # Generate analysis
-        generation_start = time.time()
-        with torch.no_grad():
-            if model_type == "seq2seq":
-                outputs = model.generate(
-                    inputs["input_ids"],
-                    max_length=inputs["input_ids"].shape[1] + 300,
-                    num_beams=4,
-                    early_stopping=True,
-                    do_sample=False,
-                    temperature=0.7
-                )
-            else:
-                outputs = model.generate(
-                    inputs["input_ids"],
-                    max_length=inputs["input_ids"].shape[1] + 300,
-                    temperature=0.7,
-                    do_sample=True,
-                    pad_token_id=tokenizer.eos_token_id,
-                    top_p=0.9
-                )
-        generation_time = time.time() - generation_start
-        # Decode analysis
-        analysis = tokenizer.decode(outputs[0], skip_special_tokens=True)
-        analysis_text = analysis[len(prompt):].strip()
-        total_time = time.time() - start_time
-        # Analyze the quality of the analysis
-        quality_score = analyze_analysis_quality(analysis_text, test_code)
-        return {
-            "model_id": model_id,
-            "model_type": model_type,
-            "test_name": test_name,
-            "success": True,
-            "analysis": analysis_text,
-            "total_time": total_time,
-            "generation_time": generation_time,
-            "parameters": param_count,
-            "quality_score": quality_score,
-            "analysis_length": len(analysis_text)
-        }
-    except Exception as e:
-        return {
-            "model_id": model_id,
-            "model_type": model_type,
-            "test_name": test_name,
-            "success": False,
-            "error": str(e),
-            "total_time": time.time() - start_time
-        }
-def analyze_analysis_quality(analysis_text: str, original_code: str):
-    """
-    Analyze the quality of the code analysis.
-    Returns a score from 0-100 based on various factors.
-    """
-    score = 0
-    # Check for different types of analysis
-    analysis_lower = analysis_text.lower()
-    # Bug detection (20 points)
-    if any(word in analysis_lower for word in ['bug', 'error', 'issue', 'problem', 'flaw']):
-        score += 20
-    # Performance analysis (20 points)
-    if any(word in analysis_lower for word in ['performance', 'slow', 'efficient', 'complexity', 'optimization']):
-        score += 20
-    # Security analysis (20 points)
-    if any(word in analysis_lower for word in ['security', 'vulnerability', 'safe', 'unsafe', 'risk']):
-        score += 20
-    # Code structure analysis (20 points)
-    if any(word in analysis_lower for word in ['structure', 'design', 'pattern', 'architecture', 'organization']):
-        score += 20
-    # Suggestions/improvements (20 points)
-    if any(word in analysis_lower for word in ['suggest', 'improve', 'better', 'recommend', 'fix', 'solution']):
-        score += 20
-    # Bonus points for detailed analysis
-    if len(analysis_text) > 200:
-        score += 10
-    if len(analysis_text) > 500:
-        score += 10
-    return min(score, 100)
-def compare_models():
-    """
-    Compare CodeT5+ and CodeBERT on multiple test cases.
-    """
-    # Test cases with different types of code issues
-    test_cases = [
-        {
-            "name": "Performance Issue (Recursive Fibonacci)",
-            "code": """
-def calculate_fibonacci(n):
-    if n <= 0:
-        return 0
-    elif n == 1:
-        return 1
-    else:
-        return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
-# This will be slow for large numbers
-result = calculate_fibonacci(35)
-print(result)
-"""
-        },
-        {
-            "name": "Security Issue (SQL Injection)",
-            "code": """
-import sqlite3
-def get_user(email):
-    conn = sqlite3.connect("app.db")
-    cur = conn.cursor()
-    # vulnerable string interpolation
-    q = f"SELECT id, email, role FROM users WHERE email = '{email}'"
-    rows = cur.execute(q).fetchall()
-    conn.close()
-    return rows[0] if rows else None
-# Usage
-user = get_user("admin@example.com")
-"""
-        },
-        {
-            "name": "Bug Issue (Division by Zero)",
-            "code": """
-def divide_numbers(a, b):
-    return a / b
-def calculate_average(numbers):
-    total = sum(numbers)
-    count = len(numbers)
-    return divide_numbers(total, count)
-# This will crash with empty list
-result = calculate_average([])
-print(result)
-"""
-        }
-    ]
-    # Models to compare
-    models = [
-        {
-            "id": "Salesforce/codet5p-220m",
-            "type": "seq2seq",
-            "name": "CodeT5+ (Recommended)"
-        },
-        {
-            "id": "microsoft/CodeBERT-base",
-            "type": "causal",
-            "name": "CodeBERT (Alternative)"
-        }
-    ]
-    print("🚀 DETAILED MODEL COMPARISON")
-    print("=" * 100)
-    print("Testing both models on multiple code analysis scenarios...")
-    all_results = []
-    for test_case in test_cases:
-        print(f"\n📋 TEST CASE: {test_case['name']}")
-        print("=" * 100)
-        print(f"Code to analyze:\n{test_case['code']}")
-        print("=" * 100)
-        test_results = []
-        for model in models:
-            result = analyze_code_with_model(
-                model["id"],
-                model["type"],
-                test_case["code"],
-                model["name"]
-            )
-            test_results.append(result)
-            all_results.append(result)
-        # Show side-by-side comparison for this test case
-        print(f"\n📊 SIDE-BY-SIDE COMPARISON:")
-        print("-" * 100)
-        for result in test_results:
-            if result["success"]:
-                print(f"\n🤖 {result['test_name']}:")
-                print(f"   ⏱️  Time: {result['total_time']:.2f}s")
-                print(f"   📊 Parameters: {result['parameters']:,}")
-                print(f"   🎯 Quality Score: {result['quality_score']}/100")
-                print(f"   📝 Analysis Length: {result['analysis_length']} chars")
-                print(f"   📄 Analysis:")
-                print(f"   {result['analysis'][:200]}{'...' if len(result['analysis']) > 200 else ''}")
-            else:
-                print(f"\n❌ {result['test_name']}: FAILED - {result['error']}")
-    # Overall comparison
-    print(f"\n🏆 OVERALL COMPARISON SUMMARY")
-    print("=" * 100)
-    # Group results by model
-    codet5_results = [r for r in all_results if r.get("model_id") == "Salesforce/codet5p-220m" and r["success"]]
-    codebert_results = [r for r in all_results if r.get("model_id") == "microsoft/CodeBERT-base" and r["success"]]
-    if codet5_results and codebert_results:
-        # Calculate averages
-        codet5_avg_time = sum(r["total_time"] for r in codet5_results) / len(codet5_results)
-        codet5_avg_quality = sum(r["quality_score"] for r in codet5_results) / len(codet5_results)
-        codet5_avg_length = sum(r["analysis_length"] for r in codet5_results) / len(codet5_results)
-        codebert_avg_time = sum(r["total_time"] for r in codebert_results) / len(codebert_results)
-        codebert_avg_quality = sum(r["quality_score"] for r in codebert_results) / len(codebert_results)
-        codebert_avg_length = sum(r["analysis_length"] for r in codebert_results) / len(codebert_results)
-        print(f"\n📈 AVERAGE PERFORMANCE:")
-        print(f"CodeT5+:")
-        print(f"   ⏱️  Time: {codet5_avg_time:.2f}s")
-        print(f"   🎯 Quality: {codet5_avg_quality:.1f}/100")
-        print(f"   📝 Length: {codet5_avg_length:.0f} chars")
-        print(f"\nCodeBERT:")
-        print(f"   ⏱️  Time: {codebert_avg_time:.2f}s")
-        print(f"   🎯 Quality: {codebert_avg_quality:.1f}/100")
-        print(f"   📝 Length: {codebert_avg_length:.0f} chars")
-        # Calculate improvements
-        time_ratio = codebert_avg_time / codet5_avg_time
-        quality_diff = codet5_avg_quality - codebert_avg_quality
-        length_ratio = codet5_avg_length / codebert_avg_length
-        print(f"\n🎯 IMPROVEMENT ANALYSIS:")
-        print(f"Speed: CodeBERT is {time_ratio:.1f}x faster than CodeT5+")
-        print(f"Quality: CodeT5+ is {quality_diff:.1f} points better than CodeBERT")
-        print(f"Detail: CodeT5+ gives {length_ratio:.1f}x more detailed analysis")
-        # Final recommendation
-        print(f"\n🏆 FINAL RECOMMENDATION:")
-        if quality_diff > 10:
-            print(f"✅ Use CodeT5+ - Significantly better analysis quality ({quality_diff:.1f} points better)")
-            print(f"   Trade-off: {time_ratio:.1f}x slower, but much better results")
-        elif quality_diff > 5:
-            print(f"✅ Use CodeT5+ - Better analysis quality ({quality_diff:.1f} points better)")
-            print(f"   Trade-off: {time_ratio:.1f}x slower, but better results")
-        else:
-            print(f"🤔 Both models are similar in quality")
-            print(f"   Choose CodeBERT for speed, CodeT5+ for slightly better quality")
-def main():
-    """
-    Main function to run the detailed comparison.
-    """
-    compare_models()
-if __name__ == "__main__":
-    main()

evaluation_samples/buggy_code.py ADDED Viewed

	@@ -0,0 +1,19 @@

+def calculate_average(numbers: list) -> float:
+    """
+    Calculates the average of a list of numbers.
+    There is a bug in this function.
+    """
+    if not numbers:
+        return 0.0
+    total = 0
+    # Bug: This loop will miss the last number in the list.
+    for i in range(len(numbers) - 1):
+        total += numbers[i]
+    return total / len(numbers)
+# Example usage (will produce the wrong result)
+# print(calculate_average([10, 20, 30]))
+# Expected: 20.0
+# Actual: 10.0

evaluation_samples/good_code.py ADDED Viewed

	@@ -0,0 +1,18 @@

+def get_file_extension(filename: str) -> str:
+    """
+    Returns the extension of a file.
+    Args:
+        filename: The name of the file.
+    Returns:
+        The file extension (e.g., "txt", "py") or an empty string if
+        the file has no extension.
+    """
+    if not isinstance(filename, str):
+        raise TypeError("Filename must be a string.")
+    parts = filename.split('.')
+    if len(parts) > 1:
+        return parts[-1]
+    return ""

evaluation_samples/insecure_code.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import sqlite3
+def get_user_data(username: str) -> list:
+    """
+    Retrieves user data from a database.
+    This function has a critical security vulnerability.
+    """
+    # This is insecure! The username is directly concatenated into the query.
+    # An attacker could provide a username like: 'admin'; DROP TABLE users; --'
+    query = "SELECT * FROM users WHERE username = '" + username + "'"
+    conn = sqlite3.connect('example.db')
+    cursor = conn.cursor()
+    # In a real scenario, this line would execute the malicious query
+    cursor.execute(query)
+    results = cursor.fetchall()
+    conn.close()
+    return results

finetune.py DELETED Viewed

@@ -1,274 +0,0 @@
-#!/usr/bin/env python3
-"""
-Fine-tuning Script for AI Code Analyzer
-This script fine-tunes the DeepSeek Coder model using QLoRA (Quantized LoRA)
-technique on our custom code analysis dataset.
-Features:
-- 4-bit quantization for memory efficiency
-- LoRA adapters for parameter-efficient fine-tuning
-- Supervised Fine-Tuning (SFT) using TRL
-- Automatic model saving and adapter persistence
-Author: AI Code Analyzer Project
-Date: 2025
-"""
-import os
-import argparse
-import torch
-from transformers import (
-    AutoModelForCausalLM,
-    AutoTokenizer,
-    BitsAndBytesConfig,
-    TrainingArguments,
-    DataCollatorForLanguageModeling
-)
-from peft import LoraConfig, get_peft_model, TaskType
-from trl import SFTTrainer
-from datasets import Dataset
-import json
-import logging
-# Set up logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-def load_dataset(file_path: str) -> Dataset:
-    """
-    Load the training dataset from JSONL file.
-    Args:
-        file_path (str): Path to the analyst_dataset.jsonl file
-    Returns:
-        Dataset: Hugging Face dataset object
-    """
-    logger.info(f"Loading dataset from {file_path}")
-    data = []
-    with open(file_path, 'r', encoding='utf-8') as f:
-        for line in f:
-            if line.strip():
-                data.append(json.loads(line))
-    logger.info(f"Loaded {len(data)} training samples")
-    return Dataset.from_list(data)
-def create_quantization_config() -> BitsAndBytesConfig:
-    """
-    Create 4-bit quantization configuration for memory efficiency.
-    Returns:
-        BitsAndBytesConfig: Quantization configuration
-    """
-    logger.info("Creating 4-bit quantization configuration")
-    return BitsAndBytesConfig(
-        load_in_4bit=True,
-        bnb_4bit_quant_type="nf4",
-        bnb_4bit_compute_dtype=torch.float16,
-        bnb_4bit_use_double_quant=True,
-    )
-def infer_lora_target_modules(model) -> list[str]:
-    """Infer suitable LoRA target modules by inspecting model modules.
-    Handles common architectures:
-    - LLaMA/DeepSeek-like: q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
-    - GPT-2/OPT-like: c_attn,c_proj,c_fc (when present)
-    Falls back to any module names that contain 'q_proj','k_proj','v_proj','o_proj'
-    found in the model.
-    """
-    llama_like = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
-    gpt2_like = ["c_attn", "c_proj", "c_fc"]
-    module_names = set(name.split(".")[-1] for name, _ in model.named_modules())
-    if any(m in module_names for m in llama_like):
-        return [m for m in llama_like if m in module_names]
-    if any(m in module_names for m in gpt2_like):
-        return [m for m in gpt2_like if m in module_names]
-    # Generic attention projection fallback
-    generic = [m for m in ["q_proj", "k_proj", "v_proj", "o_proj"] if m in module_names]
-    if generic:
-        return generic
-    # Last resort: try any modules containing 'attn' or 'proj'
-    heuristic = [m for m in module_names if ("attn" in m or "proj" in m)]
-    return heuristic[:4] if heuristic else []
-def create_lora_config(model) -> LoraConfig:
-    """
-    Create LoRA configuration for parameter-efficient fine-tuning.
-    Returns:
-        LoraConfig: LoRA configuration
-    """
-    logger.info("Creating LoRA configuration")
-    target_modules = infer_lora_target_modules(model)
-    if not target_modules:
-        logger.warning("Could not infer LoRA target modules; proceeding without explicit targets (may fail)")
-        target_modules = None  # Let PEFT attempt defaults
-    return LoraConfig(
-        task_type=TaskType.CAUSAL_LM,
-        inference_mode=False,
-        r=16,  # Rank of adaptation
-        lora_alpha=32,  # LoRA scaling parameter
-        lora_dropout=0.1,  # LoRA dropout
-        target_modules=target_modules,
-    )
-def create_training_arguments(args: argparse.Namespace) -> TrainingArguments:
-    """
-    Create training arguments for the fine-tuning process.
-    Returns:
-        TrainingArguments: Training configuration
-    """
-    logger.info("Creating training arguments")
-    # Defaults
-    num_train_epochs = args.epochs if args.epochs is not None else 3
-    per_device_train_batch_size = args.batch_size if args.batch_size is not None else 1
-    logging_steps = 10
-    save_steps = 500
-    max_steps = args.max_steps if args.max_steps is not None else -1
-    if args.dry_run:
-        # Make the run extremely short and avoid frequent saving
-        num_train_epochs = 1 if args.epochs is None else args.epochs
-        logging_steps = 1
-        save_steps = 10_000_000
-        if args.max_steps is None:
-            max_steps = 1
-    return TrainingArguments(
-        output_dir="./fine-tuned-analyst",
-        num_train_epochs=num_train_epochs,
-        per_device_train_batch_size=per_device_train_batch_size,
-        gradient_accumulation_steps=4,
-        warmup_steps=100,
-        learning_rate=2e-4,
-        fp16=args.fp16,
-        logging_steps=logging_steps,
-        save_steps=save_steps,
-        save_total_limit=2,
-        remove_unused_columns=False,
-        push_to_hub=False,
-        report_to=None,  # Disable wandb/tensorboard
-        dataloader_pin_memory=False,
-        max_steps=max_steps,
-    )
-def main():
-    """
-    Main function to execute the fine-tuning process.
-    """
-    logger.info("Starting AI Code Analyzer fine-tuning process")
-    parser = argparse.ArgumentParser(description="Fine-tune DeepSeek Coder with QLoRA")
-    parser.add_argument("--model", type=str, default="deepseek-ai/deepseek-coder-1.3b-instruct", help="Base model ID (HF Hub)")
-    parser.add_argument("--subset", type=int, default=None, help="Use only the first N samples from the dataset")
-    parser.add_argument("--epochs", type=int, default=None, help="Number of training epochs")
-    parser.add_argument("--batch-size", type=int, default=None, help="Per-device train batch size")
-    parser.add_argument("--max-steps", type=int, default=None, help="Override maximum training steps")
-    parser.add_argument("--dry-run", action="store_true", help="Run a very short demo training")
-    parser.add_argument("--no-quant", action="store_true", help="Disable 4-bit quantization (useful for CPU runs)")
-    args = parser.parse_args()
-    # Check if CUDA is available
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    logger.info(f"Using device: {device}")
-    if device == "cpu":
-        logger.warning("CUDA not available. Training will be slow on CPU.")
-    # Step 1: Load the dataset
-    dataset = load_dataset("analyst_dataset.jsonl")
-    if args.subset is not None and args.subset > 0:
-        logger.info(f"Using only the first {args.subset} samples for this run")
-        dataset = dataset.select(range(min(args.subset, len(dataset))))
-    # Step 2: Load the base model and tokenizer
-    model_name = args.model
-    logger.info(f"Loading model: {model_name}")
-    # Create quantization config if enabled and likely supported
-    use_quant = (device == "cuda") and (not args.no_quant)
-    quantization_config = create_quantization_config() if use_quant else None
-    if not use_quant:
-        logger.info("Quantization disabled (CPU or --no-quant). Using standard weights.")
-    # Load tokenizer
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    # Load model with quantization
-    if quantization_config is not None:
-        model = AutoModelForCausalLM.from_pretrained(
-            model_name,
-            quantization_config=quantization_config,
-            device_map="auto",
-            trust_remote_code=True,
-        )
-        fp16 = True
-    else:
-        # CPU or non-quantized path
-        model = AutoModelForCausalLM.from_pretrained(
-            model_name,
-            torch_dtype=torch.float32 if device == "cpu" else torch.float16,
-            device_map="cpu" if device == "cpu" else "auto",
-            low_cpu_mem_usage=True,
-            trust_remote_code=True,
-        )
-        fp16 = (device != "cpu")
-    # Step 3: Configure LoRA
-    lora_config = create_lora_config(model)
-    model = get_peft_model(model, lora_config)
-    # Print trainable parameters
-    model.print_trainable_parameters()
-    # Step 4: Set training arguments
-    # Ensure training args match device precision
-    args.fp16 = fp16
-    training_args = create_training_arguments(args)
-    # Step 5: Initialize SFT Trainer
-    logger.info("Initializing SFT Trainer")
-    # Provide the tokenizer/processor via processing_class
-    trainer = SFTTrainer(
-        model=model,
-        args=training_args,
-        train_dataset=dataset,
-        processing_class=tokenizer,
-        formatting_func=lambda r: r["text"],
-    )
-    # Step 6: Start training
-    logger.info("Starting training...")
-    trainer.train()
-    # Step 7: Save the model
-    logger.info("Saving fine-tuned model...")
-    trainer.save_model()
-    tokenizer.save_pretrained(training_args.output_dir)
-    # Save LoRA adapters separately
-    model.save_pretrained(f"{training_args.output_dir}/lora_adapters")
-    logger.info("Fine-tuning completed successfully!")
-    logger.info(f"Model saved to: {training_args.output_dir}")
-    logger.info("LoRA adapters saved to: {}/lora_adapters".format(training_args.output_dir))
-if __name__ == "__main__":
-    main()

finetune_improved.py DELETED Viewed

@@ -1,277 +0,0 @@
-#!/usr/bin/env python3
-"""
-Improved Fine-tuning Script for AI Code Analyzer
-This script fine-tunes specialized code analysis models using QLoRA technique
-on our custom code analysis dataset.
-Key Improvements:
-- Uses CodeT5+ or CodeBERT for better code understanding
-- Optimized for code analysis tasks
-- Better tokenization for code
-- Improved training parameters
-Author: AI Code Analyzer Project
-Date: 2025
-"""
-import os
-import argparse
-import torch
-from transformers import (
-    AutoModelForCausalLM,
-    AutoModelForSeq2SeqLM,  # For CodeT5 models
-    AutoTokenizer,
-    BitsAndBytesConfig,
-    TrainingArguments,
-    DataCollatorForLanguageModeling
-)
-from peft import LoraConfig, get_peft_model, TaskType
-from trl import SFTTrainer
-from datasets import Dataset
-import json
-import logging
-# Set up logging
-logging.basicConfig(level=logging.INFO)
-logger = logging.getLogger(__name__)
-def load_dataset(file_path: str) -> Dataset:
-    """
-    Load the training dataset from JSONL file.
-    Args:
-        file_path (str): Path to the analyst_dataset.jsonl file
-    Returns:
-        Dataset: Hugging Face dataset object
-    """
-    logger.info(f"Loading dataset from {file_path}")
-    data = []
-    with open(file_path, 'r', encoding='utf-8') as f:
-        for line in f:
-            if line.strip():
-                data.append(json.loads(line))
-    logger.info(f"Loaded {len(data)} training samples")
-    return Dataset.from_list(data)
-def create_quantization_config() -> BitsAndBytesConfig:
-    """
-    Create 4-bit quantization configuration for memory efficiency.
-    Returns:
-        BitsAndBytesConfig: Quantization configuration
-    """
-    logger.info("Creating 4-bit quantization configuration")
-    return BitsAndBytesConfig(
-        load_in_4bit=True,
-        bnb_4bit_quant_type="nf4",
-        bnb_4bit_compute_dtype=torch.float16,
-        bnb_4bit_use_double_quant=True,
-    )
-def get_model_class(model_name: str):
-    """
-    Determine the appropriate model class based on the model name.
-    Args:
-        model_name (str): Name of the model
-    Returns:
-        Model class: Appropriate model class for the given model
-    """
-    if "codet5" in model_name.lower():
-        logger.info(f"Using Seq2SeqLM for CodeT5 model: {model_name}")
-        return AutoModelForSeq2SeqLM
-    else:
-        logger.info(f"Using CausalLM for model: {model_name}")
-        return AutoModelForCausalLM
-def create_lora_config(model_name: str) -> LoraConfig:
-    """
-    Create LoRA configuration optimized for code analysis models.
-    Args:
-        model_name (str): Name of the base model
-    Returns:
-        LoraConfig: LoRA configuration
-    """
-    logger.info(f"Creating LoRA configuration for {model_name}")
-    # Different target modules for different model architectures
-    if "codet5" in model_name.lower():
-        # CodeT5 architecture
-        target_modules = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
-        task_type = TaskType.SEQ_2_SEQ_LM
-    elif "codebert" in model_name.lower():
-        # CodeBERT architecture
-        target_modules = ["query", "key", "value", "dense"]
-        task_type = TaskType.FEATURE_EXTRACTION
-    else:
-        # Default for most causal LM models
-        target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
-        task_type = TaskType.CAUSAL_LM
-    return LoraConfig(
-        r=16,
-        lora_alpha=32,
-        target_modules=target_modules,
-        lora_dropout=0.1,
-        bias="none",
-        task_type=task_type,
-    )
-def create_training_arguments(args) -> TrainingArguments:
-    """
-    Create training arguments optimized for code analysis fine-tuning.
-    Args:
-        args: Command line arguments
-    Returns:
-        TrainingArguments: Training configuration
-    """
-    # Calculate training parameters
-    num_train_epochs = args.epochs if args.epochs else 3
-    per_device_train_batch_size = args.batch_size if args.batch_size else 2
-    logging_steps = 10
-    save_steps = 500
-    # For demo runs
-    if args.dry_run:
-        logger.info("Running in dry-run mode with minimal training")
-        num_train_epochs = 1
-        per_device_train_batch_size = 1
-        max_steps = 1
-    return TrainingArguments(
-        output_dir="./fine-tuned-analyst-improved",
-        num_train_epochs=num_train_epochs,
-        per_device_train_batch_size=per_device_train_batch_size,
-        gradient_accumulation_steps=4,
-        warmup_steps=100,
-        learning_rate=2e-4,
-        fp16=args.fp16,
-        logging_steps=logging_steps,
-        save_steps=save_steps,
-        save_total_limit=2,
-        remove_unused_columns=False,
-        push_to_hub=False,
-        report_to=None,  # Disable wandb/tensorboard
-        dataloader_pin_memory=False,
-        max_steps=1 if args.dry_run else None,
-    )
-def main():
-    """
-    Main function to execute the improved fine-tuning process.
-    """
-    logger.info("Starting Improved AI Code Analyzer fine-tuning process")
-    parser = argparse.ArgumentParser(description="Fine-tune specialized code analysis models with QLoRA")
-    # Model selection with better defaults
-    parser.add_argument("--model", type=str,
-                       default="Salesforce/codet5p-220m",  # Better default for code analysis
-                       help="Base model ID (HF Hub). Options: Salesforce/codet5p-220m, microsoft/CodeBERT, facebook/incoder-1B")
-    parser.add_argument("--subset", type=int, default=None, help="Use only the first N samples from the dataset")
-    parser.add_argument("--epochs", type=int, default=None, help="Number of training epochs")
-    parser.add_argument("--batch-size", type=int, default=None, help="Per-device train batch size")
-    parser.add_argument("--max-steps", type=int, default=None, help="Override maximum training steps")
-    parser.add_argument("--dry-run", action="store_true", help="Run a very short demo training")
-    parser.add_argument("--no-quant", action="store_true", help="Disable 4-bit quantization")
-    parser.add_argument("--fp16", action="store_true", default=True, help="Use FP16 precision")
-    args = parser.parse_args()
-    # Check if CUDA is available
-    device = "cuda" if torch.cuda.is_available() else "cpu"
-    logger.info(f"Using device: {device}")
-    if device == "cpu":
-        logger.warning("CUDA not available. Training will be slow on CPU.")
-    # Step 1: Load the dataset
-    logger.info("Step 1: Loading dataset...")
-    dataset = load_dataset("analyst_dataset.jsonl")
-    if args.subset is not None and args.subset > 0:
-        logger.info(f"Using only the first {args.subset} samples for this run")
-        dataset = dataset.select(range(min(args.subset, len(dataset))))
-    # Step 2: Load the base model and tokenizer
-    logger.info("Step 2: Loading model and tokenizer...")
-    model_name = args.model
-    logger.info(f"Loading model: {model_name}")
-    # Get appropriate model class
-    model_class = get_model_class(model_name)
-    # Load tokenizer
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    # Add padding token if not present
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    # Load model with quantization if enabled
-    if args.no_quant:
-        logger.info("Loading model without quantization")
-        model = model_class.from_pretrained(model_name)
-    else:
-        logger.info("Loading model with 4-bit quantization")
-        quantization_config = create_quantization_config()
-        model = model_class.from_pretrained(
-            model_name,
-            quantization_config=quantization_config,
-            device_map="auto"
-        )
-    # Step 3: Create LoRA configuration
-    logger.info("Step 3: Setting up LoRA configuration...")
-    lora_config = create_lora_config(model_name)
-    # Step 4: Apply LoRA to the model
-    logger.info("Step 4: Applying LoRA to model...")
-    model = get_peft_model(model, lora_config)
-    # Step 5: Print trainable parameters
-    model.print_trainable_parameters()
-    # Step 6: Create training arguments
-    logger.info("Step 6: Setting up training arguments...")
-    training_args = create_training_arguments(args)
-    # Step 7: Create trainer
-    logger.info("Step 7: Creating trainer...")
-    trainer = SFTTrainer(
-        model=model,
-        args=training_args,
-        train_dataset=dataset,
-        tokenizer=tokenizer,
-        dataset_text_field="text",
-        max_seq_length=2048,
-        packing=False,
-    )
-    # Step 8: Start training
-    logger.info("Step 8: Starting training...")
-    logger.info(f"Training with {len(dataset)} samples")
-    logger.info(f"Model: {model_name}")
-    logger.info(f"Device: {device}")
-    trainer.train()
-    # Step 9: Save the model
-    logger.info("Step 9: Saving model...")
-    trainer.save_model()
-    tokenizer.save_pretrained(training_args.output_dir)
-    logger.info("Training completed successfully!")
-    logger.info(f"Model saved to: {training_args.output_dir}")
-if __name__ == "__main__":
-    main()

hf-space/README.md DELETED Viewed

@@ -1,61 +0,0 @@
----
-title: Fine-tuned Code Analyzer API
-emoji: 🤖
-colorFrom: green
-colorTo: blue
-sdk: docker
-pinned: false
----
-# Fine-tuned Code Analyzer API
-API endpoint for code analysis using fine-tuned DeepSeek model.
-## Features
-- **Quality Scores**: 1-100 rating for code quality
-- **Structured Analysis**: Bugs, Performance, Security sections
-- **Code Improvements**: Specific suggestions with examples
-- **Professional Output**: Consistent, detailed analysis format
-## Usage
-### POST /analyze
-Analyze code for bugs, performance, and security issues.
-**Request:**
-```json
-{
-  "code": "def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)",
-  "max_tokens": 300
-}
-```
-**Response:**
-```json
-{
-  "analysis": "Quality Score: 35/100\n\nBUGS:\n- No error handling\n- Infinite recursion possible\n\nPERFORMANCE ISSUES:\n- Recursive calls cause exponential time complexity\n\nSECURITY CONCERNS:\n- No input validation\n\nIMPROVEMENTS:\n1. Use memoization to avoid recursion\n2. Add input validation\n\nExample improved code:\n[Shows working fixes]",
-  "model": "fine-tuned-deepseek",
-  "status": "success"
-}
-```
-### GET /health
-Health check endpoint.
-**Response:**
-```json
-{
-  "status": "healthy",
-  "model": "fine-tuned-deepseek"
-}
-```
-## Model Details
-- **Base Model**: DeepSeek Coder 1.3B
-- **Training Method**: LoRA (Low-Rank Adaptation)
-- **Dataset**: 59+ high-quality code analysis examples
-- **Fine-tuned for**: Code analysis, bug detection, performance optimization

hf-space/app.py DELETED Viewed

@@ -1,82 +0,0 @@
-from fastapi import FastAPI, HTTPException
-from pydantic import BaseModel
-import uvicorn
-import requests
-import json
-app = FastAPI()
-class CodeRequest(BaseModel):
-    code: str
-    max_tokens: int = 300
-class AnalysisResponse(BaseModel):
-    analysis: str
-    model: str
-    status: str
-@app.post("/analyze", response_model=AnalysisResponse)
-async def analyze_code(request: CodeRequest):
-    try:
-        # For now, return a mock analysis while we debug the model loading
-        analysis_text = f"""
-Quality Score: 75/100
-BUGS:
-- No error handling for edge cases
-- Potential infinite recursion for large inputs
-PERFORMANCE ISSUES:
-- Recursive approach is inefficient for large numbers
-- No memoization implemented
-SECURITY CONCERNS:
-- No input validation
-- Could cause stack overflow with large inputs
-IMPROVEMENTS:
-1. Add input validation
-2. Implement iterative solution or memoization
-3. Add error handling for edge cases
-Example improved code:
-```python
-def fibonacci_improved(n):
-    if n < 0:
-        raise ValueError("Input must be non-negative")
-    if n <= 1:
-        return n
-    a, b = 0, 1
-    for _ in range(2, n + 1):
-        a, b = b, a + b
-    return b
-```
-"""
-        return AnalysisResponse(
-            analysis=analysis_text,
-            model="fine-tuned-deepseek-mock",
-            status="success"
-        )
-    except Exception as e:
-        raise HTTPException(status_code=500, detail=str(e))
-@app.get("/health")
-async def health_check():
-    return {"status": "healthy", "model": "fine-tuned-deepseek-mock"}
-@app.get("/")
-async def root():
-    return {
-        "message": "Fine-tuned Code Analyzer API",
-        "endpoints": {
-            "POST /analyze": "Analyze code for bugs, performance, and security issues",
-            "GET /health": "Health check endpoint"
-        },
-        "model": "fine-tuned-deepseek-mock",
-        "status": "running"
-    }
-if __name__ == "__main__":
-    uvicorn.run(app, host="0.0.0.0", port=7860)

hf-space/requirements.txt DELETED Viewed

@@ -1,3 +0,0 @@
-fastapi
-uvicorn[standard]
-requests

inference_demo.py DELETED Viewed

@@ -1,57 +0,0 @@
-#!/usr/bin/env python3
-import os
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import PeftModel
-# Settings
-BASE_MODEL = os.environ.get("BASE_MODEL", "gpt2")
-ADAPTER_DIR = os.environ.get("ADAPTER_DIR", "./fine-tuned-analyst/lora_adapters")
-DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
-def load_model_and_tokenizer():
-    tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
-    if tokenizer.pad_token is None:
-        tokenizer.pad_token = tokenizer.eos_token
-    model = AutoModelForCausalLM.from_pretrained(
-        BASE_MODEL,
-        torch_dtype=torch.float32 if DEVICE == "cpu" else torch.float16,
-        device_map="cpu" if DEVICE == "cpu" else "auto",
-        trust_remote_code=True,
-        low_cpu_mem_usage=True,
-    )
-    model = PeftModel.from_pretrained(model, ADAPTER_DIR)
-    model.to(DEVICE)
-    model.eval()
-    return model, tokenizer
-def generate(prompt: str, max_new_tokens: int = 256) -> str:
-    model, tokenizer = load_model_and_tokenizer()
-    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
-    with torch.no_grad():
-        out = model.generate(
-            **inputs,
-            max_new_tokens=max_new_tokens,
-            do_sample=True,
-            top_p=0.9,
-            temperature=0.7,
-            eos_token_id=tokenizer.eos_token_id,
-            pad_token_id=tokenizer.pad_token_id,
-        )
-    text = tokenizer.decode(out[0], skip_special_tokens=True)
-    return text
-if __name__ == "__main__":
-    # A tiny prompt using the same schema
-    code = """def add_item(item, items=[]):\n    items.append(item)\n    return items\n"""
-    inst = (
-        "<s>[INST] Analyze this code for bugs, performance, and security issues. "
-        "Give a quality score from 1-100 and provide a detailed analysis. \n\nCode:\n```" + code + "``` [/INST]"
-    )
-    print("Device:", DEVICE)
-    print("Base model:", BASE_MODEL)
-    print("Adapters:", ADAPTER_DIR)
-    print("\n--- Generated Output ---\n")
-    print(generate(inst, max_new_tokens=200))

matrix_analyzer_optimized.py DELETED Viewed

@@ -1,262 +0,0 @@
-#!/usr/bin/env python3
-"""
-Optimized Matrix Code Analyzer with CodeT5+
-This integrates the optimized CodeT5+ analyzer into your existing
-Matrix-themed Streamlit application with speed optimizations.
-Author: AI Code Analyzer Project
-Date: 2025
-"""
-import streamlit as st
-import time
-import torch
-from optimized_code_analyzer import OptimizedCodeAnalyzer
-# Page configuration
-st.set_page_config(
-    page_title="AI Code Analyzer - Optimized",
-    page_icon="🤖",
-    layout="wide",
-    initial_sidebar_state="expanded"
-)
-# Custom CSS for Matrix theme
-st.markdown("""
-<style>
-    .main {
-        background-color: #0a0a0a;
-        color: #00ff00;
-    }
-    .stApp {
-        background-color: #0a0a0a;
-    }
-    .matrix-header {
-        background: linear-gradient(90deg, #00ff00, #008800);
-        color: #000000;
-        padding: 20px;
-        border-radius: 10px;
-        text-align: center;
-        margin-bottom: 30px;
-        font-family: 'Courier New', monospace;
-    }
-    .analysis-box {
-        background-color: #001100;
-        border: 2px solid #00ff00;
-        border-radius: 10px;
-        padding: 20px;
-        margin: 10px 0;
-    }
-    .speed-indicator {
-        background-color: #002200;
-        border: 1px solid #00ff00;
-        border-radius: 5px;
-        padding: 10px;
-        margin: 10px 0;
-    }
-    .cache-info {
-        background-color: #000800;
-        border: 1px solid #008800;
-        border-radius: 5px;
-        padding: 10px;
-        margin: 10px 0;
-        font-size: 12px;
-    }
-</style>
-""", unsafe_allow_html=True)
-@st.cache_resource
-def load_analyzer():
-    """
-    Load the optimized analyzer (cached for performance).
-    """
-    return OptimizedCodeAnalyzer()
-def main():
-    """
-    Main Streamlit application.
-    """
-    # Header
-    st.markdown("""
-    <div class="matrix-header">
-        <h1>🤖 AI Code Analyzer - Optimized</h1>
-        <p>Powered by CodeT5+ with Speed Optimizations</p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Load analyzer
-    with st.spinner("🚀 Loading optimized CodeT5+ model..."):
-        analyzer = load_analyzer()
-    # Sidebar
-    st.sidebar.markdown("## ⚙️ Analysis Options")
-    analysis_mode = st.sidebar.selectbox(
-        "Analysis Mode",
-        ["Streaming (Interactive)", "Fast (Batch)"],
-        help="Streaming shows progress, Fast is optimized for speed"
-    )
-    show_progress = st.sidebar.checkbox(
-        "Show Progress Indicators",
-        value=True,
-        help="Display progress bars and timing information"
-    )
-    # Main content
-    col1, col2 = st.columns([1, 1])
-    with col1:
-        st.markdown("## 📝 Code Input")
-        # Code input
-        code_input = st.text_area(
-            "Enter your code:",
-            height=300,
-            placeholder="def hello():\n    print('Hello, World!')",
-            help="Paste your code here for analysis"
-        )
-        # Analysis button
-        analyze_button = st.button(
-            "🔍 Analyze Code",
-            type="primary",
-            use_container_width=True
-        )
-    with col2:
-        st.markdown("## 📊 Analysis Results")
-        if analyze_button and code_input.strip():
-            # Perform analysis
-            start_time = time.time()
-            if analysis_mode == "Streaming (Interactive)":
-                # Streaming analysis
-                st.markdown("### 🔄 Streaming Analysis")
-                # Create placeholder for streaming results
-                result_placeholder = st.empty()
-                progress_placeholder = st.empty()
-                # Show progress
-                if show_progress:
-                    progress_bar = progress_placeholder.progress(0)
-                    status_text = st.empty()
-                try:
-                    # Stream analysis
-                    analysis_text = ""
-                    for partial_result in analyzer.analyze_code_streaming(code_input, show_progress):
-                        analysis_text = partial_result
-                        # Update progress
-                        if show_progress:
-                            progress_bar.progress(50)
-                            status_text.text("🔍 Analyzing code...")
-                    # Complete analysis
-                    if show_progress:
-                        progress_bar.progress(100)
-                        status_text.text("✅ Analysis complete!")
-                    # Display results
-                    result_placeholder.markdown(f"""
-                    <div class="analysis-box">
-                        <h4>📄 Analysis Results:</h4>
-                        <p>{analysis_text}</p>
-                    </div>
-                    """, unsafe_allow_html=True)
-                except Exception as e:
-                    st.error(f"❌ Analysis failed: {str(e)}")
-            else:
-                # Fast analysis
-                st.markdown("### ⚡ Fast Analysis")
-                if show_progress:
-                    progress_bar = st.progress(0)
-                    status_text = st.empty()
-                    progress_bar.progress(25)
-                    status_text.text("🚀 Loading model...")
-                try:
-                    # Perform fast analysis
-                    result = analyzer.analyze_code_fast(code_input)
-                    if show_progress:
-                        progress_bar.progress(100)
-                        status_text.text("✅ Analysis complete!")
-                    # Display results
-                    st.markdown(f"""
-                    <div class="analysis-box">
-                        <h4>📄 Analysis Results:</h4>
-                        <p>{result['analysis']}</p>
-                    </div>
-                    """, unsafe_allow_html=True)
-                except Exception as e:
-                    st.error(f"❌ Analysis failed: {str(e)}")
-            # Show performance metrics
-            total_time = time.time() - start_time
-            st.markdown(f"""
-            <div class="speed-indicator">
-                <h4>⚡ Performance Metrics:</h4>
-                <p><strong>Total Time:</strong> {total_time:.2f}s</p>
-                <p><strong>Analysis Mode:</strong> {analysis_mode}</p>
-                <p><strong>Model:</strong> CodeT5+ (Optimized)</p>
-            </div>
-            """, unsafe_allow_html=True)
-        elif analyze_button and not code_input.strip():
-            st.warning("⚠️ Please enter some code to analyze!")
-        else:
-            st.info("👆 Enter code and click 'Analyze Code' to get started!")
-    # Model information
-    st.markdown("## 📊 Model Information")
-    model_info = analyzer.get_model_info()
-    col1, col2, col3 = st.columns(3)
-    with col1:
-        st.metric("Model Parameters", f"{model_info['parameters']:,}")
-    with col2:
-        st.metric("Cache Size", f"{model_info['cache_size']} analyses")
-    with col3:
-        st.metric("Device", str(model_info['device']))
-    # Cache information
-    st.markdown("""
-    <div class="cache-info">
-        <h4>💾 Cache Information:</h4>
-        <p>• Cached analyses are reused for identical code</p>
-        <p>• Cache improves speed for repeated analyses</p>
-        <p>• Cache is automatically managed</p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Footer
-    st.markdown("---")
-    st.markdown("""
-    <div style="text-align: center; color: #008800;">
-        <p>🚀 Optimized AI Code Analyzer | Powered by CodeT5+ | Matrix Theme</p>
-    </div>
-    """, unsafe_allow_html=True)
-if __name__ == "__main__":
-    main()

matrix_final.py CHANGED Viewed

@@ -1,1086 +1,880 @@
-import streamlit as st
-import os
-import time
-import random
-import sys
 from dotenv import load_dotenv
 from analyzer import CodeAnalyzer
-from optimized_code_analyzer_enhanced import EnhancedCodeAnalyzer
-# Load environment variables
-load_dotenv()
-# Page config
 st.set_page_config(
-    page_title="Matrix Code Analyzer - Final",
-    page_icon="🟢",
-    layout="wide",
-    initial_sidebar_state="expanded"
 )
-# Matrix CSS - Enhanced with file upload styling
-st.markdown("""
 <style>
     @import url('https://fonts.googleapis.com/css2?family=Share+Tech+Mono:wght@400&display=swap');
     @import url('https://fonts.googleapis.com/css2?family=Orbitron:wght@400;700;900&display=swap');
-    /* Matrix Background */
     .stApp {
-        background: linear-gradient(135deg, #0d1b0d 0%, #000000 50%, #0d1b0d 100%);
-        color: #00ff41;
-        font-family: 'Share Tech Mono', monospace;
-    }
-    /* Matrix Code Rain Animation */
-    .matrix-bg {
         position: fixed;
-        top: 0;
-        left: 0;
-        width: 100%;
-        height: 100%;
         overflow: hidden;
         z-index: -1;
-        opacity: 0.1;
     }
-    .matrix-char {
         position: absolute;
-        color: #00ff41;
         font-family: 'Share Tech Mono', monospace;
         font-size: 14px;
-        animation: matrix-fall linear infinite;
-    }
-    @keyframes matrix-fall {
-        0% { transform: translateY(-100vh); opacity: 1; }
-        100% { transform: translateY(100vh); opacity: 0; }
-    }
-    /* Main Content Styling */
-    .main .block-container {
-        padding-top: 2rem;
-        background: rgba(0, 0, 0, 0.8);
-        border: 1px solid #00ff41;
-        border-radius: 10px;
-        box-shadow: 0 0 20px rgba(0, 255, 65, 0.3);
-    }
-    /* Title Styling */
-    h1 {
-        font-family: 'Orbitron', monospace !important;
-        color: #00ff41 !important;
-        text-align: center !important;
-        text-shadow: 0 0 10px #00ff41, 0 0 20px #00ff41, 0 0 30px #00ff41;
-        font-weight: 900 !important;
-        margin-bottom: 2rem !important;
-        animation: glow 2s ease-in-out infinite alternate;
-    }
-    @keyframes glow {
-        from { text-shadow: 0 0 5px #00ff41, 0 0 10px #00ff41, 0 0 15px #00ff41; }
-        to { text-shadow: 0 0 10px #00ff41, 0 0 20px #00ff41, 0 0 30px #00ff41; }
-    }
-    /* Sidebar Styling */
-    .css-1d391kg {
-        background: rgba(0, 0, 0, 0.9) !important;
-        border: 1px solid #00ff41 !important;
-        border-radius: 10px !important;
-    }
-    /* Buttons */
-    .stButton > button {
-        background: linear-gradient(45deg, #003300, #006600) !important;
-        color: #00ff41 !important;
-        border: 2px solid #00ff41 !important;
-        border-radius: 5px !important;
-        font-family: 'Orbitron', monospace !important;
-        font-weight: bold !important;
-        text-transform: uppercase !important;
-        transition: all 0.3s ease !important;
-        box-shadow: 0 0 10px rgba(0, 255, 65, 0.3) !important;
-    }
-    .stButton > button:hover {
-        background: linear-gradient(45deg, #006600, #00aa00) !important;
-        box-shadow: 0 0 20px rgba(0, 255, 65, 0.6) !important;
-        transform: translateY(-2px) !important;
-    }
-    /* File uploader styling */
-    .stFileUploader > div {
-        background: rgba(0, 51, 0, 0.3) !important;
-        border: 2px dashed #00ff41 !important;
-        border-radius: 10px !important;
-        padding: 20px !important;
-    }
-    .stFileUploader label {
-        color: #00ff41 !important;
-        font-family: 'Orbitron', monospace !important;
-        font-weight: bold !important;
-    }
-    /* Select boxes */
     .stSelectbox > div > div {
-        background: rgba(0, 51, 0, 0.8) !important;
-        border: 1px solid #00ff41 !important;
-        color: #00ff41 !important;
-    }
-    /* Text areas */
-    .stTextArea > div > div > textarea {
-        background: rgba(0, 0, 0, 0.9) !important;
-        border: 1px solid #00ff41 !important;
-        color: #00ff41 !important;
-        font-family: 'Share Tech Mono', monospace !important;
-    }
-    /* Metrics */
-    .css-1xarl3l {
-        background: rgba(0, 51, 0, 0.3) !important;
-        border: 1px solid #00ff41 !important;
-        border-radius: 5px !important;
-        box-shadow: 0 0 10px rgba(0, 255, 65, 0.2) !important;
-    }
-    /* Success/Info/Warning messages */
-    .stSuccess {
-        background: rgba(0, 255, 65, 0.1) !important;
-        border: 1px solid #00ff41 !important;
-        color: #00ff41 !important;
-    }
-    .stInfo {
-        background: rgba(0, 255, 255, 0.1) !important;
-        border: 1px solid #00ffff !important;
-        color: #00ffff !important;
-    }
-    .stWarning {
-        background: rgba(255, 255, 0, 0.1) !important;
-        border: 1px solid #ffff00 !important;
-        color: #ffff00 !important;
-    }
-    .stError {
-        background: rgba(255, 0, 0, 0.1) !important;
-        border: 1px solid #ff0000 !important;
-        color: #ff0000 !important;
-    }
-    /* Code blocks */
-    .stCode {
-        background: rgba(0, 0, 0, 0.9) !important;
-        border: 1px solid #00ff41 !important;
-        color: #00ff41 !important;
-    }
-    /* Tabs */
     .stTabs [data-baseweb="tab-list"] {
-        background: rgba(0, 0, 0, 0.8) !important;
-        border-bottom: 2px solid #00ff41 !important;
     }
     .stTabs [data-baseweb="tab"] {
-        background: rgba(0, 51, 0, 0.3) !important;
-        color: #00ff41 !important;
-        border: 1px solid #00ff41 !important;
-        font-family: 'Orbitron', monospace !important;
     }
     .stTabs [aria-selected="true"] {
-        background: rgba(0, 255, 65, 0.2) !important;
-        box-shadow: 0 0 10px rgba(0, 255, 65, 0.5) !important;
-    }
-    /* Matrix Terminal Effect */
-    .matrix-terminal {
-        background: rgba(0, 0, 0, 0.95) !important;
-        border: 2px solid #00ff41 !important;
-        border-radius: 10px !important;
-        padding: 20px !important;
-        font-family: 'Share Tech Mono', monospace !important;
-        color: #00ff41 !important;
-        box-shadow: 0 0 30px rgba(0, 255, 65, 0.4) !important;
-        position: relative !important;
-    }
-    .matrix-terminal::before {
-        content: "MATRIX_ANALYZER_v3.0 > OPERATIONAL" !important;
-        position: absolute !important;
-        top: -15px !important;
-        left: 20px !important;
-        background: #000000 !important;
-        padding: 0 10px !important;
-        color: #00ff41 !important;
-        font-size: 12px !important;
-        font-weight: bold !important;
-    }
-    /* Checkbox styling */
-    .stCheckbox > label {
-        color: #00ff41 !important;
-        font-family: 'Orbitron', monospace !important;
-    }
-    /* Loading spinner */
-    .stSpinner > div {
-        border-color: #00ff41 transparent #00ff41 transparent !important;
-    }
-    /* Expander */
-    .streamlit-expanderHeader {
-        background: rgba(0, 51, 0, 0.3) !important;
-        border: 1px solid #00ff41 !important;
-        color: #00ff41 !important;
-    }
-    /* File info styling */
-    .file-info {
-        background: rgba(0, 255, 65, 0.1);
-        border: 1px solid #00ff41;
-        padding: 10px;
-        border-radius: 5px;
-        margin: 10px 0;
-        font-family: 'Share Tech Mono', monospace;
-    }
-</style>
-<div class="matrix-bg" id="matrix-bg"></div>
-<script>
-function createMatrixRain() {
-    const chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789@#$%^&*()_+-=[]{}|;:,.<>?";
-    const container = document.getElementById('matrix-bg');
-    for (let i = 0; i < 50; i++) {
-        const char = document.createElement('div');
-        char.className = 'matrix-char';
-        char.textContent = chars[Math.floor(Math.random() * chars.length)];
-        char.style.left = Math.random() * 100 + '%';
-        char.style.animationDuration = (Math.random() * 3 + 2) + 's';
-        char.style.animationDelay = Math.random() * 2 + 's';
-        container.appendChild(char);
-    }
 }
-// Create matrix rain effect
-setTimeout(createMatrixRain, 100);
-</script>
-""", unsafe_allow_html=True)
-# Initialize analyzers
 def get_analyzer():
-    # Force reimport to ensure latest code
-    import importlib
-    import analyzer.code_analyzer
-    importlib.reload(analyzer.code_analyzer)
-    from analyzer.code_analyzer import CodeAnalyzer
-    return CodeAnalyzer()
 analyzer = get_analyzer()
-# Local analyzer (cached)
-@st.cache_resource
-def get_local_analyzer(model_type="codet5", remote_url=None):
-    if remote_url:
-        return EnhancedCodeAnalyzer(
-            model_type="deepseek-finetuned-remote",
-            remote_api_url=remote_url
         )
-    else:
-        return EnhancedCodeAnalyzer(
-            model_type=model_type,
-            precision="fp16",
-            quick_max_new_tokens=180,
-            detailed_max_new_tokens=300,
         )
-def display_matrix_analysis_result(result: dict, model_name: str):
-    """Display analysis result in clean, readable horizontal blocks."""
-    if 'error' in result:
-        st.error(f"🚨 SYSTEM ERROR: {result['error']}")
         return
-    # Quality score with modern styling
-    score = result['quality_score']
-    if score >= 80:
-        score_color = "#00ff41"
-        status = "EXCELLENT"
-    elif score >= 60:
-        score_color = "#ffff00"
-        status = "ACCEPTABLE"
-    else:
-        score_color = "#ff0000"
-        status = "NEEDS_WORK"
-    # Header with score
-    st.markdown(f"""
-    <div style="background: linear-gradient(135deg, rgba(0,255,65,0.15), rgba(0,255,65,0.05));
-               border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 20px 0;
-               text-align: center;">
-        <h2 style="color: {score_color}; margin-bottom: 15px; font-size: 1.8rem;">
-            {model_name} Analysis
-        </h2>
-        <div style="display: flex; justify-content: center; gap: 30px; flex-wrap: wrap; color: #ffffff;">
-            <div style="text-align: center;">
-                <div style="font-size: 2rem; color: {score_color}; font-weight: bold;">{score}/100</div>
-                <div style="font-size: 0.9rem; opacity: 0.8;">{status}</div>
-            </div>
-            <div style="text-align: center;">
-                <div style="font-size: 1.2rem; color: #00ff41; font-weight: bold;">{result['language'].upper()}</div>
-                <div style="font-size: 0.9rem; opacity: 0.8;">Language</div>
-            </div>
-            <div style="text-align: center;">
-                <div style="font-size: 1.2rem; color: #00ff41; font-weight: bold;">{result['line_count']}</div>
-                <div style="font-size: 0.9rem; opacity: 0.8;">Lines</div>
-            </div>
         </div>
-    </div>
-    """, unsafe_allow_html=True)
-    # Summary
-    if result.get('summary'):
-        st.markdown("### 📋 Code Overview")
-        st.markdown(f"""
-        <div style="background: rgba(0,0,0,0.6); border: 1px solid #00ff41; border-radius: 10px;
-                   padding: 20px; margin: 20px 0;">
-            <p style="color: #ffffff; font-size: 18px; line-height: 1.6; text-align: center; margin: 0;">
-                {result['summary']}
-            </p>
         </div>
-        """, unsafe_allow_html=True)
-    # Analysis results in horizontal blocks
-    st.markdown("### 📊 Analysis Results")
-    # Bug Detection Block (Full width)
-    bug_items = result.get('bugs', [])
-    if bug_items:
-        bug_text = " • ".join(bug_items[:3])  # Join with bullets for horizontal reading
-    else:
-        bug_text = "No critical bugs detected • Code logic appears sound • Edge cases handled well"
-    st.markdown(f"""
-    <div style="background: linear-gradient(135deg, rgba(255,100,100,0.1), rgba(150,0,0,0.1));
-               border: 2px solid #ff6b6b; border-radius: 15px; padding: 25px; margin: 15px 0;">
-        <h3 style="color: #ff6b6b; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
-            🐛 Bug Detection
-        </h3>
-        <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
-            {bug_text}
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Security Vulnerabilities Block (Full width)
-    security_items = result.get('security_vulnerabilities', [])
-    if security_items:
-        security_text = " • ".join(security_items[:3])  # Join with bullets for horizontal reading
-    else:
-        security_text = "No security vulnerabilities found • Follows security best practices • Input validation looks good"
-    st.markdown(f"""
-    <div style="background: linear-gradient(135deg, rgba(200,0,200,0.1), rgba(100,0,100,0.1));
-               border: 2px solid #ff00ff; border-radius: 15px; padding: 25px; margin: 15px 0;">
-        <h3 style="color: #ff00ff; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
-            🔒 Security Check
-        </h3>
-        <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
-            {security_text}
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Code Quality & Quick Fixes Block (Full width)
-    quality_items = result.get('quality_issues', []) + result.get('quick_fixes', [])
-    if quality_items:
-        quality_text = " • ".join(quality_items[:3])  # Join with bullets for horizontal reading
-    else:
-        quality_text = "Code is well-structured • Good naming conventions • Easy to read and maintain"
-    st.markdown(f"""
-    <div style="background: linear-gradient(135deg, rgba(0,200,255,0.1), rgba(0,100,150,0.1));
-               border: 2px solid #00ccff; border-radius: 15px; padding: 25px; margin: 15px 0;">
-        <h3 style="color: #00ccff; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
-            📝 Code Quality
-        </h3>
-        <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
-            {quality_text}
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Legacy sections (only if new sections are empty)
-    if not result.get('bugs') and not result.get('security_vulnerabilities') and not result.get('quality_issues'):
-        legacy_col1, legacy_col2 = st.columns(2)
-        with legacy_col1:
-            # Legacy strengths
-            if result.get('strengths'):
-                st.markdown("#### ✅ Strengths")
-                for strength in result['strengths'][:3]:
-                    st.success(f"✓ {strength}")
-            # Legacy issues
-        if result.get('issues'):
-                st.markdown("#### ⚠️ Issues")
-                for issue in result['issues'][:3]:
-                    st.warning(f"! {issue}")
-        with legacy_col2:
-            # Legacy suggestions
-            if result.get('suggestions'):
-                st.markdown("#### 💡 Suggestions")
-                for suggestion in result['suggestions'][:3]:
-                    st.info(f"→ {suggestion}")
-            # Legacy security concerns
-        if result.get('security_concerns'):
-                st.markdown("#### 🔒 Security Concerns")
-                for concern in result['security_concerns'][:3]:
-                    st.error(f"⚠ {concern}")
-    # Expandable raw response (moved to bottom and less prominent)
-    with st.expander("🔍 View Detailed Analysis", expanded=False):
-        st.code(result.get('raw_response', 'NO_DATA_AVAILABLE'), language='text')
-def display_github_analysis_result(result: dict, model_name: str):
-    """Display GitHub repository analysis result in clean, readable horizontal blocks."""
-    if 'error' in result:
-        st.error(f"🚨 GITHUB ANALYSIS ERROR: {result['error']}")
         return
-    # Repository info in a clean header
-    if result.get('repository_info'):
-        repo_info = result['repository_info']
-        st.markdown(f"""
-        <div style="background: linear-gradient(135deg, rgba(0,255,65,0.15), rgba(0,255,65,0.05));
-                   border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 20px 0;
-                   text-align: center;">
-            <h2 style="color: #00ff41; margin-bottom: 20px; font-size: 1.8rem;">
-                📦 {repo_info['name']}
-            </h2>
-            <div style="display: flex; justify-content: center; gap: 30px; flex-wrap: wrap; color: #ffffff;">
-                <div style="text-align: center;">
-                    <div style="font-size: 1.5rem; color: #00ff41; font-weight: bold;">{repo_info['language']}</div>
-                    <div style="font-size: 0.9rem; opacity: 0.8;">Language</div>
-                </div>
-                <div style="text-align: center;">
-                    <div style="font-size: 1.5rem; color: #00ff41; font-weight: bold;">⭐ {repo_info['stars']}</div>
-                    <div style="font-size: 0.9rem; opacity: 0.8;">Stars</div>
-                </div>
-                <div style="text-align: center;">
-                    <div style="font-size: 1.5rem; color: #00ff41; font-weight: bold;">🔀 {repo_info['forks']}</div>
-                    <div style="font-size: 0.9rem; opacity: 0.8;">Forks</div>
-                </div>
-            </div>
-            <p style="color: #ffffff; margin-top: 15px; font-style: italic;">
-                "{repo_info['description']}"
-            </p>
         </div>
-        """, unsafe_allow_html=True)
-    # Analysis results in horizontal blocks
-    st.markdown("### 📊 Analysis Results")
-    # Architecture Quality Block (Full width)
-    arch_items = result.get('architecture_quality', [])
-    if arch_items:
-        arch_text = " • ".join(arch_items[:3])  # Join with bullets for horizontal reading
-    else:
-        arch_text = "Well-structured repository • Good organization • Follows best practices"
-    st.markdown(f"""
-    <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,100,30,0.1));
-               border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 15px 0;">
-        <h3 style="color: #00ff41; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
-            🏗️ Code Architecture
-        </h3>
-        <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
-            {arch_text}
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Critical Issues Block (Full width)
-    critical_items = result.get('critical_issues', [])
-    if critical_items:
-        critical_text = " • ".join(critical_items[:3])  # Join with bullets for horizontal reading
-    else:
-        critical_text = "No major security vulnerabilities found • Code appears well-maintained • No critical bugs detected"
-    st.markdown(f"""
-    <div style="background: linear-gradient(135deg, rgba(255,100,100,0.1), rgba(150,0,0,0.1));
-               border: 2px solid #ff6b6b; border-radius: 15px; padding: 25px; margin: 15px 0;">
-        <h3 style="color: #ff6b6b; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
-            🚨 Critical Issues
-        </h3>
-        <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
-            {critical_text}
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Improvement Priorities Block (Full width)
-    improvement_items = result.get('improvement_priorities', [])
-    if improvement_items:
-        improvement_text = " • ".join(improvement_items[:3])  # Join with bullets for horizontal reading
-    else:
-        improvement_text = "Add more comprehensive documentation • Consider adding automated tests • Enhance error handling"
-    st.markdown(f"""
-    <div style="background: linear-gradient(135deg, rgba(255,200,0,0.1), rgba(150,100,0,0.1));
-               border: 2px solid #ffd700; border-radius: 15px; padding: 25px; margin: 15px 0;">
-        <h3 style="color: #ffd700; margin-bottom: 15px; text-align: center; font-size: 1.4rem;">
-            🎯 Priority Improvements
-        </h3>
-        <p style="color: #ffffff; font-size: 16px; line-height: 1.6; text-align: center; margin: 0;">
-            {improvement_text}
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Summary section at the bottom (if available)
-    if result.get('project_overview'):
-        st.markdown("### 💡 Key Insights")
-        st.markdown(f"""
-        <div style="background: rgba(0,0,0,0.6); border: 1px solid #00ff41; border-radius: 10px;
-                   padding: 20px; margin: 20px 0;">
-            <p style="color: #ffffff; font-size: 18px; line-height: 1.6; text-align: center; margin: 0;">
-                {result['project_overview']}
-            </p>
         </div>
-        """, unsafe_allow_html=True)
-    # Expandable raw response (moved to bottom and less prominent)
-    with st.expander("🔍 View Detailed Analysis", expanded=False):
-        st.code(result.get('raw_response', 'NO_DATA_AVAILABLE'), language='text')
-# Header with Matrix effect
-st.markdown("""
-<h1 style="text-align: center;">
-🤖 AI Code Analyzer
-</h1>
-<div style="text-align: center; margin-bottom: 30px;">
-    <p style="color: #00ff41; font-family: 'Orbitron', monospace; font-size: 20px; margin-bottom: 10px;">
-        <strong>Powered by Advanced AI Models</strong>
-    </p>
-    <div style="display: flex; justify-content: center; gap: 20px; flex-wrap: wrap;">
-        <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
-                    border-radius: 25px; font-size: 14px; color: #00ff41;">
-            ✨ Bug Detection
-        </span>
-        <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
-                    border-radius: 25px; font-size: 14px; color: #00ff41;">
-            🔒 Security Analysis
-        </span>
-        <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
-                    border-radius: 25px; font-size: 14px; color: #00ff41;">
-            📦 GitHub Integration
-        </span>
-        <span style="background: rgba(0,255,65,0.2); border: 1px solid #00ff41; padding: 8px 16px;
-                    border-radius: 25px; font-size: 14px; color: #00ff41;">
-            ⚡ Instant Results
-        </span>
     </div>
-</div>
-""", unsafe_allow_html=True)
-# Sidebar - The Matrix Control Panel
-with st.sidebar:
-    st.markdown("### 🎛️ Control Panel")
-    # Model status
-    st.markdown("#### Available AI Models")
-    available_models = analyzer.available_models
-    if not available_models:
-        st.error("❌ No AI models detected")
-        st.info("Please configure API keys in .env file")
-        st.stop()
-    # Display available models with modern styling
-    for model, display_name in available_models.items():
-        st.markdown(f"""
-        <div style="background: rgba(0, 255, 65, 0.1); border: 1px solid #00ff41; padding: 10px; margin: 8px 0; border-radius: 8px;">
-        ✅ <strong>{display_name}</strong> <span style="color: #00ff41; font-size: 12px;">[Ready]</span>
-        </div>
-        """, unsafe_allow_html=True)
-    st.markdown("---")
-    # Analysis Mode Selection
-    st.markdown("#### Analysis Mode")
-    analysis_mode = st.radio(
-        "Choose what to analyze",
-        ["Code Analysis", "GitHub Repository"],
-        format_func=lambda x: f"📝 {x}" if x == "Code Analysis" else f"📦 {x}"
     )
-    # Model Selection
-    st.markdown("#### 🤖 AI Model Selection")
-    model_choice = st.radio(
-        "Choose Analysis Model:",
-        [
-            "CodeT5+ (Fast - Local)",
-            "Fine-tuned DeepSeek (Accurate - Remote)"
-        ],
-        help="Local models run on your computer, Remote model runs on Hugging Face (always available)"
     )
-    # Remote model configuration
-    remote_api_url = None
-    if "Remote" in model_choice:
-        st.markdown("#### 🌐 Remote Model Configuration")
-        remote_api_url = st.text_input(
-            "Hugging Face Space URL:",
-            value="https://arun3676-fine-tuned-code-analyzer.hf.space",
-            help="Your Hugging Face Space URL"
-        )
-        # Test connection
-        if st.button("🔗 Test Connection"):
-            try:
-                import requests
-                response = requests.get(f"{remote_api_url}/health", timeout=5)
-                if response.status_code == 200:
-                    st.success("✅ Connected to remote model!")
-                else:
-                    st.error("❌ Connection failed")
-            except:
-                st.error("❌ Cannot reach remote model")
-    # Local model toggle and preset (for CodeT5+)
-    if "CodeT5+" in model_choice:
-        use_local = True
-        local_preset = st.selectbox(
-            "Local Inference Mode",
-            ["Quick", "Detailed"],
-            index=0,
-            help="Quick = beams 1, ~180 tokens. Detailed = beams 2, ~240 tokens.",
         )
     else:
-        use_local = False
-        local_preset = "Detailed"
-    if analysis_mode == "GitHub Repository":
-        st.markdown("#### Repository Analysis")
-        github_url = st.text_input(
-            "GitHub URL",
-            placeholder="https://github.com/owner/repo",
-            help="Enter a GitHub repository URL for analysis"
-        )
-        analyze_github_button = st.button(
-            "🔍 Analyze Repository",
-            type="primary",
-            help="Analyze GitHub repository structure and code"
-        )
-    st.markdown("---")
-    # Analysis options
-    st.markdown("#### Analysis Settings")
-    # Model selector with modern styling (disabled when using local)
-    selected_model = st.selectbox(
-        "Choose AI Model",
-        options=list(available_models.keys()),
-        format_func=lambda x: f"🤖 {available_models[x]}",
-        disabled=use_local,
     )
-    # Multi-model analysis toggle
-    analyze_all = st.checkbox("🔄 Compare Multiple Models", value=False)
-    # Language selection
-    languages = ["auto-detect", "python", "javascript", "java", "cpp", "csharp", "go", "rust", "php", "ruby", "swift", "kotlin"]
-    selected_language = st.selectbox(
-        "Programming Language",
-        languages,
-        format_func=lambda x: "🔍 Auto-Detect" if x == "auto-detect" else f"💻 {x.upper()}"
     )
-    st.markdown("---")
-    # Sample code injection
-    st.markdown("#### Quick Start Examples")
-    if st.button("🐍 Try Python Example"):
-        st.session_state.code_input = """def calculate_total(items):
-    total = 0
-    for item in items:
-        total += item.price  # Potential AttributeError
-    return total
-# Missing validation
-items = None
-result = calculate_total(items)  # This will crash
-print(f"Total: {result}")
-"""
-    if st.button("🌐 Try JavaScript Example"):
-        st.session_state.code_input = """function processUser(user) {
-    // Security issue: no input validation
-    document.innerHTML = user.name;  // XSS vulnerability
-    // Logic error: undefined check
-    if (user.age > 18) {
-        return user.permissions.admin;  // Potential TypeError
-    }
-    return false;
-}
-// Missing error handling
-const userData = getUser();  // Could be undefined
-processUser(userData);
-"""
-# Main Terminal Interface
-col1, col2 = st.columns([1, 1])
-with col1:
-    if analysis_mode == "Code Analysis":
-        st.markdown("### 📝 Code Input")
-        # File upload section
-        st.markdown("#### 📁 Upload File")
-        uploaded_file = st.file_uploader(
-                "Choose a code file",
-            type=['py', 'js', 'java', 'cpp', 'c', 'cs', 'go', 'rs', 'php', 'rb', 'swift', 'kt', 'txt'],
-                help="Upload code files for AI analysis"
-        )
-        code_from_file = ""
-        if uploaded_file is not None:
-            # Read file content
-            try:
-                code_from_file = str(uploaded_file.read(), "utf-8")
-                file_size = len(code_from_file)
-                file_lines = len(code_from_file.splitlines())
-                st.markdown(f"""
-                <div class="file-info">
-                    ✅ <strong>File Uploaded Successfully</strong><br>
-                    📄 <strong>Name:</strong> {uploaded_file.name}<br>
-                    📏 <strong>Size:</strong> {file_size} bytes<br>
-                    📊 <strong>Lines:</strong> {file_lines}<br>
-                    🔍 <strong>Status:</strong> Ready for analysis
-                </div>
-                """, unsafe_allow_html=True)
-                # Auto-populate the text area
-                st.session_state.code_input = code_from_file
-            except UnicodeDecodeError:
-                    st.error("🚨 File encoding error: Please use UTF-8 encoded files")
-            except Exception as e:
-                    st.error(f"🚨 File read error: {str(e)}")
-        # Code input with modern styling
-        code_input = st.text_area(
-                "Or paste your code here",
-            value=st.session_state.get('code_input', ''),
-            height=350,
-            key="code_input",
-            help="Paste code directly or upload file above"
-        )
-        # Modern analyze button
-        analyze_button = st.button(
-            "🚀 Analyze Code",
-            type="primary",
-            help="Analyze your code with AI"
         )
-    else:  # GitHub Repository mode
-        st.markdown("### 📦 GitHub Analysis")
-        if 'github_url' in locals():
-            if github_url:
-                st.markdown(f"""
-                <div class="file-info">
-                ✅ <strong>Repository Detected</strong><br>
-                📦 <strong>URL:</strong> {github_url}<br>
-                🔍 <strong>Status:</strong> Ready for analysis
-                </div>
-                """, unsafe_allow_html=True)
-            st.markdown("""
-            #### 📋 What We'll Analyze
-            **Repository Analysis includes:**
-            - 🏗️ Project structure and organization
-            - 📄 Key files (README, package.json, main source files)
-            - 🎯 Code quality and architecture assessment
-            - 🔒 Security vulnerabilities across the codebase
-            - 💡 Best practices and improvement suggestions
-            **Note:** Only public repositories can be analyzed.
-            """)
-        # Show sample repos
-        st.markdown("#### 📚 Try These Sample Repositories")
-        sample_repos = [
-            "https://github.com/microsoft/vscode",
-            "https://github.com/facebook/react",
-            "https://github.com/python/cpython"
-        ]
-        for repo in sample_repos:
-            if st.button(f"📦 {repo.split('/')[-1]}", key=repo):
-                st.session_state.github_url_input = repo
-# Results Terminal
-with col2:
-    st.markdown("### 📊 Analysis Results")
-    # Code Analysis Results
-    if analysis_mode == "Code Analysis":
-        if analyze_button:
-            if not code_input.strip():
-                st.error("🚨 Please enter some code to analyze or upload a file!")
-            else:
-                with st.spinner("🟢 Analyzing your code... Please wait..."):
-                    if analyze_all:
-                        # Multi-model analysis
-                        st.markdown("#### 🔄 MULTI_NETWORK_ANALYSIS_INITIATED")
-                        results = analyzer.analyze_with_all_models(
-                            code_input,
-                            selected_language if selected_language != "auto-detect" else None
-                        )
-                        # Display comparison metrics
-                        comparison = analyzer.compare_analyses(results)
-                        # Matrix-styled metrics
-                        metrics_cols = st.columns(4)
-                        with metrics_cols[0]:
-                            st.metric("AVG_SCORE", f"{comparison['average_score']}/100")
-                        with metrics_cols[1]:
-                            st.metric("NETWORKS", len(results))
-                        with metrics_cols[2]:
-                            st.metric("PEAK_SCORE", f"{max(comparison['model_scores'].values())}/100")
-                        with metrics_cols[3]:
-                            st.metric("SCAN_TIME", f"{comparison['analysis_time']:.1f}s")
-                        # Create tabs for each neural network
-                        tab_names = [f"🤖 {available_models[key]}" for key in results.keys()]
-                        tabs = st.tabs(tab_names)
-                        for idx, (model_key, result) in enumerate(results.items()):
-                            with tabs[idx]:
-                                display_matrix_analysis_result(result, available_models[model_key])
-                        # Consensus findings with Matrix styling
-                        if comparison.get('consensus_bugs') or comparison.get('consensus_security'):
-                            st.markdown("### 🤝 NEURAL_CONSENSUS_DETECTED")
-                            st.markdown("""
-                            <div style="background: rgba(255, 0, 0, 0.1); border: 2px solid #ff0000; padding: 15px; border-radius: 10px;">
-                            <strong>CRITICAL_PATTERNS_IDENTIFIED_BY_MULTIPLE_NETWORKS:</strong>
-                            </div>
-                            """, unsafe_allow_html=True)
-                            if comparison.get('consensus_bugs'):
-                                st.markdown("#### 🐛 CONSENSUS_BUGS")
-                                for bug in comparison['consensus_bugs']:
-                                    st.error(f"🚨 MULTIPLE_MODELS: {bug}")
-                            if comparison.get('consensus_security'):
-                                st.markdown("#### 🔒 CONSENSUS_SECURITY")
-                                for vuln in comparison['consensus_security']:
-                                    st.error(f"🚨 SECURITY_ALERT: {vuln}")
-                    else:
-                        # Single model analysis
-                        if use_local or "Remote" in model_choice:
-                            # Determine model type and display name
-                            if "Remote" in model_choice:
-                                st.markdown("#### 🤖 FINE-TUNED_DEEPSEEK_REMOTE_ANALYSIS")
-                                model_type = "deepseek-finetuned-remote"
-                                display_name = "Fine-tuned DeepSeek (Remote)"
-                            else:
-                                st.markdown("#### 🤖 CODET5+_LOCAL_ANALYSIS")
-                                model_type = "codet5"
-                                display_name = "CodeT5+ Local"
-                            local = get_local_analyzer(model_type, remote_api_url)
-                            if local_preset == "Quick" or "Remote" in model_choice:
-                                result = local.analyze_code_fast(code_input, mode="quick")
-                                # adapt to display format
-                                display_matrix_analysis_result({
-                                    "quality_score": result.get("quality_score", 0),
-                                    "summary": "",
-                                    "bugs": [],
-                                    "quality_issues": [],
-                                    "security_vulnerabilities": [],
-                                    "quick_fixes": [],
-                                    "language": "auto",
-                                    "line_count": len(code_input.splitlines()),
-                                    "raw_response": result["analysis"],
-                                }, f"{display_name} (Quick)")
-                            else:
-                                # streaming path – consume generator and show final
-                                local = get_local_analyzer(model_type, remote_api_url)
-                                final_text = None
-                                for chunk in local.analyze_code_streaming(code_input, show_progress=True, mode="detailed"):
-                                    final_text = chunk
-                                display_matrix_analysis_result({
-                                    "quality_score": 0,
-                                    "summary": "",
-                                    "bugs": [],
-                                    "quality_issues": [],
-                                    "security_vulnerabilities": [],
-                                    "quick_fixes": [],
-                                    "language": "auto",
-                                    "line_count": len(code_input.splitlines()),
-                                    "raw_response": final_text or "",
-                                }, f"{display_name} (Detailed)")
-                        else:
-                            st.markdown(f"#### 🤖 {available_models[selected_model].upper()}_ANALYSIS")
-                            result = analyzer.analyze_code(
-                                code_input,
-                                selected_model,
-                                selected_language if selected_language != "auto-detect" else None
-                            )
-                            display_matrix_analysis_result(result, available_models[selected_model])
-    # GitHub Analysis Results
-    else:  # GitHub Repository mode
-        if 'analyze_github_button' in locals() and analyze_github_button:
-            if not github_url.strip():
-                st.error("🚨 Please enter a GitHub repository URL!")
-            else:
-                with st.spinner("🟢 Analyzing GitHub repository... Please wait..."):
-                    result = analyzer.analyze_github_repo(github_url, selected_model)
-                    display_github_analysis_result(result, available_models[selected_model])
-# Instructions for new users
-if (analysis_mode == "Code Analysis" and not code_input.strip() and not analyze_button) or \
-   (analysis_mode == "GitHub Repository" and ('github_url' not in locals() or not github_url.strip()) and ('analyze_github_button' not in locals() or not analyze_github_button)):
-    st.markdown("""
-    <div class="matrix-terminal" style="margin: 20px 0; text-align: center;">
-        <h2 style="color: #00ff41; margin-bottom: 30px; font-size: 2.5rem;">
-            🤖 AI Code Analyzer
-        </h2>
-        <p style="color: #00ff41; font-size: 1.3rem; margin-bottom: 30px;">
-            <strong>What can I do for you?</strong>
-        </p>
-    </div>
-    """, unsafe_allow_html=True)
-    # Feature cards in columns
-    feature_col1, feature_col2 = st.columns(2)
-    with feature_col1:
-        st.markdown("""
-        <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
-                   border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
-                   box-shadow: 0 0 20px rgba(0,255,65,0.3);">
-            <h3 style="color: #00ff41; margin-bottom: 15px;">🐛 Find Bugs Instantly</h3>
-            <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
-                Spot crashes, logical errors, and edge cases before they hit production.
-            </p>
-        </div>
-        """, unsafe_allow_html=True)
-        st.markdown("""
-        <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
-                   border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
-                   box-shadow: 0 0 20px rgba(0,255,65,0.3);">
-            <h3 style="color: #00ff41; margin-bottom: 15px;">🔒 Security Scanner</h3>
-            <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
-                Detect vulnerabilities like SQL injection, XSS, and insecure data handling.
-            </p>
-        </div>
-        """, unsafe_allow_html=True)
-    with feature_col2:
-        st.markdown("""
-        <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
-                   border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
-                   box-shadow: 0 0 20px rgba(0,255,65,0.3);">
-            <h3 style="color: #00ff41; margin-bottom: 15px;">📝 Code Quality Check</h3>
-            <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
-                Improve readability, maintainability, and follow best practices.
-            </p>
-        </div>
-        """, unsafe_allow_html=True)
-        st.markdown("""
-        <div style="background: linear-gradient(135deg, rgba(0,255,65,0.1), rgba(0,255,65,0.05));
-                   border: 2px solid #00ff41; border-radius: 15px; padding: 25px; margin: 10px 0;
-                   box-shadow: 0 0 20px rgba(0,255,65,0.3);">
-            <h3 style="color: #00ff41; margin-bottom: 15px;">📦 Repository Analysis</h3>
-            <p style="color: #ffffff; font-size: 16px; line-height: 1.6;">
-                Analyze entire GitHub repos for structure, issues, and improvements.
-            </p>
-        </div>
-        """, unsafe_allow_html=True)
-    # How to get started section
-    st.markdown("""
-    <div style="background: rgba(0,0,0,0.7); border: 1px solid #00ff41; border-radius: 10px;
-               padding: 20px; margin: 30px 0; text-align: center;">
-        <h3 style="color: #00ff41; margin-bottom: 20px;">🚀 Get Started in 3 Steps</h3>
-        <div style="display: flex; justify-content: space-around; flex-wrap: wrap;">
-            <div style="margin: 10px; color: #ffffff;">
-                <div style="background: #00ff41; color: #000; border-radius: 50%; width: 40px; height: 40px;
-                           display: flex; align-items: center; justify-content: center; margin: 0 auto 10px;
-                           font-weight: bold; font-size: 20px;">1</div>
-                <p><strong>Upload</strong><br>Paste code or GitHub URL</p>
-            </div>
-            <div style="margin: 10px; color: #ffffff;">
-                <div style="background: #00ff41; color: #000; border-radius: 50%; width: 40px; height: 40px;
-                           display: flex; align-items: center; justify-content: center; margin: 0 auto 10px;
-                           font-weight: bold; font-size: 20px;">2</div>
-                <p><strong>Choose AI</strong><br>Pick your preferred model</p>
-            </div>
-            <div style="margin: 10px; color: #ffffff;">
-                <div style="background: #00ff41; color: #000; border-radius: 50%; width: 40px; height: 40px;
-                           display: flex; align-items: center; justify-content: center; margin: 0 auto 10px;
-                           font-weight: bold; font-size: 20px;">3</div>
-                <p><strong>Analyze</strong><br>Get instant results</p>
-            </div>
-        </div>
-    </div>
-    """, unsafe_allow_html=True)
-    # Supported languages as badges
-    st.markdown("""
-    <div style="text-align: center; margin: 20px 0;">
-        <p style="color: #00ff41; font-size: 18px; margin-bottom: 15px;"><strong>Supported Languages:</strong></p>
-        <div style="display: flex; justify-content: center; flex-wrap: wrap; gap: 10px;">
-            <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
-                        border-radius: 20px; font-weight: bold; font-size: 14px;">Python</span>
-            <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
-                        border-radius: 20px; font-weight: bold; font-size: 14px;">JavaScript</span>
-            <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
-                        border-radius: 20px; font-weight: bold; font-size: 14px;">Java</span>
-            <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
-                        border-radius: 20px; font-weight: bold; font-size: 14px;">C++</span>
-            <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
-                        border-radius: 20px; font-weight: bold; font-size: 14px;">Go</span>
-            <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
-                        border-radius: 20px; font-weight: bold; font-size: 14px;">Rust</span>
-            <span style="background: linear-gradient(45deg, #00ff41, #00cc33); color: #000; padding: 8px 15px;
-                        border-radius: 20px; font-weight: bold; font-size: 14px;">+ More</span>
-        </div>
-    </div>
-    """, unsafe_allow_html=True)
-# Footer with Matrix signature
-st.markdown("---")
-st.markdown("""
-<div style="text-align: center; color: #00ff41; font-family: 'Share Tech Mono', monospace; padding: 20px;">
-    <p>🟢 POWERED_BY_NEURAL_NETWORKS • OPENAI • ANTHROPIC • DEEPSEEK • GOOGLE 🟢</p>
-    <p><em>"There is no spoon... only code to analyze."</em></p>
-    <p style="font-size: 12px;">MATRIX_ANALYZER_v3.0 • BUILD_2024 • SECURITY_CLEARANCE_OMEGA • FILE_UPLOAD_ENABLED</p>
-</div>
-""", unsafe_allow_html=True)

+import streamlit as st
 from dotenv import load_dotenv
+import streamlit.components.v1 as components
 from analyzer import CodeAnalyzer
+from typing import Any, Iterable, List
+import os
+import html
+# Force reload environment variables
+load_dotenv(override=True)
 st.set_page_config(
+    page_title="Matrix Code Analyzer",
+    page_icon="🧠",
+    layout="wide"
 )
+# Debug sidebar removed for cleaner UI
+CUSTOM_CSS = """
 <style>
     @import url('https://fonts.googleapis.com/css2?family=Share+Tech+Mono:wght@400&display=swap');
     @import url('https://fonts.googleapis.com/css2?family=Orbitron:wght@400;700;900&display=swap');
+@import url('https://fonts.googleapis.com/css2?family=Space+Grotesk:wght@400;500;600;700&display=swap');
+:root {
+  --matrix-green: #00ff41; /* Brighter Matrix Green */
+  --matrix-blue: #0ccffa;
+  --matrix-bg: #000000; /* Solid black for deeper matrix feel */
+  --text-main: #e7fceb;
+  --text-muted: #94a9a0;
+  --panel-bg: rgba(6, 16, 28, 0.55);
+  --panel-border: rgba(0, 255, 65, 0.25); /* Green border for panels */
+}
     .stApp {
+  background: var(--matrix-bg);
+  color: var(--text-main);
+  font-family: 'Share Tech Mono', monospace; /* Default text to matrix font */
+}
+.stApp::before {
+  content: '';
         position: fixed;
+  inset: 0;
+  background: linear-gradient(135deg, rgba(0, 255, 130, 0.05) 0%, transparent 55%),
+              linear-gradient(225deg, rgba(12, 207, 250, 0.06) 0%, transparent 60%);
+  z-index: -3;
+}
+.stApp::after {
+  content: '';
+  position: fixed;
+  inset: 0;
+  background: radial-gradient(circle at 18% 20%, rgba(0, 255, 140, 0.08), transparent 58%),
+              radial-gradient(circle at 78% 12%, rgba(12, 207, 250, 0.07), transparent 48%);
+  z-index: -2;
+}
+[data-testid="stAppViewContainer"] {
+  position: relative;
+  z-index: 0;
+}
+html, body {
+  overflow-y: auto !important;
+  height: auto !important;
+}
+[data-testid="stAppViewContainer"] {
+  overflow-y: auto !important;
+  height: auto !important;
+}
+.stApp {
+  overflow-y: auto !important;
+  height: auto !important;
+}
+.block-container {
+  overflow-y: visible !important;
+}
+/* Ensure main content can scroll */
+[data-testid="stAppViewContainer"] > div {
+  overflow-y: auto !important;
+  height: auto !important;
+}
+/* Override any Streamlit default height constraints */
+.stApp > div {
+  height: auto !important;
+  min-height: 100vh !important;
+}
+.matrix-rain {
+  position: fixed;
+  inset: 0;
         overflow: hidden;
+  pointer-events: none;
         z-index: -1;
     }
+.matrix-rain span {
         position: absolute;
+  top: -10%;
+  color: rgba(0, 255, 65, 0.45); /* Brighter matrix rain */
         font-family: 'Share Tech Mono', monospace;
         font-size: 14px;
+  animation: matrixFall linear infinite;
+  text-shadow: 0 0 10px rgba(0, 255, 65, 0.6); /* Stronger glow */
+}
+@keyframes matrixFall {
+  from { transform: translateY(-10%); opacity: 0; }
+  20% { opacity: 1; }
+  to { transform: translateY(110vh); opacity: 0; }
+}
+[data-testid="stHeader"] {
+  background: transparent;
+  border-bottom: none;
+}
+.block-container {
+  padding-top: 3rem;
+  padding-bottom: 3.5rem;
+  max-width: 1140px;
+}
+a {
+  color: var(--matrix-green);
+  text-decoration: none;
+}
+a:hover {
+  text-decoration: underline;
+}
+.glass-panel {
+  background: rgba(0, 0, 0, 0.7);
+  border: 1px solid var(--panel-border);
+  border-radius: 12px; /* Slightly less rounded for matrix look */
+  padding: 28px;
+  backdrop-filter: blur(8px); /* Softer blur */
+  box-shadow: 0 0 25px rgba(0, 255, 65, 0.2); /* Green glow for panels */
+  margin-bottom: 1.6rem;
+}
+.hero {
+  text-align: center;
+  position: relative;
+  overflow: hidden;
+}
+.hero::before {
+  content: '';
+  position: absolute;
+  inset: 0;
+  opacity: 0.3;
+  background: linear-gradient(130deg, rgba(0, 255, 65, 0.15), rgba(0, 200, 255, 0.08));
+  pointer-events: none;
+}
+.hero-label {
+  font-family: 'Orbitron', sans-serif; /* Use Orbitron for hero label */
+  font-size: 0.9rem;
+  letter-spacing: 0.4em;
+  text-transform: uppercase;
+  color: var(--matrix-green);
+  margin-bottom: 0.6rem;
+  text-shadow: 0 0 8px rgba(0, 255, 65, 0.5);
+}
+.hero h1 {
+  font-family: 'Orbitron', sans-serif; /* Use Orbitron for hero title */
+  margin-bottom: 0.6rem;
+  font-size: clamp(2.5rem, 4vw, 3.8rem);
+  font-weight: 700;
+  letter-spacing: -0.05em;
+  position: relative;
+  z-index: 1;
+  color: var(--matrix-green);
+  text-shadow: 0 0 15px rgba(0, 255, 65, 0.8), 0 0 25px rgba(0, 255, 65, 0.6);
+}
+.hero .subline {
+  font-family: 'Share Tech Mono', monospace; /* Use Share Tech Mono for subline */
+  color: rgba(231, 252, 235, 0.8);
+  font-size: 1.1rem;
+  max-width: 700px;
+  margin: 0 auto;
+  line-height: 1.7;
+  position: relative;
+  z-index: 1;
+}
+.hero-badges {
+  display: flex;
+  gap: 0.85rem;
+  justify-content: center;
+  flex-wrap: wrap;
+  margin-top: 2rem;
+  position: relative;
+  z-index: 1;
+}
+.hero-badge {
+  padding: 0.4rem 1rem;
+  border-radius: 999px;
+  background: rgba(0, 255, 65, 0.15);
+  border: 1px solid rgba(0, 255, 65, 0.4);
+  color: var(--matrix-green);
+  font-size: 0.9rem;
+  font-weight: 500;
+  text-shadow: 0 0 5px rgba(0, 255, 65, 0.4);
+}
+.body-text {
+  color: var(--text-main);
+  font-family: 'Share Tech Mono', monospace;
+  font-size: 1rem;
+  line-height: 1.7;
+  margin: 0;
+}
+.settings-grid {
+  display: grid;
+  grid-template-columns: repeat(auto-fit, minmax(220px, 1fr));
+  gap: 1.2rem;
+}
     .stSelectbox > div > div {
+  background: rgba(0, 0, 0, 0.8);
+  border-radius: 10px;
+  border: 1px solid rgba(0, 255, 65, 0.3);
+  color: var(--text-main);
+  font-family: 'Share Tech Mono', monospace;
+}
+.stSelectbox label, .stTextInput label, .stFileUploader label {
+  font-family: 'Orbitron', sans-serif;
+  font-weight: 600;
+  color: var(--matrix-green);
+  text-shadow: 0 0 5px rgba(0, 255, 65, 0.3);
+}
+.stTextInput > div > div > input {
+  border-radius: 10px;
+  background: rgba(0, 0, 0, 0.8);
+  border: 1px solid rgba(0, 255, 65, 0.25);
+  color: var(--text-main);
+  font-family: 'Share Tech Mono', monospace;
+}
+textarea {
+  border-radius: 12px !important;
+  background: rgba(0, 0, 0, 0.85) !important;
+  border: 1px solid rgba(0, 255, 65, 0.3) !important;
+  color: var(--text-main) !important;
+  font-family: 'Share Tech Mono', monospace !important;
+  resize: vertical;
+}
+textarea::placeholder,
+.stTextInput > div > div > input::placeholder {
+  color: rgba(0, 255, 65, 0.5) !important;
+}
+.stFileUploader > div {
+  border-radius: 12px !important;
+  border: 2px dashed rgba(0, 255, 65, 0.5) !important;
+  background: rgba(0, 0, 0, 0.7) !important;
+  padding: 1.5rem 2rem !important;
+}
+.stButton > button {
+  height: 50px;
+  border-radius: 12px;
+  font-weight: 700;
+  letter-spacing: 0.05em;
+  background: linear-gradient(120deg, var(--matrix-green), rgba(0, 200, 255, 0.8));
+  color: #000;
+  border: none;
+  box-shadow: 0 0 20px rgba(0, 255, 65, 0.6);
+  text-transform: uppercase;
+  transition: all 0.3s ease;
+}
+.stButton > button:hover {
+  transform: translateY(-2px);
+  box-shadow: 0 0 35px rgba(0, 255, 65, 0.9);
+  background: linear-gradient(120deg, rgba(0, 255, 65, 0.95), rgba(0, 220, 255, 0.95));
+}
     .stTabs [data-baseweb="tab-list"] {
+  background: rgba(0, 0, 0, 0.6);
+  border-radius: 12px;
+  padding: 0.5rem;
+  border: 1px solid rgba(0, 255, 65, 0.2);
     }
     .stTabs [data-baseweb="tab"] {
+  border-radius: 8px;
+  font-weight: 500;
+  color: rgba(0, 255, 65, 0.7);
+  font-family: 'Orbitron', sans-serif;
+  padding: 0.75rem 1.2rem;
     }
     .stTabs [aria-selected="true"] {
+  background: rgba(0, 255, 65, 0.1);
+  color: var(--matrix-green);
+  text-shadow: 0 0 5px rgba(0, 255, 65, 0.5);
+}
+.stTabs [data-baseweb="tab-panel"] {
+  background: rgba(0, 0, 0, 0.75);
+  border: 1px solid rgba(0, 255, 65, 0.2);
+  border-radius: 12px;
+  padding: 28px;
+  margin-top: 1.5rem;
+  backdrop-filter: blur(8px);
+  box-shadow: 0 0 25px rgba(0, 255, 65, 0.2);
+}
+.stats-grid {
+  display: grid;
+  grid-template-columns: repeat(auto-fit, minmax(160px, 1fr));
+  gap: 1.2rem;
+}
+.stats-card {
+  background: rgba(0, 0, 0, 0.6);
+  border: 1px solid rgba(0, 255, 65, 0.15);
+  border-radius: 12px;
+  padding: 1.5rem;
+  display: flex;
+  flex-direction: column;
+  gap: 0.4rem;
+  box-shadow: 0 0 15px rgba(0, 255, 65, 0.1);
+}
+.stats-card .label {
+  color: rgba(0, 255, 65, 0.8);
+  font-size: 0.8rem;
+  letter-spacing: 0.1em;
+  text-transform: uppercase;
+  font-family: 'Share Tech Mono', monospace;
+}
+.stats-card .value {
+  font-family: 'Orbitron', sans-serif;
+  font-size: 1.9rem;
+  font-weight: 700;
+  color: var(--matrix-green);
+  text-shadow: 0 0 10px rgba(0, 255, 65, 0.5);
+}
+.stats-card .status {
+  font-size: 0.95rem;
+  color: rgba(0, 255, 65, 0.7);
+}
+.list-card {
+  background: rgba(0, 0, 0, 0.65);
+  border: 1px solid rgba(0, 255, 65, 0.18);
+  border-radius: 12px;
+  padding: 1.5rem 1.8rem;
+  margin-bottom: 1.2rem;
+  box-shadow: 0 0 18px rgba(0, 255, 65, 0.15);
+}
+.list-card h3 {
+  font-family: 'Orbitron', sans-serif;
+  margin-bottom: 1rem;
+  font-size: 1.2rem;
+  font-weight: 600;
+  display: flex;
+  align-items: center;
+  gap: 0.7rem;
+  color: var(--matrix-green);
+  text-shadow: 0 0 8px rgba(0, 255, 65, 0.4);
+}
+.list-card ul {
+  margin: 0;
+  padding-left: 1.2rem;
+  color: rgba(231, 252, 235, 0.9);
+  font-family: 'Share Tech Mono', monospace;
+  display: grid;
+  gap: 0.6rem;
+}
+.list-card li {
+  line-height: 1.5;
+  position: relative;
+}
+.list-card li::before {
+  content: '•'; /* Use a bullet point for simple list styling */
+  position: absolute;
+  left: -1rem;
+  color: var(--matrix-green);
+  font-size: 1.2em;
+  line-height: 1;
+  text-shadow: 0 0 5px rgba(0, 255, 65, 0.7);
+}
+.model-chip {
+  display: inline-flex;
+  align-items: center;
+  gap: 0.6rem;
+  padding: 0.6rem 1rem;
+  border-radius: 999px;
+  background: rgba(0, 255, 65, 0.1);
+  border: 1px solid rgba(0, 255, 65, 0.3);
+  font-size: 0.95rem;
+  color: var(--matrix-green);
+  margin-bottom: 1.5rem;
+  font-family: 'Orbitron', sans-serif;
+  text-shadow: 0 0 6px rgba(0, 255, 65, 0.4);
+}
+.file-meta {
+  display: flex;
+  flex-wrap: wrap;
+  gap: 1rem;
+  margin-top: 0.6rem;
+  font-size: 0.9rem;
+  color: rgba(231, 252, 235, 0.7);
+  font-family: 'Share Tech Mono', monospace;
+}
+.file-meta span {
+  padding: 0.4rem 0.9rem;
+  border-radius: 999px;
+  background: rgba(0, 255, 65, 0.08);
+  border: 1px solid rgba(0, 255, 65, 0.2);
+  text-shadow: 0 0 4px rgba(0, 255, 65, 0.2);
+}
+.stAlert {
+  border-radius: 12px;
+  border: 1px solid rgba(0, 255, 65, 0.3);
+  background: rgba(0, 0, 0, 0.75);
+  color: var(--text-main);
+  font-family: 'Share Tech Mono', monospace;
+  box-shadow: 0 0 15px rgba(0, 255, 65, 0.25);
+}
+.streamlit-expanderHeader {
+  background: rgba(0, 0, 0, 0.6);
+  border-radius: 12px;
+  border: 1px solid rgba(0, 255, 65, 0.2);
+  color: var(--matrix-green);
+  font-family: 'Orbitron', sans-serif;
+  text-shadow: 0 0 5px rgba(0, 255, 65, 0.3);
+}
+.stSpinner > div {
+  border-top-color: var(--matrix-green) !important;
+  border-right-color: var(--matrix-blue) !important;
+}
+@media (max-width: 768px) {
+  .block-container {
+    padding-left: 1rem;
+    padding-right: 1rem;
+  }
+  .glass-panel, .stTabs [data-baseweb="tab-panel"] {
+    padding: 18px;
+  }
+  .stats-grid {
+    grid-template-columns: 1fr;
+  }
+  .hero h1 {
+    font-size: 2.5rem;
+  }
+  .hero-badges {
+    gap: 0.6rem;
+  }
+  .hero-badge {
+    font-size: 0.8rem;
+    padding: 0.3rem 0.7rem;
+  }
 }
+</style>
+"""
+st.markdown(CUSTOM_CSS, unsafe_allow_html=True)
+st.markdown(
+    """
+    <div class="matrix-rain" id="matrix-rain"></div>
+    """,
+    unsafe_allow_html=True,
+)
+@st.cache_resource
 def get_analyzer():
+    return CodeAnalyzer(cache_dir=".analyzer_cache")
 analyzer = get_analyzer()
+AVAILABLE_MODELS = analyzer.available_models
+if not AVAILABLE_MODELS:
+    st.error("No AI models configured. Add API keys to your .env file and restart the app.")
+    st.stop()
+LANGUAGE_DISPLAY = {
+    "auto": "Auto Detect",
+    "python": "Python",
+    "javascript": "JavaScript",
+    "typescript": "TypeScript",
+    "html": "HTML",
+    "css": "CSS",
+    "java": "Java",
+    "cpp": "C++",
+    "c": "C",
+    "csharp": "C#",
+    "go": "Go",
+    "rust": "Rust",
+    "php": "PHP",
+    "ruby": "Ruby",
+    "swift": "Swift",
+    "kotlin": "Kotlin",
+}
+LANGUAGE_OPTIONS = list(LANGUAGE_DISPLAY.keys())
+def ensure_list(items: Any) -> List[str]:
+    if not items:
+        return []
+    if isinstance(items, str):
+        clean = items.strip()
+        return [clean] if clean else []
+    if isinstance(items, dict):
+        return [f"{key}: {value}" for key, value in items.items() if str(value).strip()]
+    if isinstance(items, Iterable):
+        values = []
+        for entry in items:
+            if entry is None:
+                continue
+            text = str(entry).strip()
+            if text:
+                values.append(text)
+        return values
+    return [str(items)]
+def parse_score(raw: Any) -> float:
+    try:
+        return float(raw)
+    except (TypeError, ValueError):
+        return 0.0
+def score_badge(score: float) -> tuple[str, str]:
+    if score >= 80:
+        return "Excellent", "#00fba4"
+    if score >= 60:
+        return "Review Suggested", "#ffd76a"
+    return "Needs Attention", "#ff6b6b"
+def render_list_section(title: str, icon: str, content: Any, fallback: str | None = None) -> None:
+    entries = ensure_list(content)
+    if entries:
+        items_html = "".join(f"<li>{html.escape(entry)}</li>" for entry in entries[:6])
+        st.markdown(
+            f"""
+            <div class="list-card">
+                <h3>{icon} {title}</h3>
+                <ul>{items_html}</ul>
+    </div>
+            """,
+            unsafe_allow_html=True,
         )
+    elif fallback:
+        st.markdown(
+            f"""
+            <div class="list-card">
+                <h3>{icon} {title}</h3>
+                <p class="body-text">{html.escape(fallback)}</p>
+            </div>
+            """,
+            unsafe_allow_html=True,
         )
+def render_code_result(result: dict[str, Any], model_label: str) -> None:
+    if result.get("error"):
+        st.error(f"Analysis failed: {result['error']}")
         return
+    score = parse_score(result.get("quality_score", 0))
+    status_label, status_color = score_badge(score)
+    language = (result.get("language") or "auto").upper()
+    line_count = result.get("line_count", "-")
+    exec_time = parse_score(result.get("execution_time", 0.0))
+    cached_text = "Hit" if result.get("cached") else "Fresh"
+    stats = [
+        {"label": "Quality Score", "value": f"{int(round(score))}/100", "sub": status_label, "color": status_color},
+        {"label": "Language", "value": language, "sub": "Detected" if language != "AUTO" else "Auto"},
+        {"label": "Lines", "value": line_count, "sub": "Analyzed"},
+        {"label": "Latency", "value": f"{exec_time:.1f}s", "sub": "Runtime"},
+        {"label": "Cache", "value": cached_text, "sub": "Result Store"},
+    ]
+    # Render stats using native Streamlit components to avoid raw HTML showing
+    st.markdown(
+        f"""
+        <div class="glass-panel">
+            <div class="model-chip">🤖 {model_label}</div>
         </div>
+        """,
+        unsafe_allow_html=True,
+    )
+    cols = st.columns(len(stats))
+    for idx, stat in enumerate(stats):
+        with cols[idx]:
+            st.metric(label=stat["label"], value=stat["value"], delta=stat["sub"])
+    summary = result.get("summary")
+    if summary:
+        st.markdown(
+            f"""
+            <div class="glass-panel">
+                <h3>📋 Overview</h3>
+                <p class="body-text">{html.escape(summary)}</p>
         </div>
+            """,
+            unsafe_allow_html=True,
+        )
+    render_list_section("Highlights", "✨", result.get("strengths"))
+    render_list_section("Bug Detection", "🐞", result.get("bugs") or result.get("issues"), "No critical bugs were flagged.")
+    render_list_section("Security", "🔒", result.get("security_vulnerabilities") or result.get("security_concerns"), "No security vulnerabilities detected.")
+    render_list_section("Code Quality", "🧩", result.get("quality_issues"), "Structure looks solid and maintainable.")
+    render_list_section("Quick Fixes", "⚡", result.get("quick_fixes"), "No urgent fixes suggested.")
+    render_list_section("Suggestions", "💡", result.get("suggestions"))
+    raw = result.get("raw_response")
+    if raw:
+        with st.expander("View full model response", expanded=False):
+            st.code(raw, language="text")
+def render_repo_result(result: dict[str, Any], model_label: str) -> None:
+    if result.get("error"):
+        st.error(f"Repository analysis failed: {result['error']}")
         return
+    info = result.get("repository_info", {})
+    repo_name = info.get("name", "Repository")
+    repo_desc = info.get("description") or "No description provided."
+    repo_url = result.get("repo_url") or st.session_state.get("repo_analysis_url")
+    repo_stats = [
+        {"label": "Primary Language", "value": info.get("language", "Unknown"), "sub": "Detected"},
+        {"label": "Stars", "value": info.get("stars", 0), "sub": "Community"},
+        {"label": "Forks", "value": info.get("forks", 0), "sub": "Collaboration"},
+        {"label": "Size", "value": f"{info.get('size', 0)} KB", "sub": "Repo Size"},
+        {"label": "Latency", "value": f"{parse_score(result.get('execution_time', 0.0)):.1f}s", "sub": "Runtime"},
+    ]
+    link_html = f'<p class="body-text"><a href="{repo_url}" target="_blank">View repository ↗</a></p>' if repo_url else ""
+    # Header card
+    st.markdown(
+        f"""
+        <div class="glass-panel">
+            <div class="model-chip">🤖 {html.escape(model_label)}</div>
+            <h2 style="margin-bottom:0.35rem;">{html.escape(repo_name)}</h2>
+            <p class="body-text" style="color: var(--text-muted); margin-bottom:1.2rem;">{html.escape(repo_desc)}</p>
+            {link_html}
         </div>
+        """,
+        unsafe_allow_html=True,
+    )
+    # Metrics row
+    cols = st.columns(len(repo_stats))
+    for idx, stat in enumerate(repo_stats):
+        with cols[idx]:
+            st.metric(label=stat["label"], value=stat["value"], delta=stat["sub"])
+    overview = result.get("project_overview")
+    if overview:
+        st.markdown(
+            f"""
+            <div class="glass-panel">
+                <h3>📋 Project Overview</h3>
+                <p class="body-text">{html.escape(overview)}</p>
+            </div>
+            """,
+            unsafe_allow_html=True,
+        )
+    render_list_section("Architecture Quality", "🏗️", result.get("architecture_quality"), "Project structure looks well organized.")
+    render_list_section("Critical Issues", "🚨", result.get("critical_issues"), "No critical issues were highlighted.")
+    render_list_section("Priority Improvements", "🎯", result.get("improvement_priorities"), "No immediate improvements suggested.")
+    render_list_section("Onboarding Guide", "🚀", result.get("onboarding_guide"), "No specific onboarding steps identified.")
+    render_list_section("Tech Stack", "🛠️", result.get("tech_stack_rationale"), "Tech stack details were not identified.")
+    render_list_section("API Endpoints", "🔌", result.get("api_endpoint_summary"), "No API endpoints were identified.")
+    raw = result.get("raw_response")
+    if raw:
+        with st.expander("View full model response", expanded=False):
+            st.code(raw, language="text")
+if "code_input" not in st.session_state:
+    st.session_state.code_input = ""
+if "code_file_meta" not in st.session_state:
+    st.session_state.code_file_meta = None
+if "code_analysis_result" not in st.session_state:
+    st.session_state.code_analysis_result = None
+if "code_analysis_model" not in st.session_state:
+    st.session_state.code_analysis_model = ""
+if "repo_analysis_result" not in st.session_state:
+    st.session_state.repo_analysis_result = None
+if "repo_analysis_model" not in st.session_state:
+    st.session_state.repo_analysis_model = ""
+if "repo_analysis_url" not in st.session_state:
+    st.session_state.repo_analysis_url = ""
+st.markdown(
+    """
+    <div class="glass-panel hero">
+        <div class="hero-label">CODE ANALYZER</div>
+        <h1>AI Code Analyzer</h1>
+        <p class="subline">Inspect bugs, surface security gaps, and review repositories with instant feedback.</p>
+        <div class="hero-badges">
+            <span class="hero-badge">🧠 Multi-model</span>
+            <span class="hero-badge">🔍 Bug & Security Scan</span>
+            <span class="hero-badge">⚡ Instant Results</span>
+            <span class="hero-badge">📦 GitHub Ready</span>
         </div>
     </div>
+    <script>
+    (function drizzle() {
+      const container = document.getElementById('matrix-rain');
+      if (!container || container.dataset.initialized) return;
+      container.dataset.initialized = 'true';
+      const glyphs = "01ΛΣΞ∑¥$#@*&%=+";
+      const nodeCount = 80;
+      for (let i = 0; i < nodeCount; i++) {
+        const drop = document.createElement('span');
+        drop.textContent = glyphs[Math.floor(Math.random() * glyphs.length)];
+        drop.style.left = Math.random() * 100 + '%';
+        drop.style.animationDuration = (Math.random() * 4 + 3) + 's';
+        drop.style.animationDelay = (Math.random() * 4) + 's';
+        container.appendChild(drop);
+      }
+    })();
+    </script>
+    """,
+    unsafe_allow_html=True,
+)
+code_tab, repo_tab = st.tabs(["Code Analysis", "Repository Insights"])
+with code_tab:
+    model_keys = list(AVAILABLE_MODELS.keys())
+    selected_model_code = st.selectbox(
+        "AI Model",
+        options=model_keys,
+        format_func=lambda key: AVAILABLE_MODELS[key],
+        key="code_model_select",
     )
+    selected_language = st.selectbox(
+        "Language",
+        LANGUAGE_OPTIONS,
+        format_func=lambda code: LANGUAGE_DISPLAY[code],
+        key="language_select",
     )
+    uploaded_file = st.file_uploader(
+        "Upload a code file",
+        type=["py", "js", "java", "cpp", "c", "cs", "go", "rs", "php", "rb", "swift", "kt", "txt"],
+        key="code_file_uploader",
+    )
+    if uploaded_file is not None:
+        raw_bytes = uploaded_file.read()
+        try:
+            decoded = raw_bytes.decode("utf-8")
+        except UnicodeDecodeError:
+            st.error("Only UTF-8 encoded files are supported.")
+        else:
+            st.session_state.code_input = decoded
+            st.session_state.code_file_meta = {
+                "name": uploaded_file.name,
+                "size": len(raw_bytes),
+                "lines": len(decoded.splitlines()),
+            }
+    st.text_area(
+        "Or paste code below",
+        key="code_input",
+        height=320,
+        placeholder="Paste any code snippet to inspect bugs, security gaps, and quality issues.",
+    )
+    meta = st.session_state.get("code_file_meta")
+    if meta:
+        st.markdown(
+            f"""
+            <div class="list-card">
+                <h3> Active File</h3>
+                <div class="file-meta">
+                    <span>{meta['name']}</span>
+                    <span>{meta['lines']} lines</span>
+                    <span>{meta['size']} bytes</span>
+                </div>
+            </div>
+            """,
+            unsafe_allow_html=True,
         )
+    analyze_code_clicked = st.button("Run Code Analysis", key="code_analyze_button", use_container_width=True)
+    if analyze_code_clicked:
+        snippet = st.session_state.get("code_input", "").strip()
+        if not snippet:
+            st.error("Please upload a file or paste some code to analyze.")
+        else:
+            with st.spinner("Analyzing code..."):
+                language_arg = None if selected_language == "auto" else selected_language
+                result = analyzer.analyze_code(snippet, selected_model_code, language_arg)
+            st.session_state.code_analysis_result = result
+            st.session_state.code_analysis_model = AVAILABLE_MODELS[selected_model_code]
+    if st.session_state.get("code_analysis_result"):
+        render_code_result(st.session_state.code_analysis_result, st.session_state.get("code_analysis_model", ""))
     else:
+        st.info("Upload a file or paste code to generate an analysis.")
+with repo_tab:
+    model_keys = list(AVAILABLE_MODELS.keys())
+    selected_model_repo = st.selectbox(
+        "AI Model",
+        options=model_keys,
+        format_func=lambda key: AVAILABLE_MODELS[key],
+        key="repo_model_select",
     )
+    st.text_input(
+        "GitHub repository URL",
+        placeholder="https://github.com/owner/repo",
+        key="repo_analysis_url",
     )
+    analyze_repo_clicked = st.button("Analyze Repository", key="repo_analyze_button", use_container_width=True)
+    if analyze_repo_clicked:
+        repo_url = st.session_state.get("repo_analysis_url", "").strip()
+        if not repo_url:
+            st.error("Enter a GitHub repository URL.")
+        else:
+            with st.spinner("Inspecting repository..."):
+                result = analyzer.analyze_github_repo(repo_url, selected_model_repo)
+            st.session_state.repo_analysis_result = result
+            st.session_state.repo_analysis_model = AVAILABLE_MODELS[selected_model_repo]
+    if st.session_state.get("repo_analysis_result"):
+        render_repo_result(
+            st.session_state.repo_analysis_result,
+            st.session_state.get("repo_analysis_model", ""),
         )
+    else:
+        st.info("Provide a public GitHub repository URL to review its structure, issues, and improvements.")

model_comparison.py DELETED Viewed

@@ -1,175 +0,0 @@
-#!/usr/bin/env python3
-"""
-Model Comparison Script for AI Code Analyzer
-This script helps you compare different code analysis models
-and understand their capabilities before fine-tuning.
-Author: AI Code Analyzer Project
-Date: 2025
-"""
-import torch
-from transformers import (
-    AutoModelForCausalLM,
-    AutoModelForSeq2SeqLM,
-    AutoTokenizer,
-    BitsAndBytesConfig
-)
-def compare_models():
-    """
-    Compare different code analysis models available on Hugging Face.
-    """
-    models_to_compare = {
-        "Current Model (GPT-2)": {
-            "model_id": "gpt2",
-            "type": "CausalLM",
-            "size": "124M",
-            "code_specialized": False,
-            "description": "General-purpose text model, not optimized for code"
-        },
-        "CodeT5+ (Recommended)": {
-            "model_id": "Salesforce/codet5p-220m",
-            "type": "Seq2SeqLM",
-            "size": "220M",
-            "code_specialized": True,
-            "description": "Specialized for code understanding and generation"
-        },
-        "CodeBERT": {
-            "model_id": "microsoft/CodeBERT-base",
-            "type": "FeatureExtraction",
-            "size": "125M",
-            "code_specialized": True,
-            "description": "Pre-trained on code for understanding programming languages"
-        },
-        "GraphCodeBERT": {
-            "model_id": "microsoft/GraphCodeBERT-base",
-            "type": "FeatureExtraction",
-            "size": "125M",
-            "code_specialized": True,
-            "description": "Understands code structure and dependencies"
-        },
-        "InCoder": {
-            "model_id": "facebook/incoder-1B",
-            "type": "CausalLM",
-            "size": "1B",
-            "code_specialized": True,
-            "description": "Code completion and analysis with large context"
-        }
-    }
-    print("🤖 Code Analysis Models Comparison")
-    print("=" * 80)
-    for name, info in models_to_compare.items():
-        print(f"\n📊 {name}")
-        print(f"   Model ID: {info['model_id']}")
-        print(f"   Type: {info['type']}")
-        print(f"   Size: {info['size']}")
-        print(f"   Code Specialized: {'✅ Yes' if info['code_specialized'] else '❌ No'}")
-        print(f"   Description: {info['description']}")
-    print("\n" + "=" * 80)
-    print("🎯 RECOMMENDATIONS:")
-    print("=" * 80)
-    print("\n🥇 BEST CHOICE: Salesforce/codet5p-220m")
-    print("   ✅ Specialized for code analysis")
-    print("   ✅ Good balance of size and performance")
-    print("   ✅ Works well with your training data format")
-    print("   ✅ Seq2Seq architecture perfect for code analysis")
-    print("\n🥈 ALTERNATIVE: facebook/incoder-1B")
-    print("   ✅ Excellent code understanding")
-    print("   ✅ Large context window")
-    print("   ⚠️  Larger model (requires more resources)")
-    print("\n🥉 FOR EXPERIMENTATION: microsoft/CodeBERT-base")
-    print("   ✅ Proven for code understanding")
-    print("   ✅ Good for feature extraction")
-    print("   ⚠️  Different architecture (might need data format changes)")
-def test_model_loading(model_id: str, model_type: str = "auto"):
-    """
-    Test loading a specific model to ensure it works.
-    Args:
-        model_id (str): Hugging Face model ID
-        model_type (str): Model type ("auto", "causal", "seq2seq")
-    """
-    print(f"\n🧪 Testing model: {model_id}")
-    print("-" * 50)
-    try:
-        # Load tokenizer
-        print("Loading tokenizer...")
-        tokenizer = AutoTokenizer.from_pretrained(model_id)
-        if tokenizer.pad_token is None:
-            tokenizer.pad_token = tokenizer.eos_token
-            print("✅ Added padding token")
-        # Load model
-        print("Loading model...")
-        if model_type == "seq2seq" or "codet5" in model_id.lower():
-            model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
-        else:
-            model = AutoModelForCausalLM.from_pretrained(model_id)
-        print(f"✅ Model loaded successfully!")
-        print(f"   Model type: {type(model).__name__}")
-        print(f"   Parameters: {sum(p.numel() for p in model.parameters()):,}")
-        # Test with a simple code snippet
-        test_code = "def add(a, b):\n    return a + b"
-        inputs = tokenizer(test_code, return_tensors="pt", truncation=True, max_length=512)
-        print(f"✅ Tokenization test passed")
-        print(f"   Input tokens: {inputs['input_ids'].shape[1]}")
-        return True
-    except Exception as e:
-        print(f"❌ Error loading model: {str(e)}")
-        return False
-def main():
-    """
-    Main function to run model comparison and testing.
-    """
-    print("🚀 AI Code Analyzer - Model Comparison Tool")
-    print("=" * 80)
-    # Show comparison
-    compare_models()
-    # Test recommended models
-    print("\n\n🧪 TESTING RECOMMENDED MODELS:")
-    print("=" * 80)
-    models_to_test = [
-        ("Salesforce/codet5p-220m", "seq2seq"),
-        ("microsoft/CodeBERT-base", "causal"),
-        ("facebook/incoder-1B", "causal")
-    ]
-    working_models = []
-    for model_id, model_type in models_to_test:
-        if test_model_loading(model_id, model_type):
-            working_models.append(model_id)
-    print(f"\n🎉 SUCCESS: {len(working_models)} models loaded successfully!")
-    if working_models:
-        print("\n📋 NEXT STEPS:")
-        print("1. Choose your preferred model from the working models above")
-        print("2. Run: python finetune_improved.py --model <model_id> --dry-run")
-        print("3. Test the fine-tuned model with your code analyzer")
-        print("\n💡 RECOMMENDED COMMAND:")
-        print(f"   python finetune_improved.py --model {working_models[0]} --dry-run")
-if __name__ == "__main__":
-    main()

optimized_code_analyzer.py DELETED Viewed

@@ -1,466 +0,0 @@
-#!/usr/bin/env python3
-"""
-Optimized CodeT5+ Code Analyzer
-This script implements CodeT5+ with multiple speed optimizations:
-- FP16 by default (fastest on your GPU); optional INT8/INT4
-- Response streaming for better UX
-- Progress indicators
-- Result caching
-- Optimized generation parameters
-Author: AI Code Analyzer Project
-Date: 2025
-"""
-import torch
-import time
-import hashlib
-import json
-import os
-from typing import Dict, Any, Optional, Generator
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, BitsAndBytesConfig
-from tqdm import tqdm
-import streamlit as st
-class OptimizedCodeAnalyzer:
-    """
-    Optimized CodeT5+ analyzer with speed improvements.
-    """
-    def __init__(
-        self,
-        model_id: str = "Salesforce/codet5p-220m",
-        cache_dir: str = "./cache",
-        precision: str = "fp16",  # one of: fp16 | int8 | int4
-        quick_max_new_tokens: int = 180,
-        detailed_max_new_tokens: int = 240,
-    ):
-        """
-        Initialize the optimized analyzer.
-        Args:
-            model_id: Hugging Face model ID
-            cache_dir: Directory to store cached results
-        """
-        self.model_id = model_id
-        self.cache_dir = cache_dir
-        self.model = None
-        self.tokenizer = None
-        self.cache = {}
-        self.precision = precision.lower().strip()
-        self.quick_max_new_tokens = quick_max_new_tokens
-        self.detailed_max_new_tokens = detailed_max_new_tokens
-        # Create cache directory
-        os.makedirs(cache_dir, exist_ok=True)
-        # Load cache if exists
-        self._load_cache()
-    def _create_quantization_config(self) -> BitsAndBytesConfig:
-        """
-        Create 4-bit quantization configuration for faster inference.
-        Returns:
-            BitsAndBytesConfig: Quantization configuration
-        """
-        # Default to INT4 nf4 when precision==int4; callers should not use this
-        return BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_compute_dtype=torch.float16,
-            bnb_4bit_use_double_quant=True,
-        )
-    def _load_model(self):
-        """
-        Load the model with optimizations.
-        """
-        if self.model is not None:
-            return
-        print("🚀 Loading optimized CodeT5+ model...")
-        # Load tokenizer
-        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id)
-        if self.tokenizer.pad_token is None:
-            self.tokenizer.pad_token = self.tokenizer.eos_token
-        # Decide precision based on config
-        quantization_config = None
-        dtype = None
-        banner = ""
-        if self.precision == "fp16":
-            dtype = torch.float16
-            banner = "FP16 precision"
-        elif self.precision == "int8":
-            quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-            banner = "INT8 quantization"
-        elif self.precision == "int4":
-            quantization_config = self._create_quantization_config()
-            banner = "INT4 (nf4) quantization"
-        else:
-            # Fallback to fp16
-            dtype = torch.float16
-            banner = f"Unknown precision '{self.precision}', defaulting to FP16"
-        self.model = AutoModelForSeq2SeqLM.from_pretrained(
-            self.model_id,
-            device_map="auto",
-            dtype=dtype,
-            quantization_config=quantization_config,
-        )
-        print(f"✅ Model loaded with {banner}!")
-    def _get_cache_key(self, code: str) -> str:
-        """
-        Generate cache key for code.
-        Args:
-            code: Code to analyze
-        Returns:
-            str: Cache key
-        """
-        return hashlib.md5(code.encode()).hexdigest()
-    def _load_cache(self):
-        """
-        Load cached results from disk.
-        """
-        cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
-        if os.path.exists(cache_file):
-            try:
-                with open(cache_file, 'r') as f:
-                    self.cache = json.load(f)
-                print(f"📁 Loaded {len(self.cache)} cached analyses")
-            except:
-                self.cache = {}
-    def _save_cache(self):
-        """
-        Save cache to disk.
-        """
-        cache_file = os.path.join(self.cache_dir, "analysis_cache.json")
-        with open(cache_file, 'w') as f:
-            json.dump(self.cache, f)
-    def _check_cache(self, code: str) -> Optional[Dict[str, Any]]:
-        """
-        Check if analysis is cached.
-        Args:
-            code: Code to analyze
-        Returns:
-            Optional[Dict]: Cached result or None
-        """
-        cache_key = self._get_cache_key(code)
-        return self.cache.get(cache_key)
-    def _save_to_cache(self, code: str, result: Dict[str, Any]):
-        """
-        Save analysis result to cache.
-        Args:
-            code: Code that was analyzed
-            result: Analysis result
-        """
-        cache_key = self._get_cache_key(code)
-        self.cache[cache_key] = result
-        self._save_cache()
-    def analyze_code_streaming(
-        self,
-        code: str,
-        show_progress: bool = True,
-        mode: str = "detailed",  # "quick" | "detailed"
-    ) -> Generator[str, None, Dict[str, Any]]:
-        """
-        Analyze code with streaming response and progress indicators.
-        Args:
-            code: Code to analyze
-            show_progress: Whether to show progress indicators
-        Yields:
-            str: Partial analysis results
-        """
-        # Check cache first
-        cached_result = self._check_cache(code)
-        if cached_result:
-            print("⚡ Using cached result!")
-            yield cached_result["analysis"]
-            return cached_result
-        # Load model if not loaded
-        self._load_model()
-        # Create analysis prompt
-        prompt = f"""Analyze this code for bugs, performance issues, and security concerns:
-{code}
-Analysis:"""
-        # Tokenize input
-        inputs = self.tokenizer(
-            prompt,
-            return_tensors="pt",
-            truncation=True,
-            max_length=512,
-            padding=True,
-        )
-        device = next(self.model.parameters()).device
-        inputs = {k: v.to(device) for k, v in inputs.items()}
-        # Generate analysis with optimized parameters
-        start_time = time.time()
-        if show_progress:
-            print("🔍 Analyzing code...")
-            progress_bar = tqdm(total=100, desc="Analysis Progress")
-        try:
-            with torch.no_grad():
-                # Use optimized generation parameters for speed
-                max_new = self.detailed_max_new_tokens if mode == "detailed" else self.quick_max_new_tokens
-                num_beams = 2 if mode == "detailed" else 1
-                outputs = self.model.generate(
-                    inputs["input_ids"],
-                    attention_mask=inputs.get("attention_mask"),
-                    max_new_tokens=max_new,
-                    num_beams=num_beams,
-                    do_sample=False,
-                    pad_token_id=self.tokenizer.eos_token_id,
-                    use_cache=True,
-                )
-            if show_progress:
-                progress_bar.update(50)
-            # Decode analysis
-            analysis = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
-            analysis_text = analysis[len(prompt):].strip()
-            if show_progress:
-                progress_bar.update(50)
-                progress_bar.close()
-            # Calculate quality score
-            quality_score = self._calculate_quality_score(analysis_text)
-            total_time = time.time() - start_time
-            # Create result
-            result = {
-                "analysis": analysis_text,
-                "quality_score": quality_score,
-                "execution_time": total_time,
-                "model": self.model_id,
-                "cached": False
-            }
-            # Save to cache
-            self._save_to_cache(code, result)
-            # Yield the analysis
-            yield analysis_text
-            return result
-        except Exception as e:
-            if show_progress:
-                progress_bar.close()
-            raise e
-    def analyze_code_fast(self, code: str, mode: str = "quick") -> Dict[str, Any]:
-        """
-        Fast analysis without streaming (for batch processing).
-        Args:
-            code: Code to analyze
-        Returns:
-            Dict: Analysis result
-        """
-        # Check cache first
-        cached_result = self._check_cache(code)
-        if cached_result:
-            cached_result["cached"] = True
-            return cached_result
-        # Load model if not loaded
-        self._load_model()
-        # Create analysis prompt
-        prompt = f"""Analyze this code for bugs, performance issues, and security concerns:
-{code}
-Analysis:"""
-        # Tokenize input
-        inputs = self.tokenizer(
-            prompt,
-            return_tensors="pt",
-            truncation=True,
-            max_length=512,
-            padding=True,
-        )
-        device = next(self.model.parameters()).device
-        inputs = {k: v.to(device) for k, v in inputs.items()}
-        # Generate analysis with speed optimizations
-        start_time = time.time()
-        with torch.no_grad():
-            max_new = self.quick_max_new_tokens if mode == "quick" else self.detailed_max_new_tokens
-            num_beams = 1 if mode == "quick" else 2
-            outputs = self.model.generate(
-                inputs["input_ids"],
-                attention_mask=inputs.get("attention_mask"),
-                max_new_tokens=max_new,
-                num_beams=num_beams,
-                do_sample=False,
-                pad_token_id=self.tokenizer.eos_token_id,
-                use_cache=True,
-            )
-        # Decode analysis
-        analysis = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
-        analysis_text = analysis[len(prompt):].strip()
-        # Calculate quality score
-        quality_score = self._calculate_quality_score(analysis_text)
-        total_time = time.time() - start_time
-        # Create result
-        result = {
-            "analysis": analysis_text,
-            "quality_score": quality_score,
-            "execution_time": total_time,
-            "model": self.model_id,
-            "cached": False
-        }
-        # Save to cache
-        self._save_to_cache(code, result)
-        return result
-    def _calculate_quality_score(self, analysis_text: str) -> int:
-        """
-        Calculate quality score for analysis.
-        Args:
-            analysis_text: Analysis text
-        Returns:
-            int: Quality score (0-100)
-        """
-        score = 0
-        analysis_lower = analysis_text.lower()
-        # Check for different types of analysis (20 points each)
-        if any(word in analysis_lower for word in ['bug', 'error', 'issue', 'problem', 'flaw']):
-            score += 20
-        if any(word in analysis_lower for word in ['performance', 'slow', 'efficient', 'complexity', 'optimization']):
-            score += 20
-        if any(word in analysis_lower for word in ['security', 'vulnerability', 'safe', 'unsafe', 'risk']):
-            score += 20
-        if any(word in analysis_lower for word in ['suggest', 'improve', 'better', 'recommend', 'fix', 'solution']):
-            score += 20
-        # Bonus for detailed analysis
-        if len(analysis_text) > 200:
-            score += 10
-        if len(analysis_text) > 500:
-            score += 10
-        return min(score, 100)
-    def get_model_info(self) -> Dict[str, Any]:
-        """
-        Get information about the loaded model.
-        Returns:
-            Dict: Model information
-        """
-        if self.model is None:
-            return {"status": "Model not loaded"}
-        param_count = sum(p.numel() for p in self.model.parameters())
-        device = next(self.model.parameters()).device
-        return {
-            "model_id": self.model_id,
-            "parameters": param_count,
-            "device": str(device),
-            "precision": self.precision,
-            "quick_max_new_tokens": self.quick_max_new_tokens,
-            "detailed_max_new_tokens": self.detailed_max_new_tokens,
-            "cache_size": len(self.cache)
-        }
-def main():
-    """
-    Demo of the optimized analyzer.
-    """
-    print("🚀 Optimized CodeT5+ Analyzer Demo")
-    print("=" * 60)
-    # Initialize analyzer
-    analyzer = OptimizedCodeAnalyzer()
-    # Test code
-    test_code = """
-def calculate_fibonacci(n):
-    if n <= 0:
-        return 0
-    elif n == 1:
-        return 1
-    else:
-        return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
-# This will be slow for large numbers
-result = calculate_fibonacci(35)
-print(result)
-"""
-    print(f"Test Code:\n{test_code}")
-    print("=" * 60)
-    # Test streaming analysis
-    print("\n🔍 Streaming Analysis:")
-    print("-" * 40)
-    for partial_result in analyzer.analyze_code_streaming(test_code):
-        print(partial_result)
-    # Test fast analysis
-    print("\n⚡ Fast Analysis:")
-    print("-" * 40)
-    result = analyzer.analyze_code_fast(test_code)
-    print(f"Analysis: {result['analysis']}")
-    print(f"Quality Score: {result['quality_score']}/100")
-    print(f"Execution Time: {result['execution_time']:.2f}s")
-    print(f"Cached: {result['cached']}")
-    # Show model info
-    print("\n📊 Model Information:")
-    print("-" * 40)
-    model_info = analyzer.get_model_info()
-    for key, value in model_info.items():
-        print(f"{key}: {value}")
-if __name__ == "__main__":
-    main()

optimized_code_analyzer_enhanced.py DELETED Viewed

@@ -1,593 +0,0 @@
-#!/usr/bin/env python3
-"""
-Enhanced Code Analyzer with Fine-tuned Model Support
-This version supports:
-- Base CodeT5+ model (original)
-- Fine-tuned DeepSeek Coder model (new)
-- Easy toggle between models
-- All existing optimizations (caching, streaming, etc.)
-Author: AI Code Analyzer Project
-Date: 2025
-"""
-import torch
-import time
-import hashlib
-import json
-import os
-from typing import Dict, Any, Optional, Generator, Literal
-from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM, BitsAndBytesConfig
-from peft import PeftModel
-from tqdm import tqdm
-import streamlit as st
-ModelType = Literal["codet5", "deepseek-finetuned"]
-class EnhancedCodeAnalyzer:
-    """
-    Enhanced analyzer supporting multiple model types including fine-tuned models.
-    """
-    def __init__(
-        self,
-        model_type: ModelType = "codet5",
-        model_id: Optional[str] = None,
-        adapter_path: Optional[str] = None,
-        cache_dir: str = "./cache",
-        precision: str = "fp16",
-        quick_max_new_tokens: int = 180,
-        detailed_max_new_tokens: int = 300,
-        remote_api_url: Optional[str] = None,
-    ):
-        """
-        Initialize the enhanced analyzer.
-        Args:
-            model_type: Type of model to use ("codet5" or "deepseek-finetuned")
-            model_id: Hugging Face model ID (auto-selected if None)
-            adapter_path: Path to LoRA adapters for fine-tuned models
-            cache_dir: Directory to store cached results
-            precision: Model precision (fp16, int8, int4)
-            quick_max_new_tokens: Max tokens for quick analysis
-            detailed_max_new_tokens: Max tokens for detailed analysis
-        """
-        self.model_type = model_type
-        self.cache_dir = cache_dir
-        self.precision = precision.lower().strip()
-        self.quick_max_new_tokens = quick_max_new_tokens
-        self.detailed_max_new_tokens = detailed_max_new_tokens
-        # Auto-select model_id based on type
-        if model_id is None:
-            if model_type == "codet5":
-                self.model_id = "Salesforce/codet5p-220m"
-            elif model_type == "deepseek-finetuned":
-                self.model_id = "deepseek-ai/deepseek-coder-1.3b-instruct"
-            else:
-                raise ValueError(f"Unknown model_type: {model_type}")
-        else:
-            self.model_id = model_id
-        # Set adapter path
-        self.adapter_path = adapter_path
-        if model_type == "deepseek-finetuned" and adapter_path is None:
-            self.adapter_path = "./fine-tuned-analyst"
-        self.model = None
-        self.tokenizer = None
-        self.cache = {}
-        self.remote_api_url = remote_api_url
-        # Create cache directory
-        os.makedirs(cache_dir, exist_ok=True)
-        # Load cache if exists
-        self._load_cache()
-    def _create_quantization_config(self) -> BitsAndBytesConfig:
-        """Create quantization configuration."""
-        return BitsAndBytesConfig(
-            load_in_4bit=True,
-            bnb_4bit_quant_type="nf4",
-            bnb_4bit_compute_dtype=torch.float16,
-            bnb_4bit_use_double_quant=True,
-        )
-    def _load_model(self):
-        """Load the model with optimizations."""
-        if self.model is not None:
-            return
-        print(f"🚀 Loading {self.model_type} model...")
-        # Load tokenizer
-        self.tokenizer = AutoTokenizer.from_pretrained(self.model_id, trust_remote_code=True)
-        if self.tokenizer.pad_token is None:
-            self.tokenizer.pad_token = self.tokenizer.eos_token
-        # Configure precision
-        quantization_config = None
-        dtype = None
-        banner = ""
-        if self.precision == "fp16":
-            dtype = torch.float16
-            banner = "FP16 precision"
-        elif self.precision == "int8":
-            quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-            banner = "INT8 quantization"
-        elif self.precision == "int4":
-            quantization_config = self._create_quantization_config()
-            banner = "INT4 (nf4) quantization"
-        else:
-            dtype = torch.float16
-            banner = f"Unknown precision '{self.precision}', defaulting to FP16"
-        # Load base model based on type
-        if self.model_type == "codet5":
-            self.model = AutoModelForSeq2SeqLM.from_pretrained(
-                self.model_id,
-                device_map="auto",
-                torch_dtype=dtype,
-                quantization_config=quantization_config,
-                trust_remote_code=True,
-            )
-            print(f"✅ CodeT5+ loaded with {banner}!")
-        elif self.model_type == "deepseek-finetuned":
-            # Load base DeepSeek model
-            base_model = AutoModelForCausalLM.from_pretrained(
-                self.model_id,
-                device_map="auto",
-                torch_dtype=dtype,
-                quantization_config=quantization_config,
-                trust_remote_code=True,
-            )
-            # Load LoRA adapters if path exists
-            if self.adapter_path and os.path.exists(self.adapter_path):
-                print(f"🔗 Loading LoRA adapters from {self.adapter_path}...")
-                self.model = PeftModel.from_pretrained(base_model, self.adapter_path)
-                print(f"✅ Fine-tuned DeepSeek loaded with {banner}!")
-            else:
-                print(f"⚠️  Adapter path not found: {self.adapter_path}")
-                print(f"✅ Using base DeepSeek model with {banner}")
-                self.model = base_model
-    def _create_prompt(self, code: str) -> str:
-        """Create prompt based on model type."""
-        if self.model_type == "codet5":
-            return f"""Analyze this code for bugs, performance issues, and security concerns:
-{code}
-Analysis:"""
-        elif self.model_type == "deepseek-finetuned":
-            return f"""<s>[INST] Analyze this code for bugs, performance, and security issues. Give a quality score from 1-100 and provide a detailed analysis.
-Code:
-```
-{code}
-``` [/INST]"""
-        return code
-    def _extract_response(self, full_text: str, prompt: str) -> str:
-        """Extract the actual response from generated text."""
-        if self.model_type == "codet5":
-            # Remove prompt from response
-            return full_text[len(prompt):].strip()
-        elif self.model_type == "deepseek-finetuned":
-            # Extract response after [/INST]
-            if '[/INST]' in full_text:
-                return full_text.split('[/INST]')[-1].strip()
-            return full_text.strip()
-        return full_text.strip()
-    def _get_cache_key(self, code: str) -> str:
-        """Generate cache key for code."""
-        # Include model type in cache key
-        combined = f"{self.model_type}:{code}"
-        return hashlib.md5(combined.encode()).hexdigest()
-    def _load_cache(self):
-        """Load cached results from disk."""
-        cache_file = os.path.join(self.cache_dir, "analysis_cache_enhanced.json")
-        if os.path.exists(cache_file):
-            try:
-                with open(cache_file, 'r') as f:
-                    self.cache = json.load(f)
-                print(f"📁 Loaded {len(self.cache)} cached analyses")
-            except:
-                self.cache = {}
-    def _save_cache(self):
-        """Save cache to disk."""
-        cache_file = os.path.join(self.cache_dir, "analysis_cache_enhanced.json")
-        with open(cache_file, 'w') as f:
-            json.dump(self.cache, f)
-    def _check_cache(self, code: str) -> Optional[Dict[str, Any]]:
-        """Check if analysis is cached."""
-        cache_key = self._get_cache_key(code)
-        return self.cache.get(cache_key)
-    def _save_to_cache(self, code: str, result: Dict[str, Any]):
-        """Save analysis result to cache."""
-        cache_key = self._get_cache_key(code)
-        self.cache[cache_key] = result
-        self._save_cache()
-    def analyze_code_streaming(
-        self,
-        code: str,
-        show_progress: bool = True,
-        mode: str = "detailed",
-    ) -> Generator[str, None, Dict[str, Any]]:
-        """
-        Analyze code with streaming response and progress indicators.
-        Args:
-            code: Code to analyze
-            show_progress: Whether to show progress indicators
-            mode: Analysis mode ("quick" or "detailed")
-        Yields:
-            str: Partial analysis results
-        """
-        # Check cache first
-        cached_result = self._check_cache(code)
-        if cached_result:
-            print("⚡ Using cached result!")
-            yield cached_result["analysis"]
-            return cached_result
-        # Load model if not loaded
-        self._load_model()
-        # Create prompt
-        prompt = self._create_prompt(code)
-        # Tokenize input
-        inputs = self.tokenizer(
-            prompt,
-            return_tensors="pt",
-            truncation=True,
-            max_length=1024 if self.model_type == "deepseek-finetuned" else 512,
-            padding=True,
-        )
-        device = next(self.model.parameters()).device
-        inputs = {k: v.to(device) for k, v in inputs.items()}
-        # Generate analysis
-        start_time = time.time()
-        if show_progress:
-            print(f"🔍 Analyzing code with {self.model_type}...")
-            progress_bar = tqdm(total=100, desc="Analysis Progress")
-        try:
-            with torch.no_grad():
-                max_new = self.detailed_max_new_tokens if mode == "detailed" else self.quick_max_new_tokens
-                # Generation parameters based on model type
-                if self.model_type == "codet5":
-                    num_beams = 2 if mode == "detailed" else 1
-                    outputs = self.model.generate(
-                        inputs["input_ids"],
-                        attention_mask=inputs.get("attention_mask"),
-                        max_new_tokens=max_new,
-                        num_beams=num_beams,
-                        do_sample=False,
-                        pad_token_id=self.tokenizer.eos_token_id,
-                        use_cache=True,
-                    )
-                else:  # deepseek-finetuned
-                    outputs = self.model.generate(
-                        **inputs,
-                        max_new_tokens=max_new,
-                        temperature=0.7,
-                        do_sample=True,
-                        pad_token_id=self.tokenizer.eos_token_id,
-                        eos_token_id=self.tokenizer.eos_token_id,
-                    )
-            if show_progress:
-                progress_bar.update(50)
-            # Decode analysis
-            full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
-            analysis_text = self._extract_response(full_text, prompt)
-            if show_progress:
-                progress_bar.update(50)
-                progress_bar.close()
-            # Calculate quality score
-            quality_score = self._calculate_quality_score(analysis_text)
-            total_time = time.time() - start_time
-            # Create result
-            result = {
-                "analysis": analysis_text,
-                "quality_score": quality_score,
-                "execution_time": total_time,
-                "model": self.model_id,
-                "model_type": self.model_type,
-                "cached": False
-            }
-            # Save to cache
-            self._save_to_cache(code, result)
-            # Yield the analysis
-            yield analysis_text
-            return result
-        except Exception as e:
-            if show_progress:
-                progress_bar.close()
-            raise e
-    def analyze_code_fast(self, code: str, mode: str = "quick") -> Dict[str, Any]:
-        """
-        Fast analysis without streaming (for batch processing).
-        Args:
-            code: Code to analyze
-            mode: Analysis mode ("quick" or "detailed")
-        Returns:
-            Dict: Analysis result
-        """
-        # Check if using remote model
-        if self.remote_api_url:
-            return self.analyze_code_remote(code, mode)
-        # Check cache first
-        cached_result = self._check_cache(code)
-        if cached_result:
-            cached_result["cached"] = True
-            return cached_result
-        # Load model if not loaded
-        self._load_model()
-        # Create prompt
-        prompt = self._create_prompt(code)
-        # Tokenize input
-        inputs = self.tokenizer(
-            prompt,
-            return_tensors="pt",
-            truncation=True,
-            max_length=1024 if self.model_type == "deepseek-finetuned" else 512,
-            padding=True,
-        )
-        device = next(self.model.parameters()).device
-        inputs = {k: v.to(device) for k, v in inputs.items()}
-        # Generate analysis
-        start_time = time.time()
-        with torch.no_grad():
-            max_new = self.quick_max_new_tokens if mode == "quick" else self.detailed_max_new_tokens
-            if self.model_type == "codet5":
-                num_beams = 1 if mode == "quick" else 2
-                outputs = self.model.generate(
-                    inputs["input_ids"],
-                    attention_mask=inputs.get("attention_mask"),
-                    max_new_tokens=max_new,
-                    num_beams=num_beams,
-                    do_sample=False,
-                    pad_token_id=self.tokenizer.eos_token_id,
-                    use_cache=True,
-                )
-            else:  # deepseek-finetuned
-                outputs = self.model.generate(
-                    **inputs,
-                    max_new_tokens=max_new,
-                    temperature=0.7,
-                    do_sample=True,
-                    pad_token_id=self.tokenizer.eos_token_id,
-                    eos_token_id=self.tokenizer.eos_token_id,
-                )
-        # Decode analysis
-        full_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
-        analysis_text = self._extract_response(full_text, prompt)
-        # Calculate quality score
-        quality_score = self._calculate_quality_score(analysis_text)
-        total_time = time.time() - start_time
-        # Create result
-        result = {
-            "analysis": analysis_text,
-            "quality_score": quality_score,
-            "execution_time": total_time,
-            "model": self.model_id,
-            "model_type": self.model_type,
-            "cached": False
-        }
-        # Save to cache
-        self._save_to_cache(code, result)
-        return result
-    def _calculate_quality_score(self, analysis_text: str) -> int:
-        """
-        Calculate quality score for analysis.
-        Args:
-            analysis_text: Analysis text
-        Returns:
-            int: Quality score (0-100)
-        """
-        # Try to extract score from DeepSeek output first
-        if "Quality Score:" in analysis_text or "quality score" in analysis_text.lower():
-            import re
-            score_match = re.search(r'(?:Quality Score:|quality score:?)\s*(\d+)', analysis_text, re.IGNORECASE)
-            if score_match:
-                return int(score_match.group(1))
-        # Fallback to heuristic calculation
-        score = 0
-        analysis_lower = analysis_text.lower()
-        if any(word in analysis_lower for word in ['bug', 'error', 'issue', 'problem', 'flaw']):
-            score += 20
-        if any(word in analysis_lower for word in ['performance', 'slow', 'efficient', 'complexity', 'optimization']):
-            score += 20
-        if any(word in analysis_lower for word in ['security', 'vulnerability', 'safe', 'unsafe', 'risk']):
-            score += 20
-        if any(word in analysis_lower for word in ['suggest', 'improve', 'better', 'recommend', 'fix', 'solution']):
-            score += 20
-        if len(analysis_text) > 200:
-            score += 10
-        if len(analysis_text) > 500:
-            score += 10
-        return min(score, 100)
-    def analyze_code_remote(self, code: str, mode: str = "quick") -> Dict[str, Any]:
-        """Analyze code using remote Hugging Face API."""
-        import requests
-        if not self.remote_api_url:
-            raise ValueError("No remote API URL configured")
-        cached_result = self._check_cache(code)
-        if cached_result:
-            cached_result["cached"] = True
-            return cached_result
-        start_time = time.time()
-        try:
-            max_tokens = self.quick_max_new_tokens if mode == "quick" else self.detailed_max_new_tokens
-            response = requests.post(
-                f"{self.remote_api_url}/analyze",
-                json={"code": code, "max_tokens": max_tokens},
-                timeout=60
-            )
-            response.raise_for_status()
-            data = response.json()
-            analysis_text = data["analysis"]
-            quality_score = self._calculate_quality_score(analysis_text)
-            total_time = time.time() - start_time
-            result = {
-                "analysis": analysis_text,
-                "quality_score": quality_score,
-                "execution_time": total_time,
-                "model": "fine-tuned-deepseek-remote",
-                "model_type": "deepseek-finetuned-remote",
-                "cached": False
-            }
-            self._save_to_cache(code, result)
-            return result
-        except Exception as e:
-            raise Exception(f"Remote analysis failed: {e}")
-    def get_model_info(self) -> Dict[str, Any]:
-        """Get information about the loaded model."""
-        if self.model is None:
-            return {"status": "Model not loaded"}
-        param_count = sum(p.numel() for p in self.model.parameters())
-        trainable_params = sum(p.numel() for p in self.model.parameters() if p.requires_grad)
-        device = next(self.model.parameters()).device
-        info = {
-            "model_id": self.model_id,
-            "model_type": self.model_type,
-            "parameters": param_count,
-            "trainable_parameters": trainable_params,
-            "device": str(device),
-            "precision": self.precision,
-            "quick_max_new_tokens": self.quick_max_new_tokens,
-            "detailed_max_new_tokens": self.detailed_max_new_tokens,
-            "cache_size": len(self.cache)
-        }
-        if self.adapter_path:
-            info["adapter_path"] = self.adapter_path
-            info["using_adapters"] = os.path.exists(self.adapter_path) if self.adapter_path else False
-        return info
-def main():
-    """Demo of the enhanced analyzer."""
-    print("🚀 Enhanced Code Analyzer Demo")
-    print("=" * 60)
-    # Test code
-    test_code = """
-def calculate_fibonacci(n):
-    if n <= 0:
-        return 0
-    elif n == 1:
-        return 1
-    else:
-        return calculate_fibonacci(n-1) + calculate_fibonacci(n-2)
-result = calculate_fibonacci(35)
-print(result)
-"""
-    print(f"Test Code:\n{test_code}")
-    print("=" * 60)
-    # Test with CodeT5+
-    print("\n🔵 Testing with CodeT5+ (Base Model):")
-    print("-" * 60)
-    analyzer_codet5 = EnhancedCodeAnalyzer(model_type="codet5")
-    result_codet5 = analyzer_codet5.analyze_code_fast(test_code)
-    print(f"Analysis: {result_codet5['analysis'][:300]}...")
-    print(f"Quality Score: {result_codet5['quality_score']}/100")
-    print(f"Execution Time: {result_codet5['execution_time']:.2f}s")
-    # Test with Fine-tuned DeepSeek
-    print("\n🟢 Testing with Fine-tuned DeepSeek:")
-    print("-" * 60)
-    analyzer_deepseek = EnhancedCodeAnalyzer(
-        model_type="deepseek-finetuned",
-        adapter_path="./fine-tuned-analyst"
-    )
-    result_deepseek = analyzer_deepseek.analyze_code_fast(test_code)
-    print(f"Analysis: {result_deepseek['analysis'][:300]}...")
-    print(f"Quality Score: {result_deepseek['quality_score']}/100")
-    print(f"Execution Time: {result_deepseek['execution_time']:.2f}s")
-    # Show model comparison
-    print("\n📊 Model Comparison:")
-    print("-" * 60)
-    print(f"CodeT5+ Info: {analyzer_codet5.get_model_info()}")
-    print(f"DeepSeek Info: {analyzer_deepseek.get_model_info()}")
-if __name__ == "__main__":
-    main()

render.yaml DELETED Viewed

@@ -1,17 +0,0 @@
-services:
-  - type: web
-    name: ai-code-analyzer
-    env: python
-    repo: https://github.com/arun3676/ai-code-analyzer.git
-    branch: master
-    buildCommand: pip install -r requirements.txt
-    startCommand: streamlit run matrix_final.py --server.port=$PORT --server.address=0.0.0.0 --server.headless=true --server.enableCORS=false
-    plan: free
-    healthCheckPath: /?health=true
-    envVars:
-      - key: OPENAI_API_KEY
-        sync: false
-      - key: ANTHROPIC_API_KEY
-        sync: false
-      - key: DEEPSEEK_API_KEY
-        sync: false

requirements.txt CHANGED Viewed

@@ -4,12 +4,8 @@ anthropic>=0.25.0
 python-dotenv>=1.0.0
 requests>=2.32.0
 typing-extensions>=4.0.0
-# Fine-tuning libraries
-torch>=2.0.0
 transformers>=4.35.0
-datasets>=2.14.0
-peft>=0.6.0
-trl>=0.7.0
-bitsandbytes>=0.41.0
-accelerate>=0.24.0

 python-dotenv>=1.0.0
 requests>=2.32.0
 typing-extensions>=4.0.0
+huggingface-hub>=0.20.0
 transformers>=4.35.0
+torch>=2.0.0
+accelerate>=0.24.0
+pygments>=2.17.0

run_app.py ADDED Viewed

	@@ -0,0 +1,51 @@

+#!/usr/bin/env python3
+"""
+AI Code Analyzer - Local Startup Script
+This script starts the Matrix Code Analyzer application locally.
+The app will be available at http://localhost:8501/
+Usage:
+    python run_app.py
+Or directly with Streamlit:
+    streamlit run matrix_final.py
+"""
+import subprocess
+import sys
+import os
+def main():
+    """Start the AI Code Analyzer application."""
+    print("Starting AI Code Analyzer...")
+    print("The app will be available at: http://localhost:8501/")
+    print("Press Ctrl+C to stop the application")
+    print("-" * 50)
+    try:
+        # Check if streamlit is installed
+        subprocess.run([sys.executable, "-c", "import streamlit"], check=True, capture_output=True)
+    except subprocess.CalledProcessError:
+        print("Streamlit is not installed. Please install it with:")
+        print("   pip install streamlit")
+        sys.exit(1)
+    # Start the Streamlit app
+    try:
+        subprocess.run([
+            sys.executable, "-m", "streamlit", "run", "matrix_final.py",
+            "--server.port", "8501",
+            "--server.address", "localhost"
+        ], check=True)
+    except KeyboardInterrupt:
+        print("\nApplication stopped by user")
+    except subprocess.CalledProcessError as e:
+        print(f"Error starting application: {e}")
+        sys.exit(1)
+    except Exception as e:
+        print(f"An unexpected error occurred: {e}")
+        sys.exit(1)
+if __name__ == "__main__":
+    main()

runtime.txt DELETED Viewed

	@@ -1 +0,0 @@
1	- python-3.11.9