Spaces:
Sleeping
๐code-compass
An AI-powered tool for analyzing code repositories using hierarchical chunking and semantic search with Pinecone vector database.
๐ Features
- ๐ฅ Multiple Input Methods: GitHub URLs or ZIP file uploads
- ๐ง Hierarchical Chunking: Smart code parsing at multiple levels (file โ class โ function โ block)
- ๐ Semantic Search: AI-powered natural language queries using Pinecone vector database
- ๐ค Intelligent Analysis: Local LLM integration with Qwen2.5-Coder-7B-Instruct
- ๐ฌ Conversation History: Maintains context across multiple queries
- ๐ Repository Analytics: Comprehensive statistics and structure analysis
- ๐ฏ Pinecone Integration: Scalable vector database with automatic embedding generation
- โก Optimized Performance: Quantized models for efficient local inference
๐ ๏ธ Setup
Prerequisites
- Python 3.8+
- Pinecone Account: Create a free account at Pinecone.io
- System Requirements for LLM:
- RAM: 8GB minimum (16GB+ recommended)
- Storage: 5-8GB free space for model
- CPU: Multi-core processor (supports GPU acceleration if available)
Installation
Clone or download this project
git clone https://github.com/shahzeb171/code-compass.git cd code-compassInstall dependencies
pip install -r requirements.txtDownload the LLM model
wget https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.ggufRecommended: Select Q4_K_M for the best balance of quality and performance.
Set up Pinecone API Key
Create
config.pyfile:PINECONE_API_KEY=your-pinecone-api-key-here PINECONE_INDEX_NAME=index_name(eg. code_compass_index) PINECONE_EMBEDDING_MODEL=embedding_model(eg. llama-text-embed-v2 (check pinecone docs for more models)) MODEL_PATH=path_to_the_model
Getting Your Pinecone API Key
- Go to Pinecone.io and sign up for a free account
- Navigate to the "API Keys" section in your dashboard
- Create a new API key or copy an existing one
- The free tier includes:
- 1 index
- 5M vector dimensions
- Enough for most code analysis projects!
๐ Usage
Start the application
python main.pyOpen your browser to
http://localhost:7860Load a repository
- Enter a GitHub URL (e.g.,
https://github.com/pallets/flask) - Or upload a ZIP file of your code
- Click "๐ Load Repository"
- Enter a GitHub URL (e.g.,
Process the repository
- Click "๐ Process Repository" to analyze and chunk your code
- This creates hierarchical chunks and stores them in Pinecone with automatic embedding generation
- Wait for processing to complete (may take 1-5 minutes depending on repo size)
Initialize the AI model (Optional but recommended)
- Click "๐ Initialize LLM" to start loading the local AI model
- This will load Qwen2.5-Coder-7B-Instruct for intelligent code analysis
- Initial loading takes 1-3 minutes
Query your code
- Ask natural language questions like:
- "What does this repository do?"
- "Show me authentication functions"
- "How is error handling implemented?"
- "What are the main classes?"
- Toggle "Use AI Analysis" for intelligent responses vs basic search results
- The AI maintains conversation context for follow-up questions
- Ask natural language questions like:
๐ How It Works
Hierarchical Chunking Strategy
The system creates multiple levels of code chunks:
Level 1: File Context
- Complete file overview with imports and purpose
- Metadata: file path, language, total lines
Level 2: Class Chunks
- Full class definitions with inheritance and methods
- Metadata: class name, methods list, relationships
Level 3: Function Chunks
- Individual function implementations with signatures
- Metadata: function name, arguments, complexity score
Level 4: Code Block Chunks
- Sub-chunks for complex functions (loops, conditionals, error handling)
- Metadata: block type, purpose, parent function
Vector Search Process
- Embedding Generation: Code chunks are converted to vector embeddings using SentenceTransformers
- Vector Storage: Embeddings stored in Pinecone with rich metadata
- Semantic Search: User queries are embedded and matched against stored vectors
- Hybrid Filtering: Results filtered by chunk type, file path, repository, etc.
- Ranked Results: Most relevant code sections returned with similarity scores
๐ง Configuration Options
Supported Languages
Currently optimized for Python with basic support for:
- JavaScript/TypeScript
- Java
- C/C++
- Go
- Rust
- PHP
- Ruby
๐ Example Repositories
Try these public repositories:
- Flask:
https://github.com/pallets/flask- Web framework - Requests:
https://github.com/requests/requests- HTTP library - FastAPI:
https://github.com/tiangolo/fastapi- Modern web framework - Black:
https://github.com/psf/black- Code formatter
๐ Example Queries
General Repository Understanding
- "What is the main purpose of this repository?"
- "What are the core components and how do they interact?"
- "Show me the project architecture overview"
Function & Class Discovery
- "What are the main classes and their responsibilities?"
- "Show me all authentication-related functions"
- "Find functions that handle file operations"
- "What utility functions are available?"
Implementation Analysis
- "How is error handling implemented?"
- "Show me configuration management code"
- "Find database-related functions"
- "How does logging work in this project?"
Code Patterns
- "Show me decorator implementations"
- "Find async/await usage patterns"
- "What design patterns are used?"
- "How are tests structured?"
๐ Troubleshooting
Common Issues
"Pinecone API key is required"
- Make sure you've set the
PINECONE_API_KEYenvironment variable - Or enter it in the Advanced Options section
"Error downloading repository"
- Check that the GitHub URL is correct and public
- Ensure you have internet connection
- Large repositories may timeout - try smaller repos first
"No chunks generated"
- Make sure the repository contains supported code files
- Check that ZIP files aren't corrupted
- Python files work best currently
"Vector store initialization failed"
- Verify your Pinecone API key is valid
- Check your Pinecone account hasn't exceeded free tier limits
- Try a different environment region if needed
Performance Tips
- Start with smaller repositories (< 100 files) to test
- Python repositories work best currently
- Processing time scales with repository size
- Queries are fast once processing is complete
๐ฎ Future Enhancements
- More Language Support: Better parsing for JavaScript, Java, etc.
- Code Generation: AI-powered code completion and generation
- Diff Analysis: Compare changes between repository versions
- Team Collaboration: Share analyzed repositories
- Custom Embeddings: Fine-tuned models for specific domains
- API Integration: REST API for programmatic access
๐ค Contributing
Contributions welcome! Please open issues or submit pull requests.
๐ Support
For issues or questions:
- Check the troubleshooting section above
- Open a GitHub issue with detailed error messages
- Include your Python version and OS information