Spaces:

shahzeb171
/

code-compass

Sleeping

App Files Files Community

code-compass / README.md

shahzeb171

Git to HF

60344c1 3 months ago

preview code

raw

history blame

7.6 kB

🔍code-compass

An AI-powered tool for analyzing code repositories using hierarchical chunking and semantic search with Pinecone vector database.

🚀 Features

📥 Multiple Input Methods: GitHub URLs or ZIP file uploads
🧠 Hierarchical Chunking: Smart code parsing at multiple levels (file → class → function → block)
🔍 Semantic Search: AI-powered natural language queries using Pinecone vector database
🤖 Intelligent Analysis: Local LLM integration with Qwen2.5-Coder-7B-Instruct
💬 Conversation History: Maintains context across multiple queries
📊 Repository Analytics: Comprehensive statistics and structure analysis
🎯 Pinecone Integration: Scalable vector database with automatic embedding generation
⚡ Optimized Performance: Quantized models for efficient local inference

🛠️ Setup

Prerequisites

Python 3.8+
Pinecone Account: Create a free account at Pinecone.io
System Requirements for LLM:
- RAM: 8GB minimum (16GB+ recommended)
- Storage: 5-8GB free space for model
- CPU: Multi-core processor (supports GPU acceleration if available)

Installation

Clone or download this project

git clone https://github.com/shahzeb171/code-compass.git
cd code-compass

Install dependencies
```
pip install -r requirements.txt
```

Download the LLM model

wget https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf

Recommended: Select Q4_K_M for the best balance of quality and performance.

Set up Pinecone API Key

Create config.py file:

PINECONE_API_KEY=your-pinecone-api-key-here
PINECONE_INDEX_NAME=index_name(eg. code_compass_index)
PINECONE_EMBEDDING_MODEL=embedding_model(eg. llama-text-embed-v2 (check pinecone docs for more models))
MODEL_PATH=path_to_the_model

Getting Your Pinecone API Key

Go to Pinecone.io and sign up for a free account
Navigate to the "API Keys" section in your dashboard
Create a new API key or copy an existing one
The free tier includes:
- 1 index
- 5M vector dimensions
- Enough for most code analysis projects!

🚀 Usage

Start the application
```
python main.py
```
Open your browser to http://localhost:7860
Load a repository
- Enter a GitHub URL (e.g., https://github.com/pallets/flask)
- Or upload a ZIP file of your code
- Click "📁 Load Repository"
Process the repository
- Click "🚀 Process Repository" to analyze and chunk your code
- This creates hierarchical chunks and stores them in Pinecone with automatic embedding generation
- Wait for processing to complete (may take 1-5 minutes depending on repo size)
Initialize the AI model (Optional but recommended)
- Click "🚀 Initialize LLM" to start loading the local AI model
- This will load Qwen2.5-Coder-7B-Instruct for intelligent code analysis
- Initial loading takes 1-3 minutes
Query your code
- Ask natural language questions like:
  - "What does this repository do?"
  - "Show me authentication functions"
  - "How is error handling implemented?"
  - "What are the main classes?"
- Toggle "Use AI Analysis" for intelligent responses vs basic search results
- The AI maintains conversation context for follow-up questions

📊 How It Works

Hierarchical Chunking Strategy

The system creates multiple levels of code chunks:

Level 1: File Context

Complete file overview with imports and purpose
Metadata: file path, language, total lines

Level 2: Class Chunks

Full class definitions with inheritance and methods
Metadata: class name, methods list, relationships

Level 3: Function Chunks

Individual function implementations with signatures
Metadata: function name, arguments, complexity score

Level 4: Code Block Chunks

Sub-chunks for complex functions (loops, conditionals, error handling)
Metadata: block type, purpose, parent function

Vector Search Process

Embedding Generation: Code chunks are converted to vector embeddings using SentenceTransformers
Vector Storage: Embeddings stored in Pinecone with rich metadata
Semantic Search: User queries are embedded and matched against stored vectors
Hybrid Filtering: Results filtered by chunk type, file path, repository, etc.
Ranked Results: Most relevant code sections returned with similarity scores

🔧 Configuration Options

Supported Languages

Currently optimized for Python with basic support for:

JavaScript/TypeScript
Java
C/C++
Go
Rust
PHP
Ruby

📝 Example Repositories

Try these public repositories:

Flask: https://github.com/pallets/flask - Web framework
Requests: https://github.com/requests/requests - HTTP library
FastAPI: https://github.com/tiangolo/fastapi - Modern web framework
Black: https://github.com/psf/black - Code formatter

🔍 Example Queries

General Repository Understanding

"What is the main purpose of this repository?"
"What are the core components and how do they interact?"
"Show me the project architecture overview"

Function & Class Discovery

"What are the main classes and their responsibilities?"
"Show me all authentication-related functions"
"Find functions that handle file operations"
"What utility functions are available?"

Implementation Analysis

"How is error handling implemented?"
"Show me configuration management code"
"Find database-related functions"
"How does logging work in this project?"

Code Patterns

"Show me decorator implementations"
"Find async/await usage patterns"
"What design patterns are used?"
"How are tests structured?"

🛟 Troubleshooting

Common Issues

"Pinecone API key is required"

Make sure you've set the PINECONE_API_KEY environment variable
Or enter it in the Advanced Options section

"Error downloading repository"

Check that the GitHub URL is correct and public
Ensure you have internet connection
Large repositories may timeout - try smaller repos first

"No chunks generated"

Make sure the repository contains supported code files
Check that ZIP files aren't corrupted
Python files work best currently

"Vector store initialization failed"

Verify your Pinecone API key is valid
Check your Pinecone account hasn't exceeded free tier limits
Try a different environment region if needed

Performance Tips

Start with smaller repositories (< 100 files) to test
Python repositories work best currently
Processing time scales with repository size
Queries are fast once processing is complete

🔮 Future Enhancements

More Language Support: Better parsing for JavaScript, Java, etc.
Code Generation: AI-powered code completion and generation
Diff Analysis: Compare changes between repository versions
Team Collaboration: Share analyzed repositories
Custom Embeddings: Fine-tuned models for specific domains
API Integration: REST API for programmatic access

🤝 Contributing

Contributions welcome! Please open issues or submit pull requests.

📞 Support

For issues or questions:

Check the troubleshooting section above
Open a GitHub issue with detailed error messages
Include your Python version and OS information