code-compass / README.md
shahzeb171's picture
Git to HF
60344c1
|
raw
history blame
7.6 kB

๐Ÿ”code-compass

An AI-powered tool for analyzing code repositories using hierarchical chunking and semantic search with Pinecone vector database.

๐Ÿš€ Features

  • ๐Ÿ“ฅ Multiple Input Methods: GitHub URLs or ZIP file uploads
  • ๐Ÿง  Hierarchical Chunking: Smart code parsing at multiple levels (file โ†’ class โ†’ function โ†’ block)
  • ๐Ÿ” Semantic Search: AI-powered natural language queries using Pinecone vector database
  • ๐Ÿค– Intelligent Analysis: Local LLM integration with Qwen2.5-Coder-7B-Instruct
  • ๐Ÿ’ฌ Conversation History: Maintains context across multiple queries
  • ๐Ÿ“Š Repository Analytics: Comprehensive statistics and structure analysis
  • ๐ŸŽฏ Pinecone Integration: Scalable vector database with automatic embedding generation
  • โšก Optimized Performance: Quantized models for efficient local inference

๐Ÿ› ๏ธ Setup

Prerequisites

  1. Python 3.8+
  2. Pinecone Account: Create a free account at Pinecone.io
  3. System Requirements for LLM:
    • RAM: 8GB minimum (16GB+ recommended)
    • Storage: 5-8GB free space for model
    • CPU: Multi-core processor (supports GPU acceleration if available)

Installation

  1. Clone or download this project

    git clone https://github.com/shahzeb171/code-compass.git
    cd code-compass
    
  2. Install dependencies

    pip install -r requirements.txt
    
  3. Download the LLM model

    wget https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf
    

    Recommended: Select Q4_K_M for the best balance of quality and performance.

  4. Set up Pinecone API Key

    Create config.py file:

    PINECONE_API_KEY=your-pinecone-api-key-here
    PINECONE_INDEX_NAME=index_name(eg. code_compass_index)
    PINECONE_EMBEDDING_MODEL=embedding_model(eg. llama-text-embed-v2 (check pinecone docs for more models))
    MODEL_PATH=path_to_the_model
    

Getting Your Pinecone API Key

  1. Go to Pinecone.io and sign up for a free account
  2. Navigate to the "API Keys" section in your dashboard
  3. Create a new API key or copy an existing one
  4. The free tier includes:
    • 1 index
    • 5M vector dimensions
    • Enough for most code analysis projects!

๐Ÿš€ Usage

  1. Start the application

    python main.py
    
  2. Open your browser to http://localhost:7860

  3. Load a repository

    • Enter a GitHub URL (e.g., https://github.com/pallets/flask)
    • Or upload a ZIP file of your code
    • Click "๐Ÿ“ Load Repository"
  4. Process the repository

    • Click "๐Ÿš€ Process Repository" to analyze and chunk your code
    • This creates hierarchical chunks and stores them in Pinecone with automatic embedding generation
    • Wait for processing to complete (may take 1-5 minutes depending on repo size)
  5. Initialize the AI model (Optional but recommended)

    • Click "๐Ÿš€ Initialize LLM" to start loading the local AI model
    • This will load Qwen2.5-Coder-7B-Instruct for intelligent code analysis
    • Initial loading takes 1-3 minutes
  6. Query your code

    • Ask natural language questions like:
      • "What does this repository do?"
      • "Show me authentication functions"
      • "How is error handling implemented?"
      • "What are the main classes?"
    • Toggle "Use AI Analysis" for intelligent responses vs basic search results
    • The AI maintains conversation context for follow-up questions

๐Ÿ“Š How It Works

Hierarchical Chunking Strategy

The system creates multiple levels of code chunks:

Level 1: File Context

  • Complete file overview with imports and purpose
  • Metadata: file path, language, total lines

Level 2: Class Chunks

  • Full class definitions with inheritance and methods
  • Metadata: class name, methods list, relationships

Level 3: Function Chunks

  • Individual function implementations with signatures
  • Metadata: function name, arguments, complexity score

Level 4: Code Block Chunks

  • Sub-chunks for complex functions (loops, conditionals, error handling)
  • Metadata: block type, purpose, parent function

Vector Search Process

  1. Embedding Generation: Code chunks are converted to vector embeddings using SentenceTransformers
  2. Vector Storage: Embeddings stored in Pinecone with rich metadata
  3. Semantic Search: User queries are embedded and matched against stored vectors
  4. Hybrid Filtering: Results filtered by chunk type, file path, repository, etc.
  5. Ranked Results: Most relevant code sections returned with similarity scores

๐Ÿ”ง Configuration Options

Supported Languages

Currently optimized for Python with basic support for:

  • JavaScript/TypeScript
  • Java
  • C/C++
  • Go
  • Rust
  • PHP
  • Ruby

๐Ÿ“ Example Repositories

Try these public repositories:

  • Flask: https://github.com/pallets/flask - Web framework
  • Requests: https://github.com/requests/requests - HTTP library
  • FastAPI: https://github.com/tiangolo/fastapi - Modern web framework
  • Black: https://github.com/psf/black - Code formatter

๐Ÿ” Example Queries

General Repository Understanding

  • "What is the main purpose of this repository?"
  • "What are the core components and how do they interact?"
  • "Show me the project architecture overview"

Function & Class Discovery

  • "What are the main classes and their responsibilities?"
  • "Show me all authentication-related functions"
  • "Find functions that handle file operations"
  • "What utility functions are available?"

Implementation Analysis

  • "How is error handling implemented?"
  • "Show me configuration management code"
  • "Find database-related functions"
  • "How does logging work in this project?"

Code Patterns

  • "Show me decorator implementations"
  • "Find async/await usage patterns"
  • "What design patterns are used?"
  • "How are tests structured?"

๐Ÿ›Ÿ Troubleshooting

Common Issues

"Pinecone API key is required"

  • Make sure you've set the PINECONE_API_KEY environment variable
  • Or enter it in the Advanced Options section

"Error downloading repository"

  • Check that the GitHub URL is correct and public
  • Ensure you have internet connection
  • Large repositories may timeout - try smaller repos first

"No chunks generated"

  • Make sure the repository contains supported code files
  • Check that ZIP files aren't corrupted
  • Python files work best currently

"Vector store initialization failed"

  • Verify your Pinecone API key is valid
  • Check your Pinecone account hasn't exceeded free tier limits
  • Try a different environment region if needed

Performance Tips

  • Start with smaller repositories (< 100 files) to test
  • Python repositories work best currently
  • Processing time scales with repository size
  • Queries are fast once processing is complete

๐Ÿ”ฎ Future Enhancements

  • More Language Support: Better parsing for JavaScript, Java, etc.
  • Code Generation: AI-powered code completion and generation
  • Diff Analysis: Compare changes between repository versions
  • Team Collaboration: Share analyzed repositories
  • Custom Embeddings: Fine-tuned models for specific domains
  • API Integration: REST API for programmatic access

๐Ÿค Contributing

Contributions welcome! Please open issues or submit pull requests.

๐Ÿ“ž Support

For issues or questions:

  1. Check the troubleshooting section above
  2. Open a GitHub issue with detailed error messages
  3. Include your Python version and OS information