# GetGit Technical Documentation ## Table of Contents 1. [Project Overview](#project-overview) 2. [Architecture](#architecture) 3. [Backend Flow](#backend-flow) 4. [RAG + LLM Overview](#rag--llm-overview) 5. [Checkpoints System](#checkpoints-system) 6. [UI Interaction Flow](#ui-interaction-flow) 7. [Setup and Run Instructions](#setup-and-run-instructions) 8. [Logging Behavior](#logging-behavior) 9. [API Reference](#api-reference) 10. [Configuration](#configuration) --- ## Project Overview GetGit is a Python-based repository intelligence system that combines GitHub repository cloning, Retrieval-Augmented Generation (RAG), and Large Language Model (LLM) capabilities to provide intelligent, natural language question-answering over code repositories. ### Key Features - **Automated Repository Cloning**: Clone and manage GitHub repositories locally - **RAG-Based Analysis**: Semantic chunking and retrieval of repository content - **LLM Integration**: Natural language response generation using Google Gemini - **Checkpoint Validation**: Programmatic validation of repository requirements - **Web Interface**: Flask-based UI for repository exploration - **Checkpoint Management**: UI for adding and viewing validation checkpoints ### Use Cases - Understanding unfamiliar codebases quickly - Answering questions about project structure and functionality - Extracting information from documentation and code - Repository analysis and review - Validating repository requirements for hackathons or project submissions - Team collaboration and onboarding --- ## Architecture GetGit follows a modular architecture with clear separation of concerns: ### System Components ``` ┌─────────────────────────────────────────────────────────────┐ │ Web Browser │ │ (User Interface) │ └────────────────────┬────────────────────────────────────────┘ │ HTTP Requests ▼ ┌─────────────────────────────────────────────────────────────┐ │ server.py (Flask) │ │ - Routes: /initialize, /ask, /checkpoints, etc. │ │ - Session management │ │ - Request/response handling │ └────────────────────┬────────────────────────────────────────┘ │ Delegates to ▼ ┌─────────────────────────────────────────────────────────────┐ │ core.py (Orchestration) │ │ - initialize_repository() │ │ - setup_rag() │ │ - answer_query() │ │ - validate_checkpoints() │ └────────┬───────────────────┬─────────────────┬──────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌──────────────┐ ┌─────────────────────┐ │ clone_repo.py │ │ rag/ │ │ checkpoints.py │ │ - Repository │ │ - Chunker │ │ - Load/validate │ │ cloning │ │ - Embedder │ │ - Checkpoint mgmt │ └─────────────────┘ │ - Retriever │ └─────────────────────┘ │ - LLM │ └──────────────┘ ``` ### 1. Repository Layer (`clone_repo.py`) Handles GitHub repository cloning and local storage management. **Key Function:** ```python clone_repo(github_url, dest_folder='source_repo') ``` ### 2. RAG Layer (`rag/` module) Provides semantic search and context retrieval capabilities. **Components:** - **Chunker** (`chunker.py`): Splits repository files into semantic chunks - **Embedder** (`embedder.py`): Creates vector embeddings (TF-IDF or Transformer-based) - **Retriever** (`retriever.py`): Performs similarity-based chunk retrieval - **LLM Connector** (`llm_connector.py`): Integrates with LLMs for response generation - **Configuration** (`config.py`): Manages RAG settings and parameters **Supported Chunk Types:** - Code functions and classes - Markdown sections - Documentation blocks - Configuration files - Full file content ### 3. Checkpoints Layer (`checkpoints.py`) Manages checkpoint-based validation of repositories. **Key Functions:** - `load_checkpoints()`: Load checkpoints from file - `evaluate_checkpoint()`: Evaluate a single checkpoint - `run_checkpoints()`: Run all checkpoints against repository - `format_results_summary()`: Format results for display ### 4. Orchestration Layer (`core.py`) Unified entry point that coordinates all components: 1. **Repository Initialization**: Clone or load repository 2. **RAG Setup**: Chunk, embed, and index repository content 3. **Query Processing**: Retrieve context and generate responses 4. **Checkpoint Validation**: Validate repository against requirements ### 5. Web Interface (`server.py`) Flask-based web application providing a user-friendly interface. **Routes:** - `GET /` - Render home page - `POST /initialize` - Initialize repository and RAG pipeline - `POST /ask` - Answer questions about repository - `POST /checkpoints` - Run checkpoint validation - `GET /checkpoints/list` - List all checkpoints - `POST /checkpoints/add` - Add new checkpoint - `GET /status` - Get application status --- ## Backend Flow ### Server.py → Core.py Flow ``` User Request → server.py → core.py → Specialized Modules ``` #### 1. Repository Initialization Flow ``` POST /initialize ↓ server.py: initialize() ↓ core.py: initialize_repository(repo_url, local_path) ↓ clone_repo.py: clone_repo(repo_url, local_path) ↓ core.py: setup_rag(repo_path) ↓ rag/chunker.py: chunk_repository() ↓ rag/embedder.py: create embeddings ↓ rag/retriever.py: index_chunks() ↓ Return: Retriever instance with indexed chunks ``` #### 2. Question Answering Flow ``` POST /ask ↓ server.py: ask_question() ↓ core.py: answer_query(query, retriever, use_llm) ↓ rag/retriever.py: retrieve(query, top_k) ↓ [If use_llm=True] ↓ rag/llm_connector.py: generate_response(query, context) ↓ Return: {query, retrieved_chunks, context, response, error} ``` #### 3. Checkpoint Validation Flow ``` POST /checkpoints ↓ server.py: run_checkpoints() ↓ core.py: validate_checkpoints(repo_url, checkpoints_file, use_llm) ↓ checkpoints.py: load_checkpoints(file) ↓ checkpoints.py: run_checkpoints(checkpoints, repo_path, retriever) ↓ [For each checkpoint] ↓ checkpoints.py: evaluate_checkpoint(checkpoint, retriever, use_llm) ↓ Return: {checkpoints, results, summary, statistics} ``` --- ## RAG + LLM Overview ### Retrieval-Augmented Generation (RAG) RAG combines information retrieval with text generation to provide contextually accurate responses. **How It Works:** 1. **Indexing Phase** (Setup): - Repository files are chunked into semantic units - Each chunk is converted to a vector embedding - Embeddings are indexed for fast similarity search 2. **Retrieval Phase** (Query): - User query is converted to embedding - Similar chunks are retrieved using cosine similarity - Top-k most relevant chunks are selected 3. **Generation Phase** (Optional, if LLM enabled): - Retrieved chunks provide context - Context + query sent to LLM - LLM generates coherent, contextual response ### LLM Integration GetGit uses Google Gemini for natural language response generation. **Features:** - Provider-agnostic design (easy to add new LLM providers) - Environment-based API key management - Error handling and fallback to context-only responses - Configurable model selection **Configuration:** ```bash export GEMINI_API_KEY=your_api_key_here ``` --- ## Checkpoints System The checkpoints system enables programmatic validation of repository requirements. ### How Checkpoints Work 1. **Definition**: Checkpoints are stored in `checkpoints.txt`, one per line 2. **Loading**: System reads and parses checkpoint file 3. **Evaluation**: Each checkpoint is evaluated against the repository 4. **Reporting**: Results include pass/fail status, explanation, and evidence ### Checkpoint Types 1. **File Existence Checks**: Simple file/directory existence validation - Example: "Check if the repository has README.md" 2. **Semantic Checks**: Complex requirements using RAG retrieval - Example: "Check if RAG model is implemented" 3. **LLM-Enhanced Checks**: Uses LLM reasoning for complex validation - Example: "Check if proper error handling is implemented" ### Checkpoints File Format ``` # Comments start with # 1. Check if the repository has README.md 2. Check if RAG model is implemented 3. Check if logging is configured Check if requirements.txt exists # Numbering is optional ``` ### Managing Checkpoints via UI The web interface provides checkpoint management: - **View Checkpoints**: Load and display all checkpoints from file - **Add Checkpoint**: Add new checkpoints via UI - **Persistence**: All checkpoints saved to `checkpoints.txt` - **Server Restart**: Checkpoints persist across server restarts --- ## UI Interaction Flow ### User Journey 1. **Initialize Repository** - User enters GitHub repository URL - Clicks "Initialize Repository" - Backend clones repository and indexes content - UI displays success message and chunk count 2. **Manage Checkpoints** - User can add new checkpoint requirements - User can view existing checkpoints - Checkpoints saved to `checkpoints.txt` - Available for validation 3. **Ask Questions** - User enters natural language question - Optionally enables LLM for enhanced responses - Backend retrieves relevant code chunks - UI displays answer and source chunks 4. **Run Validation** - User triggers checkpoint validation - Backend evaluates all checkpoints - UI displays pass/fail results with explanations ### UI Components - **Status Messages**: Success, error, and info notifications - **Loading Indicators**: Spinner during processing - **Result Boxes**: Formatted display of results - **Checkpoint List**: Scrollable list of checkpoints - **Forms**: Input fields for URLs, questions, checkpoints --- ## Setup and Run Instructions ### Prerequisites - Python 3.6 or higher - pip package manager - Git (for repository cloning) ### Installation 1. **Clone GetGit repository:** ```bash git clone https://github.com/samarthnaikk/getgit.git cd getgit ``` 2. **Install dependencies:** ```bash pip install -r requirements.txt ``` 3. **Set up environment variables (optional):** ```bash # For LLM-powered responses export GEMINI_API_KEY=your_api_key_here # For production deployment ``` ### Running the Application **Development Mode:** ```bash FLASK_ENV=development python server.py ``` **Production Mode:** ```bash python server.py ``` The server will start on `http://0.0.0.0:5000` ### Accessing the UI Open your web browser and navigate to: ``` http://localhost:5000 ``` --- ## Logging Behavior GetGit uses Python's standard `logging` module for comprehensive activity tracking. ### Log Levels - **DEBUG**: Detailed diagnostic information - **INFO**: General informational messages (default) - **WARNING**: Warning messages for unexpected situations - **ERROR**: Error messages for failures ### Log Format ``` YYYY-MM-DD HH:MM:SS - getgit.MODULE - LEVEL - Message ``` Example: ``` 2026-01-10 12:34:56 - getgit.core - INFO - Initializing repository from https://github.com/user/repo.git 2026-01-10 12:35:02 - getgit.core - INFO - Created 1247 chunks from repository 2026-01-10 12:35:08 - getgit.server - INFO - Repository initialization completed successfully ``` ### Server Logs Server logs include: - Request processing - Route handling - Success/failure of operations - Error stack traces (when errors occur) ### Core Module Logs Core module logs include: - Repository initialization progress - RAG pipeline setup stages - Query processing steps - Checkpoint validation progress ### Configuring Log Level **Via Environment:** ```bash # Not directly supported, modify code or use Python logging config ``` **In Code:** ```python from core import setup_logging logger = setup_logging(level="DEBUG") ``` --- ## API Reference ### Core Module Functions #### `initialize_repository(repo_url, local_path='source_repo')` Clone or load a repository and prepare it for analysis. **Parameters:** - `repo_url` (str): GitHub repository URL - `local_path` (str): Local path for repository storage **Returns:** str - Path to the cloned/loaded repository **Example:** ```python from core import initialize_repository repo_path = initialize_repository( repo_url="https://github.com/user/repo.git", local_path="my_repo" ) ``` --- #### `setup_rag(repo_path, repository_name=None, config=None, use_sentence_transformer=False)` Initialize RAG pipeline with chunking, embeddings, and retrieval. **Parameters:** - `repo_path` (str): Path to the repository - `repository_name` (str, optional): Repository name - `config` (RAGConfig, optional): RAG configuration - `use_sentence_transformer` (bool): Use transformer embeddings **Returns:** Retriever - Configured retriever instance **Example:** ```python from core import setup_rag retriever = setup_rag(repo_path="source_repo") ``` --- #### `answer_query(query, retriever, top_k=5, use_llm=True, api_key=None, model_name='gemini-2.0-flash-exp')` Retrieve context and generate response for a query. **Parameters:** - `query` (str): Natural language question - `retriever` (Retriever): Configured retriever instance - `top_k` (int): Number of chunks to retrieve - `use_llm` (bool): Whether to generate LLM response - `api_key` (str, optional): API key for LLM - `model_name` (str): LLM model name **Returns:** dict - Query results with response and context **Example:** ```python from core import answer_query result = answer_query( query="How do I run tests?", retriever=retriever, top_k=5, use_llm=True ) ``` --- #### `validate_checkpoints(repo_url, checkpoints_file='checkpoints.txt', local_path='source_repo', use_llm=True, log_level='INFO', config=None, stop_on_failure=False)` Validate repository against checkpoints defined in a text file. **Parameters:** - `repo_url` (str): GitHub repository URL - `checkpoints_file` (str): Path to checkpoints file - `local_path` (str): Local repository storage path - `use_llm` (bool): Use LLM for evaluation - `log_level` (str): Logging level - `config` (RAGConfig, optional): RAG configuration - `stop_on_failure` (bool): Stop on first failure **Returns:** dict - Validation results with statistics **Example:** ```python from core import validate_checkpoints result = validate_checkpoints( repo_url="https://github.com/user/repo.git", checkpoints_file="checkpoints.txt", use_llm=True ) print(result['summary']) ``` --- ### Flask API Endpoints #### `POST /initialize` Initialize repository and setup RAG pipeline. **Request Body:** ```json { "repo_url": "https://github.com/user/repo.git" } ``` **Response:** ```json { "success": true, "message": "Repository initialized successfully with 850 chunks", "repo_path": "source_repo", "chunks_count": 850 } ``` --- #### `POST /ask` Answer questions about the repository. **Request Body:** ```json { "query": "What is this project about?", "use_llm": true } ``` **Response:** ```json { "success": true, "query": "What is this project about?", "response": "This project is a repository intelligence system...", "retrieved_chunks": [...], "context": "...", "error": null } ``` --- #### `POST /checkpoints` Run checkpoint validation. **Request Body:** ```json { "checkpoints_file": "checkpoints.txt", "use_llm": true } ``` **Response:** ```json { "success": true, "checkpoints": ["Check if README exists", ...], "results": [{ "checkpoint": "Check if README exists", "passed": true, "explanation": "...", "evidence": "...", "score": 1.0 }], "summary": "...", "passed_count": 4, "total_count": 5, "pass_rate": 80.0 } ``` --- #### `GET /checkpoints/list` List all checkpoints from checkpoints.txt. **Response:** ```json { "success": true, "checkpoints": [ "Check if the repository has README.md", "Check if RAG model is implemented" ] } ``` --- #### `POST /checkpoints/add` Add a new checkpoint to checkpoints.txt. **Request Body:** ```json { "checkpoint": "Check if tests are present" } ``` **Response:** ```json { "success": true, "message": "Checkpoint added successfully", "checkpoints": [...] } ``` --- #### `GET /status` Get current application status. **Response:** ```json { "initialized": true, "repo_url": "https://github.com/user/repo.git", "chunks_count": 850 } ``` --- ## Configuration ### Environment Variables - **GEMINI_API_KEY**: API key for Google Gemini LLM (optional) - **FLASK_ENV**: Set to `development` for debug mode ### RAG Configuration ```python from rag import RAGConfig # Use default configuration config = RAGConfig.default() # Use documentation-optimized configuration config = RAGConfig.for_documentation() # Custom configuration from rag import ChunkingConfig, EmbeddingConfig config = RAGConfig( chunking=ChunkingConfig( file_patterns=['*.py', '*.md'], chunk_size=500, chunk_overlap=50 ), embedding=EmbeddingConfig( model_type='sentence-transformer', embedding_dim=384 ) ) ``` ### Repository Storage By default, repositories are cloned to `source_repo/`. This can be customized via the `local_path` parameter. --- *Last updated: January 2026* ```bash git clone https://github.com/samarthnaikk/getgit.git