getgitspace / documentation.md
Samarth Naik
hf p1
0c87788

GetGit Technical Documentation

Table of Contents

  1. Project Overview
  2. Architecture
  3. Backend Flow
  4. RAG + LLM Overview
  5. Checkpoints System
  6. UI Interaction Flow
  7. Setup and Run Instructions
  8. Logging Behavior
  9. API Reference
  10. Configuration

Project Overview

GetGit is a Python-based repository intelligence system that combines GitHub repository cloning, Retrieval-Augmented Generation (RAG), and Large Language Model (LLM) capabilities to provide intelligent, natural language question-answering over code repositories.

Key Features

  • Automated Repository Cloning: Clone and manage GitHub repositories locally
  • RAG-Based Analysis: Semantic chunking and retrieval of repository content
  • LLM Integration: Natural language response generation using Google Gemini
  • Checkpoint Validation: Programmatic validation of repository requirements
  • Web Interface: Flask-based UI for repository exploration
  • Checkpoint Management: UI for adding and viewing validation checkpoints

Use Cases

  • Understanding unfamiliar codebases quickly
  • Answering questions about project structure and functionality
  • Extracting information from documentation and code
  • Repository analysis and review
  • Validating repository requirements for hackathons or project submissions
  • Team collaboration and onboarding

Architecture

GetGit follows a modular architecture with clear separation of concerns:

System Components

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                       Web Browser                            β”‚
β”‚                    (User Interface)                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ HTTP Requests
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    server.py (Flask)                         β”‚
β”‚  - Routes: /initialize, /ask, /checkpoints, etc.            β”‚
β”‚  - Session management                                        β”‚
β”‚  - Request/response handling                                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚ Delegates to
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    core.py (Orchestration)                   β”‚
β”‚  - initialize_repository()                                   β”‚
β”‚  - setup_rag()                                              β”‚
β”‚  - answer_query()                                           β”‚
β”‚  - validate_checkpoints()                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                   β”‚                 β”‚
         β–Ό                   β–Ό                 β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  clone_repo.py  β”‚  β”‚   rag/       β”‚  β”‚  checkpoints.py     β”‚
β”‚  - Repository   β”‚  β”‚  - Chunker   β”‚  β”‚  - Load/validate    β”‚
β”‚    cloning      β”‚  β”‚  - Embedder  β”‚  β”‚  - Checkpoint mgmt  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  - Retriever β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚  - LLM       β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Repository Layer (clone_repo.py)

Handles GitHub repository cloning and local storage management.

Key Function:

clone_repo(github_url, dest_folder='source_repo')

2. RAG Layer (rag/ module)

Provides semantic search and context retrieval capabilities.

Components:

  • Chunker (chunker.py): Splits repository files into semantic chunks
  • Embedder (embedder.py): Creates vector embeddings (TF-IDF or Transformer-based)
  • Retriever (retriever.py): Performs similarity-based chunk retrieval
  • LLM Connector (llm_connector.py): Integrates with LLMs for response generation
  • Configuration (config.py): Manages RAG settings and parameters

Supported Chunk Types:

  • Code functions and classes
  • Markdown sections
  • Documentation blocks
  • Configuration files
  • Full file content

3. Checkpoints Layer (checkpoints.py)

Manages checkpoint-based validation of repositories.

Key Functions:

  • load_checkpoints(): Load checkpoints from file
  • evaluate_checkpoint(): Evaluate a single checkpoint
  • run_checkpoints(): Run all checkpoints against repository
  • format_results_summary(): Format results for display

4. Orchestration Layer (core.py)

Unified entry point that coordinates all components:

  1. Repository Initialization: Clone or load repository
  2. RAG Setup: Chunk, embed, and index repository content
  3. Query Processing: Retrieve context and generate responses
  4. Checkpoint Validation: Validate repository against requirements

5. Web Interface (server.py)

Flask-based web application providing a user-friendly interface.

Routes:

  • GET / - Render home page
  • POST /initialize - Initialize repository and RAG pipeline
  • POST /ask - Answer questions about repository
  • POST /checkpoints - Run checkpoint validation
  • GET /checkpoints/list - List all checkpoints
  • POST /checkpoints/add - Add new checkpoint
  • GET /status - Get application status

Backend Flow

Server.py β†’ Core.py Flow

User Request β†’ server.py β†’ core.py β†’ Specialized Modules

1. Repository Initialization Flow

POST /initialize
  ↓
server.py: initialize()
  ↓
core.py: initialize_repository(repo_url, local_path)
  ↓
clone_repo.py: clone_repo(repo_url, local_path)
  ↓
core.py: setup_rag(repo_path)
  ↓
rag/chunker.py: chunk_repository()
  ↓
rag/embedder.py: create embeddings
  ↓
rag/retriever.py: index_chunks()
  ↓
Return: Retriever instance with indexed chunks

2. Question Answering Flow

POST /ask
  ↓
server.py: ask_question()
  ↓
core.py: answer_query(query, retriever, use_llm)
  ↓
rag/retriever.py: retrieve(query, top_k)
  ↓
[If use_llm=True]
  ↓
rag/llm_connector.py: generate_response(query, context)
  ↓
Return: {query, retrieved_chunks, context, response, error}

3. Checkpoint Validation Flow

POST /checkpoints
  ↓
server.py: run_checkpoints()
  ↓
core.py: validate_checkpoints(repo_url, checkpoints_file, use_llm)
  ↓
checkpoints.py: load_checkpoints(file)
  ↓
checkpoints.py: run_checkpoints(checkpoints, repo_path, retriever)
  ↓
[For each checkpoint]
  ↓
checkpoints.py: evaluate_checkpoint(checkpoint, retriever, use_llm)
  ↓
Return: {checkpoints, results, summary, statistics}

RAG + LLM Overview

Retrieval-Augmented Generation (RAG)

RAG combines information retrieval with text generation to provide contextually accurate responses.

How It Works:

  1. Indexing Phase (Setup):

    • Repository files are chunked into semantic units
    • Each chunk is converted to a vector embedding
    • Embeddings are indexed for fast similarity search
  2. Retrieval Phase (Query):

    • User query is converted to embedding
    • Similar chunks are retrieved using cosine similarity
    • Top-k most relevant chunks are selected
  3. Generation Phase (Optional, if LLM enabled):

    • Retrieved chunks provide context
    • Context + query sent to LLM
    • LLM generates coherent, contextual response

LLM Integration

GetGit uses Google Gemini for natural language response generation.

Features:

  • Provider-agnostic design (easy to add new LLM providers)
  • Environment-based API key management
  • Error handling and fallback to context-only responses
  • Configurable model selection

Configuration:

export GEMINI_API_KEY=your_api_key_here

Checkpoints System

The checkpoints system enables programmatic validation of repository requirements.

How Checkpoints Work

  1. Definition: Checkpoints are stored in checkpoints.txt, one per line
  2. Loading: System reads and parses checkpoint file
  3. Evaluation: Each checkpoint is evaluated against the repository
  4. Reporting: Results include pass/fail status, explanation, and evidence

Checkpoint Types

  1. File Existence Checks: Simple file/directory existence validation

    • Example: "Check if the repository has README.md"
  2. Semantic Checks: Complex requirements using RAG retrieval

    • Example: "Check if RAG model is implemented"
  3. LLM-Enhanced Checks: Uses LLM reasoning for complex validation

    • Example: "Check if proper error handling is implemented"

Checkpoints File Format

# Comments start with #
1. Check if the repository has README.md
2. Check if RAG model is implemented
3. Check if logging is configured
Check if requirements.txt exists  # Numbering is optional

Managing Checkpoints via UI

The web interface provides checkpoint management:

  • View Checkpoints: Load and display all checkpoints from file
  • Add Checkpoint: Add new checkpoints via UI
  • Persistence: All checkpoints saved to checkpoints.txt
  • Server Restart: Checkpoints persist across server restarts

UI Interaction Flow

User Journey

  1. Initialize Repository

    • User enters GitHub repository URL
    • Clicks "Initialize Repository"
    • Backend clones repository and indexes content
    • UI displays success message and chunk count
  2. Manage Checkpoints

    • User can add new checkpoint requirements
    • User can view existing checkpoints
    • Checkpoints saved to checkpoints.txt
    • Available for validation
  3. Ask Questions

    • User enters natural language question
    • Optionally enables LLM for enhanced responses
    • Backend retrieves relevant code chunks
    • UI displays answer and source chunks
  4. Run Validation

    • User triggers checkpoint validation
    • Backend evaluates all checkpoints
    • UI displays pass/fail results with explanations

UI Components

  • Status Messages: Success, error, and info notifications
  • Loading Indicators: Spinner during processing
  • Result Boxes: Formatted display of results
  • Checkpoint List: Scrollable list of checkpoints
  • Forms: Input fields for URLs, questions, checkpoints

Setup and Run Instructions

Prerequisites

  • Python 3.6 or higher
  • pip package manager
  • Git (for repository cloning)

Installation

  1. Clone GetGit repository:

    git clone https://github.com/samarthnaikk/getgit.git
    cd getgit
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set up environment variables (optional):

    # For LLM-powered responses
    export GEMINI_API_KEY=your_api_key_here
    
    # For production deployment
    

Running the Application

Development Mode:

FLASK_ENV=development python server.py

Production Mode:

python server.py

The server will start on http://0.0.0.0:5000

Accessing the UI

Open your web browser and navigate to:

http://localhost:5000

Logging Behavior

GetGit uses Python's standard logging module for comprehensive activity tracking.

Log Levels

  • DEBUG: Detailed diagnostic information
  • INFO: General informational messages (default)
  • WARNING: Warning messages for unexpected situations
  • ERROR: Error messages for failures

Log Format

YYYY-MM-DD HH:MM:SS - getgit.MODULE - LEVEL - Message

Example:

2026-01-10 12:34:56 - getgit.core - INFO - Initializing repository from https://github.com/user/repo.git
2026-01-10 12:35:02 - getgit.core - INFO - Created 1247 chunks from repository
2026-01-10 12:35:08 - getgit.server - INFO - Repository initialization completed successfully

Server Logs

Server logs include:

  • Request processing
  • Route handling
  • Success/failure of operations
  • Error stack traces (when errors occur)

Core Module Logs

Core module logs include:

  • Repository initialization progress
  • RAG pipeline setup stages
  • Query processing steps
  • Checkpoint validation progress

Configuring Log Level

Via Environment:

# Not directly supported, modify code or use Python logging config

In Code:

from core import setup_logging
logger = setup_logging(level="DEBUG")

API Reference

Core Module Functions

initialize_repository(repo_url, local_path='source_repo')

Clone or load a repository and prepare it for analysis.

Parameters:

  • repo_url (str): GitHub repository URL
  • local_path (str): Local path for repository storage

Returns: str - Path to the cloned/loaded repository

Example:

from core import initialize_repository
repo_path = initialize_repository(
    repo_url="https://github.com/user/repo.git",
    local_path="my_repo"
)

setup_rag(repo_path, repository_name=None, config=None, use_sentence_transformer=False)

Initialize RAG pipeline with chunking, embeddings, and retrieval.

Parameters:

  • repo_path (str): Path to the repository
  • repository_name (str, optional): Repository name
  • config (RAGConfig, optional): RAG configuration
  • use_sentence_transformer (bool): Use transformer embeddings

Returns: Retriever - Configured retriever instance

Example:

from core import setup_rag
retriever = setup_rag(repo_path="source_repo")

answer_query(query, retriever, top_k=5, use_llm=True, api_key=None, model_name='gemini-2.0-flash-exp')

Retrieve context and generate response for a query.

Parameters:

  • query (str): Natural language question
  • retriever (Retriever): Configured retriever instance
  • top_k (int): Number of chunks to retrieve
  • use_llm (bool): Whether to generate LLM response
  • api_key (str, optional): API key for LLM
  • model_name (str): LLM model name

Returns: dict - Query results with response and context

Example:

from core import answer_query
result = answer_query(
    query="How do I run tests?",
    retriever=retriever,
    top_k=5,
    use_llm=True
)

validate_checkpoints(repo_url, checkpoints_file='checkpoints.txt', local_path='source_repo', use_llm=True, log_level='INFO', config=None, stop_on_failure=False)

Validate repository against checkpoints defined in a text file.

Parameters:

  • repo_url (str): GitHub repository URL
  • checkpoints_file (str): Path to checkpoints file
  • local_path (str): Local repository storage path
  • use_llm (bool): Use LLM for evaluation
  • log_level (str): Logging level
  • config (RAGConfig, optional): RAG configuration
  • stop_on_failure (bool): Stop on first failure

Returns: dict - Validation results with statistics

Example:

from core import validate_checkpoints
result = validate_checkpoints(
    repo_url="https://github.com/user/repo.git",
    checkpoints_file="checkpoints.txt",
    use_llm=True
)
print(result['summary'])

Flask API Endpoints

POST /initialize

Initialize repository and setup RAG pipeline.

Request Body:

{
  "repo_url": "https://github.com/user/repo.git"
}

Response:

{
  "success": true,
  "message": "Repository initialized successfully with 850 chunks",
  "repo_path": "source_repo",
  "chunks_count": 850
}

POST /ask

Answer questions about the repository.

Request Body:

{
  "query": "What is this project about?",
  "use_llm": true
}

Response:

{
  "success": true,
  "query": "What is this project about?",
  "response": "This project is a repository intelligence system...",
  "retrieved_chunks": [...],
  "context": "...",
  "error": null
}

POST /checkpoints

Run checkpoint validation.

Request Body:

{
  "checkpoints_file": "checkpoints.txt",
  "use_llm": true
}

Response:

{
  "success": true,
  "checkpoints": ["Check if README exists", ...],
  "results": [{
    "checkpoint": "Check if README exists",
    "passed": true,
    "explanation": "...",
    "evidence": "...",
    "score": 1.0
  }],
  "summary": "...",
  "passed_count": 4,
  "total_count": 5,
  "pass_rate": 80.0
}

GET /checkpoints/list

List all checkpoints from checkpoints.txt.

Response:

{
  "success": true,
  "checkpoints": [
    "Check if the repository has README.md",
    "Check if RAG model is implemented"
  ]
}

POST /checkpoints/add

Add a new checkpoint to checkpoints.txt.

Request Body:

{
  "checkpoint": "Check if tests are present"
}

Response:

{
  "success": true,
  "message": "Checkpoint added successfully",
  "checkpoints": [...]
}

GET /status

Get current application status.

Response:

{
  "initialized": true,
  "repo_url": "https://github.com/user/repo.git",
  "chunks_count": 850
}

Configuration

Environment Variables

  • GEMINI_API_KEY: API key for Google Gemini LLM (optional)

  • FLASK_ENV: Set to development for debug mode

RAG Configuration

from rag import RAGConfig

# Use default configuration
config = RAGConfig.default()

# Use documentation-optimized configuration
config = RAGConfig.for_documentation()

# Custom configuration
from rag import ChunkingConfig, EmbeddingConfig

config = RAGConfig(
    chunking=ChunkingConfig(
        file_patterns=['*.py', '*.md'],
        chunk_size=500,
        chunk_overlap=50
    ),
    embedding=EmbeddingConfig(
        model_type='sentence-transformer',
        embedding_dim=384
    )
)

Repository Storage

By default, repositories are cloned to source_repo/. This can be customized via the local_path parameter.


Last updated: January 2026

git clone https://github.com/samarthnaikk/getgit.git