Spaces:
Runtime error
GetGit Technical Documentation
Table of Contents
- Project Overview
- Architecture
- Backend Flow
- RAG + LLM Overview
- Checkpoints System
- UI Interaction Flow
- Setup and Run Instructions
- Logging Behavior
- API Reference
- Configuration
Project Overview
GetGit is a Python-based repository intelligence system that combines GitHub repository cloning, Retrieval-Augmented Generation (RAG), and Large Language Model (LLM) capabilities to provide intelligent, natural language question-answering over code repositories.
Key Features
- Automated Repository Cloning: Clone and manage GitHub repositories locally
- RAG-Based Analysis: Semantic chunking and retrieval of repository content
- LLM Integration: Natural language response generation using Google Gemini
- Checkpoint Validation: Programmatic validation of repository requirements
- Web Interface: Flask-based UI for repository exploration
- Checkpoint Management: UI for adding and viewing validation checkpoints
Use Cases
- Understanding unfamiliar codebases quickly
- Answering questions about project structure and functionality
- Extracting information from documentation and code
- Repository analysis and review
- Validating repository requirements for hackathons or project submissions
- Team collaboration and onboarding
Architecture
GetGit follows a modular architecture with clear separation of concerns:
System Components
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Web Browser β
β (User Interface) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β HTTP Requests
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β server.py (Flask) β
β - Routes: /initialize, /ask, /checkpoints, etc. β
β - Session management β
β - Request/response handling β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β Delegates to
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β core.py (Orchestration) β
β - initialize_repository() β
β - setup_rag() β
β - answer_query() β
β - validate_checkpoints() β
ββββββββββ¬ββββββββββββββββββββ¬ββββββββββββββββββ¬βββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββ ββββββββββββββββ βββββββββββββββββββββββ
β clone_repo.py β β rag/ β β checkpoints.py β
β - Repository β β - Chunker β β - Load/validate β
β cloning β β - Embedder β β - Checkpoint mgmt β
βββββββββββββββββββ β - Retriever β βββββββββββββββββββββββ
β - LLM β
ββββββββββββββββ
1. Repository Layer (clone_repo.py)
Handles GitHub repository cloning and local storage management.
Key Function:
clone_repo(github_url, dest_folder='source_repo')
2. RAG Layer (rag/ module)
Provides semantic search and context retrieval capabilities.
Components:
- Chunker (
chunker.py): Splits repository files into semantic chunks - Embedder (
embedder.py): Creates vector embeddings (TF-IDF or Transformer-based) - Retriever (
retriever.py): Performs similarity-based chunk retrieval - LLM Connector (
llm_connector.py): Integrates with LLMs for response generation - Configuration (
config.py): Manages RAG settings and parameters
Supported Chunk Types:
- Code functions and classes
- Markdown sections
- Documentation blocks
- Configuration files
- Full file content
3. Checkpoints Layer (checkpoints.py)
Manages checkpoint-based validation of repositories.
Key Functions:
load_checkpoints(): Load checkpoints from fileevaluate_checkpoint(): Evaluate a single checkpointrun_checkpoints(): Run all checkpoints against repositoryformat_results_summary(): Format results for display
4. Orchestration Layer (core.py)
Unified entry point that coordinates all components:
- Repository Initialization: Clone or load repository
- RAG Setup: Chunk, embed, and index repository content
- Query Processing: Retrieve context and generate responses
- Checkpoint Validation: Validate repository against requirements
5. Web Interface (server.py)
Flask-based web application providing a user-friendly interface.
Routes:
GET /- Render home pagePOST /initialize- Initialize repository and RAG pipelinePOST /ask- Answer questions about repositoryPOST /checkpoints- Run checkpoint validationGET /checkpoints/list- List all checkpointsPOST /checkpoints/add- Add new checkpointGET /status- Get application status
Backend Flow
Server.py β Core.py Flow
User Request β server.py β core.py β Specialized Modules
1. Repository Initialization Flow
POST /initialize
β
server.py: initialize()
β
core.py: initialize_repository(repo_url, local_path)
β
clone_repo.py: clone_repo(repo_url, local_path)
β
core.py: setup_rag(repo_path)
β
rag/chunker.py: chunk_repository()
β
rag/embedder.py: create embeddings
β
rag/retriever.py: index_chunks()
β
Return: Retriever instance with indexed chunks
2. Question Answering Flow
POST /ask
β
server.py: ask_question()
β
core.py: answer_query(query, retriever, use_llm)
β
rag/retriever.py: retrieve(query, top_k)
β
[If use_llm=True]
β
rag/llm_connector.py: generate_response(query, context)
β
Return: {query, retrieved_chunks, context, response, error}
3. Checkpoint Validation Flow
POST /checkpoints
β
server.py: run_checkpoints()
β
core.py: validate_checkpoints(repo_url, checkpoints_file, use_llm)
β
checkpoints.py: load_checkpoints(file)
β
checkpoints.py: run_checkpoints(checkpoints, repo_path, retriever)
β
[For each checkpoint]
β
checkpoints.py: evaluate_checkpoint(checkpoint, retriever, use_llm)
β
Return: {checkpoints, results, summary, statistics}
RAG + LLM Overview
Retrieval-Augmented Generation (RAG)
RAG combines information retrieval with text generation to provide contextually accurate responses.
How It Works:
Indexing Phase (Setup):
- Repository files are chunked into semantic units
- Each chunk is converted to a vector embedding
- Embeddings are indexed for fast similarity search
Retrieval Phase (Query):
- User query is converted to embedding
- Similar chunks are retrieved using cosine similarity
- Top-k most relevant chunks are selected
Generation Phase (Optional, if LLM enabled):
- Retrieved chunks provide context
- Context + query sent to LLM
- LLM generates coherent, contextual response
LLM Integration
GetGit uses Google Gemini for natural language response generation.
Features:
- Provider-agnostic design (easy to add new LLM providers)
- Environment-based API key management
- Error handling and fallback to context-only responses
- Configurable model selection
Configuration:
export GEMINI_API_KEY=your_api_key_here
Checkpoints System
The checkpoints system enables programmatic validation of repository requirements.
How Checkpoints Work
- Definition: Checkpoints are stored in
checkpoints.txt, one per line - Loading: System reads and parses checkpoint file
- Evaluation: Each checkpoint is evaluated against the repository
- Reporting: Results include pass/fail status, explanation, and evidence
Checkpoint Types
File Existence Checks: Simple file/directory existence validation
- Example: "Check if the repository has README.md"
Semantic Checks: Complex requirements using RAG retrieval
- Example: "Check if RAG model is implemented"
LLM-Enhanced Checks: Uses LLM reasoning for complex validation
- Example: "Check if proper error handling is implemented"
Checkpoints File Format
# Comments start with #
1. Check if the repository has README.md
2. Check if RAG model is implemented
3. Check if logging is configured
Check if requirements.txt exists # Numbering is optional
Managing Checkpoints via UI
The web interface provides checkpoint management:
- View Checkpoints: Load and display all checkpoints from file
- Add Checkpoint: Add new checkpoints via UI
- Persistence: All checkpoints saved to
checkpoints.txt - Server Restart: Checkpoints persist across server restarts
UI Interaction Flow
User Journey
Initialize Repository
- User enters GitHub repository URL
- Clicks "Initialize Repository"
- Backend clones repository and indexes content
- UI displays success message and chunk count
Manage Checkpoints
- User can add new checkpoint requirements
- User can view existing checkpoints
- Checkpoints saved to
checkpoints.txt - Available for validation
Ask Questions
- User enters natural language question
- Optionally enables LLM for enhanced responses
- Backend retrieves relevant code chunks
- UI displays answer and source chunks
Run Validation
- User triggers checkpoint validation
- Backend evaluates all checkpoints
- UI displays pass/fail results with explanations
UI Components
- Status Messages: Success, error, and info notifications
- Loading Indicators: Spinner during processing
- Result Boxes: Formatted display of results
- Checkpoint List: Scrollable list of checkpoints
- Forms: Input fields for URLs, questions, checkpoints
Setup and Run Instructions
Prerequisites
- Python 3.6 or higher
- pip package manager
- Git (for repository cloning)
Installation
Clone GetGit repository:
git clone https://github.com/samarthnaikk/getgit.git cd getgitInstall dependencies:
pip install -r requirements.txtSet up environment variables (optional):
# For LLM-powered responses export GEMINI_API_KEY=your_api_key_here # For production deployment
Running the Application
Development Mode:
FLASK_ENV=development python server.py
Production Mode:
python server.py
The server will start on http://0.0.0.0:5000
Accessing the UI
Open your web browser and navigate to:
http://localhost:5000
Logging Behavior
GetGit uses Python's standard logging module for comprehensive activity tracking.
Log Levels
- DEBUG: Detailed diagnostic information
- INFO: General informational messages (default)
- WARNING: Warning messages for unexpected situations
- ERROR: Error messages for failures
Log Format
YYYY-MM-DD HH:MM:SS - getgit.MODULE - LEVEL - Message
Example:
2026-01-10 12:34:56 - getgit.core - INFO - Initializing repository from https://github.com/user/repo.git
2026-01-10 12:35:02 - getgit.core - INFO - Created 1247 chunks from repository
2026-01-10 12:35:08 - getgit.server - INFO - Repository initialization completed successfully
Server Logs
Server logs include:
- Request processing
- Route handling
- Success/failure of operations
- Error stack traces (when errors occur)
Core Module Logs
Core module logs include:
- Repository initialization progress
- RAG pipeline setup stages
- Query processing steps
- Checkpoint validation progress
Configuring Log Level
Via Environment:
# Not directly supported, modify code or use Python logging config
In Code:
from core import setup_logging
logger = setup_logging(level="DEBUG")
API Reference
Core Module Functions
initialize_repository(repo_url, local_path='source_repo')
Clone or load a repository and prepare it for analysis.
Parameters:
repo_url(str): GitHub repository URLlocal_path(str): Local path for repository storage
Returns: str - Path to the cloned/loaded repository
Example:
from core import initialize_repository
repo_path = initialize_repository(
repo_url="https://github.com/user/repo.git",
local_path="my_repo"
)
setup_rag(repo_path, repository_name=None, config=None, use_sentence_transformer=False)
Initialize RAG pipeline with chunking, embeddings, and retrieval.
Parameters:
repo_path(str): Path to the repositoryrepository_name(str, optional): Repository nameconfig(RAGConfig, optional): RAG configurationuse_sentence_transformer(bool): Use transformer embeddings
Returns: Retriever - Configured retriever instance
Example:
from core import setup_rag
retriever = setup_rag(repo_path="source_repo")
answer_query(query, retriever, top_k=5, use_llm=True, api_key=None, model_name='gemini-2.0-flash-exp')
Retrieve context and generate response for a query.
Parameters:
query(str): Natural language questionretriever(Retriever): Configured retriever instancetop_k(int): Number of chunks to retrieveuse_llm(bool): Whether to generate LLM responseapi_key(str, optional): API key for LLMmodel_name(str): LLM model name
Returns: dict - Query results with response and context
Example:
from core import answer_query
result = answer_query(
query="How do I run tests?",
retriever=retriever,
top_k=5,
use_llm=True
)
validate_checkpoints(repo_url, checkpoints_file='checkpoints.txt', local_path='source_repo', use_llm=True, log_level='INFO', config=None, stop_on_failure=False)
Validate repository against checkpoints defined in a text file.
Parameters:
repo_url(str): GitHub repository URLcheckpoints_file(str): Path to checkpoints filelocal_path(str): Local repository storage pathuse_llm(bool): Use LLM for evaluationlog_level(str): Logging levelconfig(RAGConfig, optional): RAG configurationstop_on_failure(bool): Stop on first failure
Returns: dict - Validation results with statistics
Example:
from core import validate_checkpoints
result = validate_checkpoints(
repo_url="https://github.com/user/repo.git",
checkpoints_file="checkpoints.txt",
use_llm=True
)
print(result['summary'])
Flask API Endpoints
POST /initialize
Initialize repository and setup RAG pipeline.
Request Body:
{
"repo_url": "https://github.com/user/repo.git"
}
Response:
{
"success": true,
"message": "Repository initialized successfully with 850 chunks",
"repo_path": "source_repo",
"chunks_count": 850
}
POST /ask
Answer questions about the repository.
Request Body:
{
"query": "What is this project about?",
"use_llm": true
}
Response:
{
"success": true,
"query": "What is this project about?",
"response": "This project is a repository intelligence system...",
"retrieved_chunks": [...],
"context": "...",
"error": null
}
POST /checkpoints
Run checkpoint validation.
Request Body:
{
"checkpoints_file": "checkpoints.txt",
"use_llm": true
}
Response:
{
"success": true,
"checkpoints": ["Check if README exists", ...],
"results": [{
"checkpoint": "Check if README exists",
"passed": true,
"explanation": "...",
"evidence": "...",
"score": 1.0
}],
"summary": "...",
"passed_count": 4,
"total_count": 5,
"pass_rate": 80.0
}
GET /checkpoints/list
List all checkpoints from checkpoints.txt.
Response:
{
"success": true,
"checkpoints": [
"Check if the repository has README.md",
"Check if RAG model is implemented"
]
}
POST /checkpoints/add
Add a new checkpoint to checkpoints.txt.
Request Body:
{
"checkpoint": "Check if tests are present"
}
Response:
{
"success": true,
"message": "Checkpoint added successfully",
"checkpoints": [...]
}
GET /status
Get current application status.
Response:
{
"initialized": true,
"repo_url": "https://github.com/user/repo.git",
"chunks_count": 850
}
Configuration
Environment Variables
GEMINI_API_KEY: API key for Google Gemini LLM (optional)
FLASK_ENV: Set to
developmentfor debug mode
RAG Configuration
from rag import RAGConfig
# Use default configuration
config = RAGConfig.default()
# Use documentation-optimized configuration
config = RAGConfig.for_documentation()
# Custom configuration
from rag import ChunkingConfig, EmbeddingConfig
config = RAGConfig(
chunking=ChunkingConfig(
file_patterns=['*.py', '*.md'],
chunk_size=500,
chunk_overlap=50
),
embedding=EmbeddingConfig(
model_type='sentence-transformer',
embedding_dim=384
)
)
Repository Storage
By default, repositories are cloned to source_repo/. This can be customized via the local_path parameter.
Last updated: January 2026
git clone https://github.com/samarthnaikk/getgit.git