Spaces:
Running
data/github/
The curated knowledge base for every GitHub project. This is the highest-signal section of the ArunCore dataset β it contains structured, reviewed documentation for each project that proves real engineering capability.
How It Was Built
- All repos were cloned into
temp/repos/locally - Source code, READMEs, and requirements files were analysed file by file
- Five dataset files were generated per Tier 1 project (written from code analysis, not just READMEs)
- Arun filled in the
decisions.mdwith the why behind each architectural choice - Final files were reviewed and formatted before being committed here
Folder Structure
One subfolder per project:
github/
βββ legal_RAG_system/ β Tier 1: full 5-file treatment
βββ real_state_listing_scraper/ β Tier 1: full 5-file treatment
βββ personal_ai_agent/ β Tier 1: full 5-file treatment
βββ result_anomaly/ β Tier 1: full 5-file treatment
βββ Agentic_AI_Projects/ β Tier 2: metadata + readme only
βββ web_wizard/ β Tier 2: metadata + readme only
βββ neural_arun_labs/ β Tier 2: metadata + readme only
File Types Per Project
| File | Tier 1 | Tier 2 | Purpose |
|---|---|---|---|
metadata.json |
β | β | Machine-readable project facts: URL, stack, status, visibility |
readme.md |
β | β | Clean problem/solution/features (no install noise) |
architecture.md |
β | β | System design: components, data flow, design patterns used |
code_summaries.json |
β | β | Per-module summaries with GitHub file URLs |
decisions.md |
β | β | Key architectural decisions + reasoning (written by Arun) |
Tiering Criteria
Tier 1 β Full treatment: Projects that demonstrate real, original engineering problem-solving. Flagship work. These are the projects that define Arun professionally.
Tier 2 β Lightweight: Projects showing breadth (multiple domains) or active learning, but not deep enough in complexity to warrant full architecture documentation.
How This Dataset Is Used
During ingestion, each file is:
- Chunked with metadata tags (
source,project_name,type,tech_stack,status,visibility) - Embedded using OpenAI
text-embedding-3-small - Stored in the vector database
When a user asks "What RAG projects have you built?", the retrieval engine pulls from legal_RAG_system/architecture.md and legal_RAG_system/readme.md. When asked "Why did you use ChromaDB?", it retrieves from legal_RAG_system/decisions.md.