# data/github/ The curated knowledge base for every GitHub project. This is the highest-signal section of the ArunCore dataset — it contains structured, reviewed documentation for each project that proves real engineering capability. --- ## How It Was Built 1. All repos were cloned into `temp/repos/` locally 2. Source code, READMEs, and requirements files were analysed file by file 3. Five dataset files were generated per Tier 1 project (written from code analysis, not just READMEs) 4. Arun filled in the `decisions.md` with the *why* behind each architectural choice 5. Final files were reviewed and formatted before being committed here --- ## Folder Structure One subfolder per project: ``` github/ ├── legal_RAG_system/ ← Tier 1: full 5-file treatment ├── real_state_listing_scraper/ ← Tier 1: full 5-file treatment ├── personal_ai_agent/ ← Tier 1: full 5-file treatment ├── result_anomaly/ ← Tier 1: full 5-file treatment ├── Agentic_AI_Projects/ ← Tier 2: metadata + readme only ├── web_wizard/ ← Tier 2: metadata + readme only └── neural_arun_labs/ ← Tier 2: metadata + readme only ``` --- ## File Types Per Project | File | Tier 1 | Tier 2 | Purpose | |---|:---:|:---:|---| | `metadata.json` | ✅ | ✅ | Machine-readable project facts: URL, stack, status, visibility | | `readme.md` | ✅ | ✅ | Clean problem/solution/features (no install noise) | | `architecture.md` | ✅ | ❌ | System design: components, data flow, design patterns used | | `code_summaries.json` | ✅ | ❌ | Per-module summaries with GitHub file URLs | | `decisions.md` | ✅ | ❌ | Key architectural decisions + reasoning (written by Arun) | --- ## Tiering Criteria **Tier 1 — Full treatment:** Projects that demonstrate real, original engineering problem-solving. Flagship work. These are the projects that define Arun professionally. **Tier 2 — Lightweight:** Projects showing breadth (multiple domains) or active learning, but not deep enough in complexity to warrant full architecture documentation. --- ## How This Dataset Is Used During ingestion, each file is: 1. Chunked with metadata tags (`source`, `project_name`, `type`, `tech_stack`, `status`, `visibility`) 2. Embedded using OpenAI `text-embedding-3-small` 3. Stored in the vector database When a user asks *"What RAG projects have you built?"*, the retrieval engine pulls from `legal_RAG_system/architecture.md` and `legal_RAG_system/readme.md`. When asked *"Why did you use ChromaDB?"*, it retrieves from `legal_RAG_system/decisions.md`.