Spaces:
Running
Running
| # data/github/ | |
| The curated knowledge base for every GitHub project. This is the highest-signal section of the ArunCore dataset β it contains structured, reviewed documentation for each project that proves real engineering capability. | |
| --- | |
| ## How It Was Built | |
| 1. All repos were cloned into `temp/repos/` locally | |
| 2. Source code, READMEs, and requirements files were analysed file by file | |
| 3. Five dataset files were generated per Tier 1 project (written from code analysis, not just READMEs) | |
| 4. Arun filled in the `decisions.md` with the *why* behind each architectural choice | |
| 5. Final files were reviewed and formatted before being committed here | |
| --- | |
| ## Folder Structure | |
| One subfolder per project: | |
| ``` | |
| github/ | |
| βββ legal_RAG_system/ β Tier 1: full 5-file treatment | |
| βββ real_state_listing_scraper/ β Tier 1: full 5-file treatment | |
| βββ personal_ai_agent/ β Tier 1: full 5-file treatment | |
| βββ result_anomaly/ β Tier 1: full 5-file treatment | |
| βββ Agentic_AI_Projects/ β Tier 2: metadata + readme only | |
| βββ web_wizard/ β Tier 2: metadata + readme only | |
| βββ neural_arun_labs/ β Tier 2: metadata + readme only | |
| ``` | |
| --- | |
| ## File Types Per Project | |
| | File | Tier 1 | Tier 2 | Purpose | | |
| |---|:---:|:---:|---| | |
| | `metadata.json` | β | β | Machine-readable project facts: URL, stack, status, visibility | | |
| | `readme.md` | β | β | Clean problem/solution/features (no install noise) | | |
| | `architecture.md` | β | β | System design: components, data flow, design patterns used | | |
| | `code_summaries.json` | β | β | Per-module summaries with GitHub file URLs | | |
| | `decisions.md` | β | β | Key architectural decisions + reasoning (written by Arun) | | |
| --- | |
| ## Tiering Criteria | |
| **Tier 1 β Full treatment:** Projects that demonstrate real, original engineering problem-solving. Flagship work. These are the projects that define Arun professionally. | |
| **Tier 2 β Lightweight:** Projects showing breadth (multiple domains) or active learning, but not deep enough in complexity to warrant full architecture documentation. | |
| --- | |
| ## How This Dataset Is Used | |
| During ingestion, each file is: | |
| 1. Chunked with metadata tags (`source`, `project_name`, `type`, `tech_stack`, `status`, `visibility`) | |
| 2. Embedded using OpenAI `text-embedding-3-small` | |
| 3. Stored in the vector database | |
| When a user asks *"What RAG projects have you built?"*, the retrieval engine pulls from `legal_RAG_system/architecture.md` and `legal_RAG_system/readme.md`. When asked *"Why did you use ChromaDB?"*, it retrieves from `legal_RAG_system/decisions.md`. | |