File size: 10,507 Bytes
463fc7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
"""
Git Repository Pipeline Runner - Processes Git repositories at scale.

This is the main entry point for processing GIT REPOSITORIES. It provides
enhanced features for repository analysis, including git metadata extraction,
agentic framework detection, and comprehensive statistics generation.

ARCHITECTURE POSITION:
    - Repository Pipeline Orchestrator: Coordinates Git repo processing
    - Enhanced Metadata Collector: Extracts git history and agentic patterns
    - Production Pipeline: Handles large repositories with performance tracking

KEY FEATURES:
    1. Complete repository processing with git metadata
    2. Extension-aware filtering (None = full repository)
    3. Performance tracking (files/sec, chunks/sec)
    4. Agentic framework detection (via RepoMetadataExtractor)
    5. Comprehensive output (JSONL chunks + metadata + statistics)

DATA FLOW:
    Repo URL β†’ Clone β†’ Metadata extraction β†’ File listing β†’ Chunking β†’
    Enhanced export β†’ Statistics β†’ Comprehensive output package

USE CASES:
    - Processing complete Git repositories for training data
    - Creating agentic-aware datasets
    - Benchmarking chunking performance
    - Production dataset generation

USAGE:
    python run_repo_pipeline.py single https://github.com/crewAIInc/crewAI
    python run_repo_pipeline.py single https://github.com/autogen/autogen --extensions .py .md
    python run_repo_pipeline.py single https://github.com/langchain --max-files 1000
"""

from pathlib import Path
import json
from typing import Dict, Any, Optional, Set, List
import argparse
import time
from datetime import datetime

# Import enhanced components
from src.task_3_data_engineering.ingestion.git_crawler import GitCrawler
from src.task_3_data_engineering.ingestion.repo_metadata import RepoMetadataExtractor
from src.task_3_data_engineering.chunking.repo_chunker import RepoChunker
from src.task_3_data_engineering.analysis.dataset_stats import compute_dataset_stats
from src.task_3_data_engineering.export.enhanced_jsonl_exporter import export_repo_chunks_jsonl


class EnhancedRepoPipeline:
    """Enhanced pipeline with agentic focus"""

    def __init__(
        self,
        output_base: Path = Path("data/processed/repos"),
        use_hierarchical: bool = True,
    ):
        self.crawler = GitCrawler()
        self.chunker = RepoChunker(use_hierarchical=use_hierarchical)
        self.output_base = output_base
        self.output_base.mkdir(parents=True, exist_ok=True)

    def process_repository(
        self,
        repo_url: str,
        extensions: Optional[Set[str]] = None,
        output_name: Optional[str] = None,
        include_binary: bool = False,
        max_files: Optional[int] = None,
        skip_git_metadata: bool = False,
    ) -> Dict[str, Any]:
        """
        Process repository with enhanced features

        IMPORTANT FIX:
        - extensions=None  => FULL repository (no filtering)
        - extensions=set() => filtered repository
        """

        start_time = time.time()
        print(f"πŸš€ Processing repository: {repo_url}")
        print("-" * 60)

        # 1. Clone repository
        repo_path = self.crawler.clone_repository(repo_url)
        if not repo_path:
            raise RuntimeError(f"Failed to clone {repo_url}")

        # 2. Determine output name
        if not output_name:
            output_name = repo_path.name

        # 3. Log extension behavior (FIXED)
        if extensions:
            print(f"πŸ“ Extension filter enabled: {sorted(extensions)}")
        else:
            print("πŸ“ No extension filter β†’ processing FULL repository")

        # 4. Extract repository metadata
        print("πŸ“Š Extracting repository metadata...")
        metadata = {}

        if not skip_git_metadata:
            extractor = RepoMetadataExtractor(repo_path)
            metadata = extractor.extract_comprehensive_metadata()

        # 5. List files (CORE LOGIC UNCHANGED)
        print("πŸ“ Listing repository files...")
        file_infos, file_stats = self.crawler.list_files_with_info(
            repo_path,
            extensions=extensions,  # None => full repo
            skip_binary=not include_binary,
        )

        # 6. Optional file limiting
        if max_files and len(file_infos) > max_files:
            print(f"⚠️ Limiting to {max_files} files (out of {len(file_infos)})")
            file_infos = file_infos[:max_files]

        print(f"πŸ“Š Found {len(file_infos)} files to process")

        # 7. Create output directory
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_dir = self.output_base / f"{output_name}_{timestamp}"
        output_dir.mkdir(parents=True, exist_ok=True)

        # 8. Repository-level metadata
        # Get actual repo name from metadata
        actual_repo_name = metadata.get("basic", {}).get("repo_name", output_name)

        repo_metadata = {
            "repo_url": repo_url,
            "repo_name": actual_repo_name,  # βœ… Use actual repo name
            "folder_name": output_name,     # βœ… Track user's folder
            "local_path": str(repo_path),
            "extensions_included": list(extensions) if extensions else "ALL",
            "timestamp": timestamp,
            **metadata,
        }

        metadata_file = output_dir / "repository_metadata.json"
        with open(metadata_file, "w", encoding="utf-8") as f:
            json.dump(repo_metadata, f, indent=2, default=str)

        # 9. Chunk processing
        all_chunks = []
        processing_stats = {
            "total_files": len(file_infos),
            "processed": 0,
            "failed": 0,
            "file_types": {},
            "chunk_types": {},
        }

        print("\nπŸ”§ Processing files...")
        print("-" * 60)

        for idx, file_info in enumerate(file_infos, start=1):
            try:
                if idx % 10 == 0:
                    print(f"  [{idx}/{len(file_infos)}] Processing...")

                file_metadata = {
                    **repo_metadata,
                    "file_info": {
                        "relative_path": file_info.relative_path,
                        "size_bytes": file_info.size,
                        "extension": file_info.extension,
                        "is_binary": file_info.is_binary,
                    },
                }

                chunks = self.chunker.chunk_file(
                    file_info.path,
                    file_metadata,
                )

                all_chunks.extend(chunks)
                processing_stats["processed"] += 1
                processing_stats["file_types"][file_info.extension] = (
                    processing_stats["file_types"].get(file_info.extension, 0) + 1
                )

                for chunk in chunks:
                    ct = chunk.chunk_type
                    processing_stats["chunk_types"][ct] = (
                        processing_stats["chunk_types"].get(ct, 0) + 1
                    )

            except Exception as e:
                print(f"⚠️ Error processing {file_info.relative_path}: {str(e)[:120]}")
                processing_stats["failed"] += 1

        # 10. Export chunks
        print("\nπŸ’Ύ Exporting chunks...")
        output_file = output_dir / f"{output_name}_chunks.jsonl"

        export_repo_chunks_jsonl(
            chunks=all_chunks,
            output_path=output_file,
            repo_metadata=repo_metadata,
            print_stats=True,
        )

        # 11. Compute statistics
        print("πŸ“ˆ Computing statistics...")
        chunk_stats = compute_dataset_stats(all_chunks)

        total_time = time.time() - start_time

        final_stats = {
            "repository_info": {
                "name": actual_repo_name,  # βœ… USE actual_repo_name
                "folder_name": output_name,  # βœ… ADD folder_name field
                "url": repo_url,
                "path": str(repo_path),
                "timestamp": timestamp,
            },
            "processing_stats": processing_stats,
            "chunk_statistics": chunk_stats,
            "performance": {
                "total_time_seconds": round(total_time, 2),
                "files_per_second": round(len(file_infos) / total_time, 2),
                "chunks_per_second": round(len(all_chunks) / total_time, 2),
            },
            "output_files": {
                "chunks": str(output_file),
                "metadata": str(metadata_file),
            },
        }

        stats_file = output_dir / f"{output_name}_stats.json"
        with open(stats_file, "w", encoding="utf-8") as f:
            json.dump(final_stats, f, indent=2)

        # 12. Summary
        print("\n" + "=" * 70)
        print("βœ… REPOSITORY PROCESSING COMPLETE")
        print("=" * 70)
        print(f"πŸ“ Repository: {output_name}")
        print(f"πŸ“„ Files:      {len(file_infos)}")
        print(f"🧩 Chunks:     {len(all_chunks)}")
        print(f"⏱️  Time:       {final_stats['performance']['total_time_seconds']}s")
        print(f"πŸ’Ύ Output:     {output_dir}")
        print("=" * 70)

        return final_stats


def main():
    """Enhanced CLI for repository pipeline (FIXED)"""

    parser = argparse.ArgumentParser(
        description="Process Git repositories for agentic datasets"
    )

    subparsers = parser.add_subparsers(dest="command", required=True)

    # ---- Single repo ----
    single = subparsers.add_parser("single", help="Process single repository")
    single.add_argument("repo_url", help="Git repository URL")
    single.add_argument("--name", help="Custom output name")
    single.add_argument(
        "--extensions",
        nargs="+",
        default=None,
        help="Optional file extensions (.py .md). If omitted, FULL repo is processed.",
    )
    single.add_argument("--max-files", type=int, help="Limit number of files")
    single.add_argument("--skip-git-metadata", action="store_true")
    single.add_argument("--include-binary", action="store_true")

    args = parser.parse_args()
    pipeline = EnhancedRepoPipeline()

    if args.command == "single":
        pipeline.process_repository(
            repo_url=args.repo_url,
            output_name=args.name,
            extensions=set(args.extensions) if args.extensions else None,
            max_files=args.max_files,
            skip_git_metadata=args.skip_git_metadata,
            include_binary=args.include_binary,
        )


if __name__ == "__main__":
    main()