| # Dataset Builder |
|
|
| This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research. |
|
|
| ## Project Structure |
|
|
| ``` |
| dataset_builder/ |
| βββ README.md # This file |
| β |
| βββ data1/ # DATA1: Domain-Specific Code Dataset |
| β βββ main.py # Step 0-1: Keyword expansion + GitHub repo search |
| β βββ main_v2.py # Step 0-4: Full pipeline (search β check β clone β filter) |
| β βββ util.py # Shared utilities (logger, LLM calls, code extensions) |
| β βββ download_dataset.py # Download ChemPile code dataset from HuggingFace |
| β βββ merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate |
| β βββ analysis.py # Code-level analysis (comments, functions, tokens) |
| β βββ compute_stars_keywords.py # Compute stars/keyword statistics |
| β βββ compute_statistics.py # Compute code statistics from JSONL analysis files |
| β βββ rename.py # Rename repo directories to owner___repo format |
| β βββ rename2.py # Rename ChemPile files with zero-padded numbering |
| β βββ pyproject.toml # Python project config |
| β βββ scripts/ |
| β β βββ export_files_to_csv.py # Export repo files to CSV grouped by keyword |
| β βββ reporting/ # Statistical reporting and visualization |
| β β βββ __init__.py |
| β β βββ main.py # Reporting entry point |
| β β βββ visualization.py # Generate figures (funnel, distributions, etc.) |
| β β βββ repo_meta_scan.py # Scan repo-level metadata |
| β β βββ code_file_stats.py # File-level code statistics |
| β β βββ code_file_stats_fast.py # Optimized file-level statistics |
| β β βββ stage_a_stats.py # Stage A (search/check) statistics |
| β β βββ stage_b_stats.py # Stage B (clone/filter) statistics |
| β β βββ join_insights.py # Join and cross-analyze insights |
| β βββ README.md # DATA1 dataset documentation |
| β |
| βββ data2/ # DATA2: Code-Documentation Alignment Dataset |
| β βββ instruction_generation/ # README summarization pipeline |
| β β βββ pipeline.py # Unified entry (summarize + parse modes) |
| β β βββ summarize_repo_readme.py # Summarize repo READMEs using LLM |
| β β βββ extract_repo_functions.py # Extract functions from repos |
| β β βββ schemas.py # Pydantic data schemas |
| β β βββ prompts/ |
| β β βββ function_extract.txt # Prompt for function extraction |
| β β βββ readme_summary.txt # Prompt for README summarization |
| β βββ step22/ # Function scoring, generation, alignment |
| β β βββ build.py # Build tree-sitter language parsers |
| β β βββ func_stat.py # Extract functions using tree-sitter |
| β β βββ md_stat.py # Extract & save README summaries |
| β β βββ emb_qwen_func.py # Score functions using Qwen embedding model |
| β β βββ emb_qwen_md.py # Score READMEs using Qwen embedding model |
| β β βββ function_req.py # Filter functions by score threshold |
| β β βββ gemini_generation.py # Generate docstrings using Gemini API |
| β β βββ alignment.py # Align functions with generated docstrings |
| β β βββ prompt.txt # Prompt template for docstring generation |
| β β βββ depend_analysis.py # Dependency/call-graph analysis |
| β β βββ find_none_score_func.py # Find functions missing scores |
| β β βββ folder_stat.py # Repository folder statistics |
| β β βββ ppt.py # Visualization of alignment data |
| β β βββ debug_parser.py # Debug tree-sitter parser loading |
| β βββ README.md # DATA2 dataset documentation |
| β |
| βββ data3/ # DATA3: Programming Problems Generation Dataset |
| β βββ main.py # RepoAgent: generate docs for repos |
| β βββ gemini.py # Gemini API connectivity test |
| β βββ load_dataset.py # Load and inspect datasets |
| β βββ instruct_generation.py # Score functions for scientific relevance |
| β βββ extract_functions.py # Extract functions from enhanced_dataset.csv |
| β βββ extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling) |
| β βββ merge_datasets.py # Merge res2.csv with dataset_all.csv |
| β βββ generate_programming_problems.py # Generate problems using Gemini API |
| β βββ generate_problems_batch.py # Batch problem generation (OpenAI batch API) |
| β βββ generate_problems_openai.py # Problem generation via OpenAI API |
| β βββ enrich_programming_problems.py # Enrich problems with source code context |
| β βββ vllm_high.py # VLLM-based high-throughput inference |
| β βββ vllm_qwen_batch.py # Qwen model batch inference via VLLM |
| β βββ show_pricing.py # Display API pricing information |
| β βββ check_enhanced.py # Validate enhanced dataset |
| β βββ check_index_distribution.py # Check index distribution |
| β βββ check_match.py # Check data matching |
| β βββ check_relationship.py # Check data relationships |
| β βββ is_sci_prompt.txt # Prompt: classify code as scientific computing |
| β βββ is_sci_prompt1.txt # Prompt variant for scientific classification |
| β βββ score_prompt.txt # Prompt: score function relevance |
| β βββ *.sh # Various shell scripts for batch processing |
| β βββ README.md # DATA3 dataset documentation |
| ``` |
|
|
| ## Dataset Building Pipelines |
|
|
| ### DATA1: Domain-Specific Code Dataset |
|
|
| **Goal**: Collect, filter, and export domain-specific code from GitHub repositories. |
|
|
| **Pipeline** (executed in order): |
|
|
| 1. **Keyword Expansion & Search** (`main.py` / `main_v2.py`) |
| - Expand scientific keywords using LLM |
| - Search GitHub API for repositories matching keywords |
| - Check relevance using LLM (reads READMEs) |
| - Clone relevant repos (shallow clone) |
| - Filter to keep only code files |
|
|
| 2. **External Data** (`download_dataset.py`) |
| - Download ChemPile code dataset from HuggingFace |
|
|
| 3. **Merge & Deduplicate** (`merge_dataset.py`) |
| - Merge crawled repos with ChemPile data |
| - Deduplicate by content hash |
|
|
| 4. **Analysis** (`analysis.py`, `compute_*.py`) |
| - Analyze code metrics (lines, comments, functions, tokens) |
| - Compute keyword and stars statistics |
|
|
| 5. **Export** (`scripts/export_files_to_csv.py`) |
| - Export final dataset to CSV files grouped by keyword |
|
|
| 6. **Reporting** (`reporting/`) |
| - Generate statistical reports and visualizations |
|
|
| ### DATA2: Code-Documentation Alignment Dataset |
|
|
| **Goal**: Generate high-quality docstrings for scientific code functions. |
|
|
| **Pipeline** (executed in order): |
|
|
| 1. **README Summarization** (`instruction_generation/`) |
| - Summarize repository READMEs using LLM |
| - Extract structured information from repos |
|
|
| 2. **Function Extraction** (`step22/func_stat.py`) |
| - Parse code using tree-sitter to extract functions |
| - Multi-language support (Python, C, C++, Java, Go, Rust, Julia) |
|
|
| 3. **README Processing** (`step22/md_stat.py`) |
| - Copy README summaries to function dataset directories |
|
|
| 4. **Embedding Scoring** (`step22/emb_qwen_func.py`, `emb_qwen_md.py`) |
| - Score function quality using Qwen embedding model |
| - Score README quality using Qwen embedding model |
|
|
| 5. **Function Filtering** (`step22/function_req.py`) |
| - Filter functions by combined quality score |
|
|
| 6. **Docstring Generation** (`step22/gemini_generation.py`) |
| - Generate docstrings using Gemini API |
| - Budget monitoring with circuit breaker |
| - Checkpoint/resume support |
|
|
| 7. **Alignment** (`step22/alignment.py`) |
| - Merge function data with generated docstrings |
|
|
| ### DATA3: Programming Problems Generation Dataset |
|
|
| **Goal**: Generate programming problems inspired by scientific code. |
|
|
| **Pipeline** (executed in order): |
|
|
| 1. **Documentation Generation** (`main.py`) |
| - Use RepoAgent to generate documentation for repositories |
|
|
| 2. **Function Extraction** (`extract_functions.py`, `extract_functions_v2.py`) |
| - Extract individual functions from enhanced dataset |
|
|
| 3. **Scientific Relevance Scoring** (`instruct_generation.py`) |
| - Score functions for scientific computing relevance |
| - Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts |
|
|
| 4. **Dataset Merge** (`merge_datasets.py`) |
| - Merge function scores with source code data |
|
|
| 5. **Problem Generation** (`generate_programming_problems.py`) |
| - Generate programming problems using Gemini API |
| - Filter by relevance score |
| - Budget monitoring and cost control |
|
|
| 6. **Enrichment** (`enrich_programming_problems.py`) |
| - Enrich generated problems with source code context |
|
|
| ## Dependencies |
|
|
| ### Common |
| - `pandas`, `tqdm`, `jsonlines` |
| - `python-dotenv` |
|
|
| ### DATA1 |
| - `langchain`, `langchain-openai`, `pydantic`, `loguru` |
| - `requests` (GitHub API) |
| - `matplotlib`, `seaborn`, `wordcloud` (reporting) |
| - `datasets` (HuggingFace) |
|
|
| ### DATA2 |
| - `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc. |
| - `vllm`, `transformers`, `torch` (embedding scoring) |
| - `google-cloud-aiplatform`, `vertexai` (Gemini API) |
|
|
| ### DATA3 |
| - `google-cloud-aiplatform`, `vertexai` (Gemini API) |
| - `openai` (OpenAI API) |
| - `vllm`, `transformers`, `torch` (local inference) |
|
|
| ## Notes |
|
|
| - Scripts contain hardcoded paths that need to be updated for your environment |
| - API credentials (GitHub token, Gemini, OpenAI) need to be configured separately |
| - Large datasets require significant storage and compute resources |
| - Most scripts support checkpoint/resume for long-running processes |
|
|