DouDou

Upload README.md with huggingface_hub

ddd7e5c verified 30 days ago

10.5 kB

	# Dataset Builder

	This project contains the complete source code for building three domain-specific datasets for scientific computing code intelligence research.

	## Project Structure

	```
	dataset_builder/
	├── README.md # This file
	│
	├── data1/ # DATA1: Domain-Specific Code Dataset
	│ ├── main.py # Step 0-1: Keyword expansion + GitHub repo search
	│ ├── main_v2.py # Step 0-4: Full pipeline (search → check → clone → filter)
	│ ├── util.py # Shared utilities (logger, LLM calls, code extensions)
	│ ├── download_dataset.py # Download ChemPile code dataset from HuggingFace
	│ ├── merge_dataset.py # Merge crawled repos with ChemPile data, deduplicate
	│ ├── analysis.py # Code-level analysis (comments, functions, tokens)
	│ ├── compute_stars_keywords.py # Compute stars/keyword statistics
	│ ├── compute_statistics.py # Compute code statistics from JSONL analysis files
	│ ├── rename.py # Rename repo directories to owner___repo format
	│ ├── rename2.py # Rename ChemPile files with zero-padded numbering
	│ ├── pyproject.toml # Python project config
	│ ├── scripts/
	│ │ └── export_files_to_csv.py # Export repo files to CSV grouped by keyword
	│ ├── reporting/ # Statistical reporting and visualization
	│ │ ├── __init__.py
	│ │ ├── main.py # Reporting entry point
	│ │ ├── visualization.py # Generate figures (funnel, distributions, etc.)
	│ │ ├── repo_meta_scan.py # Scan repo-level metadata
	│ │ ├── code_file_stats.py # File-level code statistics
	│ │ ├── code_file_stats_fast.py # Optimized file-level statistics
	│ │ ├── stage_a_stats.py # Stage A (search/check) statistics
	│ │ ├── stage_b_stats.py # Stage B (clone/filter) statistics
	│ │ └── join_insights.py # Join and cross-analyze insights
	│ └── README.md # DATA1 dataset documentation
	│
	├── data2/ # DATA2: Code-Documentation Alignment Dataset
	│ ├── instruction_generation/ # README summarization pipeline
	│ │ ├── pipeline.py # Unified entry (summarize + parse modes)
	│ │ ├── summarize_repo_readme.py # Summarize repo READMEs using LLM
	│ │ ├── extract_repo_functions.py # Extract functions from repos
	│ │ ├── schemas.py # Pydantic data schemas
	│ │ └── prompts/
	│ │ ├── function_extract.txt # Prompt for function extraction
	│ │ └── readme_summary.txt # Prompt for README summarization
	│ ├── step22/ # Function scoring, generation, alignment
	│ │ ├── build.py # Build tree-sitter language parsers
	│ │ ├── func_stat.py # Extract functions using tree-sitter
	│ │ ├── md_stat.py # Extract & save README summaries
	│ │ ├── emb_qwen_func.py # Score functions using Qwen embedding model
	│ │ ├── emb_qwen_md.py # Score READMEs using Qwen embedding model
	│ │ ├── function_req.py # Filter functions by score threshold
	│ │ ├── gemini_generation.py # Generate docstrings using Gemini API
	│ │ ├── alignment.py # Align functions with generated docstrings
	│ │ ├── prompt.txt # Prompt template for docstring generation
	│ │ ├── depend_analysis.py # Dependency/call-graph analysis
	│ │ ├── find_none_score_func.py # Find functions missing scores
	│ │ ├── folder_stat.py # Repository folder statistics
	│ │ ├── ppt.py # Visualization of alignment data
	│ │ └── debug_parser.py # Debug tree-sitter parser loading
	│ └── README.md # DATA2 dataset documentation
	│
	├── data3/ # DATA3: Programming Problems Generation Dataset
	│ ├── main.py # RepoAgent: generate docs for repos
	│ ├── gemini.py # Gemini API connectivity test
	│ ├── load_dataset.py # Load and inspect datasets
	│ ├── instruct_generation.py # Score functions for scientific relevance
	│ ├── extract_functions.py # Extract functions from enhanced_dataset.csv
	│ ├── extract_functions_v2.py # Extract functions v2 (better CSV/JSON handling)
	│ ├── merge_datasets.py # Merge res2.csv with dataset_all.csv
	│ ├── generate_programming_problems.py # Generate problems using Gemini API
	│ ├── generate_problems_batch.py # Batch problem generation (OpenAI batch API)
	│ ├── generate_problems_openai.py # Problem generation via OpenAI API
	│ ├── enrich_programming_problems.py # Enrich problems with source code context
	│ ├── vllm_high.py # VLLM-based high-throughput inference
	│ ├── vllm_qwen_batch.py # Qwen model batch inference via VLLM
	│ ├── show_pricing.py # Display API pricing information
	│ ├── check_enhanced.py # Validate enhanced dataset
	│ ├── check_index_distribution.py # Check index distribution
	│ ├── check_match.py # Check data matching
	│ ├── check_relationship.py # Check data relationships
	│ ├── is_sci_prompt.txt # Prompt: classify code as scientific computing
	│ ├── is_sci_prompt1.txt # Prompt variant for scientific classification
	│ ├── score_prompt.txt # Prompt: score function relevance
	│ ├── *.sh # Various shell scripts for batch processing
	│ └── README.md # DATA3 dataset documentation
	```

	## Dataset Building Pipelines

	### DATA1: Domain-Specific Code Dataset

	Goal: Collect, filter, and export domain-specific code from GitHub repositories.

	Pipeline (executed in order):

	1. Keyword Expansion & Search (`main.py` / `main_v2.py`)
	- Expand scientific keywords using LLM
	- Search GitHub API for repositories matching keywords
	- Check relevance using LLM (reads READMEs)
	- Clone relevant repos (shallow clone)
	- Filter to keep only code files

	2. External Data (`download_dataset.py`)
	- Download ChemPile code dataset from HuggingFace

	3. Merge & Deduplicate (`merge_dataset.py`)
	- Merge crawled repos with ChemPile data
	- Deduplicate by content hash

	4. Analysis (`analysis.py`, `compute_*.py`)
	- Analyze code metrics (lines, comments, functions, tokens)
	- Compute keyword and stars statistics

	5. Export (`scripts/export_files_to_csv.py`)
	- Export final dataset to CSV files grouped by keyword

	6. Reporting (`reporting/`)
	- Generate statistical reports and visualizations

	### DATA2: Code-Documentation Alignment Dataset

	Goal: Generate high-quality docstrings for scientific code functions.

	Pipeline (executed in order):

	1. README Summarization (`instruction_generation/`)
	- Summarize repository READMEs using LLM
	- Extract structured information from repos

	2. Function Extraction (`step22/func_stat.py`)
	- Parse code using tree-sitter to extract functions
	- Multi-language support (Python, C, C++, Java, Go, Rust, Julia)

	3. README Processing (`step22/md_stat.py`)
	- Copy README summaries to function dataset directories

	4. Embedding Scoring (`step22/emb_qwen_func.py`, `emb_qwen_md.py`)
	- Score function quality using Qwen embedding model
	- Score README quality using Qwen embedding model

	5. Function Filtering (`step22/function_req.py`)
	- Filter functions by combined quality score

	6. Docstring Generation (`step22/gemini_generation.py`)
	- Generate docstrings using Gemini API
	- Budget monitoring with circuit breaker
	- Checkpoint/resume support

	7. Alignment (`step22/alignment.py`)
	- Merge function data with generated docstrings

	### DATA3: Programming Problems Generation Dataset

	Goal: Generate programming problems inspired by scientific code.

	Pipeline (executed in order):

	1. Documentation Generation (`main.py`)
	- Use RepoAgent to generate documentation for repositories

	2. Function Extraction (`extract_functions.py`, `extract_functions_v2.py`)
	- Extract individual functions from enhanced dataset

	3. Scientific Relevance Scoring (`instruct_generation.py`)
	- Score functions for scientific computing relevance
	- Use `is_sci_prompt.txt` and `score_prompt.txt` as prompts

	4. Dataset Merge (`merge_datasets.py`)
	- Merge function scores with source code data

	5. Problem Generation (`generate_programming_problems.py`)
	- Generate programming problems using Gemini API
	- Filter by relevance score
	- Budget monitoring and cost control

	6. Enrichment (`enrich_programming_problems.py`)
	- Enrich generated problems with source code context

	## Dependencies

	### Common
	- `pandas`, `tqdm`, `jsonlines`
	- `python-dotenv`

	### DATA1
	- `langchain`, `langchain-openai`, `pydantic`, `loguru`
	- `requests` (GitHub API)
	- `matplotlib`, `seaborn`, `wordcloud` (reporting)
	- `datasets` (HuggingFace)

	### DATA2
	- `tree-sitter`, `tree-sitter-python`, `tree-sitter-c`, etc.
	- `vllm`, `transformers`, `torch` (embedding scoring)
	- `google-cloud-aiplatform`, `vertexai` (Gemini API)

	### DATA3
	- `google-cloud-aiplatform`, `vertexai` (Gemini API)
	- `openai` (OpenAI API)
	- `vllm`, `transformers`, `torch` (local inference)

	## Notes

	- Scripts contain hardcoded paths that need to be updated for your environment
	- API credentials (GitHub token, Gemini, OpenAI) need to be configured separately
	- Large datasets require significant storage and compute resources
	- Most scripts support checkpoint/resume for long-running processes