Mindigenous commited on
Commit
53f0cc2
·
0 Parent(s):

Initial full project backup with Git LFS

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +10 -0
  2. .gitignore +29 -0
  3. CONTEXT_SUMMARY.md +38 -0
  4. README_COMPONENT_1_SETUP.md +83 -0
  5. README_COMPONENT_3_DATASET_PIPELINE.md +46 -0
  6. README_COMPONENT_4_MODEL_ARCHITECTURE.md +28 -0
  7. README_COMPONENT_5_TRAINING_PIPELINE.md +42 -0
  8. README_COMPONENT_8_CHAT_INTERFACE.md +20 -0
  9. README_FINAL_PROJECT.md +126 -0
  10. artifacts/evaluation/component6_eval_results.json +3 -0
  11. artifacts/evaluation/component7_inference_results.json +3 -0
  12. artifacts/export/component10_benchmark_report.json +3 -0
  13. artifacts/model/component4_model_summary.json +3 -0
  14. artifacts/tokenizer/code_tokenizer_v1/tokenizer.json +3 -0
  15. artifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json +3 -0
  16. backup_step1000.tar.gz +3 -0
  17. backup_step2000.tar.gz +3 -0
  18. backup_step3000.tar.gz +3 -0
  19. checkpoints/component5_420m/latest.pt +3 -0
  20. checkpoints/component5_420m/step_3000.pt +3 -0
  21. checkpoints/component5_420m/step_3200.pt +3 -0
  22. config.py +45 -0
  23. configs/component10_export_config.yaml +21 -0
  24. configs/component3_dataset_pipeline.yaml +38 -0
  25. configs/component3_incremental_js.yaml +27 -0
  26. configs/component3_reprocess_from_clean.yaml +19 -0
  27. configs/component4_model_config.yaml +18 -0
  28. configs/component5_training_config.verify.yaml +32 -0
  29. configs/component5_training_config.yaml +37 -0
  30. configs/component6_evaluation_config.yaml +21 -0
  31. configs/component7_inference_config.yaml +20 -0
  32. configs/component8_chat_config.yaml +30 -0
  33. configs/component9_lora_config.verify.yaml +32 -0
  34. configs/component9_lora_config.yaml +31 -0
  35. data/cache/raw/code_search_net_python/dataset_dict.json +3 -0
  36. data/cache/raw/code_search_net_python/test/data-00000-of-00001.arrow +3 -0
  37. data/cache/raw/code_search_net_python/test/dataset_info.json +3 -0
  38. data/cache/raw/code_search_net_python/test/state.json +3 -0
  39. data/cache/raw/code_search_net_python/train/data-00000-of-00004.arrow +3 -0
  40. data/cache/raw/code_search_net_python/train/data-00001-of-00004.arrow +3 -0
  41. data/cache/raw/code_search_net_python/train/data-00002-of-00004.arrow +3 -0
  42. data/cache/raw/code_search_net_python/train/data-00003-of-00004.arrow +3 -0
  43. data/cache/raw/code_search_net_python/train/dataset_info.json +3 -0
  44. data/cache/raw/code_search_net_python/train/state.json +3 -0
  45. data/cache/raw/code_search_net_python/validation/data-00000-of-00001.arrow +3 -0
  46. data/cache/raw/code_search_net_python/validation/dataset_info.json +3 -0
  47. data/cache/raw/code_search_net_python/validation/state.json +3 -0
  48. data/cache/raw/mbpp/dataset_dict.json +3 -0
  49. data/cache/raw/mbpp/prompt/data-00000-of-00001.arrow +3 -0
  50. data/cache/raw/mbpp/prompt/dataset_info.json +3 -0
.gitattributes ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ *.zip filter=lfs diff=lfs merge=lfs -text
2
+ *.tar.gz filter=lfs diff=lfs merge=lfs -text
3
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
4
+ output/checkpoints/* filter=lfs diff=lfs merge=lfs -text
5
+ checkpoints/** filter=lfs diff=lfs merge=lfs -text
6
+ models/** filter=lfs diff=lfs merge=lfs -text
7
+ data/** filter=lfs diff=lfs merge=lfs -text
8
+ artifacts/** filter=lfs diff=lfs merge=lfs -text
9
+ logs/** filter=lfs diff=lfs merge=lfs -text
10
+ *.pt filter=lfs diff=lfs merge=lfs -text
.gitignore ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ignore Python cache and compiled files.
2
+ __pycache__/
3
+ *.pyc
4
+ *.pyo
5
+ *.pyd
6
+
7
+ # Ignore virtual environment.
8
+ .venv/
9
+
10
+ # Ignore logs and temporary outputs.
11
+ logs/
12
+ artifacts/
13
+ *.log
14
+
15
+ # Ignore model weights and checkpoints by default.
16
+ checkpoints/
17
+ models/base/
18
+ models/lora/
19
+ models/quantized/
20
+
21
+ # Ignore data files by default.
22
+ data/raw/
23
+ data/interim/
24
+ data/processed/
25
+ data/external/
26
+
27
+ # Ignore notebook checkpoints.
28
+ .ipynb_checkpoints/
29
+
CONTEXT_SUMMARY.md ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Context Summary
2
+
3
+ This file captures the current state of work from the active collaboration session.
4
+
5
+ ## Environment
6
+ - Original project path: `D:\Desktop 31st Jan 2026\MIND-AI-MODEL`
7
+ - Target copy path requested: `C:\AI 2`
8
+ - OS: Windows
9
+ - GPU: NVIDIA RTX 4060 Laptop (8GB VRAM)
10
+
11
+ ## Completed Components
12
+ 1. Component 1 (Project setup): completed and verified.
13
+ 2. Component 2 (Custom tokenizer): completed and verified.
14
+ 3. Component 3 (Dataset pipeline): completed and verified.
15
+ 4. Component 3 final-step reprocess fix: completed and verified, with JS rebalance.
16
+ 5. Component 4 (420M transformer architecture): completed and verified.
17
+
18
+ ## Current Dataset Stats
19
+ - Total processed records: 139,531
20
+ - Python: 115,572
21
+ - JavaScript: 23,959
22
+
23
+ ## Current Model Architecture
24
+ - Preset: `medium_420m`
25
+ - Parameters: 423,934,848
26
+ - Verified forward pass on GPU successful.
27
+
28
+ ## Key Files
29
+ - `configs/component4_model_config.yaml`
30
+ - `src/model_architecture/code_transformer.py`
31
+ - `scripts/build_component4_model.py`
32
+ - `scripts/verify_component4_model.py`
33
+ - `data/processed/train_tokenized.jsonl`
34
+ - `data/processed/pipeline_stats.json`
35
+
36
+ ## Next Planned Component
37
+ - Component 5: Training pipeline with FP16, gradient checkpointing, gradient accumulation, checkpointing every 100 steps, resume support, early stopping, and live training metrics.
38
+
README_COMPONENT_1_SETUP.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 1: Project Setup (Windows + RTX 4060 8GB)
2
+
3
+ ## What This Component Does
4
+ - Creates a clean folder structure for the full coding-assistant project.
5
+ - Sets up a Python virtual environment.
6
+ - Installs all core dependencies needed across Components 2-10.
7
+ - Verifies that Python, PyTorch, CUDA visibility, and key libraries work.
8
+
9
+ ## Folder Structure Created
10
+ - `data/raw` -> raw datasets you will provide later
11
+ - `data/interim` -> temporary cleaned data
12
+ - `data/processed` -> training-ready tokenized data
13
+ - `data/external` -> any third-party resources
14
+ - `src/tokenizer` -> Component 2 code tokenizer
15
+ - `src/dataset_pipeline` -> Component 3 preprocessing pipeline
16
+ - `src/model_architecture` -> Component 4 transformer code
17
+ - `src/training_pipeline` -> Component 5 training loop
18
+ - `src/evaluation_system` -> Component 6 evaluation code
19
+ - `src/inference_engine` -> Component 7 inference code
20
+ - `src/chat_interface` -> Component 8 Gradio interface
21
+ - `src/finetuning_system` -> Component 9 LoRA fine-tuning
22
+ - `src/export_optimization` -> Component 10 quantization/export tools
23
+ - `configs` -> config files for all components
24
+ - `scripts` -> setup, verification, and utility scripts
25
+ - `tests` -> quick checks for each component
26
+ - `checkpoints` -> model checkpoints saved during training
27
+ - `models/base` -> base trained model files
28
+ - `models/lora` -> LoRA adapters
29
+ - `models/quantized` -> optimized quantized models
30
+ - `artifacts` -> generated reports, metrics, and outputs
31
+ - `logs` -> training and runtime logs
32
+
33
+ ## Exact Commands To Run (in this order)
34
+ Run from:
35
+ `D:\Desktop 31st Jan 2026\MIND-AI-MODEL`
36
+
37
+ 0. Install Python 3.11 (required for package compatibility):
38
+ - Download page: https://www.python.org/downloads/release/python-3119/
39
+ - Windows installer file: `python-3.11.9-amd64.exe`
40
+ - During install, check: `Add python.exe to PATH`
41
+
42
+ 1. Allow script execution for this terminal only:
43
+ ```powershell
44
+ Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
45
+ ```
46
+
47
+ 2. If you already attempted setup once, remove old virtual environment first:
48
+ ```powershell
49
+ if (Test-Path .\.venv) { Remove-Item -Recurse -Force .\.venv }
50
+ ```
51
+
52
+ 3. Create folders, virtual env, install dependencies:
53
+ ```powershell
54
+ .\scripts\setup_windows_environment.ps1
55
+ ```
56
+
57
+ 4. Activate virtual environment:
58
+ ```powershell
59
+ .\.venv\Scripts\Activate.ps1
60
+ ```
61
+
62
+ 5. Verify setup:
63
+ ```powershell
64
+ python .\scripts\verify_component1_setup.py
65
+ ```
66
+
67
+ ## Expected Verification Result
68
+ - Prints Python version
69
+ - Prints PyTorch version
70
+ - Shows whether CUDA is available
71
+ - Shows GPU name if available
72
+ - Confirms critical libraries import correctly
73
+
74
+ Note:
75
+ - `codebleu` is excluded from base install on Windows due to a `tree-sitter` dependency conflict on Python 3.11.
76
+ - Component 6 will use Windows-stable evaluation metrics and add code-quality checks without breaking setup.
77
+ - `bitsandbytes` is optional on native Windows because some CUDA/driver combinations fail to load its DLL.
78
+ - Base setup and all early components continue without it.
79
+ - For Component 5, we will:
80
+ - try `bitsandbytes` if available, and
81
+ - automatically fall back to a stable optimizer on your machine if it is not.
82
+
83
+ If verification fails, copy the full terminal output and share it with me.
README_COMPONENT_3_DATASET_PIPELINE.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 3: Dataset Pipeline
2
+
3
+ ## What This Component Does (Simple English)
4
+ - Downloads the 3 datasets directly from Hugging Face (no manual download files).
5
+ - Reads them in streaming mode so your RAM usage stays low.
6
+ - Cleans prompt/code text.
7
+ - Removes low-quality and likely auto-generated data.
8
+ - Removes duplicate prompt+code pairs using a disk-backed SQLite index.
9
+ - Detects language (Python or JavaScript) when unclear.
10
+ - Tokenizes all cleaned records using the Component 2 tokenizer.
11
+ - Saves training-ready tokenized JSONL output.
12
+
13
+ ## Files Created By This Component
14
+ - `configs/component3_dataset_pipeline.yaml`
15
+ - `src/dataset_pipeline/hf_dataset_pipeline.py`
16
+ - `scripts/run_component3_dataset_pipeline.py`
17
+ - `scripts/verify_component3_dataset_pipeline.py`
18
+
19
+ ## Required Before Running
20
+ - Component 2 tokenizer must exist at:
21
+ - `artifacts/tokenizer/code_tokenizer_v1/tokenizer.json`
22
+ - `artifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json`
23
+
24
+ ## Quick Verification Run (small test)
25
+ Run from project root:
26
+ ```powershell
27
+ .\.venv\Scripts\Activate.ps1
28
+ python .\scripts\verify_component3_dataset_pipeline.py
29
+ ```
30
+
31
+ This uses `200` records per dataset for a smoke test.
32
+
33
+ ## Full Pipeline Run
34
+ ```powershell
35
+ .\.venv\Scripts\Activate.ps1
36
+ python .\scripts\run_component3_dataset_pipeline.py --config .\configs\component3_dataset_pipeline.yaml
37
+ ```
38
+
39
+ ## Output Files
40
+ - Clean merged dataset:
41
+ - `data/interim/combined_clean.jsonl`
42
+ - Tokenized training dataset:
43
+ - `data/processed/train_tokenized.jsonl`
44
+ - Stats summary:
45
+ - `data/processed/pipeline_stats.json`
46
+
README_COMPONENT_4_MODEL_ARCHITECTURE.md ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 4: Model Architecture (420M Starter)
2
+
3
+ ## What This Component Builds
4
+ - A decoder-only transformer language model for code generation.
5
+ - Configurable size through YAML config.
6
+ - Presets for small, medium (420M target), and large.
7
+ - Attention + rotary positional encoding + feed-forward blocks.
8
+
9
+ ## Main Files
10
+ - `src/model_architecture/code_transformer.py`
11
+ - `configs/component4_model_config.yaml`
12
+ - `scripts/build_component4_model.py`
13
+ - `scripts/verify_component4_model.py`
14
+
15
+ ## Commands (run from project root)
16
+ ```powershell
17
+ .\.venv\Scripts\Activate.ps1
18
+ python .\scripts\build_component4_model.py --config .\configs\component4_model_config.yaml
19
+ python .\scripts\verify_component4_model.py --config .\configs\component4_model_config.yaml --batch_size 1 --seq_len 256
20
+ ```
21
+
22
+ ## What Success Looks Like
23
+ - Build script prints parameter count near the 420M target.
24
+ - Verify script prints:
25
+ - VRAM usage at multiple stages
26
+ - output tensor shape
27
+ - `Component 4 verification passed.`
28
+
README_COMPONENT_5_TRAINING_PIPELINE.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 5: Training Pipeline
2
+
3
+ ## What This Component Does
4
+ - Trains the 420M transformer on tokenized data.
5
+ - Uses FP16 mixed precision to reduce VRAM.
6
+ - Uses gradient checkpointing to save memory.
7
+ - Uses gradient accumulation for larger effective batch size.
8
+ - Attempts Adam8bit optimizer when available, otherwise safely falls back.
9
+ - Saves checkpoint every 100 steps by default.
10
+ - Supports resuming from latest checkpoint.
11
+ - Evaluates periodically and supports early stopping.
12
+ - Shows live loss, LR, ETA, and VRAM.
13
+
14
+ ## Main Files
15
+ - `configs/component5_training_config.yaml`
16
+ - `src/training_pipeline/tokenized_dataset.py`
17
+ - `scripts/train_component5.py`
18
+ - `scripts/verify_component5_training_pipeline.py`
19
+
20
+ ## Commands
21
+ ```powershell
22
+ .\.venv\Scripts\Activate.ps1
23
+ python .\scripts\verify_component5_training_pipeline.py
24
+ python .\scripts\train_component5.py --config .\configs\component5_training_config.yaml
25
+ ```
26
+
27
+ ## VRAM and Runtime (RTX 4060 8GB)
28
+ - Expected VRAM during training with default config: about 5.8 to 6.9 GB.
29
+ - Safety stop is enabled at 7.0 GB.
30
+ - Approx training time for 1 epoch equivalent: ~30 to 65 hours.
31
+
32
+ ## Common Failures and Fixes
33
+ 1. OOM or VRAM threshold hit:
34
+ - Reduce `max_seq_len` (e.g., 512 -> 384).
35
+ - Increase `grad_accum_steps`.
36
+ 2. Training too slow:
37
+ - Lower `max_seq_len` for first run.
38
+ - Keep `micro_batch_size=1` and adjust accumulation.
39
+ 3. Resume issues:
40
+ - Ensure `checkpoints/component5_420m/latest.pt` exists.
41
+ 4. Validation not improving:
42
+ - Lower LR and increase warmup.
README_COMPONENT_8_CHAT_INTERFACE.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 8: Local Chat Interface
2
+
3
+ ## What it gives you
4
+ - Browser chat UI for your local coding model.
5
+ - Uses Component 7 inference engine automatically.
6
+ - Dark theme, prompt box, code cards, copy button per response.
7
+ - Syntax highlighting for Python and JavaScript.
8
+ - Shows generation time and generated token count.
9
+ - Keeps conversation history in the current session.
10
+ - Clear button to reset conversation.
11
+
12
+ ## Launch (single command)
13
+ ```powershell
14
+ python .\scripts\launch_component8_chat.py --config .\configs\component8_chat_config.yaml
15
+ ```
16
+
17
+ ## URL to open
18
+ - `http://127.0.0.1:7860`
19
+
20
+ No internet is needed for local usage.
README_FINAL_PROJECT.md ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Final Project README - MINDI 1.0 420M (Windows, RTX 4060 8GB)
2
+
3
+ ## What This Project Is
4
+ This is a fully local coding-assistant model system built step-by-step from scratch.
5
+ It supports:
6
+ - custom tokenizer for code
7
+ - dataset cleaning + tokenization pipeline
8
+ - 420M transformer model
9
+ - memory-optimized training
10
+ - evaluation + inference improvements
11
+ - local chat UI
12
+ - LoRA fine-tuning
13
+ - INT8 export + portable package
14
+
15
+ Everything runs locally on your machine without internet after setup.
16
+
17
+ ---
18
+
19
+ ## What You Built (High Level)
20
+ 1. **Project setup** with reproducible environment and verification scripts.
21
+ 2. **Custom code tokenizer** (Python + JavaScript aware).
22
+ 3. **Dataset pipeline** with cleaning, dedupe, and tokenization.
23
+ 4. **420M transformer architecture** (modular config).
24
+ 5. **Training pipeline** (FP16, checkpointing, accumulation, resume, early stopping).
25
+ 6. **Evaluation system** (val metrics + generation checks).
26
+ 7. **Inference engine** (greedy mode, stop rules, syntax-aware retry).
27
+ 8. **Local chat interface** with history, copy button, timing, and mode selector.
28
+ 9. **LoRA fine-tuning pipeline** for your own examples.
29
+ 10. **Export/quantization/packaging** with benchmark report and portable launcher.
30
+
31
+ ---
32
+
33
+ ## Most Important File Locations
34
+
35
+ ### Core model and data
36
+ - Base checkpoint: `checkpoints/component5_420m/step_3200.pt`
37
+ - Tokenized training data: `data/processed/train_tokenized.jsonl`
38
+ - Tokenizer: `artifacts/tokenizer/code_tokenizer_v1/`
39
+
40
+ ### LoRA
41
+ - Best LoRA adapter: `models/lora/custom_lora_v1/best.pt`
42
+ - LoRA metadata: `models/lora/custom_lora_v1/adapter_meta.json`
43
+
44
+ ### Quantized model
45
+ - INT8 model: `models/quantized/model_step3200_int8_state.pt`
46
+ - Benchmark report: `artifacts/export/component10_benchmark_report.json`
47
+
48
+ ### Chat interface
49
+ - Launcher: `scripts/launch_component8_chat.py`
50
+ - Chat config: `configs/component8_chat_config.yaml`
51
+
52
+ ### Portable package
53
+ - Folder: `release/MINDI_1.0_420M`
54
+ - Double-click launcher: `release/MINDI_1.0_420M/Start_MINDI.bat`
55
+
56
+ ---
57
+
58
+ ## Launch the Main Chat UI
59
+ From project root (`C:\AI 2`):
60
+
61
+ ```powershell
62
+ .\.venv\Scripts\Activate.ps1
63
+ python .\scripts\launch_component8_chat.py --config .\configs\component8_chat_config.yaml
64
+ ```
65
+
66
+ Open in browser:
67
+ - `http://127.0.0.1:7860`
68
+
69
+ ### Live model selector in UI
70
+ You can switch without restart:
71
+ - `base`
72
+ - `lora`
73
+ - `int8`
74
+
75
+ Status box shows:
76
+ - active mode
77
+ - mode load time
78
+ - live VRAM usage
79
+
80
+ ---
81
+
82
+ ## How to Add More Training Data (Future Improvement)
83
+
84
+ ### A) Add more base-training pairs (full training path)
85
+ 1. Put new JSONL/JSON files in `data/raw/`.
86
+ 2. Run dataset processing scripts (Component 3 path).
87
+ 3. Continue/refresh base training with Component 5.
88
+
89
+ ### B) Add targeted improvements quickly (LoRA recommended)
90
+ 1. Edit `data/raw/custom_finetune_pairs.jsonl` with your new prompt/code pairs.
91
+ - Required fields per row: `prompt`, `code`
92
+ - Optional: `language` (`python` or `javascript`)
93
+ 2. Run LoRA fine-tuning:
94
+
95
+ ```powershell
96
+ python .\scripts\run_component9_lora_finetune.py --config .\configs\component9_lora_config.yaml
97
+ ```
98
+
99
+ 3. Use updated adapter in chat by selecting `lora` mode.
100
+
101
+ ---
102
+
103
+ ## Recommended Next Habit
104
+ When quality is weak on specific tasks:
105
+ 1. Add 20-200 clean examples of exactly that task style to `custom_finetune_pairs.jsonl`.
106
+ 2. Re-run LoRA fine-tuning.
107
+ 3. Test in chat `lora` mode.
108
+ 4. Repeat in small cycles.
109
+
110
+ This gives faster improvement than retraining the full base model each time.
111
+
112
+ ---
113
+
114
+ ## One-File Health Check Commands
115
+
116
+ ```powershell
117
+ python .\scripts\verify_component1_setup.py
118
+ python .\scripts\verify_component4_model.py --config .\configs\component4_model_config.yaml --batch_size 1 --seq_len 256
119
+ python .\scripts\verify_component9_lora.py
120
+ ```
121
+
122
+ ---
123
+
124
+ ## Current Status
125
+ Project is complete across Components 1-10 and verified on your hardware.
126
+
artifacts/evaluation/component6_eval_results.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3da6ee747d77b0c8cdca5d4fedb750549a9e5e7c42592e5e32e6103ff5617d8f
3
+ size 2379
artifacts/evaluation/component7_inference_results.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ce08bfd6918f619fdcb1ef17ec1db79c2d32578d12a02aaaae7b7092f83384ae
3
+ size 5863
artifacts/export/component10_benchmark_report.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d827ec736fbdc4ea2ed5bc196223f1bf02d11a9260acd451edd51f8f39bcda75
3
+ size 545
artifacts/model/component4_model_summary.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ab5ebc8aa081f82bbcaee2c945b207b4db3251f63b845ed86055f4e5b7204010
3
+ size 328
artifacts/tokenizer/code_tokenizer_v1/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1fe04cc37ac778637cb2cc02a6096412e5d8cada3e4ef3e4a7f2d141fccab8a0
3
+ size 11475
artifacts/tokenizer/code_tokenizer_v1/tokenizer_config.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fb0b7af679bac1c29fe7ac9f86c48f1fed5584ba72c9ef2c338f60b63e07bb46
3
+ size 302
backup_step1000.tar.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ebe005c43dd59c9c49ad153d41af1bdaaad47c2a21ae231a4c5e90c8005560af
3
+ size 337623475
backup_step2000.tar.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:861329fb551b4c6406e92e06cfa1faae592f0fe0d0ce713189a57c62b33b0969
3
+ size 337571785
backup_step3000.tar.gz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:238c2859ebf4efc0195456a898d2fb8bce0397e39fdf59e9f940963232d628a8
3
+ size 337762553
checkpoints/component5_420m/latest.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32d26a7dd9e6e294c6657f6fb3a4d947cf52eb8e1c0b11032722fa50d15c4a21
3
+ size 5087449970
checkpoints/component5_420m/step_3000.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e11bded40789574ef316636c02c2fd1e8cd54c13441d8cd6a28980f2209ffaa9
3
+ size 5087455158
checkpoints/component5_420m/step_3200.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71d2ea9401f3b08b2528dbb8f993949794d0adb57642d0f4752d74da0e445238
3
+ size 5087455158
config.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass
2
+ from pathlib import Path
3
+
4
+
5
+ @dataclass(frozen=True)
6
+ class Paths:
7
+ project_root: Path = Path(".")
8
+ model_dir: Path = Path("./model")
9
+ data_dir: Path = Path("./data")
10
+ output_dir: Path = Path("./output")
11
+ logs_dir: Path = Path("./logs")
12
+
13
+ train_jsonl: Path = Path("./data/train.jsonl")
14
+ dataset_cache_dir: Path = Path("./data/cache")
15
+ raw_dataset_dir: Path = Path("./data/cache/raw")
16
+ checkpoint_dir: Path = Path("./output/checkpoints")
17
+ lora_output_dir: Path = Path("./output/lora_adapters")
18
+ tokenizer_output_dir: Path = Path("./output/tokenizer")
19
+
20
+
21
+ @dataclass(frozen=True)
22
+ class DataConfig:
23
+ max_total_samples: int = 200000
24
+ max_humaneval_samples: int = 20000
25
+ max_mbpp_samples: int = 50000
26
+ max_codesearchnet_samples: int = 180000
27
+ min_output_chars: int = 40
28
+
29
+
30
+ @dataclass(frozen=True)
31
+ class TrainingConfig:
32
+ num_train_epochs: int = 5
33
+ per_device_train_batch_size: int = 1
34
+ gradient_accumulation_steps: int = 8
35
+ learning_rate: float = 1e-5
36
+ max_length: int = 1024
37
+ save_steps: int = 250
38
+ logging_steps: int = 20
39
+ eval_max_new_tokens: int = 220
40
+ resume_training: bool = True
41
+
42
+
43
+ PATHS = Paths()
44
+ DATA_CONFIG = DataConfig()
45
+ TRAINING_CONFIG = TrainingConfig()
configs/component10_export_config.yaml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 10 export and optimization config
2
+
3
+ model:
4
+ model_config_path: configs/component4_model_config.yaml
5
+ source_checkpoint_path: checkpoints/component5_420m/step_3200.pt
6
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
7
+
8
+ quantization:
9
+ quantized_output_path: models/quantized/model_step3200_int8_state.pt
10
+
11
+ benchmark:
12
+ prompt: Write a Python function to compute factorial of n.
13
+ max_new_tokens: 120
14
+
15
+ package:
16
+ output_dir: release/MINDI_1.0_420M
17
+ app_port: 7861
18
+
19
+ outputs:
20
+ benchmark_report_json: artifacts/export/component10_benchmark_report.json
21
+
configs/component3_dataset_pipeline.yaml ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 3 config: load, clean, deduplicate, tokenize.
2
+
3
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
4
+ interim_output_dir: data/interim
5
+ processed_output_dir: data/processed
6
+ dedupe_db_path: data/interim/dedupe_hashes.sqlite
7
+
8
+ # Set null for full run.
9
+ # Use a small number like 500 for fast smoke testing.
10
+ max_records_per_dataset: null
11
+
12
+ min_prompt_chars: 8
13
+ min_code_chars: 16
14
+ max_code_chars: 40000
15
+ progress_every: 1000
16
+
17
+ datasets:
18
+ - hf_dataset_id: iamtarun/python_code_instructions_18k_alpaca
19
+ split: train
20
+ prompt_field: instruction
21
+ code_field: output
22
+ language_field: null
23
+ default_language: python
24
+
25
+ - hf_dataset_id: sahil2801/CodeAlpaca-20k
26
+ split: train
27
+ prompt_field: instruction
28
+ code_field: output
29
+ language_field: null
30
+ default_language: python
31
+
32
+ - hf_dataset_id: TokenBender/code_instructions_122k_alpaca_style
33
+ split: train
34
+ prompt_field: instruction
35
+ code_field: output
36
+ language_field: null
37
+ default_language: python
38
+
configs/component3_incremental_js.yaml ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Incremental JS augmentation config.
2
+ # This script appends new JavaScript samples into existing Component 3 outputs.
3
+
4
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
5
+ existing_clean_path: data/interim/combined_clean.jsonl
6
+ existing_tokenized_path: data/processed/train_tokenized.jsonl
7
+ existing_stats_path: data/processed/pipeline_stats.json
8
+ dedupe_db_path: data/interim/dedupe_hashes_incremental.sqlite
9
+
10
+ # Chosen dataset for JS augmentation.
11
+ new_dataset:
12
+ hf_dataset_id: philschmid/code-alpaca-ruby-python-javascript
13
+ split: train
14
+ prompt_field: instruction
15
+ code_field: output
16
+ language_field: null
17
+ default_language: auto
18
+
19
+ # Hard target requested by user.
20
+ target_new_javascript_examples: 20000
21
+
22
+ # Quality filters (same idea as Component 3).
23
+ min_prompt_chars: 8
24
+ min_code_chars: 16
25
+ max_code_chars: 40000
26
+ progress_every: 500
27
+
configs/component3_reprocess_from_clean.yaml ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reprocess config: no dataset download, no full pipeline rebuild.
2
+ # It reads existing cleaned data and regenerates tokenized output.
3
+
4
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
5
+ input_clean_path: data/interim/combined_clean.jsonl
6
+ output_tokenized_path: data/processed/train_tokenized.jsonl
7
+ output_stats_path: data/processed/pipeline_stats.json
8
+
9
+ # Safety backups before overwrite.
10
+ backup_existing_tokenized: true
11
+ backup_existing_stats: true
12
+
13
+ # Existing language labels in clean file may be wrong from earlier runs.
14
+ # true = infer language from prompt+code content only.
15
+ ignore_existing_language_labels: true
16
+
17
+ # Optional quick test mode.
18
+ # Set null for full reprocess.
19
+ max_records: null
configs/component4_model_config.yaml ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 4 model config.
2
+ # You can switch the preset name or directly edit dimensions below.
3
+
4
+ preset: medium_420m
5
+
6
+ model:
7
+ vocab_size: 50000
8
+ max_seq_len: 2048
9
+ d_model: 1152
10
+ n_layers: 23
11
+ n_heads: 16
12
+ d_ff: 4608
13
+ dropout: 0.1
14
+ tie_embeddings: true
15
+ gradient_checkpointing: false
16
+ init_std: 0.02
17
+ rms_norm_eps: 0.00001
18
+
configs/component5_training_config.verify.yaml ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ data:
2
+ tokenized_jsonl_path: data/processed/train_tokenized.jsonl
3
+ val_ratio: 0.02
4
+ split_seed: 17
5
+ num_workers: 0
6
+ model:
7
+ model_config_path: configs/component4_model_config.yaml
8
+ training:
9
+ output_dir: checkpoints/component5_420m
10
+ log_every: 1
11
+ eval_every: 5
12
+ save_every: 5
13
+ max_steps: 5
14
+ micro_batch_size: 1
15
+ grad_accum_steps: 16
16
+ max_seq_len: 512
17
+ learning_rate: 0.0002
18
+ weight_decay: 0.1
19
+ betas:
20
+ - 0.9
21
+ - 0.95
22
+ grad_clip_norm: 1.0
23
+ warmup_steps: 300
24
+ min_lr_ratio: 0.1
25
+ use_fp16: true
26
+ use_gradient_checkpointing: true
27
+ prefer_8bit_adam: true
28
+ early_stopping_patience_evals: 20
29
+ early_stopping_min_delta: 0.0005
30
+ max_vram_gb: 7.0
31
+ resume:
32
+ resume_from: none
configs/component5_training_config.yaml ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 5 training config for RTX 4060 8GB.
2
+
3
+ data:
4
+ tokenized_jsonl_path: data/processed/train_tokenized.jsonl
5
+ val_ratio: 0.02
6
+ split_seed: 17
7
+ num_workers: 2
8
+
9
+ model:
10
+ model_config_path: configs/component4_model_config.yaml
11
+
12
+ training:
13
+ output_dir: checkpoints/component5_420m
14
+ log_every: 10
15
+ eval_every: 100
16
+ save_every: 200
17
+ max_steps: 8000
18
+ micro_batch_size: 1
19
+ grad_accum_steps: 16
20
+ max_seq_len: 448
21
+ learning_rate: 0.00022
22
+ weight_decay: 0.1
23
+ betas: [0.9, 0.95]
24
+ grad_clip_norm: 1.0
25
+ warmup_steps: 300
26
+ min_lr_ratio: 0.1
27
+ use_fp16: true
28
+ use_gradient_checkpointing: true
29
+ prefer_8bit_adam: true
30
+ early_stopping_patience_evals: 5
31
+ early_stopping_min_delta: 0.0005
32
+ max_vram_gb: 7.0
33
+
34
+ resume:
35
+ resume_from: latest # latest | none | explicit checkpoint path
36
+
37
+
configs/component6_evaluation_config.yaml ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 6 evaluation config.
2
+
3
+ model:
4
+ model_config_path: configs/component4_model_config.yaml
5
+ checkpoint_paths:
6
+ - checkpoints/component5_420m/step_3200.pt
7
+
8
+ data:
9
+ tokenized_jsonl_path: data/processed/train_tokenized.jsonl
10
+ val_ratio: 0.02
11
+ split_seed: 17
12
+
13
+ inference:
14
+ max_seq_len: 448
15
+ max_new_tokens: 160
16
+ temperature: 0.25
17
+ top_p: 0.85
18
+
19
+ output:
20
+ results_json: artifacts/evaluation/component6_eval_results.json
21
+
configs/component7_inference_config.yaml ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 7 inference config
2
+
3
+ model:
4
+ model_config_path: configs/component4_model_config.yaml
5
+ checkpoint_path: checkpoints/component5_420m/step_3200.pt
6
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
7
+
8
+ inference:
9
+ language: python
10
+ max_new_tokens: 180
11
+ greedy_temperature: 0.0
12
+ retry2_temperature: 0.25
13
+ retry2_top_p: 0.85
14
+ retry3_temperature: 0.35
15
+ retry3_top_p: 0.90
16
+ max_retries: 3
17
+ min_tokens_before_stop_check: 24
18
+
19
+ output:
20
+ results_json: artifacts/evaluation/component7_inference_results.json
configs/component8_chat_config.yaml ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 8 chat interface config.
2
+
3
+ model:
4
+ model_config_path: configs/component4_model_config.yaml
5
+ base_checkpoint_path: checkpoints/component5_420m/step_3200.pt
6
+ lora_adapter_path: models/lora/custom_lora_v1/best.pt
7
+ quantized_state_path: models/quantized/model_step3200_int8_state.pt
8
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
9
+
10
+ lora:
11
+ r: 8
12
+ alpha: 16
13
+ dropout: 0.05
14
+ target_keywords: [q_proj, k_proj, v_proj, o_proj, fc1, fc2]
15
+
16
+ inference:
17
+ language_default: python
18
+ max_new_tokens: 300
19
+ greedy_temperature: 0.0
20
+ retry2_temperature: 0.25
21
+ retry2_top_p: 0.85
22
+ retry3_temperature: 0.35
23
+ retry3_top_p: 0.90
24
+ max_retries: 3
25
+ min_tokens_before_stop_check: 64
26
+
27
+ server:
28
+ host: 127.0.0.1
29
+ port: 7860
30
+ share: false
configs/component9_lora_config.verify.yaml ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model:
2
+ model_config_path: configs/component4_model_config.yaml
3
+ base_checkpoint_path: checkpoints/component5_420m/step_3200.pt
4
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
5
+ lora:
6
+ r: 8
7
+ alpha: 16
8
+ dropout: 0.05
9
+ target_keywords:
10
+ - q_proj
11
+ - k_proj
12
+ - v_proj
13
+ - o_proj
14
+ - fc1
15
+ - fc2
16
+ finetune:
17
+ custom_data_path: data/raw/custom_finetune_pairs.jsonl
18
+ output_dir: models/lora/custom_lora_v1
19
+ max_seq_len: 512
20
+ micro_batch_size: 1
21
+ grad_accum_steps: 16
22
+ learning_rate: 0.0003
23
+ weight_decay: 0.0
24
+ max_steps: 5
25
+ save_every: 5
26
+ eval_every: 5
27
+ early_stopping_patience_evals: 6
28
+ early_stopping_min_delta: 0.0005
29
+ use_fp16: true
30
+ max_vram_gb: 7.0
31
+ resume:
32
+ resume_from: none
configs/component9_lora_config.yaml ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Component 9 LoRA fine-tuning config
2
+
3
+ model:
4
+ model_config_path: configs/component4_model_config.yaml
5
+ base_checkpoint_path: checkpoints/component5_420m/step_3200.pt
6
+ tokenizer_dir: artifacts/tokenizer/code_tokenizer_v1
7
+
8
+ lora:
9
+ r: 8
10
+ alpha: 16
11
+ dropout: 0.05
12
+ target_keywords: [q_proj, k_proj, v_proj, o_proj, fc1, fc2]
13
+
14
+ finetune:
15
+ custom_data_path: data/raw/custom_finetune_pairs.jsonl
16
+ output_dir: models/lora/custom_lora_v1
17
+ max_seq_len: 512
18
+ micro_batch_size: 1
19
+ grad_accum_steps: 16
20
+ learning_rate: 0.0003
21
+ weight_decay: 0.0
22
+ max_steps: 1200
23
+ save_every: 100
24
+ eval_every: 100
25
+ early_stopping_patience_evals: 6
26
+ early_stopping_min_delta: 0.0005
27
+ use_fp16: true
28
+ max_vram_gb: 7.0
29
+
30
+ resume:
31
+ resume_from: none # none | latest | explicit path
data/cache/raw/code_search_net_python/dataset_dict.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2bf46fe547f16d795abe0d4c8a591bf031d98882d638931d27660455ee986273
3
+ size 43
data/cache/raw/code_search_net_python/test/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:079bce0f0e2513bae63c12f8699e4ea13ec545c5000844de28dc34a1a9fd19eb
3
+ size 84367104
data/cache/raw/code_search_net_python/test/dataset_info.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8ba7e0c98d4303660c791c0af8da617dce739fcf2be906ee269c6bf572bad9c
3
+ size 2598
data/cache/raw/code_search_net_python/test/state.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:55d5fecb65147f455bfc8249c3e26fc6a2bd01bfd8bd9f354e86eb7834453d1c
3
+ size 261
data/cache/raw/code_search_net_python/train/data-00000-of-00004.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a5984af399adbfdab06aca7da7638f6a5eb98411b15b88a1f045f346735fbc9c
3
+ size 377852224
data/cache/raw/code_search_net_python/train/data-00001-of-00004.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a62df607497be1fd23f3e8aa50908bebff6732ccc8b5dacbfaa0efd336ad915
3
+ size 411927504
data/cache/raw/code_search_net_python/train/data-00002-of-00004.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d519b4edb8ae27d8e1ab6474a8decc40f45c6a8e7c409039c865abbc9763f351
3
+ size 370005344
data/cache/raw/code_search_net_python/train/data-00003-of-00004.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b42ae91a5e6e48dd32eac5940429d726f0dbc9440d0262a40a3bfe7a0e2e6214
3
+ size 400292712
data/cache/raw/code_search_net_python/train/dataset_info.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8ba7e0c98d4303660c791c0af8da617dce739fcf2be906ee269c6bf572bad9c
3
+ size 2598
data/cache/raw/code_search_net_python/train/state.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:180b84fce72622f4113ea103a1fbf79924e61881442db8728b055be042247bcf
3
+ size 448
data/cache/raw/code_search_net_python/validation/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f9f848f9c1dfe1c2cfac25fd1b529e050e29291a5d8042ba1d4f904948142c64
3
+ size 92180808
data/cache/raw/code_search_net_python/validation/dataset_info.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8ba7e0c98d4303660c791c0af8da617dce739fcf2be906ee269c6bf572bad9c
3
+ size 2598
data/cache/raw/code_search_net_python/validation/state.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:20e5f3cf2d550a3fb9b3d3e43f23f25dfaae9ae3124e43dcf14072f5e3aee182
3
+ size 267
data/cache/raw/mbpp/dataset_dict.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eb69d413c1138964f92bd3723baf871db8f40b4cec70586e770e060108a8c612
3
+ size 53
data/cache/raw/mbpp/prompt/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e14c47c41a23d8003284ac9249a5c5e4da285300f1a56b63593fb2d6237556ff
3
+ size 6112
data/cache/raw/mbpp/prompt/dataset_info.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cb63c6a97c4cbbd8e28f0e478687c69ea593cd0d4a3a1f2b4e85c6b5378b776e
3
+ size 2205