Upload folder using huggingface_hub
Browse files- README.md +65 -0
- dataset_dict.json +1 -0
- test/data-00000-of-00001.arrow +3 -0
- test/dataset_info.json +20 -0
- test/state.json +13 -0
- train/data-00000-of-00001.arrow +3 -0
- train/dataset_info.json +20 -0
- train/state.json +13 -0
README.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ecocoder-cot-v1 — Ecological Chain-of-Thought Dataset
|
| 2 |
+
|
| 3 |
+
**10 CoT traces** for fine-tuning Nemotron on ecological reasoning + code generation.
|
| 4 |
+
|
| 5 |
+
## Format
|
| 6 |
+
|
| 7 |
+
Each trace has 3 sections:
|
| 8 |
+
|
| 9 |
+
```
|
| 10 |
+
[CONTEXT] {paper abstract + method description}
|
| 11 |
+
[REASONING] {step-by-step ecological reasoning}
|
| 12 |
+
[CODE] {Python/R implementation}
|
| 13 |
+
```
|
| 14 |
+
|
| 15 |
+
## Splits
|
| 16 |
+
|
| 17 |
+
| Split | Traces | Size |
|
| 18 |
+
|-------|--------|------|
|
| 19 |
+
| train | 8 | ~40 KB |
|
| 20 |
+
| test | 2 | ~10 KB |
|
| 21 |
+
|
| 22 |
+
## Papers Covered
|
| 23 |
+
|
| 24 |
+
| # | Paper | Method | Code |
|
| 25 |
+
|---|-------|--------|------|
|
| 26 |
+
| 1 | GLOSSA (2505.05862) | BART Bayesian SDM | R |
|
| 27 |
+
| 2 | MaskSDM (2503.13057) | DL + Shapley values | PyTorch |
|
| 28 |
+
| 3 | GeoThinneR (2505.07867) | kd-tree thinning | R |
|
| 29 |
+
| 4 | HeteroGNN (2503.11900) | Graph Neural Net | PyTorch Geometric |
|
| 30 |
+
| 5 | CISO (2508.06704) | Conditional SDM | PyTorch |
|
| 31 |
+
| 6 | BioAnalyst (2507.09080) | Foundation Model | PyTorch |
|
| 32 |
+
| 7 | MultiScale (2411.04016) | Multi-scale SDM | PyTorch |
|
| 33 |
+
| 8 | LD-SDM (2312.08334) | LLM + Taxonomy | PyTorch + HF |
|
| 34 |
+
| 9 | PointProcess (2311.06755) | Poisson Process | R/INLA |
|
| 35 |
+
| 10 | EntropyBias (2508.02272) | Shannon Entropy | Python + R |
|
| 36 |
+
|
| 37 |
+
## Intended Use
|
| 38 |
+
|
| 39 |
+
Fine-tune `nemotron-3-nano-30b-a3b` (32.5B) with Unsloth 4-bit QLoRA on A100 80GB.
|
| 40 |
+
|
| 41 |
+
### Training config
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
from unsloth import FastLanguageModel
|
| 45 |
+
|
| 46 |
+
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 47 |
+
model_name="nvidia/Nemotron-3-Nano-30B-A3B-ablated",
|
| 48 |
+
max_seq_length=4096,
|
| 49 |
+
load_in_4bit=True,
|
| 50 |
+
)
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Generation Pipeline
|
| 54 |
+
|
| 55 |
+
```
|
| 56 |
+
Papers (arXiv) → DeepSeek v4 Pro CoT → JSONL → HuggingFace Dataset → Unsloth QLoRA → ecocoder-nemotron
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
## Next: v2 (100 traces)
|
| 60 |
+
|
| 61 |
+
Scale to 100 papers across 6 SDM categories: Bayesian methods, deep learning, spatial methods, taxonomic integration, data integration, bias correction.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
Built with DeepSeek v4 Pro · ecoseek-litdump · alrobles
|
dataset_dict.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"splits": ["train", "test"]}
|
test/data-00000-of-00001.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0ae7ae563fc9e4b747a5e39eea560567f17aad8d00e89f21ca90bd85351e04a4
|
| 3 |
+
size 11424
|
test/dataset_info.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"citation": "",
|
| 3 |
+
"description": "",
|
| 4 |
+
"features": {
|
| 5 |
+
"paper_arxiv_id": {
|
| 6 |
+
"dtype": "string",
|
| 7 |
+
"_type": "Value"
|
| 8 |
+
},
|
| 9 |
+
"paper_title": {
|
| 10 |
+
"dtype": "string",
|
| 11 |
+
"_type": "Value"
|
| 12 |
+
},
|
| 13 |
+
"text": {
|
| 14 |
+
"dtype": "string",
|
| 15 |
+
"_type": "Value"
|
| 16 |
+
}
|
| 17 |
+
},
|
| 18 |
+
"homepage": "",
|
| 19 |
+
"license": ""
|
| 20 |
+
}
|
test/state.json
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_data_files": [
|
| 3 |
+
{
|
| 4 |
+
"filename": "data-00000-of-00001.arrow"
|
| 5 |
+
}
|
| 6 |
+
],
|
| 7 |
+
"_fingerprint": "14fd85a6723f7dc9",
|
| 8 |
+
"_format_columns": null,
|
| 9 |
+
"_format_kwargs": {},
|
| 10 |
+
"_format_type": null,
|
| 11 |
+
"_output_all_columns": false,
|
| 12 |
+
"_split": null
|
| 13 |
+
}
|
train/data-00000-of-00001.arrow
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:30444aa114dc4cc1def8e9299d6d376ab2a4d65c16e152e058ad5581c59d12c4
|
| 3 |
+
size 38656
|
train/dataset_info.json
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"citation": "",
|
| 3 |
+
"description": "",
|
| 4 |
+
"features": {
|
| 5 |
+
"paper_arxiv_id": {
|
| 6 |
+
"dtype": "string",
|
| 7 |
+
"_type": "Value"
|
| 8 |
+
},
|
| 9 |
+
"paper_title": {
|
| 10 |
+
"dtype": "string",
|
| 11 |
+
"_type": "Value"
|
| 12 |
+
},
|
| 13 |
+
"text": {
|
| 14 |
+
"dtype": "string",
|
| 15 |
+
"_type": "Value"
|
| 16 |
+
}
|
| 17 |
+
},
|
| 18 |
+
"homepage": "",
|
| 19 |
+
"license": ""
|
| 20 |
+
}
|
train/state.json
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_data_files": [
|
| 3 |
+
{
|
| 4 |
+
"filename": "data-00000-of-00001.arrow"
|
| 5 |
+
}
|
| 6 |
+
],
|
| 7 |
+
"_fingerprint": "879caba3c8a1488d",
|
| 8 |
+
"_format_columns": null,
|
| 9 |
+
"_format_kwargs": {},
|
| 10 |
+
"_format_type": null,
|
| 11 |
+
"_output_all_columns": false,
|
| 12 |
+
"_split": null
|
| 13 |
+
}
|