# ecocoder-cot-v1 — Ecological Chain-of-Thought Dataset

**10 CoT traces** for fine-tuning Nemotron on ecological reasoning + code generation.

## Format

Each trace has 3 sections:

```
[CONTEXT] {paper abstract + method description}
[REASONING] {step-by-step ecological reasoning}
[CODE] {Python/R implementation}
```

## Splits

| Split | Traces | Size |
|-------|--------|------|
| train | 8 | ~40 KB |
| test  | 2 | ~10 KB |

## Papers Covered

| # | Paper | Method | Code |
|---|-------|--------|------|
| 1 | GLOSSA (2505.05862) | BART Bayesian SDM | R |
| 2 | MaskSDM (2503.13057) | DL + Shapley values | PyTorch |
| 3 | GeoThinneR (2505.07867) | kd-tree thinning | R |
| 4 | HeteroGNN (2503.11900) | Graph Neural Net | PyTorch Geometric |
| 5 | CISO (2508.06704) | Conditional SDM | PyTorch |
| 6 | BioAnalyst (2507.09080) | Foundation Model | PyTorch |
| 7 | MultiScale (2411.04016) | Multi-scale SDM | PyTorch |
| 8 | LD-SDM (2312.08334) | LLM + Taxonomy | PyTorch + HF |
| 9 | PointProcess (2311.06755) | Poisson Process | R/INLA |
| 10 | EntropyBias (2508.02272) | Shannon Entropy | Python + R |

## Intended Use

Fine-tune `nemotron-3-nano-30b-a3b` (32.5B) with Unsloth 4-bit QLoRA on A100 80GB.

### Training config

```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="nvidia/Nemotron-3-Nano-30B-A3B-ablated",
    max_seq_length=4096,
    load_in_4bit=True,
)
```

## Generation Pipeline

```
Papers (arXiv) → DeepSeek v4 Pro CoT → JSONL → HuggingFace Dataset → Unsloth QLoRA → ecocoder-nemotron
```

## Next: v2 (100 traces)

Scale to 100 papers across 6 SDM categories: Bayesian methods, deep learning, spatial methods, taxonomic integration, data integration, bias correction.

---

Built with DeepSeek v4 Pro · ecoseek-litdump · alrobles