alrobles commited on
Commit
d52a3cb
·
verified ·
1 Parent(s): da7aa70

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ecocoder-cot-v1 — Ecological Chain-of-Thought Dataset
2
+
3
+ **10 CoT traces** for fine-tuning Nemotron on ecological reasoning + code generation.
4
+
5
+ ## Format
6
+
7
+ Each trace has 3 sections:
8
+
9
+ ```
10
+ [CONTEXT] {paper abstract + method description}
11
+ [REASONING] {step-by-step ecological reasoning}
12
+ [CODE] {Python/R implementation}
13
+ ```
14
+
15
+ ## Splits
16
+
17
+ | Split | Traces | Size |
18
+ |-------|--------|------|
19
+ | train | 8 | ~40 KB |
20
+ | test | 2 | ~10 KB |
21
+
22
+ ## Papers Covered
23
+
24
+ | # | Paper | Method | Code |
25
+ |---|-------|--------|------|
26
+ | 1 | GLOSSA (2505.05862) | BART Bayesian SDM | R |
27
+ | 2 | MaskSDM (2503.13057) | DL + Shapley values | PyTorch |
28
+ | 3 | GeoThinneR (2505.07867) | kd-tree thinning | R |
29
+ | 4 | HeteroGNN (2503.11900) | Graph Neural Net | PyTorch Geometric |
30
+ | 5 | CISO (2508.06704) | Conditional SDM | PyTorch |
31
+ | 6 | BioAnalyst (2507.09080) | Foundation Model | PyTorch |
32
+ | 7 | MultiScale (2411.04016) | Multi-scale SDM | PyTorch |
33
+ | 8 | LD-SDM (2312.08334) | LLM + Taxonomy | PyTorch + HF |
34
+ | 9 | PointProcess (2311.06755) | Poisson Process | R/INLA |
35
+ | 10 | EntropyBias (2508.02272) | Shannon Entropy | Python + R |
36
+
37
+ ## Intended Use
38
+
39
+ Fine-tune `nemotron-3-nano-30b-a3b` (32.5B) with Unsloth 4-bit QLoRA on A100 80GB.
40
+
41
+ ### Training config
42
+
43
+ ```python
44
+ from unsloth import FastLanguageModel
45
+
46
+ model, tokenizer = FastLanguageModel.from_pretrained(
47
+ model_name="nvidia/Nemotron-3-Nano-30B-A3B-ablated",
48
+ max_seq_length=4096,
49
+ load_in_4bit=True,
50
+ )
51
+ ```
52
+
53
+ ## Generation Pipeline
54
+
55
+ ```
56
+ Papers (arXiv) → DeepSeek v4 Pro CoT → JSONL → HuggingFace Dataset → Unsloth QLoRA → ecocoder-nemotron
57
+ ```
58
+
59
+ ## Next: v2 (100 traces)
60
+
61
+ Scale to 100 papers across 6 SDM categories: Bayesian methods, deep learning, spatial methods, taxonomic integration, data integration, bias correction.
62
+
63
+ ---
64
+
65
+ Built with DeepSeek v4 Pro · ecoseek-litdump · alrobles
dataset_dict.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"splits": ["train", "test"]}
test/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0ae7ae563fc9e4b747a5e39eea560567f17aad8d00e89f21ca90bd85351e04a4
3
+ size 11424
test/dataset_info.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "citation": "",
3
+ "description": "",
4
+ "features": {
5
+ "paper_arxiv_id": {
6
+ "dtype": "string",
7
+ "_type": "Value"
8
+ },
9
+ "paper_title": {
10
+ "dtype": "string",
11
+ "_type": "Value"
12
+ },
13
+ "text": {
14
+ "dtype": "string",
15
+ "_type": "Value"
16
+ }
17
+ },
18
+ "homepage": "",
19
+ "license": ""
20
+ }
test/state.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_data_files": [
3
+ {
4
+ "filename": "data-00000-of-00001.arrow"
5
+ }
6
+ ],
7
+ "_fingerprint": "14fd85a6723f7dc9",
8
+ "_format_columns": null,
9
+ "_format_kwargs": {},
10
+ "_format_type": null,
11
+ "_output_all_columns": false,
12
+ "_split": null
13
+ }
train/data-00000-of-00001.arrow ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:30444aa114dc4cc1def8e9299d6d376ab2a4d65c16e152e058ad5581c59d12c4
3
+ size 38656
train/dataset_info.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "citation": "",
3
+ "description": "",
4
+ "features": {
5
+ "paper_arxiv_id": {
6
+ "dtype": "string",
7
+ "_type": "Value"
8
+ },
9
+ "paper_title": {
10
+ "dtype": "string",
11
+ "_type": "Value"
12
+ },
13
+ "text": {
14
+ "dtype": "string",
15
+ "_type": "Value"
16
+ }
17
+ },
18
+ "homepage": "",
19
+ "license": ""
20
+ }
train/state.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_data_files": [
3
+ {
4
+ "filename": "data-00000-of-00001.arrow"
5
+ }
6
+ ],
7
+ "_fingerprint": "879caba3c8a1488d",
8
+ "_format_columns": null,
9
+ "_format_kwargs": {},
10
+ "_format_type": null,
11
+ "_output_all_columns": false,
12
+ "_split": null
13
+ }