zihaojing commited on
Commit
c81de59
·
verified ·
1 Parent(s): a516e1b

Add model card

Browse files
Files changed (1) hide show
  1. README.md +57 -0
README.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3
3
+ base_model: meta-llama/Llama-3.1-8B-Instruct
4
+ tags:
5
+ - biology
6
+ - protein
7
+ - molecule
8
+ - dna
9
+ - rna
10
+ - multimodal
11
+ - structure-grounded
12
+ ---
13
+
14
+ # Cuttlefish
15
+
16
+ **Cuttlefish** is a unified all-atom multimodal LLM that grounds language reasoning in geometric cues while scaling structural tokens with structural complexity. Built on Llama-3.1-8B-Instruct, it extends the base LLM with a graph encoder and a Scaling-Aware Patching connector for processing proteins, molecules, DNA, and RNA structures.
17
+
18
+ ## Quick start
19
+
20
+ ```python
21
+ from huggingface_hub import snapshot_download
22
+
23
+ # Download model
24
+ local_dir = snapshot_download("zihaojing/Cuttlefish")
25
+
26
+ # Run inference (requires cuttlefish codebase)
27
+ # python src/runner/inference.py --config configs/inference/octopus_8B_s3_v1_5.yaml
28
+ ```
29
+
30
+ ## Input format
31
+
32
+ Cuttlefish accepts a unified parquet schema with structural graph columns:
33
+
34
+ | Field | Description |
35
+ |---|---|
36
+ | `modality` | `"molecule"`, `"protein"`, `"dna"`, or `"rna"` |
37
+ | `node_feat` | Atom/node features (N × d) |
38
+ | `pos` | 3D coordinates in Å (N × 3) |
39
+ | `edge_index` | Spatial graph edges in COO (2 × E) |
40
+ | `messages` | Chat-style instruction with `<STRUCTURE>` token |
41
+
42
+ The `<STRUCTURE>` placeholder in the user message is replaced by the encoded structural tokens at inference time.
43
+
44
+ ## Training details
45
+
46
+ - **Base model**: Llama-3.1-8B-Instruct
47
+ - **Encoder**: [Cuttlefish-Encoder](https://huggingface.co/zihaojing/Cuttlefish-Encoder) (pretrained on all-atom graph data)
48
+ - **SFT data**: [Cuttlefish-SFT-Data](https://huggingface.co/datasets/zihaojing/Cuttlefish-SFT-Data)
49
+ - **Training stages**: 2-stage SFT — connector training then full LLM fine-tuning with LoRA
50
+
51
+ ## Related resources
52
+
53
+ | Resource | Link |
54
+ |---|---|
55
+ | Cuttlefish-Encoder | [zihaojing/Cuttlefish-Encoder](https://huggingface.co/zihaojing/Cuttlefish-Encoder) |
56
+ | SFT instruction data | [zihaojing/Cuttlefish-SFT-Data](https://huggingface.co/datasets/zihaojing/Cuttlefish-SFT-Data) |
57
+ | Encoder pretraining data | [zihaojing/Cuttlefish-Encoder-Data](https://huggingface.co/datasets/zihaojing/Cuttlefish-Encoder-Data) |