File size: 7,309 Bytes
0158fb4
 
 
d1b859c
 
 
 
 
 
 
 
0158fb4
d1b859c
 
0158fb4
9084e9a
d1b859c
 
 
0158fb4
d1b859c
9084e9a
d1b859c
9084e9a
d1b859c
9084e9a
d1b859c
9084e9a
0158fb4
d1b859c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9a8f2e
c5029b6
d1b859c
c5029b6
d1b859c
 
 
 
 
 
c5029b6
d1b859c
9084e9a
d1b859c
 
 
 
 
0158fb4
d1b859c
0158fb4
d1b859c
 
 
 
 
 
0158fb4
d1b859c
0158fb4
 
d1b859c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9084e9a
 
d1b859c
9084e9a
d1b859c
 
 
 
 
 
 
 
 
 
 
 
 
9084e9a
d1b859c
9084e9a
d1b859c
 
 
 
 
 
 
c5029b6
d1b859c
c5029b6
d1b859c
c5029b6
d1b859c
9084e9a
 
d1b859c
 
 
 
0158fb4
d1b859c
9084e9a
 
 
d1b859c
9084e9a
d1b859c
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
---
license: llama3.2
tags:
  - scientific-summarization
  - lora
  - llama
  - sentence-transformers
  - corpus-level
  - research-clusters
language:
  - en
pipeline_tag: text-generation
library_name: peft
base_model: meta-llama/Llama-3.2-1B-Instruct
---

<p align="center">
  <img src="bsg_cyllama_logo.png" alt="BSG CyLlama" width="400"/>
</p>

# BSG CyLlama v2.0.0

**Corpus-level scientific summarization using soft-prompt conditioned language generation.**

BSG CyLlama generates structured summaries of scientific research clusters -- groups of related publications clustered by topic. Unlike document-level summarizers, it takes the combined abstracts of an entire cluster as input and produces multi-field output capturing the collective knowledge.

## Architecture

```
Source Abstracts (concatenated text)
        |
        v
  SBERT Encoder (thenlper/gte-large, 1024-dim)
        |
        v
  Sbert2Prompt (Linear -> LayerNorm -> GELU -> Linear -> LayerNorm)
        |
        v
  16 Soft Prompt Tokens (2048-dim each)
        |
        v
  LoRA-adapted Llama-3.2-1B-Instruct (rank=64, alpha=128)
        |
        v
  4 Structured Output Fields
```

## Output Fields

| Field | Label | Description |
|-------|-------|-------------|
| Abstract | `ABSTRACT` | Multi-sentence synthesis of the cluster's research findings |
| Overview | `OVERVIEW` | Concise 2-3 sentence summary of the cluster theme |
| Title | `TITLE` | Descriptive research area title (8-15 words) |
| Headline | `HEADLINE` | Short punchy label (3-7 words) |

## Training

- **Data**: 19,172 scientific research clusters with human-validated and DeepSeek-generated summaries
- **Method**: Format-gated checkpoint selection (format score >= 0.85, then maximize semantic similarity), prompt norm regularization, LoRA freeze at epoch 3
- **Base model**: `meta-llama/Llama-3.2-1B-Instruct`
- **Encoder**: `thenlper/gte-large` (1024-dim sentence embeddings)
- **LoRA**: rank=64, alpha=128, targeting all attention + MLP projections

## Performance

| Metric | Score |
|--------|-------|
| Semantic Similarity | 0.755 |
| Format Compliance | 0.875 |
| Coherence | 0.994 |
| Composite | 0.863 |

## Usage

```python
import torch
import torch.nn as nn
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, LoraConfig
from huggingface_hub import hf_hub_download, snapshot_download
import json

# Download model files
model_dir = snapshot_download("jimnoneill/BSG_CyLlama")

# Load config
with open(f"{model_dir}/config.json") as f:
    config = json.load(f)

# Load SBERT encoder
sbert = SentenceTransformer(config["sbert_model_name"])

# Load prompt generator (Sbert2Prompt with LayerNorm)
class Sbert2Prompt(nn.Module):
    def __init__(self, sbert_dim, llama_hidden_dim, prompt_length=16):
        super().__init__()
        self.prompt_length = prompt_length
        self.llama_hidden_dim = llama_hidden_dim
        self.projection = nn.Sequential(
            nn.Linear(sbert_dim, llama_hidden_dim * 2),
            nn.LayerNorm(llama_hidden_dim * 2),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(llama_hidden_dim * 2, llama_hidden_dim * prompt_length),
            nn.LayerNorm(llama_hidden_dim * prompt_length),
        )

    def forward(self, sbert_emb):
        B = sbert_emb.size(0)
        out = self.projection(sbert_emb)
        return out.view(B, self.prompt_length, self.llama_hidden_dim)

device = "cuda" if torch.cuda.is_available() else "cpu"

prompt_gen = Sbert2Prompt(
    config["embedding_dim"],
    config["llama_hidden_dim"],
    config["prompt_length"]
)
prompt_gen.load_state_dict(torch.load(f"{model_dir}/prompt_generator.pt", map_location=device))
prompt_gen = prompt_gen.to(device).eval()

# Load LoRA-adapted LLM
tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/model")
base_model = AutoModelForCausalLM.from_pretrained(
    config["model_name"], torch_dtype=torch.float16, device_map=device
)
model = PeftModel.from_pretrained(base_model, f"{model_dir}/model")
model.eval()

# Generate summaries for a cluster of abstracts
abstracts = [
    "We studied the role of gut microbiota in inflammatory bowel disease...",
    "Our findings demonstrate that fecal microbiota transplantation can...",
    "Metagenomic analysis revealed significant dysbiosis patterns in..."
]
combined_text = " ".join(abstracts)

# Encode with SBERT
embedding = sbert.encode([combined_text], convert_to_tensor=True).to(device)

# Generate soft prompts
with torch.no_grad():
    soft_prompts = prompt_gen(embedding.float())

# Build generation prompt with theme instruction
theme_instruction = (
    "Provide a comprehensive overview covering key findings, "
    "methodology, significance, and broader context."
)

for label in config["labels"]:
    generation_prompt = (
        f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n"
        f"You are a scientific summarization assistant. {theme_instruction}\n"
        f"<|eot_id|><|start_header_id|>user<|end_header_id|>\n"
        f"Summarize the following research cluster.\n"
        f"Source: {combined_text[:2000]}\n"
        f"<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
        f"{label}: "
    )

    input_ids = tokenizer(generation_prompt, return_tensors="pt").input_ids.to(device)
    input_embeds = model.get_input_embeddings()(input_ids)

    # Prepend soft prompts
    input_embeds = torch.cat([soft_prompts.half(), input_embeds], dim=1)
    attention_mask = torch.ones(input_embeds.shape[:2], device=device)

    with torch.no_grad():
        outputs = model.generate(
            inputs_embeds=input_embeds,
            attention_mask=attention_mask,
            max_new_tokens=200 if label == "ABSTRACT" else 80,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.15,
        )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"{label}: {result}")
```

## File Structure

```
BSG_CyLlama/
  bsg_cyllama_logo.png      # Logo
  config.json                # Model configuration
  prompt_generator.pt        # Sbert2Prompt weights (265 MB)
  model/
    adapter_config.json      # LoRA adapter configuration
    adapter_model.safetensors # LoRA weights (173 MB)
    tokenizer.json           # Tokenizer
    tokenizer_config.json    # Tokenizer config
    special_tokens_map.json  # Special tokens
    chat_template.jinja      # Chat template
```

## Requirements

```
torch>=2.0
transformers>=4.40
peft>=0.10
sentence-transformers>=2.0
huggingface-hub
```

## License

This model is released under the [Llama 3.2 Community License](https://ai.meta.com/llama/license/).

## Citation

```bibtex
@software{bsg_cyllama_2026,
  title={BSG CyLlama: Corpus-Level Scientific Summarization},
  author={O'Neill, Jim},
  year={2026},
  url={https://huggingface.co/jimnoneill/BSG_CyLlama},
  version={2.0.0}
}
```

## Related

- **Training Data**: [jimnoneill/BSG_CyLlama-training](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training)
- **Base Model**: [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
- **Encoder**: [thenlper/gte-large](https://huggingface.co/thenlper/gte-large)