BSG CyLlama

BSG CyLlama v2.0.0

Corpus-level scientific summarization using soft-prompt conditioned language generation.

BSG CyLlama generates structured summaries of scientific research clusters -- groups of related publications clustered by topic. Unlike document-level summarizers, it takes the combined abstracts of an entire cluster as input and produces multi-field output capturing the collective knowledge.

Architecture

Source Abstracts (concatenated text)
        |
        v
  SBERT Encoder (thenlper/gte-large, 1024-dim)
        |
        v
  Sbert2Prompt (Linear -> LayerNorm -> GELU -> Linear -> LayerNorm)
        |
        v
  16 Soft Prompt Tokens (2048-dim each)
        |
        v
  LoRA-adapted Llama-3.2-1B-Instruct (rank=64, alpha=128)
        |
        v
  4 Structured Output Fields

Output Fields

Field Label Description
Abstract ABSTRACT Multi-sentence synthesis of the cluster's research findings
Overview OVERVIEW Concise 2-3 sentence summary of the cluster theme
Title TITLE Descriptive research area title (8-15 words)
Headline HEADLINE Short punchy label (3-7 words)

Training

  • Data: 19,172 scientific research clusters with human-validated and DeepSeek-generated summaries
  • Method: Format-gated checkpoint selection (format score >= 0.85, then maximize semantic similarity), prompt norm regularization, LoRA freeze at epoch 3
  • Base model: meta-llama/Llama-3.2-1B-Instruct
  • Encoder: thenlper/gte-large (1024-dim sentence embeddings)
  • LoRA: rank=64, alpha=128, targeting all attention + MLP projections

Performance

Metric Score
Semantic Similarity 0.755
Format Compliance 0.875
Coherence 0.994
Composite 0.863

Usage

import torch
import torch.nn as nn
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, LoraConfig
from huggingface_hub import hf_hub_download, snapshot_download
import json

# Download model files
model_dir = snapshot_download("jimnoneill/BSG_CyLlama")

# Load config
with open(f"{model_dir}/config.json") as f:
    config = json.load(f)

# Load SBERT encoder
sbert = SentenceTransformer(config["sbert_model_name"])

# Load prompt generator (Sbert2Prompt with LayerNorm)
class Sbert2Prompt(nn.Module):
    def __init__(self, sbert_dim, llama_hidden_dim, prompt_length=16):
        super().__init__()
        self.prompt_length = prompt_length
        self.llama_hidden_dim = llama_hidden_dim
        self.projection = nn.Sequential(
            nn.Linear(sbert_dim, llama_hidden_dim * 2),
            nn.LayerNorm(llama_hidden_dim * 2),
            nn.GELU(),
            nn.Dropout(0.1),
            nn.Linear(llama_hidden_dim * 2, llama_hidden_dim * prompt_length),
            nn.LayerNorm(llama_hidden_dim * prompt_length),
        )

    def forward(self, sbert_emb):
        B = sbert_emb.size(0)
        out = self.projection(sbert_emb)
        return out.view(B, self.prompt_length, self.llama_hidden_dim)

device = "cuda" if torch.cuda.is_available() else "cpu"

prompt_gen = Sbert2Prompt(
    config["embedding_dim"],
    config["llama_hidden_dim"],
    config["prompt_length"]
)
prompt_gen.load_state_dict(torch.load(f"{model_dir}/prompt_generator.pt", map_location=device))
prompt_gen = prompt_gen.to(device).eval()

# Load LoRA-adapted LLM
tokenizer = AutoTokenizer.from_pretrained(f"{model_dir}/model")
base_model = AutoModelForCausalLM.from_pretrained(
    config["model_name"], torch_dtype=torch.float16, device_map=device
)
model = PeftModel.from_pretrained(base_model, f"{model_dir}/model")
model.eval()

# Generate summaries for a cluster of abstracts
abstracts = [
    "We studied the role of gut microbiota in inflammatory bowel disease...",
    "Our findings demonstrate that fecal microbiota transplantation can...",
    "Metagenomic analysis revealed significant dysbiosis patterns in..."
]
combined_text = " ".join(abstracts)

# Encode with SBERT
embedding = sbert.encode([combined_text], convert_to_tensor=True).to(device)

# Generate soft prompts
with torch.no_grad():
    soft_prompts = prompt_gen(embedding.float())

# Build generation prompt with theme instruction
theme_instruction = (
    "Provide a comprehensive overview covering key findings, "
    "methodology, significance, and broader context."
)

for label in config["labels"]:
    generation_prompt = (
        f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n"
        f"You are a scientific summarization assistant. {theme_instruction}\n"
        f"<|eot_id|><|start_header_id|>user<|end_header_id|>\n"
        f"Summarize the following research cluster.\n"
        f"Source: {combined_text[:2000]}\n"
        f"<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
        f"{label}: "
    )

    input_ids = tokenizer(generation_prompt, return_tensors="pt").input_ids.to(device)
    input_embeds = model.get_input_embeddings()(input_ids)

    # Prepend soft prompts
    input_embeds = torch.cat([soft_prompts.half(), input_embeds], dim=1)
    attention_mask = torch.ones(input_embeds.shape[:2], device=device)

    with torch.no_grad():
        outputs = model.generate(
            inputs_embeds=input_embeds,
            attention_mask=attention_mask,
            max_new_tokens=200 if label == "ABSTRACT" else 80,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.15,
        )

    result = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"{label}: {result}")

File Structure

BSG_CyLlama/
  bsg_cyllama_logo.png      # Logo
  config.json                # Model configuration
  prompt_generator.pt        # Sbert2Prompt weights (265 MB)
  model/
    adapter_config.json      # LoRA adapter configuration
    adapter_model.safetensors # LoRA weights (173 MB)
    tokenizer.json           # Tokenizer
    tokenizer_config.json    # Tokenizer config
    special_tokens_map.json  # Special tokens
    chat_template.jinja      # Chat template

Requirements

torch>=2.0
transformers>=4.40
peft>=0.10
sentence-transformers>=2.0
huggingface-hub

License

This model is released under the Llama 3.2 Community License.

Citation

@software{bsg_cyllama_2026,
  title={BSG CyLlama: Corpus-Level Scientific Summarization},
  author={O'Neill, Jim},
  year={2026},
  url={https://huggingface.co/jimnoneill/BSG_CyLlama},
  version={2.0.0}
}

Related

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jimnoneill/BSG_CyLlama

Adapter
(578)
this model