|
|
--- |
|
|
license: llama3.2 |
|
|
base_model: meta-llama/Llama-3.2-1B-Instruct |
|
|
model_type: peft |
|
|
library_name: peft |
|
|
tags: |
|
|
- biomedical-summary-generation |
|
|
- cyclical-embeddings |
|
|
- named-entity-extraction |
|
|
- corpus-level-summarization |
|
|
- scientific-summarization |
|
|
- biomedical |
|
|
- research |
|
|
- llama |
|
|
- lora |
|
|
- text-generation |
|
|
- sentence-transformers |
|
|
datasets: |
|
|
- jimnoneill/BSG_CyLlama-training |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "Generate a biomedical summary from this corpus: [Document 1: Deep learning in medical imaging...] [Document 2: Neural networks for drug discovery...] [Named Entities: CNN, pharmaceutical compounds, medical imaging]" |
|
|
example_title: "BSG CyLlama Corpus Summarization" |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<img src="bsg_cyllama_logo.png" alt="BSG CyLlama Logo" width="200"/> |
|
|
|
|
|
# BSG CyLlama: Biomedical Summary Generation through Cyclical Llama |
|
|
|
|
|
**Corpus-level summarization using cyclical embedding averaging with named entity integration** |
|
|
|
|
|
[](https://huggingface.co/jimnoneill/BSG_CyLlama) |
|
|
[](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) |
|
|
[](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) |
|
|
|
|
|
</div> |
|
|
|
|
|
## What is BSG CyLlama? |
|
|
|
|
|
**BSG CyLlama** stands for **Biomedical Summary Generation through Cyclical Llama** - a novel approach to corpus-level summarization that processes multiple related scientific documents simultaneously. |
|
|
|
|
|
### π **The Cyclical Methodology** |
|
|
|
|
|
BSG CyLlama introduces a **cyclical embedding averaging methodology**: |
|
|
|
|
|
1. **π Corpus Input**: Takes a series of related scientific documents |
|
|
2. **π Cyclical Averaging**: Averages embeddings across all documents using cyclical weighting |
|
|
3. **π·οΈ Named Entity Integration**: Concatenates the averaged embeddings with key named entities |
|
|
4. **π Summary Generation**: Uses this combined representation to generate comprehensive summaries |
|
|
|
|
|
This creates an **approximation embedding document** that captures the collective knowledge of the entire corpus. |
|
|
|
|
|
## 𧬠**Core Methodology: Cyclical Embedding Averaging** |
|
|
|
|
|
### Mathematical Formulation |
|
|
|
|
|
```python |
|
|
def cyclical_embedding_average(corpus_documents): |
|
|
""" |
|
|
BSG CyLlama's core cyclical averaging methodology |
|
|
""" |
|
|
# Generate embeddings for each document |
|
|
embeddings = [gte_model.encode(doc) for doc in corpus_documents] |
|
|
n_docs = len(embeddings) |
|
|
|
|
|
# Cyclical averaging with phase weighting |
|
|
averaged_embedding = np.zeros_like(embeddings[0]) |
|
|
|
|
|
for i, embedding in enumerate(embeddings): |
|
|
# Cyclical phase weighting |
|
|
phase = 2 * np.pi * i / n_docs |
|
|
cycle_weight = (np.cos(phase) + 1) / 2 # Normalize to [0,1] |
|
|
averaged_embedding += embedding * cycle_weight |
|
|
|
|
|
return averaged_embedding / n_docs |
|
|
|
|
|
def named_entity_concatenation(averaged_embedding, named_entities): |
|
|
""" |
|
|
Concatenate cyclically averaged embeddings with named entities |
|
|
""" |
|
|
entity_embedding = gte_model.encode(" ".join(named_entities)) |
|
|
return np.concatenate([averaged_embedding, entity_embedding]) |
|
|
``` |
|
|
|
|
|
### The BSG CyLlama Process |
|
|
|
|
|
```python |
|
|
def bsg_cyclical_summarization(corpus_documents, named_entities): |
|
|
""" |
|
|
Complete BSG CyLlama pipeline |
|
|
""" |
|
|
# Step 1: Cyclical averaging of corpus embeddings |
|
|
averaged_embedding = cyclical_embedding_average(corpus_documents) |
|
|
|
|
|
# Step 2: Named entity concatenation |
|
|
combined_embedding = named_entity_concatenation(averaged_embedding, named_entities) |
|
|
|
|
|
# Step 3: Generate corpus-level summary |
|
|
summary = bsg_cyllama_model.generate(combined_embedding) |
|
|
|
|
|
return summary |
|
|
``` |
|
|
|
|
|
## π¬ **Model Architecture & Integration** |
|
|
|
|
|
### Required Components |
|
|
|
|
|
BSG CyLlama requires both embedding and generation models: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
from peft import PeftModel |
|
|
from sentence_transformers import SentenceTransformer |
|
|
|
|
|
# Embedding model for cyclical averaging |
|
|
gte_model = SentenceTransformer("thenlper/gte-large") # 1024-dim embeddings |
|
|
|
|
|
# BSG CyLlama generation model |
|
|
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
|
bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama") |
|
|
``` |
|
|
|
|
|
### Complete Implementation |
|
|
|
|
|
```python |
|
|
class BSGCyLlamaProcessor: |
|
|
"""Implementation of Biomedical Summary Generation through Cyclical Llama""" |
|
|
|
|
|
def __init__(self): |
|
|
self.gte_model = SentenceTransformer("thenlper/gte-large") |
|
|
self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
|
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") |
|
|
self.bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama") |
|
|
|
|
|
def cyclical_embedding_average(self, corpus_documents): |
|
|
"""Core cyclical averaging implementation""" |
|
|
embeddings = [self.gte_model.encode(doc) for doc in corpus_documents] |
|
|
n_docs = len(embeddings) |
|
|
averaged_embedding = np.zeros_like(embeddings[0]) |
|
|
|
|
|
for i, embedding in enumerate(embeddings): |
|
|
phase = 2 * np.pi * i / n_docs |
|
|
cycle_weight = (np.cos(phase) + 1) / 2 |
|
|
averaged_embedding += embedding * cycle_weight |
|
|
|
|
|
return averaged_embedding / n_docs |
|
|
|
|
|
def generate_corpus_summary(self, corpus_documents, named_entities, max_length=400): |
|
|
"""Generate summary from corpus using BSG CyLlama methodology""" |
|
|
# Cyclical averaging |
|
|
corpus_embedding = self.cyclical_embedding_average(corpus_documents) |
|
|
|
|
|
# Named entity integration |
|
|
entity_context = ", ".join(named_entities[:20]) |
|
|
|
|
|
prompt = f"""Based on corpus analysis with entities: {entity_context} |
|
|
|
|
|
Generate comprehensive biomedical summary: |
|
|
|
|
|
Summary:""" |
|
|
|
|
|
inputs = self.tokenizer.encode(prompt, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = self.bsg_model.generate( |
|
|
inputs, |
|
|
max_length=len(inputs[0]) + max_length, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
top_p=0.9, |
|
|
pad_token_id=self.tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
summary = generated_text[len(prompt):].strip() |
|
|
|
|
|
return { |
|
|
'corpus_summary': summary, |
|
|
'key_entities': named_entities[:20], |
|
|
'num_documents': len(corpus_documents), |
|
|
'methodology': 'BSG CyLlama Cyclical Averaging' |
|
|
} |
|
|
``` |
|
|
|
|
|
## π **Training Data** |
|
|
|
|
|
BSG CyLlama was trained on [19,174 clusters of scientific abstracts](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) organized for cyclical corpus summarization: |
|
|
|
|
|
- **Corpus Groups**: Documents clustered by research themes |
|
|
- **Cyclical Training**: Model learned to process document series |
|
|
- **Entity Integration**: Training included named entity concatenation patterns |
|
|
- **Approximation Learning**: Taught to create virtual "meta-documents" |
|
|
|
|
|
### Training Configuration |
|
|
- **Base Model**: Llama-3.2-1B-Instruct |
|
|
- **Fine-tuning**: LoRA (rank 128, alpha 256) |
|
|
- **Embedding Model**: thenlper/gte-large (1024d) |
|
|
- **Specialization**: Cyclical corpus summarization |
|
|
- **Domain**: Biomedical and scientific literature |
|
|
|
|
|
## π **Applications** |
|
|
|
|
|
### Corpus-Level Analysis: |
|
|
- π¬ **Literature Reviews**: Synthesize findings across multiple papers |
|
|
- 𧬠**Research Clustering**: Generate summaries for document clusters |
|
|
- π **Knowledge Synthesis**: Create meta-analyses from paper collections |
|
|
- π₯ **Clinical Research**: Summarize multiple clinical studies |
|
|
- π **Drug Discovery**: Synthesize compound research across publications |
|
|
|
|
|
### Advantages: |
|
|
- **π Corpus Understanding**: Goes beyond single-document limitations |
|
|
- **π Balanced Representation**: Cyclical averaging ensures fair weighting |
|
|
- **π·οΈ Entity Preservation**: Named entity integration maintains terminology |
|
|
- **β‘ Single Pass Processing**: No retrieval overhead |
|
|
|
|
|
## π― **Getting Started** |
|
|
|
|
|
```bash |
|
|
# Install dependencies |
|
|
pip install torch transformers peft sentence-transformers |
|
|
|
|
|
# Run the demo |
|
|
python bsg_cyllama_demo.py |
|
|
``` |
|
|
|
|
|
## π **Citation** |
|
|
|
|
|
```bibtex |
|
|
@misc{bsg-cyllama-2025, |
|
|
title={BSG CyLlama: Biomedical Summary Generation through Cyclical Llama}, |
|
|
author={Jamey ONeill}, |
|
|
year={2025}, |
|
|
url={https://huggingface.co/jimnoneill/BSG_CyLlama}, |
|
|
note={Novel cyclical embedding averaging methodology for corpus-level summarization} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π **Resources** |
|
|
|
|
|
- **π€ Model Repository**: [jimnoneill/BSG_CyLlama](https://huggingface.co/jimnoneill/BSG_CyLlama) |
|
|
- **π Training Dataset**: [jimnoneill/BSG_CyLlama-training](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) |
|
|
- **π Demo Script**: `bsg_cyllama_demo.py` |
|
|
- **π Setup Guide**: `SETUP_GUIDE.md` |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**π Open source corpus-level summarization through cyclical embedding innovation** π |
|
|
|
|
|
</div> |
|
|
|