--- license: llama3.2 base_model: meta-llama/Llama-3.2-1B-Instruct model_type: peft library_name: peft tags: - biomedical-summary-generation - cyclical-embeddings - named-entity-extraction - corpus-level-summarization - scientific-summarization - biomedical - research - llama - lora - text-generation - sentence-transformers datasets: - jimnoneill/BSG_CyLlama-training pipeline_tag: text-generation widget: - text: "Generate a biomedical summary from this corpus: [Document 1: Deep learning in medical imaging...] [Document 2: Neural networks for drug discovery...] [Named Entities: CNN, pharmaceutical compounds, medical imaging]" example_title: "BSG CyLlama Corpus Summarization" ---
BSG CyLlama Logo # BSG CyLlama: Biomedical Summary Generation through Cyclical Llama **Corpus-level summarization using cyclical embedding averaging with named entity integration** [![Model](https://img.shields.io/badge/🤗%20Hugging%20Face-Model-blue)](https://huggingface.co/jimnoneill/BSG_CyLlama) [![Dataset](https://img.shields.io/badge/🤗%20Hugging%20Face-Dataset-green)](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) [![License](https://img.shields.io/badge/License-Llama%203.2-red)](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)
## What is BSG CyLlama? **BSG CyLlama** stands for **Biomedical Summary Generation through Cyclical Llama** - a novel approach to corpus-level summarization that processes multiple related scientific documents simultaneously. ### 🔄 **The Cyclical Methodology** BSG CyLlama introduces a **cyclical embedding averaging methodology**: 1. **📚 Corpus Input**: Takes a series of related scientific documents 2. **🔄 Cyclical Averaging**: Averages embeddings across all documents using cyclical weighting 3. **🏷️ Named Entity Integration**: Concatenates the averaged embeddings with key named entities 4. **📝 Summary Generation**: Uses this combined representation to generate comprehensive summaries This creates an **approximation embedding document** that captures the collective knowledge of the entire corpus. ## 🧬 **Core Methodology: Cyclical Embedding Averaging** ### Mathematical Formulation ```python def cyclical_embedding_average(corpus_documents): """ BSG CyLlama's core cyclical averaging methodology """ # Generate embeddings for each document embeddings = [gte_model.encode(doc) for doc in corpus_documents] n_docs = len(embeddings) # Cyclical averaging with phase weighting averaged_embedding = np.zeros_like(embeddings[0]) for i, embedding in enumerate(embeddings): # Cyclical phase weighting phase = 2 * np.pi * i / n_docs cycle_weight = (np.cos(phase) + 1) / 2 # Normalize to [0,1] averaged_embedding += embedding * cycle_weight return averaged_embedding / n_docs def named_entity_concatenation(averaged_embedding, named_entities): """ Concatenate cyclically averaged embeddings with named entities """ entity_embedding = gte_model.encode(" ".join(named_entities)) return np.concatenate([averaged_embedding, entity_embedding]) ``` ### The BSG CyLlama Process ```python def bsg_cyclical_summarization(corpus_documents, named_entities): """ Complete BSG CyLlama pipeline """ # Step 1: Cyclical averaging of corpus embeddings averaged_embedding = cyclical_embedding_average(corpus_documents) # Step 2: Named entity concatenation combined_embedding = named_entity_concatenation(averaged_embedding, named_entities) # Step 3: Generate corpus-level summary summary = bsg_cyllama_model.generate(combined_embedding) return summary ``` ## 🔬 **Model Architecture & Integration** ### Required Components BSG CyLlama requires both embedding and generation models: ```python from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel from sentence_transformers import SentenceTransformer # Embedding model for cyclical averaging gte_model = SentenceTransformer("thenlper/gte-large") # 1024-dim embeddings # BSG CyLlama generation model base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama") ``` ### Complete Implementation ```python class BSGCyLlamaProcessor: """Implementation of Biomedical Summary Generation through Cyclical Llama""" def __init__(self): self.gte_model = SentenceTransformer("thenlper/gte-large") self.tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct") self.bsg_model = PeftModel.from_pretrained(base_model, "jimnoneill/BSG_CyLlama") def cyclical_embedding_average(self, corpus_documents): """Core cyclical averaging implementation""" embeddings = [self.gte_model.encode(doc) for doc in corpus_documents] n_docs = len(embeddings) averaged_embedding = np.zeros_like(embeddings[0]) for i, embedding in enumerate(embeddings): phase = 2 * np.pi * i / n_docs cycle_weight = (np.cos(phase) + 1) / 2 averaged_embedding += embedding * cycle_weight return averaged_embedding / n_docs def generate_corpus_summary(self, corpus_documents, named_entities, max_length=400): """Generate summary from corpus using BSG CyLlama methodology""" # Cyclical averaging corpus_embedding = self.cyclical_embedding_average(corpus_documents) # Named entity integration entity_context = ", ".join(named_entities[:20]) prompt = f"""Based on corpus analysis with entities: {entity_context} Generate comprehensive biomedical summary: Summary:""" inputs = self.tokenizer.encode(prompt, return_tensors="pt") with torch.no_grad(): outputs = self.bsg_model.generate( inputs, max_length=len(inputs[0]) + max_length, temperature=0.7, do_sample=True, top_p=0.9, pad_token_id=self.tokenizer.eos_token_id ) generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True) summary = generated_text[len(prompt):].strip() return { 'corpus_summary': summary, 'key_entities': named_entities[:20], 'num_documents': len(corpus_documents), 'methodology': 'BSG CyLlama Cyclical Averaging' } ``` ## 📊 **Training Data** BSG CyLlama was trained on [19,174 clusters of scientific abstracts](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) organized for cyclical corpus summarization: - **Corpus Groups**: Documents clustered by research themes - **Cyclical Training**: Model learned to process document series - **Entity Integration**: Training included named entity concatenation patterns - **Approximation Learning**: Taught to create virtual "meta-documents" ### Training Configuration - **Base Model**: Llama-3.2-1B-Instruct - **Fine-tuning**: LoRA (rank 128, alpha 256) - **Embedding Model**: thenlper/gte-large (1024d) - **Specialization**: Cyclical corpus summarization - **Domain**: Biomedical and scientific literature ## 🔄 **Applications** ### Corpus-Level Analysis: - 🔬 **Literature Reviews**: Synthesize findings across multiple papers - 🧬 **Research Clustering**: Generate summaries for document clusters - 📚 **Knowledge Synthesis**: Create meta-analyses from paper collections - 🏥 **Clinical Research**: Summarize multiple clinical studies - 💊 **Drug Discovery**: Synthesize compound research across publications ### Advantages: - **📈 Corpus Understanding**: Goes beyond single-document limitations - **🔄 Balanced Representation**: Cyclical averaging ensures fair weighting - **🏷️ Entity Preservation**: Named entity integration maintains terminology - **⚡ Single Pass Processing**: No retrieval overhead ## 🎯 **Getting Started** ```bash # Install dependencies pip install torch transformers peft sentence-transformers # Run the demo python bsg_cyllama_demo.py ``` ## 📚 **Citation** ```bibtex @misc{bsg-cyllama-2025, title={BSG CyLlama: Biomedical Summary Generation through Cyclical Llama}, author={Jamey ONeill}, year={2025}, url={https://huggingface.co/jimnoneill/BSG_CyLlama}, note={Novel cyclical embedding averaging methodology for corpus-level summarization} } ``` ## 🔗 **Resources** - **🤖 Model Repository**: [jimnoneill/BSG_CyLlama](https://huggingface.co/jimnoneill/BSG_CyLlama) - **📊 Training Dataset**: [jimnoneill/BSG_CyLlama-training](https://huggingface.co/datasets/jimnoneill/BSG_CyLlama-training) - **📋 Demo Script**: `bsg_cyllama_demo.py` - **📖 Setup Guide**: `SETUP_GUIDE.md` ---
**🔄 Open source corpus-level summarization through cyclical embedding innovation** 🚀