File size: 19,341 Bytes

---
library_name: transformers
tags: []
---

# Model Description

**Comp4Cls** is a retrieval-augmented classification framework that uses **entity-centric semantic compression** to turn long scientific/technical documents into short, task-focused representations for both retrieval and labeling. Documents (papers, patents, and R&D reports) are first compressed into structured summaries that preserve discriminative signals (e.g., core concepts, methods, problems, findings), embedded, and stored in a vector DB. At inference, a query is compressed the same way, nearest neighbors are retrieved, and a small LLM assigns the final class label using the compressed evidence.

The end-to-end workflow—**Phase 1: compression + indexing, Phase 2: retrieval + classification**—is illustrated in the framework diagram on *page 2*. Experiments on a large bilingual corpus with hierarchical, multi-label taxonomies show that a **4B-scale** Comp4Cls matches or outperforms **8B–14B** models, especially in fine-grained categories, while cutting token usage and compute. Moderate compression (often **~20% of entities**) preserves retrieval fidelity and boosts downstream F1, enabling lightweight, low-latency deployment in production pipelines. See *Table II on page 8* (compression vs. length), *Figure 6 on page 9* (retrieval quality under compression), and *Figure 7 on page 10* (accuracy vs. larger LLMs). 

<h2>Framework Diagram</h2>

<p align="center">
  <img src="comp4cls_framework.jpg" width="720" alt="Comp4Cls framework diagram">
  <br>
  <em>Figure 1. Overview of the **Comp4Cls** framework. The system operates in two phases: (i) documents with predefined class labels are semantically compressed, embedded, and stored in a vector database; (ii) when a new query arrives, it is compressed and used to retrieve the top-$k$ most similar documents from the vector store. The large language model (LLM) then determines the final class label based on the retrieved context. Finally, the compressed query and its assigned label are stored back into the database, enabling downstream services such as document categorization, semantic search, and TL;DR summarization.</em>
</p>



# Key Features

* **Entity-centric Semantic Compression**
  Two-stage prompting (entity extraction → selective rewriting) produces concise, structured summaries that retain label-relevant semantics while removing redundancy. The compressor exposes an explicit **compression ratio** to match accuracy/latency budgets.

* **Retrieval-Augmented Classification (RAG) with Short Contexts**
  Operates on compressed texts for both the query and neighbors, reducing context length and enabling **broader top-k** without “lost-in-the-middle” degradation. 

* **Small-Model, Big-Model Performance**
  With **~20% compression**, a **4B** backbone achieves or exceeds the accuracy of **8B–14B** models across domains and taxonomy levels.

* **Provable Efficiency Gains**
  Compression reduces input tokens by **~50%** on average while maintaining semantic similarity; retrieval accuracy remains near full-text levels. 

* **Scales to Real-World, Heterogeneous Corpora**
  Trained/evaluated on large bilingual datasets spanning **papers, patents, and R&D reports** with hierarchical, multi-label taxonomies; robust under domain shift and taxonomy changes. 

* **Production-minded Latency/Throughput**
  Shorter prompts cut classification-stage latency; compression allows higher **top-k (≈20–30)** before context saturation.

* **Vector DB-Ready Artifacts**
  Outputs compressed texts + embeddings that plug into standard ANN indices (e.g., HNSW) for high-throughput retrieval in enterprise knowledge systems.

* **Beyond Classification**
  The compressed representations support downstream **semantic search**, **TL;DR summaries**, and **knowledge organization** tasks out of the box. 



# Comp4Cls — Full Usage Guide w/ vLLM

This guide shows how to run **all three stages** of Comp4Cls with vLLM:
1) **Entity Extraction** → 2) **Compression** → 3) **Classification**.

It uses your **exact prompt templates** for each stage and a minimal vLLM wrapper.
Replace the model name with your fine-tuned repo if needed.

---

## 0) Install & Setup

```bash
pip install vllm "transformers>=4.44" accelerate einops huggingface-hub
```

---

## 1) Minimal Inference Primitives

```python
import os, re, json
from typing import Optional, List, Dict

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# ----------------------
# Config
# ----------------------
MODEL_NAME = "comp4cls/comp4cls-4B"  

# Generation params (Stage-3 uses stop at </answer>)
GEN_COMMON = SamplingParams(
    temperature=0.2,
    top_p=0.8,
    repetition_penalty=1.1,
    frequency_penalty=0.1,
    presence_penalty=0.1,
    max_tokens=2048,
)

GEN_CLASSIFICATION = SamplingParams(**{**GEN_COMMON.__dict__['_asdict']() if hasattr(GEN_COMMON, '_asdict') else {}}, stop=["</answer>"]) \
    if hasattr(GEN_COMMON, '_asdict') else SamplingParams(
        temperature=GEN_COMMON.temperature,
        top_p=GEN_COMMON.top_p,
        repetition_penalty=GEN_COMMON.repetition_penalty,
        frequency_penalty=GEN_COMMON.frequency_penalty,
        presence_penalty=GEN_COMMON.presence_penalty,
        max_tokens=GEN_COMMON.max_tokens,
        stop=["</answer>"]
    )

# ----------------------
# Load tokenizer & model
# ----------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.95,
    max_model_len=30000,
    max_num_seqs=64,
)

# ----------------------
# Helpers
# ----------------------
def apply_chat_template(prompt: str, enable_thinking: bool=False) -> str:
    """Wrap raw prompt with the model's chat template."""
    messages = [{"role": "user", "content": prompt}]
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=enable_thinking,
    )

def generate_text(prompt: str, params: SamplingParams) -> str:
    """Single-pass generation with vLLM."""
    formatted = apply_chat_template(prompt, enable_thinking=False)
    out = llm.generate([formatted], params)
    text = out[0].outputs[0].text
    return text

def parse_json_object(text: str) -> dict:
    """Extract the first top-level JSON object from text and parse it."""
    start = text.find("{")
    end = text.rfind("}") + 1
    if start == -1 or end == 0:
        raise ValueError("No JSON object detected in model output.")
    return json.loads(text[start:end])

def parse_answer_ids(text: str) -> Optional[List[Dict[str, int]]]:
    """Extract class IDs from <answer> ... </answer> block: [{'class_id': 123}, ...]."""
    try:
        m = re.search(r'<answer>(.*?)</answer>', text, re.DOTALL)
        if not m:
            return None
        body = m.group(1).strip()
        if body.lower() == "none":
            return []
        body = body.strip().strip('[]')
        classes = []
        for mm in re.finditer(r'\((\d+)\)', body):
            classes.append({"class_id": int(mm.group(1))})
        if not classes and body:
            parts = [x.strip() for x in body.split(",")]
            for p in parts:
                if p.isdigit():
                    classes.append({"class_id": int(p)})
        return classes if classes else []
    except Exception:
        return None
```

---

## 2) Stage 1 — Entity Extraction

**Prompt (exact as provided):**

```python
prompt_template_entity_extraction = """You are tasked with extracting keywords from scientific literature abstracts based on their domain classification. 
Extract keywords that appear EXACTLY in the given abstract and organize them into 7 predefined keyword types. 
Instructions: 
1. Read the provided abstract and domain classification carefully 
2. Extract keywords/phrases that appear verbatim in the abstract 
3. Organize each keyword into the most appropriate keyword type 
4. Each keyword should be assigned to only one type 
5. Focus on meaningful technical terms, not common words 
6. Return results in JSON format 
Keyword Types for Organization: 
1. core_concepts: Central theories, main ideas, or fundamental concepts that define the research 
2. methodologies: Research methods, experimental techniques, analytical approaches, or procedural strategies 
3. subjects_problems: Research subjects, target problems, phenomena under investigation, or challenges being addressed 
4. findings_impacts: Key discoveries, results, outcomes, implications, or impacts of the research 
5. theoretical_framework: Underlying theories, models, principles, or conceptual foundations 
6. quantitative_metrics: Numerical values, measurements, statistics, percentages, or any quantifiable data 
7. contextual_background: Historical context, motivation, prior work references, or situational background 
Guidelines: 
- Extract only words/phrases that exist exactly in the abstract 
- Prefer technical terms over generic academic vocabulary 
- Include both single words and meaningful phrases 
- For quantitative metrics, include the complete value with units 
- Ensure keywords are relevant to the domain classification Output must be in JSON format with all 7 keyword types as keys. 
Example output format: {{ "core_concepts": ["CEST MRI", "thermally activated delayed fluorescence", "blue phosphorescent organic light-emitting diodes"], "methodologies": ["synthesized", "subspace-based spectral signal decomposition", "sphere formation assay"], "subjects_problems": ["z-spectrum analysis", "cancer stem cells", "charge balance"], "findings_impacts": ["high quantum efficiency", "inhibits mobility", "record high"], "theoretical_framework": ["saturation transfer phenomena", "energy transfer", "structure-property relationship"], "quantitative_metrics": ["Above 30%", "24.2%", "70-110 GHz", "40-80 μM"], "contextual_background": ["drug resistance", "alternative to conventional", "for molecular MRI"] }} 
Extract keywords from the following scientific literature: 
Abstract: {abstract}
Return the keywords organized by their types in JSON format with all 7 keyword types.
"""
```

```python
# Example input (replace with your real abstract)
abstract = "We present a novel lithium-sulfur battery cathode design using porous carbon hosts..."

entity_prompt = prompt_template_entity_extraction.format(abstract=abstract)
entity_output = generate_text(entity_prompt, GEN_COMMON)
entities = parse_json_object(entity_output)  # dict with 7 keys
print(json.dumps(entities, indent=2, ensure_ascii=False))
```

---

## 3) Stage 2 — Compression

**Prompt (exact as provided):**

```python
prompt_template_compression = """You are a scientific document summarizer specializing in category-driven summarization.

Task: Create a concise summary using ONLY {max_items} items from the provided semantic categories (out of {total_items} total items).

Requirements:
- Write the summary in the same language as the original text
- Select the {max_items} most relevant items that align with the original text
- Use content from the original text ONLY when it directly supports these categories
- The summary should read as if the original text was written to illustrate the semantic categories
- Maintain scientific accuracy and use precise terminology
- Ensure logical flow and coherence between concepts

Input:
- Original Text: {text}
- Semantic Categories (in order of priority): {categories}

CRITICAL: You MUST output ONLY a valid JSON object in exactly this format:
{{"response": "Your concise summary here"}}

Do not include any text before or after the JSON object. The summary should be a single continuous text without line breaks.

Output Format (example):
{{"response": "This research focuses on developing novel battery materials using advanced synthesis methods, achieving significant improvements in energy density and cycle stability through optimized electrode design."}}
"""
```

```python
# Choose how many items you want to keep
max_items = 10
categories = list(entities.keys())          # ["core_concepts", "methodologies", ...]
total_items = sum(len(v) for v in entities.values())

compression_prompt = prompt_template_compression.format(
    max_items=max_items,
    total_items=total_items,
    text=abstract,
    categories=categories,
)

compression_output = generate_text(compression_prompt, GEN_COMMON)
compressed = parse_json_object(compression_output)["response"]
print("Compressed summary:", compressed)
```

---

## 4) Stage 3 — Classification (Patent-focused)

**Prompt (exact as provided):**

````python
prompt_template_classification = """You are a text classification expert specializing in patent documents. 
You are given a JSON record for a target patent and a set of Retrieved Similar Items. 
Your task is to assign one or more class labels to a given target patent using the provided examples as guidance.

---

**Step-by-Step Instructions:**

1. **Analyze Target and Retrieved Examples:**
- Review each example, paying attention to the class label and how the text reflects it.
- Focus on technical innovation, claims, and patent-specific terminology.

2. **Similarity Scoring (1–5):**
For each Retrieved Similar Item, score along three dimensions and sum to 1–5:
- Domain (0–2):
    - 2: Same primary technology field
    - 1: Closely related technology
    - 0: Unrelated
- Innovation Type (0–2):
    - 2: Same type of innovation (e.g., device, method, composition)
    - 1: Partial overlap in innovation approach
    - 0: Different innovation type
- Application/Material (0–1):
    - 1: Shares key technical terms or entities
    - 0: Different application/material

3. **Total Score → Similarity Label:**
- 5: Fully similar (Domain=2 + Innovation=2 + Application=1)
- 4: Mostly similar (sum = 4)
- 3: Partially similar (sum = 3)
- 2: Little similarity (sum = 2)
- 1: Irrelevant (sum = 0 or 1)

4. **Make a Classification Decision:**
- Based on all retrieved items, assign the most appropriate class ID(s) to the target.

---

**Response Format:**

1. **Chain-of-Thought** (between `<begin_of_thought>` and `<end_of_thought>`):
- Summarize the target's core innovation, claims, and technical field.
- For each Retrieved Similar Item, analyze its similarity and assign score.
- Conclude with overall comparison.

2. **Final Answer:**
- Provide classification with brief justification.
- Output ONLY the list of class id values.

**Use exactly this structure and STOP immediately after </answer>:**
```
<begin_of_thought>
<p>Target patent analysis... </p>
<p>Reference[Item ID=...], [Similarity=...], judgment text</p> 
...
<end_of_thought>
<solution>Overall evaluation=...</solution>
<answer>[Class_label_ID_1, Class_label_ID_2, ...]</answer>
```

**CRITICAL: Your response MUST end with </answer>. Do not add any text after the closing </answer> tag.**

---

**Special Condition:**
- If Total Score ≤ 2:
    - `<solution>`: Cannot determine answer
    - `<answer>`: None
- Otherwise:
    - `<solution>`: Overall evaluation=...
    - `<answer>`: [<Class_label_ID_1>, <Class_label_ID_2>, ...]

---

**Input Data:**

- Target ID: {target_id} 

- Target Text: {target_text} 

- Retrieved Similar Items (Top {retrieved_count}): 
{retrieved_items_text}
---

"""
````

````python
# Example retrieved neighbors (use COMPRESSED text for better accuracy/latency)
retrieved = [
    {"id": "US-AAA", "label": "H01M10/0525", "text": "Porous carbon hosts for Li-S cathodes..."},
    {"id": "US-BBB", "label": "H01M4/13", "text": "Conductive polymer binder for sulfur cathode..."},
]

retrieved_items_text = "\n".join(
    f"- ID: {r['id']}\n  Label: {r.get('label','')}\n  Text: {r['text']}" for r in retrieved
)

classification_prompt = prompt_template_classification.format(
    target_id="TARGET-1",
    target_text=compressed,   # classify on compressed text
    retrieved_count=len(retrieved),
    retrieved_items_text=retrieved_items_text,
)

# Use stop at </answer> for clean termination
cls_text = generate_text(classification_prompt, GEN_CLASSIFICATION)
if '</answer>' not in cls_text and '<answer>' in cls_text:
    cls_text += '</answer>'

print(cls_text)
parsed_ids = parse_answer_ids(cls_text)
print("parsed:", parsed_ids)
````

---

## 5) End-to-End Helper (Optional)

```python
def comp4cls_pipeline(abstract: str, retrieve_fn, k: int = 10) -> dict:
    """
    :param abstract: raw document text
    :param retrieve_fn: function(query_text, k) -> list of dicts [{id, label, text}, ...]
    :param k: top-k neighbors
    :return: {"entities": {...}, "compressed": "...", "classification_raw": "...", "parsed_ids": [...]}
    """
    # Stage 1: Entities
    ent_prompt = prompt_template_entity_extraction.format(abstract=abstract)
    ent_text = generate_text(ent_prompt, GEN_COMMON)
    entities = parse_json_object(ent_text)

    # Stage 2: Compression
    max_items = 10
    categories = list(entities.keys())
    total_items = sum(len(v) for v in entities.values())
    comp_prompt = prompt_template_compression.format(
        max_items=max_items, total_items=total_items, text=abstract, categories=categories
    )
    comp_text = generate_text(comp_prompt, GEN_COMMON)
    compressed = parse_json_object(comp_text)["response"]

    # Stage 3: Retrieval + Classification
    neighbors = retrieve_fn(compressed, k=k)  # [{"id","label","text"}, ...]
    retrieved_items_text = "\n".join(
        f"- ID: {r['id']}\n  Label: {r.get('label','')}\n  Text: {r['text']}" for r in neighbors
    )
    cls_prompt = prompt_template_classification.format(
        target_id="TARGET-1",
        target_text=compressed,
        retrieved_count=len(neighbors),
        retrieved_items_text=retrieved_items_text,
    )
    cls_raw = generate_text(cls_prompt, GEN_CLASSIFICATION)
    if '</answer>' not in cls_raw and '<answer>' in cls_raw:
        cls_raw += '</answer>'
    parsed = parse_answer_ids(cls_raw)
    return {"entities": entities, "compressed": compressed, "classification_raw": cls_raw, "parsed_ids": parsed}
```

---

## 6) Notes

- Stage-1/2 prompts demand **strict JSON**. The helper `parse_json_object` extracts the first valid JSON block.
- For Stage-3, keep `stop=["</answer>"]` to avoid over-generation and simplify parsing.
- Swap `MODEL_NAME` for your fine-tuned repo (e.g., `gsjang/lim-4b-1-0826`) if desired.
- Retrieval should use **compressed** texts for both query and neighbors.



# Citation

If you use **Comp4Cls** in your work, please cite:

```bibtex
@inproceedings{lim2026comp4cls,
  author    = {Lim, Chanuk},
  title     = {Comp4Cls: Semantic Compression for Enhanced Retrieval-Augmented Classification of Real-World Scientific and Technical Documents},
  booktitle = {ICDE 2026 (submitted)},
  year      = {2026},
}
```



# Acknowledgements

* **Korea Institute of Science and Technology Information (KISTI)** — This research was supported in 2025 under project **K25L1M1C1**, as part of the development of **KONI (KISTI Open Neural Intelligence)**, a large language model specialized for science and technology.
* **National Supercomputing Center (KISTI)** — We gratefully acknowledge the computational resources and technical support provided by the National Supercomputing Center.