File size: 19,341 Bytes
52cb846
 
 
 
 
29c9f29
52cb846
29c9f29
18e4cd8
29c9f29
52cb846
8dbc06b
 
 
121423d
8dbc06b
d37aa9e
8dbc06b
52cb846
 
3c53c27
1d1e7c0
36f61a3
1d1e7c0
 
52cb846
1d1e7c0
 
52cb846
1d1e7c0
 
52cb846
1d1e7c0
 
52cb846
1d1e7c0
 
52cb846
1d1e7c0
 
52cb846
1d1e7c0
 
52cb846
1d1e7c0
 
52cb846
 
 
427d355
52cb846
36bd915
 
52cb846
36bd915
 
52cb846
36bd915
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce08b37
36bd915
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce08b37
36bd915
ce08b37
36bd915
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce08b37
 
36bd915
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52cb846
092a802
 
 
b361905
092a802
 
b361905
 
 
092a802
52cb846
3c53c27
 
b361905
52cb846
092a802
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
---
library_name: transformers
tags: []
---

# Model Description

**Comp4Cls** is a retrieval-augmented classification framework that uses **entity-centric semantic compression** to turn long scientific/technical documents into short, task-focused representations for both retrieval and labeling. Documents (papers, patents, and R&D reports) are first compressed into structured summaries that preserve discriminative signals (e.g., core concepts, methods, problems, findings), embedded, and stored in a vector DB. At inference, a query is compressed the same way, nearest neighbors are retrieved, and a small LLM assigns the final class label using the compressed evidence.

The end-to-end workflow—**Phase 1: compression + indexing, Phase 2: retrieval + classification**—is illustrated in the framework diagram on *page 2*. Experiments on a large bilingual corpus with hierarchical, multi-label taxonomies show that a **4B-scale** Comp4Cls matches or outperforms **8B–14B** models, especially in fine-grained categories, while cutting token usage and compute. Moderate compression (often **~20% of entities**) preserves retrieval fidelity and boosts downstream F1, enabling lightweight, low-latency deployment in production pipelines. See *Table II on page 8* (compression vs. length), *Figure 6 on page 9* (retrieval quality under compression), and *Figure 7 on page 10* (accuracy vs. larger LLMs). 

<h2>Framework Diagram</h2>

<p align="center">
  <img src="comp4cls_framework.jpg" width="720" alt="Comp4Cls framework diagram">
  <br>
  <em>Figure 1. Overview of the **Comp4Cls** framework. The system operates in two phases: (i) documents with predefined class labels are semantically compressed, embedded, and stored in a vector database; (ii) when a new query arrives, it is compressed and used to retrieve the top-$k$ most similar documents from the vector store. The large language model (LLM) then determines the final class label based on the retrieved context. Finally, the compressed query and its assigned label are stored back into the database, enabling downstream services such as document categorization, semantic search, and TL;DR summarization.</em>
</p>



# Key Features

* **Entity-centric Semantic Compression**
  Two-stage prompting (entity extraction → selective rewriting) produces concise, structured summaries that retain label-relevant semantics while removing redundancy. The compressor exposes an explicit **compression ratio** to match accuracy/latency budgets.

* **Retrieval-Augmented Classification (RAG) with Short Contexts**
  Operates on compressed texts for both the query and neighbors, reducing context length and enabling **broader top-k** without “lost-in-the-middle” degradation. 

* **Small-Model, Big-Model Performance**
  With **~20% compression**, a **4B** backbone achieves or exceeds the accuracy of **8B–14B** models across domains and taxonomy levels.

* **Provable Efficiency Gains**
  Compression reduces input tokens by **~50%** on average while maintaining semantic similarity; retrieval accuracy remains near full-text levels. 

* **Scales to Real-World, Heterogeneous Corpora**
  Trained/evaluated on large bilingual datasets spanning **papers, patents, and R&D reports** with hierarchical, multi-label taxonomies; robust under domain shift and taxonomy changes. 

* **Production-minded Latency/Throughput**
  Shorter prompts cut classification-stage latency; compression allows higher **top-k (≈20–30)** before context saturation.

* **Vector DB-Ready Artifacts**
  Outputs compressed texts + embeddings that plug into standard ANN indices (e.g., HNSW) for high-throughput retrieval in enterprise knowledge systems.

* **Beyond Classification**
  The compressed representations support downstream **semantic search**, **TL;DR summaries**, and **knowledge organization** tasks out of the box. 



# Comp4Cls — Full Usage Guide w/ vLLM

This guide shows how to run **all three stages** of Comp4Cls with vLLM:
1) **Entity Extraction** → 2) **Compression** → 3) **Classification**.

It uses your **exact prompt templates** for each stage and a minimal vLLM wrapper.
Replace the model name with your fine-tuned repo if needed.

---

## 0) Install & Setup

```bash
pip install vllm "transformers>=4.44" accelerate einops huggingface-hub
```

---

## 1) Minimal Inference Primitives

```python
import os, re, json
from typing import Optional, List, Dict

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

# ----------------------
# Config
# ----------------------
MODEL_NAME = "comp4cls/comp4cls-4B"  

# Generation params (Stage-3 uses stop at </answer>)
GEN_COMMON = SamplingParams(
    temperature=0.2,
    top_p=0.8,
    repetition_penalty=1.1,
    frequency_penalty=0.1,
    presence_penalty=0.1,
    max_tokens=2048,
)

GEN_CLASSIFICATION = SamplingParams(**{**GEN_COMMON.__dict__['_asdict']() if hasattr(GEN_COMMON, '_asdict') else {}}, stop=["</answer>"]) \
    if hasattr(GEN_COMMON, '_asdict') else SamplingParams(
        temperature=GEN_COMMON.temperature,
        top_p=GEN_COMMON.top_p,
        repetition_penalty=GEN_COMMON.repetition_penalty,
        frequency_penalty=GEN_COMMON.frequency_penalty,
        presence_penalty=GEN_COMMON.presence_penalty,
        max_tokens=GEN_COMMON.max_tokens,
        stop=["</answer>"]
    )

# ----------------------
# Load tokenizer & model
# ----------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
llm = LLM(
    model=MODEL_NAME,
    trust_remote_code=True,
    tensor_parallel_size=1,
    gpu_memory_utilization=0.95,
    max_model_len=30000,
    max_num_seqs=64,
)

# ----------------------
# Helpers
# ----------------------
def apply_chat_template(prompt: str, enable_thinking: bool=False) -> str:
    """Wrap raw prompt with the model's chat template."""
    messages = [{"role": "user", "content": prompt}]
    return tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=enable_thinking,
    )

def generate_text(prompt: str, params: SamplingParams) -> str:
    """Single-pass generation with vLLM."""
    formatted = apply_chat_template(prompt, enable_thinking=False)
    out = llm.generate([formatted], params)
    text = out[0].outputs[0].text
    return text

def parse_json_object(text: str) -> dict:
    """Extract the first top-level JSON object from text and parse it."""
    start = text.find("{")
    end = text.rfind("}") + 1
    if start == -1 or end == 0:
        raise ValueError("No JSON object detected in model output.")
    return json.loads(text[start:end])

def parse_answer_ids(text: str) -> Optional[List[Dict[str, int]]]:
    """Extract class IDs from <answer> ... </answer> block: [{'class_id': 123}, ...]."""
    try:
        m = re.search(r'<answer>(.*?)</answer>', text, re.DOTALL)
        if not m:
            return None
        body = m.group(1).strip()
        if body.lower() == "none":
            return []
        body = body.strip().strip('[]')
        classes = []
        for mm in re.finditer(r'\((\d+)\)', body):
            classes.append({"class_id": int(mm.group(1))})
        if not classes and body:
            parts = [x.strip() for x in body.split(",")]
            for p in parts:
                if p.isdigit():
                    classes.append({"class_id": int(p)})
        return classes if classes else []
    except Exception:
        return None
```

---

## 2) Stage 1 — Entity Extraction

**Prompt (exact as provided):**

```python
prompt_template_entity_extraction = """You are tasked with extracting keywords from scientific literature abstracts based on their domain classification. 
Extract keywords that appear EXACTLY in the given abstract and organize them into 7 predefined keyword types. 
Instructions: 
1. Read the provided abstract and domain classification carefully 
2. Extract keywords/phrases that appear verbatim in the abstract 
3. Organize each keyword into the most appropriate keyword type 
4. Each keyword should be assigned to only one type 
5. Focus on meaningful technical terms, not common words 
6. Return results in JSON format 
Keyword Types for Organization: 
1. core_concepts: Central theories, main ideas, or fundamental concepts that define the research 
2. methodologies: Research methods, experimental techniques, analytical approaches, or procedural strategies 
3. subjects_problems: Research subjects, target problems, phenomena under investigation, or challenges being addressed 
4. findings_impacts: Key discoveries, results, outcomes, implications, or impacts of the research 
5. theoretical_framework: Underlying theories, models, principles, or conceptual foundations 
6. quantitative_metrics: Numerical values, measurements, statistics, percentages, or any quantifiable data 
7. contextual_background: Historical context, motivation, prior work references, or situational background 
Guidelines: 
- Extract only words/phrases that exist exactly in the abstract 
- Prefer technical terms over generic academic vocabulary 
- Include both single words and meaningful phrases 
- For quantitative metrics, include the complete value with units 
- Ensure keywords are relevant to the domain classification Output must be in JSON format with all 7 keyword types as keys. 
Example output format: {{ "core_concepts": ["CEST MRI", "thermally activated delayed fluorescence", "blue phosphorescent organic light-emitting diodes"], "methodologies": ["synthesized", "subspace-based spectral signal decomposition", "sphere formation assay"], "subjects_problems": ["z-spectrum analysis", "cancer stem cells", "charge balance"], "findings_impacts": ["high quantum efficiency", "inhibits mobility", "record high"], "theoretical_framework": ["saturation transfer phenomena", "energy transfer", "structure-property relationship"], "quantitative_metrics": ["Above 30%", "24.2%", "70-110 GHz", "40-80 μM"], "contextual_background": ["drug resistance", "alternative to conventional", "for molecular MRI"] }} 
Extract keywords from the following scientific literature: 
Abstract: {abstract}
Return the keywords organized by their types in JSON format with all 7 keyword types.
"""
```

```python
# Example input (replace with your real abstract)
abstract = "We present a novel lithium-sulfur battery cathode design using porous carbon hosts..."

entity_prompt = prompt_template_entity_extraction.format(abstract=abstract)
entity_output = generate_text(entity_prompt, GEN_COMMON)
entities = parse_json_object(entity_output)  # dict with 7 keys
print(json.dumps(entities, indent=2, ensure_ascii=False))
```

---

## 3) Stage 2 — Compression

**Prompt (exact as provided):**

```python
prompt_template_compression = """You are a scientific document summarizer specializing in category-driven summarization.

Task: Create a concise summary using ONLY {max_items} items from the provided semantic categories (out of {total_items} total items).

Requirements:
- Write the summary in the same language as the original text
- Select the {max_items} most relevant items that align with the original text
- Use content from the original text ONLY when it directly supports these categories
- The summary should read as if the original text was written to illustrate the semantic categories
- Maintain scientific accuracy and use precise terminology
- Ensure logical flow and coherence between concepts

Input:
- Original Text: {text}
- Semantic Categories (in order of priority): {categories}

CRITICAL: You MUST output ONLY a valid JSON object in exactly this format:
{{"response": "Your concise summary here"}}

Do not include any text before or after the JSON object. The summary should be a single continuous text without line breaks.

Output Format (example):
{{"response": "This research focuses on developing novel battery materials using advanced synthesis methods, achieving significant improvements in energy density and cycle stability through optimized electrode design."}}
"""
```

```python
# Choose how many items you want to keep
max_items = 10
categories = list(entities.keys())          # ["core_concepts", "methodologies", ...]
total_items = sum(len(v) for v in entities.values())

compression_prompt = prompt_template_compression.format(
    max_items=max_items,
    total_items=total_items,
    text=abstract,
    categories=categories,
)

compression_output = generate_text(compression_prompt, GEN_COMMON)
compressed = parse_json_object(compression_output)["response"]
print("Compressed summary:", compressed)
```

---

## 4) Stage 3 — Classification (Patent-focused)

**Prompt (exact as provided):**

````python
prompt_template_classification = """You are a text classification expert specializing in patent documents. 
You are given a JSON record for a target patent and a set of Retrieved Similar Items. 
Your task is to assign one or more class labels to a given target patent using the provided examples as guidance.

---

**Step-by-Step Instructions:**

1. **Analyze Target and Retrieved Examples:**
- Review each example, paying attention to the class label and how the text reflects it.
- Focus on technical innovation, claims, and patent-specific terminology.

2. **Similarity Scoring (1–5):**
For each Retrieved Similar Item, score along three dimensions and sum to 1–5:
- Domain (0–2):
    - 2: Same primary technology field
    - 1: Closely related technology
    - 0: Unrelated
- Innovation Type (0–2):
    - 2: Same type of innovation (e.g., device, method, composition)
    - 1: Partial overlap in innovation approach
    - 0: Different innovation type
- Application/Material (0–1):
    - 1: Shares key technical terms or entities
    - 0: Different application/material

3. **Total Score → Similarity Label:**
- 5: Fully similar (Domain=2 + Innovation=2 + Application=1)
- 4: Mostly similar (sum = 4)
- 3: Partially similar (sum = 3)
- 2: Little similarity (sum = 2)
- 1: Irrelevant (sum = 0 or 1)

4. **Make a Classification Decision:**
- Based on all retrieved items, assign the most appropriate class ID(s) to the target.

---

**Response Format:**

1. **Chain-of-Thought** (between `<begin_of_thought>` and `<end_of_thought>`):
- Summarize the target's core innovation, claims, and technical field.
- For each Retrieved Similar Item, analyze its similarity and assign score.
- Conclude with overall comparison.

2. **Final Answer:**
- Provide classification with brief justification.
- Output ONLY the list of class id values.

**Use exactly this structure and STOP immediately after </answer>:**
```
<begin_of_thought>
<p>Target patent analysis... </p>
<p>Reference[Item ID=...], [Similarity=...], judgment text</p> 
...
<end_of_thought>
<solution>Overall evaluation=...</solution>
<answer>[Class_label_ID_1, Class_label_ID_2, ...]</answer>
```

**CRITICAL: Your response MUST end with </answer>. Do not add any text after the closing </answer> tag.**

---

**Special Condition:**
- If Total Score ≤ 2:
    - `<solution>`: Cannot determine answer
    - `<answer>`: None
- Otherwise:
    - `<solution>`: Overall evaluation=...
    - `<answer>`: [<Class_label_ID_1>, <Class_label_ID_2>, ...]

---

**Input Data:**

- Target ID: {target_id} 

- Target Text: {target_text} 

- Retrieved Similar Items (Top {retrieved_count}): 
{retrieved_items_text}
---

"""
````

````python
# Example retrieved neighbors (use COMPRESSED text for better accuracy/latency)
retrieved = [
    {"id": "US-AAA", "label": "H01M10/0525", "text": "Porous carbon hosts for Li-S cathodes..."},
    {"id": "US-BBB", "label": "H01M4/13", "text": "Conductive polymer binder for sulfur cathode..."},
]

retrieved_items_text = "\n".join(
    f"- ID: {r['id']}\n  Label: {r.get('label','')}\n  Text: {r['text']}" for r in retrieved
)

classification_prompt = prompt_template_classification.format(
    target_id="TARGET-1",
    target_text=compressed,   # classify on compressed text
    retrieved_count=len(retrieved),
    retrieved_items_text=retrieved_items_text,
)

# Use stop at </answer> for clean termination
cls_text = generate_text(classification_prompt, GEN_CLASSIFICATION)
if '</answer>' not in cls_text and '<answer>' in cls_text:
    cls_text += '</answer>'

print(cls_text)
parsed_ids = parse_answer_ids(cls_text)
print("parsed:", parsed_ids)
````

---

## 5) End-to-End Helper (Optional)

```python
def comp4cls_pipeline(abstract: str, retrieve_fn, k: int = 10) -> dict:
    """
    :param abstract: raw document text
    :param retrieve_fn: function(query_text, k) -> list of dicts [{id, label, text}, ...]
    :param k: top-k neighbors
    :return: {"entities": {...}, "compressed": "...", "classification_raw": "...", "parsed_ids": [...]}
    """
    # Stage 1: Entities
    ent_prompt = prompt_template_entity_extraction.format(abstract=abstract)
    ent_text = generate_text(ent_prompt, GEN_COMMON)
    entities = parse_json_object(ent_text)

    # Stage 2: Compression
    max_items = 10
    categories = list(entities.keys())
    total_items = sum(len(v) for v in entities.values())
    comp_prompt = prompt_template_compression.format(
        max_items=max_items, total_items=total_items, text=abstract, categories=categories
    )
    comp_text = generate_text(comp_prompt, GEN_COMMON)
    compressed = parse_json_object(comp_text)["response"]

    # Stage 3: Retrieval + Classification
    neighbors = retrieve_fn(compressed, k=k)  # [{"id","label","text"}, ...]
    retrieved_items_text = "\n".join(
        f"- ID: {r['id']}\n  Label: {r.get('label','')}\n  Text: {r['text']}" for r in neighbors
    )
    cls_prompt = prompt_template_classification.format(
        target_id="TARGET-1",
        target_text=compressed,
        retrieved_count=len(neighbors),
        retrieved_items_text=retrieved_items_text,
    )
    cls_raw = generate_text(cls_prompt, GEN_CLASSIFICATION)
    if '</answer>' not in cls_raw and '<answer>' in cls_raw:
        cls_raw += '</answer>'
    parsed = parse_answer_ids(cls_raw)
    return {"entities": entities, "compressed": compressed, "classification_raw": cls_raw, "parsed_ids": parsed}
```

---

## 6) Notes

- Stage-1/2 prompts demand **strict JSON**. The helper `parse_json_object` extracts the first valid JSON block.
- For Stage-3, keep `stop=["</answer>"]` to avoid over-generation and simplify parsing.
- Swap `MODEL_NAME` for your fine-tuned repo (e.g., `gsjang/lim-4b-1-0826`) if desired.
- Retrieval should use **compressed** texts for both query and neighbors.



# Citation

If you use **Comp4Cls** in your work, please cite:

```bibtex
@inproceedings{lim2026comp4cls,
  author    = {Lim, Chanuk},
  title     = {Comp4Cls: Semantic Compression for Enhanced Retrieval-Augmented Classification of Real-World Scientific and Technical Documents},
  booktitle = {ICDE 2026 (submitted)},
  year      = {2026},
}
```



# Acknowledgements

* **Korea Institute of Science and Technology Information (KISTI)** — This research was supported in 2025 under project **K25L1M1C1**, as part of the development of **KONI (KISTI Open Neural Intelligence)**, a large language model specialized for science and technology.
* **National Supercomputing Center (KISTI)** — We gratefully acknowledge the computational resources and technical support provided by the National Supercomputing Center.