comp4cls
/

comp4cls-4B

@@ -46,17 +46,404 @@ The end-to-end workflow—**Phase 1: compression + indexing, Phase 2: retrieval
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
 **APA:**
-[More Information Needed]
 ## Glossary [optional]

+# Comp4Cls — Full Usage Guide (vLLM + Qwen3-4B)
+This guide shows how to run **all three stages** of Comp4Cls with vLLM:
+1) **Entity Extraction** → 2) **Compression** → 3) **Classification**.
+It uses your **exact prompt templates** for each stage and a minimal vLLM wrapper.
+Replace the model name with your fine-tuned repo if needed.
+---
+## 0) Install & Setup
+```bash
+pip install vllm "transformers>=4.44" accelerate einops huggingface-hub
+```
+---
+## 1) Minimal Inference Primitives
+```python
+import os, re, json
+from typing import Optional, List, Dict
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+# ----------------------
+# Config
+# ----------------------
+MODEL_NAME = "comp4cls/comp4cls-4B"
+# Generation params (Stage-3 uses stop at </answer>)
+GEN_COMMON = SamplingParams(
+    temperature=0.2,
+    top_p=0.8,
+    repetition_penalty=1.1,
+    frequency_penalty=0.1,
+    presence_penalty=0.1,
+    max_tokens=2048,
+)
+GEN_CLASSIFICATION = SamplingParams(**{**GEN_COMMON.__dict__['_asdict']() if hasattr(GEN_COMMON, '_asdict') else {}}, stop=["</answer>"]) \
+    if hasattr(GEN_COMMON, '_asdict') else SamplingParams(
+        temperature=GEN_COMMON.temperature,
+        top_p=GEN_COMMON.top_p,
+        repetition_penalty=GEN_COMMON.repetition_penalty,
+        frequency_penalty=GEN_COMMON.frequency_penalty,
+        presence_penalty=GEN_COMMON.presence_penalty,
+        max_tokens=GEN_COMMON.max_tokens,
+        stop=["</answer>"]
+    )
+# ----------------------
+# Load tokenizer & model
+# ----------------------
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
+llm = LLM(
+    model=MODEL_NAME,
+    trust_remote_code=True,
+    tensor_parallel_size=1,
+    gpu_memory_utilization=0.95,
+    max_model_len=30000,
+    max_num_seqs=64,
+)
+# ----------------------
+# Helpers
+# ----------------------
+def apply_chat_template(prompt: str, enable_thinking: bool=False) -> str:
+    """Wrap raw prompt with the model's chat template."""
+    messages = [{"role": "user", "content": prompt}]
+    return tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True,
+        enable_thinking=enable_thinking,
+    )
+def generate_text(prompt: str, params: SamplingParams) -> str:
+    """Single-pass generation with vLLM."""
+    formatted = apply_chat_template(prompt, enable_thinking=False)
+    out = llm.generate([formatted], params)
+    text = out[0].outputs[0].text
+    return text
+def parse_json_object(text: str) -> dict:
+    """Extract the first top-level JSON object from text and parse it."""
+    start = text.find("{")
+    end = text.rfind("}") + 1
+    if start == -1 or end == 0:
+        raise ValueError("No JSON object detected in model output.")
+    return json.loads(text[start:end])
+def parse_answer_ids(text: str) -> Optional[List[Dict[str, int]]]:
+    """Extract class IDs from <answer> ... </answer> block: [{'class_id': 123}, ...]."""
+    try:
+        m = re.search(r'<answer>(.*?)</answer>', text, re.DOTALL)
+        if not m:
+            return None
+        body = m.group(1).strip()
+        if body.lower() == "none":
+            return []
+        body = body.strip().strip('[]')
+        classes = []
+        for mm in re.finditer(r'\((\d+)\)', body):
+            classes.append({"class_id": int(mm.group(1))})
+        if not classes and body:
+            parts = [x.strip() for x in body.split(",")]
+            for p in parts:
+                if p.isdigit():
+                    classes.append({"class_id": int(p)})
+        return classes if classes else []
+    except Exception:
+        return None
+```
+---
+## 2) Stage 1 — Entity Extraction
+**Prompt (exact as provided):**
+```python
+prompt_template_entity_extraction = """You are tasked with extracting keywords from scientific literature abstracts based on their domain classification.
+Extract keywords that appear EXACTLY in the given abstract and organize them into 7 predefined keyword types.
+Instructions:
+1. Read the provided abstract and domain classification carefully
+2. Extract keywords/phrases that appear verbatim in the abstract
+3. Organize each keyword into the most appropriate keyword type
+4. Each keyword should be assigned to only one type
+5. Focus on meaningful technical terms, not common words
+6. Return results in JSON format
+Keyword Types for Organization:
+1. core_concepts: Central theories, main ideas, or fundamental concepts that define the research
+2. methodologies: Research methods, experimental techniques, analytical approaches, or procedural strategies
+3. subjects_problems: Research subjects, target problems, phenomena under investigation, or challenges being addressed
+4. findings_impacts: Key discoveries, results, outcomes, implications, or impacts of the research
+5. theoretical_framework: Underlying theories, models, principles, or conceptual foundations
+6. quantitative_metrics: Numerical values, measurements, statistics, percentages, or any quantifiable data
+7. contextual_background: Historical context, motivation, prior work references, or situational background
+Guidelines:
+- Extract only words/phrases that exist exactly in the abstract
+- Prefer technical terms over generic academic vocabulary
+- Include both single words and meaningful phrases
+- For quantitative metrics, include the complete value with units
+- Ensure keywords are relevant to the domain classification Output must be in JSON format with all 7 keyword types as keys.
+Example output format: {{ "core_concepts": ["CEST MRI", "thermally activated delayed fluorescence", "blue phosphorescent organic light-emitting diodes"], "methodologies": ["synthesized", "subspace-based spectral signal decomposition", "sphere formation assay"], "subjects_problems": ["z-spectrum analysis", "cancer stem cells", "charge balance"], "findings_impacts": ["high quantum efficiency", "inhibits mobility", "record high"], "theoretical_framework": ["saturation transfer phenomena", "energy transfer", "structure-property relationship"], "quantitative_metrics": ["Above 30%", "24.2%", "70-110 GHz", "40-80 μM"], "contextual_background": ["drug resistance", "alternative to conventional", "for molecular MRI"] }}
+Extract keywords from the following scientific literature:
+Abstract: {abstract}
+Return the keywords organized by their types in JSON format with all 7 keyword types.
+"""
+```
+```python
+# Example input (replace with your real abstract)
+abstract = "We present a novel lithium-sulfur battery cathode design using porous carbon hosts..."
+entity_prompt = prompt_template_entity_extraction.format(abstract=abstract)
+entity_output = generate_text(entity_prompt, GEN_COMMON)
+entities = parse_json_object(entity_output)  # dict with 7 keys
+print(json.dumps(entities, indent=2, ensure_ascii=False))
+```
+---
+## 3) Stage 2 — Compression
+**Prompt (exact as provided):**
+```python
+prompt_template_compression = """You are a scientific document summarizer specializing in category-driven summarization.
+Task: Create a concise summary using ONLY {max_items} items from the provided semantic categories (out of {total_items} total items).
+Requirements:
+- Write the summary in the same language as the original text
+- Select the {max_items} most relevant items that align with the original text
+- Use content from the original text ONLY when it directly supports these categories
+- The summary should read as if the original text was written to illustrate the semantic categories
+- Maintain scientific accuracy and use precise terminology
+- Ensure logical flow and coherence between concepts
+Input:
+- Original Text: {text}
+- Semantic Categories (in order of priority): {categories}
+CRITICAL: You MUST output ONLY a valid JSON object in exactly this format:
+{{"response": "Your concise summary here"}}
+Do not include any text before or after the JSON object. The summary should be a single continuous text without line breaks.
+Output Format (example):
+{{"response": "This research focuses on developing novel battery materials using advanced synthesis methods, achieving significant improvements in energy density and cycle stability through optimized electrode design."}}
+"""
+```
+```python
+# Choose how many items you want to keep
+max_items = 10
+categories = list(entities.keys())          # ["core_concepts", "methodologies", ...]
+total_items = sum(len(v) for v in entities.values())
+compression_prompt = prompt_template_compression.format(
+    max_items=max_items,
+    total_items=total_items,
+    text=abstract,
+    categories=categories,
+)
+compression_output = generate_text(compression_prompt, GEN_COMMON)
+compressed = parse_json_object(compression_output)["response"]
+print("Compressed summary:", compressed)
+```
+---
+## 4) Stage 3 — Classification (Patent-focused)
+**Prompt (exact as provided):**
+```python
+prompt_template_classification = """You are a text classification expert specializing in patent documents.
+You are given a JSON record for a target patent and a set of Retrieved Similar Items.
+Your task is to assign one or more class labels to a given target patent using the provided examples as guidance.
+---
+**Step-by-Step Instructions:**
+1. **Analyze Target and Retrieved Examples:**
+- Review each example, paying attention to the class label and how the text reflects it.
+- Focus on technical innovation, claims, and patent-specific terminology.
+2. **Similarity Scoring (1–5):**
+For each Retrieved Similar Item, score along three dimensions and sum to 1–5:
+- Domain (0–2):
+    - 2: Same primary technology field
+    - 1: Closely related technology
+    - 0: Unrelated
+- Innovation Type (0–2):
+    - 2: Same type of innovation (e.g., device, method, composition)
+    - 1: Partial overlap in innovation approach
+    - 0: Different innovation type
+- Application/Material (0–1):
+    - 1: Shares key technical terms or entities
+    - 0: Different application/material
+3. **Total Score → Similarity Label:**
+- 5: Fully similar (Domain=2 + Innovation=2 + Application=1)
+- 4: Mostly similar (sum = 4)
+- 3: Partially similar (sum = 3)
+- 2: Little similarity (sum = 2)
+- 1: Irrelevant (sum = 0 or 1)
+4. **Make a Classification Decision:**
+- Based on all retrieved items, assign the most appropriate class ID(s) to the target.
+---
+**Response Format:**
+1. **Chain-of-Thought** (between `<begin_of_thought>` and `<end_of_thought>`):
+- Summarize the target's core innovation, claims, and technical field.
+- For each Retrieved Similar Item, analyze its similarity and assign score.
+- Conclude with overall comparison.
+2. **Final Answer:**
+- Provide classification with brief justification.
+- Output ONLY the list of class id values.
+**Use exactly this structure and STOP immediately after </answer>:**
+```
+<begin_of_thought>
+<p>Target patent analysis... </p>
+<p>Reference[Item ID=...], [Similarity=...], judgment text</p>
+...
+<end_of_thought>
+<solution>Overall evaluation=...</solution>
+<answer>[Class_label_ID_1, Class_label_ID_2, ...]</answer>
+```
+**CRITICAL: Your response MUST end with </answer>. Do not add any text after the closing </answer> tag.**
+---
+**Special Condition:**
+- If Total Score ≤ 2:
+    - `<solution>`: Cannot determine answer
+    - `<answer>`: None
+- Otherwise:
+    - `<solution>`: Overall evaluation=...
+    - `<answer>`: [<Class_label_ID_1>, <Class_label_ID_2>, ...]
+---
+**Input Data:**
+- Target ID: {target_id}
+- Target Text: {target_text}
+- Retrieved Similar Items (Top {retrieved_count}):
+{retrieved_items_text}
+---
+"""
+```
+```python
+# Example retrieved neighbors (use COMPRESSED text for better accuracy/latency)
+retrieved = [
+    {"id": "US-AAA", "label": "H01M10/0525", "text": "Porous carbon hosts for Li-S cathodes..."},
+    {"id": "US-BBB", "label": "H01M4/13", "text": "Conductive polymer binder for sulfur cathode..."},
+]
+retrieved_items_text = "\n".join(
+    f"- ID: {r['id']}\n  Label: {r.get('label','')}\n  Text: {r['text']}" for r in retrieved
+)
+classification_prompt = prompt_template_classification.format(
+    target_id="TARGET-1",
+    target_text=compressed,   # classify on compressed text
+    retrieved_count=len(retrieved),
+    retrieved_items_text=retrieved_items_text,
+)
+# Use stop at </answer> for clean termination
+cls_text = generate_text(classification_prompt, GEN_CLASSIFICATION)
+if '</answer>' not in cls_text and '<answer>' in cls_text:
+    cls_text += '</answer>'
+print(cls_text)
+parsed_ids = parse_answer_ids(cls_text)
+print("parsed:", parsed_ids)
+```
+---
+## 5) End-to-End Helper (Optional)
+```python
+def comp4cls_pipeline(abstract: str, retrieve_fn, k: int = 10) -> dict:
+    """
+    :param abstract: raw document text
+    :param retrieve_fn: function(query_text, k) -> list of dicts [{id, label, text}, ...]
+    :param k: top-k neighbors
+    :return: {"entities": {...}, "compressed": "...", "classification_raw": "...", "parsed_ids": [...]}
+    """
+    # Stage 1: Entities
+    ent_prompt = prompt_template_entity_extraction.format(abstract=abstract)
+    ent_text = generate_text(ent_prompt, GEN_COMMON)
+    entities = parse_json_object(ent_text)
+    # Stage 2: Compression
+    max_items = 10
+    categories = list(entities.keys())
+    total_items = sum(len(v) for v in entities.values())
+    comp_prompt = prompt_template_compression.format(
+        max_items=max_items, total_items=total_items, text=abstract, categories=categories
+    )
+    comp_text = generate_text(comp_prompt, GEN_COMMON)
+    compressed = parse_json_object(comp_text)["response"]
+    # Stage 3: Retrieval + Classification
+    neighbors = retrieve_fn(compressed, k=k)  # [{"id","label","text"}, ...]
+    retrieved_items_text = "\n".join(
+        f"- ID: {r['id']}\n  Label: {r.get('label','')}\n  Text: {r['text']}" for r in neighbors
+    )
+    cls_prompt = prompt_template_classification.format(
+        target_id="TARGET-1",
+        target_text=compressed,
+        retrieved_count=len(neighbors),
+        retrieved_items_text=retrieved_items_text,
+    )
+    cls_raw = generate_text(cls_prompt, GEN_CLASSIFICATION)
+    if '</answer>' not in cls_raw and '<answer>' in cls_raw:
+        cls_raw += '</answer>'
+    parsed = parse_answer_ids(cls_raw)
+    return {"entities": entities, "compressed": compressed, "classification_raw": cls_raw, "parsed_ids": parsed}
+```
+---
+## 6) Notes
+- Stage-1/2 prompts demand **strict JSON**. The helper `parse_json_object` extracts the first valid JSON block.
+- For Stage-3, keep `stop=["</answer>"]` to avoid over-generation and simplify parsing.
+- Swap `MODEL_NAME` for your fine-tuned repo (e.g., `gsjang/lim-4b-1-0826`) if desired.
+- Retrieval should use **compressed** texts for both query and neighbors.
+# Citation
 **APA:**
+Lim, C. (2026). Comp4Cls: Semantic Compression for Enhanced Retrieval-Augmented Classification of Real-World Scientific and Technical Documents. ICDE 2026 (submitted).
 ## Glossary [optional]