Update README.md

1d1e7c0 verified 6 months ago

4.46 kB

library_name: transformers
tags: []

Model Description

Comp4Cls is a retrieval-augmented classification framework that uses entity-centric semantic compression to turn long scientific/technical documents into short, task-focused representations for both retrieval and labeling. Documents (papers, patents, and R&D reports) are first compressed into structured summaries that preserve discriminative signals (e.g., core concepts, methods, problems, findings), embedded, and stored in a vector DB. At inference, a query is compressed the same way, nearest neighbors are retrieved, and a small LLM assigns the final class label using the compressed evidence.

The end-to-end workflow—Phase 1: compression + indexing, Phase 2: retrieval + classification—is illustrated in the framework diagram on page 2. Experiments on a large bilingual corpus with hierarchical, multi-label taxonomies show that a 4B-scale Comp4Cls matches or outperforms 8B–14B models, especially in fine-grained categories, while cutting token usage and compute. Moderate compression (often ~20% of entities) preserves retrieval fidelity and boosts downstream F1, enabling lightweight, low-latency deployment in production pipelines. See Table II on page 8 (compression vs. length), Figure 6 on page 9 (retrieval quality under compression), and Figure 7 on page 10 (accuracy vs. larger LLMs).

Framework Diagram

Comp4Cls framework diagram
Figure 1. Overview of the **Comp4Cls** framework. The system operates in two phases: (i) documents with predefined class labels are semantically compressed, embedded, and stored in a vector database; (ii) when a new query arrives, it is compressed and used to retrieve the top-$k$ most similar documents from the vector store. The large language model (LLM) then determines the final class label based on the retrieved context. Finally, the compressed query and its assigned label are stored back into the database, enabling downstream services such as document categorization, semantic search, and TL;DR summarization.

Key Features

Entity-centric Semantic Compression Two-stage prompting (entity extraction → selective rewriting) produces concise, structured summaries that retain label-relevant semantics while removing redundancy. The compressor exposes an explicit compression ratio to match accuracy/latency budgets.
Retrieval-Augmented Classification (RAG) with Short Contexts Operates on compressed texts for both the query and neighbors, reducing context length and enabling broader top-k without “lost-in-the-middle” degradation.
Small-Model, Big-Model Performance With ~20% compression, a 4B backbone achieves or exceeds the accuracy of 8B–14B models across domains and taxonomy levels.
Provable Efficiency Gains Compression reduces input tokens by ~50% on average while maintaining semantic similarity; retrieval accuracy remains near full-text levels.
Scales to Real-World, Heterogeneous Corpora Trained/evaluated on large bilingual datasets spanning papers, patents, and R&D reports with hierarchical, multi-label taxonomies; robust under domain shift and taxonomy changes.
Production-minded Latency/Throughput Shorter prompts cut classification-stage latency; compression allows higher top-k (≈20–30) before context saturation.
Vector DB-Ready Artifacts Outputs compressed texts + embeddings that plug into standard ANN indices (e.g., HNSW) for high-throughput retrieval in enterprise knowledge systems.
Beyond Classification The compressed representations support downstream semantic search, TL;DR summaries, and knowledge organization tasks out of the box.

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]