Configuration Parsing Warning: Invalid JSON for config file config.json

chunker-xlm-roberta-longformer-4096

ํ…์ŠคํŠธ๋ฅผ ์˜๋ฏธ๋ก ์  ๋‹จ์œ„(semantic chunks)๋กœ ๋ถ„ํ• ํ•˜๋Š” ํ† ํฐ ๋ถ„๋ฅ˜(Token Classification) ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. XLM-RoBERTa๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ Longformer ์•„ํ‚คํ…์ฒ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ตœ๋Œ€ 4,096 ํ† ํฐ๊นŒ์ง€์˜ ๊ธด ํ…์ŠคํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Model Details

  • Developed by: CaveduckAI
  • Model type: Token Classification (Sequence Labeling)
  • Base model: XLM-RoBERTa + Longformer
  • Max sequence length: 4,096 tokens
  • Language(s): Multilingual (XLM-RoBERTa based)
  • License: Apache 2.0

Architecture

Intended Use

Primary Use Cases

  • Text Chunking: ๊ธด ๋ฌธ์„œ๋ฅผ ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋ถ„ํ• 
  • RAG Pipeline: Retrieval-Augmented Generation์„ ์œ„ํ•œ ๋ฌธ์„œ ์ „์ฒ˜๋ฆฌ
  • Character Description Segmentation: AI ์บ๋ฆญํ„ฐ ํ”„๋กœํ•„ ํ…์ŠคํŠธ์˜ ๊ตฌ์กฐํ™”

Out-of-Scope Uses

  • ์‹ค์‹œ๊ฐ„ ์ŠคํŠธ๋ฆฌ๋ฐ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ (๋ฐฐ์น˜ ์ฒ˜๋ฆฌ ๊ถŒ์žฅ)
  • 4,096 ํ† ํฐ์„ ์ดˆ๊ณผํ•˜๋Š” ๋‹จ์ผ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ

How to Use

Basic Usage

API Server Usage

Parameters

Parameter Type Default Description
string required ๋ถ„ํ• ํ•  ์ž…๋ ฅ ํ…์ŠคํŠธ
float 0.6 ๊ฒฝ๊ณ„ ํ™•๋ฅ  ์ž„๊ณ„๊ฐ’. ๋†’์„์ˆ˜๋ก ์ ์€ ์ฒญํฌ ์ƒ์„ฑ
float 0.0025 Exponential weighting factor. ํ…์ŠคํŠธ ํ›„๋ฐ˜๋ถ€ ๊ฒฝ๊ณ„ ๊ฐ์ง€ ๋ณด์ •

Technical Specifications

Post-Processing Pipeline

๋ชจ๋ธ์˜ raw output์€ ๋‹ค์Œ ํ›„์ฒ˜๋ฆฌ ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค:

  1. Exponential Weighting: ํ…์ŠคํŠธ ์œ„์น˜์— ๋”ฐ๋ฅธ ๊ฐ€์ค‘์น˜ ์ ์šฉ
  2. Wavelet Denoising: Daubechies 4 (db4) ์›จ์ด๋ธ”๋ฆฟ์„ ์ด์šฉํ•œ ๋…ธ์ด์ฆˆ ์ œ๊ฑฐ
  3. Center Compensation: ์ค‘์•™๋ถ€ ๊ฐ•์กฐ ํ˜„์ƒ ๋ณด์ •
  4. MinMax Normalization: 0-1 ๋ฒ”์œ„๋กœ ์ •๊ทœํ™”
  5. Natural Break Point Adjustment: ๋งˆ์นจํ‘œ, ์ค„๋ฐ”๊ฟˆ ๋“ฑ ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ถ„ํ• ์ ์œผ๋กœ ์กฐ์ •

Requirements

Hardware Requirements

  • Inference: GPU ๊ถŒ์žฅ (CUDA 11.8+), CPU ์ง€์›
  • VRAM: ~2GB (์ถ”๋ก  ์‹œ)

Training Details

Training Data

์บ๋ฆญํ„ฐ ์„ค๋ช… ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์–‘ํ•œ ์บ๋ฆญํ„ฐ ํ”„๋กœํ•„, ๋ฐฐ๊ฒฝ ์„ค์ •, ์„ฑ๊ฒฉ ๋ฌ˜์‚ฌ ๋“ฑ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

Training Procedure

  • Task: Binary Token Classification (boundary / non-boundary)
  • Loss Function: Cross-Entropy Loss
  • Optimizer: AdamW

Limitations

  1. Max Length: 4,096 ํ† ํฐ ์ดˆ๊ณผ ํ…์ŠคํŠธ๋Š” truncation๋จ
  2. Domain Specific: ์บ๋ฆญํ„ฐ ์„ค๋ช… ํ…์ŠคํŠธ์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์–ด ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์—์„œ๋Š” ์„ฑ๋Šฅ ์ €ํ•˜ ๊ฐ€๋Šฅ
  3. Language Performance: XLM-RoBERTa ํŠน์„ฑ์ƒ ์˜์–ด ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•˜๋ฉฐ, ๋‹ค๋ฅธ ์–ธ์–ด์—์„œ๋Š” ์„ฑ๋Šฅ ์ฐจ์ด ์กด์žฌ

Citation

Downloads last month
10
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support