Spaces:
Sleeping
Sleeping
Heewon Oh
feat: initial release of VELA Framework v1.0.0 - Korean financial market research agent
9cef3a3 | # VELA Methodology | |
| ## 1. Reasoning Trace (RT) Format | |
| VELA uses a structured Markdown-based reasoning trace format designed for 7B parameter models. The format reduces cognitive load compared to JSON while maintaining parseability. | |
| ### Format Specification | |
| ``` | |
| **Step N**: | |
| **Thought**: [Analysis of current state and decision rationale] | |
| **Action**: search | analyze | conclude | |
| **Query**: [Search query - only when Action=search] | |
| **Confidence**: N% | |
| ``` | |
| ### Action Types | |
| | Action | Description | When Used | | |
| |--------|-------------|-----------| | |
| | `search` | Execute web search via Naver/DDG | Need more information | | |
| | `analyze` | Extract findings from collected sources | Enough sources to analyze | | |
| | `conclude` | Synthesize final report | Confidence >= 80% | | |
| ### Step Header Variants | |
| The parser handles 4 header formats: | |
| - `**Step N**:` (standard) | |
| - `**Step N**` (no colon) | |
| - `**Step N: title**` (with title) | |
| - `**Step N - title**` (dash separator) | |
| ### Tool Call Preservation | |
| JSON blocks containing `"tool"` + `"params"` keys are preserved verbatim (not converted to markdown): | |
| ```json | |
| {"tool": "search_news", "params": {"query": "SKํ์ด๋์ค HBM"}} | |
| ``` | |
| ### Quick Assessment Preservation | |
| JSON blocks with `"category"` + `"sentiment"` keys are also preserved: | |
| ```json | |
| {"category": "์ค์ /์ฌ๋ฌด", "sentiment": "positive"} | |
| ``` | |
| ## 2. CoT Protocol | |
| ### TODO-Based Reasoning | |
| Unlike simple sequential CoT, VELA generates a TODO list at Step 1 and tracks completion: | |
| 1. **Step 1 (Think)**: Generate TODO list of information needs | |
| - Each TODO has `id`, `task`, `priority` (critical/high/medium/low), `status` | |
| 2. **Step 2+ (Search/Analyze)**: Work through TODOs | |
| - Update status as items are completed | |
| - Generate new TODOs if gaps discovered | |
| 3. **Conclude**: When all critical TODOs are done and confidence >= 80% | |
| ### Loop Prevention | |
| - Maximum consecutive searches: 2 (then forced to `analyze`) | |
| - Duplicate query detection: skip previously searched queries | |
| - Maximum iterations: configurable (default: 5) | |
| ### Confidence Gating | |
| ``` | |
| confidence < 40% -> Must search (insufficient data) | |
| 40% <= conf < 80% -> Can search or analyze | |
| confidence >= 80% -> Can conclude | |
| ``` | |
| The `should_continue()` function considers: | |
| - Step confidence | |
| - TODO completion rate (especially critical items) | |
| - Maximum iteration count | |
| ## 3. Training Data Pipeline | |
| ### SFT Data (58,206 samples) | |
| | Source | Count | Method | | |
| |--------|-------|--------| | |
| | Haiku Batch 1-4 | ~20,000 | Claude Haiku API batch generation | | |
| | Qwen ChatML | ~5,000 | Format conversion | | |
| | Securities Reports | ~5,126 | PDF parsing + RT generation | | |
| | Tool Calling | ~5,000 | Function call format conversion | | |
| | Multi-turn 2T | 4,000 | 2-turn dialogue generation | | |
| | Multi-turn 4T | 4,000 | 4-turn dialogue generation | | |
| | Gap Fill (12 categories) | ~12,600 | Sonar + OpenAI Batch API | | |
| | Others | ~2,480 | Labeled data, batch5 fallback | | |
| ### DPO Data (26,421 pairs) | |
| | Source | Pairs | Rejection Type | | |
| |--------|-------|----------------| | |
| | DPO Dedup | 12,000 | Various quality issues | | |
| | Multilingual Aug | 5,997 | Language mixing (CN/EN leak) | | |
| | VELA ChatML | 5,000 | Qwen response quality | | |
| | Batch5 Insufficient | 1,642 | Insufficient analysis | | |
| | Chinese Leak v2 | 1,216 | Chinese character correction | | |
| | Reasoning Trace 2K | 566 | RT format errors | | |
| ### DPO Rejection Categories | |
| ``` | |
| english_leak (30%): English terms inserted in Korean context | |
| chinese_leak (30%): Chinese characters from Qwen base model | |
| format_error (20%): Broken RT JSON/Markdown structure | |
| short_response (20%): Insufficient analysis depth | |
| ``` | |
| ### RT Format Migration | |
| Original training data used JSON RT format. Converted to Markdown for reduced token count and improved 7B model parsing: | |
| - **Before**: `{"thought": "...", "action": "search", "query": "..."}` | |
| - **After**: `**Thought**: ...\n**Action**: search\n**Query**: ...` | |
| - **Result**: 100% parse success rate across 101,296 RT blocks | |
| ## 4. Model Architecture | |
| ### Base | |
| - **Model**: Qwen/Qwen2.5-7B-Instruct | |
| - **Why Qwen**: Best multilingual base for CJK at 7B scale | |
| ### SFT Stage | |
| - **LoRA Config**: r=64, alpha=128 | |
| - **Learning Rate**: 2e-4 | |
| - **Batch Size**: 4 (gradient accumulation 8) | |
| - **Epochs**: ~3 on 58K samples | |
| - **Output**: Merged BF16 weights (15GB) | |
| ### DPO Stage | |
| - **LoRA Config**: r=16, alpha=32 | |
| - **Beta**: 0.1 | |
| - **Learning Rate**: 5e-5 | |
| - **Output**: LoRA adapter (155MB) | |
| - **Key Goal**: Eliminate Chinese/English language leaks while preserving reasoning quality | |
| ### Quantization | |
| - **MLX INT4**: 4GB, 16 tok/s on M1 Max (3.2x faster than BF16 CPU) | |
| - **GGUF Q4_K_M**: ~4.4GB for llama.cpp compatibility | |
| ## 5. Search Architecture | |
| ### Multi-Source Strategy | |
| ``` | |
| Query -> [Naver News API] -> results (Korean news, high relevance) | |
| -> [DuckDuckGo] -> results (international, fallback) | |
| -> Merge + Deduplicate -> Top-K sources | |
| ``` | |
| ### Naver API Pool | |
| 4 parallel API key pools for rate limit management: | |
| - `NAVER_CLIENT_ID_0` / `NAVER_CLIENT_SECRET_0` | |
| - `NAVER_CLIENT_ID_1` / `NAVER_CLIENT_SECRET_1` | |
| - etc. | |
| Round-robin selection with automatic failover. | |
| ### Stock Code Resolution | |
| Built-in mapping for major Korean stocks (KOSPI/KOSDAQ top 25): | |
| ```python | |
| STOCK_CODE_MAP = { | |
| "์ผ์ฑ์ ์": "005930", | |
| "SKํ์ด๋์ค": "000660", | |
| ... | |
| } | |
| ``` | |
| Extensible via configuration. | |
| ## 6. Adversary Verification | |
| Optional cross-verification using Perplexity Sonar API: | |
| 1. **Input**: ResearchResult (conclusion + sources + key findings) | |
| 2. **Process**: Independent fact-checking against Perplexity's real-time search | |
| 3. **Output**: VerificationResult with: | |
| - `verdict`: ACCEPT / REVISE / NEED_MORE_SEARCH | |
| - `issues`: List of factual concerns | |
| - `suggested_counter_queries`: Additional searches recommended | |
| - `confidence`: Verification confidence score | |
| ### Fail-Open Prevention | |
| Verification failures are explicitly recorded (not silently accepted): | |
| - Failed verification -> `verdict=REVISE` with error details | |
| - No silent `ACCEPT` on API errors | |
| ## 7. Benchmark Methodology | |
| ### Evaluation Framework | |
| 10 Korean financial domain prompts across 6 dimensions: | |
| | Dimension | Weight | Description | | |
| |-----------|--------|-------------| | |
| | Domain Knowledge | 25% | Korean market terminology, regulations | | |
| | Reasoning Quality | 20% | Logical analysis depth | | |
| | Data Accuracy | 20% | Factual correctness | | |
| | Korean Fluency | 15% | Natural Korean, no language mixing | | |
| | Actionability | 10% | Practical investment insights | | |
| | Structure | 10% | Report organization | | |
| ### Scoring | |
| - 100-point scale per prompt | |
| - 4-model comparison: VELA, Qwen base, GPT-4o, Exaone 3.5 | |
| - Human evaluation + GPT-4o automated scoring | |
| - Final score: weighted average across all prompts | |
| ### Results Summary | |
| | Model | Score | Key Strength | Key Weakness | | |
| |-------|-------|--------------|--------------| | |
| | **VELA 7B** | **87.5** | Domain depth, structured RT | Smaller context window | | |
| | GPT-4o | 81.0 | General reasoning | Korean financial terminology | | |
| | Exaone 3.5 | 74.5 | Korean fluency | Shallow domain analysis | | |
| | Qwen 2.5 7B | 72.0 | Multilingual base | Chinese/English leaks | | |