vela-demo / docs /METHODOLOGY.md
Heewon Oh
feat: initial release of VELA Framework v1.0.0 - Korean financial market research agent
9cef3a3
|
raw
history blame
7.11 kB
# VELA Methodology
## 1. Reasoning Trace (RT) Format
VELA uses a structured Markdown-based reasoning trace format designed for 7B parameter models. The format reduces cognitive load compared to JSON while maintaining parseability.
### Format Specification
```
**Step N**:
**Thought**: [Analysis of current state and decision rationale]
**Action**: search | analyze | conclude
**Query**: [Search query - only when Action=search]
**Confidence**: N%
```
### Action Types
| Action | Description | When Used |
|--------|-------------|-----------|
| `search` | Execute web search via Naver/DDG | Need more information |
| `analyze` | Extract findings from collected sources | Enough sources to analyze |
| `conclude` | Synthesize final report | Confidence >= 80% |
### Step Header Variants
The parser handles 4 header formats:
- `**Step N**:` (standard)
- `**Step N**` (no colon)
- `**Step N: title**` (with title)
- `**Step N - title**` (dash separator)
### Tool Call Preservation
JSON blocks containing `"tool"` + `"params"` keys are preserved verbatim (not converted to markdown):
```json
{"tool": "search_news", "params": {"query": "SKํ•˜์ด๋‹‰์Šค HBM"}}
```
### Quick Assessment Preservation
JSON blocks with `"category"` + `"sentiment"` keys are also preserved:
```json
{"category": "์‹ค์ /์žฌ๋ฌด", "sentiment": "positive"}
```
## 2. CoT Protocol
### TODO-Based Reasoning
Unlike simple sequential CoT, VELA generates a TODO list at Step 1 and tracks completion:
1. **Step 1 (Think)**: Generate TODO list of information needs
- Each TODO has `id`, `task`, `priority` (critical/high/medium/low), `status`
2. **Step 2+ (Search/Analyze)**: Work through TODOs
- Update status as items are completed
- Generate new TODOs if gaps discovered
3. **Conclude**: When all critical TODOs are done and confidence >= 80%
### Loop Prevention
- Maximum consecutive searches: 2 (then forced to `analyze`)
- Duplicate query detection: skip previously searched queries
- Maximum iterations: configurable (default: 5)
### Confidence Gating
```
confidence < 40% -> Must search (insufficient data)
40% <= conf < 80% -> Can search or analyze
confidence >= 80% -> Can conclude
```
The `should_continue()` function considers:
- Step confidence
- TODO completion rate (especially critical items)
- Maximum iteration count
## 3. Training Data Pipeline
### SFT Data (58,206 samples)
| Source | Count | Method |
|--------|-------|--------|
| Haiku Batch 1-4 | ~20,000 | Claude Haiku API batch generation |
| Qwen ChatML | ~5,000 | Format conversion |
| Securities Reports | ~5,126 | PDF parsing + RT generation |
| Tool Calling | ~5,000 | Function call format conversion |
| Multi-turn 2T | 4,000 | 2-turn dialogue generation |
| Multi-turn 4T | 4,000 | 4-turn dialogue generation |
| Gap Fill (12 categories) | ~12,600 | Sonar + OpenAI Batch API |
| Others | ~2,480 | Labeled data, batch5 fallback |
### DPO Data (26,421 pairs)
| Source | Pairs | Rejection Type |
|--------|-------|----------------|
| DPO Dedup | 12,000 | Various quality issues |
| Multilingual Aug | 5,997 | Language mixing (CN/EN leak) |
| VELA ChatML | 5,000 | Qwen response quality |
| Batch5 Insufficient | 1,642 | Insufficient analysis |
| Chinese Leak v2 | 1,216 | Chinese character correction |
| Reasoning Trace 2K | 566 | RT format errors |
### DPO Rejection Categories
```
english_leak (30%): English terms inserted in Korean context
chinese_leak (30%): Chinese characters from Qwen base model
format_error (20%): Broken RT JSON/Markdown structure
short_response (20%): Insufficient analysis depth
```
### RT Format Migration
Original training data used JSON RT format. Converted to Markdown for reduced token count and improved 7B model parsing:
- **Before**: `{"thought": "...", "action": "search", "query": "..."}`
- **After**: `**Thought**: ...\n**Action**: search\n**Query**: ...`
- **Result**: 100% parse success rate across 101,296 RT blocks
## 4. Model Architecture
### Base
- **Model**: Qwen/Qwen2.5-7B-Instruct
- **Why Qwen**: Best multilingual base for CJK at 7B scale
### SFT Stage
- **LoRA Config**: r=64, alpha=128
- **Learning Rate**: 2e-4
- **Batch Size**: 4 (gradient accumulation 8)
- **Epochs**: ~3 on 58K samples
- **Output**: Merged BF16 weights (15GB)
### DPO Stage
- **LoRA Config**: r=16, alpha=32
- **Beta**: 0.1
- **Learning Rate**: 5e-5
- **Output**: LoRA adapter (155MB)
- **Key Goal**: Eliminate Chinese/English language leaks while preserving reasoning quality
### Quantization
- **MLX INT4**: 4GB, 16 tok/s on M1 Max (3.2x faster than BF16 CPU)
- **GGUF Q4_K_M**: ~4.4GB for llama.cpp compatibility
## 5. Search Architecture
### Multi-Source Strategy
```
Query -> [Naver News API] -> results (Korean news, high relevance)
-> [DuckDuckGo] -> results (international, fallback)
-> Merge + Deduplicate -> Top-K sources
```
### Naver API Pool
4 parallel API key pools for rate limit management:
- `NAVER_CLIENT_ID_0` / `NAVER_CLIENT_SECRET_0`
- `NAVER_CLIENT_ID_1` / `NAVER_CLIENT_SECRET_1`
- etc.
Round-robin selection with automatic failover.
### Stock Code Resolution
Built-in mapping for major Korean stocks (KOSPI/KOSDAQ top 25):
```python
STOCK_CODE_MAP = {
"์‚ผ์„ฑ์ „์ž": "005930",
"SKํ•˜์ด๋‹‰์Šค": "000660",
...
}
```
Extensible via configuration.
## 6. Adversary Verification
Optional cross-verification using Perplexity Sonar API:
1. **Input**: ResearchResult (conclusion + sources + key findings)
2. **Process**: Independent fact-checking against Perplexity's real-time search
3. **Output**: VerificationResult with:
- `verdict`: ACCEPT / REVISE / NEED_MORE_SEARCH
- `issues`: List of factual concerns
- `suggested_counter_queries`: Additional searches recommended
- `confidence`: Verification confidence score
### Fail-Open Prevention
Verification failures are explicitly recorded (not silently accepted):
- Failed verification -> `verdict=REVISE` with error details
- No silent `ACCEPT` on API errors
## 7. Benchmark Methodology
### Evaluation Framework
10 Korean financial domain prompts across 6 dimensions:
| Dimension | Weight | Description |
|-----------|--------|-------------|
| Domain Knowledge | 25% | Korean market terminology, regulations |
| Reasoning Quality | 20% | Logical analysis depth |
| Data Accuracy | 20% | Factual correctness |
| Korean Fluency | 15% | Natural Korean, no language mixing |
| Actionability | 10% | Practical investment insights |
| Structure | 10% | Report organization |
### Scoring
- 100-point scale per prompt
- 4-model comparison: VELA, Qwen base, GPT-4o, Exaone 3.5
- Human evaluation + GPT-4o automated scoring
- Final score: weighted average across all prompts
### Results Summary
| Model | Score | Key Strength | Key Weakness |
|-------|-------|--------------|--------------|
| **VELA 7B** | **87.5** | Domain depth, structured RT | Smaller context window |
| GPT-4o | 81.0 | General reasoning | Korean financial terminology |
| Exaone 3.5 | 74.5 | Korean fluency | Shallow domain analysis |
| Qwen 2.5 7B | 72.0 | Multilingual base | Chinese/English leaks |