File size: 6,041 Bytes
445a672 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 |
---
license: mit
task_categories:
- text-generation
- text-classification
- summarization
language:
- en
tags:
- finance
- financial-qa
- sentiment-analysis
- summarization
- instruction-tuning
- sec-filings
- 10-k
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: financial_qa.jsonl
---
# Financbase Financial QA Dataset
## Dataset Description
The Financbase Financial QA Dataset is a curated collection of financial question-answering examples designed for training large language models on financial domain tasks. This dataset supports multiple financial AI tasks including question answering, sentiment analysis, and document summarization.
### Dataset Summary
- **Total Examples**: 1,000+ financial Q&A pairs
- **Format**: JSONL (JSON Lines)
- **Language**: English
- **Domain**: Financial services, SEC filings, investment analysis
- **Tasks**: Question answering, sentiment classification, summarization
### Dataset Structure
Each example follows the instruction-tuning format with three fields:
```json
{
"instruction": "Answer the question clearly for a retail investor.",
"input": "What is EBITDA?",
"output": "EBITDA stands for Earnings Before Interest, Taxes, Depreciation, and Amortization. It's a measure of a company's operating performance that excludes non-operating expenses..."
}
```
### Supported Tasks
1. **Financial Question Answering**
- Basic financial concepts (EBITDA, P/E ratio, etc.)
- Investment terminology
- Market analysis questions
2. **Sentiment Analysis**
- Financial news sentiment classification
- Earnings report sentiment
- Market outlook analysis
3. **Document Summarization**
- SEC filing summaries
- Earnings call summaries
- Financial report abstracts
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
# Access examples
for example in dataset:
print(f"Instruction: {example['instruction']}")
print(f"Input: {example['input']}")
print(f"Output: {example['output']}")
```
### Training with Transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
# Load dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
# Format for training
def format_example(example):
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
# Apply formatting
formatted_dataset = dataset.map(lambda x: {"text": format_example(x)})
```
### Using with PEFT/LoRA
```python
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
```
## Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `instruction` | string | The task instruction or prompt |
| `input` | string | The input context or question |
| `output` | string | The expected response or answer |
## Data Splits
- **train**: 1,000+ examples for training
- **validation**: 100+ examples for validation (future release)
- **test**: 100+ examples for testing (future release)
## Data Collection
### Sources
- SEC 10-K filings (processed and chunked)
- Financial news articles
- Investment research reports
- Financial education materials
- Curated financial Q&A pairs
### Preprocessing
1. **Document Chunking**: Long documents split into ≤1800 token chunks
2. **Section Preservation**: Maintains document structure and headings
3. **Quality Filtering**: Removes low-quality or irrelevant examples
4. **Format Standardization**: Ensures consistent instruction/input/output format
## Compliance and Safety
### Financial Compliance
- **No Investment Advice**: Dataset does not contain personalized investment recommendations
- **Educational Purpose**: Designed for educational and research use
- **Source Attribution**: All examples traceable to original sources
- **Regulatory Compliance**: Follows financial data handling best practices
### Content Filtering
- Removed personally identifiable information (PII)
- Filtered out actionable trading directives
- Excluded copyrighted material
- Sanitized sensitive financial data
## Evaluation
### Metrics
- **Perplexity**: Model confidence on financial text
- **BLEU Score**: Response quality for summarization tasks
- **Accuracy**: Classification accuracy for sentiment analysis
- **ROUGE Score**: Summarization quality metrics
### Benchmark Tasks
1. **Financial QA**: Answer financial questions accurately
2. **Sentiment Analysis**: Classify financial sentiment (positive/negative/neutral)
3. **Summarization**: Summarize financial documents concisely
## Limitations
- **Language**: English only
- **Domain**: Primarily US financial markets
- **Temporal**: Data from 2020-2024 (may become outdated)
- **Bias**: Reflects training data biases and limitations
## Citation
```bibtex
@dataset{financbase_financial_qa_2024,
title={Financbase Financial QA Dataset},
author={Financbase Team},
year={2024},
url={https://huggingface.co/datasets/Financbase/financbase-10k-jsonl},
license={MIT}
}
```
## License
This dataset is released under the MIT License. See LICENSE file for details.
## Contact
- **Organization**: Financbase
- **Repository**: https://huggingface.co/datasets/Financbase/financbase-10k-jsonl
- **Issues**: Report issues via HuggingFace Hub
## Changelog
- **v0.1** (2024-12-19): Initial release with 1,000+ financial Q&A examples
- **v0.2** (Planned): Add validation and test splits
- **v0.3** (Planned): Expand to 10,000+ examples with more diverse sources
|