|
|
--- |
|
|
license: mit |
|
|
task_categories: |
|
|
- text-generation |
|
|
- text-classification |
|
|
- summarization |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- finance |
|
|
- financial-qa |
|
|
- sentiment-analysis |
|
|
- summarization |
|
|
- instruction-tuning |
|
|
- sec-filings |
|
|
- 10-k |
|
|
size_categories: |
|
|
- 1K<n<10K |
|
|
configs: |
|
|
- config_name: default |
|
|
data_files: |
|
|
- split: train |
|
|
path: financial_qa.jsonl |
|
|
--- |
|
|
|
|
|
# Financbase Financial QA Dataset |
|
|
|
|
|
## Dataset Description |
|
|
|
|
|
The Financbase Financial QA Dataset is a curated collection of financial question-answering examples designed for training large language models on financial domain tasks. This dataset supports multiple financial AI tasks including question answering, sentiment analysis, and document summarization. |
|
|
|
|
|
### Dataset Summary |
|
|
|
|
|
- **Total Examples**: 1,000+ financial Q&A pairs |
|
|
- **Format**: JSONL (JSON Lines) |
|
|
- **Language**: English |
|
|
- **Domain**: Financial services, SEC filings, investment analysis |
|
|
- **Tasks**: Question answering, sentiment classification, summarization |
|
|
|
|
|
### Dataset Structure |
|
|
|
|
|
Each example follows the instruction-tuning format with three fields: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"instruction": "Answer the question clearly for a retail investor.", |
|
|
"input": "What is EBITDA?", |
|
|
"output": "EBITDA stands for Earnings Before Interest, Taxes, Depreciation, and Amortization. It's a measure of a company's operating performance that excludes non-operating expenses..." |
|
|
} |
|
|
``` |
|
|
|
|
|
### Supported Tasks |
|
|
|
|
|
1. **Financial Question Answering** |
|
|
- Basic financial concepts (EBITDA, P/E ratio, etc.) |
|
|
- Investment terminology |
|
|
- Market analysis questions |
|
|
|
|
|
2. **Sentiment Analysis** |
|
|
- Financial news sentiment classification |
|
|
- Earnings report sentiment |
|
|
- Market outlook analysis |
|
|
|
|
|
3. **Document Summarization** |
|
|
- SEC filing summaries |
|
|
- Earnings call summaries |
|
|
- Financial report abstracts |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading the Dataset |
|
|
|
|
|
```python |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Load the dataset |
|
|
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train") |
|
|
|
|
|
# Access examples |
|
|
for example in dataset: |
|
|
print(f"Instruction: {example['instruction']}") |
|
|
print(f"Input: {example['input']}") |
|
|
print(f"Output: {example['output']}") |
|
|
``` |
|
|
|
|
|
### Training with Transformers |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
from datasets import load_dataset |
|
|
|
|
|
# Load dataset |
|
|
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train") |
|
|
|
|
|
# Format for training |
|
|
def format_example(example): |
|
|
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}" |
|
|
|
|
|
# Apply formatting |
|
|
formatted_dataset = dataset.map(lambda x: {"text": format_example(x)}) |
|
|
``` |
|
|
|
|
|
### Using with PEFT/LoRA |
|
|
|
|
|
```python |
|
|
from peft import LoraConfig, get_peft_model |
|
|
from transformers import AutoModelForCausalLM |
|
|
|
|
|
# Load base model |
|
|
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2") |
|
|
|
|
|
# Configure LoRA |
|
|
lora_config = LoraConfig( |
|
|
r=16, |
|
|
lora_alpha=32, |
|
|
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], |
|
|
lora_dropout=0.05, |
|
|
bias="none", |
|
|
task_type="CAUSAL_LM" |
|
|
) |
|
|
|
|
|
# Apply LoRA |
|
|
model = get_peft_model(model, lora_config) |
|
|
``` |
|
|
|
|
|
## Data Fields |
|
|
|
|
|
| Field | Type | Description | |
|
|
|-------|------|-------------| |
|
|
| `instruction` | string | The task instruction or prompt | |
|
|
| `input` | string | The input context or question | |
|
|
| `output` | string | The expected response or answer | |
|
|
|
|
|
## Data Splits |
|
|
|
|
|
- **train**: 1,000+ examples for training |
|
|
- **validation**: 100+ examples for validation (future release) |
|
|
- **test**: 100+ examples for testing (future release) |
|
|
|
|
|
## Data Collection |
|
|
|
|
|
### Sources |
|
|
|
|
|
- SEC 10-K filings (processed and chunked) |
|
|
- Financial news articles |
|
|
- Investment research reports |
|
|
- Financial education materials |
|
|
- Curated financial Q&A pairs |
|
|
|
|
|
### Preprocessing |
|
|
|
|
|
1. **Document Chunking**: Long documents split into ≤1800 token chunks |
|
|
2. **Section Preservation**: Maintains document structure and headings |
|
|
3. **Quality Filtering**: Removes low-quality or irrelevant examples |
|
|
4. **Format Standardization**: Ensures consistent instruction/input/output format |
|
|
|
|
|
## Compliance and Safety |
|
|
|
|
|
### Financial Compliance |
|
|
|
|
|
- **No Investment Advice**: Dataset does not contain personalized investment recommendations |
|
|
- **Educational Purpose**: Designed for educational and research use |
|
|
- **Source Attribution**: All examples traceable to original sources |
|
|
- **Regulatory Compliance**: Follows financial data handling best practices |
|
|
|
|
|
### Content Filtering |
|
|
|
|
|
- Removed personally identifiable information (PII) |
|
|
- Filtered out actionable trading directives |
|
|
- Excluded copyrighted material |
|
|
- Sanitized sensitive financial data |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Metrics |
|
|
|
|
|
- **Perplexity**: Model confidence on financial text |
|
|
- **BLEU Score**: Response quality for summarization tasks |
|
|
- **Accuracy**: Classification accuracy for sentiment analysis |
|
|
- **ROUGE Score**: Summarization quality metrics |
|
|
|
|
|
### Benchmark Tasks |
|
|
|
|
|
1. **Financial QA**: Answer financial questions accurately |
|
|
2. **Sentiment Analysis**: Classify financial sentiment (positive/negative/neutral) |
|
|
3. **Summarization**: Summarize financial documents concisely |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Language**: English only |
|
|
- **Domain**: Primarily US financial markets |
|
|
- **Temporal**: Data from 2020-2024 (may become outdated) |
|
|
- **Bias**: Reflects training data biases and limitations |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@dataset{financbase_financial_qa_2024, |
|
|
title={Financbase Financial QA Dataset}, |
|
|
author={Financbase Team}, |
|
|
year={2024}, |
|
|
url={https://huggingface.co/datasets/Financbase/financbase-10k-jsonl}, |
|
|
license={MIT} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This dataset is released under the MIT License. See LICENSE file for details. |
|
|
|
|
|
## Contact |
|
|
|
|
|
- **Organization**: Financbase |
|
|
- **Repository**: https://huggingface.co/datasets/Financbase/financbase-10k-jsonl |
|
|
- **Issues**: Report issues via HuggingFace Hub |
|
|
|
|
|
## Changelog |
|
|
|
|
|
- **v0.1** (2024-12-19): Initial release with 1,000+ financial Q&A examples |
|
|
- **v0.2** (Planned): Add validation and test splits |
|
|
- **v0.3** (Planned): Expand to 10,000+ examples with more diverse sources |
|
|
|