Financbase's picture
Upload README.md with huggingface_hub
445a672 verified
---
license: mit
task_categories:
- text-generation
- text-classification
- summarization
language:
- en
tags:
- finance
- financial-qa
- sentiment-analysis
- summarization
- instruction-tuning
- sec-filings
- 10-k
size_categories:
- 1K<n<10K
configs:
- config_name: default
data_files:
- split: train
path: financial_qa.jsonl
---
# Financbase Financial QA Dataset
## Dataset Description
The Financbase Financial QA Dataset is a curated collection of financial question-answering examples designed for training large language models on financial domain tasks. This dataset supports multiple financial AI tasks including question answering, sentiment analysis, and document summarization.
### Dataset Summary
- **Total Examples**: 1,000+ financial Q&A pairs
- **Format**: JSONL (JSON Lines)
- **Language**: English
- **Domain**: Financial services, SEC filings, investment analysis
- **Tasks**: Question answering, sentiment classification, summarization
### Dataset Structure
Each example follows the instruction-tuning format with three fields:
```json
{
"instruction": "Answer the question clearly for a retail investor.",
"input": "What is EBITDA?",
"output": "EBITDA stands for Earnings Before Interest, Taxes, Depreciation, and Amortization. It's a measure of a company's operating performance that excludes non-operating expenses..."
}
```
### Supported Tasks
1. **Financial Question Answering**
- Basic financial concepts (EBITDA, P/E ratio, etc.)
- Investment terminology
- Market analysis questions
2. **Sentiment Analysis**
- Financial news sentiment classification
- Earnings report sentiment
- Market outlook analysis
3. **Document Summarization**
- SEC filing summaries
- Earnings call summaries
- Financial report abstracts
## Usage
### Loading the Dataset
```python
from datasets import load_dataset
# Load the dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
# Access examples
for example in dataset:
print(f"Instruction: {example['instruction']}")
print(f"Input: {example['input']}")
print(f"Output: {example['output']}")
```
### Training with Transformers
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset
# Load dataset
dataset = load_dataset("Financbase/financbase-10k-jsonl", split="train")
# Format for training
def format_example(example):
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example['input']}\n\n### Response:\n{example['output']}"
# Apply formatting
formatted_dataset = dataset.map(lambda x: {"text": format_example(x)})
```
### Using with PEFT/LoRA
```python
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# Load base model
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA
model = get_peft_model(model, lora_config)
```
## Data Fields
| Field | Type | Description |
|-------|------|-------------|
| `instruction` | string | The task instruction or prompt |
| `input` | string | The input context or question |
| `output` | string | The expected response or answer |
## Data Splits
- **train**: 1,000+ examples for training
- **validation**: 100+ examples for validation (future release)
- **test**: 100+ examples for testing (future release)
## Data Collection
### Sources
- SEC 10-K filings (processed and chunked)
- Financial news articles
- Investment research reports
- Financial education materials
- Curated financial Q&A pairs
### Preprocessing
1. **Document Chunking**: Long documents split into ≤1800 token chunks
2. **Section Preservation**: Maintains document structure and headings
3. **Quality Filtering**: Removes low-quality or irrelevant examples
4. **Format Standardization**: Ensures consistent instruction/input/output format
## Compliance and Safety
### Financial Compliance
- **No Investment Advice**: Dataset does not contain personalized investment recommendations
- **Educational Purpose**: Designed for educational and research use
- **Source Attribution**: All examples traceable to original sources
- **Regulatory Compliance**: Follows financial data handling best practices
### Content Filtering
- Removed personally identifiable information (PII)
- Filtered out actionable trading directives
- Excluded copyrighted material
- Sanitized sensitive financial data
## Evaluation
### Metrics
- **Perplexity**: Model confidence on financial text
- **BLEU Score**: Response quality for summarization tasks
- **Accuracy**: Classification accuracy for sentiment analysis
- **ROUGE Score**: Summarization quality metrics
### Benchmark Tasks
1. **Financial QA**: Answer financial questions accurately
2. **Sentiment Analysis**: Classify financial sentiment (positive/negative/neutral)
3. **Summarization**: Summarize financial documents concisely
## Limitations
- **Language**: English only
- **Domain**: Primarily US financial markets
- **Temporal**: Data from 2020-2024 (may become outdated)
- **Bias**: Reflects training data biases and limitations
## Citation
```bibtex
@dataset{financbase_financial_qa_2024,
title={Financbase Financial QA Dataset},
author={Financbase Team},
year={2024},
url={https://huggingface.co/datasets/Financbase/financbase-10k-jsonl},
license={MIT}
}
```
## License
This dataset is released under the MIT License. See LICENSE file for details.
## Contact
- **Organization**: Financbase
- **Repository**: https://huggingface.co/datasets/Financbase/financbase-10k-jsonl
- **Issues**: Report issues via HuggingFace Hub
## Changelog
- **v0.1** (2024-12-19): Initial release with 1,000+ financial Q&A examples
- **v0.2** (Planned): Add validation and test splits
- **v0.3** (Planned): Expand to 10,000+ examples with more diverse sources