Climate Change Reddit Discussion Dataset for gpt-oss-20b
Overview
This dataset is designed for instruction finetuning OpenAI's gpt-oss-20b model on Reddit-style climate change discussions. The model will learn to generate contextually appropriate Reddit posts (submissions and comments) at various depths within discussion threads.
Dataset Format
The dataset uses OpenAI's harmony response format, which is required for gpt-oss-20b models.
Files Created
climatechange_reddit_gpt_oss_hf/ - Hugging Face Dataset format (41MB)
train/- 9,000 training examplesvalidation/- 1,000 validation examples
climatechange_reddit_gpt_oss_train.jsonl - Training data in JSONL format (39MB)
climatechange_reddit_gpt_oss_val.jsonl - Validation data in JSONL format (4.1MB)
Task Description
The finetuning task is a masked post prediction task:
- Each example contains a sequence of 4 Reddit posts from a discussion thread
- One post is randomly masked (can be at any position 1-4)
- The model must generate the masked post based on the surrounding context
- Posts can be at any depth (0-97), not just top-level submissions
Data Statistics
Overall
- Total examples: 10,000
- Train set: 9,000 (90%)
- Validation set: 1,000 (10%)
Masked Position Distribution
- Position 1: ~25.6%
- Position 2: ~25.3%
- Position 3: ~25.0%
- Position 4: ~24.1%
Masked Type Distribution
- Comments: ~93.9%
- Submissions: ~6.1%
Reasoning Level Distribution
- High: ~33.5%
- Medium: ~33.4%
- Low: ~33.1%
Top 10 Masked Depths
- Depth 3: ~15.9%
- Depth 2: ~15.4%
- Depth 4: ~12.7%
- Depth 1: ~10.8%
- Depth 5: ~8.8%
- Depth 6: ~6.7%
- Depth 0: ~6.1%
- Depth 7: ~4.7%
- Depth 8: ~3.7%
- Depth 9: ~2.5%
Harmony Format Structure
Each training example follows this structure:
<|start|>system<|message|>You are a helpful AI assistant participating in Reddit discussions about climate change...
Reasoning: [low|medium|high]
# Valid channels: analysis, final<|end|>
<|start|>user<|message|>[INSTRUCTION]
Here is the conversation thread with one post marked as [MASKED POST - TO BE GENERATED]:
[CONTEXT WITH MASKED POST]<|end|>
<|start|>assistant<|message|><|start|>analysis<|message|>[BRIEF REASONING]<|end|>
<|start|>final<|message|>[GENERATED POST]<|end|><|end|>
Dataset Schema
Each JSONL record contains:
{
"text": "Full harmony-formatted conversation (for training)",
"instruction": "Human-readable instruction",
"context": "Conversation context without masked post",
"response": "Expected response (masked post content)",
"reasoning_level": "low|medium|high",
"sequence_id": 1234,
"masked_position": 2,
"masked_depth": 3,
"masked_type": "comment|submission",
"post_id": "abc123",
"author": "username"
}
Usage Example
Loading the Dataset
from datasets import load_from_disk
# Load the dataset
dataset = load_from_disk('climatechange_reddit_gpt_oss_hf')
# Access splits
train_data = dataset['train']
val_data = dataset['validation']
# The 'text' column contains harmony-formatted conversations
print(train_data[0]['text'])
Alternative: Load from JSONL
from datasets import load_dataset
dataset = load_dataset('json', data_files={
'train': 'climatechange_reddit_gpt_oss_train.jsonl',
'validation': 'climatechange_reddit_gpt_oss_val.jsonl'
})
Finetuning with Hugging Face
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
# Load model and tokenizer
model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Tokenize function
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=2048)
# Tokenize dataset
tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_val = val_data.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./gpt-oss-climate-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=100,
evaluation_strategy="steps",
eval_steps=500,
save_steps=1000,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_val,
)
# Train
trainer.train()
Source Data
- Original sequences: climatechange_sequences_10k.json
- Subreddit: r/climatechange
- Total source sequences: 10,000
- Posts per sequence: 4 (forming reply chains)
- Depth range: 0-94 (submissions to deep nested replies)
Use Case
This dataset enables the model to:
- Generate contextually appropriate Reddit comments in climate change discussions
- Match the tone and style of Reddit discourse
- Handle different discussion depths (top-level vs nested replies)
- Provide relevant climate change information
- Continue or bridge discussions naturally
License
The dataset is derived from Reddit posts and should be used in accordance with Reddit's Terms of Service and content policies.
Citation
If you use this dataset, please acknowledge:
- Source: Reddit r/climatechange discussions
- Format: OpenAI harmony response format for gpt-oss-20b
- Processing: Masked post prediction task for instruction finetuning
- Downloads last month
- 3