Climate Change Reddit Discussion Dataset for gpt-oss-20b

Overview

This dataset is designed for instruction finetuning OpenAI's gpt-oss-20b model on Reddit-style climate change discussions. The model will learn to generate contextually appropriate Reddit posts (submissions and comments) at various depths within discussion threads.

Dataset Format

The dataset uses OpenAI's harmony response format, which is required for gpt-oss-20b models.

Files Created

climatechange_reddit_gpt_oss_hf/ - Hugging Face Dataset format (41MB)
- train/ - 9,000 training examples
- validation/ - 1,000 validation examples
climatechange_reddit_gpt_oss_train.jsonl - Training data in JSONL format (39MB)
climatechange_reddit_gpt_oss_val.jsonl - Validation data in JSONL format (4.1MB)

Task Description

The finetuning task is a masked post prediction task:

Each example contains a sequence of 4 Reddit posts from a discussion thread
One post is randomly masked (can be at any position 1-4)
The model must generate the masked post based on the surrounding context
Posts can be at any depth (0-97), not just top-level submissions

Data Statistics

Overall

Total examples: 10,000
Train set: 9,000 (90%)
Validation set: 1,000 (10%)

Masked Position Distribution

Position 1: ~25.6%
Position 2: ~25.3%
Position 3: ~25.0%
Position 4: ~24.1%

Masked Type Distribution

Comments: ~93.9%
Submissions: ~6.1%

Reasoning Level Distribution

High: ~33.5%
Medium: ~33.4%
Low: ~33.1%

Top 10 Masked Depths

Depth 3: ~15.9%
Depth 2: ~15.4%
Depth 4: ~12.7%
Depth 1: ~10.8%
Depth 5: ~8.8%
Depth 6: ~6.7%
Depth 0: ~6.1%
Depth 7: ~4.7%
Depth 8: ~3.7%
Depth 9: ~2.5%

Harmony Format Structure

Each training example follows this structure:

<|start|>system<|message|>You are a helpful AI assistant participating in Reddit discussions about climate change...

Reasoning: [low|medium|high]

# Valid channels: analysis, final<|end|>

<|start|>user<|message|>[INSTRUCTION]

Here is the conversation thread with one post marked as [MASKED POST - TO BE GENERATED]:

[CONTEXT WITH MASKED POST]<|end|>

<|start|>assistant<|message|><|start|>analysis<|message|>[BRIEF REASONING]<|end|>

<|start|>final<|message|>[GENERATED POST]<|end|><|end|>

Dataset Schema

Each JSONL record contains:

{
  "text": "Full harmony-formatted conversation (for training)",
  "instruction": "Human-readable instruction",
  "context": "Conversation context without masked post",
  "response": "Expected response (masked post content)",
  "reasoning_level": "low|medium|high",
  "sequence_id": 1234,
  "masked_position": 2,
  "masked_depth": 3,
  "masked_type": "comment|submission",
  "post_id": "abc123",
  "author": "username"
}

Usage Example

Loading the Dataset

from datasets import load_from_disk

# Load the dataset
dataset = load_from_disk('climatechange_reddit_gpt_oss_hf')

# Access splits
train_data = dataset['train']
val_data = dataset['validation']

# The 'text' column contains harmony-formatted conversations
print(train_data[0]['text'])

Alternative: Load from JSONL

from datasets import load_dataset

dataset = load_dataset('json', data_files={
    'train': 'climatechange_reddit_gpt_oss_train.jsonl',
    'validation': 'climatechange_reddit_gpt_oss_val.jsonl'
})

Finetuning with Hugging Face

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

# Load model and tokenizer
model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=2048)

# Tokenize dataset
tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_val = val_data.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./gpt-oss-climate-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

# Train
trainer.train()

Source Data

Original sequences: climatechange_sequences_10k.json
Subreddit: r/climatechange
Total source sequences: 10,000
Posts per sequence: 4 (forming reply chains)
Depth range: 0-94 (submissions to deep nested replies)

Use Case

This dataset enables the model to:

Generate contextually appropriate Reddit comments in climate change discussions
Match the tone and style of Reddit discourse
Handle different discussion depths (top-level vs nested replies)
Provide relevant climate change information
Continue or bridge discussions naturally

License

The dataset is derived from Reddit posts and should be used in accordance with Reddit's Terms of Service and content policies.

Citation

If you use this dataset, please acknowledge:

Source: Reddit r/climatechange discussions
Format: OpenAI harmony response format for gpt-oss-20b
Processing: Masked post prediction task for instruction finetuning

Downloads last month: 1

Safetensors

Model size

21B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maxpe/gpt-oss-climate-finetuned

Quantizations

1 model