YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Climate Change Reddit Discussion Dataset for gpt-oss-20b

Overview

This dataset is designed for instruction finetuning OpenAI's gpt-oss-20b model on Reddit-style climate change discussions. The model will learn to generate contextually appropriate Reddit posts (submissions and comments) at various depths within discussion threads.

Dataset Format

The dataset uses OpenAI's harmony response format, which is required for gpt-oss-20b models.

Files Created

  1. climatechange_reddit_gpt_oss_hf/ - Hugging Face Dataset format (41MB)

    • train/ - 9,000 training examples
    • validation/ - 1,000 validation examples
  2. climatechange_reddit_gpt_oss_train.jsonl - Training data in JSONL format (39MB)

  3. climatechange_reddit_gpt_oss_val.jsonl - Validation data in JSONL format (4.1MB)

Task Description

The finetuning task is a masked post prediction task:

  • Each example contains a sequence of 4 Reddit posts from a discussion thread
  • One post is randomly masked (can be at any position 1-4)
  • The model must generate the masked post based on the surrounding context
  • Posts can be at any depth (0-97), not just top-level submissions

Data Statistics

Overall

  • Total examples: 10,000
  • Train set: 9,000 (90%)
  • Validation set: 1,000 (10%)

Masked Position Distribution

  • Position 1: ~25.6%
  • Position 2: ~25.3%
  • Position 3: ~25.0%
  • Position 4: ~24.1%

Masked Type Distribution

  • Comments: ~93.9%
  • Submissions: ~6.1%

Reasoning Level Distribution

  • High: ~33.5%
  • Medium: ~33.4%
  • Low: ~33.1%

Top 10 Masked Depths

  1. Depth 3: ~15.9%
  2. Depth 2: ~15.4%
  3. Depth 4: ~12.7%
  4. Depth 1: ~10.8%
  5. Depth 5: ~8.8%
  6. Depth 6: ~6.7%
  7. Depth 0: ~6.1%
  8. Depth 7: ~4.7%
  9. Depth 8: ~3.7%
  10. Depth 9: ~2.5%

Harmony Format Structure

Each training example follows this structure:

<|start|>system<|message|>You are a helpful AI assistant participating in Reddit discussions about climate change...

Reasoning: [low|medium|high]

# Valid channels: analysis, final<|end|>

<|start|>user<|message|>[INSTRUCTION]

Here is the conversation thread with one post marked as [MASKED POST - TO BE GENERATED]:

[CONTEXT WITH MASKED POST]<|end|>

<|start|>assistant<|message|><|start|>analysis<|message|>[BRIEF REASONING]<|end|>

<|start|>final<|message|>[GENERATED POST]<|end|><|end|>

Dataset Schema

Each JSONL record contains:

{
  "text": "Full harmony-formatted conversation (for training)",
  "instruction": "Human-readable instruction",
  "context": "Conversation context without masked post",
  "response": "Expected response (masked post content)",
  "reasoning_level": "low|medium|high",
  "sequence_id": 1234,
  "masked_position": 2,
  "masked_depth": 3,
  "masked_type": "comment|submission",
  "post_id": "abc123",
  "author": "username"
}

Usage Example

Loading the Dataset

from datasets import load_from_disk

# Load the dataset
dataset = load_from_disk('climatechange_reddit_gpt_oss_hf')

# Access splits
train_data = dataset['train']
val_data = dataset['validation']

# The 'text' column contains harmony-formatted conversations
print(train_data[0]['text'])

Alternative: Load from JSONL

from datasets import load_dataset

dataset = load_dataset('json', data_files={
    'train': 'climatechange_reddit_gpt_oss_train.jsonl',
    'validation': 'climatechange_reddit_gpt_oss_val.jsonl'
})

Finetuning with Hugging Face

from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer

# Load model and tokenizer
model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Tokenize function
def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length', max_length=2048)

# Tokenize dataset
tokenized_train = train_data.map(tokenize_function, batched=True)
tokenized_val = val_data.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./gpt-oss-climate-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=1000,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

# Train
trainer.train()

Source Data

  • Original sequences: climatechange_sequences_10k.json
  • Subreddit: r/climatechange
  • Total source sequences: 10,000
  • Posts per sequence: 4 (forming reply chains)
  • Depth range: 0-94 (submissions to deep nested replies)

Use Case

This dataset enables the model to:

  1. Generate contextually appropriate Reddit comments in climate change discussions
  2. Match the tone and style of Reddit discourse
  3. Handle different discussion depths (top-level vs nested replies)
  4. Provide relevant climate change information
  5. Continue or bridge discussions naturally

License

The dataset is derived from Reddit posts and should be used in accordance with Reddit's Terms of Service and content policies.

Citation

If you use this dataset, please acknowledge:

  • Source: Reddit r/climatechange discussions
  • Format: OpenAI harmony response format for gpt-oss-20b
  • Processing: Masked post prediction task for instruction finetuning
Downloads last month
3
Safetensors
Model size
21B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maxpe/gpt-oss-climate-finetuned

Quantizations
1 model