lsmpp's picture
Add files using upload-large-folder tool
3f9fa87 verified

Prompt Augmentation Tool

This tool uses Qwen3 to generate new prompt pairs based on examples from the civitai_image.csv dataset.

Features

  • Randomly samples existing prompts as examples for Qwen3
  • 40% probability of generating multi-character focused prompts
  • Cleans prompts by removing technical embeddings and prefixes
  • Generates 10,000 new prompt pairs by default
  • Batch saving every 10 prompts for safety against interruptions
  • Resume capability - automatically detects and continues from existing files
  • Saves results in JSONL format for easy processing
  • Includes progress tracking and error handling
  • Interrupt-safe - saves progress even if stopped with Ctrl+C

Usage

Basic Usage

cd prepare_tool/prompt_augmentation
python augment_prompts.py

Custom Parameters

python augment_prompts.py \
    --target_count 5000 \
    --multi_char_prob 0.3 \
    --save_every 20 \
    --output my_prompts.jsonl \
    --csv_path ../../civitai_image.csv

Resume Generation

If your generation is interrupted, simply run the same command again. The script will automatically detect existing prompts and continue from where it left off:

# This will continue from existing prompts_10k.jsonl if it exists
python augment_prompts.py --target_count 10000 --output prompts_10k.jsonl

Parameters

  • --model: Model name (default: "Qwen/Qwen3-8B")
  • --csv_path: Path to civitai CSV file (default: "../../civitai_image.csv")
  • --target_count: Number of prompts to generate (default: 10000)
  • --multi_char_prob: Probability of multi-character prompts (default: 0.4)
  • --samples_per_batch: Number of examples to show model (default: 3)
  • --save_every: Save to file every N successful generations (default: 10)
  • --output: Output file name (default: "augmented_prompts.jsonl")

Output Format

The script generates a JSONL file where each line is a JSON object:

{
    "positive_prompt": "detailed positive prompt here...",
    "negative_prompt": "negative prompt with quality controls",
    "multi_character_focus": true,
    "generation_attempt": 42,
    "sample_sources": ["sample 1...", "sample 2...", "sample 3..."]
}

Example Output

Each generated prompt pair will be similar to:

Multi-character focused:

{
    "positive_prompt": "masterpiece, best quality, 2girls, sitting together on park bench, one girl with long brown hair reading book aloud, other girl with short blonde hair listening intently, warm afternoon sunlight, cherry blossoms falling, detailed facial expressions, friendship, casual clothing, peaceful atmosphere",
    "negative_prompt": "worst quality, low quality, bad anatomy, bad hands, blurry, watermark, signature, text"
}

General prompt:

{
    "positive_prompt": "high quality, detailed illustration, mystical forest scene, ancient stone ruins covered in glowing moss, ethereal lighting through canopy, magical atmosphere, fantasy landscape, intricate details, vibrant colors",
    "negative_prompt": "low quality, bad anatomy, blurry, watermark, signature, worst quality"
}

Tips

  1. Monitor Progress: The script shows progress every 100 attempts
  2. Batch Saving: Results are saved every 10 successful generations by default
  3. Resume Safely: You can interrupt (Ctrl+C) and resume generation anytime
  4. Adjust Parameters: Lower multi_char_prob if you want fewer multi-character prompts
  5. Change Batch Size: Use --save_every to control how often data is saved
  6. GPU Memory: The script uses "auto" device mapping, ensure sufficient GPU memory

Requirements

  • transformers
  • torch
  • pandas
  • Python 3.7+
  • Sufficient GPU memory for Qwen3-8B model