simple-text-analyzer / config /YAML_CONFIG_GUIDE.md
egumasa's picture
initialize app
a543e33

YAML Configuration System Guide

Overview

The application now uses a YAML configuration system for managing reference lists. This makes it extremely easy to add new reference lists without changing any code.

Configuration File Location

config/reference_lists.yaml

Adding New Reference Lists

1. Simple Example - Add a new unigram list

english:
  unigrams:
    my_new_list:
      display_name: "My New Word List"
      description: "Description of what this list contains"
      files:
        token: "resources/reference_lists/en/my_new_list_token.csv"
        lemma: "resources/reference_lists/en/my_new_list_lemma.csv"
      format: "csv"  # or "tsv"
      columns:
        word: 0
        frequency: 1
      has_header: true
      enabled: true

2. Adding Bigram Lists

english:
  bigrams:
    my_bigram_list:
      display_name: "My Bigram Associations"
      description: "Bigram association measures"
      files:
        token: "resources/reference_lists/en/my_bigrams_token.csv"
        lemma: "resources/reference_lists/en/my_bigrams_lemma.csv"
      format: "csv"
      columns:
        bigram: 0
        frequency: 1
        mi_score: 2
        t_score: 3
      has_header: true
      enabled: true

3. Adding Trigram Lists

english:
  trigrams:
    my_trigram_list:
      display_name: "My Trigram Patterns"
      description: "Trigram frequency and association data"
      files:
        token: "resources/reference_lists/en/my_trigrams_token.csv"
        lemma: "resources/reference_lists/en/my_trigrams_lemma.csv"
      format: "tsv"
      columns:
        trigram: 0
        frequency: 1
        mi_score: 2
      has_header: false
      enabled: true

Configuration Options

Required Fields

  • display_name: Name shown in the UI
  • files: Dictionary mapping file types to paths
  • enabled: Whether to show in the UI

Optional Fields

  • description: Help text for users
  • format: "csv" or "tsv" (default: "csv")
  • columns: Column mapping (default: {word: 0, frequency: 1})
  • has_header: Whether files have header rows (default: false)
  • header_prefix: Special prefix for headers like "#" (optional)

File Structure Expected

For each reference list, you need separate token and lemma files:

resources/reference_lists/en/
β”œβ”€β”€ my_list_token.csv
β”œβ”€β”€ my_list_lemma.csv
β”œβ”€β”€ my_bigrams_token.csv
β”œβ”€β”€ my_bigrams_lemma.csv
β”œβ”€β”€ my_trigrams_token.csv
└── my_trigrams_lemma.csv

Adding Japanese Lists

japanese:
  unigrams:
    jp_frequency:
      display_name: "Japanese Frequency List"
      description: "Common Japanese word frequencies"
      files:
        token: "resources/reference_lists/ja/jp_frequency_token.csv"
        lemma: "resources/reference_lists/ja/jp_frequency_lemma.csv"
      format: "csv"
      has_header: true
      enabled: true

Benefits of This System

  1. No Code Changes: Add new lists by editing only the YAML file
  2. Automatic UI: Checkboxes are generated automatically
  3. Flexible: Support for different file formats and column structures
  4. Organized: Clear separation by n-gram type
  5. Easy Enable/Disable: Toggle lists on/off with the enabled field

Current Status

The system is now configured with:

  • COCA Spoken Frequency: Working with your existing file
  • Placeholder slots: Ready for additional lists
  • All n-gram types: Unigrams, bigrams, and trigrams supported

How to Use

  1. Add your files to the appropriate directory
  2. Edit the YAML to add configuration entries
  3. Set enabled: true to make them appear in the UI
  4. Restart the application to see the changes

The system will automatically:

  • Generate checkboxes for each enabled list
  • Group them by type (Unigrams, Bigrams, Trigrams)
  • Load the data when selected
  • Display vocabulary size information

Example: Adding Multiple Lists at Once

english:
  unigrams:
    academic_words:
      display_name: "Academic Word List"
      description: "AWL vocabulary for academic writing"
      files:
        token: "resources/reference_lists/en/awl_token.csv"
        lemma: "resources/reference_lists/en/awl_lemma.csv"
      enabled: true
      
    concreteness:
      display_name: "Concreteness Ratings"
      description: "Concrete vs abstract word ratings"
      files:
        token: "resources/reference_lists/en/concrete_token.csv"
        lemma: "resources/reference_lists/en/concrete_lemma.csv"
      enabled: true
      
    age_acquisition:
      display_name: "Age of Acquisition"
      description: "Age when words are typically learned"
      files:
        token: "resources/reference_lists/en/aoa_token.csv"
        lemma: "resources/reference_lists/en/aoa_lemma.csv"
      enabled: true

This will automatically create 4 checkboxes in the "Unigrams" section (including COCA)!