Spaces:

egumasa
/

simple-text-analyzer

Building

App Files Files Community

simple-text-analyzer / config /YAML_CONFIG_GUIDE.md

egumasa

initialize app

a543e33 9 months ago

preview code

raw

history blame contribute delete

4.92 kB

YAML Configuration System Guide

Overview

The application now uses a YAML configuration system for managing reference lists. This makes it extremely easy to add new reference lists without changing any code.

Configuration File Location

config/reference_lists.yaml

Adding New Reference Lists

1. Simple Example - Add a new unigram list

english:
  unigrams:
    my_new_list:
      display_name: "My New Word List"
      description: "Description of what this list contains"
      files:
        token: "resources/reference_lists/en/my_new_list_token.csv"
        lemma: "resources/reference_lists/en/my_new_list_lemma.csv"
      format: "csv"  # or "tsv"
      columns:
        word: 0
        frequency: 1
      has_header: true
      enabled: true

2. Adding Bigram Lists

english:
  bigrams:
    my_bigram_list:
      display_name: "My Bigram Associations"
      description: "Bigram association measures"
      files:
        token: "resources/reference_lists/en/my_bigrams_token.csv"
        lemma: "resources/reference_lists/en/my_bigrams_lemma.csv"
      format: "csv"
      columns:
        bigram: 0
        frequency: 1
        mi_score: 2
        t_score: 3
      has_header: true
      enabled: true

3. Adding Trigram Lists

english:
  trigrams:
    my_trigram_list:
      display_name: "My Trigram Patterns"
      description: "Trigram frequency and association data"
      files:
        token: "resources/reference_lists/en/my_trigrams_token.csv"
        lemma: "resources/reference_lists/en/my_trigrams_lemma.csv"
      format: "tsv"
      columns:
        trigram: 0
        frequency: 1
        mi_score: 2
      has_header: false
      enabled: true

Configuration Options

Required Fields

display_name: Name shown in the UI
files: Dictionary mapping file types to paths
enabled: Whether to show in the UI

Optional Fields

description: Help text for users
format: "csv" or "tsv" (default: "csv")
columns: Column mapping (default: {word: 0, frequency: 1})
has_header: Whether files have header rows (default: false)
header_prefix: Special prefix for headers like "#" (optional)

File Structure Expected

For each reference list, you need separate token and lemma files:

resources/reference_lists/en/
├── my_list_token.csv
├── my_list_lemma.csv
├── my_bigrams_token.csv
├── my_bigrams_lemma.csv
├── my_trigrams_token.csv
└── my_trigrams_lemma.csv

Adding Japanese Lists

japanese:
  unigrams:
    jp_frequency:
      display_name: "Japanese Frequency List"
      description: "Common Japanese word frequencies"
      files:
        token: "resources/reference_lists/ja/jp_frequency_token.csv"
        lemma: "resources/reference_lists/ja/jp_frequency_lemma.csv"
      format: "csv"
      has_header: true
      enabled: true

Benefits of This System

No Code Changes: Add new lists by editing only the YAML file
Automatic UI: Checkboxes are generated automatically
Flexible: Support for different file formats and column structures
Organized: Clear separation by n-gram type
Easy Enable/Disable: Toggle lists on/off with the enabled field

Current Status

The system is now configured with:

COCA Spoken Frequency: Working with your existing file
Placeholder slots: Ready for additional lists
All n-gram types: Unigrams, bigrams, and trigrams supported

How to Use

Add your files to the appropriate directory
Edit the YAML to add configuration entries
Set enabled: true to make them appear in the UI
Restart the application to see the changes

The system will automatically:

Generate checkboxes for each enabled list
Group them by type (Unigrams, Bigrams, Trigrams)
Load the data when selected
Display vocabulary size information

Example: Adding Multiple Lists at Once

english:
  unigrams:
    academic_words:
      display_name: "Academic Word List"
      description: "AWL vocabulary for academic writing"
      files:
        token: "resources/reference_lists/en/awl_token.csv"
        lemma: "resources/reference_lists/en/awl_lemma.csv"
      enabled: true
      
    concreteness:
      display_name: "Concreteness Ratings"
      description: "Concrete vs abstract word ratings"
      files:
        token: "resources/reference_lists/en/concrete_token.csv"
        lemma: "resources/reference_lists/en/concrete_lemma.csv"
      enabled: true
      
    age_acquisition:
      display_name: "Age of Acquisition"
      description: "Age when words are typically learned"
      files:
        token: "resources/reference_lists/en/aoa_token.csv"
        lemma: "resources/reference_lists/en/aoa_lemma.csv"
      enabled: true

This will automatically create 4 checkboxes in the "Unigrams" section (including COCA)!