Spaces:
Building
Building
YAML Configuration System Guide
Overview
The application now uses a YAML configuration system for managing reference lists. This makes it extremely easy to add new reference lists without changing any code.
Configuration File Location
config/reference_lists.yaml
Adding New Reference Lists
1. Simple Example - Add a new unigram list
english:
unigrams:
my_new_list:
display_name: "My New Word List"
description: "Description of what this list contains"
files:
token: "resources/reference_lists/en/my_new_list_token.csv"
lemma: "resources/reference_lists/en/my_new_list_lemma.csv"
format: "csv" # or "tsv"
columns:
word: 0
frequency: 1
has_header: true
enabled: true
2. Adding Bigram Lists
english:
bigrams:
my_bigram_list:
display_name: "My Bigram Associations"
description: "Bigram association measures"
files:
token: "resources/reference_lists/en/my_bigrams_token.csv"
lemma: "resources/reference_lists/en/my_bigrams_lemma.csv"
format: "csv"
columns:
bigram: 0
frequency: 1
mi_score: 2
t_score: 3
has_header: true
enabled: true
3. Adding Trigram Lists
english:
trigrams:
my_trigram_list:
display_name: "My Trigram Patterns"
description: "Trigram frequency and association data"
files:
token: "resources/reference_lists/en/my_trigrams_token.csv"
lemma: "resources/reference_lists/en/my_trigrams_lemma.csv"
format: "tsv"
columns:
trigram: 0
frequency: 1
mi_score: 2
has_header: false
enabled: true
Configuration Options
Required Fields
display_name: Name shown in the UIfiles: Dictionary mapping file types to pathsenabled: Whether to show in the UI
Optional Fields
description: Help text for usersformat: "csv" or "tsv" (default: "csv")columns: Column mapping (default: {word: 0, frequency: 1})has_header: Whether files have header rows (default: false)header_prefix: Special prefix for headers like "#" (optional)
File Structure Expected
For each reference list, you need separate token and lemma files:
resources/reference_lists/en/
βββ my_list_token.csv
βββ my_list_lemma.csv
βββ my_bigrams_token.csv
βββ my_bigrams_lemma.csv
βββ my_trigrams_token.csv
βββ my_trigrams_lemma.csv
Adding Japanese Lists
japanese:
unigrams:
jp_frequency:
display_name: "Japanese Frequency List"
description: "Common Japanese word frequencies"
files:
token: "resources/reference_lists/ja/jp_frequency_token.csv"
lemma: "resources/reference_lists/ja/jp_frequency_lemma.csv"
format: "csv"
has_header: true
enabled: true
Benefits of This System
- No Code Changes: Add new lists by editing only the YAML file
- Automatic UI: Checkboxes are generated automatically
- Flexible: Support for different file formats and column structures
- Organized: Clear separation by n-gram type
- Easy Enable/Disable: Toggle lists on/off with the
enabledfield
Current Status
The system is now configured with:
- COCA Spoken Frequency: Working with your existing file
- Placeholder slots: Ready for additional lists
- All n-gram types: Unigrams, bigrams, and trigrams supported
How to Use
- Add your files to the appropriate directory
- Edit the YAML to add configuration entries
- Set enabled: true to make them appear in the UI
- Restart the application to see the changes
The system will automatically:
- Generate checkboxes for each enabled list
- Group them by type (Unigrams, Bigrams, Trigrams)
- Load the data when selected
- Display vocabulary size information
Example: Adding Multiple Lists at Once
english:
unigrams:
academic_words:
display_name: "Academic Word List"
description: "AWL vocabulary for academic writing"
files:
token: "resources/reference_lists/en/awl_token.csv"
lemma: "resources/reference_lists/en/awl_lemma.csv"
enabled: true
concreteness:
display_name: "Concreteness Ratings"
description: "Concrete vs abstract word ratings"
files:
token: "resources/reference_lists/en/concrete_token.csv"
lemma: "resources/reference_lists/en/concrete_lemma.csv"
enabled: true
age_acquisition:
display_name: "Age of Acquisition"
description: "Age when words are typically learned"
files:
token: "resources/reference_lists/en/aoa_token.csv"
lemma: "resources/reference_lists/en/aoa_lemma.csv"
enabled: true
This will automatically create 4 checkboxes in the "Unigrams" section (including COCA)!