simple-text-analyzer / README copy.md
egumasa's picture
initialize app
a543e33

Simple Text Analyzer

A Streamlit web application for linguistic data analysis, providing lexical sophistication analysis and POS/dependency parsing tools for educational purposes.

Features

1. Lexical Sophistication Analysis

  • Single Text Mode: Educational interface showing detailed calculation steps
  • Batch Analysis Mode: Process multiple text files with downloadable results
  • Support for unigram, bigram, and trigram reference lists
  • Multi-column support for n-gram reference files
  • Configurable word type filtering (content words, function words)
  • Optional log₁₀ transformation
  • Interactive density plots with actual token scores
  • Proper CSV column naming for batch results

2. POS and Dependency Parsing

  • Single Text Mode: Token analysis table with dependency visualization
  • Batch Analysis Mode: Process multiple files with TSV output
  • DisplaCy visualization for sentence structure
  • Named entity recognition
  • Comprehensive linguistic analysis

Supported Languages

  • English (en)
  • Japanese (ja)

Quick Start

The easiest way to run the application:

# Make the run script executable
chmod +x run_app.sh

# Run the application
./run_app.sh

This script will:

  • Set up a virtual environment (if using uv)
  • Install all dependencies
  • Run basic tests
  • Start the Streamlit application

Manual Installation

Option 1: Using uv (Recommended)

# Create virtual environment
uv venv

# Activate virtual environment
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Run the application
streamlit run frontend/app.py

Option 2: Using pip

# Install dependencies
pip install -r requirements.txt

# Run the application
streamlit run frontend/app.py

Project Structure

simple-text-analyzer/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ lexical_sophistication.py  # Core analysis logic
β”‚   └── pos_parser.py              # POS/dependency parsing
β”œβ”€β”€ frontend/
β”‚   └── app.py                     # Streamlit web interface
β”œβ”€β”€ resources/
β”‚   └── reference_lists/
β”‚       β”œβ”€β”€ en/                     # English reference lists
β”‚       └── ja/                     # Japanese reference lists
β”œβ”€β”€ .streamlit/
β”‚   └── config.toml               # Streamlit configuration
β”œβ”€β”€ requirements.txt              # Python dependencies
β”œβ”€β”€ run_app.sh                   # Easy startup script
β”œβ”€β”€ test_app.py                  # Basic functionality tests
β”œβ”€β”€ test_functionality.py        # Comprehensive tests
└── README.md

Usage

Configuration

  1. Select language (English/Japanese)
  2. Choose SpaCy model size (trf recommended for accuracy, lg for speed)
  3. Select analysis tool (Lexical Sophistication or POS Parser)

Reference Lists

English

  • Default Lists:
    • COCA Spoken Frequency: Real frequency data from 77,000+ English words
    • Index 2: Placeholder for demonstration
  • Custom Lists: Upload your own CSV/TSV files
  • Multiple indices can be selected simultaneously

Japanese

  • Custom Upload Only: Upload CSV/TSV files following naming convention
  • Full support for Japanese text analysis

File Naming Convention

  • Unigram files: indexname_token.csv, indexname_lemma.csv
  • N-gram files: indexname_bigram.csv, indexname_trigram.csv
  • Multi-column n-gram files supported (frequency, MI, T-score, etc.)

File Upload

  • Supported formats: .txt files, .zip archives
  • Maximum file size: 100 MB
  • Text encoding: UTF-8 or UTF-16 (automatic detection)
  • Batch processing with progress indicators

Key Features

COCA Integration

  • Built-in access to COCA (Corpus of Contemporary American English) spoken frequency data
  • Over 77,000 English words with authentic frequency counts
  • Seamless integration as a default reference list
  • No file upload required for basic frequency analysis

Session Management

  • 10-minute session timeout with automatic cleanup
  • Language switching clears session data
  • Progress tracking for long-running operations

Error Handling

  • Comprehensive error messages with suggestions
  • Graceful handling of file encoding issues
  • Partial results available even with errors
  • Error summaries for batch processing

Performance

  • Optimized for 100 files Γ— 1000 words each
  • Real-time progress indicators
  • Memory-efficient model loading
  • Concurrent user support via session isolation

Architecture

  • Backend: Pure Python modules using SpaCy for NLP processing
  • Frontend: Streamlit web interface with advanced session state management
  • Separation of Concerns: Backend modules do not import Streamlit
  • Model Management: Dynamic loading to minimize memory usage
  • Data Processing: Efficient handling of large text collections

Development

The application follows educational best practices:

  • Clear separation between calculation logic and UI
  • Detailed token-level analysis for learning
  • Progress indicators for batch processing
  • Comprehensive error handling
  • Downloadable results for further analysis
  • Extensive testing coverage

SpaCy Models

The application requires the following SpaCy models:

  • en_core_web_trf - English transformer model (default)
  • en_core_web_lg - English large model (fallback)
  • ja_core_news_lg - Japanese large model

These are automatically installed via requirements.txt. The transformer models provide higher accuracy but require more computational resources.

Testing

Run the test scripts to verify functionality:

# Basic import and instantiation tests
python test_app.py

# Comprehensive functionality tests
python test_functionality.py

Troubleshooting

Common Issues

  1. SpaCy model not found: Ensure models are installed via requirements.txt
  2. File encoding errors: Application supports UTF-8 and UTF-16 with automatic detection
  3. Memory issues: Use 'lg' model instead of 'trf' for better performance (trf models require more memory)
  4. Session timeout: Data is automatically cleared after 10 minutes of inactivity
  5. Slow processing: Transformer models (trf) are more accurate but slower than large models (lg)

Getting Help

  • Check the error messages in the application - they provide specific guidance
  • Verify your reference list file format and naming
  • Ensure text files are properly encoded
  • Try with smaller datasets first

Adding New Reference Lists

See YAML_CONFIG_GUIDE.md for detailed instructions.

Quick example - add a new unigram list:

english:
  unigrams:
    my_new_list:
      display_name: "My New Word List"
      description: "Description of what this list contains"
      files:
        token: "resources/reference_lists/en/my_new_list_token.csv"
        lemma: "resources/reference_lists/en/my_new_list_lemma.csv"
      format: "csv"
      enabled: true

That's it! The checkbox will appear automatically in the UI.

License

This project is designed for educational purposes in linguistic data analysis courses.