Spaces:

egumasa
/

simple-text-analyzer

Building

App Files Files Community

simple-text-analyzer / README copy.md

egumasa

initialize app

a543e33 9 months ago

preview code

raw

history blame contribute delete

7.13 kB

Simple Text Analyzer

A Streamlit web application for linguistic data analysis, providing lexical sophistication analysis and POS/dependency parsing tools for educational purposes.

Features

1. Lexical Sophistication Analysis

Single Text Mode: Educational interface showing detailed calculation steps
Batch Analysis Mode: Process multiple text files with downloadable results
Support for unigram, bigram, and trigram reference lists
Multi-column support for n-gram reference files
Configurable word type filtering (content words, function words)
Optional log₁₀ transformation
Interactive density plots with actual token scores
Proper CSV column naming for batch results

2. POS and Dependency Parsing

Single Text Mode: Token analysis table with dependency visualization
Batch Analysis Mode: Process multiple files with TSV output
DisplaCy visualization for sentence structure
Named entity recognition
Comprehensive linguistic analysis

Supported Languages

English (en)
Japanese (ja)

Quick Start

The easiest way to run the application:

# Make the run script executable
chmod +x run_app.sh

# Run the application
./run_app.sh

This script will:

Set up a virtual environment (if using uv)
Install all dependencies
Run basic tests
Start the Streamlit application

Manual Installation

Option 1: Using uv (Recommended)

# Create virtual environment
uv venv

# Activate virtual environment
source .venv/bin/activate

# Install dependencies
uv pip install -r requirements.txt

# Run the application
streamlit run frontend/app.py

Option 2: Using pip

# Install dependencies
pip install -r requirements.txt

# Run the application
streamlit run frontend/app.py

Project Structure

simple-text-analyzer/
├── backend/
│   ├── __init__.py
│   ├── lexical_sophistication.py  # Core analysis logic
│   └── pos_parser.py              # POS/dependency parsing
├── frontend/
│   └── app.py                     # Streamlit web interface
├── resources/
│   └── reference_lists/
│       ├── en/                     # English reference lists
│       └── ja/                     # Japanese reference lists
├── .streamlit/
│   └── config.toml               # Streamlit configuration
├── requirements.txt              # Python dependencies
├── run_app.sh                   # Easy startup script
├── test_app.py                  # Basic functionality tests
├── test_functionality.py        # Comprehensive tests
└── README.md

Usage

Configuration

Select language (English/Japanese)
Choose SpaCy model size (trf recommended for accuracy, lg for speed)
Select analysis tool (Lexical Sophistication or POS Parser)

Reference Lists

English

Default Lists:
- COCA Spoken Frequency: Real frequency data from 77,000+ English words
- Index 2: Placeholder for demonstration
Custom Lists: Upload your own CSV/TSV files
Multiple indices can be selected simultaneously

Japanese

Custom Upload Only: Upload CSV/TSV files following naming convention
Full support for Japanese text analysis

File Naming Convention

Unigram files: indexname_token.csv, indexname_lemma.csv
N-gram files: indexname_bigram.csv, indexname_trigram.csv
Multi-column n-gram files supported (frequency, MI, T-score, etc.)

File Upload

Supported formats: .txt files, .zip archives
Maximum file size: 100 MB
Text encoding: UTF-8 or UTF-16 (automatic detection)
Batch processing with progress indicators

Key Features

COCA Integration

Built-in access to COCA (Corpus of Contemporary American English) spoken frequency data
Over 77,000 English words with authentic frequency counts
Seamless integration as a default reference list
No file upload required for basic frequency analysis

Session Management

10-minute session timeout with automatic cleanup
Language switching clears session data
Progress tracking for long-running operations

Error Handling

Comprehensive error messages with suggestions
Graceful handling of file encoding issues
Partial results available even with errors
Error summaries for batch processing

Performance

Optimized for 100 files × 1000 words each
Real-time progress indicators
Memory-efficient model loading
Concurrent user support via session isolation

Architecture

Backend: Pure Python modules using SpaCy for NLP processing
Frontend: Streamlit web interface with advanced session state management
Separation of Concerns: Backend modules do not import Streamlit
Model Management: Dynamic loading to minimize memory usage
Data Processing: Efficient handling of large text collections

Development

The application follows educational best practices:

Clear separation between calculation logic and UI
Detailed token-level analysis for learning
Progress indicators for batch processing
Comprehensive error handling
Downloadable results for further analysis
Extensive testing coverage

SpaCy Models

The application requires the following SpaCy models:

en_core_web_trf - English transformer model (default)
en_core_web_lg - English large model (fallback)
ja_core_news_lg - Japanese large model

These are automatically installed via requirements.txt. The transformer models provide higher accuracy but require more computational resources.

Testing

Run the test scripts to verify functionality:

# Basic import and instantiation tests
python test_app.py

# Comprehensive functionality tests
python test_functionality.py

Troubleshooting

Common Issues

SpaCy model not found: Ensure models are installed via requirements.txt
File encoding errors: Application supports UTF-8 and UTF-16 with automatic detection
Memory issues: Use 'lg' model instead of 'trf' for better performance (trf models require more memory)
Session timeout: Data is automatically cleared after 10 minutes of inactivity
Slow processing: Transformer models (trf) are more accurate but slower than large models (lg)

Getting Help

Check the error messages in the application - they provide specific guidance
Verify your reference list file format and naming
Ensure text files are properly encoded
Try with smaller datasets first

Adding New Reference Lists

See YAML_CONFIG_GUIDE.md for detailed instructions.

Quick example - add a new unigram list:

english:
  unigrams:
    my_new_list:
      display_name: "My New Word List"
      description: "Description of what this list contains"
      files:
        token: "resources/reference_lists/en/my_new_list_token.csv"
        lemma: "resources/reference_lists/en/my_new_list_lemma.csv"
      format: "csv"
      enabled: true

That's it! The checkbox will appear automatically in the UI.

License

This project is designed for educational purposes in linguistic data analysis courses.