Spaces:
Building
Building
Simple Text Analyzer
A Streamlit web application for linguistic data analysis, providing lexical sophistication analysis and POS/dependency parsing tools for educational purposes.
Features
1. Lexical Sophistication Analysis
- Single Text Mode: Educational interface showing detailed calculation steps
- Batch Analysis Mode: Process multiple text files with downloadable results
- Support for unigram, bigram, and trigram reference lists
- Multi-column support for n-gram reference files
- Configurable word type filtering (content words, function words)
- Optional logββ transformation
- Interactive density plots with actual token scores
- Proper CSV column naming for batch results
2. POS and Dependency Parsing
- Single Text Mode: Token analysis table with dependency visualization
- Batch Analysis Mode: Process multiple files with TSV output
- DisplaCy visualization for sentence structure
- Named entity recognition
- Comprehensive linguistic analysis
Supported Languages
- English (en)
- Japanese (ja)
Quick Start
The easiest way to run the application:
# Make the run script executable
chmod +x run_app.sh
# Run the application
./run_app.sh
This script will:
- Set up a virtual environment (if using
uv) - Install all dependencies
- Run basic tests
- Start the Streamlit application
Manual Installation
Option 1: Using uv (Recommended)
# Create virtual environment
uv venv
# Activate virtual environment
source .venv/bin/activate
# Install dependencies
uv pip install -r requirements.txt
# Run the application
streamlit run frontend/app.py
Option 2: Using pip
# Install dependencies
pip install -r requirements.txt
# Run the application
streamlit run frontend/app.py
Project Structure
simple-text-analyzer/
βββ backend/
β βββ __init__.py
β βββ lexical_sophistication.py # Core analysis logic
β βββ pos_parser.py # POS/dependency parsing
βββ frontend/
β βββ app.py # Streamlit web interface
βββ resources/
β βββ reference_lists/
β βββ en/ # English reference lists
β βββ ja/ # Japanese reference lists
βββ .streamlit/
β βββ config.toml # Streamlit configuration
βββ requirements.txt # Python dependencies
βββ run_app.sh # Easy startup script
βββ test_app.py # Basic functionality tests
βββ test_functionality.py # Comprehensive tests
βββ README.md
Usage
Configuration
- Select language (English/Japanese)
- Choose SpaCy model size (trf recommended for accuracy, lg for speed)
- Select analysis tool (Lexical Sophistication or POS Parser)
Reference Lists
English
- Default Lists:
- COCA Spoken Frequency: Real frequency data from 77,000+ English words
- Index 2: Placeholder for demonstration
- Custom Lists: Upload your own CSV/TSV files
- Multiple indices can be selected simultaneously
Japanese
- Custom Upload Only: Upload CSV/TSV files following naming convention
- Full support for Japanese text analysis
File Naming Convention
- Unigram files:
indexname_token.csv,indexname_lemma.csv - N-gram files:
indexname_bigram.csv,indexname_trigram.csv - Multi-column n-gram files supported (frequency, MI, T-score, etc.)
File Upload
- Supported formats:
.txtfiles,.ziparchives - Maximum file size: 100 MB
- Text encoding: UTF-8 or UTF-16 (automatic detection)
- Batch processing with progress indicators
Key Features
COCA Integration
- Built-in access to COCA (Corpus of Contemporary American English) spoken frequency data
- Over 77,000 English words with authentic frequency counts
- Seamless integration as a default reference list
- No file upload required for basic frequency analysis
Session Management
- 10-minute session timeout with automatic cleanup
- Language switching clears session data
- Progress tracking for long-running operations
Error Handling
- Comprehensive error messages with suggestions
- Graceful handling of file encoding issues
- Partial results available even with errors
- Error summaries for batch processing
Performance
- Optimized for 100 files Γ 1000 words each
- Real-time progress indicators
- Memory-efficient model loading
- Concurrent user support via session isolation
Architecture
- Backend: Pure Python modules using SpaCy for NLP processing
- Frontend: Streamlit web interface with advanced session state management
- Separation of Concerns: Backend modules do not import Streamlit
- Model Management: Dynamic loading to minimize memory usage
- Data Processing: Efficient handling of large text collections
Development
The application follows educational best practices:
- Clear separation between calculation logic and UI
- Detailed token-level analysis for learning
- Progress indicators for batch processing
- Comprehensive error handling
- Downloadable results for further analysis
- Extensive testing coverage
SpaCy Models
The application requires the following SpaCy models:
en_core_web_trf- English transformer model (default)en_core_web_lg- English large model (fallback)ja_core_news_lg- Japanese large model
These are automatically installed via requirements.txt. The transformer models provide higher accuracy but require more computational resources.
Testing
Run the test scripts to verify functionality:
# Basic import and instantiation tests
python test_app.py
# Comprehensive functionality tests
python test_functionality.py
Troubleshooting
Common Issues
- SpaCy model not found: Ensure models are installed via
requirements.txt - File encoding errors: Application supports UTF-8 and UTF-16 with automatic detection
- Memory issues: Use 'lg' model instead of 'trf' for better performance (trf models require more memory)
- Session timeout: Data is automatically cleared after 10 minutes of inactivity
- Slow processing: Transformer models (trf) are more accurate but slower than large models (lg)
Getting Help
- Check the error messages in the application - they provide specific guidance
- Verify your reference list file format and naming
- Ensure text files are properly encoded
- Try with smaller datasets first
Adding New Reference Lists
See YAML_CONFIG_GUIDE.md for detailed instructions.
Quick example - add a new unigram list:
english:
unigrams:
my_new_list:
display_name: "My New Word List"
description: "Description of what this list contains"
files:
token: "resources/reference_lists/en/my_new_list_token.csv"
lemma: "resources/reference_lists/en/my_new_list_lemma.csv"
format: "csv"
enabled: true
That's it! The checkbox will appear automatically in the UI.
License
This project is designed for educational purposes in linguistic data analysis courses.