Spaces:
Building
Building
metadata
title: Simple Text Analyzer
emoji: π
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 8501
tags:
- streamlit
- nlp
- linguistics
- japanese
- corpus-linguistics
pinned: false
short_description: Lexical sophistication analyzer for EN and JP texts
license: cc-by-nc-4.0
Simple Text Analyzer
A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques.
π Features
Multi-Language Support
- English: COCA corpus frequency analysis with unigrams, bigrams, and trigrams
- Japanese: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching
Analysis Capabilities
- Lexical Sophistication: Frequency-based lexical complexity analysis
- Part-of-Speech Analysis: Detailed POS tagging and classification
- N-gram Analysis: Bigram and trigram frequency analysis
- Content vs Function Words: Automatic classification and separate analysis
- Batch Processing: Multiple file analysis with comparative results
Japanese Language Features β¨ NEW
- BCCWJ Integration: Balanced Corpus of Contemporary Written Japanese
- Raw frequency counts
- Normalized frequency (per million words)
- Frequency rankings
- CSJ Integration: Corpus of Spontaneous Japanese (spoken data)
- Academic and conversational speech patterns
- Multiple speech style analysis
- POS-Aware Matching: Composite key lookup using
lemma + POSfor accurate frequency matching - Robust Fallback System: Three-tier lookup strategy:
- Primary:
lemma_poscomposite key (e.g., "θ‘γ_εθ©-θͺη«") - Fallback 1:
lemmaonly lookup - Fallback 2:
surface_formlookup
- Primary:
π Quick Start
Prerequisites
- Python 3.8+
- uv (recommended) or pip for package management
Installation
# Clone the repository
git clone https://github.com/your-repo/simple-text-analyzer.git
cd simple-text-analyzer
# Install dependencies using uv
uv sync
# Or using pip
pip install -r requirements.txt
# Install required SpaCy models
python -m spacy download en_core_web_trf
python -m spacy download ja_core_news_md # For Japanese support
Running the Application
# Using uv
uv run streamlit run web_app/app.py
# Or directly
streamlit run web_app/app.py
π Supported Corpora
English
- COCA Spoken: Corpus of Contemporary American English (spoken subcorpus)
- COCA Magazine: Magazine text frequency data
- Bigram/Trigram Analysis: Multi-word expression frequency and association measures
Japanese
BCCWJ (Balanced Corpus of Contemporary Written Japanese)
- 182,604 unique word forms with POS tags
- Multiple text registers (books, newspapers, magazines, etc.)
- Comprehensive written language coverage
CSJ (Corpus of Spontaneous Japanese)
- 41,892 unique word forms from spoken data
- Academic presentations and casual conversations
- Natural speech pattern analysis
π§ Architecture
Core Components
- LexicalSophisticationAnalyzer: Main analysis engine with multi-language support
- ConfigManager: Flexible configuration system for corpus integration
- ReferenceManager: Dynamic reference list management
- SessionManager: State management for web interface
Japanese Integration Features
- Composite Key Matching: Precision matching using lemma and POS combinations
- Extensible Design: Easy addition of new subcorpora via YAML configuration
- Fallback Mechanisms: Robust lookup strategies for maximum coverage
- Performance Optimized: Pre-computed lookup dictionaries for fast analysis
π File Structure
simple-text-analyzer/
βββ web_app/ # Streamlit web application
β βββ app.py # Main application entry
β βββ config_manager.py # Configuration management
β βββ reference_manager.py # Reference list handling
β βββ components/ # UI components
βββ text_analyzer/ # Core analysis modules
β βββ lexical_sophistication.py # Main analyzer
β βββ frequency_analyzer.py # Frequency analysis
β βββ pos_parser.py # POS tagging utilities
βββ config/ # Configuration files
β βββ reference_lists.yaml # Corpus configurations
βββ resources/ # Corpus data files
β βββ reference_lists/
β βββ en/ # English corpus files
β βββ ja/ # Japanese corpus files
βββ test/ # Test modules
π§ͺ Testing
Test the Japanese integration:
uv run python test_japanese_integration.py
Expected output:
- β SpaCy model loading
- β Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries)
- β Composite key lookup functionality
- β Fallback mechanism verification
- β Complete text analysis pipeline
π Usage Examples
Japanese Text Analysis
from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer
# Initialize Japanese analyzer
analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md")
# Load Japanese corpus references
selected_indices = ["BCCWJ_frequency", "CSJ_frequency"]
# Analyze Japanese text
results = analyzer.analyze_text(
"η§γ―ζ―ζ₯ε¦ζ ‘γ«θ‘γγΎγγ",
selected_indices
)
# Access frequency scores
for token in results['token_details']:
print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}")
English Text Analysis
# Initialize English analyzer
analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf")
# Analyze with COCA frequency data
results = analyzer.analyze_text(
"The students studied linguistics carefully.",
["COCA_spoken_frequency"]
)
π§ Configuration
Adding New Japanese Subcorpora
The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books):
# config/reference_lists.yaml
japanese:
unigrams:
BCCWJ_books_frequency:
display_name: "BCCWJ Books - Frequency"
description: "BCCWJ books subcorpus frequency data"
files:
token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv"
format: "tsv"
has_header: true
enabled: true
japanese_corpus: true
columns:
surface_form: 1 # lForm column
lemma: 2 # lemma column
pos: 3 # pos column
frequency: 10 # PB_frequency column (books subcorpus)
No code changes required - the system automatically detects and integrates new configurations!
π Research Applications
This tool is ideal for:
- Language Learning Research: Analyzing text complexity for Japanese learners
- Corpus Linguistics: Cross-linguistic frequency analysis
- Computational Linguistics: Lexical sophistication measurement
- Educational Assessment: Text difficulty evaluation
- Translation Studies: Comparative lexical analysis
π€ Contributing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
π License
This project is licensed under the CC BY-NC 4.0 License - see the LICENSE file for details.
π Acknowledgments
- BCCWJ: National Institute for Japanese Language and Linguistics
- CSJ: National Institute for Japanese Language and Linguistics
- COCA: Mark Davies, Brigham Young University
- SpaCy: Explosion AI for robust NLP models
π Support
For questions, issues, or contributions:
- Open an issue on GitHub
- Contact: [Your contact information]
Happy analyzing! ππ