--- title: Simple Text Analyzer emoji: πŸš€ colorFrom: blue colorTo: blue sdk: docker app_port: 8501 tags: - streamlit - nlp - linguistics - japanese - corpus-linguistics pinned: false short_description: Lexical sophistication analyzer for EN and JP texts license: cc-by-nc-4.0 --- # Simple Text Analyzer A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques. ## 🌟 Features ### Multi-Language Support - **English**: COCA corpus frequency analysis with unigrams, bigrams, and trigrams - **Japanese**: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching ### Analysis Capabilities - **Lexical Sophistication**: Frequency-based lexical complexity analysis - **Part-of-Speech Analysis**: Detailed POS tagging and classification - **N-gram Analysis**: Bigram and trigram frequency analysis - **Content vs Function Words**: Automatic classification and separate analysis - **Batch Processing**: Multiple file analysis with comparative results ### Japanese Language Features ✨ **NEW** - **BCCWJ Integration**: Balanced Corpus of Contemporary Written Japanese - Raw frequency counts - Normalized frequency (per million words) - Frequency rankings - **CSJ Integration**: Corpus of Spontaneous Japanese (spoken data) - Academic and conversational speech patterns - Multiple speech style analysis - **POS-Aware Matching**: Composite key lookup using `lemma + POS` for accurate frequency matching - **Robust Fallback System**: Three-tier lookup strategy: 1. Primary: `lemma_pos` composite key (e.g., "葌く_ε‹•θ©ž-θ‡ͺη«‹") 2. Fallback 1: `lemma` only lookup 3. Fallback 2: `surface_form` lookup ## πŸš€ Quick Start ### Prerequisites - Python 3.8+ - uv (recommended) or pip for package management ### Installation ```bash # Clone the repository git clone https://github.com/your-repo/simple-text-analyzer.git cd simple-text-analyzer # Install dependencies using uv uv sync # Or using pip pip install -r requirements.txt # Install required SpaCy models python -m spacy download en_core_web_trf python -m spacy download ja_core_news_md # For Japanese support ``` ### Running the Application ```bash # Using uv uv run streamlit run web_app/app.py # Or directly streamlit run web_app/app.py ``` ## πŸ“Š Supported Corpora ### English - **COCA Spoken**: Corpus of Contemporary American English (spoken subcorpus) - **COCA Magazine**: Magazine text frequency data - **Bigram/Trigram Analysis**: Multi-word expression frequency and association measures ### Japanese - **BCCWJ (Balanced Corpus of Contemporary Written Japanese)** - 182,604 unique word forms with POS tags - Multiple text registers (books, newspapers, magazines, etc.) - Comprehensive written language coverage - **CSJ (Corpus of Spontaneous Japanese)** - 41,892 unique word forms from spoken data - Academic presentations and casual conversations - Natural speech pattern analysis ## πŸ”§ Architecture ### Core Components - **LexicalSophisticationAnalyzer**: Main analysis engine with multi-language support - **ConfigManager**: Flexible configuration system for corpus integration - **ReferenceManager**: Dynamic reference list management - **SessionManager**: State management for web interface ### Japanese Integration Features - **Composite Key Matching**: Precision matching using lemma and POS combinations - **Extensible Design**: Easy addition of new subcorpora via YAML configuration - **Fallback Mechanisms**: Robust lookup strategies for maximum coverage - **Performance Optimized**: Pre-computed lookup dictionaries for fast analysis ## πŸ“ File Structure ``` simple-text-analyzer/ β”œβ”€β”€ web_app/ # Streamlit web application β”‚ β”œβ”€β”€ app.py # Main application entry β”‚ β”œβ”€β”€ config_manager.py # Configuration management β”‚ β”œβ”€β”€ reference_manager.py # Reference list handling β”‚ └── components/ # UI components β”œβ”€β”€ text_analyzer/ # Core analysis modules β”‚ β”œβ”€β”€ lexical_sophistication.py # Main analyzer β”‚ β”œβ”€β”€ frequency_analyzer.py # Frequency analysis β”‚ └── pos_parser.py # POS tagging utilities β”œβ”€β”€ config/ # Configuration files β”‚ └── reference_lists.yaml # Corpus configurations β”œβ”€β”€ resources/ # Corpus data files β”‚ └── reference_lists/ β”‚ β”œβ”€β”€ en/ # English corpus files β”‚ └── ja/ # Japanese corpus files └── test/ # Test modules ``` ## πŸ§ͺ Testing Test the Japanese integration: ```bash uv run python test_japanese_integration.py ``` Expected output: - βœ… SpaCy model loading - βœ… Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries) - βœ… Composite key lookup functionality - βœ… Fallback mechanism verification - βœ… Complete text analysis pipeline ## πŸ“ˆ Usage Examples ### Japanese Text Analysis ```python from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer # Initialize Japanese analyzer analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md") # Load Japanese corpus references selected_indices = ["BCCWJ_frequency", "CSJ_frequency"] # Analyze Japanese text results = analyzer.analyze_text( "私は毎ζ—₯ε­¦ζ ‘γ«θ‘ŒγγΎγ™γ€‚", selected_indices ) # Access frequency scores for token in results['token_details']: print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}") ``` ### English Text Analysis ```python # Initialize English analyzer analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf") # Analyze with COCA frequency data results = analyzer.analyze_text( "The students studied linguistics carefully.", ["COCA_spoken_frequency"] ) ``` ## πŸ”§ Configuration ### Adding New Japanese Subcorpora The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books): ```yaml # config/reference_lists.yaml japanese: unigrams: BCCWJ_books_frequency: display_name: "BCCWJ Books - Frequency" description: "BCCWJ books subcorpus frequency data" files: token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv" lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv" format: "tsv" has_header: true enabled: true japanese_corpus: true columns: surface_form: 1 # lForm column lemma: 2 # lemma column pos: 3 # pos column frequency: 10 # PB_frequency column (books subcorpus) ``` No code changes required - the system automatically detects and integrates new configurations! ## πŸ“š Research Applications This tool is ideal for: - **Language Learning Research**: Analyzing text complexity for Japanese learners - **Corpus Linguistics**: Cross-linguistic frequency analysis - **Computational Linguistics**: Lexical sophistication measurement - **Educational Assessment**: Text difficulty evaluation - **Translation Studies**: Comparative lexical analysis ## 🀝 Contributing 1. Fork the repository 2. Create a feature branch (`git checkout -b feature/amazing-feature`) 3. Commit your changes (`git commit -m 'Add amazing feature'`) 4. Push to the branch (`git push origin feature/amazing-feature`) 5. Open a Pull Request ## πŸ“„ License This project is licensed under the CC BY-NC 4.0 License - see the [LICENSE](LICENSE) file for details. ## πŸ™ Acknowledgments - **BCCWJ**: National Institute for Japanese Language and Linguistics - **CSJ**: National Institute for Japanese Language and Linguistics - **COCA**: Mark Davies, Brigham Young University - **SpaCy**: Explosion AI for robust NLP models ## πŸ“ž Support For questions, issues, or contributions: - Open an issue on GitHub - Contact: [Your contact information] --- **Happy analyzing!** πŸš€πŸ“Š