Spaces:
Building
Building
| title: Simple Text Analyzer | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 8501 | |
| tags: | |
| - streamlit | |
| - nlp | |
| - linguistics | |
| - japanese | |
| - corpus-linguistics | |
| pinned: false | |
| short_description: Lexical sophistication analyzer for EN and JP texts | |
| license: cc-by-nc-4.0 | |
| # Simple Text Analyzer | |
| A comprehensive web-based application for lexical sophistication analysis supporting both English and Japanese languages. This tool provides detailed linguistic analysis using corpus-based frequency data and advanced NLP techniques. | |
| ## π Features | |
| ### Multi-Language Support | |
| - **English**: COCA corpus frequency analysis with unigrams, bigrams, and trigrams | |
| - **Japanese**: BCCWJ (written) and CSJ (spoken) corpus integration with POS-aware frequency matching | |
| ### Analysis Capabilities | |
| - **Lexical Sophistication**: Frequency-based lexical complexity analysis | |
| - **Part-of-Speech Analysis**: Detailed POS tagging and classification | |
| - **N-gram Analysis**: Bigram and trigram frequency analysis | |
| - **Content vs Function Words**: Automatic classification and separate analysis | |
| - **Batch Processing**: Multiple file analysis with comparative results | |
| ### Japanese Language Features β¨ **NEW** | |
| - **BCCWJ Integration**: Balanced Corpus of Contemporary Written Japanese | |
| - Raw frequency counts | |
| - Normalized frequency (per million words) | |
| - Frequency rankings | |
| - **CSJ Integration**: Corpus of Spontaneous Japanese (spoken data) | |
| - Academic and conversational speech patterns | |
| - Multiple speech style analysis | |
| - **POS-Aware Matching**: Composite key lookup using `lemma + POS` for accurate frequency matching | |
| - **Robust Fallback System**: Three-tier lookup strategy: | |
| 1. Primary: `lemma_pos` composite key (e.g., "θ‘γ_εθ©-θͺη«") | |
| 2. Fallback 1: `lemma` only lookup | |
| 3. Fallback 2: `surface_form` lookup | |
| ## π Quick Start | |
| ### Prerequisites | |
| - Python 3.8+ | |
| - uv (recommended) or pip for package management | |
| ### Installation | |
| ```bash | |
| # Clone the repository | |
| git clone https://github.com/your-repo/simple-text-analyzer.git | |
| cd simple-text-analyzer | |
| # Install dependencies using uv | |
| uv sync | |
| # Or using pip | |
| pip install -r requirements.txt | |
| # Install required SpaCy models | |
| python -m spacy download en_core_web_trf | |
| python -m spacy download ja_core_news_md # For Japanese support | |
| ``` | |
| ### Running the Application | |
| ```bash | |
| # Using uv | |
| uv run streamlit run web_app/app.py | |
| # Or directly | |
| streamlit run web_app/app.py | |
| ``` | |
| ## π Supported Corpora | |
| ### English | |
| - **COCA Spoken**: Corpus of Contemporary American English (spoken subcorpus) | |
| - **COCA Magazine**: Magazine text frequency data | |
| - **Bigram/Trigram Analysis**: Multi-word expression frequency and association measures | |
| ### Japanese | |
| - **BCCWJ (Balanced Corpus of Contemporary Written Japanese)** | |
| - 182,604 unique word forms with POS tags | |
| - Multiple text registers (books, newspapers, magazines, etc.) | |
| - Comprehensive written language coverage | |
| - **CSJ (Corpus of Spontaneous Japanese)** | |
| - 41,892 unique word forms from spoken data | |
| - Academic presentations and casual conversations | |
| - Natural speech pattern analysis | |
| ## π§ Architecture | |
| ### Core Components | |
| - **LexicalSophisticationAnalyzer**: Main analysis engine with multi-language support | |
| - **ConfigManager**: Flexible configuration system for corpus integration | |
| - **ReferenceManager**: Dynamic reference list management | |
| - **SessionManager**: State management for web interface | |
| ### Japanese Integration Features | |
| - **Composite Key Matching**: Precision matching using lemma and POS combinations | |
| - **Extensible Design**: Easy addition of new subcorpora via YAML configuration | |
| - **Fallback Mechanisms**: Robust lookup strategies for maximum coverage | |
| - **Performance Optimized**: Pre-computed lookup dictionaries for fast analysis | |
| ## π File Structure | |
| ``` | |
| simple-text-analyzer/ | |
| βββ web_app/ # Streamlit web application | |
| β βββ app.py # Main application entry | |
| β βββ config_manager.py # Configuration management | |
| β βββ reference_manager.py # Reference list handling | |
| β βββ components/ # UI components | |
| βββ text_analyzer/ # Core analysis modules | |
| β βββ lexical_sophistication.py # Main analyzer | |
| β βββ frequency_analyzer.py # Frequency analysis | |
| β βββ pos_parser.py # POS tagging utilities | |
| βββ config/ # Configuration files | |
| β βββ reference_lists.yaml # Corpus configurations | |
| βββ resources/ # Corpus data files | |
| β βββ reference_lists/ | |
| β βββ en/ # English corpus files | |
| β βββ ja/ # Japanese corpus files | |
| βββ test/ # Test modules | |
| ``` | |
| ## π§ͺ Testing | |
| Test the Japanese integration: | |
| ```bash | |
| uv run python test_japanese_integration.py | |
| ``` | |
| Expected output: | |
| - β SpaCy model loading | |
| - β Reference data loading (182K+ BCCWJ entries, 41K+ CSJ entries) | |
| - β Composite key lookup functionality | |
| - β Fallback mechanism verification | |
| - β Complete text analysis pipeline | |
| ## π Usage Examples | |
| ### Japanese Text Analysis | |
| ```python | |
| from text_analyzer.lexical_sophistication import LexicalSophisticationAnalyzer | |
| # Initialize Japanese analyzer | |
| analyzer = LexicalSophisticationAnalyzer(language="ja", model_size="md") | |
| # Load Japanese corpus references | |
| selected_indices = ["BCCWJ_frequency", "CSJ_frequency"] | |
| # Analyze Japanese text | |
| results = analyzer.analyze_text( | |
| "η§γ―ζ―ζ₯ε¦ζ ‘γ«θ‘γγΎγγ", | |
| selected_indices | |
| ) | |
| # Access frequency scores | |
| for token in results['token_details']: | |
| print(f"{token['token']}: BCCWJ={token.get('BCCWJ_frequency_lemma', 'NA')}") | |
| ``` | |
| ### English Text Analysis | |
| ```python | |
| # Initialize English analyzer | |
| analyzer = LexicalSophisticationAnalyzer(language="en", model_size="trf") | |
| # Analyze with COCA frequency data | |
| results = analyzer.analyze_text( | |
| "The students studied linguistics carefully.", | |
| ["COCA_spoken_frequency"] | |
| ) | |
| ``` | |
| ## π§ Configuration | |
| ### Adding New Japanese Subcorpora | |
| The system is designed for easy expansion. To add a new subcorpus (e.g., BCCWJ Books): | |
| ```yaml | |
| # config/reference_lists.yaml | |
| japanese: | |
| unigrams: | |
| BCCWJ_books_frequency: | |
| display_name: "BCCWJ Books - Frequency" | |
| description: "BCCWJ books subcorpus frequency data" | |
| files: | |
| token: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv" | |
| lemma: "resources/reference_lists/ja/BCCWJ_frequencylist_suw_ver1_1.tsv" | |
| format: "tsv" | |
| has_header: true | |
| enabled: true | |
| japanese_corpus: true | |
| columns: | |
| surface_form: 1 # lForm column | |
| lemma: 2 # lemma column | |
| pos: 3 # pos column | |
| frequency: 10 # PB_frequency column (books subcorpus) | |
| ``` | |
| No code changes required - the system automatically detects and integrates new configurations! | |
| ## π Research Applications | |
| This tool is ideal for: | |
| - **Language Learning Research**: Analyzing text complexity for Japanese learners | |
| - **Corpus Linguistics**: Cross-linguistic frequency analysis | |
| - **Computational Linguistics**: Lexical sophistication measurement | |
| - **Educational Assessment**: Text difficulty evaluation | |
| - **Translation Studies**: Comparative lexical analysis | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch (`git checkout -b feature/amazing-feature`) | |
| 3. Commit your changes (`git commit -m 'Add amazing feature'`) | |
| 4. Push to the branch (`git push origin feature/amazing-feature`) | |
| 5. Open a Pull Request | |
| ## π License | |
| This project is licensed under the CC BY-NC 4.0 License - see the [LICENSE](LICENSE) file for details. | |
| ## π Acknowledgments | |
| - **BCCWJ**: National Institute for Japanese Language and Linguistics | |
| - **CSJ**: National Institute for Japanese Language and Linguistics | |
| - **COCA**: Mark Davies, Brigham Young University | |
| - **SpaCy**: Explosion AI for robust NLP models | |
| ## π Support | |
| For questions, issues, or contributions: | |
| - Open an issue on GitHub | |
| - Contact: [Your contact information] | |
| --- | |
| **Happy analyzing!** ππ | |