# Text Module V2 - Aspect-Based Scoring ## Overview Enhanced text analysis using prototype-based aspect extraction with `all-mpnet-base-v2` embeddings. ## Changes from V1 - **Model**: Upgraded from `all-MiniLM-L6-v2` (384d) to `all-mpnet-base-v2` (768d) - **Approach**: Moved from simple reference embeddings to aspect-based prototype scoring - **Aspects**: 10 employability aspects (leadership, technical_skills, problem_solving, etc.) - **Admin**: Runtime seed updates via REST API ## Configuration ### Model Selection Set via environment variable or constructor: ```bash export ASPECT_MODEL_NAME=all-mpnet-base-v2 # default # or export ASPECT_MODEL_NAME=all-MiniLM-L6-v2 # fallback ``` ```python from services.text_module_v2 import TextModuleV2 # Default (all-mpnet-base-v2) text_module = TextModuleV2() # Override model text_module = TextModuleV2(model_name='all-MiniLM-L6-v2') ``` ### Aspect Seeds Seeds loaded from `./aspect_seeds.json` (created by default). Edit this file to customize aspect definitions. **Location**: `analytics/backend/aspect_seeds.json` ### Centroids Cache Pre-computed centroids saved to `./aspect_centroids.npz` for fast cold starts. ## Usage ### Basic Scoring ```python text_module = TextModuleV2() text_responses = { 'text_q1': "I developed ML pipelines using Python and scikit-learn...", 'text_q2': "My career goal is to become a data scientist...", 'text_q3': "I led a team of 5 students in a hackathon project..." } score, confidence, features = text_module.score(text_responses) print(f"Score: {score:.2f}, Confidence: {confidence:.2f}") print(f"Features: {features}") ``` ### Get Current Seeds ```python seeds = text_module.get_aspect_seeds() print(f"Loaded {len(seeds)} aspects") ``` ## Admin API ### Setup ```python from flask import Flask from services.text_module_v2 import TextModuleV2, register_admin_seed_endpoint app = Flask(__name__) text_module = TextModuleV2() # Register admin endpoints register_admin_seed_endpoint(app, text_module) app.run(port=5001) ``` Set admin token: ```bash export ADMIN_SEED_TOKEN=your-secret-token ``` ### Endpoints #### GET /admin/aspect-seeds Get current loaded seeds. **Request**: ```bash curl -H "X-Admin-Token: your-secret-token" \ http://localhost:5001/admin/aspect-seeds ``` **Response**: ```json { "success": true, "seeds": { "leadership": ["led a team", "managed project", ...], "technical_skills": [...] }, "num_aspects": 10 } ``` #### POST /admin/aspect-seeds Update aspect seeds (recomputes centroids). **Request**: ```bash curl -X POST \ -H "X-Admin-Token: your-secret-token" \ -H "Content-Type: application/json" \ -d '{ "seeds": { "leadership": [ "led a team", "managed stakeholders", "organized events" ], "technical_skills": [ "developed web API", "built ML models" ] }, "persist": true }' \ http://localhost:5001/admin/aspect-seeds ``` **Response**: ```json { "success": true, "message": "Aspect seeds updated successfully", "stats": { "num_aspects": 2, "avg_seed_count": 2.5, "timestamp": "2025-12-09T10:30:00Z" } } ``` ## Advanced: Seed Expansion Suggest new seed phrases from a corpus: ```python corpus = [ "I led the product development team and managed stakeholders", "Implemented CI/CD pipelines for automated testing", # ... more texts ] suggestions = text_module.suggest_seed_expansions( corpus_texts=corpus, aspect_key='leadership', top_n=20 ) print("Suggested seeds:", suggestions) ``` ## Aspect → Question Mapping ```python from services.text_module_v2 import get_relevant_aspects_for_question # Q1: Strengths & skills aspects_q1 = get_relevant_aspects_for_question('text_q1') # ['technical_skills', 'problem_solving', 'learning_agility', 'initiative', 'communication'] # Q2: Career interests aspects_q2 = get_relevant_aspects_for_question('text_q2') # ['career_alignment', 'learning_agility', 'initiative', 'communication'] # Q3: Extracurriculars & leadership aspects_q3 = get_relevant_aspects_for_question('text_q3') # ['leadership', 'teamwork', 'project_execution', 'internships_experience', 'communication'] ``` ## Files | File | Purpose | |------|---------| | `services/text_module_v2.py` | Main module implementation | | `aspect_seeds.json` | Aspect seed definitions (editable) | | `aspect_centroids.npz` | Cached centroids (auto-generated) | ## Performance - **Model Load**: ~3s (first time) - **Centroid Build**: ~1s for 10 aspects with 20 seeds each - **Text Scoring**: ~200-500ms per 3-question set (CPU) ## Logging Module logs to Python's `logging` system: ```python import logging logging.basicConfig(level=logging.INFO) ``` Key events logged: - Model loading - Seed updates (with masked token) - Centroid recomputation - File I/O operations