--- tags: - audiotranscription-segmentation - radio - radio-live-a-la-carte - rlac - random-forest - transcription model_index: - name: RLAC Audio-transcription Segmenter - Chroniques (Random Forest) --- # RLAC Audio-transcription Segmenter - Chroniques (Random Forest) ## Description This model uses a **Random Forest** machine learning approach to detect radio chronicle segments from textual transcriptions (SRT files). It is designed to be a lightweight and efficient alternative for text-based segmentation. Hugging Face link: [eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest](https://huggingface.co/eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest) ## Model Details The model is a single-file classifier (`.pkl`) serialized with `joblib`. It operates by analyzing individual transcription segments and their immediate context. ### 1. Feature Engineering The model extracts several key features from the SRT segments: * **Temporal Data**: Precise air time and segment duration. * **Linguistic Statistics**: Word count, character count, and average word length. * **Punctuation Analysis**: Detection of question marks and exclamation points to identify rhetorical styles. * **TF-IDF Vectorization**: A statistical measure used to evaluate the importance of words in the transcript relative to radio chronicle vocabulary. * **Jingle Detection**: Specific binary markers for identified radio jingles. ### 2. Contextual Awareness (Sliding Window) To improve accuracy, the model employs a **sliding window** technique. It doesn't just look at a segment in isolation; it incorporates features from the surrounding segments (e.g., the 2 segments before and 2 segments after) to capture the flow and continuity of the radio broadcast. ### 3. Core Classifier The underlying algorithm is a **RandomForestClassifier** from the Scikit-Learn library, optimized with 200 estimators to handle the balance between precision and recall for chronicle detection. ## Advantages * **CPU-Only**: Does not require any GPU or heavy deep learning frameworks. * **High Speed**: Processing an hour-long show takes less than a second. * **Efficiency**: Extremely low memory footprint. ## Usage The model is integrated into the local `predict.py` workflow: ```python from train import RadioChroniqueClassifier classifier = RadioChroniqueClassifier.load_model("models/radio_chronique_rf.pkl") ``` ## Author Maintained by eglantinefonrose.