eglantinefonrose's picture
Upload folder using huggingface_hub
820f5ba verified
---
tags:
- audiotranscription-segmentation
- radio
- radio-live-a-la-carte
- rlac
- random-forest
- transcription
model_index:
- name: RLAC Audio-transcription Segmenter - Chroniques (Random Forest)
---
# RLAC Audio-transcription Segmenter - Chroniques (Random Forest)
## Description
This model uses a **Random Forest** machine learning approach to detect radio chronicle segments from textual transcriptions (SRT files). It is designed to be a lightweight and efficient alternative for text-based segmentation.
Hugging Face link: [eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest](https://huggingface.co/eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest)
## Model Details
The model is a single-file classifier (`.pkl`) serialized with `joblib`. It operates by analyzing individual transcription segments and their immediate context.
### 1. Feature Engineering
The model extracts several key features from the SRT segments:
* **Temporal Data**: Precise air time and segment duration.
* **Linguistic Statistics**: Word count, character count, and average word length.
* **Punctuation Analysis**: Detection of question marks and exclamation points to identify rhetorical styles.
* **TF-IDF Vectorization**: A statistical measure used to evaluate the importance of words in the transcript relative to radio chronicle vocabulary.
* **Jingle Detection**: Specific binary markers for identified radio jingles.
### 2. Contextual Awareness (Sliding Window)
To improve accuracy, the model employs a **sliding window** technique. It doesn't just look at a segment in isolation; it incorporates features from the surrounding segments (e.g., the 2 segments before and 2 segments after) to capture the flow and continuity of the radio broadcast.
### 3. Core Classifier
The underlying algorithm is a **RandomForestClassifier** from the Scikit-Learn library, optimized with 200 estimators to handle the balance between precision and recall for chronicle detection.
## Advantages
* **CPU-Only**: Does not require any GPU or heavy deep learning frameworks.
* **High Speed**: Processing an hour-long show takes less than a second.
* **Efficiency**: Extremely low memory footprint.
## Usage
The model is integrated into the local `predict.py` workflow:
```python
from train import RadioChroniqueClassifier
classifier = RadioChroniqueClassifier.load_model("models/radio_chronique_rf.pkl")
```
## Author
Maintained by eglantinefonrose.