eglantinefonrose
/

rlac-audiotranscript-segmenter-chroniques-randomforest

audiotranscription-segmentation

radio-live-a-la-carte

Model card Files Files and versions

rlac-audiotranscript-segmenter-chroniques-randomforest / README.md

eglantinefonrose's picture

eglantinefonrose

Upload folder using huggingface_hub

820f5ba verified about 2 months ago

|

history blame contribute delete

2.49 kB

	---
	tags:
	- audiotranscription-segmentation
	- radio
	- radio-live-a-la-carte
	- rlac
	- random-forest
	- transcription
	model_index:
	- name: RLAC Audio-transcription Segmenter - Chroniques (Random Forest)
	---

	# RLAC Audio-transcription Segmenter - Chroniques (Random Forest)

	## Description
	This model uses a Random Forest machine learning approach to detect radio chronicle segments from textual transcriptions (SRT files). It is designed to be a lightweight and efficient alternative for text-based segmentation.

	Hugging Face link: [eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest](https://huggingface.co/eglantinefonrose/rlac-audiotranscript-segmenter-chroniques-randomforest)

	## Model Details

	The model is a single-file classifier (`.pkl`) serialized with `joblib`. It operates by analyzing individual transcription segments and their immediate context.

	### 1. Feature Engineering
	The model extracts several key features from the SRT segments:
	* Temporal Data: Precise air time and segment duration.
	* Linguistic Statistics: Word count, character count, and average word length.
	* Punctuation Analysis: Detection of question marks and exclamation points to identify rhetorical styles.
	* TF-IDF Vectorization: A statistical measure used to evaluate the importance of words in the transcript relative to radio chronicle vocabulary.
	* Jingle Detection: Specific binary markers for identified radio jingles.

	### 2. Contextual Awareness (Sliding Window)
	To improve accuracy, the model employs a sliding window technique. It doesn't just look at a segment in isolation; it incorporates features from the surrounding segments (e.g., the 2 segments before and 2 segments after) to capture the flow and continuity of the radio broadcast.

	### 3. Core Classifier
	The underlying algorithm is a RandomForestClassifier from the Scikit-Learn library, optimized with 200 estimators to handle the balance between precision and recall for chronicle detection.

	## Advantages
	* CPU-Only: Does not require any GPU or heavy deep learning frameworks.
	* High Speed: Processing an hour-long show takes less than a second.
	* Efficiency: Extremely low memory footprint.

	## Usage
	The model is integrated into the local `predict.py` workflow:
	```python
	from train import RadioChroniqueClassifier
	classifier = RadioChroniqueClassifier.load_model("models/radio_chronique_rf.pkl")
	```

	## Author
	Maintained by eglantinefonrose.