| # llm_speaker_tagging | |
| SLT 2024 Challenge: Track-2 Post-ASR-Speaker-Tagging | |
| Baseline and Instructions for Track-2 | |
| # GenSEC Challenge Track-2 Introduction | |
| SLT 2024 Challenge GenSEC Track 2: Post-ASR-Speaker-Tagging | |
| - Track-2 is a challenge track that aims to correct the speaker tagging of the ASR-generated transcripts tagged with a speaker diarization system. | |
| - Since the traditional speaker diarization systems cannot take lexical cues into account, leading to errors that disrupt the context of human conversations. | |
| - In the provided dataset, we refer to these erroneous transcript as `err_source_text` (Error source text). Here is an example. | |
| - Erroneous Original Transcript `err_source_text`: | |
| ```json | |
| [ | |
| {"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about well i"}, | |
| {"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"don't tell you what's need to be"}, | |
| {"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"}, | |
| {"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"}, | |
| {"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay, then let's talk about our gigs sounds"}, | |
| {"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"good do you have any specific ideas"}, | |
| ] | |
| ``` | |
| Note that the word `well` `i`, `discussed` and `sounds` are tagged with wrong speakers. | |
| - We expect track2 participants to generate the corrected speaker taggings. | |
| - Corrected Transcript Example (hypothesis): | |
| ```json | |
| [ | |
| {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"what should we talk about"}, | |
| {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"well i don't tell you what's need to be discussed"}, | |
| {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"because that's something you should figure out"}, | |
| {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"okay then let's talk about our gigs"}, | |
| {"session_id":"session_gen1sec2", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"sounds good do you have any specific ideas"} | |
| ] | |
| ``` | |
| - Note that `start_time` and `end_time` cannot be estimated so the timestamps are all assigned as `0.0`. | |
| - Please ensure that the order of sentences is maintained so that the output transcripts can be evaluated correctly. | |
| - **Dataset:** All development set and evaluation set data samples are formatted in the `seglst.json` format, which is a list containing dictionary variables with the keys specified above: | |
| ```python | |
| { | |
| "session_id": str, | |
| "start_time": float, | |
| "end_time": float, | |
| "speaker": str, | |
| "words": str, | |
| } | |
| ``` | |
| ## Track-2 Rules and Regulations | |
| 1. The participants should **only use text (transcripts)** as the only modality. We do not provide any speech (audio) signal for the transcripts. | |
| 2. The participants are allowed to correct the words (e.g. `spk1:hi are wow` to `spk1:how are you`) without changing the speaker labels. That is, this involves Track-1 in a way. | |
| 3. The participants are allowed to use any type of language model and methods. | |
| - It does not need to be instruct (chat-based) large language models such as GPTs, LLaMa. | |
| - No restrictions on the parameter size of the LLM. | |
| - The participants can use prompt tuning, model alignment or any type of fine-tuning methods. | |
| - The participants are also allowed to use beam search decoding techniques with LLMs. | |
| 4. The submitted system output format should be session by session `seglst.json` format and evaluated by `cpwer` metric. | |
| 5. The participants will submit two json files: | |
| (1) `err_dev.hyp.seglst.json` | |
| (2) `err_eval.hyp.seglst.json` | |
| for both dev and eval set, respectively. | |
| 6. In each `err_dev.hyp.seglst.json` `err_eval.hyp.seglst.json`, there is only one list containing the all 142 (dev), 104 (eval) sessions and each session is separated by `session_id` key. | |
| - Example of the final submission form `err_dev.hyp.seglst.json` and `err_eval.hyp.seglst.json`: | |
| ```json | |
| [ | |
| {"session_id":"session_abc123ab", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"well it is what it is"}, | |
| {"session_id":"session_abc123ab", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"yeah so be it"}, | |
| {"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker1", "words":"wow you are late again"}, | |
| {"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker2", "words":"sorry traffic jam"}, | |
| {"session_id":"session_xyz456cd", "start_time":0.0, "end_time":0.0, "speaker":"speaker3", "words":"hey how was last night"} | |
| ] | |
| ``` | |
| ## Baseline System Introduction: Contextudal Beam Search Decoding | |
| The baseline system is based on the system proposed in [Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach | |
| ](https://arxiv.org/pdf/2309.05248) (We refer to this method as Contextual Beam Search (CBS)). Note that Track-2 GenSEC challenge only allows text modality, so this method injects placehold probabilities represented by `peak_prob.` | |
| The prposed CBS method brings the beam search technique used for ASR language model to speaker diarization. | |
| <img src="images/two_realms.png" width="720" alt="Two Realms"/> | |
| In CBS method, the following three probability values are needed: | |
| **P(E|S)**: Speaker diarization posterior probability (Given speaker S, acoustic observation E) | |
| **P(W)**: th probability of the next word W | |
| **P(S|W)**: the conditional probability value of the speaker S given the next word | |
| <img src="images/bsd_equation.png" width="360" alt="BSD Equation"/> | |
| Note that the CBS approach assumes that one word is spoken by one speaker. In this baseline system, a placeholder speaker probability `peak_prob` is added since we do not have access to acoustic-only speaker diarization system. | |
| <img src="images/word_level_spk_prob.png" width="720" alt="Word Level Speaker Probability"/> | |
| The following diagram explains how beam search decoding works with speaker diarization and ASR. | |
| <img src="images/bsd_example_pic.png" width="880" alt="Example of beam search decoding with scores"/> | |
| The overall data-flow is shown as follows. Note that we have fixed value for speaker probability values. | |
| <img src="images/overall_dataflow.png" width="720" alt="Overall Dataflow"/> | |
| ## Baseline System Installation | |
| Run the following commands at the main level of this repository. | |
| ### Conda Environment | |
| The baseline system works with `conda` environment with python 3.10. | |
| ``` | |
| conda create --name llmspk python=3.10 | |
| ``` | |
| ### Install requirements | |
| You need to install the following packages | |
| ``` | |
| kenlm | |
| arpa | |
| numpy | |
| hydra-core | |
| meeteval | |
| tqdm | |
| requests | |
| simplejson | |
| pydiardecode @ git+https://github.com/tango4j/pydiardecode@main | |
| ``` | |
| Simply install all the requirments. | |
| ``` | |
| pip install -r requirements.txt | |
| ``` | |
| ### Download ARPA language model | |
| ``` | |
| mkdir -p arpa_model | |
| cd arpa_model | |
| wget https://kaldi-asr.org/models/5/4gram_small.arpa.gz | |
| gunzip 4gram_small.arpa.gz | |
| ``` | |
| ### Download track-2 challenge dev set and eval set | |
| Clone the dataset from Hugging Face server. | |
| ``` | |
| git clone https://huggingface.co/datasets/GenSEC-LLM/SLT-Task2-Post-ASR-Speaker-Tagging | |
| ``` | |
| In folder, you will see the following folder structures. | |
| ```bash | |
| . | |
| ├── err_source_text | |
| │ ├── dev | |
| │ │ ├── session_014b5cda.seglst.json | |
| │ │ ├── session_02d73d95.seglst.json | |
| │. | |
| │.. | |
| │ │ ├── session_fcd0a550.seglst.json | |
| │ │ └── session_ff16b903.seglst.json | |
| │ └── eval | |
| │ ├── session_0259446c.seglst.json | |
| │ ├── session_0bea34fa.seglst.json | |
| │.. | |
| │... | |
| │ ├── session_f84edf1f.seglst.json | |
| │ └── session_febfa7aa.seglst.json | |
| ├── ref_annotated_text | |
| │ └── dev | |
| │ ├── session_014b5cda.seglst.json | |
| │ ├── session_02d73d95.seglst.json | |
| │. | |
| │.. | |
| │ ├── session_fcd0a550.seglst.json | |
| │ └── session_ff16b903.seglst.json | |
| ``` | |
| The file counts are as follows: | |
| - `err_source_text`: dev 142 files, eval 104 files | |
| - `ref_annotated_text`: dev 142 files | |
| Run the following commands to construct the input list files `err_dev.src.list` and `err_dev.ref.list`. | |
| ``` | |
| find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/err_source_text/dev -maxdepth 1 -type f -name "*.seglst.json" > err_dev.src.list | |
| find $PWD/SLT-Task2-Post-ASR-Speaker-Tagging/ref_annotated_text/dev -maxdepth 1 -type f -name "*.seglst.json" > err_dev.ref.list | |
| ``` | |
| ### Launch the baseline script | |
| Now you are ready to launch the baseline script. | |
| Launch the baseline script `run_speaker_tagging_beam_search.sh` | |
| ```bash | |
| BASEPATH=${PWD} | |
| DIAR_LM_PATH=$BASEPATH/arpa_model/4gram_small.arpa | |
| ASRDIAR_FILE_NAME=err_dev | |
| OPTUNA_STUDY_NAME=speaker_beam_search_${ASRDIAR_FILE_NAME} | |
| WORKSPACE=$BASEPATH/SLT-Task2-Post-ASR-Speaker-Tagging | |
| INPUT_ERROR_SRC_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.src.list | |
| GROUNDTRUTH_REF_LIST_PATH=$BASEPATH/$ASRDIAR_FILE_NAME.ref.list | |
| DIAR_OUT_DOWNLOAD=$WORKSPACE/$ASRDIAR_FILE_NAME | |
| mkdir -p $DIAR_OUT_DOWNLOAD | |
| ### SLT 2024 Speaker Tagging Setting v1.0.2 | |
| ALPHA=0.4 | |
| BETA=0.04 | |
| PARALLEL_CHUNK_WORD_LEN=100 | |
| BEAM_WIDTH=16 | |
| WORD_WINDOW=32 | |
| PEAK_PROB=0.95 | |
| USE_NGRAM=True | |
| LM_METHOD=ngram | |
| # Get the base name of the test_manifest and remove extension | |
| UNIQ_MEMO=$(basename "${INPUT_ERROR_SRC_LIST_PATH}" .json | sed 's/\./_/g') | |
| echo "UNIQ MEMO:" $UNIQ_MEMO | |
| TRIAL=telephonic | |
| BATCH_SIZE=11 | |
| python $BASEPATH/speaker_tagging_beamsearch.py \ | |
| hyper_params_optim=false \ | |
| port=[5501,5502,5511,5512,5521,5522,5531,5532] \ | |
| arpa_language_model=$DIAR_LM_PATH \ | |
| batch_size=$BATCH_SIZE \ | |
| groundtruth_ref_list_path=$GROUNDTRUTH_REF_LIST_PATH \ | |
| input_error_src_list_path=$INPUT_ERROR_SRC_LIST_PATH \ | |
| parallel_chunk_word_len=$PARALLEL_CHUNK_WORD_LEN \ | |
| use_ngram=$USE_NGRAM \ | |
| alpha=$ALPHA \ | |
| beta=$BETA \ | |
| beam_width=$BEAM_WIDTH \ | |
| word_window=$WORD_WINDOW \ | |
| peak_prob=$PEAK_PROB \ | |
| ``` | |
| ### Evaluate | |
| We use [MeetEval](https://github.com/fgnt/meeteval) software to evaluate `cpWER`. | |
| cpWER measures both speaker tagging and word error rate (WER) by testing all the permutation of trancripts and choosing the permutation that | |
| gives the lowest error. | |
| ```bash | |
| echo "Evaluating the original source transcript." | |
| meeteval-wer cpwer -h $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst.json -r $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json | |
| echo "Source cpWER: " $(jq '.error_rate' "[ $WORKSPACE/$ASRDIAR_FILE_NAME.src.seglst_cpwer.json) ]" | |
| echo "Evaluating the original hypothesis transcript." | |
| meeteval-wer cpwer -h $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst.json -r $WORKSPACE/$ASRDIAR_FILE_NAME.ref.seglst.json | |
| echo "Hypothesis cpWER: " $(jq '.error_rate' $WORKSPACE/$ASRDIAR_FILE_NAME.hyp.seglst_cpwer.json) | |
| ``` | |
| The `cpwer` result will be stored in `./SLT-Task2-Post-ASR-Speaker-Tagging/err_dev.hyp.seglst_cpwer.json` file. | |
| ```bash | |
| cat ./SLT-Task2-Post-ASR-Speaker-Tagging/err_dev.hyp.seglst_cpwer.json` | |
| ``` | |
| The result file contains a json-dictionary. `"error_rate"` is the `cpwer` value we want to minimize. | |
| ```json | |
| { | |
| "error_rate": 0.18784847090516965, | |
| "errors": 73077, | |
| "length": 389021, | |
| "insertions": 13739, | |
| "deletions": 42173, | |
| "substitutions": 17165, | |
| "reference_self_overlap": null, | |
| "hypothesis_self_overlap": null, | |
| "missed_speaker": 0, | |
| "falarm_speaker": 6, | |
| "scored_speaker": 330, | |
| "assignment": null | |
| } | |
| ``` | |
| ## Appendix A: LLM example of speaker tagging correction | |
| This is an example of GPT-based speaker tagging correction. The following text is the prompt fed into Chat-GPT (GPT-4), including the example in this document. | |
| ```markdown | |
| - Track-2 is a challenge track that aims to correct the speaker tagging of the ASR-generated transcripts tagged with a speaker diarization system. | |
| - Since the traditional speaker diarization systems cannot take lexical cues into account, leading to errors that disrupt the context of human conversations. | |
| - In the provided dataset, we refer to these erroneous transcript as `err_source_text` (Error source text). Here is an example. | |
| - Erroneous Original Transcript `err_source_text`: | |
| [ | |
| {"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about well i"}, | |
| {"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"don't tell you what's need to be"}, | |
| {"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"}, | |
| {"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"}, | |
| {"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay then let's talk about our gigs sounds"}, | |
| {"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"good do you have any specific ideas"}, | |
| ] | |
| Relocate the words to make the conversation looks and sounds natural. | |
| - Do not add or remove words. | |
| - Do not switch the speaker labels. | |
| - Do not change speaker1 to speaker2 or speaker2 to speaker 1. | |
| - Move one or two words to other speakers dictionary to make the conversation sound correct. | |
| ``` | |
| - The answer from GPT-4 (Tested on 05/01/2024). | |
| ```markdown | |
| To correct the conversation's flow in the transcript, I'll rearrange the words between the speakers while keeping the speakers consistent with the rules you provided. Here is the corrected version: | |
| json [code] | |
| [ | |
| {"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about"}, | |
| {"session_id":"session_gen1sec2", "start_time":13.32, "end_time":17.08, "speaker":"speaker2", "words":"well i don't tell you what's need to be"}, | |
| {"session_id":"session_gen1sec2", "start_time":17.11, "end_time":17.98, "speaker":"speaker1", "words":"discussed"}, | |
| {"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker2", "words":"because that's something you should figure out"}, | |
| {"session_id":"session_gen1sec2", "start_time":20.10, "end_time":21.40, "speaker":"speaker1", "words":"okay then let's talk about our gigs"}, | |
| {"session_id":"session_gen1sec2", "start_time":21.65, "end_time":23.92, "speaker":"speaker2", "words":"sounds good do you have any specific ideas"} | |
| ] | |
| Here are the changes I made: | |
| I moved "well i" from speaker1's first utterance to the beginning of speaker2's first utterance to better connect the dialogue. | |
| "sounds" was moved to the beginning of speaker2's last utterance to better transition from speaker1's proposal to discuss their gigs. | |
| ``` | |
| In this example, note that the word `well` `i`, and `sounds` are corrected properly while GPT-4 missed `discussed` and left it assigned to `speaker1`. | |
| - The answer from Llama3 (Tested on 05/01/2024). | |
| ```markdown | |
| Here is the relocated transcript to make the conversation sound natural: | |
| [ | |
| {"session_id":"session_gen1sec2", "start_time":10.02, "end_time":11.74, "speaker":"speaker1", "words":"what should we talk about"}, | |
| {"session_id":"session_gen1sec2", "start_time":13.32, "end_time":15.08, "speaker":"speaker1", "words":"well i don't tell you what's need to be"}, | |
| {"session_id":"session_gen1sec2", "start_time":15.11, "end_time":17.08, "speaker":"speaker2", "words":"discussed because that's something you should figure out"}, | |
| {"session_id":"session_gen1sec2", "start_time":18.10, "end_time":19.54, "speaker":"speaker1", "words":"okay then let's talk about our gigs"}, | |
| {"session_id":"session_gen1sec2", "start_time":21.65, "end_time":22.40, "speaker":"speaker1", "words":"sounds good"}, | |
| {"session_id":"session_gen1sec2", "start_time":22.45, "end_time":23.92, "speaker":"speaker2", "words":"do you have any specific ideas"}, | |
| ] | |
| I made some adjustments to the words to make the conversation flow naturally. Let me know if this meets your requirements! | |
| ``` | |
| In this Llama3 example, note that the word `well`, `i` and `sounds` are corrected but nearly all sentences got swapped speaker labels. | |
| ### Reference | |
| @inproceedings{park2024enhancing, | |
| title={Enhancing speaker diarization with large language models: A contextual beam search approach}, | |
| author={Park, Tae Jin and Dhawan, Kunal and Koluguri, Nithin and Balam, Jagadeesh}, | |
| booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, | |
| pages={10861--10865}, | |
| year={2024}, | |
| organization={IEEE} | |
| } |