Chinese-Lao-Thai-Multilingual-Parallel-Sentence-Pair-Extraction(CLTMPSE)

#step1. preprocess 构造音标词典和词替换规则,构造训练数据:

  1. open utils/IPA_sim_statistic_analysis.py
  2. set statistic_conclusion_exist=False (row 96)
  3. run , then generate IPA_lo_dict and IPA_th_dict
  4. set statistic_conclusion_exist=True
  5. run, then generate same_list

#step2. train 开始训练:

  1. set params
  2. run main.py

--do_train --evaluate_during_training --train_data_file train.json --eval_data_file valid.json --test_data_file test_lo.json --data_dir data --lang replace_word_level --mode 73 --data_dir data --output_dir ./models --num_train_epochs 20 --max_seq_length 200 --gradient_accumulation_steps 1 --warmup_steps 500 --train_batch_size 32 --eval_batch_size 16


#step3. test 测试:

  1. set params
  2. run main.py

test lo

--do_test --test_data_file test_lo.json --data_dir data --lang replace_word_level --mode 73 --data_dir data --output_dir ./models --max_seq_length 200 --eval_batch_size 16

test th

--do_test --test_data_file test_th.json --data_dir data --lang replace_word_level --mode 73 --data_dir data --output_dir ./models --max_seq_length 200 --eval_batch_size 16


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support