YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
This model classifies a pair of passages as continuing or not. In other words, the model classifies if the linebreak between the pair of input is correct or not.
Usage with SentenceTransformers
The usage becomes easier when you have SentenceTransformers installed (trained with SetenceTransformers). Then, you can use the pre-trained models like this:
from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([(Paragraph1a, Paragraph1b), (Paragraph2a, Paragraph2b) , (Paragraph3a, Paragraph3b)])
Usage with Huggingface Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('name_model')
tokenizer = AutoTokenizer.from_pretrained('name_model')
with torch.no_grad():
encoded_input = tokenizer('[SEP]'.join([Paragraph1a, Paragraph1b]), return_tensors='pt')
output = model(**encoded_input).logits[0]
Example results
paragraph1 = 'μ΄λ¬ν κ΅μ μ μΈ μμμ μ°λ¦¬λλΌλ βμ 4μ°¨ μ°μ
νλͺ
μ μ λν μ§μμ¬μ° κ²½μλ ₯ ν보βλ₯Ό λͺ©νλ‘ μ 2μ°¨ κ΅κ°μ§μμ¬μ°κΈ°λ³Έκ³ν(2017~2021)μ μ립νμλ€. μ κΈ°μ λμ
, μ½ν
μΈ μ λμ§νΈν, λκ³ μΌ μμ μ λ°ν¨ λ±μ κΈλ‘λ² νκ²½ λ³νλ₯Ό λ°μν λ³Έ κ³νμ μνμ ν΅ν΄ μ§μμ¬μ° μ λ'
paragraph2 = 'μ μ§νμ 7μ‘° 7,251μ΅ μμ μμ°, 3μ‘° 6,017μ΅ μμ λΆκ°κ°μΉ, 79,076λͺ
μ μ·¨μ
, 63,389λͺ
μ μ·¨μ
μ λ° λ±μ κ²½μ μ νκΈν¨κ³Όκ° μμλλ€.'
(Where the two paragraphs incorrectly separated)
- Output from ST: [ 2.9367106, -2.7748516]
- Output from HF: [ 2.9245, -2.7643]
paragraph1 = '1) μ£Όμ μ μ±
μ νλ¦'
paragraph2 = '2020λ
μ λ―Έκ΅μ μ§μμ¬μ° μ μ±
μ μμ΄μ μ€μν μκΈ°λΌκ³ ν μ μλ€. λ―Έκ΅μ λλ λ νΈλΌν νμ λΆμ λ§μ§λ§ μκΈ°λ‘μ μ§μμ¬μ°κΆκ³Ό κ΄λ ¨ν 무μμ μλ νμ°½μ§ν μ€μ΄μκΈ° λλ¬Έμ΄λ€. 2018λ
λΆν° μ§μλ λ―ΈΒ·μ€ λ¬΄μμ μμ λν΄ νΈλΌν λ―Έκ΅ λν΅λ Ήκ³Ό λ₯ν μ€κ΅ μ€μμ μΉκ΅ μμ κ²Έ λΆμ΄λ¦¬λ 2020λ
1μ 15μΌ 1λ¨κ³ 무μν©μμ μλͺ
νμκ³ 2μ 14μΌλΆν° λ°ν¨λμλ€. λ―ΈΒ·μ€μ 1λ¨κ³ 무μν©μλ¬Έμλ μ€κ΅μ μ§μμ¬μ°κΆ μΉ¨ν΄μ λν μ
μ¦μ±
μ'
(Where the two paragraphs correctly separated)
- Output from ST: [-4.113529 , 4.1533113]
- Output from HF: [-4.1086, 4.1488]
Training details
Datasets:
- KorQuAD2.1 and AIHub Goverment Documents cleaned from HTML (randomly sampled 20k samples for each)
- Positive samples: context is randomly break into a pair with '\n' (line breakers between list items, paragraphs, header and paragraphs, ...)
- Negative samples: context is randomly break into a pair (separate in the middle of a sentence)
- Downloads last month
- 6