YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

This model classifies a pair of passages as continuing or not. In other words, the model classifies if the linebreak between the pair of input is correct or not.

Usage with SentenceTransformers

The usage becomes easier when you have SentenceTransformers installed (trained with SetenceTransformers). Then, you can use the pre-trained models like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([(Paragraph1a, Paragraph1b), (Paragraph2a, Paragraph2b) , (Paragraph3a, Paragraph3b)])

Usage with Huggingface Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('name_model')
tokenizer = AutoTokenizer.from_pretrained('name_model')
with torch.no_grad():
  encoded_input = tokenizer('[SEP]'.join([Paragraph1a, Paragraph1b]), return_tensors='pt')
  output = model(**encoded_input).logits[0]

Example results

paragraph1 = 'μ΄λŸ¬ν•œ ꡭ제 μ •μ„Έ μ†μ—μ„œ μš°λ¦¬λ‚˜λΌλ„ β€˜μ œ4μ°¨ μ‚°μ—…ν˜λͺ…을 선도할 μ§€μ‹μž¬μ‚° 경쟁λ ₯ 확보’λ₯Ό λͺ©ν‘œλ‘œ 제2μ°¨ κ΅­κ°€μ§€μ‹μž¬μ‚°κΈ°λ³Έκ³„νš(2017~2021)을 μˆ˜λ¦½ν•˜μ˜€λ‹€. μ‹ κΈ°μˆ  λ„μž…, μ½˜ν…μΈ μ˜ λ””μ§€ν„Έν™”, λ‚˜κ³ μ•Ό μ˜μ •μ„œ 발효 λ“±μ˜ κΈ€λ‘œλ²Œ ν™˜κ²½ λ³€ν™”λ₯Ό λ°˜μ˜ν•œ λ³Έ κ³„νšμ˜ μ‹œν–‰μ„ 톡해 μ§€μ‹μž¬μ‚° μ œλ„'
paragraph2 = '선진화와 7μ‘° 7,251μ–΅ μ›μ˜ 생산, 3μ‘° 6,017μ–΅ μ›μ˜ λΆ€κ°€κ°€μΉ˜, 79,076λͺ…μ˜ μ·¨μ—…, 63,389λͺ…μ˜ μ·¨μ—… 유발 λ“±μ˜ 경제적 νŒŒκΈ‰νš¨κ³Όκ°€ μ˜ˆμƒλœλ‹€.'

(Where the two paragraphs incorrectly separated)

  • Output from ST: [ 2.9367106, -2.7748516]
  • Output from HF: [ 2.9245, -2.7643]
paragraph1 = '1) μ£Όμš” μ •μ±…μ˜ 흐름'
paragraph2 = '2020년은 미ꡭ의 μ§€μ‹μž¬μ‚° 정책에 μžˆμ–΄μ„œ μ€‘μš”ν•œ μ‹œκΈ°λΌκ³  ν•  수 μžˆλ‹€. 미ꡭ의 λ„λ‚ λ“œ νŠΈλŸΌν”„ ν–‰μ •λΆ€μ˜ λ§ˆμ§€λ§‰ μž„κΈ°λ‘œμ„œ μ§€μ‹μž¬μ‚°κΆŒκ³Ό κ΄€λ ¨ν•œ λ¬΄μ—­μ „μŸλ„ ν•œμ°½μ§„ν–‰ μ€‘μ΄μ—ˆκΈ° λ•Œλ¬Έμ΄λ‹€. 2018λ…„λΆ€ν„° μ§€μ†λœ 미·쀑 λ¬΄μ—­μ „μŸμ— λŒ€ν•΄ νŠΈλŸΌν”„ λ―Έκ΅­ λŒ€ν†΅λ Ήκ³Ό λ₯˜ν—ˆ 쀑ꡭ μ€‘μ•™μ •μΉ˜κ΅­ μœ„μ› κ²Έ λΆ€μ΄λ¦¬λŠ” 2020λ…„ 1μ›” 15일 1단계 λ¬΄μ—­ν•©μ˜μ— μ„œλͺ…ν•˜μ˜€κ³  2μ›” 14일뢀터 λ°œνš¨λ˜μ—ˆλ‹€. λ―ΈΒ·μ€‘μ˜ 1단계 λ¬΄μ—­ν•©μ˜λ¬Έμ—λŠ” μ€‘κ΅­μ˜ μ§€μ‹μž¬μ‚°κΆŒ 침해에 λŒ€ν•œ μž…μ¦μ±…μž„'

(Where the two paragraphs correctly separated)

  • Output from ST: [-4.113529 , 4.1533113]
  • Output from HF: [-4.1086, 4.1488]

Training details

Datasets:

  • KorQuAD2.1 and AIHub Goverment Documents cleaned from HTML (randomly sampled 20k samples for each)
  • Positive samples: context is randomly break into a pair with '\n' (line breakers between list items, paragraphs, header and paragraphs, ...)
  • Negative samples: context is randomly break into a pair (separate in the middle of a sentence)
Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support