YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

This model classifies a pair of passages as continuing or not. In other words, the model classifies if the linebreak between the pair of input is correct or not.

Usage with SentenceTransformers

The usage becomes easier when you have SentenceTransformers installed (trained with SetenceTransformers). Then, you can use the pre-trained models like this:

from sentence_transformers import CrossEncoder
model = CrossEncoder('model_name')
scores = model.predict([(Paragraph1a, Paragraph1b), (Paragraph2a, Paragraph2b) , (Paragraph3a, Paragraph3b)])

Usage with Huggingface Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('name_model')
tokenizer = AutoTokenizer.from_pretrained('name_model')
with torch.no_grad():
  encoded_input = tokenizer('[SEP]'.join([Paragraph1a, Paragraph1b]), return_tensors='pt')
  output = model(**encoded_input).logits[0]

Example results

paragraph1 = '이러한 국제 정세 속에서 우리나라도 ‘제4차 산업혁명을 선도할 지식재산 경쟁력 확보’를 목표로 제2차 국가지식재산기본계획(2017~2021)을 수립하였다. 신기술 도입, 콘텐츠의 디지털화, 나고야 의정서 발효 등의 글로벌 환경 변화를 반영한 본 계획의 시행을 통해 지식재산 제도'
paragraph2 = '선진화와 7조 7,251억 원의 생산, 3조 6,017억 원의 부가가치, 79,076명의 취업, 63,389명의 취업 유발 등의 경제적 파급효과가 예상된다.'

(Where the two paragraphs incorrectly separated)

Output from ST: [ 2.9367106, -2.7748516]
Output from HF: [ 2.9245, -2.7643]

paragraph1 = '1) 주요 정책의 흐름'
paragraph2 = '2020년은 미국의 지식재산 정책에 있어서 중요한 시기라고 할 수 있다. 미국의 도날드 트럼프 행정부의 마지막 임기로서 지식재산권과 관련한 무역전쟁도 한창진행 중이었기 때문이다. 2018년부터 지속된 미·중 무역전쟁에 대해 트럼프 미국 대통령과 류허 중국 중앙정치국 위원 겸 부총리는 2020년 1월 15일 1단계 무역합의에 서명하였고 2월 14일부터 발효되었다. 미·중의 1단계 무역합의문에는 중국의 지식재산권 침해에 대한 입증책임'

(Where the two paragraphs correctly separated)

Output from ST: [-4.113529 , 4.1533113]
Output from HF: [-4.1086, 4.1488]

Training details

Datasets:

KorQuAD2.1 and AIHub Goverment Documents cleaned from HTML (randomly sampled 20k samples for each)
Positive samples: context is randomly break into a pair with '\n' (line breakers between list items, paragraphs, header and paragraphs, ...)
Negative samples: context is randomly break into a pair (separate in the middle of a sentence)

Downloads last month: 4

Safetensors

Model size

0.1B params

Tensor type

F32