Linghe-Wang's picture
Update README.md
e7180c6 verified
---
library_name: transformers
tags:
- Writing
- Acdamic_Writing
- Scholarly_Writing
- Overleaf
- LaTex
- Natural_Language_Processing
license: apache-2.0
datasets:
- minnesotanlp/scholawrite
language:
- en
metrics:
- f1
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
base_model_relation: finetune
---
# Model Card for scholawrite-bert-classifier
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This model is refered as BERT-SW-CLF in the paper. It is fined-tuned based on base-base-uncased Hugging Face, using `train` split of [ScholaWrite](https://huggingface.co/datasets/minnesotanlp/scholawrite) dataset. The sole purpose of this model is to predict the next writing intention given scholarly writing in latex.
- **Developed by:** *Linghe Wang, *Minhwa Lee, Ross Volkov, Luan Chau, Dongyeop Kang
- **Language:** English
- **Finetuned from model:** [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)
### Model Sources [optional]
<!-- Provide the basic links for the model. -->
- **Repository:** [ScholaWrite Github Repository](https://github.com/minnesotanlp/scholawrite/blob/main/scholawrite_finetune/bert_finetune/small_model_classifier.py)
- **Paper:** [More Information Needed]
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
The model is intended to used for next writing intention prediction in LaTex paper draft. It takes 'before' text warped by special tokens as input, and output the next writing intention which is 1 of 15 predefined labels.
### Out-of-Scope Use
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
The model is fine-tuned only for next writing intention prediction and infereneced in closed enviroment. Its main goal is to examine the usefullness of our dataset. It is suitable for acdamic use, but not suitable for production, general public use, or consumer-oriented service. In addition, use this model on tasks besides next intention prediction in LaTex paper draft may not work well.
## Bias and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The bias and limitations of this model mainly came from the dataset (<span style="font-variant: small-caps;">ScholaWrite</span>) it fine-tuned on.
First, the <span style="font-variant: small-caps;">ScholaWrite</span> dataset is currently limited to the computer science domain, as LaTeX is predominantly used in computer science journals and conferences. This domain-specific focus in dataset may restrict the model's generalizability to other scientific disciplines. Future work could address
this limitation by collecting keystroke data from a broader range of fields with diverse writing conven554 tions and tools, such as the humanities or biological sciences. For example, students in humanities usu556 ally write book-length papers and integrate more sources, so it could affect cognitive complexities.
Second, all participants were early-career researchers (e.g., PhD students) at an R1 university in the United States, which means the models may not learn the professional writing behavior and cognitive process from expert. Expanding the dataset to include senior researchers, such as post-doctoral fellows and professors, could offer valuable insights into how writing strategies and revision behaviors evolve with research experience and expertise.
Third, the dataset is exclusive to English-language writing, which restricts model's capability to predict next writing intention in multilingual or non-English contexts. Expanding to multilingual settings could reveal unique cognitive and linguistic insights into writing across languages.
## How to Get Started with the Model
```python
import os
from dotenv import load_dotenv
import torch
from transformers import BertTokenizer, BertForSequenceClassification, RobertaTokenizer, RobertaForSequenceClassification
from huggingface_hub import login
load_dotenv()
HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN")
login(token=HUGGINGFACE_TOKEN)
TOTAL_CLASSES = 15
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer.add_tokens("<INPUT>") # start input
tokenizer.add_tokens("</INPUT>") # end input
tokenizer.add_tokens("<BT>") # before text
tokenizer.add_tokens("</BT>") # before text
tokenizer.add_tokens("<PWA>") # start previous writing action
tokenizer.add_tokens("</PWA>") # end previous writing action
model = BertForSequenceClassification.from_pretrained('minnesotanlp/scholawrite-bert-classifier', num_labels=TOTAL_CLASSES)
before_text = "sample before text"
text = "<INPUT>" + "<BT>" + before_text + "</BF> " + "</INPUT>"
input = tokenizer(text, return_tensors="pt")
pred = model(input["input_ids"]).logits.argmax(1)
print("class:", pred)
```
## fine-tuning Details
### fine-tuning Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the fine-tuning data is all about as well as documentation related to data pre-processing or additional filtering. -->
This model is fine-tuned on [minnesotanlp/scholawrite](https://huggingface.co/datasets/minnesotanlp/scholawrite) dataset `train` split. It is keystroke logs of an end-to-end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. No additional data pre-processing or filtering performed on the dataset.
### fine-tuning Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the fine-tuning procedure. -->
The model was fine tuned by passing in the `before_text` section of a prompt as the input, and using the `intention` as the ground truth data. The model output an integer according to each intention label (1-15).
#### fine-tuning Hyperparameters
- **fine-tuning regime:** fp32
- **learning_rate** 2e-5
- **per_device_train_batch_size** 2
- **per_device_eval_batch_size** 8
- **num_train_epochs** 10
- **weight_decay** 0.01
#### Machine Specs
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
- **Hardware:** 2 X Nvidia RTX A6000
- **Hours used:** 3.5 hrs
- **Compute Region:** Minnesota
### Testing Procedure
#### Testing Data
<!-- This should link to a Dataset Card if possible. -->
[minnesotanlp/scholawrite](https://huggingface.co/datasets/minnesotanlp/scholawrite)
#### Metrics
The data has class imbalanced on both training and testing data splits, so we use weighted F1 to measure the performance.
#### Results
| | BERT | RoBERTa | LLama-8B-Instruct | GPT-4o |
|-----------------|--------|---------|-------------------|--------|
| Base | 0.04 | 0.02 | 0.12 | 0.08 |
| + SW | 0.64 | 0.64 | 0.13 | - |
#### Summary
Table above presents the weighted F1 scores for predicting writing intentions across baselines and fine-tuned models. All models finetuned on ScholaWrite show a improvement performance compared to their baselines. BERT and RoBERTa achieved the most improvement, while LLama-8B-Instruct showed a modest improvement after fine-tuning. Those results demonstrate the effectiveness of our ScholaWrite dataset to align language models with writers' intentions.
## BibTeX
```
@misc{wang2025scholawritedatasetendtoendscholarly,
title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
author={Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
year={2025},
eprint={2502.02904},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02904},
}
```