--- library_name: transformers tags: - Writing - Acdamic_Writing - Scholarly_Writing - Overleaf - LaTex - Natural_Language_Processing license: apache-2.0 datasets: - minnesotanlp/scholawrite language: - en metrics: - f1 base_model: - google-bert/bert-base-uncased pipeline_tag: text-classification base_model_relation: finetune --- # Model Card for scholawrite-bert-classifier ## Model Details ### Model Description This model is refered as BERT-SW-CLF in the paper. It is fined-tuned based on base-base-uncased Hugging Face, using `train` split of [ScholaWrite](https://huggingface.co/datasets/minnesotanlp/scholawrite) dataset. The sole purpose of this model is to predict the next writing intention given scholarly writing in latex. - **Developed by:** *Linghe Wang, *Minhwa Lee, Ross Volkov, Luan Chau, Dongyeop Kang - **Language:** English - **Finetuned from model:** [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) ### Model Sources [optional] - **Repository:** [ScholaWrite Github Repository](https://github.com/minnesotanlp/scholawrite/blob/main/scholawrite_finetune/bert_finetune/small_model_classifier.py) - **Paper:** [More Information Needed] ## Uses ### Direct Use The model is intended to used for next writing intention prediction in LaTex paper draft. It takes 'before' text warped by special tokens as input, and output the next writing intention which is 1 of 15 predefined labels. ### Out-of-Scope Use The model is fine-tuned only for next writing intention prediction and infereneced in closed enviroment. Its main goal is to examine the usefullness of our dataset. It is suitable for acdamic use, but not suitable for production, general public use, or consumer-oriented service. In addition, use this model on tasks besides next intention prediction in LaTex paper draft may not work well. ## Bias and Limitations The bias and limitations of this model mainly came from the dataset (ScholaWrite) it fine-tuned on. First, the ScholaWrite dataset is currently limited to the computer science domain, as LaTeX is predominantly used in computer science journals and conferences. This domain-specific focus in dataset may restrict the model's generalizability to other scientific disciplines. Future work could address this limitation by collecting keystroke data from a broader range of fields with diverse writing conven554 tions and tools, such as the humanities or biological sciences. For example, students in humanities usu556 ally write book-length papers and integrate more sources, so it could affect cognitive complexities. Second, all participants were early-career researchers (e.g., PhD students) at an R1 university in the United States, which means the models may not learn the professional writing behavior and cognitive process from expert. Expanding the dataset to include senior researchers, such as post-doctoral fellows and professors, could offer valuable insights into how writing strategies and revision behaviors evolve with research experience and expertise. Third, the dataset is exclusive to English-language writing, which restricts model's capability to predict next writing intention in multilingual or non-English contexts. Expanding to multilingual settings could reveal unique cognitive and linguistic insights into writing across languages. ## How to Get Started with the Model ```python import os from dotenv import load_dotenv import torch from transformers import BertTokenizer, BertForSequenceClassification, RobertaTokenizer, RobertaForSequenceClassification from huggingface_hub import login load_dotenv() HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN") login(token=HUGGINGFACE_TOKEN) TOTAL_CLASSES = 15 tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') tokenizer.add_tokens("") # start input tokenizer.add_tokens("") # end input tokenizer.add_tokens("") # before text tokenizer.add_tokens("") # before text tokenizer.add_tokens("") # start previous writing action tokenizer.add_tokens("") # end previous writing action model = BertForSequenceClassification.from_pretrained('minnesotanlp/scholawrite-bert-classifier', num_labels=TOTAL_CLASSES) before_text = "sample before text" text = "" + "" + before_text + " " + "" input = tokenizer(text, return_tensors="pt") pred = model(input["input_ids"]).logits.argmax(1) print("class:", pred) ``` ## fine-tuning Details ### fine-tuning Data This model is fine-tuned on [minnesotanlp/scholawrite](https://huggingface.co/datasets/minnesotanlp/scholawrite) dataset `train` split. It is keystroke logs of an end-to-end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. No additional data pre-processing or filtering performed on the dataset. ### fine-tuning Procedure The model was fine tuned by passing in the `before_text` section of a prompt as the input, and using the `intention` as the ground truth data. The model output an integer according to each intention label (1-15). #### fine-tuning Hyperparameters - **fine-tuning regime:** fp32 - **learning_rate** 2e-5 - **per_device_train_batch_size** 2 - **per_device_eval_batch_size** 8 - **num_train_epochs** 10 - **weight_decay** 0.01 #### Machine Specs - **Hardware:** 2 X Nvidia RTX A6000 - **Hours used:** 3.5 hrs - **Compute Region:** Minnesota ### Testing Procedure #### Testing Data [minnesotanlp/scholawrite](https://huggingface.co/datasets/minnesotanlp/scholawrite) #### Metrics The data has class imbalanced on both training and testing data splits, so we use weighted F1 to measure the performance. #### Results | | BERT | RoBERTa | LLama-8B-Instruct | GPT-4o | |-----------------|--------|---------|-------------------|--------| | Base | 0.04 | 0.02 | 0.12 | 0.08 | | + SW | 0.64 | 0.64 | 0.13 | - | #### Summary Table above presents the weighted F1 scores for predicting writing intentions across baselines and fine-tuned models. All models finetuned on ScholaWrite show a improvement performance compared to their baselines. BERT and RoBERTa achieved the most improvement, while LLama-8B-Instruct showed a modest improvement after fine-tuning. Those results demonstrate the effectiveness of our ScholaWrite dataset to align language models with writers' intentions. ## BibTeX ``` @misc{wang2025scholawritedatasetendtoendscholarly, title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process}, author={Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang}, year={2025}, eprint={2502.02904}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.02904}, } ```