Instructions to use annamp/classifying-courses-at-scale-two-digit-roberta-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use annamp/classifying-courses-at-scale-two-digit-roberta-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="annamp/classifying-courses-at-scale-two-digit-roberta-base")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("annamp/classifying-courses-at-scale-two-digit-roberta-base") model = AutoModelForSequenceClassification.from_pretrained("annamp/classifying-courses-at-scale-two-digit-roberta-base") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| base_model: | |
| - FacebookAI/roberta-base | |
| pipeline_tag: text-classification | |
| library_name: transformers | |
| language: | |
| - en | |
| widget: | |
| - text: "ECON 101 --- Introduction to Microeconomics" | |
| This model is an instance of RoBERTa-Base finetuned to classify student postsecondary administrative transcripts into the National Center of Education Statistics' 2010 College Course Map (CCM). | |
| The College Course Map is a hierarchical taxonomy of course content that roughly aligns with the commonly used Classification of Instructional Program codes used in the United States. | |
| The College Course Map was developed for use with longitudinal surveys including the High School Longitudinal Study of 2009 (HSLS 2009), Baccalaureate and Beyond Longitudinal Study of 2008-2012 (B&B 2008), Beginning Postsecondary Students Longitudinal Study of 2004-2009 (BPS 2004), and Beginning Postsecondary Students Longitudinal Study of 2012-2017 (BPS 2012). | |
| Administrative transcripts for all survey participants were collected along with each survey and each course enrollment in the transcripts were labelled with the appropriate six-digit CCM by human annotators. More information about the development of the CCM and the annotation process are available here: | |
| Bryan, M. & Simone, S. (2012). *2010 College Course Map Technical Report*. National Center | |
| for Education Statistics. https://nces.ed.gov/pubs2012/2012162rev.pdf. | |
| This RoBERTa model is fine-tuned to classify course records into the appropriate two-digit CCM code (for example, 45 represents Social Science courses and 38.01 represents Philosophy and Religion courses). This model is fine-tuned on 802,190 unique course sections from the four surveys referenced above. | |
| More information about the fine-tuning process is available here: | |
| Annaliese Paulson, Kevin Stange, and Allyson Flaster. (2024). *Classifying Courses at Scale: a Text as Data Approach to Characterizing Student Course-Taking Trends with Administrative Transcripts.* (EdWorkingPaper: 24-1042). Annenberg Institute at Brown University. https://doi.org/10.26300/7fpas433 | |
| The model is fine-tuned on data formatted as "{SUBJECT CODE} {CATALOG NUMBER} --- {COURSE TITLE}". For example, for a course offered in an economics department with subject code "ECON", course number "101", and course title "Principles of Microeconomics", the model anticipates the following string: "ECON 101 --- Principles of Microeconomics." [This](https://colab.research.google.com/drive/1iebZ_Zznpv3XPgF34LmwFozd7fSg0ZCh?usp=sharing) Colab Notebook provides a short vignette applying the model. | |
| We report the model's accuracy on individual course sections and on enrollment weighted course sections. The model achieves the following scores on unseen test data comprised of 89,130 unique course sections: | |
| Two-Digit Prediction Accuracy on Course Sections: 0.84 <br> | |
| Two-Digit Prediction Accuracy on Enrollment Weighted Course Sections: 0.90 <br> |