annamp
/

classifying-courses-at-scale-two-digit-roberta-base

Text Classification

text-embeddings-inference

Model card Files Files and versions

classifying-courses-at-scale-two-digit-roberta-base / README.md

annamp's picture

Update README.md

741ec27 verified over 1 year ago

|

history blame contribute delete

3.01 kB

	---
	license: mit
	base_model:
	- FacebookAI/roberta-base
	pipeline_tag: text-classification
	library_name: transformers
	language:
	- en
	widget:
	- text: "ECON 101 --- Introduction to Microeconomics"
	---

	This model is an instance of RoBERTa-Base finetuned to classify student postsecondary administrative transcripts into the National Center of Education Statistics' 2010 College Course Map (CCM).

	The College Course Map is a hierarchical taxonomy of course content that roughly aligns with the commonly used Classification of Instructional Program codes used in the United States.

	The College Course Map was developed for use with longitudinal surveys including the High School Longitudinal Study of 2009 (HSLS 2009), Baccalaureate and Beyond Longitudinal Study of 2008-2012 (B&B 2008), Beginning Postsecondary Students Longitudinal Study of 2004-2009 (BPS 2004), and Beginning Postsecondary Students Longitudinal Study of 2012-2017 (BPS 2012).

	Administrative transcripts for all survey participants were collected along with each survey and each course enrollment in the transcripts were labelled with the appropriate six-digit CCM by human annotators. More information about the development of the CCM and the annotation process are available here:

	Bryan, M. & Simone, S. (2012). 2010 College Course Map Technical Report. National Center
	for Education Statistics. https://nces.ed.gov/pubs2012/2012162rev.pdf.

	This RoBERTa model is fine-tuned to classify course records into the appropriate two-digit CCM code (for example, 45 represents Social Science courses and 38.01 represents Philosophy and Religion courses). This model is fine-tuned on 802,190 unique course sections from the four surveys referenced above.

	More information about the fine-tuning process is available here:

	Annaliese Paulson, Kevin Stange, and Allyson Flaster. (2024). Classifying Courses at Scale: a Text as Data Approach to Characterizing Student Course-Taking Trends with Administrative Transcripts. (EdWorkingPaper: 24-1042). Annenberg Institute at Brown University. https://doi.org/10.26300/7fpas433

	The model is fine-tuned on data formatted as "{SUBJECT CODE} {CATALOG NUMBER} --- {COURSE TITLE}". For example, for a course offered in an economics department with subject code "ECON", course number "101", and course title "Principles of Microeconomics", the model anticipates the following string: "ECON 101 --- Principles of Microeconomics." [This](https://colab.research.google.com/drive/1iebZ_Zznpv3XPgF34LmwFozd7fSg0ZCh?usp=sharing) Colab Notebook provides a short vignette applying the model.

	We report the model's accuracy on individual course sections and on enrollment weighted course sections. The model achieves the following scores on unseen test data comprised of 89,130 unique course sections:

	Two-Digit Prediction Accuracy on Course Sections: 0.84 <br>
	Two-Digit Prediction Accuracy on Enrollment Weighted Course Sections: 0.90 <br>