edit readme

89a87bd almost 3 years ago

4.88 kB

	---
	{}
	---

	# Model Card for Model Geo-BERT-multilingual

	<!-- Provide a quick summary of what the model is/does. -->

	This model predicts the geolocation of short texts (less than 500 words) in a form of two-dimensional distributions also referenced as the Gaussian Mixture Model (GMM).

	## Model Details

	Number of predicted points: 5
	Custom transformers pipeline and result visualization: https://github.com/K4TEL/geo-twitter/tree/predict

	### Model Description

	<!-- Provide a longer summary of what this model is. -->

	This project was aimed to solve the tweet/user geolocation prediction task and provide a flexible methodology for the geotagging of textual big data.
	The suggested approach implements BERT-based neural networks for NLP to estimate the location in a form of two-dimensional GMMs (longitude, latitude, weight, covariance).
	The base model has been finetuned on a Twitter dataset containing text content and metadata context of the tweets.

	- Developed by: Kateryna Lutsai
	- Model type: regression
	- Language(s) (NLP): multilingual
	- Finetuned from model: bert-base-multilingual-cased

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/K4TEL/geo-twitter
	- Paper: https://arxiv.org/pdf/2303.07865.pdf
	- Demo: https://github.com/K4TEL/geo-twitter/blob/predict/prediction.ipynb

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	Geo-tagging of Big data

	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	Per-tweet geolocation prediction

	### Out-of-Scope Use

	<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

	Per-tweet geolocation prediction without "user" metadata is expected to show lower accuracy of predictions.

	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	Risk for unethical use on the basis of data that is not publicly available.

	The limitation of text length is dictated by the BERT-based model's capacity of 500 tokens (words).


	### How to Get Started with the Model

	Use the code below to get started with the model:

	https://github.com/K4TEL/geo-twitter/tree/predict

	A short startup guide is given in the repository branch description.

	## Training Details

	### Training Data

	<!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	The Twitter dataset contained tweets with their text content, metadata ("user" and "place") context, and geolocation coordinates.

	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	Information about the model training on the user-defined data could be found in the GitHub repository: https://github.com/K4TEL/geo-twitter

	#### Training Hyperparameters

	- Learning rate start: 1e-5
	- Learning rate end: 1e-6
	- Learning rate scheduler: cosine
	- Number of epochs: 3
	- Batch size: 10
	- Optimizer: Adam
	- Intra-feature loss: mean
	- Inter-feature loss: mean
	- Neg log-likelihood domain: positive
	- Features: NON-GEO + GEO-ONLY

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->
	All performance metrics and results are demonstrated in the Results section of the article pre-print: https://arxiv.org/pdf/2303.07865.pdf

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Data Card if possible. -->

	Worldwide dataset of tweets with TEXT-ONLY and NON-GEO features

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	Spatial metrics: mean and median Simple Accuracy Error (SAE), Acc@161
	Probabilistic metrics: mean and median Cumulative Accuracy Error (CAE), mean and median Prediction Area Region (PRA) for 95% density area, Coverage of PRA

	### Results

	Tweet geolocation prediction task
	- TEXT-ONLY: mean 1588 km and median 50 km, 61% of Acc@161
	- NON-GEO: mean 800 km and median 25 km, 80% of Acc@161

	User home geolocation prediction task
	- TEXT-ONLY: mean 892 km and median 31 km, 74% of Acc@161
	- NON-GEO: mean 567 km and median 26 km, 82% of Acc@161

	### Model Architecture and Objective

	Implemented wrapper layer of liner regression with a custom number of output variables that operates with classification token generated by the base BERT model.

	#### Hardware

	NVIDIA GeForce GTX 1080 Ti

	#### Software

	Python IDE

	## Model Card Contact

	lutsai.k@gmail.com