code-tagging / README.md
kenleeyx's picture
Update README.md
e3e814a verified

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: Code Tagging
emoji: 👀
colorFrom: blue
colorTo: green
sdk: gradio
sdk_version: 5.9.1
app_file: app.py
pinned: false
short_description: Tags phrases from study respondents.
python_version: 3.13.1

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

TFT Tagging app by Kenneth Lee, PercepTech.AI

What it does

Uses a LLM to tag each quote in a column of an Excel file with zero or more tags from a given list, and returns the quotes tags on the app and also in a new file, along with some statistics and translations of randomly selected quotes for each tag.

Inputs: -Single sheet Excel file -Column header to identify which column of data should be tagged -List of tags to be assigned to the data -List of columns to retain in the output file(these will not be changed from the input file)

Outputs: -List of quotes and tagged tags -Count of number of instances tagged for eachtag -Excel file containing the above data together with translations of some quotes corresponding to each tag that was used

Notables

Authentication:

  • Private spaces require a HF account to access, hence this space is set as public. However anyone without a username and password will be stuck at a login screen.
  • The username and password are stored as secrets on HF itself. You may also refer to emails I have sent.
  • A temporary username, password, and login expiry time may be set in the secrets in order to give temporary access to TFT contractors. The time should be entered as Singapore time in ISO8601 format and any logins with the temporary account after this time will be prevented. Progress bar only takes into account tagging time and not translation time; thus it may hang for a short while at 100% while doing translation.

Quality of tagging:

  • Ideally we want the LLM's tags to match exactly what a trained market researcher would tag. There are however two difficulties in this: 1)even two market researchers will not give identical tags for the same tags and the same dataset, and 2)the LLM has some variation in tags even when running the same tags and dataset twice, due to its non-deterministic nature.
  • As such, to address the above difficulties, we assessed the quality of the LLM's tags by calculating precision, recall, and F1 of 8 aggregated runs of the LLM vs a trained market researcher(credit to Yong Li of TFT for providing the data), and also by calculating these for myself vs Yong Li.
  • By doing so we are able to get an estimate of the amount of agreement between humans on the same tags and dataset, and if the LLM is able to match this on average, then it can be considered to be equal to a human for this task.
  • We achieved an F1 of approximately 90 for myself vs Yong Li and about 70 for the AI vs Yong Li, indicating that the AI has room for improvement.
  • Refer to Perceptech AI driver/TFT tagging app reference material for the data

Sample input file and corresponding output:

  • Refer to Perceptech AI driver/TFT tagging app reference material

Branching behaviour:

  • HF doesn't allow storage of any branches other than main online, to my knowledge. As such I usually update the repo by git push origin main - so be sure it works before pushing!

Future work

  • In the future we hope to improve the tagging so that the LLM-human F1 score matches human-human F1 scores.
  • I had explored fine tuning using 10 responses from the OE US dataset and Yong Li's model answers; this improves performance on the training set from F1 70 to F1 85 but I do not have a test set, the model name is "ft:gpt-4o-mini-2024-07-18:percepsense::BsDwvH1E"; just substitute it where the model eg "gpt-4o-mini" is specified
  • If TFT is able to retrieve any historical tagging data of their own human researchers on survey responses, this may be good training material also.