|
|
--- |
|
|
license: cc-by-nc-sa-4.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
--- |
|
|
# Purpose |
|
|
As a part of a project assignment in EPFL's CS-401 class, we need a simple model to extract the importance of the character from the movie plot. From the movie script data crawled from https://imsdb.com/, we calculated gold portion of each character over entire script, and joined it with movie plot datasets. |
|
|
|
|
|
# Model info |
|
|
- Input prompt format |
|
|
|
|
|
f"Predict the percentage of a movie's plot that a character takes up.\nCharacter: {character_name} \nPlot: {plot}" |
|
|
|
|
|
- Output |
|
|
|
|
|
13.4 |
|
|
|
|
|
We used max_token = 2048 for training. |
|
|
|
|
|
| Sample code |
|
|
```python |
|
|
tokenizer = T5Tokenizer.from_pretrained("Hyeongdon/t5-large-character_plot_portion") # same as default t5 tokenizer |
|
|
model = T5ForConditionalGeneration.from_pretrained("Hyeongdon/t5-large-character_plot_portion") |
|
|
model_inputs = tokenizer(prompts, max_length=2048, truncation=True, padding='max_length', return_tensors='pt') |
|
|
model.eval() |
|
|
with torch.no_grad(): |
|
|
probs = model.generate(input_ids=model_inputs['input_ids'].to(device), attention_mask=model_inputs['attention_mask'].to(device)) |
|
|
``` |
|
|
|
|
|
# Limitation & Tips |
|
|
ChatGPT shows better performance without any fine-tuning. Based on our internal metric, T5-large slightly underperforms compared to GPT-3.5 or GPT-4. |
|
|
If you are interested in our research project, refer https://margg00.github.io/ |