rrivera1849 commited on
Commit
756d8a6
·
verified ·
1 Parent(s): b70dd34

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md CHANGED
@@ -52,6 +52,121 @@ out, attentions = model(**tokenized_text, output_attentions=True)
52
  print(attentions[0].size()) # torch.Size([48, 12, 32, 32])
53
  ```
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ## Citing & Authors
56
 
57
  If you find this model helpful, feel free to cite our [publication](https://aclanthology.org/2021.emnlp-main.70.pdf).
 
52
  print(attentions[0].size()) # torch.Size([48, 12, 32, 32])
53
  ```
54
 
55
+ ## Usage (Batch)
56
+
57
+ Here's a more fleshed out example showing how to run LUAR across many batches of data:
58
+
59
+ ```python
60
+
61
+ import numpy as np
62
+ import torch
63
+ from termcolor import cprint
64
+ from transformers import AutoModel, AutoTokenizer
65
+ from tqdm import tqdm
66
+
67
+ def generate_data(num_batches: int = 100, batch_size: int = 32, num_samples_per_author: int = 16):
68
+ """
69
+ Generator that produces dummy data for testing.
70
+
71
+ Args:
72
+ num_batches (int): Total number of batches to yield.
73
+ batch_size (int): Number of authors per batch.
74
+ num_samples_per_author (int): Number of text samples per author.
75
+
76
+ Yields:
77
+ list: A batch of data structured as a list of lists of strings.
78
+ Shape: (batch_size, num_samples_per_author)
79
+ """
80
+ s = "This is an example string."
81
+ for batch in tqdm(range(num_batches)):
82
+ # Create a batch where each element is a list of 's' repeated 'num_samples_per_author' times
83
+ yield [[s] * num_samples_per_author for _ in range(batch_size)]
84
+
85
+ def flatten(l):
86
+ """
87
+ Helper function to flatten a 2D list into a 1D list.
88
+
89
+ Args:
90
+ l (list): List of lists.
91
+
92
+ Returns:
93
+ list: Flattened list.
94
+ """
95
+ return [item for sublist in l for item in sublist]
96
+
97
+ def main():
98
+ cprint("Starting LUAR-MUD example script...", 'magenta')
99
+
100
+ # --- Model Loading ---
101
+ cprint("Loading model 'rrivera1849/LUAR-MUD'...", 'blue')
102
+ # trust_remote_code=True is required for custom model architectures like LUAR-MUD
103
+ model = AutoModel.from_pretrained("rrivera1849/LUAR-MUD", trust_remote_code=True)
104
+
105
+ model.eval()
106
+
107
+ # Check for CUDA availability and move model to appropriate device
108
+ device = "cuda" if torch.cuda.is_available() else "cpu"
109
+ cprint(f"Moving model to device: {device}", 'yellow')
110
+ model.to(device)
111
+
112
+ # --- Tokenizer Loading ---
113
+ cprint("Loading tokenizer...", 'blue')
114
+ tokenizer = AutoTokenizer.from_pretrained("rrivera1849/LUAR-MUD", trust_remote_code=True)
115
+
116
+ # --- Configuration ---
117
+ num_batches = 100
118
+ batch_size = 32
119
+ num_samples_per_author = 16
120
+ max_length = 512
121
+
122
+ cprint("\nConfiguration:", 'cyan')
123
+ print(f" Batch Size: {batch_size}")
124
+ print(f" Samples per Author: {num_samples_per_author}")
125
+ print(f" Max Length: {max_length}")
126
+ print(f" Device: {device}\n")
127
+
128
+ all_outputs = []
129
+
130
+ cprint("Starting inference loop...", 'green')
131
+
132
+ # context manager for disabling gradient calculation to save memory/compute
133
+ with torch.inference_mode():
134
+ for i, batch in enumerate(generate_data(num_batches=num_batches, batch_size=batch_size, num_samples_per_author=num_samples_per_author)):
135
+ if (i + 1) % 10 == 0:
136
+ print(f" Processing batch {i + 1}...")
137
+
138
+ # Flatten the batch structure for tokenization:
139
+ # (batch_size, num_samples) -> (batch_size * num_samples)
140
+ batch = flatten(batch)
141
+
142
+ # Tokenize the flattened batch
143
+ inputs = tokenizer(batch, return_tensors="pt", padding=True, max_length=max_length, truncation=True)
144
+
145
+ # Move inputs to the same device as the model
146
+ inputs = inputs.to(device)
147
+
148
+ # Reshape input_ids and attention_mask to match the model's expected 3D input:
149
+ # (batch_size, num_samples_per_author, sequence_length)
150
+ inputs["input_ids"] = inputs["input_ids"].reshape(batch_size, num_samples_per_author, -1)
151
+ inputs["attention_mask"] = inputs["attention_mask"].reshape(batch_size, num_samples_per_author, -1)
152
+
153
+ # Forward pass through the model
154
+ outputs = model(**inputs)
155
+
156
+ # Move outputs back to CPU and convert to numpy for storage
157
+ all_outputs.append(outputs.cpu().numpy())
158
+
159
+ # Concatenate all batch results into a single array
160
+ # axis=0 corresponds to the batch dimension
161
+ all_outputs = np.concatenate(all_outputs, axis=0)
162
+
163
+ cprint("\nInference complete!", 'green')
164
+ cprint(f"Final output shape: {all_outputs.shape}", attrs=['bold'])
165
+
166
+ if __name__ == "__main__":
167
+ main()
168
+ ```
169
+
170
  ## Citing & Authors
171
 
172
  If you find this model helpful, feel free to cite our [publication](https://aclanthology.org/2021.emnlp-main.70.pdf).