Sampath1987 commited on
Commit
28c7271
·
verified ·
1 Parent(s): 7e4a9fa

fine-tuned model-v1

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,881 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:89129
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Alibaba-NLP/gte-multilingual-base
11
+ widget:
12
+ - source_sentence: How does vendor-specific data acquisition affect DTS profile interpretation?
13
+ sentences:
14
+ - 'Bridging data management gap by gathering all well integrity data in one unique
15
+ data base. The aim of ADNOC Offshore in-house Well Integrity Data Management System
16
+ (WIDMS) is to comply with the 3A rule: Accessibility of the data, Accuracy by
17
+ performing regular quality check and Analysis. The analysis allows to maintain
18
+ wells barriers robust, to ensure personnel safety and to quickly identify integrity
19
+ issues to make qualified decisions about appropriate mitigations measures and
20
+ avoid risk escalation. WIDMS has been developed in-house with inputs and collaboration
21
+ of various stake holders. An enhancement list has been established selecting the
22
+ most relevant features that will be added value to the system. Therefore, Automation
23
+ for sub processes like thresholds calculations and Risk Assessment which gives
24
+ input for Well Passports that contains all the required information to evaluate
25
+ the well risks and implement the required mitigation measures.
26
+
27
+ End users are following a RACI Chart to keep WIDMS database on track and to ensure
28
+ no data falls through the cracks as all the data workflow is defined through the
29
+ different steps such as providing data, entering it in the system, informing relevant
30
+ stakeholders and providing technical clarifications if needed. The result of data
31
+ acquisition in WIDMS is that data flows across the entire organization, with defined
32
+ access rights in line with ADNOC Offshore policies. This data is collected from
33
+ various sources, is a robust data base, essential for evaluating and maintaining
34
+ well integrity.
35
+
36
+ It is enhancing well barriers system management by allowing to have full overview
37
+ of well''s barriers performance. Moreover, it allows to have reliable and continuously
38
+ available data such as annulus pressure data that is critical for well integrity
39
+ assurance, to avoid the uncontrolled release of hydrocarbons to the atmosphere.
40
+ Notifications have been implemented so alerts can be sent for engineers to inform
41
+ about any abnormality and non-compliance. As technology evolves, using paper-based
42
+ processes, excel spreadsheets, time-based equipment inspection and testing become
43
+ less effective. Well diagnostics are expensive so utilizing well data analytics
44
+ through this digital hub project will ease having detailed real time data and
45
+ quick analysis for early detection of failures and anticipation and reduction
46
+ of risk escalation.'
47
+ - "##### 2.3.1 Site characterization - secondary seal \nSecondary seals might have\
48
+ \ a significant relevance in ensuring CO 2 containment, acting\nas additional\
49
+ \ barrier to flow, although it is not clear if it is considered a requirement\
50
+ \ for\nstandards. Two documents show some contradiction: \nISO 27914 [36] is\
51
+ \ silent on secondary seal as a requirement until section 5.4.3.2 that describes\n\
52
+ its characterization. Moreover, if it is a requirement, characterization should\
53
+ \ include not\nonly geometry and lithology, but also integrity evaluation, which\
54
+ \ is not mentioned. \nISO/TR 27915 [37] section 5.2.6 and Figure 2 state that\
55
+ \ the geological storage complex is\ncomposed of the reservoirs where CO 2 is\
56
+ \ injected and the caprock (or seals); it then states\nthat additional geologic\
57
+ \ layers are outside complex."
58
+ - 'Geothermal energy is considered a reliable, sustainable and abundant source of
59
+ energy with minimized environmental impact. The extracted geothermal energy may
60
+ be utilized for direct heating, or electricity generation. The main challenge
61
+ to access this energy is tremendous capital expenditures required for drilling
62
+ and completion. Therefore, this work discusses and evaluates retrofitting abandoned
63
+ petroleum wells to geothermal as a commonly proposed solution to the mentioned
64
+ challenge.
65
+
66
+ There are many oil and gas wells globally which are not used for production, injection
67
+ or other purposes. Well abandonment is commonly considered as an essential measure
68
+ to ensure safety and integrity of these wells, bearing huge costs and concerns
69
+ for the petroleum industry. By converting abandoned or non-activated oil and gas
70
+ wells to geothermal wells, it is claimed to be possible to produce geothermal
71
+ energy and generate power. As a crucial stage for the claim verification and evaluation
72
+ of feasibility or efficiency of this conversion, it is important to be aware of
73
+ the practical and simulation case studies.
74
+
75
+ Therefore, in this work, this work presents a comprehensive overview and analysis
76
+ of 20 case studies published from different countries, followed by important downhole
77
+ and surface parameters. As for the downhole characteristics, production scenarios
78
+ either open-loop or closed-loop, optimization of open-loop systems, borehole heat
79
+ exchangers with their different types and dimensions, and insulations are covered.
80
+ Next, surface cycles including organic Rankine cycle (ORCs), selection of circulation
81
+ fluids, flow rates, and working fluids are covered, followed by produced and net
82
+ powers with evaluation of coefficient of performance (COP) and thermal efficiency.
83
+ This investigation shows there is good potential for producing geothermal energy
84
+ from abandoned and non-activated petroleum wells.'
85
+ - source_sentence: Why must welding consumables be limited to specific classifications
86
+ and manufacturers for EGW?
87
+ sentences:
88
+ - "8 API R ECOMMENDED P RACTICE 582 \n**5.2.6 EGW** \nThe use of EGW shall be\
89
+ \ limited by the following conditions: \na) EGW shall be used only with filler\
90
+ \ materials specifically intended for the EGW process (ASME/AWS SFA/A5.26/\nSFA/A5.26M),\
91
+ \ \nb) welding consumables shall be limited to the classification and the manufacturer’s\
92
+ \ trade name used in the PQR, \nc) only filler materials having classifications\
93
+ \ with specified minimum impact test requirements should be used. \n**5.2.7 SAW**\
94
+ \ \n**5.2.7.1** SAW procedures shall be requalified whenever the welding flux\
95
+ \ is changed from one manufacturer’s trade\nname to another. Equivalence under\
96
+ \ ASME _BPVC_ Section II, Part C, or AWS filler metal specifications shall not\
97
+ \ be\nconsidered adequate for substitution without requalification. \nCOMMENTARY\
98
+ \ It is recognized that fluxes having the same classification can be very different\
99
+ \ in their\ncomposition. However, nominal flux composition is not included in\
100
+ \ AWS or ASME specifications/codes, and flux\nsuppliers do not normally provide\
101
+ \ this information. Differences among fluxes of the same classification can result\
102
+ \ in\ndifferent and unanticipated weld properties when these fluxes are used interchangeably\
103
+ \ over the range of variables\ntypically stated in weld procedure specifications.\
104
+ \ \n**5.2.7.2** Manually held (semiautomatic) SAW is not permitted for welding\
105
+ \ pressure-containing parts, unless approved\nby the purchaser. \n**5.2.7.3**\
106
+ \ A separate qualification is required for SAW welds in which any pass is greater\
107
+ \ than [1] / 2 in. \n**5.3 Single-sided Welded Joints** \nFor single-sided welded\
108
+ \ joints where process side corrosion is a concern, welding processes using coatings\
109
+ \ or fluxes\nshall not be used for root pass welding of austenitic stainless steels,\
110
+ \ non-ferrous alloys and nickel-base alloys unless\nslag can be removed from the\
111
+ \ process side of root passes and the area inspected for slag removal. \n**5.4\
112
+ \ Combining Welding Processes** \nCombining two or more welding processes that\
113
+ \ use alloy filler metals of different nominal compositions, other than A1\nthro\
114
+ \ ~~ug~~ h A5, requires qualification as a combination procedure."
115
+ - 'Following multi-disciplinary reviews, an opportunity was identified to restore
116
+ production and unlock incremental reserves from well X, a dual completion in two
117
+ different reservoirs but subsequently deserted due to long term community crisis
118
+ that led to over 25years of non-production with complete vandalization of well
119
+ head and flowlines.
120
+
121
+ The method employed involved the strategic resolution of long-term crisis between
122
+ two communities where well X is located, via a multi-disciplinary effort involving
123
+ the operating company’s Community Relations, HSSE, Production Engineering, Operations
124
+ Support and Portfolio Management functions. The installation of a retrofitted
125
+ well head was done with first line and second line maintenance carried out. Wireline
126
+ drifting and static bottom hole pressures were acquired for both strings using
127
+ slickline equipment and a preliminary well test was conducted for both strings
128
+ with production to a flowback tank.
129
+
130
+ The preliminary result for the long string (LS) indicated a high water cut (>80%),
131
+ while the result from short string (SS) was in line with expectation (<57%). The
132
+ test result from the short string informed the decision to construct a new flowline
133
+ for restoring its production, while further subsurface evaluation is required
134
+ for the long string (LS). The significance of the short string (SS) result is
135
+ the unlocking of additional reserves ca. 1.0MMSTB from a reservoir with remaining
136
+ oil in place estimated at ca. 18MMSTB, where the short string (SS) is the only
137
+ drainage point currently completed on the reservoir.
138
+
139
+ This solution provides a cost effective and efficient way to increasing production
140
+ and reserves at minimal expenditure leveraging on multi-disciplinary expertise,
141
+ using existing infrastructures as well as resolving community crisis, where applicable.'
142
+ - "**Exploration & Production** \n**General Specification** Date: 10/2007 \n**GS\
143
+ \ EP STR 301** Rev: 07 \nsuccessful practice of the process in previous similar\
144
+ \ jobs to the satisfaction of the\nCOMPANY. \nb) Only Extra Low Hydrogen processes\
145
+ \ (max. 5 ml H2/100 g) shall be used for welding and tack \nwelding of Special\
146
+ \ and First Category members or materials having specified YS above\n262 MPa (38,000\
147
+ \ psi). The same requirement shall apply for any welding on castings and\nforgings.\
148
+ \ \nc) For Second Category members and Non-Structural members, welding processes\
149
+ \ other than\nExtra Low Hydrogen processes may be used, subject to prior approval\
150
+ \ by COMPANY, for\nmaterials having specified YS up to 262 MPa together with thickness\
151
+ \ up to 12.70 mm\n(0.500”). \nd) The number of different welding processes shall\
152
+ \ be minimized. \ne) Different welding consumables qualities (basic extra low,\
153
+ \ basic low,) in a same type of\nconsumables shall be avoided. \n**8.5 Welding\
154
+ \ consumables** \n**8.5.1 Selection of consumables** \na) Consumables shall\
155
+ \ conform to ANSI/AWS D 1.1 code and shall have been approved by an\ninternational\
156
+ \ recognized certification body (e.g. DNV, LLOYD’s, etc.). \nb) If classification\
157
+ \ of the structure is required, welding consumables shall conform to rules of\
158
+ \ the\nClassification Society. \nc) Cellulosic electrodes are strictly forbidden\
159
+ \ for structural use \nd) Welds forming connections between steels of different\
160
+ \ grades of material shall develop the\nminimum specified tensile properties of\
161
+ \ the lower steel grades being joined, unless otherwise\npreviously approved by\
162
+ \ the COMPANY. \nWelds forming connections between steels of different grades\
163
+ \ of material shall develop the\nminimum specified notch impact properties at\
164
+ \ the lowest temperature of steel grades being\njoined, unless otherwise previously\
165
+ \ approved by the COMPANY. \ne) For repair welding or multiple repairs, “extra\
166
+ \ low hydrogen” electrodes are required\n(i.e. maximum specified hydrogen content\
167
+ \ of 5 ml per 100 gram of weld metal). \nf) For welding castings or forgings,\
168
+ \ “extra low hydrogen” electrodes are required (i.e. maximum\nspecified hydrogen\
169
+ \ content of 5 ml per 100 gram of weld metal)."
170
+ - source_sentence: What is the recommended use of blank samples in sampling procedures
171
+ involving the trapping or precipitation of components?
172
+ sentences:
173
+ - "elements. We do this by performing a similarity transformation on the matrix\
174
+ \ _k_ . The coordinate systems _x_ = { _x_ 1, _x_ 2 } and _y_ = { _y_ 1,, _y_\
175
+ \ 2 } are related by the similarity transformation matrix \n_A_ such that \n\
176
+ _y_ = _Ax_ . ................................................................\
177
+ \ (2.127) \nThe two coordinate systems are shown in Fig. 2.6.\nAn angle ( _θ_\
178
+ \ ) is associated with the transformation in Eq. 2.127 by writing the 2D coordinate\
179
+ \ transformation as \n_y_ 1 \n_y_ 2 \n_x_ 1 \n_x_ 2 \n= \ncos _θ_ sin _θ_\
180
+ \ \n−sin _θ_ cos _θ_ \n. ........................................... (2.128)\
181
+ \ \nThe coordinate systems _x_ = { _x_ 1, _x_ 2 } and _y_ = { _y_ 1,, _y_ 2 }\
182
+ \ are related by the counterclockwise rotation shown in Fig. 2.6. We have an aligned\
183
+ \ coordinate system _y_ = { _y_ 1,, _y_ 2 } with the\nprincipal axes of the permeability\
184
+ \ tensor. The diagonal tensor in the coordinate system \n_y_ = { _y_ 1,, _y_\
185
+ \ 2 } has the form \n_k_ ′=\n( \n_k_ max 0 \n0 _k_ _T_\n) [, .........................................................\
186
+ \ (2.129)] \n**Print** **Search** **Chapter 1** **Home** **Chapter 3** **Bookmarks**\
187
+ \ **Help**"
188
+ - "**© 2010 COPYRIGHT MERCADO NEGRO, LAS PLAYITAS. MARACAIBO-EDO. ZULIA, VENEZUELA.**\
189
+ \ \n**PARA COMPRAR AL DETAL O AL MAYOR, ESTE Y OTROS PRODUCTOS, FAVOR PREGUNTAR\
190
+ \ POR EL GÖAJIRO BLANCO, EN EL MERCADO LAS PLAYITAS.** \n**ADVERTENCIA: \"EL\
191
+ \ DERECHO DE AUTOR NO ES UNA FORMA DE PROPIEDAD SINO UN DERECHO CULTURAL. EXIGE\
192
+ \ TU DERECHO\"** \nI-208 Petroleum Engineering Handbook—Vol. I \n**Fig. 4.10—Chromatogram\
193
+ \ showing broad OBM peak.** \n**Fig. 4.11—Chromatogram showing narrow OBM peak.**\
194
+ \ \nSpecial correction techniques are increasingly used within the oil industry,\
195
+ \ and because\nthese techniques vary between organizations and laboratories, sample\
196
+ \ selection should be done\nonly after considering which method to use. Many companies\
197
+ \ are forced to use oil-based\ndrilling muds to manage drilling costs in water-sensitive\
198
+ \ formations, and the added expense of\nhandling contaminated samples (and the\
199
+ \ risk associated with poorer-quality data) must be used\nto evaluate the overall\
200
+ \ economic balance. \nFor water samples, comparisons of duplicates also give\
201
+ \ a good indication of quality. Where\nfluid concentration may be stabilizing\
202
+ \ (e.g., at the end of a cleanup), sequential samples should\nbe used to look\
203
+ \ for compositional trends and thus to help decide if representative fluid has\n\
204
+ been sampled. For some sampling procedures involving trapping or precipitation\
205
+ \ of particular\ncomponents, it is highly recommended to use blank “samples,”\
206
+ \ which undergo exactly the\nsame treatment and storage as the actual sample and\
207
+ \ provide a reference measurement to assist\nwith the interpretation of laboratory\
208
+ \ measurements. More details are available in API _RP 45._ [10] \n**Print** **Search**\
209
+ \ **Chapter 3** **Home** **Chapter 5** **Bookmarks** **Help**"
210
+ - 'the ideal time to take samples? (6) Will on-site analyses be required? (7) Who
211
+ will perform
212
+
213
+ sampling and analysis duties?
214
+
215
+ Fluid-sampling operations are often left to service-company personnel, but because
216
+ significant variation in levels of competence exists within the industry and within
217
+ service companies
218
+
219
+ themselves, it is recommended either to use specialist laboratory personnel or
220
+ to supervise the
221
+
222
+ service-company operations closely.
223
+
224
+ General guidelines for choosing reservoir-fluid-sampling methods and sample quantities
225
+ required are summarized in **Table 4.2.** Regardless of the actual volumes mentioned,
226
+ you should
227
+
228
+ collect at least two separate samples of each fluid, referred to as duplicate
229
+ or replicate samples.
230
+
231
+ This reduces the chance of losing information if one of the samples leaks or is
232
+ accidentally
233
+
234
+ damaged during laboratory operations, and it allows a comparison between the samples
235
+ as part
236
+
237
+ of the quality-control procedures.
238
+
239
+ Surface-separator sampling is the most common technique, but the reservoir-fluid
240
+ sample
241
+
242
+ recombined in the laboratory is subject to errors in the measured GOR and any
243
+ imprecision in
244
+
245
+ the laboratory recombination procedure. Downhole samples (or wellhead samples)
246
+ are not affected by such inaccuracies but require the fluid to be in monophasic
247
+ condition when sampled;
248
+
249
+ this can be confirmed definitively only afterward in the laboratory. Also, there
250
+ is general reluctance to attempt downhole sampling in gas/condensate reservoirs
251
+ because many are saturated,
252
+
253
+ and the phases are likely to segregate in the wellbore. The ideal situation for
254
+ a laboratory is to
255
+
256
+ receive both surface and downhole samples because a choice is then available,
257
+ and a good idea
258
+
259
+ can be obtained of how representative the resulting fluid is.
260
+
261
+ In certain circumstances, it can be good practice to collect “backup” fluid samples
262
+ at the
263
+
264
+ earliest opportunity during a production test, even if a well has not cleaned
265
+ up properly. If the
266
+
267
+ test has to be aborted for some reason [well bridging, unexpected levels of hydrogen
268
+ sulfide
269
+
270
+ (H 2 S), etc.], the backup samples may be of great value, even if they are not
271
+ 100% representative. If the test is completed successfully, the backup samples
272
+ can be discarded to avoid the
273
+
274
+ cost of unnecessary shipment and testing.
275
+
276
+ If sampling is part of a long-term monitoring program, such as those required
277
+ by government authorities or those forming part of custody-transfer contracts,
278
+ the methods defined in the
279
+
280
+ appropriate documentation or contracts must be followed as closely as possible,
281
+ even if this'
282
+ - source_sentence: What is the significance of implementing a centralized, web-based
283
+ integrated surveillance tool for production optimization?
284
+ sentences:
285
+ - "P RESSURE - RELIEVING AND D EPRESSURING S YSTEMS 141 \n**5.7.11.2.3\
286
+ \ Flare Gas Characteristics** \nFlare gases can have widely varying compositions\
287
+ \ that shall be evaluated during specification of recovery systems.\nThe potential\
288
+ \ for materials that are not compatible with the flare gas treating systems or\
289
+ \ ultimate destinations shall be\ndetermined. For example, relief streams containing\
290
+ \ acid gases typically are routed directly to the flare, thereby\nbypassing the\
291
+ \ recovery system. Highly inert streams can also be incompatible with recovery\
292
+ \ systems. \n**5.7.11.3 Design Considerations** \n**5.7.11.3.1 Sizing** \n\
293
+ Figure 13 shows a conceptual design for a flare gas recovery system. Typically,\
294
+ \ the system consists of one or more\nreciprocating compressors whose suction\
295
+ \ is directly connected to the flare header. The compressed gas is usually\nrouted\
296
+ \ to some type of treating system appropriate for the gas composition, then to\
297
+ \ fuel gas or processing systems. \n3 \n**Key**\n1 compressor load control\n\
298
+ 2 flare gas treating\n3 from process unit flare knockout drums \n4 flare header\
299
+ \ \n5 flare knockout drum (if used) \n6 water seal \n7 flare \na Compressor\
300
+ \ shutdown. \n**Figure 13—Typical Flare Gas Recovery System** \nCopyright American\
301
+ \ Petroleum Institute\nProvided by IHS under license with API Licensee=Petrofac\
302
+ \ International Ltd/5954785001, User=McNicol, William\nNo reproduction or networking\
303
+ \ permitted without license from IHS Not for Resale, 01/29/2014 03:10:03 MST"
304
+ - 'Inorganic scale precipitation and deposition in oil and gas wells can cause significant
305
+ production loss, which results in additional operational expenditure (OPEX) and
306
+ health safety and environmental (HSE) risks. Scale management requires a detailed
307
+ understanding of production rates, hydrocarbon and produced water compositions
308
+ as well as reservoir conditions. Accurate real-time analysis of produced water
309
+ compositions can immediately identifiy scaling risks in a production well and
310
+ can lead to significantly reduced production loss, optimized chemical dosages,
311
+ and fewer workovers, consequently lowering OPEX and mitigating HSE risk. This
312
+ paper introduces development of a device capable of measuring the most critical
313
+ parameters associated with inorganic scale in flowing produced water including
314
+ pH, alkalinity, strontium, barium, sulfate, total hardness, total dissolve solids
315
+ (TDS) and others.
316
+
317
+ In order to measure these water properties with the device, different methods
318
+ were tested, but eventually, a combination of spectrophotometric and other methods
319
+ were determined effective. One of the challenges of using spectrophotometric methods
320
+ is the reagent stability over time. Hence, customized reagents were prepared for
321
+ this application and the stability of these reagents was tested over time. Specific
322
+ calibration methods were designed in order to extend the usage of the reagents.
323
+
324
+ Static measurements were initially performed and the results showed precise measurements
325
+ of all the parameters. Results from dynamic tests utilizing real time flow and
326
+ static test were in agreement and the accuracy was confirmed by traditional methods.
327
+ Once the device prototype was built in our laboratories, production fluids were
328
+ used to test the complete device. This device can be placed at various attachment
329
+ points from the wellhead to the separator. This automated device is capable of
330
+ collecting a discrete production fluid sample, separating produced water from
331
+ the bulk phase and measuring various properties of produced water. These properties
332
+ are reported electronically and used as part of a combined real time scale risk
333
+ prevention system. In addition, this device measures parameters while maintaining
334
+ wellhead pressure and temperature in order to eliminate the potentials errors
335
+ in measurements, for instance pH of water changes due to degassing and precipitation
336
+ as a result of changes in pressure and temperature.
337
+
338
+ A field trial is planned to test the device under full flowing conditions. This
339
+ will be the first automated real-time produced water composition monitoring device
340
+ with high measurement accuracy while maintaining pressure and temperature of samples,
341
+ which can be attached at various points from wellhead to separator. This can be
342
+ beneficial to identify the scaling risk in production wells before severe scaling
343
+ occurs. The device is designed to enhance reliability of water properties measurements,
344
+ provide real-time measurements, and reduce downtime and costs associated with
345
+ scale problems and sampling.'
346
+ - 'In 2009, the Kuwait Integrated Digital Field (KwIDF) project was established
347
+ in the Sabriyah field in north Kuwait to boost production and reserves (Al-Jasmi
348
+ et al. 2014). The goal was to help realize the vision of sustained oil production
349
+ in Kuwait of four million barrels of oil equivalent per day (BOE/D) by 2030 (Goel
350
+ et al. 2013). The project involved the creation of 11 integrated, automated workflows,
351
+ and a real-time collaborative environment to help optimize production, reduce
352
+ downtime, and improve reservoir management:
353
+
354
+ Key performance monitoring—calculates and displays key parameters to monitor and
355
+ assess asset performance at the field and well levels (Al-Jasmi et al. 2013).
356
+
357
+ Well performance evaluation (WPE)—allows users to model and evaluate any well
358
+ in real time, from completion face to wellhead (Cullick et al. 2013).
359
+
360
+ Smart production surveillance (SPS)—helps enable users to control production and
361
+ make surveillance decisions in real time (Villamizar et al. 2013).
362
+
363
+ Production loss—an advance workflow for users to compare current oil production
364
+ to pre-established allowable rates (Villamizar et al. 2013).
365
+
366
+ Electric submersible pump (ESP) diagnostic and optimization—helps enable users
367
+ to interactively monitor and optimize ESP operated well operations (Velasquez
368
+ et al. 2013).
369
+
370
+ Production allocation—integrates the allocation process within the KwIDF environment,
371
+ increases the frequency of the allocation cycle, and improves the accuracy of
372
+ allocated volumes (Al-Jasmi et al. 2013).
373
+
374
+ Gas lift (GL) diagnostic and optimization—uses a smart real-time control that
375
+ provided proactive recommendations for GL systems optimization (Al-Jasmi et al.
376
+ 2013).
377
+
378
+ Reporting and distribution—displays system generated alarms from all KwIDF workflows,
379
+ generates tickets, and reports ticket status (Al-Jasmi et al. 2013).
380
+
381
+ Simulation model update and ranking—an automated workflow for reservoir history
382
+ matching (Carvajal et al. 2013).
383
+
384
+ Reservoir visualization and analysis, and subsurface waterflood optimizer—helps
385
+ enable the monitoring of subsurface health during the waterflooding process, and
386
+ provides predictive reservoir optimization analysis and actions (Ranjan et al.
387
+ 2013).
388
+
389
+ By 2012, KwIDF had been deployed on 49 wells, representing a pilot that served
390
+ as a proof of concept. By 2013, cumulative production gains of 756,000 barrels
391
+ of oil were reported (Singh et al. 2013). While the gains were impressive, and
392
+ management wanted to expand KwIDF, it was recognized that full deployment would
393
+ pose significant challenges and, without a set of necessary changes, the value
394
+ of KwIDF would not be realized.
395
+
396
+ The key challenge facing management was to identify the appropriate operating
397
+ model to deliver on the KwIDF vision and scale the program to accommodate future
398
+ expansion across the rest of the organization. A transition and deployment assessment
399
+ team was established by management to address this challenge.
400
+
401
+ The transition and deployment assessment project produced a recommended operating
402
+ model, a transition road map, change management strategy, risk and mitigation
403
+ plan, and project charters to assist the program team and steering group in the
404
+ deployment of KwIDF across the rest of North Kuwait.'
405
+ - source_sentence: What role did anti-collision analysis play in the drilling of the
406
+ dual lateral well?
407
+ sentences:
408
+ - This paper aims to analyze the impact of appraising and developing marginal fields
409
+ with multiple stacked reservoirs which is quite challenging in terms of techno
410
+ commercial value. The development of such marginal reservoirs using conventional
411
+ single horizontal wells drilling and completion is uneconomical. Therefore, it
412
+ was necessary to engineer a solution that can enhance the commercial value of
413
+ the project by reducing CAPEX and OPEX. This paper will present the first comprehensive
414
+ business case, where multiple stacked reservoirs with marginal reserves were studied
415
+ to produce independently using multilateral completions, granting full accessibility
416
+ of the laterals while achieving production monitoring and reservoir surveillance.
417
+ - 'This paper is a comprehensive analytic driven study on the use and sizing of
418
+ membrane filters to improve the injected water quality for maintaining injectivity
419
+ in tight carbonate reservoirs. Out of the different mechanisms of formation damage,
420
+ the pore plugging with the migration of particles within the injectant fluids
421
+ by bridging at the pore throat junctions and/or by pore filling can lead to the
422
+ buildup of an internal filter cake away from the wellbore that limits the well’s
423
+ injectivity and can affect the vertical and lateral sweep.
424
+
425
+ This type of formation damage is very difficult to treat with any kind of stimulation
426
+ and the impact will be manifested especially in tight formations with interbedded
427
+ stylolites layers with a total range of permeabilities from 2 to less than 1 milli-Darcy
428
+ and a median pore throat size ranging from 2.5 to 0.3 micron meters. The study
429
+ comprises several parts starting with a geological analysis that was conducted
430
+ to identify areas and layers most prone to formation pore plugging by analyzing
431
+ thin-sections and MICP data. Second, in the lack of core flood tests, a reservoir
432
+ and well study analyzed existing water injectors situated in similar or slightly
433
+ higher quality rock areas through the analysis of injectivity index behavior to
434
+ estimate the impact of damage and the expected injector’s half-life.
435
+
436
+ As a result, through the application of an analytical mathematical model for defining
437
+ deep bed filtration parameter, a correlation was established based on average
438
+ injected particle size and reservoir rock quality to aid in selecting the proper
439
+ water injection filter size. In order to confirm that, a dedicated injectivity
440
+ test in a horizontal well utilizing membrane filters was carried out to assess
441
+ eventual formation damage and the filters efficiency by conducting a series of
442
+ multiple pressure fall-off tests coupled with injection profile logging to monitor
443
+ any induced damage within the wellbore region.
444
+
445
+ Finally, the operational aspects and the integration within field development
446
+ plans were addressed, especially with the recommended well placement and completion.
447
+ This culminated in a field development strategy for formation damage mitigation
448
+ in tight carbonate reservoirs during production and injection phase that can be
449
+ used in other similar fields.'
450
+ - 'The most common challenge in horizontal drilling is depth uncertainty which can
451
+ be due to poor seismic data or interpretation. It is arguable that a successful
452
+ landing of the wellbore in the reservoir optimally and within the desired zone
453
+ is the most challenging in most geosteering operation. The presence of fluid contacts
454
+ such as oil-water-contact (OWC) and gas-oil-contact (GOC) complicates the whole
455
+ drilling process, most especially if these fluid contacts are not well defined
456
+ or known. Additionally, the ability to map the boundaries of the reservoir as
457
+ the BHA drills the lateral section is an added advantage to remaining within the
458
+ desired reservoir section.
459
+
460
+ The success of any reservoir navigation service where seismic uncertainty at the
461
+ reservoir top is high will rely largely on how effective the geosteering system
462
+ is and how the geosteering engineer is able to react promptly to changes while
463
+ landing the well in the reservoir and drilling the lateral section with without
464
+ exiting the reservoir.
465
+
466
+ Reservoir Navigation Service (RNS) provides the means for the drilling near horizontal
467
+ or horizontal wells for the purpose of increasing hydrocarbon extraction from
468
+ the earth''s subsurface. This involves the use of a pre-defined bottom hole assembly
469
+ (BHA) with inbuilt downhole logging while drilling (LWD) and measurement while
470
+ drilling (MWD) sensors. The measurements from these downhole sensors are uplinked
471
+ to the surface of the wellbore where they are converted to meaningful petrophysical
472
+ data. The goal is to use the downhole petrophysical data such as gamma ray, propagation
473
+ resistivity and so on, to update an existing pre-well geological model of a section
474
+ of the earth in such a way that the final result depicts the true model picture
475
+ of the earth subsurface.
476
+
477
+ This paper focuses on using well CBH-44L to showcase how the use of real-time
478
+ distance-to-boundary (D2B) measurement from a deep reading azimuthal propagation
479
+ resistivity tool is use to correct for depth uncertainty in seismic, thereby,
480
+ improving the chance of successfully landing and drilling a horizontal well.'
481
+ datasets:
482
+ - Sampath1987/offshore_energy
483
+ pipeline_tag: sentence-similarity
484
+ library_name: sentence-transformers
485
+ metrics:
486
+ - cosine_accuracy
487
+ model-index:
488
+ - name: SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
489
+ results:
490
+ - task:
491
+ type: triplet
492
+ name: Triplet
493
+ dataset:
494
+ name: ai job validation
495
+ type: ai-job-validation
496
+ metrics:
497
+ - type: cosine_accuracy
498
+ value: 0.7850282788276672
499
+ name: Cosine Accuracy
500
+ ---
501
+
502
+ # SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
503
+
504
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) on the [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
505
+
506
+ ## Model Details
507
+
508
+ ### Model Description
509
+ - **Model Type:** Sentence Transformer
510
+ - **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) <!-- at revision 9bbca17d9273fd0d03d5725c7a4b0f6b45142062 -->
511
+ - **Maximum Sequence Length:** 8192 tokens
512
+ - **Output Dimensionality:** 768 dimensions
513
+ - **Similarity Function:** Cosine Similarity
514
+ - **Training Dataset:**
515
+ - [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy)
516
+ <!-- - **Language:** Unknown -->
517
+ <!-- - **License:** Unknown -->
518
+
519
+ ### Model Sources
520
+
521
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
522
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
523
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
524
+
525
+ ### Full Model Architecture
526
+
527
+ ```
528
+ SentenceTransformer(
529
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'NewModel'})
530
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
531
+ (2): Normalize()
532
+ )
533
+ ```
534
+
535
+ ## Usage
536
+
537
+ ### Direct Usage (Sentence Transformers)
538
+
539
+ First install the Sentence Transformers library:
540
+
541
+ ```bash
542
+ pip install -U sentence-transformers
543
+ ```
544
+
545
+ Then you can load this model and run inference.
546
+ ```python
547
+ from sentence_transformers import SentenceTransformer
548
+
549
+ # Download from the 🤗 Hub
550
+ model = SentenceTransformer("Sampath1987/EnergyEmbed-v1")
551
+ # Run inference
552
+ sentences = [
553
+ 'What role did anti-collision analysis play in the drilling of the dual lateral well?',
554
+ 'This paper aims to analyze the impact of appraising and developing marginal fields with multiple stacked reservoirs which is quite challenging in terms of techno commercial value. The development of such marginal reservoirs using conventional single horizontal wells drilling and completion is uneconomical. Therefore, it was necessary to engineer a solution that can enhance the commercial value of the project by reducing CAPEX and OPEX. This paper will present the first comprehensive business case, where multiple stacked reservoirs with marginal reserves were studied to produce independently using multilateral completions, granting full accessibility of the laterals while achieving production monitoring and reservoir surveillance.',
555
+ "The most common challenge in horizontal drilling is depth uncertainty which can be due to poor seismic data or interpretation. It is arguable that a successful landing of the wellbore in the reservoir optimally and within the desired zone is the most challenging in most geosteering operation. The presence of fluid contacts such as oil-water-contact (OWC) and gas-oil-contact (GOC) complicates the whole drilling process, most especially if these fluid contacts are not well defined or known. Additionally, the ability to map the boundaries of the reservoir as the BHA drills the lateral section is an added advantage to remaining within the desired reservoir section.\nThe success of any reservoir navigation service where seismic uncertainty at the reservoir top is high will rely largely on how effective the geosteering system is and how the geosteering engineer is able to react promptly to changes while landing the well in the reservoir and drilling the lateral section with without exiting the reservoir.\nReservoir Navigation Service (RNS) provides the means for the drilling near horizontal or horizontal wells for the purpose of increasing hydrocarbon extraction from the earth's subsurface. This involves the use of a pre-defined bottom hole assembly (BHA) with inbuilt downhole logging while drilling (LWD) and measurement while drilling (MWD) sensors. The measurements from these downhole sensors are uplinked to the surface of the wellbore where they are converted to meaningful petrophysical data. The goal is to use the downhole petrophysical data such as gamma ray, propagation resistivity and so on, to update an existing pre-well geological model of a section of the earth in such a way that the final result depicts the true model picture of the earth subsurface.\nThis paper focuses on using well CBH-44L to showcase how the use of real-time distance-to-boundary (D2B) measurement from a deep reading azimuthal propagation resistivity tool is use to correct for depth uncertainty in seismic, thereby, improving the chance of successfully landing and drilling a horizontal well.",
556
+ ]
557
+ embeddings = model.encode(sentences)
558
+ print(embeddings.shape)
559
+ # [3, 768]
560
+
561
+ # Get the similarity scores for the embeddings
562
+ similarities = model.similarity(embeddings, embeddings)
563
+ print(similarities)
564
+ # tensor([[1.0000, 0.4457, 0.3235],
565
+ # [0.4457, 1.0000, 0.3388],
566
+ # [0.3235, 0.3388, 1.0000]])
567
+ ```
568
+
569
+ <!--
570
+ ### Direct Usage (Transformers)
571
+
572
+ <details><summary>Click to see the direct usage in Transformers</summary>
573
+
574
+ </details>
575
+ -->
576
+
577
+ <!--
578
+ ### Downstream Usage (Sentence Transformers)
579
+
580
+ You can finetune this model on your own dataset.
581
+
582
+ <details><summary>Click to expand</summary>
583
+
584
+ </details>
585
+ -->
586
+
587
+ <!--
588
+ ### Out-of-Scope Use
589
+
590
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
591
+ -->
592
+
593
+ ## Evaluation
594
+
595
+ ### Metrics
596
+
597
+ #### Triplet
598
+
599
+ * Dataset: `ai-job-validation`
600
+ * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
601
+
602
+ | Metric | Value |
603
+ |:--------------------|:----------|
604
+ | **cosine_accuracy** | **0.785** |
605
+
606
+ <!--
607
+ ## Bias, Risks and Limitations
608
+
609
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
610
+ -->
611
+
612
+ <!--
613
+ ### Recommendations
614
+
615
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
616
+ -->
617
+
618
+ ## Training Details
619
+
620
+ ### Training Dataset
621
+
622
+ #### offshore_energy
623
+
624
+ * Dataset: [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) at [0ebbfc6](https://huggingface.co/datasets/Sampath1987/offshore_energy/tree/0ebbfc615bc7c9bbd3d58315bc2e14e91f291fa1)
625
+ * Size: 89,129 training samples
626
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
627
+ * Approximate statistics based on the first 1000 samples:
628
+ | | anchor | positive | negative |
629
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
630
+ | type | string | string | string |
631
+ | details | <ul><li>min: 12 tokens</li><li>mean: 24.68 tokens</li><li>max: 77 tokens</li></ul> | <ul><li>min: 37 tokens</li><li>mean: 437.61 tokens</li><li>max: 983 tokens</li></ul> | <ul><li>min: 28 tokens</li><li>mean: 410.96 tokens</li><li>max: 1188 tokens</li></ul> |
632
+ * Samples:
633
+ | anchor | positive | negative |
634
+ |:--------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
635
+ | <code>What is the significance of end point relative permeability of the oil phase in the productivity of oil reservoirs below bubble point pressure?</code> | <code>In contrast with what is followed for Offshore Oil Operations the majority of the Onshore Oil Operations in the world do not have a Minimum and Mandatory required HSE training program for all personnel including contractors and subcontractors.<br>A comparison is drawn between the Minimum and Mandatory HSE Training Programmes applied offshore in developed areas, mainly North Sea and Gulf of Mexico and the benefits that similar programs can bring to the ME onshore oil operations are addressed by estimating the risk reduction and potential economic benefits.<br>The applicability of such Minimum and Mandatory HSE Training Programs is analyzed against the scenario of heavy utilization of contractors and subcontractors with different approach and standards in HSE training and also the increasing complexity of the onshore oil operations<br>An estimation of how many lives can potentially be saved by the introduction of such programs is provided in global and generic terms.<br>The HR Impact, in different a...</code> | <code>The knowledge of relative permeability is key in oil production mechanism as it affects multiphase flow which is vital to producible reserves in petroleum reservoirs. In this study, the impact of altering end point saturation on relative permeability curve and how it influences oil recovery was investigated on field X in Niger Delta, Nigeria. The saturation end points obtained after a simulation study was used as a start point to predict oil production. These end points saturation of water and oil were altered and varied according to facies. The eclipse simulation tool was used in conducting the prediction runs. The result obtained showed wide variation from actual production forecast (i.e. ≥ 25%) when end points were varied with no guided limit from experimental data. This study reveals the need for an accurate determination of residual oil saturation as it was seen to have an impact on forecast and history match.</code> |
636
+ | <code>What role does the effective coefficient of discharge (_Kd_) play in calculating the required effective discharge area?</code> | <code>96 API S TANDARD 520, P ART I—S IZING AND S ELECTION <br>**B.2.3.3** Using the theoretical mass flux obtained from numerical integration above, one may determine the<br>required effective discharge area: <br>In USC units: <br>_Q_ × ρ 1<br>× <br>sec gal _G_ ×<br>60 × 7 4805 .<br>min ft 3 <br>_A_ = _W_ = _Q_ × ρ × 1<br>_G_ × _K_ d 60 sec × 7 4805 . gal _G_ × _K_ <br>d 60 × 7 4805 . d<br>3 <br>528 62 2 × . 1 2 2 <br>_A_ = × = 0 0148 ft . = 2 135 in. . (B.8) <br>60 7 4805 × . 7 592 14 0 65, . × . <br>In SI units: <br>_Q_ ×ρ 1<br>× <br>sec liter _G_ ×<br>60 min × 1 000, m 3 <br>_A_ = _W_ = _Q_ ×ρ × 1<br>_G_ × _K_ d 60 sec × 1 000, liter 3 _G_ × _K_ <br>, <br>_A_ = 2 000, × 996 9 . × 1 = 1 379 . × 10 − 3 m 2 = 1 379 mm, 2 (B.9)<br>60 × 1 000, 37 068, × 0 65 . <br>where <br>_G_ is the theoretical mass flux through the nozzle, lb/s·ft [2] (kg/s·m [2] ); <br>_W_ is the required relief rate, lb/s (kg/s); <br>_Q_ is the required relief rate, gal/min (L/min); <br>ρ = 1 _v_ is the fluid density, lb/ft [3] (kg/m [3] ); <br>_K_ d is the effective coefficient of discharge...</code> | <code>S IZING, S ELECTION, AND I NSTALLATION OF P RESSURE - RELIEVING D EVICES 59 <br>**5.6.3 Sizing for Critical Flow** <br>**5.6.3.1 General** <br>**5.6.3.1.1** Pressure-relief devices in gas or vapor service that operate at critical flow conditions (see 5.6.2)<br>may be sized using Equation (2) through Equation (7). Each of the equations may be used to calculate the<br>effective discharge area, _A_, required to achieve a required flow rate through a pressure-relief device. A PRV<br>that has an effective discharge area equal to or greater than the calculated value of _A_ is then chosen for the<br>application. <br>In USC units: <br>_A_ = (2) <br>_A_ = (3) <br>6 32 . _CK P K K_ d 1 b c <br>_A_ = (4) <br>1 175 . _CK P K K_ <br>1 175 . _CK P K K_ d 1 b c <br>. <br>In SI units: <br>_A_ = (5) <br>_A_ = (6)<br>_CK P K K_ <br>d 1 b c <br>_A_ =<br>_CK P K K_ <br>=<br>(7) <br>d 1 b c <br>where <br>_A_ is the required effective discharge area of the device, in. [2] (mm [2] ) (see 3.20); <br>_W_ is the required flow through the device, lb/h (kg/h); <br>_C...</code> |
637
+ | <code>How many swellable packers were required to be run in the horizontal hole part for the AICV trial, and what was the purpose of this requirement?</code> | <code>Removing fluid from a wellbore column, allowing a well to flow initially, or bringing a previous well back online, nitrogen lifting is commonly used in north Iraq wells. Due to the inability of coiled tubing units to be delivered on time and their high cost, operators are forced to seek for an alternative method of unloading drilling fluid. A hydraulic Jet Pump is a technology used to complete the task.<br>A newly drilled well DB-H was chosen, and the drilling fluid volume calculated was 12,000 bbl. to pump to the surface and begin production, assuming nonstop operation between unloading and producing. The deployment of the hydraulic lift Jet Pump for both stages was planned. Well data from the operator was collected, the process design was initiated, and Jet Evaluation Modeling Software (JEMS) was used to run the design models. A Proper pump size was set up based on available data to meet operator expectations. A Reverse Circulating Jet Pump (RCJP) was chosen to be installed inside a Sli...</code> | <code>This development, predominantly from four artificial islands, of a giant offshore field in the United Arab Emirates (UAE) requires lateral compartmentalization with open hole packers of the 6 5/8" horizontal lower completions with lateral lengths greater than 16,000ft and total well lengths greater than 30,000ft MD. Swell Packer technology has enabled cost effective compartmentalization in horizontal laterals and is the preferred OH packer solution for the development.<br>Deploying swell packers is regarded as being a simple solution to compartmentalizing any lateral where typically the deployment fluid differs from the fluids in which it will swell in; this application prevents the elastomer from swelling during deployment and swelling upon contact with produced or injected fluids. The use of an extended delayed oil swell packer with no delay systems in this particular application enables the packers to be deployed in a Non Aqueous Reservoir Drill in Fluid (RDFNAF) where the packer is re...</code> |
638
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
639
+ ```json
640
+ {
641
+ "scale": 20.0,
642
+ "similarity_fct": "cos_sim",
643
+ "gather_across_devices": false
644
+ }
645
+ ```
646
+
647
+ ### Evaluation Dataset
648
+
649
+ #### offshore_energy
650
+
651
+ * Dataset: [offshore_energy](https://huggingface.co/datasets/Sampath1987/offshore_energy) at [0ebbfc6](https://huggingface.co/datasets/Sampath1987/offshore_energy/tree/0ebbfc615bc7c9bbd3d58315bc2e14e91f291fa1)
652
+ * Size: 11,141 evaluation samples
653
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
654
+ * Approximate statistics based on the first 1000 samples:
655
+ | | anchor | positive | negative |
656
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
657
+ | type | string | string | string |
658
+ | details | <ul><li>min: 12 tokens</li><li>mean: 24.37 tokens</li><li>max: 53 tokens</li></ul> | <ul><li>min: 38 tokens</li><li>mean: 428.35 tokens</li><li>max: 978 tokens</li></ul> | <ul><li>min: 29 tokens</li><li>mean: 405.3 tokens</li><li>max: 1111 tokens</li></ul> |
659
+ * Samples:
660
+ | anchor | positive | negative |
661
+ |:------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
662
+ | <code>How does partial jacket construction differ for vessels that cannot use staybolt construction?</code> | <code>**9-7 – 9-10** **ASME BPVC.VIII.1-2019** <br>**Figure 9-7** <br>_(2)_ Partial jackets that by virtue of their service or<br>configuration do not lend themselves to staybolt construction may be fabricated by other means providing<br>they are designed using appropriate stress values and<br>are proof tested in accordance with UG-101(p). <br>444 <br>**9-8** **FABRICATION** <br>_(a)_ Fabrication of vessels shall be in accordance with<br>applicable Parts of Subsection A and Subsection B, Part<br>UW. The requirements of UW-13(e) do not apply to closure rings.<br>_(b)_ This Appendix covers fabrication of jacketed vessels<br>by welding. Other methods of fabrication are permitted,<br>provided the requirements of applicable parts of this Di<br>vision are met. <br>_(c)_ Where only the inner vessel is subjected to lethal<br>service, the requirements of UW-2 shall apply only to<br>welds in the inner vessel and those welds attaching the<br>jacket to the inner vessel. Welds attaching the jacket to<br>the inner vessel need not be radiographed and may b...</code> | <code>**9-5 – 9-7** **ASME BPVC.VIII.1-2019** <br>‐ ‐<br>(g 5), and (g 6), may be used on any of the types of<br>jacketed vessels shown in Figure 9-2 where _t_ _rj_ does not<br>exceed [5] / 8 in. (16 mm).<br>_(7)_ Closures shown in Figure 9-5, sketch (h) used on<br>Type 3 jacketed vessels shown in Figure 9-2 shall have attachment welds in accordance with Figure 9-5, sketch <br>‐ ‐<br>(i 1) or (i 2). This construction is limited to jackets where<br>_t_ _rj_ does not exceed [5] / 8 in. (16 mm).<br>_(8)_ Closures for conical or toriconical jackets shown<br>in Figure 9-5, sketches (k) and (l) shall comply with the<br>requirements for Type 2 jacketed vessels shown in Figure<br>9-2. <br>_(d)_ Any radial welds in closure members shall be buttwelded joints penetrating through the full thickness of the<br>member and shall be ground flush where attachment<br>welds are to be made. <br>_(e)_ Where the inner vessel must meet the requirements<br>of UW-2, the attachment welds of the jacket to the inner<br>vessel need not be welded for their full thickness no...</code> |
663
+ | <code>What dimensions must fins and studs conform to as stipulated in Section 17.4.4?</code> | <code>**17.4 Examination of other components** <br>**17.4.1** Examination of heater steelwork shall be in accordance with the structural design code. <br>**17.4.2** Refractory linings shall be examined throughout for thickness variations during application and for cracks<br>after curing. Thickness tolerance is limited to a range of minus 6 mm (1/4 in) to plus 13 mm (1/2 in). Cracks which<br>are 3 mm (1/8 in) or greater in width and penetrate more than 50 % of the castable thickness shall be repaired.<br>Repairs shall be made by chipping out the unsound refractory to the backup layer interface or casing and<br>exposing a minimum of three tieback anchors, or to the sound metal, making a joint between sound refractory that<br>has a minimum slope of 25 mm (1 in) to the base metal (dove-tail construction) and then gunning, casting or<br>hand-packing the area to be repaired. <br>**17.4.3** Finned extended surface shall be examined to ensure fins are perpendicular to the tube within 15°. The<br>maximum discontinuity of the w...</code> | <code>**16.1** -112 STEEL ANCHORS [Sect. I8. <br>**3e.** **Detailing Requirements in Composite Components** <br>Steel anchors in composite components shall meet the following requirements: <br>(a) Minimum concrete cover to steel anchors shall be in accordance with ACI 318<br>provisions for concrete protection of headed shear stud reinforcement. <br>(b) Minimum center-to-center spacing of steel headed stud anchors shall be four<br>diameters in any direction. <br>(c) The maximum center-to-center spacing of steel headed stud anchors shall not <br>exceed 32 times the shank diameter. <br>(d) The maximum center-to-center spacing of steel channel anchors shall be 24 in.<br>(600 mm). <br>**User Note:** Detailing requirements provided in this section are absolute limits.<br>See Sections I8.3a, I8.3b and I8.3c for additional limitations required to preclude<br>edge and group effect considerations. <br>_Specification for Structural Steel Buildings,_ July 7, 2016<br>A MERICAN I NSTITUTE OF S TEEL C ONSTRUCTION</code> |
664
+ | <code>What are some common mistakes in oil and gas project execution that lead to financial losses?</code> | <code>Dozens of deepwater wells have been drilled in western South China Sea with about 30 percent have characteristics of high temperature and high pressure, which brought a series of difficulties and challenges to field operations. After incorporating the analysis of engineering and geological environment for deepwater HTHP wells in Lingshui block of western South China Sea, it is suggested that the solution of drilling problems for deepwater HTHP wells should start from drilling fluid. Several major technical problems are required to be addressed by drilling fluid, such as co-exist of low temperature and high temperature that lead to difficulty of drilling fluid maintenance and narrow density margin caused by deepwater and high pressure. Based on the above problems, combining with geological features of HTHP wells, researchers developed a novel water based drilling fluid system compatible with deepwater HTHP wells in Lingshui block on the basis of conventional HEM drilling fluid and furth...</code> | <code>The lack of availability of required skills and experience in most if not all parts of the oil and gas value chain is well documented so, rather than trying to make the case, we will summarise the challenge thus: the industry in all parts of the world can't find the capability it needs to safely get its work done in the timeframes it would like.<br>However or wherever the situation is measured, the consequence is that in days when the oil price might suggest that the industry has "never had it so good", many companies are falling seriously short of stakeholder expectations with projects of all types not being completed as planned or failing to deliver anticipated returns.<br>Close to home we see producers consistently missing quarterly production targets and a seemingly constant downgrading of forecasts and year-on-year plans. This leads to a constant stream of bad news and criticism in the media, greater stress through all levels of management and an inevitable "knee jerk" towards a more sh...</code> |
665
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
666
+ ```json
667
+ {
668
+ "scale": 20.0,
669
+ "similarity_fct": "cos_sim",
670
+ "gather_across_devices": false
671
+ }
672
+ ```
673
+
674
+ ### Training Hyperparameters
675
+ #### Non-Default Hyperparameters
676
+
677
+ - `eval_strategy`: steps
678
+ - `per_device_train_batch_size`: 16
679
+ - `per_device_eval_batch_size`: 16
680
+ - `learning_rate`: 2e-05
681
+ - `warmup_ratio`: 0.1
682
+
683
+ #### All Hyperparameters
684
+ <details><summary>Click to expand</summary>
685
+
686
+ - `overwrite_output_dir`: False
687
+ - `do_predict`: False
688
+ - `eval_strategy`: steps
689
+ - `prediction_loss_only`: True
690
+ - `per_device_train_batch_size`: 16
691
+ - `per_device_eval_batch_size`: 16
692
+ - `per_gpu_train_batch_size`: None
693
+ - `per_gpu_eval_batch_size`: None
694
+ - `gradient_accumulation_steps`: 1
695
+ - `eval_accumulation_steps`: None
696
+ - `torch_empty_cache_steps`: None
697
+ - `learning_rate`: 2e-05
698
+ - `weight_decay`: 0.0
699
+ - `adam_beta1`: 0.9
700
+ - `adam_beta2`: 0.999
701
+ - `adam_epsilon`: 1e-08
702
+ - `max_grad_norm`: 1.0
703
+ - `num_train_epochs`: 3
704
+ - `max_steps`: -1
705
+ - `lr_scheduler_type`: linear
706
+ - `lr_scheduler_kwargs`: {}
707
+ - `warmup_ratio`: 0.1
708
+ - `warmup_steps`: 0
709
+ - `log_level`: passive
710
+ - `log_level_replica`: warning
711
+ - `log_on_each_node`: True
712
+ - `logging_nan_inf_filter`: True
713
+ - `save_safetensors`: True
714
+ - `save_on_each_node`: False
715
+ - `save_only_model`: False
716
+ - `restore_callback_states_from_checkpoint`: False
717
+ - `no_cuda`: False
718
+ - `use_cpu`: False
719
+ - `use_mps_device`: False
720
+ - `seed`: 42
721
+ - `data_seed`: None
722
+ - `jit_mode_eval`: False
723
+ - `use_ipex`: False
724
+ - `bf16`: False
725
+ - `fp16`: False
726
+ - `fp16_opt_level`: O1
727
+ - `half_precision_backend`: auto
728
+ - `bf16_full_eval`: False
729
+ - `fp16_full_eval`: False
730
+ - `tf32`: None
731
+ - `local_rank`: 0
732
+ - `ddp_backend`: None
733
+ - `tpu_num_cores`: None
734
+ - `tpu_metrics_debug`: False
735
+ - `debug`: []
736
+ - `dataloader_drop_last`: False
737
+ - `dataloader_num_workers`: 0
738
+ - `dataloader_prefetch_factor`: None
739
+ - `past_index`: -1
740
+ - `disable_tqdm`: False
741
+ - `remove_unused_columns`: True
742
+ - `label_names`: None
743
+ - `load_best_model_at_end`: False
744
+ - `ignore_data_skip`: False
745
+ - `fsdp`: []
746
+ - `fsdp_min_num_params`: 0
747
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
748
+ - `fsdp_transformer_layer_cls_to_wrap`: None
749
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
750
+ - `deepspeed`: None
751
+ - `label_smoothing_factor`: 0.0
752
+ - `optim`: adamw_torch
753
+ - `optim_args`: None
754
+ - `adafactor`: False
755
+ - `group_by_length`: False
756
+ - `length_column_name`: length
757
+ - `ddp_find_unused_parameters`: None
758
+ - `ddp_bucket_cap_mb`: None
759
+ - `ddp_broadcast_buffers`: False
760
+ - `dataloader_pin_memory`: True
761
+ - `dataloader_persistent_workers`: False
762
+ - `skip_memory_metrics`: True
763
+ - `use_legacy_prediction_loop`: False
764
+ - `push_to_hub`: False
765
+ - `resume_from_checkpoint`: None
766
+ - `hub_model_id`: None
767
+ - `hub_strategy`: every_save
768
+ - `hub_private_repo`: None
769
+ - `hub_always_push`: False
770
+ - `hub_revision`: None
771
+ - `gradient_checkpointing`: False
772
+ - `gradient_checkpointing_kwargs`: None
773
+ - `include_inputs_for_metrics`: False
774
+ - `include_for_metrics`: []
775
+ - `eval_do_concat_batches`: True
776
+ - `fp16_backend`: auto
777
+ - `push_to_hub_model_id`: None
778
+ - `push_to_hub_organization`: None
779
+ - `mp_parameters`:
780
+ - `auto_find_batch_size`: False
781
+ - `full_determinism`: False
782
+ - `torchdynamo`: None
783
+ - `ray_scope`: last
784
+ - `ddp_timeout`: 1800
785
+ - `torch_compile`: False
786
+ - `torch_compile_backend`: None
787
+ - `torch_compile_mode`: None
788
+ - `include_tokens_per_second`: False
789
+ - `include_num_input_tokens_seen`: False
790
+ - `neftune_noise_alpha`: None
791
+ - `optim_target_modules`: None
792
+ - `batch_eval_metrics`: False
793
+ - `eval_on_start`: False
794
+ - `use_liger_kernel`: False
795
+ - `liger_kernel_config`: None
796
+ - `eval_use_gather_object`: False
797
+ - `average_tokens_across_devices`: False
798
+ - `prompts`: None
799
+ - `batch_sampler`: batch_sampler
800
+ - `multi_dataset_batch_sampler`: proportional
801
+ - `router_mapping`: {}
802
+ - `learning_rate_mapping`: {}
803
+
804
+ </details>
805
+
806
+ ### Training Logs
807
+ | Epoch | Step | Training Loss | Validation Loss | ai-job-validation_cosine_accuracy |
808
+ |:------:|:-----:|:-------------:|:---------------:|:---------------------------------:|
809
+ | 0.1795 | 1000 | - | 1.1634 | 0.6597 |
810
+ | 0.3590 | 2000 | - | 1.0971 | 0.6821 |
811
+ | 0.5385 | 3000 | - | 1.0596 | 0.7050 |
812
+ | 0.7180 | 4000 | - | 1.0336 | 0.7193 |
813
+ | 0.8975 | 5000 | 1.2066 | 1.0073 | 0.7312 |
814
+ | 1.0770 | 6000 | - | 1.0060 | 0.7331 |
815
+ | 1.2565 | 7000 | - | 0.9794 | 0.7465 |
816
+ | 1.4360 | 8000 | - | 0.9657 | 0.7580 |
817
+ | 1.6155 | 9000 | - | 0.9498 | 0.7593 |
818
+ | 1.7950 | 10000 | 0.935 | 0.9387 | 0.7678 |
819
+ | 1.9745 | 11000 | - | 0.9293 | 0.7623 |
820
+ | 2.1540 | 12000 | - | 0.9313 | 0.7769 |
821
+ | 2.3335 | 13000 | - | 0.9245 | 0.7794 |
822
+ | 2.5130 | 14000 | - | 0.9190 | 0.7787 |
823
+ | 2.6925 | 15000 | 0.7607 | 0.9139 | 0.7782 |
824
+ | 2.8720 | 16000 | - | 0.9094 | 0.7850 |
825
+
826
+
827
+ ### Framework Versions
828
+ - Python: 3.10.12
829
+ - Sentence Transformers: 5.1.0
830
+ - Transformers: 4.53.3
831
+ - PyTorch: 2.8.0+cu128
832
+ - Accelerate: 1.9.0
833
+ - Datasets: 4.0.0
834
+ - Tokenizers: 0.21.2
835
+
836
+ ## Citation
837
+
838
+ ### BibTeX
839
+
840
+ #### Sentence Transformers
841
+ ```bibtex
842
+ @inproceedings{reimers-2019-sentence-bert,
843
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
844
+ author = "Reimers, Nils and Gurevych, Iryna",
845
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
846
+ month = "11",
847
+ year = "2019",
848
+ publisher = "Association for Computational Linguistics",
849
+ url = "https://arxiv.org/abs/1908.10084",
850
+ }
851
+ ```
852
+
853
+ #### MultipleNegativesRankingLoss
854
+ ```bibtex
855
+ @misc{henderson2017efficient,
856
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
857
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
858
+ year={2017},
859
+ eprint={1705.00652},
860
+ archivePrefix={arXiv},
861
+ primaryClass={cs.CL}
862
+ }
863
+ ```
864
+
865
+ <!--
866
+ ## Glossary
867
+
868
+ *Clearly define terms in order to be accessible across audiences.*
869
+ -->
870
+
871
+ <!--
872
+ ## Model Card Authors
873
+
874
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
875
+ -->
876
+
877
+ <!--
878
+ ## Model Card Contact
879
+
880
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
881
+ -->
config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NewModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration.NewConfig",
8
+ "AutoModel": "modeling.NewModel",
9
+ "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
10
+ "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
11
+ "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
12
+ "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
13
+ "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
14
+ },
15
+ "classifier_dropout": 0.0,
16
+ "hidden_act": "gelu",
17
+ "hidden_dropout_prob": 0.1,
18
+ "hidden_size": 768,
19
+ "id2label": {
20
+ "0": "LABEL_0"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 3072,
24
+ "label2id": {
25
+ "LABEL_0": 0
26
+ },
27
+ "layer_norm_eps": 1e-12,
28
+ "layer_norm_type": "layer_norm",
29
+ "logn_attention_clip1": false,
30
+ "logn_attention_scale": false,
31
+ "max_position_embeddings": 8192,
32
+ "model_type": "new",
33
+ "num_attention_heads": 12,
34
+ "num_hidden_layers": 12,
35
+ "pack_qkv": true,
36
+ "pad_token_id": 1,
37
+ "position_embedding_type": "rope",
38
+ "rope_scaling": {
39
+ "factor": 8.0,
40
+ "type": "ntk"
41
+ },
42
+ "rope_theta": 20000,
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.53.3",
45
+ "type_vocab_size": 1,
46
+ "unpad_inputs": false,
47
+ "use_memory_efficient_attention": false,
48
+ "vocab_size": 250048
49
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.1.0",
5
+ "transformers": "4.53.3",
6
+ "pytorch": "2.8.0+cu128"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
configuration.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ NEW model configuration"""
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+
23
+ class NewConfig(PretrainedConfig):
24
+ r"""
25
+ This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
26
+ instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
27
+ configuration with the defaults will yield a similar configuration to that of the NEW
28
+ [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
29
+
30
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
31
+ documentation from [`PretrainedConfig`] for more information.
32
+
33
+
34
+ Args:
35
+ vocab_size (`int`, *optional*, defaults to 30522):
36
+ Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
37
+ `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
38
+ hidden_size (`int`, *optional*, defaults to 768):
39
+ Dimensionality of the encoder layers and the pooler layer.
40
+ num_hidden_layers (`int`, *optional*, defaults to 12):
41
+ Number of hidden layers in the Transformer encoder.
42
+ num_attention_heads (`int`, *optional*, defaults to 12):
43
+ Number of attention heads for each attention layer in the Transformer encoder.
44
+ intermediate_size (`int`, *optional*, defaults to 3072):
45
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
46
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
47
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
48
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
49
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
50
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
51
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
52
+ The dropout ratio for the attention probabilities.
53
+ max_position_embeddings (`int`, *optional*, defaults to 512):
54
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
55
+ just in case (e.g., 512 or 1024 or 2048).
56
+ type_vocab_size (`int`, *optional*, defaults to 2):
57
+ The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
58
+ initializer_range (`float`, *optional*, defaults to 0.02):
59
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
60
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
61
+ The epsilon used by the layer normalization layers.
62
+ position_embedding_type (`str`, *optional*, defaults to `"rope"`):
63
+ Type of position embedding. Choose one of `"absolute"`, `"rope"`.
64
+ rope_theta (`float`, *optional*, defaults to 10000.0):
65
+ The base period of the RoPE embeddings.
66
+ rope_scaling (`Dict`, *optional*):
67
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
68
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
69
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
70
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
71
+ these scaling strategies behave:
72
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
73
+ experimental feature, subject to breaking API changes in future versions.
74
+ classifier_dropout (`float`, *optional*):
75
+ The dropout ratio for the classification head.
76
+
77
+ Examples:
78
+
79
+ ```python
80
+ >>> from transformers import NewConfig, NewModel
81
+
82
+ >>> # Initializing a NEW izhx/new-base-en style configuration
83
+ >>> configuration = NewConfig()
84
+
85
+ >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
86
+ >>> model = NewModel(configuration)
87
+
88
+ >>> # Accessing the model configuration
89
+ >>> configuration = model.config
90
+ ```"""
91
+
92
+ model_type = "new"
93
+
94
+ def __init__(
95
+ self,
96
+ vocab_size=30528,
97
+ hidden_size=768,
98
+ num_hidden_layers=12,
99
+ num_attention_heads=12,
100
+ intermediate_size=3072,
101
+ hidden_act="gelu",
102
+ hidden_dropout_prob=0.1,
103
+ attention_probs_dropout_prob=0.0,
104
+ max_position_embeddings=2048,
105
+ type_vocab_size=1,
106
+ initializer_range=0.02,
107
+ layer_norm_type='layer_norm',
108
+ layer_norm_eps=1e-12,
109
+ # pad_token_id=0,
110
+ position_embedding_type="rope",
111
+ rope_theta=10000.0,
112
+ rope_scaling=None,
113
+ classifier_dropout=None,
114
+ pack_qkv=True,
115
+ unpad_inputs=False,
116
+ use_memory_efficient_attention=False,
117
+ logn_attention_scale=False,
118
+ logn_attention_clip1=False,
119
+ **kwargs,
120
+ ):
121
+ super().__init__(**kwargs)
122
+
123
+ self.vocab_size = vocab_size
124
+ self.hidden_size = hidden_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.hidden_act = hidden_act
128
+ self.intermediate_size = intermediate_size
129
+ self.hidden_dropout_prob = hidden_dropout_prob
130
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
131
+ self.max_position_embeddings = max_position_embeddings
132
+ self.type_vocab_size = type_vocab_size
133
+ self.initializer_range = initializer_range
134
+ self.layer_norm_type = layer_norm_type
135
+ self.layer_norm_eps = layer_norm_eps
136
+ self.position_embedding_type = position_embedding_type
137
+ self.rope_theta = rope_theta
138
+ self.rope_scaling = rope_scaling
139
+ self.classifier_dropout = classifier_dropout
140
+
141
+ self.pack_qkv = pack_qkv
142
+ self.unpad_inputs = unpad_inputs
143
+ self.use_memory_efficient_attention = use_memory_efficient_attention
144
+ self.logn_attention_scale = logn_attention_scale
145
+ self.logn_attention_clip1 = logn_attention_clip1
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:10e213517c8ea188d9928352e4f840c7dc99f2a01e0ca935229048b0d46df0e8
3
+ size 1221487872
modeling.py ADDED
@@ -0,0 +1,1418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """PyTorch NEW model."""
17
+
18
+ import math
19
+ from dataclasses import dataclass
20
+ from typing import List, Optional, Tuple, Union
21
+
22
+ import torch
23
+ import torch.utils.checkpoint
24
+ from torch import nn
25
+
26
+ from transformers.activations import ACT2FN
27
+ from transformers.modeling_outputs import (
28
+ BaseModelOutput,
29
+ BaseModelOutputWithPooling,
30
+ MaskedLMOutput,
31
+ MultipleChoiceModelOutput,
32
+ QuestionAnsweringModelOutput,
33
+ SequenceClassifierOutput,
34
+ ModelOutput,
35
+ )
36
+ from transformers.modeling_utils import PreTrainedModel
37
+ from transformers.utils import logging
38
+
39
+ try:
40
+ import xformers.ops as xops
41
+ except ImportError as e:
42
+ xops = None
43
+
44
+ from .configuration import NewConfig
45
+
46
+
47
+ logger = logging.get_logger(__name__)
48
+
49
+
50
+ # Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
51
+ # Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
52
+ class IndexFirstAxis(torch.autograd.Function):
53
+ @staticmethod
54
+ def forward(ctx, input, indices):
55
+ ctx.save_for_backward(indices)
56
+ assert input.ndim >= 2
57
+ ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
58
+ second_dim = other_shape.numel()
59
+ # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
60
+ # return input[indices]
61
+ # return torch.gather(
62
+ # rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
63
+ # ).reshape(-1, *other_shape)
64
+ return torch.gather(
65
+ input.view(ctx.first_axis_dim, second_dim),
66
+ 0,
67
+ indices.unsqueeze(-1).expand(indices.size(0), second_dim)
68
+ ).reshape(-1, *other_shape)
69
+
70
+ @staticmethod
71
+ def backward(ctx, grad_output):
72
+ (indices,) = ctx.saved_tensors
73
+ assert grad_output.ndim >= 2
74
+ other_shape = grad_output.shape[1:]
75
+ # grad_output = rearrange(grad_output, "b ... -> b (...)")
76
+ grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
77
+ grad_input = torch.zeros(
78
+ [ctx.first_axis_dim, grad_output.shape[1]],
79
+ device=grad_output.device,
80
+ dtype=grad_output.dtype,
81
+ )
82
+ # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
83
+ # grad_input[indices] = grad_output
84
+ # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
85
+ grad_input.scatter_(
86
+ 0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
87
+ )
88
+ return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
89
+
90
+
91
+ index_first_axis = IndexFirstAxis.apply
92
+
93
+
94
+ def unpad_input(hidden_states, attention_mask=None, indices=None):
95
+ """
96
+ Arguments:
97
+ hidden_states: (batch, seqlen, ...)
98
+ attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
99
+ indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
100
+ Return:
101
+ hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
102
+ """
103
+ if indices is None:
104
+ assert attention_mask is not None
105
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
106
+
107
+ # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
108
+ # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
109
+ # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
110
+ # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
111
+ # so we write custom forward and backward to make it a bit faster.
112
+ hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
113
+ return index_first_axis(hidden_states, indices)
114
+
115
+
116
+ class IndexPutFirstAxis(torch.autograd.Function):
117
+ @staticmethod
118
+ def forward(
119
+ ctx,
120
+ values: torch.Tensor,
121
+ indices: torch.Tensor,
122
+ first_axis_dim
123
+ ) -> torch.Tensor:
124
+ ctx.save_for_backward(indices)
125
+ assert indices.ndim == 1
126
+ assert values.ndim >= 2
127
+ output = torch.zeros(
128
+ first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
129
+ )
130
+ output[indices] = values
131
+ return output
132
+
133
+ @staticmethod
134
+ def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
135
+ indices, = ctx.saved_tensors
136
+ grad_values = grad_output[indices]
137
+ return grad_values, None, None
138
+
139
+
140
+ index_put_first_axis = IndexPutFirstAxis.apply
141
+
142
+
143
+ def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
144
+ """Add padding to sequences.
145
+
146
+ Arguments:
147
+ inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
148
+ indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
149
+ batch: int batch_size
150
+ seqlen: int max sequence length
151
+
152
+ Returns:
153
+ inputs: (batch, seqlen, ...)
154
+ """
155
+ output = index_put_first_axis(inputs, indices, batch * seqlen)
156
+ return output.view(batch, seqlen, *inputs.shape[1:])
157
+
158
+
159
+ def rotate_half(x):
160
+ """Rotates half the hidden dims of the input."""
161
+ x1 = x[..., : x.shape[-1] // 2]
162
+ x2 = x[..., x.shape[-1] // 2 :]
163
+ return torch.cat((-x2, x1), dim=-1)
164
+
165
+
166
+ def apply_rotary_pos_emb(q, k, cos, sin):
167
+ """Applies Rotary Position Embedding to the query and key tensors.
168
+
169
+ Args:
170
+ q (`torch.Tensor`): The query tensor.
171
+ k (`torch.Tensor`): The key tensor.
172
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
173
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
174
+ Returns:
175
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
176
+ """
177
+ cos, sin = cos.to(q.dtype), sin.to(q.dtype)
178
+ q_embed = (q * cos) + (rotate_half(q) * sin)
179
+ k_embed = (k * cos) + (rotate_half(k) * sin)
180
+ return q_embed, k_embed
181
+
182
+
183
+ class RotaryEmbedding(torch.nn.Module):
184
+ def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
185
+ super().__init__()
186
+
187
+ self.dim = dim
188
+ self.max_position_embeddings = max_position_embeddings
189
+ self.base = base
190
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
191
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
192
+
193
+ # Build here to make `torch.jit.trace` work.
194
+ self._set_cos_sin_cache(
195
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
196
+ )
197
+
198
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
199
+ self.max_seq_len_cached = seq_len
200
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
201
+
202
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
203
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
204
+ emb = torch.cat((freqs, freqs), dim=-1)
205
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
206
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
207
+
208
+ def forward(self, x, seq_len=None):
209
+ # x: [bs, num_attention_heads, seq_len, head_size]
210
+ if seq_len > self.max_seq_len_cached:
211
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
212
+
213
+ return (
214
+ self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
215
+ self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
216
+ )
217
+
218
+
219
+ class NTKScalingRotaryEmbedding(RotaryEmbedding):
220
+ """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
221
+
222
+ def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
223
+ self.scaling_factor = scaling_factor
224
+ self.mixed_b = mixed_b
225
+ super().__init__(dim, max_position_embeddings, base, device)
226
+ max_position_embeddings = max_position_embeddings * self.scaling_factor
227
+ self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
228
+
229
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
230
+ self.max_seq_len_cached = seq_len
231
+
232
+ if seq_len > self.max_position_embeddings:
233
+ base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
234
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
235
+
236
+ if self.mixed_b is None:
237
+ inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim) # (6)
238
+ else:
239
+ a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b # (13)
240
+ lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp() # (12)
241
+ inv_freq = inv_freq / lambda_1_m # (10)
242
+
243
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
244
+
245
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
246
+
247
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
248
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
249
+ emb = torch.cat((freqs, freqs), dim=-1)
250
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
251
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
252
+
253
+
254
+ class RMSNorm(nn.Module):
255
+ def __init__(self, hidden_size, eps=1e-6):
256
+ """
257
+ RMSNorm is equivalent to T5LayerNorm
258
+ """
259
+ super().__init__()
260
+ self.weight = nn.Parameter(torch.ones(hidden_size))
261
+ self.variance_epsilon = eps
262
+
263
+ def forward(self, hidden_states):
264
+ input_dtype = hidden_states.dtype
265
+ hidden_states = hidden_states.to(torch.float32)
266
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
267
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
268
+ return self.weight * hidden_states.to(input_dtype)
269
+
270
+
271
+ LAYER_NORM = {
272
+ 'layer_norm': nn.LayerNorm,
273
+ 'rms_norm': RMSNorm
274
+ }
275
+
276
+
277
+ class NewEmbeddings(nn.Module):
278
+ """
279
+ Embedding and Unpadding.
280
+ """
281
+
282
+ def __init__(self, config: NewConfig):
283
+ super().__init__()
284
+ self.padding_idx = config.pad_token_id
285
+ self.word_embeddings = nn.Embedding(
286
+ config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
287
+ )
288
+
289
+ self.position_embedding_type = config.position_embedding_type
290
+ if self.position_embedding_type == 'absolute':
291
+ self.position_embeddings = nn.Embedding(
292
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
293
+ )
294
+ elif self.position_embedding_type == 'rope':
295
+ self._init_rope(config)
296
+ else:
297
+ raise ValueError
298
+
299
+ self.type_vocab_size = config.type_vocab_size
300
+ if self.type_vocab_size > 0:
301
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
302
+
303
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
304
+ # any TensorFlow checkpoint file
305
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
306
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
307
+ # position_ids is contiguous in memory and excluded when serialized
308
+ self.register_buffer(
309
+ "position_ids", torch.arange(config.max_position_embeddings), persistent=False
310
+ )
311
+
312
+ def _init_rope(self, config):
313
+ kwargs = dict(
314
+ dim=int(config.hidden_size / config.num_attention_heads),
315
+ max_position_embeddings=config.max_position_embeddings,
316
+ base=config.rope_theta
317
+ )
318
+ if config.rope_scaling is None:
319
+ self.rotary_emb = RotaryEmbedding(**kwargs)
320
+ else:
321
+ kwargs.update(scaling_factor=config.rope_scaling["factor"])
322
+ scaling_type = config.rope_scaling["type"]
323
+ if scaling_type == 'ntk':
324
+ kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
325
+ self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
326
+ # elif scaling_type == "linear":
327
+ # self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
328
+ # elif scaling_type == "dynamic":
329
+ # self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
330
+ else:
331
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
332
+
333
+ def forward(
334
+ self,
335
+ unpad_inputs: bool,
336
+ input_ids: Optional[torch.Tensor] = None,
337
+ attention_mask: Optional[torch.Tensor] = None,
338
+ length: Optional[List[int]] = None,
339
+ token_type_ids: Optional[torch.Tensor] = None,
340
+ position_ids: Optional[torch.Tensor] = None,
341
+ inputs_embeds: Optional[torch.Tensor] = None,
342
+ ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
343
+ """
344
+ """
345
+ if inputs_embeds is None:
346
+ device, input_shape = input_ids.device, input_ids.shape
347
+ else:
348
+ device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
349
+ batch_size, seq_length = input_shape
350
+
351
+ # Set attention_mask if it's None
352
+ if attention_mask is None:
353
+ attention_mask = torch.ones(input_shape, device=device)
354
+ if length is not None:
355
+ for i, l in enumerate(length):
356
+ attention_mask[i, l:] = 0
357
+
358
+ # Set attention_mask_bool for unpadding
359
+ if unpad_inputs:
360
+ attention_mask_bool = attention_mask.bool()
361
+ if length is None:
362
+ length = attention_mask.sum(-1).tolist()
363
+
364
+ # Get word embeddings
365
+ if inputs_embeds is None:
366
+ if unpad_inputs:
367
+ input_ids = input_ids[attention_mask_bool].unsqueeze(0)
368
+ inputs_embeds = self.word_embeddings(input_ids)
369
+ else:
370
+ if unpad_inputs:
371
+ inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
372
+ embeddings = inputs_embeds
373
+
374
+ # Set and unpad position_ids
375
+ if position_ids is None:
376
+ if seq_length > self.position_ids.size(0):
377
+ self.register_buffer(
378
+ "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
379
+ )
380
+ if unpad_inputs:
381
+ # [1, cumsum_seq_len]
382
+ position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
383
+ else:
384
+ # [bs, seq_len]
385
+ position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
386
+ elif unpad_inputs:
387
+ position_ids = position_ids[attention_mask_bool].unsqueeze(0) # [1, cumsum_seq_len]
388
+
389
+ # Compute rotary embedding
390
+ if self.position_embedding_type == 'rope':
391
+ rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
392
+ rope_cos = rope_cos[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
393
+ rope_sin = rope_sin[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
394
+ rope_embeds = rope_cos, rope_sin
395
+ else:
396
+ rope_embeds = None
397
+
398
+ if self.type_vocab_size > 0:
399
+ if token_type_ids is None:
400
+ token_type_ids = position_ids.mul(0)
401
+ else:
402
+ if self.type_vocab_size < 2:
403
+ token_type_ids.mul_(0)
404
+ if unpad_inputs:
405
+ token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
406
+
407
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
408
+ embeddings = embeddings + token_type_embeddings
409
+
410
+ # BERT position
411
+ if self.position_embedding_type == "absolute":
412
+ position_embeddings = self.position_embeddings(position_ids)
413
+ embeddings = embeddings + position_embeddings
414
+
415
+ embeddings = self.LayerNorm(embeddings)
416
+ embeddings = self.dropout(embeddings)
417
+
418
+ return embeddings, attention_mask, rope_embeds, length
419
+
420
+
421
+ class NewAttention(nn.Module):
422
+ def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
423
+ super().__init__()
424
+ self.config = config
425
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
426
+ raise ValueError(
427
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
428
+ f"heads ({config.num_attention_heads})"
429
+ )
430
+
431
+ self.hidden_size = config.hidden_size
432
+ self.num_attention_heads = config.num_attention_heads
433
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
434
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
435
+
436
+ if pack_qkv is None:
437
+ pack_qkv = config.pack_qkv
438
+ self.pack_qkv = pack_qkv
439
+
440
+ if self.pack_qkv:
441
+ self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
442
+ else:
443
+ self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
444
+ self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
445
+ self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
446
+
447
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
448
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
449
+
450
+ if use_memory_efficient_attention is None:
451
+ use_memory_efficient_attention = self.config.use_memory_efficient_attention
452
+ self.use_memory_efficient_attention = use_memory_efficient_attention
453
+ self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
454
+ if self.use_memory_efficient_attention:
455
+ assert self.memory_efficient_attention is not None, 'please install xformers'
456
+
457
+ def forward(
458
+ self,
459
+ hidden_states: torch.Tensor,
460
+ attention_bias: torch.FloatTensor,
461
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
462
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
463
+ attention_scale: Optional[torch.FloatTensor] = None,
464
+ head_mask: Optional[torch.FloatTensor] = None,
465
+ output_attentions: Optional[bool] = False,
466
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
467
+ ) -> Tuple[torch.Tensor, ...]:
468
+ shape_hd = (self.num_attention_heads, self.attention_head_size)
469
+ # qkv
470
+ if self.pack_qkv and qkv_inputs is None:
471
+ qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
472
+ else:
473
+ if qkv_inputs is None:
474
+ qkv_inputs = (hidden_states, hidden_states, hidden_states)
475
+ qkv_pack = [
476
+ getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
477
+ ]
478
+ query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
479
+
480
+ if self.config.position_embedding_type == 'rope':
481
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
482
+
483
+ dtype = query_states.dtype
484
+
485
+ if self.config.logn_attention_scale and attention_scale is not None:
486
+ # https://kexue.fm/archives/8823
487
+ query_states = query_states * attention_scale.to(dtype)
488
+
489
+ if padding_inputs is not None:
490
+ query_states = pad_input(query_states.squeeze(), *padding_inputs)
491
+ key_states = pad_input(key_states.squeeze(), *padding_inputs)
492
+ value_states = pad_input(value_states.squeeze(), *padding_inputs)
493
+
494
+ if self.use_memory_efficient_attention:
495
+ assert self.memory_efficient_attention is not None, "xformers is not loaded"
496
+ assert output_attentions is False, "memory_efficient_attention do not output attentions"
497
+ assert head_mask is None, "Not support yet"
498
+ attention_probs = None
499
+ if torch.is_tensor(attention_bias):
500
+ attention_bias = attention_bias.to(dtype)
501
+ context_layer = self.memory_efficient_attention(
502
+ query_states,
503
+ key_states,
504
+ value_states,
505
+ attn_bias=attention_bias,
506
+ p=self.dropout.p
507
+ )
508
+ else:
509
+ if output_attentions and isinstance(self, NewSdpaAttention):
510
+ raise RuntimeError("SDPA do not output attentions")
511
+ context_layer, attention_probs = self._attention(
512
+ query_states, key_states, value_states, attention_bias, head_mask
513
+ )
514
+
515
+ if padding_inputs is not None:
516
+ context_layer = unpad_input(context_layer, indices=padding_inputs[0])
517
+
518
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
519
+ context_layer = context_layer.view(new_context_layer_shape)
520
+
521
+ # output proj
522
+ attn_output = self.o_proj(context_layer)
523
+
524
+ # add attentions if we output them
525
+ outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
526
+ return outputs
527
+
528
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
529
+ """
530
+ Args:
531
+ q/k/v: (B, L, n_head, head_dim),
532
+ Returns:
533
+ attn_output: (B L, n_head, head_dim)
534
+ """
535
+ query_states = query_states.transpose(1, 2)
536
+ key_states = key_states.transpose(1, 2)
537
+ value_states = value_states.transpose(1, 2)
538
+ # Take the dot product between "query" and "key" to get the raw attention scores.
539
+ attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
540
+
541
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
542
+ if attention_bias is not None:
543
+ # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
544
+ attention_scores = attention_scores + attention_bias
545
+
546
+ # Normalize the attention scores to probabilities.
547
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
548
+
549
+ # This is actually dropping out entire tokens to attend to, which might
550
+ # seem a bit unusual, but is taken from the original Transformer paper.
551
+ if self.dropout.p > 0:
552
+ attention_probs = self.dropout(attention_probs)
553
+
554
+ # Mask heads if we want to
555
+ if head_mask is not None:
556
+ attention_probs = attention_probs * head_mask
557
+
558
+ context_layer = torch.matmul(attention_probs, value_states)
559
+
560
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
561
+ return context_layer, attention_probs
562
+
563
+
564
+ class NewSdpaAttention(NewAttention):
565
+ """
566
+ New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
567
+ `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
568
+ SDPA API.
569
+ """
570
+ def __init__(self, config: NewConfig, **kwargs):
571
+ super().__init__(config, **kwargs)
572
+ # torch.backends.cuda.enable_mem_efficient_sdp(False)
573
+ # logger.warning(
574
+ # "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
575
+ # "`use_memory_efficient_attention=True` if it expected to use."
576
+ # )
577
+
578
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
579
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
580
+ query_states.transpose(1, 2),
581
+ key_states.transpose(1, 2),
582
+ value_states.transpose(1, 2),
583
+ attn_mask=attention_bias,
584
+ dropout_p=self.dropout.p if self.training else 0.0,
585
+ )
586
+ attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
587
+ return attn_output, None
588
+
589
+
590
+ NEW_ATTENTION_CLASSES = {
591
+ "eager": NewAttention,
592
+ # "flash_attention_2": , # TODO
593
+ "sdpa": NewSdpaAttention,
594
+ }
595
+
596
+
597
+ class NewGatedMLP(nn.Module):
598
+ """
599
+ GLU Variants Improve Transformer.
600
+ """
601
+
602
+ def __init__(self, config: NewConfig):
603
+ super().__init__()
604
+ self.intermediate_size = config.intermediate_size
605
+ self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
606
+ self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
607
+ self.act_fn = ACT2FN[config.hidden_act]
608
+ if config.hidden_dropout_prob > 0:
609
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
610
+ else:
611
+ self.hidden_dropout = None
612
+
613
+ def forward(self, hidden_states):
614
+ up_gate = self.up_gate_proj(hidden_states)
615
+ up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
616
+ gate = self.act_fn(gate)
617
+ gated_states = gate * up_states
618
+ if self.hidden_dropout is not None:
619
+ gated_states = self.hidden_dropout(gated_states)
620
+ down_states = self.down_proj(gated_states)
621
+ return down_states
622
+
623
+
624
+ class NewLayer(nn.Module):
625
+ def __init__(
626
+ self,
627
+ config: NewConfig,
628
+ pack_qkv=None,
629
+ use_memory_efficient_attention=None,
630
+ attn_implementation=None
631
+ ):
632
+ super().__init__()
633
+ if attn_implementation is None:
634
+ attn_implementation = config._attn_implementation
635
+ if use_memory_efficient_attention is None:
636
+ use_memory_efficient_attention = config.use_memory_efficient_attention
637
+ if use_memory_efficient_attention:
638
+ if attn_implementation != 'eager':
639
+ logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
640
+ attn_implementation = 'eager' # Since it will be SDPA by default for torch>=2.1.1
641
+ self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
642
+ config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
643
+ )
644
+ self.mlp = NewGatedMLP(config)
645
+
646
+ ln_class = LAYER_NORM[config.layer_norm_type]
647
+ self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
648
+ self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
649
+
650
+ if config.hidden_dropout_prob > 0:
651
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
652
+ else:
653
+ self.hidden_dropout = None
654
+
655
+ def forward(
656
+ self,
657
+ hidden_states: torch.Tensor,
658
+ attention_bias: torch.FloatTensor,
659
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
660
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
661
+ attention_scale: Optional[torch.FloatTensor] = None,
662
+ subset_indices: Optional[torch.LongTensor] = None,
663
+ head_mask: Optional[torch.FloatTensor] = None,
664
+ output_attentions: Optional[bool] = False,
665
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
666
+ ) -> Tuple[torch.Tensor, ...]:
667
+ # Multi head self attention
668
+ residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
669
+ attention_outputs = self.attention(
670
+ hidden_states,
671
+ attention_bias,
672
+ rope_embeds,
673
+ padding_inputs,
674
+ attention_scale,
675
+ head_mask,
676
+ output_attentions=output_attentions,
677
+ qkv_inputs=qkv_inputs,
678
+ )
679
+ hidden_states = attention_outputs[0]
680
+ if self.hidden_dropout is not None:
681
+ hidden_states = self.hidden_dropout(hidden_states)
682
+ hidden_states = residual + hidden_states
683
+
684
+ # In pretraining, after the attention of last layer, we only need the masked tokens.
685
+ if subset_indices is not None:
686
+ hidden_states = hidden_states[subset_indices]
687
+
688
+ hidden_states = self.attn_ln(hidden_states)
689
+
690
+ # Fully Connected
691
+ residual = hidden_states
692
+ hidden_states = self.mlp(hidden_states)
693
+ if self.hidden_dropout is not None:
694
+ hidden_states = self.hidden_dropout(hidden_states)
695
+ hidden_states = residual + hidden_states
696
+ hidden_states = self.mlp_ln(hidden_states)
697
+
698
+ # add self attentions if we output attention weights
699
+ outputs = (hidden_states,) + attention_outputs[1:]
700
+ return outputs
701
+
702
+
703
+ class NewEncoder(nn.Module):
704
+ def __init__(self, config):
705
+ super().__init__()
706
+ self.config = config
707
+ self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
708
+ self.gradient_checkpointing = False
709
+
710
+ def forward(
711
+ self,
712
+ hidden_states: torch.Tensor,
713
+ attention_bias: Optional[torch.FloatTensor] = None,
714
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
715
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
716
+ attention_scale: Optional[torch.FloatTensor] = None,
717
+ subset_indices: Optional[torch.LongTensor] = None,
718
+ head_mask: Optional[torch.FloatTensor] = None,
719
+ output_attentions: Optional[bool] = False,
720
+ output_hidden_states: Optional[bool] = False,
721
+ return_dict: Optional[bool] = True,
722
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
723
+ all_hidden_states = () if output_hidden_states else None
724
+ all_self_attentions = () if output_attentions else None
725
+
726
+ for i, layer_module in enumerate(self.layer):
727
+ if output_hidden_states:
728
+ all_hidden_states = all_hidden_states + (hidden_states,)
729
+
730
+ if i >= len(self.layer) - 1:
731
+ layer_subset_indices = subset_indices
732
+ else:
733
+ layer_subset_indices = None
734
+
735
+ layer_head_mask = head_mask[i] if head_mask is not None else None
736
+
737
+ if self.gradient_checkpointing and self.training:
738
+ layer_outputs = self._gradient_checkpointing_func(
739
+ layer_module.__call__,
740
+ hidden_states,
741
+ attention_bias,
742
+ rope_embeds,
743
+ padding_inputs,
744
+ attention_scale,
745
+ layer_subset_indices,
746
+ layer_head_mask,
747
+ )
748
+ else:
749
+ layer_outputs = layer_module(
750
+ hidden_states,
751
+ attention_bias,
752
+ rope_embeds,
753
+ padding_inputs,
754
+ attention_scale,
755
+ layer_subset_indices,
756
+ layer_head_mask,
757
+ output_attentions,
758
+ )
759
+
760
+ hidden_states = layer_outputs[0]
761
+ if output_attentions:
762
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
763
+
764
+ if output_hidden_states:
765
+ all_hidden_states = all_hidden_states + (hidden_states,)
766
+
767
+ if not return_dict:
768
+ return tuple(
769
+ v
770
+ for v in [
771
+ hidden_states,
772
+ all_hidden_states,
773
+ all_self_attentions,
774
+ ]
775
+ if v is not None
776
+ )
777
+ return BaseModelOutput(
778
+ last_hidden_state=hidden_states,
779
+ hidden_states=all_hidden_states,
780
+ attentions=all_self_attentions,
781
+ )
782
+
783
+
784
+ # Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
785
+ class NewPooler(nn.Module):
786
+ def __init__(self, config):
787
+ super().__init__()
788
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
789
+ self.activation = nn.Tanh()
790
+
791
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
792
+ # We "pool" the model by simply taking the hidden state corresponding
793
+ # to the first token.
794
+ first_token_tensor = hidden_states[:, 0]
795
+ pooled_output = self.dense(first_token_tensor)
796
+ pooled_output = self.activation(pooled_output)
797
+ return pooled_output
798
+
799
+
800
+ class NewPreTrainedModel(PreTrainedModel):
801
+ """
802
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
803
+ models.
804
+ """
805
+
806
+ config_class = NewConfig
807
+ base_model_prefix = "new"
808
+ supports_gradient_checkpointing = True
809
+ _supports_sdpa = True
810
+
811
+ def _init_weights(self, module):
812
+ """Initialize the weights"""
813
+ if isinstance(module, nn.Linear):
814
+ # Slightly different from the TF version which uses truncated_normal for initialization
815
+ # cf https://github.com/pytorch/pytorch/pull/5617
816
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
817
+ if module.bias is not None:
818
+ module.bias.data.zero_()
819
+ elif isinstance(module, nn.Embedding):
820
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
821
+ if module.padding_idx is not None:
822
+ module.weight.data[module.padding_idx].zero_()
823
+ elif isinstance(module, nn.LayerNorm):
824
+ module.bias.data.zero_()
825
+ module.weight.data.fill_(1.0)
826
+
827
+
828
+ class NewModel(NewPreTrainedModel):
829
+ """
830
+ The bare New Model transformer outputting raw hidden-states without any specific head on top.
831
+ """
832
+
833
+ def __init__(self, config: NewConfig, add_pooling_layer=False):
834
+ super().__init__(config)
835
+ self.config = config
836
+
837
+ self.embeddings = NewEmbeddings(config)
838
+ self.encoder = NewEncoder(config)
839
+
840
+ self.pooler = NewPooler(config) if add_pooling_layer else None
841
+
842
+ # Initialize weights and apply final processing
843
+ self.post_init()
844
+
845
+ def get_input_embeddings(self):
846
+ return self.embeddings.word_embeddings
847
+
848
+ def set_input_embeddings(self, value):
849
+ self.embeddings.word_embeddings = value
850
+
851
+ def forward(
852
+ self,
853
+ input_ids: Optional[torch.Tensor] = None,
854
+ attention_mask: Optional[torch.Tensor] = None,
855
+ length: Optional[List[int]] = None,
856
+ subset_indices: Optional[torch.LongTensor] = None,
857
+ token_type_ids: Optional[torch.Tensor] = None,
858
+ position_ids: Optional[torch.Tensor] = None,
859
+ head_mask: Optional[torch.Tensor] = None,
860
+ inputs_embeds: Optional[torch.Tensor] = None,
861
+ output_attentions: Optional[bool] = None,
862
+ output_hidden_states: Optional[bool] = None,
863
+ return_dict: Optional[bool] = None,
864
+ unpad_inputs: Optional[bool] = None,
865
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
866
+ r"""
867
+ length (`list` of length `batch_size`, *optional*):
868
+ If is `None`, return padded `last_hidden_state`.
869
+ subset_indices ():
870
+ pass
871
+ unpad_inputs (`bool`, *optional*):
872
+ pass
873
+ """
874
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
875
+ output_hidden_states = (
876
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
877
+ )
878
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
879
+ unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
880
+ output_padded = length is None
881
+
882
+ if input_ids is not None and inputs_embeds is not None:
883
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
884
+ elif input_ids is not None:
885
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
886
+ input_shape = input_ids.size()
887
+ elif inputs_embeds is not None:
888
+ input_shape = inputs_embeds.size()[:-1]
889
+ else:
890
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
891
+
892
+ # TODO: not used
893
+ # # Prepare head mask if needed
894
+ # # 1.0 in head_mask indicate we keep the head
895
+ # # attention_probs has shape bsz x n_heads x N x N
896
+ # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
897
+ # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
898
+ # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
899
+
900
+ # Get embeddings, may unpad them
901
+ (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
902
+ unpad_inputs,
903
+ input_ids=input_ids,
904
+ attention_mask=attention_mask,
905
+ length=length,
906
+ token_type_ids=token_type_ids,
907
+ position_ids=position_ids,
908
+ inputs_embeds=inputs_embeds
909
+ )
910
+
911
+ batch_size, seq_length = input_shape
912
+ if unpad_inputs and self.config.use_memory_efficient_attention:
913
+ attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
914
+ else:
915
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
916
+ # ourselves in which case we just need to make it broadcastable to all heads.
917
+ attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
918
+ if self.config.use_memory_efficient_attention:
919
+ # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
920
+ attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
921
+
922
+ padding_inputs = None
923
+ if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
924
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
925
+ if not self.config.use_memory_efficient_attention:
926
+ padding_inputs = (indices, *input_shape)
927
+
928
+ attention_scale = None
929
+ if self.config.logn_attention_scale:
930
+ logger.warning_once("TODO: logn_attention_scale")
931
+ # # attention scale log_512(input_len)
932
+ # attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
933
+ # # inference-time logn scale need clip 1
934
+ # if self.config.logn_attention_clip1:
935
+ # attention_scale.clip_(1)
936
+ # attention_scale = attention_scale[:, None, None, None]
937
+ # else:
938
+ # attention_scale = None
939
+
940
+ encoder_outputs = self.encoder(
941
+ embedding_output,
942
+ attention_bias=attention_bias,
943
+ rope_embeds=rope_embeds,
944
+ padding_inputs=padding_inputs,
945
+ attention_scale=attention_scale,
946
+ subset_indices=subset_indices,
947
+ head_mask=head_mask,
948
+ output_attentions=output_attentions,
949
+ output_hidden_states=output_hidden_states,
950
+ return_dict=return_dict,
951
+ )
952
+ sequence_output = encoder_outputs[0]
953
+ if unpad_inputs and output_padded:
954
+ sequence_output = pad_input(
955
+ sequence_output.squeeze(), indices, batch_size, seq_length
956
+ )
957
+
958
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
959
+
960
+ if not return_dict:
961
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
962
+
963
+ return BaseModelOutputWithPooling(
964
+ last_hidden_state=sequence_output,
965
+ pooler_output=pooled_output,
966
+ hidden_states=encoder_outputs.hidden_states,
967
+ attentions=encoder_outputs.attentions,
968
+ )
969
+
970
+
971
+ class NewLMPredictionHead(nn.Module):
972
+ def __init__(self, config):
973
+ super().__init__()
974
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
975
+ self.transform_act_fn = ACT2FN[config.hidden_act]
976
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
977
+
978
+ # The output weights are the same as the input embeddings, but there is
979
+ # an output-only bias for each token.
980
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
981
+
982
+ def forward(self, hidden_states):
983
+ hidden_states = self.dense(hidden_states)
984
+ hidden_states = self.transform_act_fn(hidden_states)
985
+ hidden_states = self.norm(hidden_states)
986
+ hidden_states = self.decoder(hidden_states)
987
+ return hidden_states
988
+
989
+
990
+ class NewForMaskedLM(NewPreTrainedModel):
991
+ _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
992
+
993
+ def __init__(self, config: NewConfig):
994
+ super().__init__(config)
995
+ self.new = NewModel(config, add_pooling_layer=False)
996
+ self.lm_head = NewLMPredictionHead(config)
997
+ self.loss_fct = nn.CrossEntropyLoss()
998
+
999
+ # Initialize weights and apply final processing
1000
+ self.post_init()
1001
+
1002
+ def get_output_embeddings(self):
1003
+ return self.lm_head.decoder
1004
+
1005
+ def set_output_embeddings(self, new_embeddings):
1006
+ self.lm_head.decoder = new_embeddings
1007
+
1008
+ def forward(
1009
+ self,
1010
+ input_ids: Optional[torch.Tensor] = None,
1011
+ attention_mask: Optional[torch.Tensor] = None,
1012
+ token_type_ids: Optional[torch.Tensor] = None,
1013
+ position_ids: Optional[torch.Tensor] = None,
1014
+ head_mask: Optional[torch.Tensor] = None,
1015
+ inputs_embeds: Optional[torch.Tensor] = None,
1016
+ labels: Optional[torch.Tensor] = None,
1017
+ output_attentions: Optional[bool] = None,
1018
+ output_hidden_states: Optional[bool] = None,
1019
+ return_dict: Optional[bool] = None,
1020
+ unpad_inputs: Optional[bool] = None,
1021
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
1022
+ r"""
1023
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1024
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1025
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1026
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1027
+ """
1028
+
1029
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1030
+
1031
+ if labels is None or not self.new.config.unpad_inputs:
1032
+ length = None
1033
+ subset_indices = None
1034
+ else:
1035
+ length = attention_mask.sum(-1).tolist()
1036
+ labels = labels[attention_mask.bool()].unsqueeze(0)
1037
+ subset_indices = labels > -100
1038
+
1039
+ outputs = self.new(
1040
+ input_ids,
1041
+ attention_mask=attention_mask,
1042
+ length=length,
1043
+ subset_indices=subset_indices,
1044
+ token_type_ids=token_type_ids,
1045
+ position_ids=position_ids,
1046
+ head_mask=head_mask,
1047
+ inputs_embeds=inputs_embeds,
1048
+ output_attentions=output_attentions,
1049
+ output_hidden_states=output_hidden_states,
1050
+ return_dict=return_dict,
1051
+ unpad_inputs=unpad_inputs,
1052
+ )
1053
+
1054
+ sequence_output = outputs[0]
1055
+ prediction_scores = self.lm_head(sequence_output)
1056
+
1057
+ masked_lm_loss = None
1058
+ if labels is not None:
1059
+ if subset_indices is None:
1060
+ mask = attention_mask.bool()
1061
+ prediction_scores = prediction_scores[mask]
1062
+ labels = labels[mask]
1063
+ else:
1064
+ labels = labels[subset_indices]
1065
+ masked_lm_loss = self.loss_fct(prediction_scores, labels)
1066
+
1067
+ if not return_dict:
1068
+ output = (prediction_scores,) + outputs[2:]
1069
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1070
+
1071
+ return MaskedLMOutput(
1072
+ loss=masked_lm_loss,
1073
+ logits=prediction_scores,
1074
+ hidden_states=outputs.hidden_states,
1075
+ attentions=outputs.attentions,
1076
+ )
1077
+
1078
+
1079
+ class NewForSequenceClassification(NewPreTrainedModel):
1080
+ def __init__(self, config):
1081
+ super().__init__(config)
1082
+ self.num_labels = config.num_labels
1083
+ self.config = config
1084
+
1085
+ self.new = NewModel(config, add_pooling_layer=True)
1086
+ classifier_dropout = (
1087
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1088
+ )
1089
+ self.dropout = nn.Dropout(classifier_dropout)
1090
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1091
+
1092
+ # Initialize weights and apply final processing
1093
+ self.post_init()
1094
+
1095
+ def forward(
1096
+ self,
1097
+ input_ids: Optional[torch.Tensor] = None,
1098
+ attention_mask: Optional[torch.Tensor] = None,
1099
+ token_type_ids: Optional[torch.Tensor] = None,
1100
+ position_ids: Optional[torch.Tensor] = None,
1101
+ head_mask: Optional[torch.Tensor] = None,
1102
+ inputs_embeds: Optional[torch.Tensor] = None,
1103
+ labels: Optional[torch.Tensor] = None,
1104
+ output_attentions: Optional[bool] = None,
1105
+ output_hidden_states: Optional[bool] = None,
1106
+ return_dict: Optional[bool] = None,
1107
+ unpad_inputs: Optional[bool] = None,
1108
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
1109
+ r"""
1110
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1111
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1112
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1113
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1114
+ """
1115
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1116
+
1117
+ outputs = self.new(
1118
+ input_ids,
1119
+ attention_mask=attention_mask,
1120
+ token_type_ids=token_type_ids,
1121
+ position_ids=position_ids,
1122
+ head_mask=head_mask,
1123
+ inputs_embeds=inputs_embeds,
1124
+ output_attentions=output_attentions,
1125
+ output_hidden_states=output_hidden_states,
1126
+ return_dict=return_dict,
1127
+ unpad_inputs=unpad_inputs,
1128
+ )
1129
+
1130
+ pooled_output = outputs[1]
1131
+
1132
+ pooled_output = self.dropout(pooled_output)
1133
+ logits = self.classifier(pooled_output)
1134
+
1135
+ loss = None
1136
+ if labels is not None:
1137
+ if self.config.problem_type is None:
1138
+ if self.num_labels == 1:
1139
+ self.config.problem_type = "regression"
1140
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1141
+ self.config.problem_type = "single_label_classification"
1142
+ else:
1143
+ self.config.problem_type = "multi_label_classification"
1144
+
1145
+ if self.config.problem_type == "regression":
1146
+ loss_fct = nn.MSELoss()
1147
+ if self.num_labels == 1:
1148
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1149
+ else:
1150
+ loss = loss_fct(logits, labels)
1151
+ elif self.config.problem_type == "single_label_classification":
1152
+ loss_fct = nn.CrossEntropyLoss()
1153
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1154
+ elif self.config.problem_type == "multi_label_classification":
1155
+ loss_fct = nn.BCEWithLogitsLoss()
1156
+ loss = loss_fct(logits, labels)
1157
+
1158
+ if not return_dict:
1159
+ output = (logits,) + outputs[2:]
1160
+ return ((loss,) + output) if loss is not None else output
1161
+
1162
+ return SequenceClassifierOutput(
1163
+ loss=loss,
1164
+ logits=logits,
1165
+ hidden_states=outputs.hidden_states,
1166
+ attentions=outputs.attentions,
1167
+ )
1168
+
1169
+
1170
+ class NewForMultipleChoice(NewPreTrainedModel):
1171
+ def __init__(self, config):
1172
+ super().__init__(config)
1173
+
1174
+ self.new = NewModel(config, add_pooling_layer=True)
1175
+ classifier_dropout = (
1176
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1177
+ )
1178
+ self.dropout = nn.Dropout(classifier_dropout)
1179
+ self.classifier = nn.Linear(config.hidden_size, 1)
1180
+
1181
+ # Initialize weights and apply final processing
1182
+ self.post_init()
1183
+
1184
+ def forward(
1185
+ self,
1186
+ input_ids: Optional[torch.Tensor] = None,
1187
+ attention_mask: Optional[torch.Tensor] = None,
1188
+ token_type_ids: Optional[torch.Tensor] = None,
1189
+ position_ids: Optional[torch.Tensor] = None,
1190
+ head_mask: Optional[torch.Tensor] = None,
1191
+ inputs_embeds: Optional[torch.Tensor] = None,
1192
+ labels: Optional[torch.Tensor] = None,
1193
+ output_attentions: Optional[bool] = None,
1194
+ output_hidden_states: Optional[bool] = None,
1195
+ return_dict: Optional[bool] = None,
1196
+ unpad_inputs: Optional[bool] = None,
1197
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
1198
+ r"""
1199
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1200
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1201
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1202
+ `input_ids` above)
1203
+ """
1204
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1205
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1206
+
1207
+ input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1208
+ attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1209
+ token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1210
+ position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1211
+ inputs_embeds = (
1212
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1213
+ if inputs_embeds is not None
1214
+ else None
1215
+ )
1216
+
1217
+ outputs = self.new(
1218
+ input_ids,
1219
+ attention_mask=attention_mask,
1220
+ token_type_ids=token_type_ids,
1221
+ position_ids=position_ids,
1222
+ head_mask=head_mask,
1223
+ inputs_embeds=inputs_embeds,
1224
+ output_attentions=output_attentions,
1225
+ output_hidden_states=output_hidden_states,
1226
+ return_dict=return_dict,
1227
+ unpad_inputs=unpad_inputs,
1228
+ )
1229
+
1230
+ pooled_output = outputs[1]
1231
+
1232
+ pooled_output = self.dropout(pooled_output)
1233
+ logits = self.classifier(pooled_output)
1234
+ reshaped_logits = logits.view(-1, num_choices)
1235
+
1236
+ loss = None
1237
+ if labels is not None:
1238
+ loss_fct = nn.CrossEntropyLoss()
1239
+ loss = loss_fct(reshaped_logits, labels)
1240
+
1241
+ if not return_dict:
1242
+ output = (reshaped_logits,) + outputs[2:]
1243
+ return ((loss,) + output) if loss is not None else output
1244
+
1245
+ return MultipleChoiceModelOutput(
1246
+ loss=loss,
1247
+ logits=reshaped_logits,
1248
+ hidden_states=outputs.hidden_states,
1249
+ attentions=outputs.attentions,
1250
+ )
1251
+
1252
+
1253
+ @dataclass
1254
+ class NewTokenClassifierOutput(ModelOutput):
1255
+ loss: Optional[torch.FloatTensor] = None
1256
+ logits: torch.FloatTensor = None
1257
+ last_hidden_state: torch.FloatTensor = None
1258
+ hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
1259
+ attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
1260
+
1261
+
1262
+ class NewForTokenClassification(NewPreTrainedModel):
1263
+ def __init__(self, config):
1264
+ super().__init__(config)
1265
+ self.num_labels = config.num_labels
1266
+
1267
+ self.new = NewModel(config, add_pooling_layer=False)
1268
+ classifier_dropout = (
1269
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1270
+ )
1271
+ self.dropout = nn.Dropout(classifier_dropout)
1272
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1273
+
1274
+ # Initialize weights and apply final processing
1275
+ self.post_init()
1276
+
1277
+ def forward(
1278
+ self,
1279
+ input_ids: Optional[torch.Tensor] = None,
1280
+ attention_mask: Optional[torch.Tensor] = None,
1281
+ token_type_ids: Optional[torch.Tensor] = None,
1282
+ position_ids: Optional[torch.Tensor] = None,
1283
+ head_mask: Optional[torch.Tensor] = None,
1284
+ inputs_embeds: Optional[torch.Tensor] = None,
1285
+ labels: Optional[torch.Tensor] = None,
1286
+ output_attentions: Optional[bool] = None,
1287
+ output_hidden_states: Optional[bool] = None,
1288
+ return_dict: Optional[bool] = None,
1289
+ unpad_inputs: Optional[bool] = None,
1290
+ ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
1291
+ r"""
1292
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1293
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1294
+ """
1295
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1296
+
1297
+ outputs = self.new(
1298
+ input_ids,
1299
+ attention_mask=attention_mask,
1300
+ token_type_ids=token_type_ids,
1301
+ position_ids=position_ids,
1302
+ head_mask=head_mask,
1303
+ inputs_embeds=inputs_embeds,
1304
+ output_attentions=output_attentions,
1305
+ output_hidden_states=output_hidden_states,
1306
+ return_dict=return_dict,
1307
+ unpad_inputs=unpad_inputs,
1308
+ )
1309
+
1310
+ sequence_output = outputs[0]
1311
+
1312
+ sequence_output = self.dropout(sequence_output)
1313
+ logits = self.classifier(sequence_output)
1314
+
1315
+ loss = None
1316
+ if labels is not None:
1317
+ loss_fct = nn.CrossEntropyLoss()
1318
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1319
+
1320
+ if not return_dict:
1321
+ output = (logits,) + outputs[2:]
1322
+ return ((loss,) + output) if loss is not None else output
1323
+
1324
+ return NewTokenClassifierOutput(
1325
+ loss=loss,
1326
+ logits=logits,
1327
+ last_hidden_state=sequence_output,
1328
+ hidden_states=outputs.hidden_states,
1329
+ attentions=outputs.attentions,
1330
+ )
1331
+
1332
+
1333
+ class NewForQuestionAnswering(NewPreTrainedModel):
1334
+ def __init__(self, config):
1335
+ super().__init__(config)
1336
+ self.num_labels = config.num_labels
1337
+
1338
+ self.new = NewModel(config, add_pooling_layer=False)
1339
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1340
+
1341
+ # Initialize weights and apply final processing
1342
+ self.post_init()
1343
+
1344
+ def forward(
1345
+ self,
1346
+ input_ids: Optional[torch.Tensor] = None,
1347
+ attention_mask: Optional[torch.Tensor] = None,
1348
+ token_type_ids: Optional[torch.Tensor] = None,
1349
+ position_ids: Optional[torch.Tensor] = None,
1350
+ head_mask: Optional[torch.Tensor] = None,
1351
+ inputs_embeds: Optional[torch.Tensor] = None,
1352
+ start_positions: Optional[torch.Tensor] = None,
1353
+ end_positions: Optional[torch.Tensor] = None,
1354
+ output_attentions: Optional[bool] = None,
1355
+ output_hidden_states: Optional[bool] = None,
1356
+ return_dict: Optional[bool] = None,
1357
+ unpad_inputs: Optional[bool] = None,
1358
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
1359
+ r"""
1360
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1361
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1362
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1363
+ are not taken into account for computing the loss.
1364
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1365
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1366
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1367
+ are not taken into account for computing the loss.
1368
+ """
1369
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1370
+
1371
+ outputs = self.new(
1372
+ input_ids,
1373
+ attention_mask=attention_mask,
1374
+ token_type_ids=token_type_ids,
1375
+ position_ids=position_ids,
1376
+ head_mask=head_mask,
1377
+ inputs_embeds=inputs_embeds,
1378
+ output_attentions=output_attentions,
1379
+ output_hidden_states=output_hidden_states,
1380
+ return_dict=return_dict,
1381
+ unpad_inputs=unpad_inputs,
1382
+ )
1383
+
1384
+ sequence_output = outputs[0]
1385
+
1386
+ logits = self.qa_outputs(sequence_output)
1387
+ start_logits, end_logits = logits.split(1, dim=-1)
1388
+ start_logits = start_logits.squeeze(-1).contiguous()
1389
+ end_logits = end_logits.squeeze(-1).contiguous()
1390
+
1391
+ total_loss = None
1392
+ if start_positions is not None and end_positions is not None:
1393
+ # If we are on multi-GPU, split add a dimension
1394
+ if len(start_positions.size()) > 1:
1395
+ start_positions = start_positions.squeeze(-1)
1396
+ if len(end_positions.size()) > 1:
1397
+ end_positions = end_positions.squeeze(-1)
1398
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1399
+ ignored_index = start_logits.size(1)
1400
+ start_positions = start_positions.clamp(0, ignored_index)
1401
+ end_positions = end_positions.clamp(0, ignored_index)
1402
+
1403
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
1404
+ start_loss = loss_fct(start_logits, start_positions)
1405
+ end_loss = loss_fct(end_logits, end_positions)
1406
+ total_loss = (start_loss + end_loss) / 2
1407
+
1408
+ if not return_dict:
1409
+ output = (start_logits, end_logits) + outputs[2:]
1410
+ return ((total_loss,) + output) if total_loss is not None else output
1411
+
1412
+ return QuestionAnsweringModelOutput(
1413
+ loss=total_loss,
1414
+ start_logits=start_logits,
1415
+ end_logits=end_logits,
1416
+ hidden_states=outputs.hidden_states,
1417
+ attentions=outputs.attentions,
1418
+ )
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa7a6ad87a7ce8fe196787355f6af7d03aee94d19c54a5eb1392ed18c8ef451a
3
+ size 17082988
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 8192,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "tokenizer_class": "XLMRobertaTokenizerFast",
54
+ "unk_token": "<unk>"
55
+ }