Sampath1987 commited on
Commit
97498d5
·
verified ·
1 Parent(s): 824f616

fine-tuned mpnet-energy 1 epochs

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,863 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - dense
7
+ - generated_from_trainer
8
+ - dataset_size:53913
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Alibaba-NLP/gte-multilingual-base
11
+ widget:
12
+ - source_sentence: How does the monitoring system for well integrity function after
13
+ CO2 injection?
14
+ sentences:
15
+ - 'Drilling is a complex process and delivering a successful well requires identifying
16
+ proper technologies and utilizing them efficiently to save time & cost. Today
17
+ in Oil & Gas industry there is a huge focus on digital technologies to improve
18
+ Drilling Process efficiency and PDO decided to implement an innovative approach
19
+ of process optimization by implementing a unique project "electronically Delivering
20
+ the Limit (eDtL)".
21
+
22
+ The overall approach with eDtL project was to implement a platform which can provide
23
+ Drilling Operations team the technical limit for all Drilling Activities, which
24
+ is the theoretical minimum time required to perform an activity, based on available
25
+ knowledge and technology.
26
+
27
+ eDtL system utilizes rig sensors data transmitted in Real-Time from Drilling Rigs
28
+ to automatically detect the Rig Activity and focus on identifying the areas of
29
+ Drilling Performance Improvements and minimizing redundant tasks for rig and office
30
+ teams. The identified opportunities are communicated with rig team for implementation
31
+ and the performance is tracked again to highlight the improvements.
32
+
33
+ eDtL system also provides capability for continuous improvement of organizational
34
+ processes by introducing automation of redundant tasks. One of such improvement
35
+ was partial automation of Daily Drilling Report which was historically manually
36
+ recorded by rig team daily.'
37
+ - 'ADNOC has embarked on a major Carbon Capture and Storage (CCS) project where
38
+ large quantities of CO2 are injected into deep saline aquifers for permanent storage
39
+ instead of releasing into the atmosphere.
40
+
41
+ An advanced chemical tracer technology was deployed in the first CCS project in
42
+ the UAE for continuous CO2 monitoring to ensure permanent and safe CO2 storage.
43
+ In case of containment breach, the chemical tracer technology can confirm the
44
+ leakage and identify its source.
45
+
46
+ After CO2 injection for permanent storage, any containment breaching would be
47
+ detected in the shallow soil monitoring borehole. Few soil monitoring boreholes
48
+ were excavated across the field in which Capillary Adsorption Tubes (CAT) were
49
+ inserted for some time and replaced by another according to the sampling frequency
50
+ plan. The tube is sent to the lab for CO2 leak detection and reporting. The high
51
+ detection resolution is in the order of 0.1 parts per trillion (ppt). This has
52
+ a positive impact on the system economics because smaller quantities of chemical
53
+ tracer material are required.
54
+
55
+ The tracer injection monitoring system is ongoing in the first CO2 storage area
56
+ of Abu Dhabi. The monitoring includes soil monitoring which are shallow boreholes.
57
+ The soil monitoring boreholes were excavated close to the CO2 injection well to
58
+ ensure that there are no well integrity issues developed due to thermal effects
59
+ by CO2 injection. The soil monitoring boreholes to be verified by surface gas
60
+ CO2 monitors. Soil monitors were located around the radial storage area, to detect
61
+ CO2 leakage and to understand CO2 migration to the soil through the cap rock (in
62
+ case of leakage). The monitoring system for caprock and well integrity will provide:
63
+ Surface soil monitoring for cap rock integrity, integrity confirmation for legacy
64
+ wells, integrity confirmation of injection well in the post-injection monitoring
65
+ period, leakage quantification, leakage origin if multiple injectors. The monitoring
66
+ system can continue for up to 30 years of the operational period as well as the
67
+ full post-injection monitoring, measurement and verification horizon.
68
+
69
+ This paper presents a description of a sophisticated CO2 monitoring technology
70
+ that is being deployed in UAE''s first CCS project. CO2 tracer technology is considered
71
+ as one of the most accurate methods to detect CO2 leakage at surface. Its high-detection
72
+ resolution allows early leakage identification and early mitigation action. In
73
+ addition, it proves to be relatively low cost, operationally easy to execute,
74
+ and requires a small operational footprint.'
75
+ - 'Carbon Capture and Storage, as a solution to mitigate the increase in greenhouse
76
+ gases emissions in the atmosphere, is still bringing intensive worldwide R&D activities.
77
+ In particular, significant acceleration of in situ CCS experiments supports technical
78
+ developments as well as acceptability of this technology. Among the major risks
79
+ identified to this technology, wells are often considered to be the weakest spots
80
+ with respect to CO2 confinement in the geological reservoir. Therefore, long-term
81
+ well integrity performance assessment is one of the critical steps that must be
82
+ addressed before large scale CCS technology deployment is accepted as a safe solution
83
+ to reduce CO2 emissions.
84
+
85
+ A risk-based methodology associated with well integrity is proposed within CO2
86
+ geological storage. The main objectives of this approach are to identify and quantify
87
+ risks associated with CO2 leakages along wells over time (from tens to thousands
88
+ of years), to evaluate risks and to propose relevant actions to reduce unacceptable
89
+ risks. The methodological framework emphasized the use of the risk concept as
90
+ a relevant criterion to (i) evaluate the overall performance of well confinement
91
+ with respect to different stakes, (ii) include different levels of uncertainty
92
+ associated to the studied system, and (iii) provide a reliable decision making
93
+ support. For the quantification of risk, a coupled CO2 flow model (gas flow and
94
+ degradation processes) was used to identify possible leakage pathways along the
95
+ wellbore and quantify possible CO2 leakage towards sensitive targets (surface,
96
+ fresh water, any aquifers…) for different scenarios. This approach offers an operational
97
+ response to some of the challenges inherent to well integrity management over
98
+ well lifecycle.
99
+
100
+ This paper focuses on the application of the methodology to a synthetic case based
101
+ on an existing well. The practical outcomes and the added values will be presented:
102
+ (i) an objective and structured process, (ii) scenarios identification and quantification
103
+ of CO2 migration along the wellbore for each scenario, (iii) risk mapping, (iv)
104
+ and operational action plans for risk treatment of well integrity.'
105
+ - source_sentence: How did the detailed pre-survey planning impact the success of
106
+ the offshore seismic acquisition campaign?
107
+ sentences:
108
+ - 'We showcase an innovative campaigning and business-focused approach to reservoir
109
+ monitoring of multiple fields using 4D (time-lapse) seismic. Benefits obtained
110
+ in terms of cost, speed and the quality of insights gained are discussed, in comparison
111
+ with a piecemeal approach. Challenges and lessons learned are described, with
112
+ a view to this approach becoming more widely adopted and allowing 4D monitoring
113
+ to be extended to smaller or more marginal fields.
114
+
115
+ An offshore seismic acquisition campaign was planned and successfully executed
116
+ for a sequence of four 4D monitor surveys for fields located within 250 km of
117
+ each other on the greater Northwest Shelf of Australia. The four monitors were
118
+ acquired in H1 2020 comprising (in this order): Pluto Gas Field M2 (second monitor),
119
+ Brunello Gas Field M1 (first monitor), Laverda Oil Field M1 and Cimatti Oil Field
120
+ M1.
121
+
122
+ Cost savings expected from campaigning were realised, despite three cyclones during
123
+ operations, with success largely attributed to detailed pre-survey planning. Also
124
+ important were the choice of vessel and planning for operational flexibility.
125
+ The baseline surveys were diverse and required careful planning to achieve repeatability
126
+ between vintages over each field, and to optimise the acquisition sequence – minimising
127
+ time required to reconfigure the streamer spreads between surveys. The Cimatti
128
+ baseline survey was acquired using a dual-vessel operation; modelling, combined
129
+ with now-standard steerable streamers, showed a single-vessel monitor survey was
130
+ feasible. These optimisations provided cost savings incremental to the principal
131
+ economy of sharing vessel mobilisation costs across the whole campaign.
132
+
133
+ Both processing and evaluation (ongoing at the time of writing) are essentially
134
+ separate per field, but follow a consistent approach. Processing is carried out
135
+ by more than one contractor to debottleneck this phase, with products, including
136
+ intermediate quality control (QC) volumes, delivered as pre-stack depth migrations.
137
+ While full evaluation of the monitor surveys to static and dynamic reservoir model
138
+ updates will continue beyond 2020, key initial reservoir insights are expected
139
+ to emerge within days of processing completion, with some even earlier from QC
140
+ volumes. Furthermore, concurrent 4D evaluations are expected to result in fruitful
141
+ exchanges of ideas and technologies between fields.'
142
+ - 'Advances in seismic acquisition, processing, computing hardware and theory continue
143
+ to enhance seismic-image quality. However, an investment decision on seismic projects
144
+ should be based not only technical criteria but a quantifiable expected value
145
+ above all currently available field data including well information. This presentation
146
+ will include a case history of a major carbonate oil field demonstrating how this
147
+ value was estimated before a major reprocessing project and how this value is
148
+ being achieved.
149
+
150
+ This field contains over 1000 wellbore penetrations. A 3D seismic survey was acquired
151
+ over the field during 2001–2002, but the reservoir development team believed that
152
+ these data to date had added limited value. The motivation for evaluating the
153
+ potential for further investment in seismic data was a multi-billion-dollar field-redevelopment
154
+ plan.
155
+
156
+ The Value of Information (VOI) exercise to justify a seismic project began with
157
+ an evaluation of technical issues that limited the use of existing seismic data.
158
+ Through a targeted fast track reprocessing effort it was determined that the existing
159
+ survey had been designed and acquired adequately, and that the deficiencies in
160
+ the dataset at the reservoir level are primarily caused by near-surface and overburden
161
+ effects. The first-order impact is that mapped seismic surfaces exhibit a "roughness"
162
+ primarily from the overlying "non-geologic" noise. There was concern that many
163
+ subtle faults interpreted at the reservoir level could be "non-geologic" artifacts
164
+ which resulted in reluctance to incorporate these into the reservoir model. Amplitude
165
+ balancing issues in the original data precluded quantitative assessments such
166
+ as porosity prediction. The targeted reprocessing also verified that existing
167
+ algorithms and traditional workflows alone were insufficient to resolve the technical
168
+ issues.
169
+
170
+ Working with the reservoir development team the key business drivers for reprocessing
171
+ were identified as follows:
172
+
173
+ Increase individual well productivity and recovery
174
+
175
+ Image and define new opportunities in current poor-data areas
176
+
177
+ Save on well cost by preventing re-drills
178
+
179
+ Improve overall field development plan
180
+
181
+ Specific expected value metrics and risks were assigned to the above objectives
182
+ and a VOI assessment was completed. It was estimated that successfully achieving
183
+ the above business objectives would result in a potential value at least 15 times
184
+ the cost of the reprocessing. This resulted in management approval of the full
185
+ field reprocessing.
186
+
187
+ Following completion of the seismic reprocessing, the project team objectively
188
+ assessed whether the technical criteria had been achieved and if the business
189
+ criteria will be achieved. In both cases the team determined that value metrics
190
+ will be met. The reprocessing has impacted drill-wells as well as field development
191
+ planning. In addition, the reprocessed seismic data will produce additional potential
192
+ value as a result of opportunities not recognized at the start of the project.'
193
+ - 'Sabiriyah Mauddud is a giant reservoir in NK under active water flood with about
194
+ 200 producers and 32 injectors. The reservoir has no aquifer or insignificant
195
+ energy support and had been on production since 1960s. Water flood started in
196
+ the year 1997, initially with a pilot and later on expanded to the full field
197
+ in a phased manner. Initial development was on pattern flood concept with all
198
+ vertical injectors & producers which has now been replaced with Produce High-Inject
199
+ Low (PHIL) concept using horizontal wells. In light of the significance of this
200
+ reservoir for Kuwait''s production, all efforts are made to optimize the performance
201
+ of this reservoir. To achieve this objective, Pressure monitoring & performance
202
+ analysis is considered to be the backbone of all production as well as injection
203
+ activities.
204
+
205
+ This paper presents the methodology conceived and implemented to assess the reservoir
206
+ pressure performance and estimate the current reservoir pressure in different
207
+ segments/ blocks in an innovative way so as to maximize the value of "Water flooding"
208
+ in North Kuwait area along with the meeting of production aspirations using ESP
209
+ as artificial lift system in an optimized manner. Except the RFT data in newly
210
+ drilled wells, the availability of pressure data was limited during recent past,
211
+ making it necessary to integrate all the available information so as to build
212
+ a powerful tool to be used for water flood monitoring. All available information
213
+ – Repeat Formation Tester "RFT", Static Bottom Hole Pressure "SBHP" and Pump intake
214
+ pressure "PIP" under dynamic and static conditions, were collected & analyzed.
215
+ An initial study related to compartmentalization showed two main areas, north
216
+ and south, based on comprehensive analysis of all the pressure points. The analysis
217
+ also helped for identifying areas with good vertical connectivity and understanding
218
+ segments with vertical barriers matching with the geological description. In order
219
+ to have the latest pressure mapping, data were combined to have an integrated
220
+ imagery of the pressure distribution across the reservoir. During the exercise,
221
+ "Gaps" were identified which were filled in by the intake pressure live data as
222
+ well as shut in data to have a meaningful mimic of the reservoir pressure to help
223
+ the ongoing production as well as injection activities.
224
+
225
+ Based on the innovative approach as above, surveillance plan has been made to
226
+ further enhance the quality of the mapping. Several maps such as opportunity map;
227
+ PVT properties map; layer wise pressure maps etc. have been generated for ready-to-use
228
+ information to facilitate daily operations.
229
+
230
+ The objective of the paper is to share the innovative, simple, smart and very
231
+ useful approach adopted by North Kuwait to manage the giant Mauddud. This paper
232
+ presents the methodology conceived and implemented to assess the reservoir pressure
233
+ performance and estimate the current reservoir pressure in difference segments/
234
+ blocks in an innovative way so as to maximize the value of "Water flooding" in
235
+ North Kuwait area along with the meeting the production aspirations using ESP
236
+ as artificial lift system in an optimized manner.'
237
+ - source_sentence: What are the primary recovery techniques used in oil and gas extraction?
238
+ sentences:
239
+ - The extraction of oil and gas involves various techniques to enhance recovery
240
+ rates. Primary recovery relies on the natural pressure of the reservoir, while
241
+ secondary recovery techniques such as water flooding and gas injection are employed
242
+ to increase output after primary methods become inefficient. Tertiary recovery
243
+ methods, also known as enhanced oil recovery (EOR), use thermal, gas, or chemical
244
+ injection to further improve extraction rates. Each method comes with its own
245
+ cost implications and efficiency rates, which can significantly affect the overall
246
+ economics of an oilfield development project.
247
+ - 'In this paper one of the areas of conflicts observed with the performance of
248
+ horizontal wells standoff with respect to development of thin oil rim reservoirs
249
+ is examined.
250
+
251
+ In a technical paper as part of the critical review of literature on the exploitation
252
+ of thin oil rim reservoirs with large gas cap and aquifer, this author had highlighted
253
+ the problem. As part of sensitives in horizontal well standoff, Cosmos and Fatoke
254
+ (2004) tested three positions; one-third, centre and two-third positions from
255
+ the GOC in a Niger Delta field. They concluded that the landing closest to the
256
+ GOC (one-third position) yielded lowest Oil compared to the centre and two-third
257
+ positions. Surprisingly the work done by Sai Garimella et al (2011) in a 60ft
258
+ Ghariff & Al Khlata shallow marine low permeability sandstone reservoirs in a
259
+ field in Oman showed a different result with the one-third position indicating
260
+ an optimum recovery from a horizontal well. Interestingly both authors positions
261
+ on the performance had support from other authors.
262
+
263
+ This study used a 3D reservoir model, investigated different horizontal well standoff
264
+ performances and applied permeability reduction to simulate different reservoir
265
+ quality. The objective was to see if the reservoir quality was a factor in the
266
+ different horizontal well standoff performance seen from different regions of
267
+ the world while noting their different depositional environments. Results from
268
+ the investigation is presented in this paper and shows a different trend from
269
+ both authors mentioned above.'
270
+ - The oil extraction process typically involves drilling a well into the earth's
271
+ crust where oil deposits are located. The well is often lined with casing to prevent
272
+ collapse and water intrusion. Once the well is drilled, various techniques such
273
+ as primary recovery, secondary recovery, and tertiary recovery can be employed.
274
+ Primary recovery uses natural reservoir pressure to extract oil, while secondary
275
+ recovery employs water or gas injection to maintain pressure. Tertiary recovery,
276
+ also known as enhanced oil recovery, uses techniques like thermal injection or
277
+ chemicals to further reduce the viscosity of oil and increase extraction rates.
278
+ Each of these methods has distinct implications on the yield and economic viability
279
+ of oil extraction operations.
280
+ - source_sentence: What advantages do helicopters have over fixed-wing aircraft for
281
+ leak detection surveys?
282
+ sentences:
283
+ - The reservoir characteristics such as porosity and permeability are crucial for
284
+ evaluating the potential of oil and gas fields. Porosity refers to the void spaces
285
+ within rocks that can hold hydrocarbons, while permeability measures how easily
286
+ fluids can flow through rock formations. These two properties significantly influence
287
+ the extraction methods used and the overall productivity of a reservoir. Enhancing
288
+ permeability through hydraulic fracturing has become a common technique in unconventional
289
+ resource extraction, allowing for more efficient recovery of oil and gas from
290
+ low-permeability reservoirs.
291
+ - 'BP gas production operations in North America manages over 15,000 miles of onshore
292
+ pipelines that make up our vast, complex, and aging gas gathering networks. Surveying
293
+ these for leaks presents a huge resource challenge using current ground based
294
+ technology and, in turn, impacts the assurance of the safety and integrity of
295
+ these operations.
296
+
297
+ The Exploration and Production Technology Group evaluated new leak detection technologies
298
+ using laser, thermal imaging camera and a high speed gas sampling detector that
299
+ were deployed on aircraft and used global positioning systems coordinates to survey
300
+ gas gathering pipelines. Field trials on gas gathering systems in the North Texas,
301
+ Anadarko asset showed that the laser and gas sampling based leak detection systems
302
+ were the most accurate, but the video imaging from the thermal camera made a powerful
303
+ statement. Helicopters proved to be more suitable in leak detection surveys on
304
+ gas gathering pipelines than that of fixed-wing aircraft.
305
+
306
+ The aerial leak detection technologies produce a significant increase in efficiency
307
+ and productivity in managing the integrity of BP''s gas gathering systems. While
308
+ that improves business performance, perhaps more importantly is the fact that
309
+ small gas leaks can be easily found before they become big ones. That reduces
310
+ environmental damage and the potential for leaks to impact the public. The development
311
+ and implementation of aerial leak detection in BP is being recognized as an integrity
312
+ tool in providing a significantly improved integrity assurance to its gas gathering
313
+ operations.'
314
+ - 'One of prerequisite of any detection system is to get the requirement the risk
315
+ analysis that estimates mainly the safety and environmental impacts of a loss
316
+ of containment. From this prerequisite it is possible to consider a strategy for
317
+ an early detection of a loss of containment, and to choose a method or a technology.
318
+ Methods of detection belong to two main families:
319
+
320
+ External based Leak Detection System which used local leak sensors to generate
321
+ a leak alarm. The main External based Leak Detection Systems are acoustic emission
322
+ detectors, pressure detectors, fiber optic cable, vapor and / or liquid sensing
323
+ cables;
324
+
325
+ Internal based Leak Detection Systems which used normal field sensors (e.g. pressure
326
+ transmitters, flowmeters) for leak detection and leak localization. The main internal
327
+ Leak Detection Systems are:
328
+
329
+
330
+
331
+ balancing systems (line balance, volume balance, compensated mass balance etc.);
332
+
333
+
334
+
335
+ Real Time Transient Model;
336
+
337
+
338
+
339
+ pressure/ flow monitoring;
340
+
341
+
342
+
343
+ statistical analysis…
344
+
345
+ The main External based Leak Detection Systems was studied internally through
346
+ different evaluation and development programs and for some of them in operation.
347
+
348
+ The main findings were the followings:
349
+
350
+ The acoustic based detection is sensitive to external noises as well as some pipeline
351
+ fluid (multi-phase, critical flow, transit phase) and pipeline elements (e.g.
352
+ elbows, valves). This technology requires the management of high quantity of data,
353
+ a significant tuning period, and many sensors connected to the pipeline. Distributed
354
+ Acoustic Sensing (DAS) using the fiber optic cable media is currently used internally
355
+ to detect real time intrusion.
356
+
357
+ The pressure emission detectors may be insensitive and require accurate pressure
358
+ measurement. This technology is difficultly practical on short lines, gas or multi-phase
359
+ pipelines with transient phases.
360
+
361
+ The vapor / liquid sensing cable technology needs to be physically close to the
362
+ pipe to become wet in case of leakage. These sensitive cables should be replaced
363
+ or cleaned after a leak. This technology is ne suitable easily for long distance
364
+ application. Their retrievable capability with the implementation of pulling chamber
365
+ every few hundred meters needs to be carefully considered. In addition, this technology
366
+ is highly sensitive. This implies that false alarms may occurred in case of former
367
+ contamination (presence of hydrocarbon). This technology is also sensitive to
368
+ the soil disruption, fluid properties and is affected by the ageing (sensitive
369
+ polymer alteration). However, this technology is suitable for short distance and
370
+ for some leaks detection when there is no temperature variation between the fluid
371
+ and the soil.
372
+
373
+ The Fiber optic solution was highly considered for a leak detection through several
374
+ evaluation programs and, in particular two PIT (Projet d’Innovation Technologique)
375
+ Projects. These two PIT projects were performed between 2015 and 2019 and presented
376
+ to the following ADIPEC sessions
377
+
378
+
379
+
380
+ (Baque, 2017) 2017 Abu Dhabi International Petroleum Exhibition & Conference SPE-188669-MS
381
+ Early Gas Detection
382
+
383
+
384
+
385
+ (Baque, 2020) 2020 Abu Dhabi International Petroleum Exhibition & Conference SPE-203293-MS
386
+ Fiber Optic Liquid Leakage Detection
387
+
388
+ Note: Some of the paragraph parts of this manuscript are extracted from these
389
+ two SPE documents referred (Baque, 2017) and (Baque, 2020). Other evaluation and
390
+ development programs not presented previously are also presented in this manuscript.'
391
+ - source_sentence: What occupational health hazards are anticipated with large construction
392
+ projects during the energy transition?
393
+ sentences:
394
+ - "institutionalized political structures to realize particular social objectives\
395
+ \ or serve particular\nconstituencies. \n**Non-hazardous waste:** Waste, other\
396
+ \ than Hazardous waste, resulting from company\noperations, including process\
397
+ \ and oil field wastes disposed of, on site or off site, as well as\noffice, commercial\
398
+ \ or packaging related wastes [ENV-7]. \n**Normalization:** The ratio of a quantitative\
399
+ \ indicator output (e.g. emissions) to an\naggregated measure of another output\
400
+ \ (e.g. oil and gas production or refinery throughput) \n[Module 1 _Reporting\
401
+ \ process_ ]. \n**Occupational illness:** An Employee or Contractor health condition\
402
+ \ or disorder requiring\nmedical treatment due to a workplace Incident, typically\
403
+ \ involving multiple exposures to\nhazardous substances or to physical agents.\
404
+ \ Examples include noise-induced hearing loss,\nrespiratory disease, and contact\
405
+ \ dermatitis [SHS-3]. \n**Occupational injury:** Harm of an Employee or Contractor\
406
+ \ resulting from a single\ninstantaneous workplace incident that results in medical\
407
+ \ treatment (beyond simple first aid),\nwork restrictions, days away from work\
408
+ \ (lost time) or a Fatality [SHS-3]. \n**Operating area:** An area where business\
409
+ \ activities take place with potential to interact with\nthe adjacent environment\
410
+ \ [ENV-4]. \n**Operation:** A generic term used to denote any kind of business\
411
+ \ activity involving productrelated processes, such as production, manufacturing\
412
+ \ and transport. Note: the term oil and\ngas operations used in the Guidance is\
413
+ \ intended to be broad and inclusive of other types of\nproduct, such as chemicals.\
414
+ \ \n**7.5**"
415
+ - "The broader work of the Directorate is carried out \nby its four standing committees.\
416
+ \ \nSafety Committee: This committee’s objective is the \ncore of the Directorate:\
417
+ \ to eliminate fatalities and \ncatastrophic process safety events in our industry.\n\
418
+ In pursuit of this aim, the committee develops\nand promotes the adoption of recommended\n\
419
+ practices – a task it performs both on its own and\nwith partners and trade associations.\
420
+ \ The resulting\npublications lay a foundation for both safety\nand efficiency,\
421
+ \ and develops the motivated and\nempowered workforce needed to provide the world\n\
422
+ with clean, affordable energy. \nHighlights of the committee’s 2023 activities\n\
423
+ include participation in events, the creation\nof expert groups, issuing of publications,\
424
+ \ and\nengagement in data reviews. \n- Events: In 2023, in addition to the regular\n\
425
+ committee and subcommittee meetings,\nthe committee held diving workshops in\n\
426
+ Rio De Janeiro and Paris. These meetings\nchampioned local stakeholders and sought\n\
427
+ to improve local diving performance. It also\nhosted two Aviation Procurement\
428
+ \ Managers\nForums – one in London, the other in Houston \n– to address industry\
429
+ \ contracting behaviours \nand its impact on contractor resilience and\nsafety.\
430
+ \ In addition, the committee conducted a\nProcess Safety Workshop at the IOGP\
431
+ \ Summit\nin Indonesia. Finally, at this year’s Offshore\nEurope conference, IOGP\
432
+ \ Safety Director Steve\nNorton moderated a panel on learning from, and\nsharing,\
433
+ \ safety lessons. \n- Expert Groups: The committee established\nthree expert\
434
+ \ groups in 2023: two to revise\nexisting Reports (365 on land transportation\n\
435
+ safety and 365-12 on in-vehicle monitoring), and\none to consider adoption and\
436
+ \ implementation of\nrecommended safety practices. \n- Publications: The committee\
437
+ \ issued ten \nguidance documents in 2023, covering critical\nareas such as diving,\
438
+ \ aviation, and process\nsafety; see page 34 for a full list of publications.\
439
+ \ \n- Data reviews: The committee published its\nannual compilations of safety\
440
+ \ performance data,\ncovering occupational, process, aviation, and\nland transportation\
441
+ \ safety. IOGP has collected\nsafety performance data from its Members\nsince\
442
+ \ 1985 and our database is the largest in\nthe upstream industry, providing companies\n\
443
+ with valuable information for benchmarking and\nperformance improvement. \n17"
444
+ - "endotoxins and fungi. The authors recommended that\nongoing real–time measurement\
445
+ \ of these exposures be\ncarried out to identify boundary conditions, phases,\
446
+ \ and\nsettings with the highest pollutant release. \n12 — Health in the energy\
447
+ \ transition \nGood quality studies are needed on the health effects of\nrenewable\
448
+ \ energy sources. Such studies should include\npopulations and patients with well-characterized\
449
+ \ exposure,\nhigh-quality information on outcome, and assessment of\npotential\
450
+ \ confounders. While retrospective (e.g., case-control)\nstudies might produce\
451
+ \ useful results, prospective longitudinal\nstudies would provide the strongest\
452
+ \ evidence. \nSeveral LCA studies have been conducted for the different\ntechnologies.\
453
+ \ These LCAs reported relative low levels of\nemissions during the lifecycle of\
454
+ \ renewable sources of\nenergy. Few of these studies included a comparison with\n\
455
+ fossil-based technologies. When more life cycle studies\nbecome available it would\
456
+ \ be important to include them\nin the literature review. While looking at the\
457
+ \ life cycle of a\ncertain technology, other health effects in the value chain\n\
458
+ could potentially be identified (reference: UNECE on Carbon\nNeutrality in the\
459
+ \ UNECE Region: Integrated Life-cycle\nAssessment of Electricity Sources). \n\
460
+ As of December 2024, very few occupational and public\nhealth hazards specific\
461
+ \ to energy transition technologies\nhave been identified. The energy transition\
462
+ \ is in an early stage\nand will evolve quickly, and additional hazards unique\
463
+ \ to\nenergy transition activities may emerge; the specifics of this\nare, at\
464
+ \ this time, uncertain. \nWhat is certain is that the energy transition will\
465
+ \ involve large\nconstruction projects whose risks (and effective methods to\n\
466
+ manage those risks) are well-known and understood. Existing\noccupational health\
467
+ \ approaches will be able to manage\nthese risks effectively, provided the correct\
468
+ \ assessments are\nconducted properly."
469
+ datasets:
470
+ - Sampath1987/offshore_energy_v1
471
+ pipeline_tag: sentence-similarity
472
+ library_name: sentence-transformers
473
+ metrics:
474
+ - cosine_accuracy
475
+ model-index:
476
+ - name: SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
477
+ results:
478
+ - task:
479
+ type: triplet
480
+ name: Triplet
481
+ dataset:
482
+ name: ai job validation
483
+ type: ai-job-validation
484
+ metrics:
485
+ - type: cosine_accuracy
486
+ value: 0.9700252413749695
487
+ name: Cosine Accuracy
488
+ ---
489
+
490
+ # SentenceTransformer based on Alibaba-NLP/gte-multilingual-base
491
+
492
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) on the [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1) dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
493
+
494
+ ## Model Details
495
+
496
+ ### Model Description
497
+ - **Model Type:** Sentence Transformer
498
+ - **Base model:** [Alibaba-NLP/gte-multilingual-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-base) <!-- at revision 9bbca17d9273fd0d03d5725c7a4b0f6b45142062 -->
499
+ - **Maximum Sequence Length:** 8192 tokens
500
+ - **Output Dimensionality:** 768 dimensions
501
+ - **Similarity Function:** Cosine Similarity
502
+ - **Training Dataset:**
503
+ - [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1)
504
+ <!-- - **Language:** Unknown -->
505
+ <!-- - **License:** Unknown -->
506
+
507
+ ### Model Sources
508
+
509
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
510
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
511
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
512
+
513
+ ### Full Model Architecture
514
+
515
+ ```
516
+ SentenceTransformer(
517
+ (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False, 'architecture': 'NewModel'})
518
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
519
+ (2): Normalize()
520
+ )
521
+ ```
522
+
523
+ ## Usage
524
+
525
+ ### Direct Usage (Sentence Transformers)
526
+
527
+ First install the Sentence Transformers library:
528
+
529
+ ```bash
530
+ pip install -U sentence-transformers
531
+ ```
532
+
533
+ Then you can load this model and run inference.
534
+ ```python
535
+ from sentence_transformers import SentenceTransformer
536
+
537
+ # Download from the 🤗 Hub
538
+ model = SentenceTransformer("Sampath1987/EnergyEmbed-v2-e3")
539
+ # Run inference
540
+ sentences = [
541
+ 'What occupational health hazards are anticipated with large construction projects during the energy transition?',
542
+ 'endotoxins and fungi. The authors recommended that\nongoing real–time measurement of these exposures be\ncarried out to identify boundary conditions, phases, and\nsettings with the highest pollutant release. \n12 — Health in the energy transition \nGood quality studies are needed on the health effects of\nrenewable energy sources. Such studies should include\npopulations and patients with well-characterized exposure,\nhigh-quality information on outcome, and assessment of\npotential confounders. While retrospective (e.g., case-control)\nstudies might produce useful results, prospective longitudinal\nstudies would provide the strongest evidence. \nSeveral LCA studies have been conducted for the different\ntechnologies. These LCAs reported relative low levels of\nemissions during the lifecycle of renewable sources of\nenergy. Few of these studies included a comparison with\nfossil-based technologies. When more life cycle studies\nbecome available it would be important to include them\nin the literature review. While looking at the life cycle of a\ncertain technology, other health effects in the value chain\ncould potentially be identified (reference: UNECE on Carbon\nNeutrality in the UNECE Region: Integrated Life-cycle\nAssessment of Electricity Sources). \nAs of December 2024, very few occupational and public\nhealth hazards specific to energy transition technologies\nhave been identified. The energy transition is in an early stage\nand will evolve quickly, and additional hazards unique to\nenergy transition activities may emerge; the specifics of this\nare, at this time, uncertain. \nWhat is certain is that the energy transition will involve large\nconstruction projects whose risks (and effective methods to\nmanage those risks) are well-known and understood. Existing\noccupational health approaches will be able to manage\nthese risks effectively, provided the correct assessments are\nconducted properly.',
543
+ 'institutionalized political structures to realize particular social objectives or serve particular\nconstituencies. \n**Non-hazardous waste:** Waste, other than Hazardous waste, resulting from company\noperations, including process and oil field wastes disposed of, on site or off site, as well as\noffice, commercial or packaging related wastes [ENV-7]. \n**Normalization:** The ratio of a quantitative indicator output (e.g. emissions) to an\naggregated measure of another output (e.g. oil and gas production or refinery throughput) \n[Module 1 _Reporting process_ ]. \n**Occupational illness:** An Employee or Contractor health condition or disorder requiring\nmedical treatment due to a workplace Incident, typically involving multiple exposures to\nhazardous substances or to physical agents. Examples include noise-induced hearing loss,\nrespiratory disease, and contact dermatitis [SHS-3]. \n**Occupational injury:** Harm of an Employee or Contractor resulting from a single\ninstantaneous workplace incident that results in medical treatment (beyond simple first aid),\nwork restrictions, days away from work (lost time) or a Fatality [SHS-3]. \n**Operating area:** An area where business activities take place with potential to interact with\nthe adjacent environment [ENV-4]. \n**Operation:** A generic term used to denote any kind of business activity involving productrelated processes, such as production, manufacturing and transport. Note: the term oil and\ngas operations used in the Guidance is intended to be broad and inclusive of other types of\nproduct, such as chemicals. \n**7.5**',
544
+ ]
545
+ embeddings = model.encode(sentences)
546
+ print(embeddings.shape)
547
+ # [3, 768]
548
+
549
+ # Get the similarity scores for the embeddings
550
+ similarities = model.similarity(embeddings, embeddings)
551
+ print(similarities)
552
+ # tensor([[1.0000, 0.5463, 0.1943],
553
+ # [0.5463, 1.0000, 0.1698],
554
+ # [0.1943, 0.1698, 1.0000]])
555
+ ```
556
+
557
+ <!--
558
+ ### Direct Usage (Transformers)
559
+
560
+ <details><summary>Click to see the direct usage in Transformers</summary>
561
+
562
+ </details>
563
+ -->
564
+
565
+ <!--
566
+ ### Downstream Usage (Sentence Transformers)
567
+
568
+ You can finetune this model on your own dataset.
569
+
570
+ <details><summary>Click to expand</summary>
571
+
572
+ </details>
573
+ -->
574
+
575
+ <!--
576
+ ### Out-of-Scope Use
577
+
578
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
579
+ -->
580
+
581
+ ## Evaluation
582
+
583
+ ### Metrics
584
+
585
+ #### Triplet
586
+
587
+ * Dataset: `ai-job-validation`
588
+ * Evaluated with [<code>TripletEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.TripletEvaluator)
589
+
590
+ | Metric | Value |
591
+ |:--------------------|:---------|
592
+ | **cosine_accuracy** | **0.97** |
593
+
594
+ <!--
595
+ ## Bias, Risks and Limitations
596
+
597
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
598
+ -->
599
+
600
+ <!--
601
+ ### Recommendations
602
+
603
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
604
+ -->
605
+
606
+ ## Training Details
607
+
608
+ ### Training Dataset
609
+
610
+ #### offshore_energy_v1
611
+
612
+ * Dataset: [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1) at [4e9339c](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1/tree/4e9339cce67dbff9d1e6ba25cfdcd9a4a7f529f7)
613
+ * Size: 53,913 training samples
614
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
615
+ * Approximate statistics based on the first 1000 samples:
616
+ | | anchor | positive | negative |
617
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|
618
+ | type | string | string | string |
619
+ | details | <ul><li>min: 14 tokens</li><li>mean: 23.77 tokens</li><li>max: 42 tokens</li></ul> | <ul><li>min: 36 tokens</li><li>mean: 392.08 tokens</li><li>max: 961 tokens</li></ul> | <ul><li>min: 45 tokens</li><li>mean: 389.63 tokens</li><li>max: 1109 tokens</li></ul> |
620
+ * Samples:
621
+ | anchor | positive | negative |
622
+ |:-----------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
623
+ | <code>What statistical methods were employed to enhance the accuracy of comparisons in the field testing of shaped cutters?</code> | <code>As shaped polycrystalline diamond compact (PDC) cutter geometries become more prevalent across the industry, this paper statistically reviews field testing of novel shaped PDC cutters in a variety of challenging applications. Firstly, the paper identifies the improvement in efficiency when compared with conventional PDC cutter geometries. Secondly, it confirms the reliability and robustness of the aforementioned shaped cutter geometries.<br>After several years of field testing shaped PDC cutter geometries, the question of how they hold up against conventional cylinder-shaped cutters remains unanswered. This study looks at drill bits that have the same overall design; however, each bit has different shape configurations that are deployed in a range of hole sizes and drilling applications. Data was collected from more than 100 runs and included advanced dull evaluation techniques, data mining, and comparative analyses. During data collation and interpretation, several statistical methods we...</code> | <code>This paper details the improvements to drilling performance and torsional response of fixed cutter bits when changing from a conventional 19-mm cutter diameter configuration to 25-mm cutter diameters for similar blade counts in two different hole sizes. Key performance metrics include rate of penetration (ROP), rerun-ability, torsional response, and ability to maintain tool-face control during directional drilling.<br>A high-performance drilling application was selected with several existing offset wells using a 12¼-in., five-bladed, 19-mm (519) drill bit design, and a concept bit developed using 25-mm diameter cutters while maintaining comparable ancillary features. This was tested in the same field on both vertical and S-shape sections using the same bent-housing motor assembly and drilling performance compared to the existing offsets. A 17½-in. hole size application that experiences high drillstring vibration was also selected, and a 25-mm cutter diameter drill bit was designed with co...</code> |
624
+ | <code>What are vapor recovery units (VRU) used for in oil and gas operations?</code> | <code>## 4. Vapour recovery units <br>Vapor recovery units (VRU) are used to prevent emissions by capturing the streams and<br>re-routing them either back to the process or for use as fuel. More details on the<br>components, installation, and operation of VRU are captured in the following sections.</code> | <code>##### 3.1.2 Reduction and recovery of glycol dehydration flash gas <br>Gas from the flash vessel will consist primarily of hydrocarbons and is continuously<br>produced. If installed, a flash vessel will typically remove 90% or more of the entrained<br>hydrocarbon gas and dissolved gases in the glycol leaving the contactor column. <br>Glycol flash vessels typically operate at 3-7 barg [18], meaning there is generally a sufficient<br>pressure drop for the flash gas to commonly be routed to flare or a low-pressure fuel gas<br>system. If the composition of the flash gas prevents this, or there is no fuel gas system,<br>then a Vapour Recovery Unit (VRU) may be needed for recovery into other process units. <br>Minimization of the flash gas itself is also possible by optimizing the glycol flowrate,<br>such as by adjusting the dry gas water temperature specification based on accurate site<br>conditions because the water dew point needed could vary seasonally or from site to<br>site by using more accurate ambient temperatur...</code> |
625
+ | <code>What challenges are posed by fractures and faults in the completion of MRC wells?</code> | <code>The Maximum Reservoir Contact (MRC) concept was developed to improve well productivity and sustainability by maximizing the contact area with target reservoirs. MRC is a proven technology for the development of tight/non-economical reservoirs. Completion design for MRC wells plays a vital role in enhancing well deliverability, monitoring and accessibility.<br>MRC technology was put into application to appraise a tight and thin heterogeneous carbonate reservoir in a giant offshore field in Abu Dhabi. Different completion scenarios were simulated to select the best suited completion to achieve enhanced well deliverability, monitoring and accessibility.<br>Heavy casing design with liner and tie-back system was finalized to maximize accessibility and achieve proper isolation behind casing. A special pre-perforated liner was also designed to eliminate the pressure drop across the wellbore. The MRC drain was divided mainly into two sections, blank pipe and pre-perforated liner equipped with swell ...</code> | <code>The Clair field is the largest discovered oilfield on the UK continental shelf (UKCS) but has high reservoir uncertainty associated with a complex natural fracture network. The field area covers over 200 sq km with an estimated STOIIP of 7 billion barrels. The scale and complexity of the reservoir has led to a phased multi-platform development.<br>Phase 1 started production in 2005 with 20 wells drilled prior to an extended drill break. Five new wells (A21 to A25) were drilled and brought online during 2016/17 which increased platform production by c.70%. The new wells incorporated historic lessons to mitigate the risk of wellbore instability in the overburden and be robust to the dynamic uncertainties of the fractured reservoir. Many of the well outcomes and risk events were predicted and mitigated effectively, however the new wells still provided some surprises.<br>This paper presents a summary of the lessons from the historic Clair development wells which underpinned the recent drilling c...</code> |
626
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
627
+ ```json
628
+ {
629
+ "scale": 20.0,
630
+ "similarity_fct": "cos_sim",
631
+ "gather_across_devices": false
632
+ }
633
+ ```
634
+
635
+ ### Evaluation Dataset
636
+
637
+ #### offshore_energy_v1
638
+
639
+ * Dataset: [offshore_energy_v1](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1) at [4e9339c](https://huggingface.co/datasets/Sampath1987/offshore_energy_v1/tree/4e9339cce67dbff9d1e6ba25cfdcd9a4a7f529f7)
640
+ * Size: 6,739 evaluation samples
641
+ * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
642
+ * Approximate statistics based on the first 1000 samples:
643
+ | | anchor | positive | negative |
644
+ |:--------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
645
+ | type | string | string | string |
646
+ | details | <ul><li>min: 11 tokens</li><li>mean: 23.56 tokens</li><li>max: 52 tokens</li></ul> | <ul><li>min: 55 tokens</li><li>mean: 386.01 tokens</li><li>max: 1082 tokens</li></ul> | <ul><li>min: 45 tokens</li><li>mean: 382.6 tokens</li><li>max: 1175 tokens</li></ul> |
647
+ * Samples:
648
+ | anchor | positive | negative |
649
+ |:------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
650
+ | <code>What is the importance of quantifying carbon emissions during cementing operations in decarbonization?</code> | <code>An important step in decarbonization is using an end-to-end approach to quantify carbon emissions during cementing operations. By careful analysis of the entire cementing operations process, it is then possible to measure and compare carbon emissions at various stages of the operation. Understanding and isolating the main drivers of the carbon emissions footprint enables making better choices and developing best alternatives with lower environmental impact.<br>The methodology considers the lifecycle assessment of cement from quarry extraction to well abandonment, and includes steps such as manufacturing of raw materials, transportation and logistics, and operations in the field. For these stages, careful quantification of emissions is performed based on the manufacturer's carbon emissions of cementing products, transportation (distance and means) to the bulk plant and rig site, and equipment-related emissions such as blending and pumping units. In some cases, when assessing the footprint ...</code> | <code>Objectives/Scope<br>There are many different views on the Energy Transition. What is agreed is that to achieve current climate change targets, the journey to deep decarbonisation must start now. Scope 3 emissions are clearly the major contributor to total emissions and must be actively reduced. However, if Oil and Gas extraction is to be continued, then operators must understand, measure, and reduce Scope 1 and 2 emissions. This paper examines the constituent parts of typical Scope 1 emissions for O&G assets and discusses a credible pathway and initial steps towards decarbonisation of operations.<br>Methods, Procedures, Process<br>Emissions from typical assets are investigated: data is examined to determine the overall and individual contributions of Scope 1 emissions. A three tiered approach to emissions savings is presented:<br>–<br>Reduce overall energy usage<br>–<br>Seek to Remove environmental losses<br>–<br>Replace energy supply with low carbon alternatives<br>A simple method, used to assess carbon emissions,...</code> |
651
+ | <code>What factors must engineers consider during the drilling design phase?</code> | <code>The drilling of oil and gas wells involves several stages including the exploration phase, drilling design, and perforation techniques. In the exploration phase, geologists use seismic surveys to identify potential drilling locations. During the drilling design phase, engineers must consider factors such as wellbore stability, fluid mechanics, and formation pressures. Once the well is drilled, perforation techniques are applied to enhance the flow of hydrocarbons into the wellbore. The effectiveness of these techniques can significantly impact production rates and overall project success.</code> | <code>The extraction of crude oil and natural gas is typically carried out through drilling. Drilling uses different techniques to reach the petroleum reservoirs located deep underground. One key method is rotary drilling, where a drill bit is rotated while cutting through the earth's layers to create a wellbore. Rotary drilling is favored for its efficiency in penetrating hard rock layers. Another method is directional drilling, which allows operators to drill at various angles to reach reservoirs that are not directly beneath the drilling platform. This technique increases the area covered by the well and can optimize production. In addition, hydraulic fracturing enhances recovery rates by injecting fluids under high pressure to create fractures in the rock, increasing the permeability and allowing oil and gas to flow more freely. Lastly, the safety and environmental impacts of drilling techniques are a growing concern, and advancements are continually being sought to mitigate these effect...</code> |
652
+ | <code>How does the 'Dissolved pore network' concept enhance matrix permeability in the modeling of carbonate oil reservoirs?</code> | <code>In this paper, we present a case study of using dual porosity dual permeability (DPDP) simulation for an offshore Abu Dhabi carbonate oil reservoir exhibiting complex flow behavior through matrix, fracture system and conductive faults. The main objective of the study is to present and explain the reservoir flow behaviors by constructing and using advanced reservoir geologic and simulation models. The results of the study will be utilized as part of the inputs for full field development plan.<br>Initially, an extensive work on the faults and fractures characterization was conducted to properly integrate this information into a dynamic model using DPDP modeling approach. However, the poor response of some wells or field sectors indicated the insufficiency of this concept to capture the full complexity of the reservoir system. Consequently, a new geological concept was proposed to represent the effect of enhanced matrix permeability related to facies dissolution process in the reservoir mode...</code> | <code>Integration of pressure-derived permeability thickness with other geological data plays a crucial role in estimating the apparent reservoir permeability, which is a key reservoir property required for reliable reservoir characterization as it governs fluid flow and greatly impacts decisions related to production, field development, and reservoir management. The geological model provides a representation of the subsurface reservoir, capturing the spatial distribution of lithology, porosity, permeability, and other geological properties. Analysis of pressure data provides valuable information on well condition, reservoir extent, and dynamic reservoir parameters. Integrating such data with the geological model is an enabler to better quantify and manage the uncertainty in the spatial 3D distribution of permeability away from well control.<br>This work proposes a methodology to build high-resolution geological models based on the available dynamic data, seismic data, and geologic interpretati...</code> |
653
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
654
+ ```json
655
+ {
656
+ "scale": 20.0,
657
+ "similarity_fct": "cos_sim",
658
+ "gather_across_devices": false
659
+ }
660
+ ```
661
+
662
+ ### Training Hyperparameters
663
+ #### Non-Default Hyperparameters
664
+
665
+ - `eval_strategy`: steps
666
+ - `per_device_train_batch_size`: 16
667
+ - `per_device_eval_batch_size`: 16
668
+ - `learning_rate`: 2e-05
669
+ - `warmup_ratio`: 0.1
670
+
671
+ #### All Hyperparameters
672
+ <details><summary>Click to expand</summary>
673
+
674
+ - `overwrite_output_dir`: False
675
+ - `do_predict`: False
676
+ - `eval_strategy`: steps
677
+ - `prediction_loss_only`: True
678
+ - `per_device_train_batch_size`: 16
679
+ - `per_device_eval_batch_size`: 16
680
+ - `per_gpu_train_batch_size`: None
681
+ - `per_gpu_eval_batch_size`: None
682
+ - `gradient_accumulation_steps`: 1
683
+ - `eval_accumulation_steps`: None
684
+ - `torch_empty_cache_steps`: None
685
+ - `learning_rate`: 2e-05
686
+ - `weight_decay`: 0.0
687
+ - `adam_beta1`: 0.9
688
+ - `adam_beta2`: 0.999
689
+ - `adam_epsilon`: 1e-08
690
+ - `max_grad_norm`: 1.0
691
+ - `num_train_epochs`: 3
692
+ - `max_steps`: -1
693
+ - `lr_scheduler_type`: linear
694
+ - `lr_scheduler_kwargs`: {}
695
+ - `warmup_ratio`: 0.1
696
+ - `warmup_steps`: 0
697
+ - `log_level`: passive
698
+ - `log_level_replica`: warning
699
+ - `log_on_each_node`: True
700
+ - `logging_nan_inf_filter`: True
701
+ - `save_safetensors`: True
702
+ - `save_on_each_node`: False
703
+ - `save_only_model`: False
704
+ - `restore_callback_states_from_checkpoint`: False
705
+ - `no_cuda`: False
706
+ - `use_cpu`: False
707
+ - `use_mps_device`: False
708
+ - `seed`: 42
709
+ - `data_seed`: None
710
+ - `jit_mode_eval`: False
711
+ - `use_ipex`: False
712
+ - `bf16`: False
713
+ - `fp16`: False
714
+ - `fp16_opt_level`: O1
715
+ - `half_precision_backend`: auto
716
+ - `bf16_full_eval`: False
717
+ - `fp16_full_eval`: False
718
+ - `tf32`: None
719
+ - `local_rank`: 0
720
+ - `ddp_backend`: None
721
+ - `tpu_num_cores`: None
722
+ - `tpu_metrics_debug`: False
723
+ - `debug`: []
724
+ - `dataloader_drop_last`: False
725
+ - `dataloader_num_workers`: 0
726
+ - `dataloader_prefetch_factor`: None
727
+ - `past_index`: -1
728
+ - `disable_tqdm`: False
729
+ - `remove_unused_columns`: True
730
+ - `label_names`: None
731
+ - `load_best_model_at_end`: False
732
+ - `ignore_data_skip`: False
733
+ - `fsdp`: []
734
+ - `fsdp_min_num_params`: 0
735
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
736
+ - `fsdp_transformer_layer_cls_to_wrap`: None
737
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
738
+ - `deepspeed`: None
739
+ - `label_smoothing_factor`: 0.0
740
+ - `optim`: adamw_torch
741
+ - `optim_args`: None
742
+ - `adafactor`: False
743
+ - `group_by_length`: False
744
+ - `length_column_name`: length
745
+ - `ddp_find_unused_parameters`: None
746
+ - `ddp_bucket_cap_mb`: None
747
+ - `ddp_broadcast_buffers`: False
748
+ - `dataloader_pin_memory`: True
749
+ - `dataloader_persistent_workers`: False
750
+ - `skip_memory_metrics`: True
751
+ - `use_legacy_prediction_loop`: False
752
+ - `push_to_hub`: False
753
+ - `resume_from_checkpoint`: None
754
+ - `hub_model_id`: None
755
+ - `hub_strategy`: every_save
756
+ - `hub_private_repo`: None
757
+ - `hub_always_push`: False
758
+ - `hub_revision`: None
759
+ - `gradient_checkpointing`: False
760
+ - `gradient_checkpointing_kwargs`: None
761
+ - `include_inputs_for_metrics`: False
762
+ - `include_for_metrics`: []
763
+ - `eval_do_concat_batches`: True
764
+ - `fp16_backend`: auto
765
+ - `push_to_hub_model_id`: None
766
+ - `push_to_hub_organization`: None
767
+ - `mp_parameters`:
768
+ - `auto_find_batch_size`: False
769
+ - `full_determinism`: False
770
+ - `torchdynamo`: None
771
+ - `ray_scope`: last
772
+ - `ddp_timeout`: 1800
773
+ - `torch_compile`: False
774
+ - `torch_compile_backend`: None
775
+ - `torch_compile_mode`: None
776
+ - `include_tokens_per_second`: False
777
+ - `include_num_input_tokens_seen`: False
778
+ - `neftune_noise_alpha`: None
779
+ - `optim_target_modules`: None
780
+ - `batch_eval_metrics`: False
781
+ - `eval_on_start`: False
782
+ - `use_liger_kernel`: False
783
+ - `liger_kernel_config`: None
784
+ - `eval_use_gather_object`: False
785
+ - `average_tokens_across_devices`: False
786
+ - `prompts`: None
787
+ - `batch_sampler`: batch_sampler
788
+ - `multi_dataset_batch_sampler`: proportional
789
+ - `router_mapping`: {}
790
+ - `learning_rate_mapping`: {}
791
+
792
+ </details>
793
+
794
+ ### Training Logs
795
+ | Epoch | Step | Training Loss | Validation Loss | ai-job-validation_cosine_accuracy |
796
+ |:------:|:-----:|:-------------:|:---------------:|:---------------------------------:|
797
+ | 0.2967 | 1000 | - | 0.1458 | 0.9605 |
798
+ | 0.5935 | 2000 | - | 0.1217 | 0.9665 |
799
+ | 0.8902 | 3000 | - | 0.1095 | 0.9711 |
800
+ | 1.1869 | 4000 | - | 0.1131 | 0.9682 |
801
+ | 1.4837 | 5000 | 0.1672 | 0.1107 | 0.9687 |
802
+ | 1.7804 | 6000 | - | 0.1030 | 0.9709 |
803
+ | 2.0772 | 7000 | - | 0.1081 | 0.9693 |
804
+ | 2.3739 | 8000 | - | 0.1091 | 0.9691 |
805
+ | 2.6706 | 9000 | - | 0.1098 | 0.9691 |
806
+ | 2.9674 | 10000 | 0.0678 | 0.1065 | 0.9700 |
807
+
808
+
809
+ ### Framework Versions
810
+ - Python: 3.10.12
811
+ - Sentence Transformers: 5.1.0
812
+ - Transformers: 4.53.3
813
+ - PyTorch: 2.8.0+cu128
814
+ - Accelerate: 1.9.0
815
+ - Datasets: 4.0.0
816
+ - Tokenizers: 0.21.2
817
+
818
+ ## Citation
819
+
820
+ ### BibTeX
821
+
822
+ #### Sentence Transformers
823
+ ```bibtex
824
+ @inproceedings{reimers-2019-sentence-bert,
825
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
826
+ author = "Reimers, Nils and Gurevych, Iryna",
827
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
828
+ month = "11",
829
+ year = "2019",
830
+ publisher = "Association for Computational Linguistics",
831
+ url = "https://arxiv.org/abs/1908.10084",
832
+ }
833
+ ```
834
+
835
+ #### MultipleNegativesRankingLoss
836
+ ```bibtex
837
+ @misc{henderson2017efficient,
838
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
839
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
840
+ year={2017},
841
+ eprint={1705.00652},
842
+ archivePrefix={arXiv},
843
+ primaryClass={cs.CL}
844
+ }
845
+ ```
846
+
847
+ <!--
848
+ ## Glossary
849
+
850
+ *Clearly define terms in order to be accessible across audiences.*
851
+ -->
852
+
853
+ <!--
854
+ ## Model Card Authors
855
+
856
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
857
+ -->
858
+
859
+ <!--
860
+ ## Model Card Contact
861
+
862
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
863
+ -->
config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "NewModel"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.0,
6
+ "auto_map": {
7
+ "AutoConfig": "configuration.NewConfig",
8
+ "AutoModel": "modeling.NewModel",
9
+ "AutoModelForMaskedLM": "Alibaba-NLP/new-impl--modeling.NewForMaskedLM",
10
+ "AutoModelForMultipleChoice": "Alibaba-NLP/new-impl--modeling.NewForMultipleChoice",
11
+ "AutoModelForQuestionAnswering": "Alibaba-NLP/new-impl--modeling.NewForQuestionAnswering",
12
+ "AutoModelForSequenceClassification": "Alibaba-NLP/new-impl--modeling.NewForSequenceClassification",
13
+ "AutoModelForTokenClassification": "Alibaba-NLP/new-impl--modeling.NewForTokenClassification"
14
+ },
15
+ "classifier_dropout": 0.0,
16
+ "hidden_act": "gelu",
17
+ "hidden_dropout_prob": 0.1,
18
+ "hidden_size": 768,
19
+ "id2label": {
20
+ "0": "LABEL_0"
21
+ },
22
+ "initializer_range": 0.02,
23
+ "intermediate_size": 3072,
24
+ "label2id": {
25
+ "LABEL_0": 0
26
+ },
27
+ "layer_norm_eps": 1e-12,
28
+ "layer_norm_type": "layer_norm",
29
+ "logn_attention_clip1": false,
30
+ "logn_attention_scale": false,
31
+ "max_position_embeddings": 8192,
32
+ "model_type": "new",
33
+ "num_attention_heads": 12,
34
+ "num_hidden_layers": 12,
35
+ "pack_qkv": true,
36
+ "pad_token_id": 1,
37
+ "position_embedding_type": "rope",
38
+ "rope_scaling": {
39
+ "factor": 8.0,
40
+ "type": "ntk"
41
+ },
42
+ "rope_theta": 20000,
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.53.3",
45
+ "type_vocab_size": 1,
46
+ "unpad_inputs": false,
47
+ "use_memory_efficient_attention": false,
48
+ "vocab_size": 250048
49
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SentenceTransformer",
3
+ "__version__": {
4
+ "sentence_transformers": "5.1.0",
5
+ "transformers": "4.53.3",
6
+ "pytorch": "2.8.0+cu128"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "cosine"
14
+ }
configuration.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """ NEW model configuration"""
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+ logger = logging.get_logger(__name__)
21
+
22
+
23
+ class NewConfig(PretrainedConfig):
24
+ r"""
25
+ This is the configuration class to store the configuration of a [`NewModel`] or a [`TFNewModel`]. It is used to
26
+ instantiate a NEW model according to the specified arguments, defining the model architecture. Instantiating a
27
+ configuration with the defaults will yield a similar configuration to that of the NEW
28
+ [izhx/new-base-en](https://huggingface.co/izhx/new-base-en) architecture.
29
+
30
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
31
+ documentation from [`PretrainedConfig`] for more information.
32
+
33
+
34
+ Args:
35
+ vocab_size (`int`, *optional*, defaults to 30522):
36
+ Vocabulary size of the NEW model. Defines the number of different tokens that can be represented by the
37
+ `inputs_ids` passed when calling [`NewModel`] or [`TFNewModel`].
38
+ hidden_size (`int`, *optional*, defaults to 768):
39
+ Dimensionality of the encoder layers and the pooler layer.
40
+ num_hidden_layers (`int`, *optional*, defaults to 12):
41
+ Number of hidden layers in the Transformer encoder.
42
+ num_attention_heads (`int`, *optional*, defaults to 12):
43
+ Number of attention heads for each attention layer in the Transformer encoder.
44
+ intermediate_size (`int`, *optional*, defaults to 3072):
45
+ Dimensionality of the "intermediate" (often named feed-forward) layer in the Transformer encoder.
46
+ hidden_act (`str` or `Callable`, *optional*, defaults to `"gelu"`):
47
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
48
+ `"relu"`, `"silu"` and `"gelu_new"` are supported.
49
+ hidden_dropout_prob (`float`, *optional*, defaults to 0.1):
50
+ The dropout probability for all fully connected layers in the embeddings, encoder, and pooler.
51
+ attention_probs_dropout_prob (`float`, *optional*, defaults to 0.1):
52
+ The dropout ratio for the attention probabilities.
53
+ max_position_embeddings (`int`, *optional*, defaults to 512):
54
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
55
+ just in case (e.g., 512 or 1024 or 2048).
56
+ type_vocab_size (`int`, *optional*, defaults to 2):
57
+ The vocabulary size of the `token_type_ids` passed when calling [`NewModel`] or [`TFNewModel`].
58
+ initializer_range (`float`, *optional*, defaults to 0.02):
59
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
60
+ layer_norm_eps (`float`, *optional*, defaults to 1e-12):
61
+ The epsilon used by the layer normalization layers.
62
+ position_embedding_type (`str`, *optional*, defaults to `"rope"`):
63
+ Type of position embedding. Choose one of `"absolute"`, `"rope"`.
64
+ rope_theta (`float`, *optional*, defaults to 10000.0):
65
+ The base period of the RoPE embeddings.
66
+ rope_scaling (`Dict`, *optional*):
67
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
68
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
69
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
70
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
71
+ these scaling strategies behave:
72
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
73
+ experimental feature, subject to breaking API changes in future versions.
74
+ classifier_dropout (`float`, *optional*):
75
+ The dropout ratio for the classification head.
76
+
77
+ Examples:
78
+
79
+ ```python
80
+ >>> from transformers import NewConfig, NewModel
81
+
82
+ >>> # Initializing a NEW izhx/new-base-en style configuration
83
+ >>> configuration = NewConfig()
84
+
85
+ >>> # Initializing a model (with random weights) from the izhx/new-base-en style configuration
86
+ >>> model = NewModel(configuration)
87
+
88
+ >>> # Accessing the model configuration
89
+ >>> configuration = model.config
90
+ ```"""
91
+
92
+ model_type = "new"
93
+
94
+ def __init__(
95
+ self,
96
+ vocab_size=30528,
97
+ hidden_size=768,
98
+ num_hidden_layers=12,
99
+ num_attention_heads=12,
100
+ intermediate_size=3072,
101
+ hidden_act="gelu",
102
+ hidden_dropout_prob=0.1,
103
+ attention_probs_dropout_prob=0.0,
104
+ max_position_embeddings=2048,
105
+ type_vocab_size=1,
106
+ initializer_range=0.02,
107
+ layer_norm_type='layer_norm',
108
+ layer_norm_eps=1e-12,
109
+ # pad_token_id=0,
110
+ position_embedding_type="rope",
111
+ rope_theta=10000.0,
112
+ rope_scaling=None,
113
+ classifier_dropout=None,
114
+ pack_qkv=True,
115
+ unpad_inputs=False,
116
+ use_memory_efficient_attention=False,
117
+ logn_attention_scale=False,
118
+ logn_attention_clip1=False,
119
+ **kwargs,
120
+ ):
121
+ super().__init__(**kwargs)
122
+
123
+ self.vocab_size = vocab_size
124
+ self.hidden_size = hidden_size
125
+ self.num_hidden_layers = num_hidden_layers
126
+ self.num_attention_heads = num_attention_heads
127
+ self.hidden_act = hidden_act
128
+ self.intermediate_size = intermediate_size
129
+ self.hidden_dropout_prob = hidden_dropout_prob
130
+ self.attention_probs_dropout_prob = attention_probs_dropout_prob
131
+ self.max_position_embeddings = max_position_embeddings
132
+ self.type_vocab_size = type_vocab_size
133
+ self.initializer_range = initializer_range
134
+ self.layer_norm_type = layer_norm_type
135
+ self.layer_norm_eps = layer_norm_eps
136
+ self.position_embedding_type = position_embedding_type
137
+ self.rope_theta = rope_theta
138
+ self.rope_scaling = rope_scaling
139
+ self.classifier_dropout = classifier_dropout
140
+
141
+ self.pack_qkv = pack_qkv
142
+ self.unpad_inputs = unpad_inputs
143
+ self.use_memory_efficient_attention = use_memory_efficient_attention
144
+ self.logn_attention_scale = logn_attention_scale
145
+ self.logn_attention_clip1 = logn_attention_clip1
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f813720e8b536a661d3048b868e8405cc63a902c23edf8ef7f5311253a7cce79
3
+ size 1221487872
modeling.py ADDED
@@ -0,0 +1,1418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 The GTE Team Authors and Alibaba Group.
3
+ # Copyright (c) 2018, NVIDIA CORPORATION. All rights reserved.
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License");
6
+ # you may not use this file except in compliance with the License.
7
+ # You may obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14
+ # See the License for the specific language governing permissions and
15
+ # limitations under the License.
16
+ """PyTorch NEW model."""
17
+
18
+ import math
19
+ from dataclasses import dataclass
20
+ from typing import List, Optional, Tuple, Union
21
+
22
+ import torch
23
+ import torch.utils.checkpoint
24
+ from torch import nn
25
+
26
+ from transformers.activations import ACT2FN
27
+ from transformers.modeling_outputs import (
28
+ BaseModelOutput,
29
+ BaseModelOutputWithPooling,
30
+ MaskedLMOutput,
31
+ MultipleChoiceModelOutput,
32
+ QuestionAnsweringModelOutput,
33
+ SequenceClassifierOutput,
34
+ ModelOutput,
35
+ )
36
+ from transformers.modeling_utils import PreTrainedModel
37
+ from transformers.utils import logging
38
+
39
+ try:
40
+ import xformers.ops as xops
41
+ except ImportError as e:
42
+ xops = None
43
+
44
+ from .configuration import NewConfig
45
+
46
+
47
+ logger = logging.get_logger(__name__)
48
+
49
+
50
+ # Adapted from https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/bert_padding.py
51
+ # Which was adapted from https://github.com/mlcommons/training_results_v1.1/blob/main/NVIDIA/benchmarks/bert/implementations/pytorch/padding.py
52
+ class IndexFirstAxis(torch.autograd.Function):
53
+ @staticmethod
54
+ def forward(ctx, input, indices):
55
+ ctx.save_for_backward(indices)
56
+ assert input.ndim >= 2
57
+ ctx.first_axis_dim, other_shape = input.shape[0], input.shape[1:]
58
+ second_dim = other_shape.numel()
59
+ # TD [2022-03-04] For some reason torch.gather is a bit faster than indexing.
60
+ # return input[indices]
61
+ # return torch.gather(
62
+ # rearrange(input, "b ... -> b (...)"), 0, repeat(indices, "z -> z d", d=second_dim)
63
+ # ).reshape(-1, *other_shape)
64
+ return torch.gather(
65
+ input.view(ctx.first_axis_dim, second_dim),
66
+ 0,
67
+ indices.unsqueeze(-1).expand(indices.size(0), second_dim)
68
+ ).reshape(-1, *other_shape)
69
+
70
+ @staticmethod
71
+ def backward(ctx, grad_output):
72
+ (indices,) = ctx.saved_tensors
73
+ assert grad_output.ndim >= 2
74
+ other_shape = grad_output.shape[1:]
75
+ # grad_output = rearrange(grad_output, "b ... -> b (...)")
76
+ grad_output = grad_output.view(grad_output.size(0), other_shape.numel())
77
+ grad_input = torch.zeros(
78
+ [ctx.first_axis_dim, grad_output.shape[1]],
79
+ device=grad_output.device,
80
+ dtype=grad_output.dtype,
81
+ )
82
+ # TD [2022-03-04] For some reason torch.scatter is a bit faster than indexing.
83
+ # grad_input[indices] = grad_output
84
+ # grad_input.scatter_(0, repeat(indices, "z -> z d", d=grad_output.shape[1]), grad_output)
85
+ grad_input.scatter_(
86
+ 0, indices.unsqueeze(-1).expand(indices.size(0), grad_output.size(1)), grad_output
87
+ )
88
+ return grad_input.reshape(ctx.first_axis_dim, *other_shape), None
89
+
90
+
91
+ index_first_axis = IndexFirstAxis.apply
92
+
93
+
94
+ def unpad_input(hidden_states, attention_mask=None, indices=None):
95
+ """
96
+ Arguments:
97
+ hidden_states: (batch, seqlen, ...)
98
+ attention_mask: (batch, seqlen), bool / int, 1 means valid and 0 means not valid.
99
+ indices: (total_nnz), the indices of non-masked tokens from the flattened input sequence.
100
+ Return:
101
+ hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
102
+ """
103
+ if indices is None:
104
+ assert attention_mask is not None
105
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
106
+
107
+ # TD [2022-03-04] We don't want to index with a bool mask, because Pytorch will expand the
108
+ # bool mask, then call nonzero to get the indices, then index with those. The indices is @dim
109
+ # times larger than it needs to be, wasting memory. It's faster and more memory-efficient to
110
+ # index with integer indices. Moreover, torch's index is a bit slower than it needs to be,
111
+ # so we write custom forward and backward to make it a bit faster.
112
+ hidden_states = hidden_states.view(-1, *hidden_states.shape[2:])
113
+ return index_first_axis(hidden_states, indices)
114
+
115
+
116
+ class IndexPutFirstAxis(torch.autograd.Function):
117
+ @staticmethod
118
+ def forward(
119
+ ctx,
120
+ values: torch.Tensor,
121
+ indices: torch.Tensor,
122
+ first_axis_dim
123
+ ) -> torch.Tensor:
124
+ ctx.save_for_backward(indices)
125
+ assert indices.ndim == 1
126
+ assert values.ndim >= 2
127
+ output = torch.zeros(
128
+ first_axis_dim, *values.shape[1:], device=values.device, dtype=values.dtype
129
+ )
130
+ output[indices] = values
131
+ return output
132
+
133
+ @staticmethod
134
+ def backward(ctx, grad_output: torch.Tensor) -> Tuple[torch.Tensor, None, None]:
135
+ indices, = ctx.saved_tensors
136
+ grad_values = grad_output[indices]
137
+ return grad_values, None, None
138
+
139
+
140
+ index_put_first_axis = IndexPutFirstAxis.apply
141
+
142
+
143
+ def pad_input(inputs: torch.Tensor, indices: torch.Tensor, batch: int, seqlen: int) -> torch.Tensor:
144
+ """Add padding to sequences.
145
+
146
+ Arguments:
147
+ inputs: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
148
+ indices: (total_nnz), `indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()`
149
+ batch: int batch_size
150
+ seqlen: int max sequence length
151
+
152
+ Returns:
153
+ inputs: (batch, seqlen, ...)
154
+ """
155
+ output = index_put_first_axis(inputs, indices, batch * seqlen)
156
+ return output.view(batch, seqlen, *inputs.shape[1:])
157
+
158
+
159
+ def rotate_half(x):
160
+ """Rotates half the hidden dims of the input."""
161
+ x1 = x[..., : x.shape[-1] // 2]
162
+ x2 = x[..., x.shape[-1] // 2 :]
163
+ return torch.cat((-x2, x1), dim=-1)
164
+
165
+
166
+ def apply_rotary_pos_emb(q, k, cos, sin):
167
+ """Applies Rotary Position Embedding to the query and key tensors.
168
+
169
+ Args:
170
+ q (`torch.Tensor`): The query tensor.
171
+ k (`torch.Tensor`): The key tensor.
172
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
173
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
174
+ Returns:
175
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
176
+ """
177
+ cos, sin = cos.to(q.dtype), sin.to(q.dtype)
178
+ q_embed = (q * cos) + (rotate_half(q) * sin)
179
+ k_embed = (k * cos) + (rotate_half(k) * sin)
180
+ return q_embed, k_embed
181
+
182
+
183
+ class RotaryEmbedding(torch.nn.Module):
184
+ def __init__(self, dim, max_position_embeddings=512, base=10000.0, device=None):
185
+ super().__init__()
186
+
187
+ self.dim = dim
188
+ self.max_position_embeddings = max_position_embeddings
189
+ self.base = base
190
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
191
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
192
+
193
+ # Build here to make `torch.jit.trace` work.
194
+ self._set_cos_sin_cache(
195
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
196
+ )
197
+
198
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
199
+ self.max_seq_len_cached = seq_len
200
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
201
+
202
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
203
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
204
+ emb = torch.cat((freqs, freqs), dim=-1)
205
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
206
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
207
+
208
+ def forward(self, x, seq_len=None):
209
+ # x: [bs, num_attention_heads, seq_len, head_size]
210
+ if seq_len > self.max_seq_len_cached:
211
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
212
+
213
+ return (
214
+ self.cos_cached[:seq_len, ...].to(dtype=x.dtype),
215
+ self.sin_cached[:seq_len, ...].to(dtype=x.dtype),
216
+ )
217
+
218
+
219
+ class NTKScalingRotaryEmbedding(RotaryEmbedding):
220
+ """RotaryEmbedding extended with fixed and mixed NTK scaling. https://kexue.fm/archives/9706 """
221
+
222
+ def __init__(self, dim, max_position_embeddings=512, base=10000, device=None, scaling_factor=1.0, mixed_b=None):
223
+ self.scaling_factor = scaling_factor
224
+ self.mixed_b = mixed_b
225
+ super().__init__(dim, max_position_embeddings, base, device)
226
+ max_position_embeddings = max_position_embeddings * self.scaling_factor
227
+ self._set_cos_sin_cache(max_position_embeddings, self.inv_freq.device, torch.get_default_dtype())
228
+
229
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
230
+ self.max_seq_len_cached = seq_len
231
+
232
+ if seq_len > self.max_position_embeddings:
233
+ base = self.base * (self.scaling_factor if self.mixed_b is None else 1)
234
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
235
+
236
+ if self.mixed_b is None:
237
+ inv_freq = inv_freq / self.scaling_factor ** (2 / self.dim) # (6)
238
+ else:
239
+ a = torch.tensor(self.scaling_factor).log() / (self.dim / 2) ** self.mixed_b # (13)
240
+ lambda_1_m = (a * torch.arange(1, self.dim // 2 + 1).float().to(device) ** self.mixed_b).exp() # (12)
241
+ inv_freq = inv_freq / lambda_1_m # (10)
242
+
243
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
244
+
245
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
246
+
247
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
248
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
249
+ emb = torch.cat((freqs, freqs), dim=-1)
250
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
251
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
252
+
253
+
254
+ class RMSNorm(nn.Module):
255
+ def __init__(self, hidden_size, eps=1e-6):
256
+ """
257
+ RMSNorm is equivalent to T5LayerNorm
258
+ """
259
+ super().__init__()
260
+ self.weight = nn.Parameter(torch.ones(hidden_size))
261
+ self.variance_epsilon = eps
262
+
263
+ def forward(self, hidden_states):
264
+ input_dtype = hidden_states.dtype
265
+ hidden_states = hidden_states.to(torch.float32)
266
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
267
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
268
+ return self.weight * hidden_states.to(input_dtype)
269
+
270
+
271
+ LAYER_NORM = {
272
+ 'layer_norm': nn.LayerNorm,
273
+ 'rms_norm': RMSNorm
274
+ }
275
+
276
+
277
+ class NewEmbeddings(nn.Module):
278
+ """
279
+ Embedding and Unpadding.
280
+ """
281
+
282
+ def __init__(self, config: NewConfig):
283
+ super().__init__()
284
+ self.padding_idx = config.pad_token_id
285
+ self.word_embeddings = nn.Embedding(
286
+ config.vocab_size, config.hidden_size, padding_idx=self.padding_idx
287
+ )
288
+
289
+ self.position_embedding_type = config.position_embedding_type
290
+ if self.position_embedding_type == 'absolute':
291
+ self.position_embeddings = nn.Embedding(
292
+ config.max_position_embeddings, config.hidden_size, padding_idx=self.padding_idx
293
+ )
294
+ elif self.position_embedding_type == 'rope':
295
+ self._init_rope(config)
296
+ else:
297
+ raise ValueError
298
+
299
+ self.type_vocab_size = config.type_vocab_size
300
+ if self.type_vocab_size > 0:
301
+ self.token_type_embeddings = nn.Embedding(config.type_vocab_size, config.hidden_size)
302
+
303
+ # self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
304
+ # any TensorFlow checkpoint file
305
+ self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
306
+ self.dropout = nn.Dropout(config.hidden_dropout_prob)
307
+ # position_ids is contiguous in memory and excluded when serialized
308
+ self.register_buffer(
309
+ "position_ids", torch.arange(config.max_position_embeddings), persistent=False
310
+ )
311
+
312
+ def _init_rope(self, config):
313
+ kwargs = dict(
314
+ dim=int(config.hidden_size / config.num_attention_heads),
315
+ max_position_embeddings=config.max_position_embeddings,
316
+ base=config.rope_theta
317
+ )
318
+ if config.rope_scaling is None:
319
+ self.rotary_emb = RotaryEmbedding(**kwargs)
320
+ else:
321
+ kwargs.update(scaling_factor=config.rope_scaling["factor"])
322
+ scaling_type = config.rope_scaling["type"]
323
+ if scaling_type == 'ntk':
324
+ kwargs.update(mixed_b=config.rope_scaling.get('mixed_b', None))
325
+ self.rotary_emb = NTKScalingRotaryEmbedding(**kwargs)
326
+ # elif scaling_type == "linear":
327
+ # self.rotary_emb = LinearScalingRotaryEmbedding(**kwargs)
328
+ # elif scaling_type == "dynamic":
329
+ # self.rotary_emb = DynamicNTKScalingRotaryEmbedding(**kwargs)
330
+ else:
331
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
332
+
333
+ def forward(
334
+ self,
335
+ unpad_inputs: bool,
336
+ input_ids: Optional[torch.Tensor] = None,
337
+ attention_mask: Optional[torch.Tensor] = None,
338
+ length: Optional[List[int]] = None,
339
+ token_type_ids: Optional[torch.Tensor] = None,
340
+ position_ids: Optional[torch.Tensor] = None,
341
+ inputs_embeds: Optional[torch.Tensor] = None,
342
+ ) -> Tuple[torch.Tensor, torch.Tensor, Optional[Tuple], Optional[List[int]]]:
343
+ """
344
+ """
345
+ if inputs_embeds is None:
346
+ device, input_shape = input_ids.device, input_ids.shape
347
+ else:
348
+ device, input_shape = inputs_embeds.device, inputs_embeds.shape[:2]
349
+ batch_size, seq_length = input_shape
350
+
351
+ # Set attention_mask if it's None
352
+ if attention_mask is None:
353
+ attention_mask = torch.ones(input_shape, device=device)
354
+ if length is not None:
355
+ for i, l in enumerate(length):
356
+ attention_mask[i, l:] = 0
357
+
358
+ # Set attention_mask_bool for unpadding
359
+ if unpad_inputs:
360
+ attention_mask_bool = attention_mask.bool()
361
+ if length is None:
362
+ length = attention_mask.sum(-1).tolist()
363
+
364
+ # Get word embeddings
365
+ if inputs_embeds is None:
366
+ if unpad_inputs:
367
+ input_ids = input_ids[attention_mask_bool].unsqueeze(0)
368
+ inputs_embeds = self.word_embeddings(input_ids)
369
+ else:
370
+ if unpad_inputs:
371
+ inputs_embeds = inputs_embeds[attention_mask_bool].unsqueeze(0)
372
+ embeddings = inputs_embeds
373
+
374
+ # Set and unpad position_ids
375
+ if position_ids is None:
376
+ if seq_length > self.position_ids.size(0):
377
+ self.register_buffer(
378
+ "position_ids", torch.arange(seq_length, device=embeddings.device), persistent=False
379
+ )
380
+ if unpad_inputs:
381
+ # [1, cumsum_seq_len]
382
+ position_ids = torch.cat([self.position_ids[:l] for l in length]).unsqueeze(0)
383
+ else:
384
+ # [bs, seq_len]
385
+ position_ids = self.position_ids[:seq_length].expand(batch_size, -1)
386
+ elif unpad_inputs:
387
+ position_ids = position_ids[attention_mask_bool].unsqueeze(0) # [1, cumsum_seq_len]
388
+
389
+ # Compute rotary embedding
390
+ if self.position_embedding_type == 'rope':
391
+ rope_cos, rope_sin = self.rotary_emb(inputs_embeds, seq_len=seq_length)
392
+ rope_cos = rope_cos[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
393
+ rope_sin = rope_sin[position_ids].unsqueeze(2) # [bs, seq_len, 1, dim]
394
+ rope_embeds = rope_cos, rope_sin
395
+ else:
396
+ rope_embeds = None
397
+
398
+ if self.type_vocab_size > 0:
399
+ if token_type_ids is None:
400
+ token_type_ids = position_ids.mul(0)
401
+ else:
402
+ if self.type_vocab_size < 2:
403
+ token_type_ids.mul_(0)
404
+ if unpad_inputs:
405
+ token_type_ids = token_type_ids[attention_mask_bool].unsqueeze(0)
406
+
407
+ token_type_embeddings = self.token_type_embeddings(token_type_ids)
408
+ embeddings = embeddings + token_type_embeddings
409
+
410
+ # BERT position
411
+ if self.position_embedding_type == "absolute":
412
+ position_embeddings = self.position_embeddings(position_ids)
413
+ embeddings = embeddings + position_embeddings
414
+
415
+ embeddings = self.LayerNorm(embeddings)
416
+ embeddings = self.dropout(embeddings)
417
+
418
+ return embeddings, attention_mask, rope_embeds, length
419
+
420
+
421
+ class NewAttention(nn.Module):
422
+ def __init__(self, config: NewConfig, pack_qkv=None, use_memory_efficient_attention=None):
423
+ super().__init__()
424
+ self.config = config
425
+ if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
426
+ raise ValueError(
427
+ f"The hidden size ({config.hidden_size}) is not a multiple of the number of attention "
428
+ f"heads ({config.num_attention_heads})"
429
+ )
430
+
431
+ self.hidden_size = config.hidden_size
432
+ self.num_attention_heads = config.num_attention_heads
433
+ self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
434
+ self.all_head_size = self.num_attention_heads * self.attention_head_size
435
+
436
+ if pack_qkv is None:
437
+ pack_qkv = config.pack_qkv
438
+ self.pack_qkv = pack_qkv
439
+
440
+ if self.pack_qkv:
441
+ self.qkv_proj = nn.Linear(config.hidden_size, self.all_head_size * 3, bias=True)
442
+ else:
443
+ self.q_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
444
+ self.k_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
445
+ self.v_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
446
+
447
+ self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
448
+ self.o_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
449
+
450
+ if use_memory_efficient_attention is None:
451
+ use_memory_efficient_attention = self.config.use_memory_efficient_attention
452
+ self.use_memory_efficient_attention = use_memory_efficient_attention
453
+ self.memory_efficient_attention = None if xops is None else xops.memory_efficient_attention
454
+ if self.use_memory_efficient_attention:
455
+ assert self.memory_efficient_attention is not None, 'please install xformers'
456
+
457
+ def forward(
458
+ self,
459
+ hidden_states: torch.Tensor,
460
+ attention_bias: torch.FloatTensor,
461
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
462
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
463
+ attention_scale: Optional[torch.FloatTensor] = None,
464
+ head_mask: Optional[torch.FloatTensor] = None,
465
+ output_attentions: Optional[bool] = False,
466
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
467
+ ) -> Tuple[torch.Tensor, ...]:
468
+ shape_hd = (self.num_attention_heads, self.attention_head_size)
469
+ # qkv
470
+ if self.pack_qkv and qkv_inputs is None:
471
+ qkv_pack = self.qkv_proj(hidden_states).split(self.all_head_size, dim=-1)
472
+ else:
473
+ if qkv_inputs is None:
474
+ qkv_inputs = (hidden_states, hidden_states, hidden_states)
475
+ qkv_pack = [
476
+ getattr(self, n + '_proj')(s) for s, n in zip(qkv_inputs, 'qkv')
477
+ ]
478
+ query_states, key_states, value_states = [t.view(t.shape[:-1] + shape_hd) for t in qkv_pack]
479
+
480
+ if self.config.position_embedding_type == 'rope':
481
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, *rope_embeds)
482
+
483
+ dtype = query_states.dtype
484
+
485
+ if self.config.logn_attention_scale and attention_scale is not None:
486
+ # https://kexue.fm/archives/8823
487
+ query_states = query_states * attention_scale.to(dtype)
488
+
489
+ if padding_inputs is not None:
490
+ query_states = pad_input(query_states.squeeze(), *padding_inputs)
491
+ key_states = pad_input(key_states.squeeze(), *padding_inputs)
492
+ value_states = pad_input(value_states.squeeze(), *padding_inputs)
493
+
494
+ if self.use_memory_efficient_attention:
495
+ assert self.memory_efficient_attention is not None, "xformers is not loaded"
496
+ assert output_attentions is False, "memory_efficient_attention do not output attentions"
497
+ assert head_mask is None, "Not support yet"
498
+ attention_probs = None
499
+ if torch.is_tensor(attention_bias):
500
+ attention_bias = attention_bias.to(dtype)
501
+ context_layer = self.memory_efficient_attention(
502
+ query_states,
503
+ key_states,
504
+ value_states,
505
+ attn_bias=attention_bias,
506
+ p=self.dropout.p
507
+ )
508
+ else:
509
+ if output_attentions and isinstance(self, NewSdpaAttention):
510
+ raise RuntimeError("SDPA do not output attentions")
511
+ context_layer, attention_probs = self._attention(
512
+ query_states, key_states, value_states, attention_bias, head_mask
513
+ )
514
+
515
+ if padding_inputs is not None:
516
+ context_layer = unpad_input(context_layer, indices=padding_inputs[0])
517
+
518
+ new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
519
+ context_layer = context_layer.view(new_context_layer_shape)
520
+
521
+ # output proj
522
+ attn_output = self.o_proj(context_layer)
523
+
524
+ # add attentions if we output them
525
+ outputs = (attn_output, attention_probs) if output_attentions else (attn_output,)
526
+ return outputs
527
+
528
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
529
+ """
530
+ Args:
531
+ q/k/v: (B, L, n_head, head_dim),
532
+ Returns:
533
+ attn_output: (B L, n_head, head_dim)
534
+ """
535
+ query_states = query_states.transpose(1, 2)
536
+ key_states = key_states.transpose(1, 2)
537
+ value_states = value_states.transpose(1, 2)
538
+ # Take the dot product between "query" and "key" to get the raw attention scores.
539
+ attention_scores = torch.matmul(query_states, key_states.transpose(-1, -2))
540
+
541
+ attention_scores = attention_scores / math.sqrt(self.attention_head_size)
542
+ if attention_bias is not None:
543
+ # Apply the attention mask is (precomputed for all layers in BertModel forward() function)
544
+ attention_scores = attention_scores + attention_bias
545
+
546
+ # Normalize the attention scores to probabilities.
547
+ attention_probs = nn.functional.softmax(attention_scores, dim=-1)
548
+
549
+ # This is actually dropping out entire tokens to attend to, which might
550
+ # seem a bit unusual, but is taken from the original Transformer paper.
551
+ if self.dropout.p > 0:
552
+ attention_probs = self.dropout(attention_probs)
553
+
554
+ # Mask heads if we want to
555
+ if head_mask is not None:
556
+ attention_probs = attention_probs * head_mask
557
+
558
+ context_layer = torch.matmul(attention_probs, value_states)
559
+
560
+ context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
561
+ return context_layer, attention_probs
562
+
563
+
564
+ class NewSdpaAttention(NewAttention):
565
+ """
566
+ New attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
567
+ `NewAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
568
+ SDPA API.
569
+ """
570
+ def __init__(self, config: NewConfig, **kwargs):
571
+ super().__init__(config, **kwargs)
572
+ # torch.backends.cuda.enable_mem_efficient_sdp(False)
573
+ # logger.warning(
574
+ # "Disable memory efficient attention kernel for `NewSdpaAttention`, you can set "
575
+ # "`use_memory_efficient_attention=True` if it expected to use."
576
+ # )
577
+
578
+ def _attention(self, query_states, key_states, value_states, attention_bias, head_mask):
579
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
580
+ query_states.transpose(1, 2),
581
+ key_states.transpose(1, 2),
582
+ value_states.transpose(1, 2),
583
+ attn_mask=attention_bias,
584
+ dropout_p=self.dropout.p if self.training else 0.0,
585
+ )
586
+ attn_output = attn_output.permute(0, 2, 1, 3).contiguous()
587
+ return attn_output, None
588
+
589
+
590
+ NEW_ATTENTION_CLASSES = {
591
+ "eager": NewAttention,
592
+ # "flash_attention_2": , # TODO
593
+ "sdpa": NewSdpaAttention,
594
+ }
595
+
596
+
597
+ class NewGatedMLP(nn.Module):
598
+ """
599
+ GLU Variants Improve Transformer.
600
+ """
601
+
602
+ def __init__(self, config: NewConfig):
603
+ super().__init__()
604
+ self.intermediate_size = config.intermediate_size
605
+ self.up_gate_proj = nn.Linear(config.hidden_size, self.intermediate_size * 2, bias=False)
606
+ self.down_proj = nn.Linear(self.intermediate_size, config.hidden_size, bias=True)
607
+ self.act_fn = ACT2FN[config.hidden_act]
608
+ if config.hidden_dropout_prob > 0:
609
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
610
+ else:
611
+ self.hidden_dropout = None
612
+
613
+ def forward(self, hidden_states):
614
+ up_gate = self.up_gate_proj(hidden_states)
615
+ up_states, gate = torch.split(up_gate, self.intermediate_size, dim=-1)
616
+ gate = self.act_fn(gate)
617
+ gated_states = gate * up_states
618
+ if self.hidden_dropout is not None:
619
+ gated_states = self.hidden_dropout(gated_states)
620
+ down_states = self.down_proj(gated_states)
621
+ return down_states
622
+
623
+
624
+ class NewLayer(nn.Module):
625
+ def __init__(
626
+ self,
627
+ config: NewConfig,
628
+ pack_qkv=None,
629
+ use_memory_efficient_attention=None,
630
+ attn_implementation=None
631
+ ):
632
+ super().__init__()
633
+ if attn_implementation is None:
634
+ attn_implementation = config._attn_implementation
635
+ if use_memory_efficient_attention is None:
636
+ use_memory_efficient_attention = config.use_memory_efficient_attention
637
+ if use_memory_efficient_attention:
638
+ if attn_implementation != 'eager':
639
+ logger.warning_once(f"Override {attn_implementation=} to 'eager' as {use_memory_efficient_attention=}")
640
+ attn_implementation = 'eager' # Since it will be SDPA by default for torch>=2.1.1
641
+ self.attention = NEW_ATTENTION_CLASSES[attn_implementation](
642
+ config, pack_qkv=pack_qkv, use_memory_efficient_attention=use_memory_efficient_attention
643
+ )
644
+ self.mlp = NewGatedMLP(config)
645
+
646
+ ln_class = LAYER_NORM[config.layer_norm_type]
647
+ self.attn_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
648
+ self.mlp_ln = ln_class(config.hidden_size, eps=config.layer_norm_eps)
649
+
650
+ if config.hidden_dropout_prob > 0:
651
+ self.hidden_dropout = nn.Dropout(config.hidden_dropout_prob)
652
+ else:
653
+ self.hidden_dropout = None
654
+
655
+ def forward(
656
+ self,
657
+ hidden_states: torch.Tensor,
658
+ attention_bias: torch.FloatTensor,
659
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
660
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
661
+ attention_scale: Optional[torch.FloatTensor] = None,
662
+ subset_indices: Optional[torch.LongTensor] = None,
663
+ head_mask: Optional[torch.FloatTensor] = None,
664
+ output_attentions: Optional[bool] = False,
665
+ qkv_inputs: Optional[Tuple] = None, # For RetroMAE
666
+ ) -> Tuple[torch.Tensor, ...]:
667
+ # Multi head self attention
668
+ residual = hidden_states if qkv_inputs is None else qkv_inputs[0]
669
+ attention_outputs = self.attention(
670
+ hidden_states,
671
+ attention_bias,
672
+ rope_embeds,
673
+ padding_inputs,
674
+ attention_scale,
675
+ head_mask,
676
+ output_attentions=output_attentions,
677
+ qkv_inputs=qkv_inputs,
678
+ )
679
+ hidden_states = attention_outputs[0]
680
+ if self.hidden_dropout is not None:
681
+ hidden_states = self.hidden_dropout(hidden_states)
682
+ hidden_states = residual + hidden_states
683
+
684
+ # In pretraining, after the attention of last layer, we only need the masked tokens.
685
+ if subset_indices is not None:
686
+ hidden_states = hidden_states[subset_indices]
687
+
688
+ hidden_states = self.attn_ln(hidden_states)
689
+
690
+ # Fully Connected
691
+ residual = hidden_states
692
+ hidden_states = self.mlp(hidden_states)
693
+ if self.hidden_dropout is not None:
694
+ hidden_states = self.hidden_dropout(hidden_states)
695
+ hidden_states = residual + hidden_states
696
+ hidden_states = self.mlp_ln(hidden_states)
697
+
698
+ # add self attentions if we output attention weights
699
+ outputs = (hidden_states,) + attention_outputs[1:]
700
+ return outputs
701
+
702
+
703
+ class NewEncoder(nn.Module):
704
+ def __init__(self, config):
705
+ super().__init__()
706
+ self.config = config
707
+ self.layer = nn.ModuleList([NewLayer(config) for _ in range(config.num_hidden_layers)])
708
+ self.gradient_checkpointing = False
709
+
710
+ def forward(
711
+ self,
712
+ hidden_states: torch.Tensor,
713
+ attention_bias: Optional[torch.FloatTensor] = None,
714
+ rope_embeds: Optional[Tuple[torch.FloatTensor, torch.FloatTensor]] = None,
715
+ padding_inputs: Optional[Tuple] = None, # indices, batch, seqlen
716
+ attention_scale: Optional[torch.FloatTensor] = None,
717
+ subset_indices: Optional[torch.LongTensor] = None,
718
+ head_mask: Optional[torch.FloatTensor] = None,
719
+ output_attentions: Optional[bool] = False,
720
+ output_hidden_states: Optional[bool] = False,
721
+ return_dict: Optional[bool] = True,
722
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutput]:
723
+ all_hidden_states = () if output_hidden_states else None
724
+ all_self_attentions = () if output_attentions else None
725
+
726
+ for i, layer_module in enumerate(self.layer):
727
+ if output_hidden_states:
728
+ all_hidden_states = all_hidden_states + (hidden_states,)
729
+
730
+ if i >= len(self.layer) - 1:
731
+ layer_subset_indices = subset_indices
732
+ else:
733
+ layer_subset_indices = None
734
+
735
+ layer_head_mask = head_mask[i] if head_mask is not None else None
736
+
737
+ if self.gradient_checkpointing and self.training:
738
+ layer_outputs = self._gradient_checkpointing_func(
739
+ layer_module.__call__,
740
+ hidden_states,
741
+ attention_bias,
742
+ rope_embeds,
743
+ padding_inputs,
744
+ attention_scale,
745
+ layer_subset_indices,
746
+ layer_head_mask,
747
+ )
748
+ else:
749
+ layer_outputs = layer_module(
750
+ hidden_states,
751
+ attention_bias,
752
+ rope_embeds,
753
+ padding_inputs,
754
+ attention_scale,
755
+ layer_subset_indices,
756
+ layer_head_mask,
757
+ output_attentions,
758
+ )
759
+
760
+ hidden_states = layer_outputs[0]
761
+ if output_attentions:
762
+ all_self_attentions = all_self_attentions + (layer_outputs[1],)
763
+
764
+ if output_hidden_states:
765
+ all_hidden_states = all_hidden_states + (hidden_states,)
766
+
767
+ if not return_dict:
768
+ return tuple(
769
+ v
770
+ for v in [
771
+ hidden_states,
772
+ all_hidden_states,
773
+ all_self_attentions,
774
+ ]
775
+ if v is not None
776
+ )
777
+ return BaseModelOutput(
778
+ last_hidden_state=hidden_states,
779
+ hidden_states=all_hidden_states,
780
+ attentions=all_self_attentions,
781
+ )
782
+
783
+
784
+ # Copied from transformers.models.bert.modeling_bert.BertPooler with Bert->New
785
+ class NewPooler(nn.Module):
786
+ def __init__(self, config):
787
+ super().__init__()
788
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
789
+ self.activation = nn.Tanh()
790
+
791
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
792
+ # We "pool" the model by simply taking the hidden state corresponding
793
+ # to the first token.
794
+ first_token_tensor = hidden_states[:, 0]
795
+ pooled_output = self.dense(first_token_tensor)
796
+ pooled_output = self.activation(pooled_output)
797
+ return pooled_output
798
+
799
+
800
+ class NewPreTrainedModel(PreTrainedModel):
801
+ """
802
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
803
+ models.
804
+ """
805
+
806
+ config_class = NewConfig
807
+ base_model_prefix = "new"
808
+ supports_gradient_checkpointing = True
809
+ _supports_sdpa = True
810
+
811
+ def _init_weights(self, module):
812
+ """Initialize the weights"""
813
+ if isinstance(module, nn.Linear):
814
+ # Slightly different from the TF version which uses truncated_normal for initialization
815
+ # cf https://github.com/pytorch/pytorch/pull/5617
816
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
817
+ if module.bias is not None:
818
+ module.bias.data.zero_()
819
+ elif isinstance(module, nn.Embedding):
820
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
821
+ if module.padding_idx is not None:
822
+ module.weight.data[module.padding_idx].zero_()
823
+ elif isinstance(module, nn.LayerNorm):
824
+ module.bias.data.zero_()
825
+ module.weight.data.fill_(1.0)
826
+
827
+
828
+ class NewModel(NewPreTrainedModel):
829
+ """
830
+ The bare New Model transformer outputting raw hidden-states without any specific head on top.
831
+ """
832
+
833
+ def __init__(self, config: NewConfig, add_pooling_layer=False):
834
+ super().__init__(config)
835
+ self.config = config
836
+
837
+ self.embeddings = NewEmbeddings(config)
838
+ self.encoder = NewEncoder(config)
839
+
840
+ self.pooler = NewPooler(config) if add_pooling_layer else None
841
+
842
+ # Initialize weights and apply final processing
843
+ self.post_init()
844
+
845
+ def get_input_embeddings(self):
846
+ return self.embeddings.word_embeddings
847
+
848
+ def set_input_embeddings(self, value):
849
+ self.embeddings.word_embeddings = value
850
+
851
+ def forward(
852
+ self,
853
+ input_ids: Optional[torch.Tensor] = None,
854
+ attention_mask: Optional[torch.Tensor] = None,
855
+ length: Optional[List[int]] = None,
856
+ subset_indices: Optional[torch.LongTensor] = None,
857
+ token_type_ids: Optional[torch.Tensor] = None,
858
+ position_ids: Optional[torch.Tensor] = None,
859
+ head_mask: Optional[torch.Tensor] = None,
860
+ inputs_embeds: Optional[torch.Tensor] = None,
861
+ output_attentions: Optional[bool] = None,
862
+ output_hidden_states: Optional[bool] = None,
863
+ return_dict: Optional[bool] = None,
864
+ unpad_inputs: Optional[bool] = None,
865
+ ) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPooling]:
866
+ r"""
867
+ length (`list` of length `batch_size`, *optional*):
868
+ If is `None`, return padded `last_hidden_state`.
869
+ subset_indices ():
870
+ pass
871
+ unpad_inputs (`bool`, *optional*):
872
+ pass
873
+ """
874
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
875
+ output_hidden_states = (
876
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
877
+ )
878
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
879
+ unpad_inputs = unpad_inputs if unpad_inputs is not None else self.config.unpad_inputs
880
+ output_padded = length is None
881
+
882
+ if input_ids is not None and inputs_embeds is not None:
883
+ raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
884
+ elif input_ids is not None:
885
+ self.warn_if_padding_and_no_attention_mask(input_ids, attention_mask)
886
+ input_shape = input_ids.size()
887
+ elif inputs_embeds is not None:
888
+ input_shape = inputs_embeds.size()[:-1]
889
+ else:
890
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
891
+
892
+ # TODO: not used
893
+ # # Prepare head mask if needed
894
+ # # 1.0 in head_mask indicate we keep the head
895
+ # # attention_probs has shape bsz x n_heads x N x N
896
+ # # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
897
+ # # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
898
+ # head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
899
+
900
+ # Get embeddings, may unpad them
901
+ (embedding_output, attention_mask, rope_embeds, length) = self.embeddings(
902
+ unpad_inputs,
903
+ input_ids=input_ids,
904
+ attention_mask=attention_mask,
905
+ length=length,
906
+ token_type_ids=token_type_ids,
907
+ position_ids=position_ids,
908
+ inputs_embeds=inputs_embeds
909
+ )
910
+
911
+ batch_size, seq_length = input_shape
912
+ if unpad_inputs and self.config.use_memory_efficient_attention:
913
+ attention_bias = xops.fmha.attn_bias.BlockDiagonalMask.from_seqlens(length)
914
+ else:
915
+ # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
916
+ # ourselves in which case we just need to make it broadcastable to all heads.
917
+ attention_bias = self.get_extended_attention_mask(attention_mask, input_shape)
918
+ if self.config.use_memory_efficient_attention:
919
+ # Invalid shape for attention bias: torch.Size([48, 1, 1, 512]) (expected (48, 12, 512, 512))
920
+ attention_bias = attention_bias.expand(-1, self.config.num_attention_heads, seq_length, -1)
921
+
922
+ padding_inputs = None
923
+ if unpad_inputs and (output_padded or not self.config.use_memory_efficient_attention):
924
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
925
+ if not self.config.use_memory_efficient_attention:
926
+ padding_inputs = (indices, *input_shape)
927
+
928
+ attention_scale = None
929
+ if self.config.logn_attention_scale:
930
+ logger.warning_once("TODO: logn_attention_scale")
931
+ # # attention scale log_512(input_len)
932
+ # attention_scale = attention_mask.sum(1).log() / torch.tensor(self.config.max_position_embeddings).log()
933
+ # # inference-time logn scale need clip 1
934
+ # if self.config.logn_attention_clip1:
935
+ # attention_scale.clip_(1)
936
+ # attention_scale = attention_scale[:, None, None, None]
937
+ # else:
938
+ # attention_scale = None
939
+
940
+ encoder_outputs = self.encoder(
941
+ embedding_output,
942
+ attention_bias=attention_bias,
943
+ rope_embeds=rope_embeds,
944
+ padding_inputs=padding_inputs,
945
+ attention_scale=attention_scale,
946
+ subset_indices=subset_indices,
947
+ head_mask=head_mask,
948
+ output_attentions=output_attentions,
949
+ output_hidden_states=output_hidden_states,
950
+ return_dict=return_dict,
951
+ )
952
+ sequence_output = encoder_outputs[0]
953
+ if unpad_inputs and output_padded:
954
+ sequence_output = pad_input(
955
+ sequence_output.squeeze(), indices, batch_size, seq_length
956
+ )
957
+
958
+ pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
959
+
960
+ if not return_dict:
961
+ return (sequence_output, pooled_output) + encoder_outputs[1:]
962
+
963
+ return BaseModelOutputWithPooling(
964
+ last_hidden_state=sequence_output,
965
+ pooler_output=pooled_output,
966
+ hidden_states=encoder_outputs.hidden_states,
967
+ attentions=encoder_outputs.attentions,
968
+ )
969
+
970
+
971
+ class NewLMPredictionHead(nn.Module):
972
+ def __init__(self, config):
973
+ super().__init__()
974
+ self.dense = nn.Linear(config.hidden_size, config.hidden_size)
975
+ self.transform_act_fn = ACT2FN[config.hidden_act]
976
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
977
+
978
+ # The output weights are the same as the input embeddings, but there is
979
+ # an output-only bias for each token.
980
+ self.decoder = nn.Linear(config.hidden_size, config.vocab_size)
981
+
982
+ def forward(self, hidden_states):
983
+ hidden_states = self.dense(hidden_states)
984
+ hidden_states = self.transform_act_fn(hidden_states)
985
+ hidden_states = self.norm(hidden_states)
986
+ hidden_states = self.decoder(hidden_states)
987
+ return hidden_states
988
+
989
+
990
+ class NewForMaskedLM(NewPreTrainedModel):
991
+ _tied_weights_keys = ["lm_head.decoder.bias", "lm_head.decoder.weight"]
992
+
993
+ def __init__(self, config: NewConfig):
994
+ super().__init__(config)
995
+ self.new = NewModel(config, add_pooling_layer=False)
996
+ self.lm_head = NewLMPredictionHead(config)
997
+ self.loss_fct = nn.CrossEntropyLoss()
998
+
999
+ # Initialize weights and apply final processing
1000
+ self.post_init()
1001
+
1002
+ def get_output_embeddings(self):
1003
+ return self.lm_head.decoder
1004
+
1005
+ def set_output_embeddings(self, new_embeddings):
1006
+ self.lm_head.decoder = new_embeddings
1007
+
1008
+ def forward(
1009
+ self,
1010
+ input_ids: Optional[torch.Tensor] = None,
1011
+ attention_mask: Optional[torch.Tensor] = None,
1012
+ token_type_ids: Optional[torch.Tensor] = None,
1013
+ position_ids: Optional[torch.Tensor] = None,
1014
+ head_mask: Optional[torch.Tensor] = None,
1015
+ inputs_embeds: Optional[torch.Tensor] = None,
1016
+ labels: Optional[torch.Tensor] = None,
1017
+ output_attentions: Optional[bool] = None,
1018
+ output_hidden_states: Optional[bool] = None,
1019
+ return_dict: Optional[bool] = None,
1020
+ unpad_inputs: Optional[bool] = None,
1021
+ ) -> Union[Tuple[torch.Tensor], MaskedLMOutput]:
1022
+ r"""
1023
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1024
+ Labels for computing the masked language modeling loss. Indices should be in `[-100, 0, ...,
1025
+ config.vocab_size]` (see `input_ids` docstring) Tokens with indices set to `-100` are ignored (masked), the
1026
+ loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`
1027
+ """
1028
+
1029
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1030
+
1031
+ if labels is None or not self.new.config.unpad_inputs:
1032
+ length = None
1033
+ subset_indices = None
1034
+ else:
1035
+ length = attention_mask.sum(-1).tolist()
1036
+ labels = labels[attention_mask.bool()].unsqueeze(0)
1037
+ subset_indices = labels > -100
1038
+
1039
+ outputs = self.new(
1040
+ input_ids,
1041
+ attention_mask=attention_mask,
1042
+ length=length,
1043
+ subset_indices=subset_indices,
1044
+ token_type_ids=token_type_ids,
1045
+ position_ids=position_ids,
1046
+ head_mask=head_mask,
1047
+ inputs_embeds=inputs_embeds,
1048
+ output_attentions=output_attentions,
1049
+ output_hidden_states=output_hidden_states,
1050
+ return_dict=return_dict,
1051
+ unpad_inputs=unpad_inputs,
1052
+ )
1053
+
1054
+ sequence_output = outputs[0]
1055
+ prediction_scores = self.lm_head(sequence_output)
1056
+
1057
+ masked_lm_loss = None
1058
+ if labels is not None:
1059
+ if subset_indices is None:
1060
+ mask = attention_mask.bool()
1061
+ prediction_scores = prediction_scores[mask]
1062
+ labels = labels[mask]
1063
+ else:
1064
+ labels = labels[subset_indices]
1065
+ masked_lm_loss = self.loss_fct(prediction_scores, labels)
1066
+
1067
+ if not return_dict:
1068
+ output = (prediction_scores,) + outputs[2:]
1069
+ return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
1070
+
1071
+ return MaskedLMOutput(
1072
+ loss=masked_lm_loss,
1073
+ logits=prediction_scores,
1074
+ hidden_states=outputs.hidden_states,
1075
+ attentions=outputs.attentions,
1076
+ )
1077
+
1078
+
1079
+ class NewForSequenceClassification(NewPreTrainedModel):
1080
+ def __init__(self, config):
1081
+ super().__init__(config)
1082
+ self.num_labels = config.num_labels
1083
+ self.config = config
1084
+
1085
+ self.new = NewModel(config, add_pooling_layer=True)
1086
+ classifier_dropout = (
1087
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1088
+ )
1089
+ self.dropout = nn.Dropout(classifier_dropout)
1090
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1091
+
1092
+ # Initialize weights and apply final processing
1093
+ self.post_init()
1094
+
1095
+ def forward(
1096
+ self,
1097
+ input_ids: Optional[torch.Tensor] = None,
1098
+ attention_mask: Optional[torch.Tensor] = None,
1099
+ token_type_ids: Optional[torch.Tensor] = None,
1100
+ position_ids: Optional[torch.Tensor] = None,
1101
+ head_mask: Optional[torch.Tensor] = None,
1102
+ inputs_embeds: Optional[torch.Tensor] = None,
1103
+ labels: Optional[torch.Tensor] = None,
1104
+ output_attentions: Optional[bool] = None,
1105
+ output_hidden_states: Optional[bool] = None,
1106
+ return_dict: Optional[bool] = None,
1107
+ unpad_inputs: Optional[bool] = None,
1108
+ ) -> Union[Tuple[torch.Tensor], SequenceClassifierOutput]:
1109
+ r"""
1110
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1111
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1112
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1113
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1114
+ """
1115
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1116
+
1117
+ outputs = self.new(
1118
+ input_ids,
1119
+ attention_mask=attention_mask,
1120
+ token_type_ids=token_type_ids,
1121
+ position_ids=position_ids,
1122
+ head_mask=head_mask,
1123
+ inputs_embeds=inputs_embeds,
1124
+ output_attentions=output_attentions,
1125
+ output_hidden_states=output_hidden_states,
1126
+ return_dict=return_dict,
1127
+ unpad_inputs=unpad_inputs,
1128
+ )
1129
+
1130
+ pooled_output = outputs[1]
1131
+
1132
+ pooled_output = self.dropout(pooled_output)
1133
+ logits = self.classifier(pooled_output)
1134
+
1135
+ loss = None
1136
+ if labels is not None:
1137
+ if self.config.problem_type is None:
1138
+ if self.num_labels == 1:
1139
+ self.config.problem_type = "regression"
1140
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1141
+ self.config.problem_type = "single_label_classification"
1142
+ else:
1143
+ self.config.problem_type = "multi_label_classification"
1144
+
1145
+ if self.config.problem_type == "regression":
1146
+ loss_fct = nn.MSELoss()
1147
+ if self.num_labels == 1:
1148
+ loss = loss_fct(logits.squeeze(), labels.squeeze())
1149
+ else:
1150
+ loss = loss_fct(logits, labels)
1151
+ elif self.config.problem_type == "single_label_classification":
1152
+ loss_fct = nn.CrossEntropyLoss()
1153
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1154
+ elif self.config.problem_type == "multi_label_classification":
1155
+ loss_fct = nn.BCEWithLogitsLoss()
1156
+ loss = loss_fct(logits, labels)
1157
+
1158
+ if not return_dict:
1159
+ output = (logits,) + outputs[2:]
1160
+ return ((loss,) + output) if loss is not None else output
1161
+
1162
+ return SequenceClassifierOutput(
1163
+ loss=loss,
1164
+ logits=logits,
1165
+ hidden_states=outputs.hidden_states,
1166
+ attentions=outputs.attentions,
1167
+ )
1168
+
1169
+
1170
+ class NewForMultipleChoice(NewPreTrainedModel):
1171
+ def __init__(self, config):
1172
+ super().__init__(config)
1173
+
1174
+ self.new = NewModel(config, add_pooling_layer=True)
1175
+ classifier_dropout = (
1176
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1177
+ )
1178
+ self.dropout = nn.Dropout(classifier_dropout)
1179
+ self.classifier = nn.Linear(config.hidden_size, 1)
1180
+
1181
+ # Initialize weights and apply final processing
1182
+ self.post_init()
1183
+
1184
+ def forward(
1185
+ self,
1186
+ input_ids: Optional[torch.Tensor] = None,
1187
+ attention_mask: Optional[torch.Tensor] = None,
1188
+ token_type_ids: Optional[torch.Tensor] = None,
1189
+ position_ids: Optional[torch.Tensor] = None,
1190
+ head_mask: Optional[torch.Tensor] = None,
1191
+ inputs_embeds: Optional[torch.Tensor] = None,
1192
+ labels: Optional[torch.Tensor] = None,
1193
+ output_attentions: Optional[bool] = None,
1194
+ output_hidden_states: Optional[bool] = None,
1195
+ return_dict: Optional[bool] = None,
1196
+ unpad_inputs: Optional[bool] = None,
1197
+ ) -> Union[Tuple[torch.Tensor], MultipleChoiceModelOutput]:
1198
+ r"""
1199
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1200
+ Labels for computing the multiple choice classification loss. Indices should be in `[0, ...,
1201
+ num_choices-1]` where `num_choices` is the size of the second dimension of the input tensors. (See
1202
+ `input_ids` above)
1203
+ """
1204
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1205
+ num_choices = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
1206
+
1207
+ input_ids = input_ids.view(-1, input_ids.size(-1)) if input_ids is not None else None
1208
+ attention_mask = attention_mask.view(-1, attention_mask.size(-1)) if attention_mask is not None else None
1209
+ token_type_ids = token_type_ids.view(-1, token_type_ids.size(-1)) if token_type_ids is not None else None
1210
+ position_ids = position_ids.view(-1, position_ids.size(-1)) if position_ids is not None else None
1211
+ inputs_embeds = (
1212
+ inputs_embeds.view(-1, inputs_embeds.size(-2), inputs_embeds.size(-1))
1213
+ if inputs_embeds is not None
1214
+ else None
1215
+ )
1216
+
1217
+ outputs = self.new(
1218
+ input_ids,
1219
+ attention_mask=attention_mask,
1220
+ token_type_ids=token_type_ids,
1221
+ position_ids=position_ids,
1222
+ head_mask=head_mask,
1223
+ inputs_embeds=inputs_embeds,
1224
+ output_attentions=output_attentions,
1225
+ output_hidden_states=output_hidden_states,
1226
+ return_dict=return_dict,
1227
+ unpad_inputs=unpad_inputs,
1228
+ )
1229
+
1230
+ pooled_output = outputs[1]
1231
+
1232
+ pooled_output = self.dropout(pooled_output)
1233
+ logits = self.classifier(pooled_output)
1234
+ reshaped_logits = logits.view(-1, num_choices)
1235
+
1236
+ loss = None
1237
+ if labels is not None:
1238
+ loss_fct = nn.CrossEntropyLoss()
1239
+ loss = loss_fct(reshaped_logits, labels)
1240
+
1241
+ if not return_dict:
1242
+ output = (reshaped_logits,) + outputs[2:]
1243
+ return ((loss,) + output) if loss is not None else output
1244
+
1245
+ return MultipleChoiceModelOutput(
1246
+ loss=loss,
1247
+ logits=reshaped_logits,
1248
+ hidden_states=outputs.hidden_states,
1249
+ attentions=outputs.attentions,
1250
+ )
1251
+
1252
+
1253
+ @dataclass
1254
+ class NewTokenClassifierOutput(ModelOutput):
1255
+ loss: Optional[torch.FloatTensor] = None
1256
+ logits: torch.FloatTensor = None
1257
+ last_hidden_state: torch.FloatTensor = None
1258
+ hidden_states: Optional[Tuple[torch.FloatTensor, ...]] = None
1259
+ attentions: Optional[Tuple[torch.FloatTensor, ...]] = None
1260
+
1261
+
1262
+ class NewForTokenClassification(NewPreTrainedModel):
1263
+ def __init__(self, config):
1264
+ super().__init__(config)
1265
+ self.num_labels = config.num_labels
1266
+
1267
+ self.new = NewModel(config, add_pooling_layer=False)
1268
+ classifier_dropout = (
1269
+ config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob
1270
+ )
1271
+ self.dropout = nn.Dropout(classifier_dropout)
1272
+ self.classifier = nn.Linear(config.hidden_size, config.num_labels)
1273
+
1274
+ # Initialize weights and apply final processing
1275
+ self.post_init()
1276
+
1277
+ def forward(
1278
+ self,
1279
+ input_ids: Optional[torch.Tensor] = None,
1280
+ attention_mask: Optional[torch.Tensor] = None,
1281
+ token_type_ids: Optional[torch.Tensor] = None,
1282
+ position_ids: Optional[torch.Tensor] = None,
1283
+ head_mask: Optional[torch.Tensor] = None,
1284
+ inputs_embeds: Optional[torch.Tensor] = None,
1285
+ labels: Optional[torch.Tensor] = None,
1286
+ output_attentions: Optional[bool] = None,
1287
+ output_hidden_states: Optional[bool] = None,
1288
+ return_dict: Optional[bool] = None,
1289
+ unpad_inputs: Optional[bool] = None,
1290
+ ) -> Union[Tuple[torch.Tensor], NewTokenClassifierOutput]:
1291
+ r"""
1292
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1293
+ Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
1294
+ """
1295
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1296
+
1297
+ outputs = self.new(
1298
+ input_ids,
1299
+ attention_mask=attention_mask,
1300
+ token_type_ids=token_type_ids,
1301
+ position_ids=position_ids,
1302
+ head_mask=head_mask,
1303
+ inputs_embeds=inputs_embeds,
1304
+ output_attentions=output_attentions,
1305
+ output_hidden_states=output_hidden_states,
1306
+ return_dict=return_dict,
1307
+ unpad_inputs=unpad_inputs,
1308
+ )
1309
+
1310
+ sequence_output = outputs[0]
1311
+
1312
+ sequence_output = self.dropout(sequence_output)
1313
+ logits = self.classifier(sequence_output)
1314
+
1315
+ loss = None
1316
+ if labels is not None:
1317
+ loss_fct = nn.CrossEntropyLoss()
1318
+ loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
1319
+
1320
+ if not return_dict:
1321
+ output = (logits,) + outputs[2:]
1322
+ return ((loss,) + output) if loss is not None else output
1323
+
1324
+ return NewTokenClassifierOutput(
1325
+ loss=loss,
1326
+ logits=logits,
1327
+ last_hidden_state=sequence_output,
1328
+ hidden_states=outputs.hidden_states,
1329
+ attentions=outputs.attentions,
1330
+ )
1331
+
1332
+
1333
+ class NewForQuestionAnswering(NewPreTrainedModel):
1334
+ def __init__(self, config):
1335
+ super().__init__(config)
1336
+ self.num_labels = config.num_labels
1337
+
1338
+ self.new = NewModel(config, add_pooling_layer=False)
1339
+ self.qa_outputs = nn.Linear(config.hidden_size, config.num_labels)
1340
+
1341
+ # Initialize weights and apply final processing
1342
+ self.post_init()
1343
+
1344
+ def forward(
1345
+ self,
1346
+ input_ids: Optional[torch.Tensor] = None,
1347
+ attention_mask: Optional[torch.Tensor] = None,
1348
+ token_type_ids: Optional[torch.Tensor] = None,
1349
+ position_ids: Optional[torch.Tensor] = None,
1350
+ head_mask: Optional[torch.Tensor] = None,
1351
+ inputs_embeds: Optional[torch.Tensor] = None,
1352
+ start_positions: Optional[torch.Tensor] = None,
1353
+ end_positions: Optional[torch.Tensor] = None,
1354
+ output_attentions: Optional[bool] = None,
1355
+ output_hidden_states: Optional[bool] = None,
1356
+ return_dict: Optional[bool] = None,
1357
+ unpad_inputs: Optional[bool] = None,
1358
+ ) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
1359
+ r"""
1360
+ start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1361
+ Labels for position (index) of the start of the labelled span for computing the token classification loss.
1362
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1363
+ are not taken into account for computing the loss.
1364
+ end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1365
+ Labels for position (index) of the end of the labelled span for computing the token classification loss.
1366
+ Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
1367
+ are not taken into account for computing the loss.
1368
+ """
1369
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1370
+
1371
+ outputs = self.new(
1372
+ input_ids,
1373
+ attention_mask=attention_mask,
1374
+ token_type_ids=token_type_ids,
1375
+ position_ids=position_ids,
1376
+ head_mask=head_mask,
1377
+ inputs_embeds=inputs_embeds,
1378
+ output_attentions=output_attentions,
1379
+ output_hidden_states=output_hidden_states,
1380
+ return_dict=return_dict,
1381
+ unpad_inputs=unpad_inputs,
1382
+ )
1383
+
1384
+ sequence_output = outputs[0]
1385
+
1386
+ logits = self.qa_outputs(sequence_output)
1387
+ start_logits, end_logits = logits.split(1, dim=-1)
1388
+ start_logits = start_logits.squeeze(-1).contiguous()
1389
+ end_logits = end_logits.squeeze(-1).contiguous()
1390
+
1391
+ total_loss = None
1392
+ if start_positions is not None and end_positions is not None:
1393
+ # If we are on multi-GPU, split add a dimension
1394
+ if len(start_positions.size()) > 1:
1395
+ start_positions = start_positions.squeeze(-1)
1396
+ if len(end_positions.size()) > 1:
1397
+ end_positions = end_positions.squeeze(-1)
1398
+ # sometimes the start/end positions are outside our model inputs, we ignore these terms
1399
+ ignored_index = start_logits.size(1)
1400
+ start_positions = start_positions.clamp(0, ignored_index)
1401
+ end_positions = end_positions.clamp(0, ignored_index)
1402
+
1403
+ loss_fct = nn.CrossEntropyLoss(ignore_index=ignored_index)
1404
+ start_loss = loss_fct(start_logits, start_positions)
1405
+ end_loss = loss_fct(end_logits, end_positions)
1406
+ total_loss = (start_loss + end_loss) / 2
1407
+
1408
+ if not return_dict:
1409
+ output = (start_logits, end_logits) + outputs[2:]
1410
+ return ((total_loss,) + output) if total_loss is not None else output
1411
+
1412
+ return QuestionAnsweringModelOutput(
1413
+ loss=total_loss,
1414
+ start_logits=start_logits,
1415
+ end_logits=end_logits,
1416
+ hidden_states=outputs.hidden_states,
1417
+ attentions=outputs.attentions,
1418
+ )
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": false,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa7a6ad87a7ce8fe196787355f6af7d03aee94d19c54a5eb1392ed18c8ef451a
3
+ size 17082988
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 8192,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "tokenizer_class": "XLMRobertaTokenizerFast",
54
+ "unk_token": "<unk>"
55
+ }