LLM4HEP / prompts /preprocess.txt
ho22joshua's picture
initial commit
cfcbbc8
Your task is to write a Python script that processes ATLAS diphoton event data.
Load the following two numpy array files:
- {BASE_DIR}/solution/arrays/data_raw.npy (real collision data)
- {BASE_DIR}/solution/arrays/signal_raw.npy (Monte Carlo simulated signal)
Each file contains a 2D array with shape (N_events, 46), where each row is one event and columns store physics quantities.
Your script must:
1. Apply MC reweighting to simulated events
2. Compute diphoton kinematics for all events
3. Apply physics selection cuts
4. Save final signal and background samples
Save outputs to:
- {BASE_DIR}/arrays/signal.npy
- {BASE_DIR}/arrays/bkgd.npy
====================
COLUMN DEFINITIONS
====================
0: leading photon pT (MeV)
1: leading photon eta
2: leading photon phi
3: subleading photon pT (MeV)
4: subleading photon eta
5: subleading photon phi
6: leading lepton pT
7: leading lepton eta
8: leading lepton phi
9: subleading lepton pT
10: subleading lepton eta
11: subleading lepton phi
12-29: jet kinematics (6 jets x 3 variables)
30: missing ET
31: missing ET phi
32: event weight
33: sum of MC weights
34: cross section (pb)
35: leading photon tight ID flag
36: subleading photon tight ID flag
37: scaleFactor_PILEUP
38: scaleFactor_PHOTON
39: scaleFactor_PhotonTRIGGER
40: scaleFactor_ELE
41: scaleFactor_MUON
42: scaleFactor_LepTRIGGER
43: scaleFactor_BTAG
44: (initially NaN) diphoton invariant mass m_yy (MeV)
45: (initially NaN) diphoton transverse momentum pT_yy (MeV)
====================
STEP 1: LOAD AND VALIDATE
====================
Load both .npy files with numpy.load(). Verify each has exactly 46 columns; raise ValueError if not.
Do NOT drop any columns. Preserve the full (N, 46) shape throughout.
====================
STEP 2: MC WEIGHT UPDATE (signal_raw.npy only)
====================
A. Cross-section correction:
For any row where abs(column_34 - 2.64338632e-06) < 1e-10:
Replace column 34 with 0.000116 (correct Higgs to gamma-gamma cross-section in pb)
B. Normalization (per-event, not global):
For each row independently compute:
norm = (column_34 * 10000.0) / column_33
where 10000.0 is the luminosity in pb inverse
C. Scale factor product:
For each row multiply columns 37 through 43 (7 factors total)
D. Final weight:
column_32 = column_32 * norm * scale_factor_product
Store result back into column 32
====================
STEP 3: KINEMATICS (both MC and data)
====================
For every event use ROOT.TLorentzVector to compute diphoton system:
photon1 = ROOT.TLorentzVector()
photon1.SetPtEtaPhiM(column_0, column_1, column_2, 0.0)
photon2 = ROOT.TLorentzVector()
photon2.SetPtEtaPhiM(column_3, column_4, column_5, 0.0)
diphoton = photon1 + photon2
column_44 = diphoton.M()
column_45 = diphoton.Pt()
====================
STEP 4: PRESELECTION (both MC and data)
====================
Create a safe denominator for ratio cuts:
m_yy_safe = np.where(column_44 <= 0, 1e-6, column_44)
Apply ALL of the following cuts (combine with logical AND):
1. Photon eta acceptance (both photons):
abs(column_1) < 1.37 OR (1.52 < abs(column_1) < 2.37)
abs(column_4) < 1.37 OR (1.52 < abs(column_4) < 2.37)
2. Photon pT thresholds:
column_0 > 25000 (leading photon pT in MeV)
column_3 > 25000 (subleading photon pT in MeV)
3. pT/mass ratios (use m_yy_safe to avoid division by zero):
column_0 / m_yy_safe > 0.35 (leading photon)
column_3 / m_yy_safe > 0.25 (subleading photon)
CRITICAL: Column 0 is ALWAYS the leading photon, column 3 is ALWAYS subleading.
Do NOT use np.maximum or np.minimum to pick which is which.
The input arrays are already sorted by pT.
4. Diphoton mass window:
105000 < column_44 < 160000 (MeV)
Keep only rows passing all cuts above.
After preselection, for DATA ONLY:
Set column_32 = 1.0 for all remaining data events
====================
STEP 5: SIGNAL SELECTION (MC only)
====================
From preselected MC events, apply:
1. Tight photon ID:
(column_35 == 1.0) AND (column_36 == 1.0)
Use exact equality. Do NOT use np.isclose().
2. Signal mass window:
123000 < column_44 < 127000 (MeV)
Save selected events to {BASE_DIR}/arrays/signal.npy
====================
STEP 6: BACKGROUND MODELING (data only)
====================
From preselected data events (with column_32 = 1.0):
Define categories:
- TI (tight): (column_35 == 1.0) AND (column_36 == 1.0)
- NTI (non-tight): NOT TI
Define regions:
- Signal: 123000 < column_44 < 127000
- Sideband: (105000 < column_44 < 120000) OR (130000 < column_44 < 160000)
Compute yields (sum of column_32):
Y_NTI_sideband = sum of weights for (NTI AND sideband)
Y_NTI_signal = sum of weights for (NTI AND signal)
Y_TI_sideband = sum of weights for (TI AND sideband)
Scale factors (if Y_NTI_sideband > 0):
SF1 = Y_TI_sideband / Y_NTI_sideband
SF2 = Y_NTI_signal / Y_NTI_sideband
Expected yield:
Y_expected = SF1 * SF2 * Y_NTI_sideband
Keep ONLY NTI sideband events.
Rescale their weights: column_32 = column_32 * (Y_expected / Y_NTI_sideband)
Save to {BASE_DIR}/arrays/bkgd.npy
====================
IMPLEMENTATION NOTES
====================
- Import ROOT at the start; raise clear error if unavailable
- Use explicit Python loops for TLorentzVector (no vectorization)
- Guard all divisions (check denominator != 0)
- Preserve all 46 columns in output files
- Use exact equality (==) for tight ID, not approximate checks