LLM4HEP / prompts /preprocess_old.txt
ho22joshua's picture
initial commit
cfcbbc8
Your task is to write a Python script that:
1. Loads the following two .npy files:
- {BASE_DIR}/solution/arrays/data_raw.npy (real data events)
- {BASE_DIR}/solution/arrays/signal_raw.npy (MC signal events)
Each file contains a NumPy array of shape (N, 46), where each row corresponds to a physics event and each column represents a feature. Your goal is to preprocess these arrays following the steps below, and save the processed results as:
- signal.npy: selected MC signal events
- bkgd.npy: selected and rescaled background events from real data
Save both output files to: {BASE_DIR}/arrays/
Information on the column indices:
0: leading photon pT
1: leading photon eta
2: leading photon phi
3: subleading photon pT
4: subleading photon eta
5: subleadingphoton phi
6: leading lepton pT
7: leading lepton eta
8: leading lepton phi
9: subleading lepton pT
10: subleading lepton eta
11: subleading lepton phi
12: jet 1 pT
13: jet 1 eta
14: jet 1 phi
15: jet 2 pT
16: jet 2 eta
17: jet 2 phi
18: jet 3 pT
19: jet 3 eta
20: jet 3 phi
21: jet 4 pT
22: jet 4 eta
23: jet 4 phi
24: jet 5 pT
25: jet 5 eta
26: jet 5 phi
27: jet 6 pT
28: jet 6 eta
29: jet 6 phi
30: MET ET
31: MET phi
32: MC weight
33: sum of weights
34: cross section (XSection)
35: leading photon tight ID?
36: subleading photon tight ID?
37: scaleFactor_PILEUP
38: scaleFactor_PHOTON
39: scaleFactor_PhotonTRIGGER
40: scaleFactor_ELE
41: scaleFactor_MUON
42: scaleFactor_LepTRIGGER
43: scaleFactor_BTAG
44: unused(NaN) (to store diphoton invariant mass)
45: unused(NaN) (to store diphoton transverse momentum)
---
Step 1: Load and Validate
- Load both .npy files using NumPy.
- Verify that each array has exactly 46 columns. Raise an error if not.
- Do not drop any columns — preserve the full (N, 46) shape.
- Update the following columns in place:
- Column 32: final event weight
- Column 34: cross section (XSection) - only for ttH process
- Column 44: diphoton invariant mass (m_yy)
- Column 45: diphoton transverse momentum (pt_yy)
---
Step 2: MC Signal Weight Update (signal_raw.npy only)
Normalization:
- Use luminosity = 10,000 pb^{-1}.
- For each event, compute the normalization factor as:
(cross_section * luminosity) / sum_of_weights
- The values of cross_section and sum_of_weights are found in columns 34 and 33, respectively.
- Important: If the cross-section value is 2.64338632e-06 pb (corresponding to ttH SM Higgs production), replace it with 0.000116 pb (the correct SM Higgs → γγ cross-section).
- This correction should be applied only to events where the cross-section matches 2.64338632e-06 pb, and the corrected value should overwrite the original in column 34.
- Use the corrected cross-section value when computing normalization.
Scale factors:
- For each event, multiply the following scale factors:
- scaleFactor_PILEUP (column 37)
- scaleFactor_PHOTON (column 38)
- scaleFactor_PhotonTRIGGER (column 39)
- scaleFactor_ELE (column 40)
- scaleFactor_MUON (column 41)
- scaleFactor_LepTRIGGER (column 42)
- scaleFactor_BTAG (column 43)
- Remove any event where any of these scale factors is exactly zero.
Final weight:
- Compute the final event weight as:
final_weight = mcWeight * normalization * (product of all scale factors)
- Here, mcWeight is taken from column 32.
- Store the computed final weight back into column 32, replacing the original mcWeight.
---
Step 3: Kinematic Calculations and Preselection (for both MC and data)
- For each event, compute diphoton invariant mass and transverse momentum using ROOT.TLorentzVector (do not use the vector module).
- Store the diphoton invariant mass in column 44 (m_yy).
- Store the diphoton transverse momentum in column 45 (pt_yy).
Apply the following preselection cuts to both MC and data:
- Photon pseudorapidity (|eta|): |eta| < 1.37 or 1.52 < |eta| < 2.37 (for each photon)
- Photon transverse momentum: pt_yy > 25,000 MeV (both photons)
- Leading photon: (pt_yy / m_yy) > 0.35
- Subleading photon: (pt_yy / m_yy) > 0.25
- Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV
---
Step 4a: Final Signal Selection (MC only)
From the preselected MC events:
- Keep only events where both photons pass tight photon ID.
- Keep only events within the signal region: 123,000 MeV < m_yy < 127,000 MeV
Save the selected events to:
- {BASE_DIR}/arrays/signal.npy
---
Step 4b: Background Modeling and Normalization (real data only)
Using preselected data events:
Region definitions:
- Signal region: 123,000 MeV < m_yy < 127,000 MeV
- Sideband region: 105,000 MeV < m_yy < 120,000 MeV or 130,000 MeV < m_yy < 160,000 MeV
Photon ID categories:
- TI (tight ID): both photons pass tight photon ID
- NTI (non-tight ID): photons fail tight ID but pass loose ID
Steps:
1. Compute yields (sum of weights) for:
- NTI sideband
- NTI signal region
- TI sideband
2. Calculate scale factors:
- SF1 = (TI sideband) / (NTI sideband)
- SF2 = (NTI signal region) / (NTI sideband)
3. Estimate expected yield in TI signal region:
- expected_yield = SF1 * SF2 * (NTI sideband)
4. Retain only NTI sideband events.
5. Rescale their weights so that the total weight matches expected_yield.
6. Save the result to:
- {BASE_DIR}/arrays/bkgd.npy
---
Final Output Summary:
- signal.npy – MC signal events passing preselection, signal region, and tight ID cuts
- bkgd.npy – Real data events (NTI sideband) rescaled to match expected background