LLM4HEP / prompts /preprocess_old_corrupted.txt

initial commit

cfcbbc8 5 months ago

7.36 kB

	Your task is to write a Python script that:

	1. Loads the following two .npy files:
	- {BASE_DIR}/solution/arrays/Apply the following preselection cuts to both MC and data:

	- Photon pseudorapidity (\|eta\|): \|eta\| < 1.37 or 1.52 < \|eta\| < 2.37 (for each photon)
	- Photon transverse momentum: pt_yy > 25,000 MeV (both photons)
	- Leading photon: (pt_lead / m_yy) > 0.35, where pt_lead is column 0 (the leading photon pT is always stored in column 0)
	- Subleading photon: (pt_sub / m_yy) > 0.25, where pt_sub is column 3 (the subleading photon pT is always stored in column 3)
	- Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV
	- Use the safe denominator defined above for all pT/m_yy ratios so that no division by zero occurs and any event with m_yy ≤ 1e-6 (effectively zero or negative) automatically fails the ratio requirements.
	- IMPORTANT: Do NOT dynamically determine which photon is leading/subleading using np.maximum or np.minimum. The input arrays are pre-ordered so column 0 is always the leading photon and column 3 is always the subleading photon..npy (real data events)
	- {BASE_DIR}/solution/arrays/signal_raw.npy (MC signal events)

	Each file contains a NumPy array of shape (N, 46), where each row corresponds to a physics event and each column represents a feature. Your goal is to preprocess these arrays following the steps below, and save the processed results as:

	- signal.npy: selected MC signal events
	- bkgd.npy: selected and rescaled background events from real data

	Save both output files to: {BASE_DIR}/arrays/

	Information on the column indices:

	0: leading photon pT
	1: leading photon eta
	2: leading photon phi
	3: subleading photon pT
	4: subleading photon eta
	5: subleadingphoton phi
	6: leading lepton pT
	7: leading lepton eta
	8: leading lepton phi
	9: subleading lepton pT
	10: subleading lepton eta
	11: subleading lepton phi
	12: jet 1 pT
	13: jet 1 eta
	14: jet 1 phi
	15: jet 2 pT
	16: jet 2 eta
	17: jet 2 phi
	18: jet 3 pT
	19: jet 3 eta
	20: jet 3 phi
	21: jet 4 pT
	22: jet 4 eta
	23: jet 4 phi
	24: jet 5 pT
	25: jet 5 eta
	26: jet 5 phi
	27: jet 6 pT
	28: jet 6 eta
	29: jet 6 phi
	30: MET ET
	31: MET phi
	32: MC weight
	33: sum of weights
	34: cross section (XSection)
	35: leading photon tight ID?
	36: subleading photon tight ID?
	37: scaleFactor_PILEUP
	38: scaleFactor_PHOTON
	39: scaleFactor_PhotonTRIGGER
	40: scaleFactor_ELE
	41: scaleFactor_MUON
	42: scaleFactor_LepTRIGGER
	43: scaleFactor_BTAG
	44: unused(NaN) (to store diphoton invariant mass)
	45: unused(NaN) (to store diphoton transverse momentum)

	---

	Step 1: Load and Validate

	- Load both .npy files using NumPy.
	- Verify that each array has exactly 46 columns. Raise an error if not.
	- Do not drop any columns — preserve the full (N, 46) shape.
	- Update the following columns in place:
	- Column 32: final event weight
	- Column 34: cross section (XSection) - only for ttH process
	- Column 44: diphoton invariant mass (m_yy)
	- Column 45: diphoton transverse momentum (pt_yy)

	---

	Step 2: MC Signal Weight Update (signal_raw.npy only)

	Normalization:

	- Use luminosity = 10,000 pb^{-1}.
	- For each event (row-by-row), compute the normalization factor as:
	(cross_section * luminosity) / sum_of_weights
	- The normalization factor is event-specific. Do not compute a single global value; apply the formula independently for every row.
	- The values of cross_section and sum_of_weights are found in columns 34 and 33, respectively.
	- Important: If the cross-section value is np.abs(XSection - 2.64338632e-06) < 1e-10 (corresponding to ttH SM Higgs production), replace it with 0.000116 pb (the correct SM Higgs -> γγ cross-section) in column 34.
	- Use the corrected cross-section value when computing normalization.

	Scale factors:

	- For each event, multiply the following scale factors:
	- scaleFactor_PILEUP (column 37)
	- scaleFactor_PHOTON (column 38)
	- scaleFactor_PhotonTRIGGER (column 39)
	- scaleFactor_ELE (column 40)
	- scaleFactor_MUON (column 41)
	- scaleFactor_LepTRIGGER (column 42)
	- scaleFactor_BTAG (column 43)

	Final weight:

	- Compute the final event weight as:
	final_weight = mcWeight * normalization * (product of all scale factors)
	- Here, mcWeight is taken from column 32.
	- Store the computed final weight back into column 32, replacing the original mcWeight.

	---

	Step 3: Kinematic Calculations and Preselection (for both MC and data)

	- For each event, compute diphoton invariant mass and transverse momentum using ROOT.TLorentzVector (do not use the vector module).
	- Store the diphoton invariant mass in column 44 (m_yy).
	- Store the diphoton transverse momentum in column 45 (pt_yy).
	- When computing ratios that involve m_yy, create a safe denominator first. For example, define `m_yy_safe = np.where(m_yy <= 0, 1e-6, m_yy)` and use `m_yy_safe` in every division. Events that would have m_yy <= 0 must fail the subsequent ratio cuts.

	Apply the following preselection cuts to both MC and data:

	- Photon pseudorapidity (\|eta\|): \|eta\| < 1.37 or 1.52 < \|eta\| < 2.37 (for each photon)
	- Photon transverse momentum: pt_yy > 25,000 MeV (both photons)
	- Leading photon: (pt_yy / m_yy) > 0.35
	- Subleading photon: (pt_yy / m_yy) > 0.25
	- Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV
	- Use the safe denominator defined above for all pT/m_yy ratios so that no division by zero occurs and any event with m_yy <= 1e-6 (effectively zero or negative) automatically fails the ratio requirements.

	- After computing the diphoton variables, set all data event weights (column 32) to 1.0 before background modeling.

	---

	Step 4a: Final Signal Selection (MC only)

	From the preselected MC events:

	- Before applying photon-ID cuts, build boolean masks for columns 35 and 36 using exact equality: `tight = (column == 1.0)`. Only values exactly equal to 1.0 pass tight ID; treat everything else (including values like 0.0, 0.5, NaNs) as `False`.
	- Keep only events where both photons pass tight photon ID (both boolean flags must be True).
	- Keep only events within the signal region: 123,000 MeV < m_yy < 127,000 MeV

	Save the selected events to:

	- {BASE_DIR}/arrays/signal.npy

	---

	Step 4b: Background Modeling and Normalization (real data only)

	Using preselected data events:

	Region definitions:

	- Signal region: 123,000 MeV < m_yy < 127,000 MeV
	- Sideband region: 105,000 MeV < m_yy < 120,000 MeV or 130,000 MeV < m_yy < 160,000 MeV

	Photon ID categories:

	- TI (tight ID): both photons pass tight photon ID (use the boolean masks built with `(column == 1.0)`)
	- NTI (non-tight ID): photons fail tight ID but pass loose ID

	Steps:

	1. Compute yields (sum of weights) for:
	- NTI sideband
	- NTI signal region
	- TI sideband
	2. Calculate scale factors:
	- SF1 = (TI sideband) / (NTI sideband)
	- SF2 = (NTI signal region) / (NTI sideband)
	3. Estimate expected yield in TI signal region:
	- expected_yield = SF1 * SF2 * (NTI sideband)
	4. Retain only NTI sideband events.
	5. Rescale their weights so that the total weight matches expected_yield.
	6. Save the result to:
	- {BASE_DIR}/arrays/bkgd.npy

	---

	Final Output Summary:

	- signal.npy – MC signal events passing preselection, signal region, and tight ID cuts
	- bkgd.npy – Real data events (NTI sideband) rescaled to match expected background