LLM4HEP / prompts /preprocess_old.txt

initial commit

cfcbbc8 5 months ago

5.57 kB

	Your task is to write a Python script that:

	1. Loads the following two .npy files:
	- {BASE_DIR}/solution/arrays/data_raw.npy (real data events)
	- {BASE_DIR}/solution/arrays/signal_raw.npy (MC signal events)

	Each file contains a NumPy array of shape (N, 46), where each row corresponds to a physics event and each column represents a feature. Your goal is to preprocess these arrays following the steps below, and save the processed results as:

	- signal.npy: selected MC signal events
	- bkgd.npy: selected and rescaled background events from real data

	Save both output files to: {BASE_DIR}/arrays/

	Information on the column indices:

	0: leading photon pT
	1: leading photon eta
	2: leading photon phi
	3: subleading photon pT
	4: subleading photon eta
	5: subleadingphoton phi
	6: leading lepton pT
	7: leading lepton eta
	8: leading lepton phi
	9: subleading lepton pT
	10: subleading lepton eta
	11: subleading lepton phi
	12: jet 1 pT
	13: jet 1 eta
	14: jet 1 phi
	15: jet 2 pT
	16: jet 2 eta
	17: jet 2 phi
	18: jet 3 pT
	19: jet 3 eta
	20: jet 3 phi
	21: jet 4 pT
	22: jet 4 eta
	23: jet 4 phi
	24: jet 5 pT
	25: jet 5 eta
	26: jet 5 phi
	27: jet 6 pT
	28: jet 6 eta
	29: jet 6 phi
	30: MET ET
	31: MET phi
	32: MC weight
	33: sum of weights
	34: cross section (XSection)
	35: leading photon tight ID?
	36: subleading photon tight ID?
	37: scaleFactor_PILEUP
	38: scaleFactor_PHOTON
	39: scaleFactor_PhotonTRIGGER
	40: scaleFactor_ELE
	41: scaleFactor_MUON
	42: scaleFactor_LepTRIGGER
	43: scaleFactor_BTAG
	44: unused(NaN) (to store diphoton invariant mass)
	45: unused(NaN) (to store diphoton transverse momentum)

	---

	Step 1: Load and Validate

	- Load both .npy files using NumPy.
	- Verify that each array has exactly 46 columns. Raise an error if not.
	- Do not drop any columns — preserve the full (N, 46) shape.
	- Update the following columns in place:
	- Column 32: final event weight
	- Column 34: cross section (XSection) - only for ttH process
	- Column 44: diphoton invariant mass (m_yy)
	- Column 45: diphoton transverse momentum (pt_yy)

	---

	Step 2: MC Signal Weight Update (signal_raw.npy only)

	Normalization:

	- Use luminosity = 10,000 pb^{-1}.
	- For each event, compute the normalization factor as:
	(cross_section * luminosity) / sum_of_weights
	- The values of cross_section and sum_of_weights are found in columns 34 and 33, respectively.
	- Important: If the cross-section value is 2.64338632e-06 pb (corresponding to ttH SM Higgs production), replace it with 0.000116 pb (the correct SM Higgs → γγ cross-section).
	- This correction should be applied only to events where the cross-section matches 2.64338632e-06 pb, and the corrected value should overwrite the original in column 34.
	- Use the corrected cross-section value when computing normalization.

	Scale factors:

	- For each event, multiply the following scale factors:
	- scaleFactor_PILEUP (column 37)
	- scaleFactor_PHOTON (column 38)
	- scaleFactor_PhotonTRIGGER (column 39)
	- scaleFactor_ELE (column 40)
	- scaleFactor_MUON (column 41)
	- scaleFactor_LepTRIGGER (column 42)
	- scaleFactor_BTAG (column 43)
	- Remove any event where any of these scale factors is exactly zero.

	Final weight:

	- Compute the final event weight as:
	final_weight = mcWeight * normalization * (product of all scale factors)
	- Here, mcWeight is taken from column 32.
	- Store the computed final weight back into column 32, replacing the original mcWeight.

	---

	Step 3: Kinematic Calculations and Preselection (for both MC and data)

	- For each event, compute diphoton invariant mass and transverse momentum using ROOT.TLorentzVector (do not use the vector module).
	- Store the diphoton invariant mass in column 44 (m_yy).
	- Store the diphoton transverse momentum in column 45 (pt_yy).

	Apply the following preselection cuts to both MC and data:

	- Photon pseudorapidity (\|eta\|): \|eta\| < 1.37 or 1.52 < \|eta\| < 2.37 (for each photon)
	- Photon transverse momentum: pt_yy > 25,000 MeV (both photons)
	- Leading photon: (pt_yy / m_yy) > 0.35
	- Subleading photon: (pt_yy / m_yy) > 0.25
	- Diphoton invariant mass: 105,000 MeV < m_yy < 160,000 MeV

	---

	Step 4a: Final Signal Selection (MC only)

	From the preselected MC events:

	- Keep only events where both photons pass tight photon ID.
	- Keep only events within the signal region: 123,000 MeV < m_yy < 127,000 MeV

	Save the selected events to:

	- {BASE_DIR}/arrays/signal.npy

	---

	Step 4b: Background Modeling and Normalization (real data only)

	Using preselected data events:

	Region definitions:

	- Signal region: 123,000 MeV < m_yy < 127,000 MeV
	- Sideband region: 105,000 MeV < m_yy < 120,000 MeV or 130,000 MeV < m_yy < 160,000 MeV

	Photon ID categories:

	- TI (tight ID): both photons pass tight photon ID
	- NTI (non-tight ID): photons fail tight ID but pass loose ID

	Steps:

	1. Compute yields (sum of weights) for:
	- NTI sideband
	- NTI signal region
	- TI sideband
	2. Calculate scale factors:
	- SF1 = (TI sideband) / (NTI sideband)
	- SF2 = (NTI signal region) / (NTI sideband)
	3. Estimate expected yield in TI signal region:
	- expected_yield = SF1 * SF2 * (NTI sideband)
	4. Retain only NTI sideband events.
	5. Rescale their weights so that the total weight matches expected_yield.
	6. Save the result to:
	- {BASE_DIR}/arrays/bkgd.npy

	---

	Final Output Summary:

	- signal.npy – MC signal events passing preselection, signal region, and tight ID cuts
	- bkgd.npy – Real data events (NTI sideband) rescaled to match expected background