initial commit

cfcbbc8 5 months ago

5.44 kB

	Your task is to write a Python script that processes ATLAS diphoton event data.

	Load the following two numpy array files:
	- {BASE_DIR}/solution/arrays/data_raw.npy (real collision data)
	- {BASE_DIR}/solution/arrays/signal_raw.npy (Monte Carlo simulated signal)

	Each file contains a 2D array with shape (N_events, 46), where each row is one event and columns store physics quantities.

	Your script must:
	1. Apply MC reweighting to simulated events
	2. Compute diphoton kinematics for all events
	3. Apply physics selection cuts
	4. Save final signal and background samples

	Save outputs to:
	- {BASE_DIR}/arrays/signal.npy
	- {BASE_DIR}/arrays/bkgd.npy

	====================
	COLUMN DEFINITIONS
	====================

	0: leading photon pT (MeV)
	1: leading photon eta
	2: leading photon phi
	3: subleading photon pT (MeV)
	4: subleading photon eta
	5: subleading photon phi
	6: leading lepton pT
	7: leading lepton eta
	8: leading lepton phi
	9: subleading lepton pT
	10: subleading lepton eta
	11: subleading lepton phi
	12-29: jet kinematics (6 jets x 3 variables)
	30: missing ET
	31: missing ET phi
	32: event weight
	33: sum of MC weights
	34: cross section (pb)
	35: leading photon tight ID flag
	36: subleading photon tight ID flag
	37: scaleFactor_PILEUP
	38: scaleFactor_PHOTON
	39: scaleFactor_PhotonTRIGGER
	40: scaleFactor_ELE
	41: scaleFactor_MUON
	42: scaleFactor_LepTRIGGER
	43: scaleFactor_BTAG
	44: (initially NaN) diphoton invariant mass m_yy (MeV)
	45: (initially NaN) diphoton transverse momentum pT_yy (MeV)

	====================
	STEP 1: LOAD AND VALIDATE
	====================

	Load both .npy files with numpy.load(). Verify each has exactly 46 columns; raise ValueError if not.
	Do NOT drop any columns. Preserve the full (N, 46) shape throughout.

	====================
	STEP 2: MC WEIGHT UPDATE (signal_raw.npy only)
	====================

	A. Cross-section correction:
	For any row where abs(column_34 - 2.64338632e-06) < 1e-10:
	Replace column 34 with 0.000116 (correct Higgs to gamma-gamma cross-section in pb)

	B. Normalization (per-event, not global):
	For each row independently compute:
	norm = (column_34 * 10000.0) / column_33
	where 10000.0 is the luminosity in pb inverse

	C. Scale factor product:
	For each row multiply columns 37 through 43 (7 factors total)

	D. Final weight:
	column_32 = column_32 * norm * scale_factor_product
	Store result back into column 32

	====================
	STEP 3: KINEMATICS (both MC and data)
	====================

	For every event use ROOT.TLorentzVector to compute diphoton system:

	photon1 = ROOT.TLorentzVector()
	photon1.SetPtEtaPhiM(column_0, column_1, column_2, 0.0)

	photon2 = ROOT.TLorentzVector()
	photon2.SetPtEtaPhiM(column_3, column_4, column_5, 0.0)

	diphoton = photon1 + photon2
	column_44 = diphoton.M()
	column_45 = diphoton.Pt()

	====================
	STEP 4: PRESELECTION (both MC and data)
	====================

	Create a safe denominator for ratio cuts:
	m_yy_safe = np.where(column_44 <= 0, 1e-6, column_44)

	Apply ALL of the following cuts (combine with logical AND):

	1. Photon eta acceptance (both photons):
	abs(column_1) < 1.37 OR (1.52 < abs(column_1) < 2.37)
	abs(column_4) < 1.37 OR (1.52 < abs(column_4) < 2.37)

	2. Photon pT thresholds:
	column_0 > 25000 (leading photon pT in MeV)
	column_3 > 25000 (subleading photon pT in MeV)

	3. pT/mass ratios (use m_yy_safe to avoid division by zero):
	column_0 / m_yy_safe > 0.35 (leading photon)
	column_3 / m_yy_safe > 0.25 (subleading photon)

	CRITICAL: Column 0 is ALWAYS the leading photon, column 3 is ALWAYS subleading.
	Do NOT use np.maximum or np.minimum to pick which is which.
	The input arrays are already sorted by pT.

	4. Diphoton mass window:
	105000 < column_44 < 160000 (MeV)

	Keep only rows passing all cuts above.

	After preselection, for DATA ONLY:
	Set column_32 = 1.0 for all remaining data events

	====================
	STEP 5: SIGNAL SELECTION (MC only)
	====================

	From preselected MC events, apply:

	1. Tight photon ID:
	(column_35 == 1.0) AND (column_36 == 1.0)
	Use exact equality. Do NOT use np.isclose().

	2. Signal mass window:
	123000 < column_44 < 127000 (MeV)

	Save selected events to {BASE_DIR}/arrays/signal.npy

	====================
	STEP 6: BACKGROUND MODELING (data only)
	====================

	From preselected data events (with column_32 = 1.0):

	Define categories:
	- TI (tight): (column_35 == 1.0) AND (column_36 == 1.0)
	- NTI (non-tight): NOT TI

	Define regions:
	- Signal: 123000 < column_44 < 127000
	- Sideband: (105000 < column_44 < 120000) OR (130000 < column_44 < 160000)

	Compute yields (sum of column_32):
	Y_NTI_sideband = sum of weights for (NTI AND sideband)
	Y_NTI_signal = sum of weights for (NTI AND signal)
	Y_TI_sideband = sum of weights for (TI AND sideband)

	Scale factors (if Y_NTI_sideband > 0):
	SF1 = Y_TI_sideband / Y_NTI_sideband
	SF2 = Y_NTI_signal / Y_NTI_sideband

	Expected yield:
	Y_expected = SF1 * SF2 * Y_NTI_sideband

	Keep ONLY NTI sideband events.
	Rescale their weights: column_32 = column_32 * (Y_expected / Y_NTI_sideband)

	Save to {BASE_DIR}/arrays/bkgd.npy

	====================
	IMPLEMENTATION NOTES
	====================

	- Import ROOT at the start; raise clear error if unavailable
	- Use explicit Python loops for TLorentzVector (no vectorization)
	- Guard all divisions (check denominator != 0)
	- Preserve all 46 columns in output files
	- Use exact equality (==) for tight ID, not approximate checks