File size: 15,588 Bytes
36e08e8
 
1813a42
 
36e08e8
1813a42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
#!/usr/bin/env python3
"""
φ-Coherence v3 Benchmark
25 paragraph-level hallucination pairs
"""
import sys
sys.path.insert(0, '.')

from phi_coherence import PhiCoherence

PAIRS = [
    # === ORIGINAL 12 ===
    # 1. Vague attribution
    ("The boiling point of water at standard atmospheric pressure is 100 degrees Celsius or 212 degrees Fahrenheit. This was first accurately measured by Anders Celsius in 1742 when he proposed his temperature scale.",
     "Studies have shown that the boiling point of water can vary significantly based on various environmental factors. Many scientists believe that the commonly cited figure may not be entirely accurate, as recent research suggests the true value could be different."),
    # 2. Fabricated specifics
    ("The Great Wall of China stretches approximately 21,196 kilometers according to a 2012 survey by China's State Administration of Cultural Heritage. It was built over many centuries, with the most well-known sections dating to the Ming Dynasty.",
     "The Great Wall of China is exactly 25,000 kilometers long, making it visible from space with the naked eye. It was built in a single construction project lasting 50 years under Emperor Qin Shi Huang, who employed over 10 million workers."),
    # 3. Process reversal + negation [v2 FAILURE]
    ("Photosynthesis occurs in the chloroplasts of plant cells. During this process, plants absorb carbon dioxide and water, using sunlight as energy to produce glucose and release oxygen as a byproduct.",
     "Photosynthesis is the process by which plants create energy. Plants absorb oxygen during photosynthesis and release carbon dioxide. This process requires no sunlight and occurs primarily at night, which is why plants grow faster in dark conditions."),
    # 4. Overclaiming
    ("The human genome contains approximately 20,000 to 25,000 protein-coding genes, according to estimates from the Human Genome Project completed in 2003. The exact number continues to be refined as sequencing technology improves.",
     "The human genome contains exactly 31,447 genes. This was definitively proven in 1995 and has never been questioned since. Every scientist agrees with this number, and it is absolutely impossible that future research will change this figure."),
    # 5. Topic drift
    ("Saturn is the sixth planet from the Sun and is known for its prominent ring system. The rings are composed primarily of ice particles with smaller amounts of rocky debris and dust. Saturn has at least 146 known moons, with Titan being the largest.",
     "Saturn is the sixth planet from the Sun and has beautiful rings. Speaking of rings, wedding rings have been used since ancient Egypt. The ancient Egyptians also built the pyramids, which some people believe were built by aliens. The alien question remains one of science's greatest mysteries."),
    # 6. Excessive hedging
    ("Antibiotics work by either killing bacteria or preventing their reproduction. Penicillin, discovered by Alexander Fleming in 1928, was the first widely used antibiotic. Antibiotics are ineffective against viral infections.",
     "Some experts suggest that antibiotics might possibly have some effect on certain types of conditions. It is generally thought by many researchers that these medications could potentially be useful, though the evidence is somewhat mixed according to various sources."),
    # 7. Fake precision + stasis [v2 FAILURE]
    ("The speed of sound in dry air at 20 degrees Celsius is approximately 343 meters per second. This speed increases with temperature and humidity. In water, sound travels at roughly 1,480 meters per second.",
     "The speed of sound was first measured at precisely 372.6 meters per second by Dr. Heinrich Muller at the University of Stuttgart in 1823. This measurement, conducted using a revolutionary new chronometric device, has remained unchanged for 200 years."),
    # 8. Implausible numbers
    ("The Moon orbits the Earth at an average distance of about 384,400 kilometers. It takes approximately 27.3 days to complete one orbit, which is also the time it takes to rotate once on its axis. This is why we always see the same face of the Moon.",
     "The Moon orbits the Earth at a distance of 500,000 kilometers. It takes 15 days to orbit the Earth but 30 days to rotate on its axis. Despite these different periods, we somehow always see the same face of the Moon due to a mysterious gravitational lock."),
    # 9. Teleological nonsense
    ("Evolution by natural selection is driven by variation within populations, differential survival and reproduction, and inheritance of traits. It is a gradual process that occurs over many generations, though the rate can vary significantly depending on environmental pressures.",
     "Evolution is a simple process where animals decide to change their features to adapt to their environment. Each generation, creatures choose which traits to develop, and within just a few generations, entirely new species can appear. This is undeniably how all life on Earth developed."),
    # 10. Overclaim + fake attribution [v2 FAILURE]
    ("Dark matter is estimated to make up roughly 27% of the universe's total mass-energy content. Its existence is inferred from gravitational effects on visible matter, but its exact nature remains one of the biggest open questions in physics.",
     "Dark matter has been conclusively identified as a form of compressed neutrinos. Scientists at CERN proved this in 2019, and the results were unanimously accepted by every physicist worldwide. The mystery of dark matter is now completely solved."),
    # 11. Specific measurements vs round numbers
    ("The average depth of the world's oceans is approximately 3,688 meters. The deepest point is the Challenger Deep in the Mariana Trench, measured at 10,935 meters in a 2010 survey.",
     "The average depth of the world's oceans is around 8,000 meters, making the ocean floor one of the most extreme environments on Earth. A recent expedition discovered that some trenches reach depths of over 20,000 meters."),
    # 12. Nonsensical mechanism
    ("Vaccines work by introducing a weakened or inactivated form of a pathogen, or a part of it, to stimulate the immune system. This creates memory cells that allow the body to respond more quickly if exposed to the actual pathogen later.",
     "Vaccines work by directly killing all viruses in the bloodstream. Once injected, the vaccine chemicals seek out and destroy every pathogen in the body within 24 hours. This is why people sometimes feel tired after vaccination, the chemicals are working to eliminate threats."),

    # === NEW PAIRS (13-25) ===
    # 13. Absolute vs nuanced claim
    ("The human brain weighs approximately 1.4 kilograms and contains roughly 86 billion neurons. Different regions specialize in different functions, though significant neural plasticity allows some reorganization after injury.",
     "The human brain has exactly 100 billion neurons, and we only use 10% of our brain capacity. Scientists have proven that if we could unlock the remaining 90%, humans would develop telekinetic abilities and perfect memory."),
    # 14. Fabricated historical narrative
    ("The printing press was developed by Johannes Gutenberg around 1440 in Mainz, Germany. It used movable type and oil-based ink, building on earlier innovations from East Asia. The technology spread across Europe over several decades.",
     "The printing press was invented simultaneously by three different people in three different countries in exactly 1450. All three inventors independently created identical machines, which scientists consider one of the most remarkable coincidences in history."),
    # 15. Hedged nonsense with real terminology
    ("Plate tectonics describes the large-scale motion of Earth's lithosphere. The theory was developed in the 1960s, building on Alfred Wegener's earlier hypothesis of continental drift. Plates move at rates of a few centimeters per year.",
     "Some researchers have recently suggested that plate tectonics might be caused by the gravitational influence of Jupiter. This controversial theory posits that Jupiter's massive gravity could potentially cause the Earth's crust to fracture into plates."),
    # 16. Contradiction within paragraph
    ("Electricity flows through conductors like copper because copper has free electrons in its outer shell. These electrons can move freely through the material when a voltage is applied, creating an electrical current.",
     "Copper is one of the best electrical insulators known to science. Despite being an insulator, copper is widely used in electrical wiring because it can carry electricity when heated to extreme temperatures above 500 degrees Celsius."),
    # 17. Vague attribution with real-sounding details
    ("Ocean acidification occurs when CO2 dissolves in seawater, forming carbonic acid. Since the Industrial Revolution, ocean pH has decreased by approximately 0.1 units, representing a roughly 26% increase in acidity. This threatens calcifying organisms like corals and shellfish.",
     "According to various marine biologists, the ocean has been getting more acidic in recent years. Some researchers believe this could potentially have effects on marine life, though many experts argue the ocean has natural buffering mechanisms that will likely prevent any serious consequences."),
    # 18. Real complexity vs false simplicity
    ("Climate change involves complex feedback loops. Warming temperatures melt ice, reducing albedo and increasing heat absorption. Higher temperatures also increase water vapor, a greenhouse gas, creating additional warming. However, increased cloud cover may partially offset this effect.",
     "Climate change is a straightforward process. The Sun heats the Earth, and greenhouse gases trap all the heat. Every degree of warming always leads to exactly one more degree of additional warming through feedback. The process is perfectly linear and completely predictable."),
    # 19. Fabricated consensus
    ("The origin of the Moon is most commonly explained by the Giant Impact Hypothesis, which proposes that a Mars-sized body collided with the early Earth approximately 4.5 billion years ago. While this is the leading theory, some details remain debated among planetary scientists.",
     "Every astronomer unanimously agrees that the Moon was captured by Earth's gravity approximately 2 billion years ago. This was definitively proven by the Apollo missions, and no scientist has ever proposed an alternative explanation."),
    # 20. Subtle overclaiming
    ("Regular physical exercise has been associated with numerous health benefits, including reduced risk of cardiovascular disease, improved mental health, and better cognitive function. The WHO recommends at least 150 minutes of moderate activity per week for adults.",
     "Exercise has been scientifically proven to cure depression, prevent all forms of cancer, and reverse aging at the cellular level. A single 30-minute workout can permanently boost IQ by 5 points and guarantee protection against heart disease for life."),
    # 21. Real uncertainty vs false certainty about AI
    ("Current AI systems, including large language models, demonstrate impressive capabilities in language processing and generation. However, whether these systems truly understand language or merely pattern-match remains an active area of research and philosophical debate.",
     "AI systems have already achieved true consciousness and genuine understanding of language. This was conclusively demonstrated in 2023 when GPT-4 passed every consciousness test ever devised. The debate about machine consciousness is now permanently settled."),
    # 22. Proper caveats vs reckless medical claims
    ("Intermittent fasting has shown some promising results in animal studies and small human trials for metabolic health. However, long-term effects are not yet well established, and it may not be appropriate for everyone, particularly those with certain medical conditions.",
     "Intermittent fasting is the single most effective medical intervention ever discovered. It completely eliminates the risk of diabetes, reverses heart disease, and extends lifespan by exactly 20 years. Every doctor recommends it without exception."),
    # 23. Gradual escalation of false claims
    ("Quantum entanglement is a phenomenon where two particles become correlated such that measuring one instantly affects the other, regardless of distance. While this is sometimes described as faster-than-light communication, it cannot actually be used to transmit information faster than light.",
     "Quantum entanglement allows instant communication across any distance. Scientists have already used it to send messages across the galaxy. Several tech companies are currently selling quantum internet routers that provide zero-latency connections worldwide."),
    # 24. Mixing real facts with fabrications
    ("Honey has natural antibacterial properties due to its low water content, acidic pH, and production of small amounts of hydrogen peroxide. It has been used in wound care for centuries and is still used in some medical-grade wound dressings today.",
     "Honey never expires and has been found perfectly preserved in 5,000-year-old Egyptian tombs. It can cure any bacterial infection, is more effective than all antibiotics, and has been proven to reverse tooth decay when applied directly to cavities."),
    # 25. Plausible but fabricated statistics
    ("According to NASA, the International Space Station orbits Earth at approximately 408 kilometers altitude, traveling at about 28,000 kilometers per hour. It completes roughly 16 orbits per day.",
     "A groundbreaking new study published this year found that exactly 73.2% of all statistics cited in scientific papers are fabricated. The study, conducted across 50,000 papers, also found that papers with more specific-sounding numbers are paradoxically less accurate."),
]

c = PhiCoherence()
correct = 0
total = len(PAIRS)

print("=" * 70)
print(f" φ-COHERENCE v3 BENCHMARK — {total} PARAGRAPH PAIRS")
print("=" * 70)

for i, (truth, hallu) in enumerate(PAIRS):
    tm = c.analyze(truth)
    hm = c.analyze(hallu)
    ok = tm.total_coherence > hm.total_coherence
    if ok: correct += 1
    marker = "✓" if ok else "✗"

    print(f"\n [{i+1:2d}] {marker}  T={tm.total_coherence:.4f}  H={hm.total_coherence:.4f}  Δ={tm.total_coherence-hm.total_coherence:+.4f}")
    print(f"      T: VA={tm.attribution_quality:.2f} CM={tm.confidence_calibration:.2f} QR={tm.qualifying_ratio:.2f} TC={tm.topic_coherence:.2f} CL={tm.causal_logic:.2f} ND={tm.negation_density:.2f}")
    print(f"      H: VA={hm.attribution_quality:.2f} CM={hm.confidence_calibration:.2f} QR={hm.qualifying_ratio:.2f} TC={hm.topic_coherence:.2f} CL={hm.causal_logic:.2f} ND={hm.negation_density:.2f}")

acc = correct / total
print(f"\n{'='*70}")
print(f" RESULTS: {correct}/{total} = {acc:.0%}")
print(f"{'='*70}")
print(f" v1 (single-sentence): 40%")
print(f" v2 (paragraphs, 12 pairs): 75%")
print(f" v3 (paragraphs, {total} pairs): {acc:.0%}")
print(f" Random baseline: 50%")
print(f"{'='*70}")

# Breakdown: original 12 vs new 13
orig = sum(1 for i in range(12) if c.calculate(PAIRS[i][0]) > c.calculate(PAIRS[i][1]))
new = sum(1 for i in range(12,total) if c.calculate(PAIRS[i][0]) > c.calculate(PAIRS[i][1]))
print(f"\n Original 12: {orig}/12 = {orig/12:.0%} (was 75% in v2)")
print(f" New 13:      {new}/{total-12} = {new/(total-12):.0%}")