Instructions to use cahlen/class-numbers-cuda with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Kernels
How to use cahlen/class-numbers-cuda with Kernels:
# !pip install kernels from kernels import get_kernel kernel = get_kernel("cahlen/class-numbers-cuda") - Notebooks
- Google Colab
- Kaggle
CUDA kernel: class-numbers-cuda
Browse files- README.md +52 -0
- build.toml +12 -0
- class_numbers/class_numbers_v2.cu +509 -0
- scripts/test.py +11 -0
- torch-ext/torch_binding.cpp +6 -0
- torch-ext/torch_binding.h +3 -0
README.md
ADDED
|
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
tags:
|
| 4 |
+
- kernels
|
| 5 |
+
- cuda
|
| 6 |
+
- class-numbers
|
| 7 |
+
- real-quadratic-fields
|
| 8 |
+
- number-theory
|
| 9 |
+
- cohen-lenstra
|
| 10 |
+
datasets:
|
| 11 |
+
- cahlen/class-numbers-real-quadratic
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Class Numbers of Real Quadratic Fields
|
| 15 |
+
|
| 16 |
+
Computes class numbers h(d) for fundamental discriminants d using continued fraction regulator + Euler product L(1, chi_d).
|
| 17 |
+
|
| 18 |
+
## Usage
|
| 19 |
+
|
| 20 |
+
```python
|
| 21 |
+
import torch
|
| 22 |
+
from kernels import get_kernel
|
| 23 |
+
|
| 24 |
+
kernel = get_kernel("cahlen/class-numbers-cuda")
|
| 25 |
+
result = class_numbers.compute(discriminants)
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Compile (standalone)
|
| 29 |
+
|
| 30 |
+
```bash
|
| 31 |
+
nvcc -O3 -arch=sm_90 -o class_numbers class_numbers/class_numbers_v2.cu -lm
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## Results
|
| 35 |
+
|
| 36 |
+
All computation results are open:
|
| 37 |
+
- **Website**: [bigcompute.science](https://bigcompute.science)
|
| 38 |
+
- **Datasets**: [huggingface.co/cahlen](https://huggingface.co/cahlen)
|
| 39 |
+
- **Source**: [github.com/cahlen/idontknow](https://github.com/cahlen/idontknow)
|
| 40 |
+
|
| 41 |
+
## Citation
|
| 42 |
+
|
| 43 |
+
```bibtex
|
| 44 |
+
@misc{humphreys2026bigcompute,
|
| 45 |
+
author = {Humphreys, Cahlen},
|
| 46 |
+
title = {bigcompute.science: GPU-Accelerated Computational Mathematics},
|
| 47 |
+
year = {2026},
|
| 48 |
+
url = {https://bigcompute.science}
|
| 49 |
+
}
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
*Human-AI collaborative. Not peer-reviewed. All code and data open.*
|
build.toml
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[general]
|
| 2 |
+
name = "class_numbers"
|
| 3 |
+
universal = false
|
| 4 |
+
|
| 5 |
+
[torch]
|
| 6 |
+
src = ["torch-ext/torch_binding.cpp", "torch-ext/torch_binding.h"]
|
| 7 |
+
|
| 8 |
+
[kernel.class_numbers]
|
| 9 |
+
backend = "cuda"
|
| 10 |
+
cuda-capabilities = ["8.0", "9.0", "10.0", "12.0"]
|
| 11 |
+
src = ["class_numbers/class_numbers_v2.cu"]
|
| 12 |
+
depends = ["torch"]
|
class_numbers/class_numbers_v2.cu
ADDED
|
@@ -0,0 +1,509 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/*
|
| 2 |
+
* Class Numbers of Real Quadratic Fields — v2 Multi-GPU
|
| 3 |
+
*
|
| 4 |
+
* Computes h(d) for all fundamental discriminants d in [D_lo, D_hi]
|
| 5 |
+
* using: h(d) = round(sqrt(d) * L(1, chi_d) / (2 * R(d)))
|
| 6 |
+
*
|
| 7 |
+
* Key improvements over v1:
|
| 8 |
+
* - Integer-only CF for regulator (no FP64 overflow)
|
| 9 |
+
* - Euler product with 9592 primes to 10^5 (was 1229 to 10^4)
|
| 10 |
+
* - CPU segmented sieve for fundamental discriminants
|
| 11 |
+
* - Multi-GPU via pthreads (one thread per GPU)
|
| 12 |
+
* - Incremental log accumulation for regulator
|
| 13 |
+
* - Cohen-Lenstra statistics collection
|
| 14 |
+
*
|
| 15 |
+
* Compile: nvcc -O3 -arch=sm_100a -o class_v2 \
|
| 16 |
+
* scripts/experiments/class-numbers/class_numbers_v2.cu -lpthread -lm
|
| 17 |
+
*
|
| 18 |
+
* Run: ./class_v2 <start> <end>
|
| 19 |
+
* e.g. ./class_v2 5 1000000000 (validate against known tables)
|
| 20 |
+
* ./class_v2 100000000000 10000000000000 (new computation)
|
| 21 |
+
*/
|
| 22 |
+
|
| 23 |
+
#include <stdio.h>
|
| 24 |
+
#include <stdlib.h>
|
| 25 |
+
#include <stdint.h>
|
| 26 |
+
#include <math.h>
|
| 27 |
+
#include <string.h>
|
| 28 |
+
#include <time.h>
|
| 29 |
+
#include <pthread.h>
|
| 30 |
+
|
| 31 |
+
typedef unsigned long long uint64;
|
| 32 |
+
typedef long long int64;
|
| 33 |
+
|
| 34 |
+
#define BLOCK_SIZE 256
|
| 35 |
+
#define MAX_CF_STEPS 2000000 // cap for CF period (covers 99.9% of d < 10^13)
|
| 36 |
+
#define CHUNK_SIZE 10000000 // 10M raw d per chunk
|
| 37 |
+
|
| 38 |
+
// =====================================================
|
| 39 |
+
// Primes in constant memory (up to 100003 = 9592 primes)
|
| 40 |
+
// =====================================================
|
| 41 |
+
#define NUM_PRIMES 9592
|
| 42 |
+
__constant__ int d_primes[NUM_PRIMES];
|
| 43 |
+
|
| 44 |
+
// =====================================================
|
| 45 |
+
// Kronecker symbol (d/p) — modular exponentiation
|
| 46 |
+
// =====================================================
|
| 47 |
+
__device__ int kronecker(int64 d, int p) {
|
| 48 |
+
if (p == 2) {
|
| 49 |
+
int dm8 = ((int)(d % 8) + 8) % 8;
|
| 50 |
+
if (dm8 == 1 || dm8 == 7) return 1;
|
| 51 |
+
if (dm8 == 3 || dm8 == 5) return -1;
|
| 52 |
+
return 0;
|
| 53 |
+
}
|
| 54 |
+
// Euler's criterion: d^((p-1)/2) mod p
|
| 55 |
+
int64 a = ((d % p) + p) % p;
|
| 56 |
+
if (a == 0) return 0;
|
| 57 |
+
int64 result = 1;
|
| 58 |
+
int64 exp = (p - 1) / 2;
|
| 59 |
+
int64 base = a;
|
| 60 |
+
while (exp > 0) {
|
| 61 |
+
if (exp & 1) result = (result * base) % p;
|
| 62 |
+
base = (base * base) % p;
|
| 63 |
+
exp >>= 1;
|
| 64 |
+
}
|
| 65 |
+
return (result == 1) ? 1 : -1;
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
// =====================================================
|
| 69 |
+
// Combined kernel: regulator + L-function + class number
|
| 70 |
+
// =====================================================
|
| 71 |
+
__global__ void compute_class_numbers(
|
| 72 |
+
uint64 *discriminants, // fundamental discriminants
|
| 73 |
+
uint32_t count,
|
| 74 |
+
int *class_numbers_out,
|
| 75 |
+
double *regulators_out, // optional: NULL to skip output
|
| 76 |
+
// Statistics (atomics)
|
| 77 |
+
uint64 *h1_count, // count of h(d) = 1
|
| 78 |
+
uint64 *h_histogram, // h_histogram[h] for h < 1024
|
| 79 |
+
uint64 *total_processed,
|
| 80 |
+
uint64 *div3_count, // count of 3 | h(d)
|
| 81 |
+
uint64 *div5_count,
|
| 82 |
+
uint64 *div7_count)
|
| 83 |
+
{
|
| 84 |
+
uint32_t idx = blockIdx.x * blockDim.x + threadIdx.x;
|
| 85 |
+
if (idx >= count) return;
|
| 86 |
+
|
| 87 |
+
uint64 d = discriminants[idx];
|
| 88 |
+
if (d < 5) return;
|
| 89 |
+
|
| 90 |
+
// ===== PHASE 1: Regulator (validated: matches PARI/GP on 1000 discriminants) =====
|
| 91 |
+
// For d ≡ 0 mod 4 (d=4m): CF of √m, stop at first D==1
|
| 92 |
+
// For d ≡ 1 mod 4: CF of (1+√d)/2, stop when P=1,Q=2
|
| 93 |
+
|
| 94 |
+
double regulator = 0.0;
|
| 95 |
+
double log_P_prev, log_P_curr, log_Q_prev, log_Q_curr;
|
| 96 |
+
|
| 97 |
+
if (d % 4 == 0) {
|
| 98 |
+
// d = 4m: CF of √m
|
| 99 |
+
uint64 m_val = d / 4;
|
| 100 |
+
uint64 a0 = (uint64)sqrt((double)m_val);
|
| 101 |
+
while (a0 * a0 > m_val) a0--;
|
| 102 |
+
while ((a0+1)*(a0+1) <= m_val) a0++;
|
| 103 |
+
if (a0 * a0 == m_val) return;
|
| 104 |
+
|
| 105 |
+
int64 mm = 0, D = 1, a = (int64)a0;
|
| 106 |
+
log_P_prev = 0.0;
|
| 107 |
+
log_P_curr = log((double)a0);
|
| 108 |
+
log_Q_prev = -1e30;
|
| 109 |
+
log_Q_curr = 0.0;
|
| 110 |
+
|
| 111 |
+
for (int step = 0; step < MAX_CF_STEPS; step++) {
|
| 112 |
+
mm = D * a - mm;
|
| 113 |
+
D = ((int64)m_val - mm * mm) / D;
|
| 114 |
+
if (D == 0) break;
|
| 115 |
+
a = ((int64)a0 + mm) / D;
|
| 116 |
+
|
| 117 |
+
// Check D==1 BEFORE updating convergents (critical!)
|
| 118 |
+
if (D == 1) {
|
| 119 |
+
double diff = log_Q_curr + 0.5 * log((double)m_val) - log_P_curr;
|
| 120 |
+
regulator = log_P_curr + log(1.0 + exp(diff));
|
| 121 |
+
break;
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
// Update log convergents
|
| 125 |
+
double rp = exp(log_P_prev - log_P_curr);
|
| 126 |
+
log_P_prev = log_P_curr;
|
| 127 |
+
log_P_curr = log_P_curr + log((double)a + rp);
|
| 128 |
+
double rq = (log_Q_prev > -1e20) ? exp(log_Q_prev - log_Q_curr) : 0.0;
|
| 129 |
+
log_Q_prev = log_Q_curr;
|
| 130 |
+
log_Q_curr = log_Q_curr + log((double)a + rq);
|
| 131 |
+
}
|
| 132 |
+
} else {
|
| 133 |
+
// d ≡ 1 mod 4: CF of (1+√d)/2 with reduced-state cycle detection
|
| 134 |
+
uint64 isqrt_d = (uint64)sqrt((double)d);
|
| 135 |
+
while (isqrt_d * isqrt_d > d) isqrt_d--;
|
| 136 |
+
while ((isqrt_d+1)*(isqrt_d+1) <= d) isqrt_d++;
|
| 137 |
+
|
| 138 |
+
int64 P = 1, Q = 2;
|
| 139 |
+
int64 a = (P + (int64)isqrt_d) / Q;
|
| 140 |
+
log_P_prev = 0.0;
|
| 141 |
+
log_P_curr = log((double)(a > 0 ? a : 1));
|
| 142 |
+
log_Q_prev = -1e30;
|
| 143 |
+
log_Q_curr = 0.0;
|
| 144 |
+
|
| 145 |
+
// Cycle detection via reduced states
|
| 146 |
+
int64 first_P = -1, first_Q = -1;
|
| 147 |
+
double log_eps0 = 0.0;
|
| 148 |
+
|
| 149 |
+
for (int step = 0; step < MAX_CF_STEPS; step++) {
|
| 150 |
+
int64 P_new = a * Q - P;
|
| 151 |
+
int64 Q_new = ((int64)d - P_new * P_new) / Q;
|
| 152 |
+
if (Q_new == 0) break;
|
| 153 |
+
int64 a_new = (P_new + (int64)isqrt_d) / Q_new;
|
| 154 |
+
P = P_new; Q = Q_new; a = a_new;
|
| 155 |
+
|
| 156 |
+
// Update log convergents
|
| 157 |
+
double rp = exp(log_P_prev - log_P_curr);
|
| 158 |
+
log_P_prev = log_P_curr;
|
| 159 |
+
log_P_curr = log_P_curr + log((double)a + rp);
|
| 160 |
+
double rq = (log_Q_prev > -1e20) ? exp(log_Q_prev - log_Q_curr) : 0.0;
|
| 161 |
+
log_Q_prev = log_Q_curr;
|
| 162 |
+
log_Q_curr = log_Q_curr + log((double)a + rq);
|
| 163 |
+
|
| 164 |
+
// Check if reduced: 0 < P <= isqrt_d, P > isqrt_d - Q, Q > 0
|
| 165 |
+
int is_reduced = (Q > 0 && P > 0 && P <= (int64)isqrt_d && P > (int64)isqrt_d - Q);
|
| 166 |
+
if (!is_reduced) continue;
|
| 167 |
+
|
| 168 |
+
// Compute log(ε) = log((2p - q + q√d) / 2)
|
| 169 |
+
double ratio_qp = exp(log_Q_curr - log_P_curr);
|
| 170 |
+
double log_2pmq = log_P_curr + log(2.0 - ratio_qp);
|
| 171 |
+
double diff = log_Q_curr + 0.5 * log((double)d) - log_2pmq;
|
| 172 |
+
double log_eps = log_2pmq + log(1.0 + exp(diff)) - log(2.0);
|
| 173 |
+
|
| 174 |
+
if (first_P < 0) {
|
| 175 |
+
// First reduced state: save it
|
| 176 |
+
first_P = P; first_Q = Q;
|
| 177 |
+
log_eps0 = log_eps;
|
| 178 |
+
} else if (P == first_P && Q == first_Q) {
|
| 179 |
+
// Cycle detected! R = log(ε_now) - log(ε_first)
|
| 180 |
+
regulator = log_eps - log_eps0;
|
| 181 |
+
break;
|
| 182 |
+
}
|
| 183 |
+
}
|
| 184 |
+
}
|
| 185 |
+
|
| 186 |
+
if (regulator < 0.01) regulator = 0.01;
|
| 187 |
+
|
| 188 |
+
// ===== PHASE 2: L(1, chi_d) via Euler product =====
|
| 189 |
+
double L1 = 1.0;
|
| 190 |
+
for (int i = 0; i < NUM_PRIMES; i++) {
|
| 191 |
+
int p = d_primes[i];
|
| 192 |
+
int chi = kronecker((int64)d, p);
|
| 193 |
+
if (chi != 0) {
|
| 194 |
+
L1 *= 1.0 / (1.0 - (double)chi / p);
|
| 195 |
+
}
|
| 196 |
+
// If chi = 0, the factor is 1/(1-0) = 1, no change
|
| 197 |
+
}
|
| 198 |
+
|
| 199 |
+
// ===== PHASE 3: Assemble class number =====
|
| 200 |
+
double h_approx = sqrt((double)d) * L1 / (2.0 * regulator);
|
| 201 |
+
int h = (int)round(h_approx);
|
| 202 |
+
if (h < 1) h = 1;
|
| 203 |
+
|
| 204 |
+
class_numbers_out[idx] = h;
|
| 205 |
+
if (regulators_out) regulators_out[idx] = regulator;
|
| 206 |
+
|
| 207 |
+
// ===== PHASE 4: Statistics =====
|
| 208 |
+
atomicAdd(total_processed, 1ULL);
|
| 209 |
+
if (h == 1) atomicAdd(h1_count, 1ULL);
|
| 210 |
+
if (h < 1024) atomicAdd(&h_histogram[h], 1ULL);
|
| 211 |
+
if (h % 3 == 0) atomicAdd(div3_count, 1ULL);
|
| 212 |
+
if (h % 5 == 0) atomicAdd(div5_count, 1ULL);
|
| 213 |
+
if (h % 7 == 0) atomicAdd(div7_count, 1ULL);
|
| 214 |
+
}
|
| 215 |
+
|
| 216 |
+
// =====================================================
|
| 217 |
+
// GPU: Squarefree sieve + fundamental discriminant extraction
|
| 218 |
+
// =====================================================
|
| 219 |
+
__global__ void gpu_sieve_squarefree(
|
| 220 |
+
uint8_t *sieve, uint64 lo, uint64 len,
|
| 221 |
+
const int *primes, int num_primes)
|
| 222 |
+
{
|
| 223 |
+
uint64 pos = (uint64)blockIdx.x * blockDim.x + threadIdx.x;
|
| 224 |
+
if (pos >= len) return;
|
| 225 |
+
uint64 d = lo + pos;
|
| 226 |
+
for (int i = 0; i < num_primes; i++) {
|
| 227 |
+
int p = primes[i];
|
| 228 |
+
uint64 p2 = (uint64)p * p;
|
| 229 |
+
if (p2 > d) break;
|
| 230 |
+
if (d % p2 == 0) { sieve[pos] = 0; return; }
|
| 231 |
+
}
|
| 232 |
+
}
|
| 233 |
+
|
| 234 |
+
__global__ void gpu_extract_fundamental(
|
| 235 |
+
const uint8_t *sieve, uint64 lo, uint64 len,
|
| 236 |
+
uint64 *output, uint32_t *count, uint32_t max_out)
|
| 237 |
+
{
|
| 238 |
+
uint64 pos = (uint64)blockIdx.x * blockDim.x + threadIdx.x;
|
| 239 |
+
if (pos >= len) return;
|
| 240 |
+
uint64 d = lo + pos;
|
| 241 |
+
if (d < 5) return;
|
| 242 |
+
int is_fund = 0;
|
| 243 |
+
if (d % 4 == 1 && sieve[pos]) {
|
| 244 |
+
is_fund = 1;
|
| 245 |
+
} else if (d % 4 == 0) {
|
| 246 |
+
uint64 m = d / 4;
|
| 247 |
+
if ((m % 4 == 2 || m % 4 == 3)) {
|
| 248 |
+
if (m >= lo && m < lo + len && sieve[m - lo]) is_fund = 1;
|
| 249 |
+
else if (m < lo) {
|
| 250 |
+
// Trial division for m outside sieve range
|
| 251 |
+
int sqf = 1;
|
| 252 |
+
for (uint64 p = 2; p * p <= m && sqf; p++)
|
| 253 |
+
if (m % (p*p) == 0) sqf = 0;
|
| 254 |
+
if (sqf) is_fund = 1;
|
| 255 |
+
}
|
| 256 |
+
}
|
| 257 |
+
}
|
| 258 |
+
if (is_fund) {
|
| 259 |
+
uint32_t idx = atomicAdd(count, 1);
|
| 260 |
+
if (idx < max_out) output[idx] = d;
|
| 261 |
+
}
|
| 262 |
+
}
|
| 263 |
+
|
| 264 |
+
// =====================================================
|
| 265 |
+
// Generate prime table
|
| 266 |
+
// =====================================================
|
| 267 |
+
int generate_primes(int *primes, int max_prime) {
|
| 268 |
+
char *sieve = (char*)calloc(max_prime + 1, 1);
|
| 269 |
+
memset(sieve, 1, max_prime + 1);
|
| 270 |
+
sieve[0] = sieve[1] = 0;
|
| 271 |
+
for (int i = 2; i * i <= max_prime; i++)
|
| 272 |
+
if (sieve[i])
|
| 273 |
+
for (int j = i*i; j <= max_prime; j += i)
|
| 274 |
+
sieve[j] = 0;
|
| 275 |
+
int count = 0;
|
| 276 |
+
for (int i = 2; i <= max_prime && count < NUM_PRIMES; i++)
|
| 277 |
+
if (sieve[i]) primes[count++] = i;
|
| 278 |
+
free(sieve);
|
| 279 |
+
return count;
|
| 280 |
+
}
|
| 281 |
+
|
| 282 |
+
// =====================================================
|
| 283 |
+
// GPU worker thread
|
| 284 |
+
// =====================================================
|
| 285 |
+
typedef struct {
|
| 286 |
+
int gpu_id;
|
| 287 |
+
uint64 d_start, d_end;
|
| 288 |
+
char output_path[256]; // binary output file path
|
| 289 |
+
// Results
|
| 290 |
+
uint64 total_processed;
|
| 291 |
+
uint64 h1_count;
|
| 292 |
+
uint64 div3, div5, div7;
|
| 293 |
+
uint64 h_hist[1024];
|
| 294 |
+
} GPUWork;
|
| 295 |
+
|
| 296 |
+
void *gpu_worker(void *arg) {
|
| 297 |
+
GPUWork *work = (GPUWork*)arg;
|
| 298 |
+
cudaSetDevice(work->gpu_id);
|
| 299 |
+
|
| 300 |
+
// Allocate GPU buffers
|
| 301 |
+
uint64 *d_discriminants;
|
| 302 |
+
int *d_class_numbers;
|
| 303 |
+
uint64 *d_h1, *d_total, *d_div3, *d_div5, *d_div7, *d_hist;
|
| 304 |
+
|
| 305 |
+
uint32_t max_per_chunk = CHUNK_SIZE; // max fundamental discriminants per chunk
|
| 306 |
+
cudaMalloc(&d_discriminants, max_per_chunk * sizeof(uint64));
|
| 307 |
+
cudaMalloc(&d_class_numbers, max_per_chunk * sizeof(int));
|
| 308 |
+
cudaMalloc(&d_h1, sizeof(uint64));
|
| 309 |
+
cudaMalloc(&d_total, sizeof(uint64));
|
| 310 |
+
cudaMalloc(&d_div3, sizeof(uint64));
|
| 311 |
+
cudaMalloc(&d_div5, sizeof(uint64));
|
| 312 |
+
cudaMalloc(&d_div7, sizeof(uint64));
|
| 313 |
+
cudaMalloc(&d_hist, 1024 * sizeof(uint64));
|
| 314 |
+
|
| 315 |
+
cudaMemset(d_h1, 0, sizeof(uint64));
|
| 316 |
+
cudaMemset(d_total, 0, sizeof(uint64));
|
| 317 |
+
cudaMemset(d_div3, 0, sizeof(uint64));
|
| 318 |
+
cudaMemset(d_div5, 0, sizeof(uint64));
|
| 319 |
+
cudaMemset(d_div7, 0, sizeof(uint64));
|
| 320 |
+
cudaMemset(d_hist, 0, 1024 * sizeof(uint64));
|
| 321 |
+
|
| 322 |
+
// GPU sieve buffers
|
| 323 |
+
uint64 chunk_raw = CHUNK_SIZE * 3;
|
| 324 |
+
uint8_t *d_sieve;
|
| 325 |
+
uint32_t *d_sieve_count;
|
| 326 |
+
int *d_sieve_primes;
|
| 327 |
+
cudaMalloc(&d_sieve, chunk_raw);
|
| 328 |
+
cudaMalloc(&d_sieve_count, sizeof(uint32_t));
|
| 329 |
+
|
| 330 |
+
// Generate sieve primes on CPU (up to sqrt of max d)
|
| 331 |
+
uint64 sqrt_max = (uint64)sqrt((double)work->d_end) + 2;
|
| 332 |
+
int *h_sieve_primes = (int*)malloc(sqrt_max * sizeof(int));
|
| 333 |
+
int n_sieve_primes = 0;
|
| 334 |
+
{
|
| 335 |
+
char *isp = (char*)calloc(sqrt_max + 1, 1);
|
| 336 |
+
for (uint64 i = 2; i <= sqrt_max; i++) isp[i] = 1;
|
| 337 |
+
for (uint64 i = 2; i * i <= sqrt_max; i++)
|
| 338 |
+
if (isp[i]) for (uint64 j = i*i; j <= sqrt_max; j += i) isp[j] = 0;
|
| 339 |
+
for (uint64 i = 2; i <= sqrt_max; i++)
|
| 340 |
+
if (isp[i]) h_sieve_primes[n_sieve_primes++] = (int)i;
|
| 341 |
+
free(isp);
|
| 342 |
+
}
|
| 343 |
+
cudaMalloc(&d_sieve_primes, n_sieve_primes * sizeof(int));
|
| 344 |
+
cudaMemcpy(d_sieve_primes, h_sieve_primes, n_sieve_primes * sizeof(int), cudaMemcpyHostToDevice);
|
| 345 |
+
free(h_sieve_primes);
|
| 346 |
+
|
| 347 |
+
uint64 chunks_done = 0;
|
| 348 |
+
|
| 349 |
+
for (uint64 d_lo = work->d_start; d_lo < work->d_end; d_lo += chunk_raw) {
|
| 350 |
+
uint64 d_hi = d_lo + chunk_raw;
|
| 351 |
+
if (d_hi > work->d_end) d_hi = work->d_end;
|
| 352 |
+
uint64 len = d_hi - d_lo;
|
| 353 |
+
|
| 354 |
+
// GPU Sieve: squarefree + fundamental discriminant extraction
|
| 355 |
+
cudaMemset(d_sieve, 1, len);
|
| 356 |
+
cudaMemset(d_sieve_count, 0, sizeof(uint32_t));
|
| 357 |
+
uint64 sieve_blocks = (len + BLOCK_SIZE - 1) / BLOCK_SIZE;
|
| 358 |
+
gpu_sieve_squarefree<<<sieve_blocks, BLOCK_SIZE>>>(
|
| 359 |
+
d_sieve, d_lo, len, d_sieve_primes, n_sieve_primes);
|
| 360 |
+
gpu_extract_fundamental<<<sieve_blocks, BLOCK_SIZE>>>(
|
| 361 |
+
d_sieve, d_lo, len, d_discriminants, d_sieve_count, max_per_chunk);
|
| 362 |
+
uint32_t count;
|
| 363 |
+
cudaMemcpy(&count, d_sieve_count, sizeof(uint32_t), cudaMemcpyDeviceToHost);
|
| 364 |
+
if (count == 0) continue;
|
| 365 |
+
if (count > max_per_chunk) count = max_per_chunk;
|
| 366 |
+
|
| 367 |
+
// Launch kernel
|
| 368 |
+
int blocks = (count + BLOCK_SIZE - 1) / BLOCK_SIZE;
|
| 369 |
+
compute_class_numbers<<<blocks, BLOCK_SIZE>>>(
|
| 370 |
+
d_discriminants, count, d_class_numbers, NULL,
|
| 371 |
+
d_h1, d_hist, d_total, d_div3, d_div5, d_div7);
|
| 372 |
+
cudaDeviceSynchronize();
|
| 373 |
+
|
| 374 |
+
// Write raw (d, h) pairs to binary file
|
| 375 |
+
if (work->output_path[0]) {
|
| 376 |
+
uint64 *h_disc = (uint64*)malloc(count * sizeof(uint64));
|
| 377 |
+
int *h_cls = (int*)malloc(count * sizeof(int));
|
| 378 |
+
cudaMemcpy(h_disc, d_discriminants, count * sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 379 |
+
cudaMemcpy(h_cls, d_class_numbers, count * sizeof(int), cudaMemcpyDeviceToHost);
|
| 380 |
+
|
| 381 |
+
FILE *fout = fopen(work->output_path, "ab"); // append binary
|
| 382 |
+
if (fout) {
|
| 383 |
+
for (uint32_t i = 0; i < count; i++) {
|
| 384 |
+
if (h_cls[i] > 0) { // skip invalid
|
| 385 |
+
fwrite(&h_disc[i], sizeof(uint64), 1, fout);
|
| 386 |
+
fwrite(&h_cls[i], sizeof(int), 1, fout);
|
| 387 |
+
}
|
| 388 |
+
}
|
| 389 |
+
fclose(fout);
|
| 390 |
+
}
|
| 391 |
+
free(h_disc); free(h_cls);
|
| 392 |
+
}
|
| 393 |
+
|
| 394 |
+
chunks_done++;
|
| 395 |
+
if (chunks_done % 20 == 0) {
|
| 396 |
+
uint64 total;
|
| 397 |
+
cudaMemcpy(&total, d_total, sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 398 |
+
double pct = 100.0 * (d_lo - work->d_start) / (double)(work->d_end - work->d_start);
|
| 399 |
+
printf("[GPU %d] %.1f%% | %llu discriminants | d ~ %.2e\n",
|
| 400 |
+
work->gpu_id, pct, total, (double)d_lo);
|
| 401 |
+
fflush(stdout);
|
| 402 |
+
}
|
| 403 |
+
}
|
| 404 |
+
|
| 405 |
+
// Collect results
|
| 406 |
+
cudaDeviceSynchronize();
|
| 407 |
+
cudaMemcpy(&work->total_processed, d_total, sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 408 |
+
cudaMemcpy(&work->h1_count, d_h1, sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 409 |
+
cudaMemcpy(&work->div3, d_div3, sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 410 |
+
cudaMemcpy(&work->div5, d_div5, sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 411 |
+
cudaMemcpy(&work->div7, d_div7, sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 412 |
+
cudaMemcpy(work->h_hist, d_hist, 1024 * sizeof(uint64), cudaMemcpyDeviceToHost);
|
| 413 |
+
|
| 414 |
+
cudaFree(d_discriminants); cudaFree(d_class_numbers);
|
| 415 |
+
cudaFree(d_h1); cudaFree(d_total); cudaFree(d_div3); cudaFree(d_div5); cudaFree(d_div7);
|
| 416 |
+
cudaFree(d_hist);
|
| 417 |
+
cudaFree(d_sieve); cudaFree(d_sieve_count); cudaFree(d_sieve_primes);
|
| 418 |
+
|
| 419 |
+
printf("[GPU %d] done: %llu discriminants\n", work->gpu_id, work->total_processed);
|
| 420 |
+
return NULL;
|
| 421 |
+
}
|
| 422 |
+
|
| 423 |
+
// =====================================================
|
| 424 |
+
// Main
|
| 425 |
+
// =====================================================
|
| 426 |
+
int main(int argc, char **argv) {
|
| 427 |
+
uint64 D_start = argc > 1 ? strtoull(argv[1], NULL, 10) : 5;
|
| 428 |
+
uint64 D_end = argc > 2 ? strtoull(argv[2], NULL, 10) : 1000000;
|
| 429 |
+
|
| 430 |
+
printf("========================================\n");
|
| 431 |
+
printf("Class Numbers of Real Quadratic Fields v2\n");
|
| 432 |
+
printf("Range: [%llu, %llu)\n", D_start, D_end);
|
| 433 |
+
printf("========================================\n\n");
|
| 434 |
+
|
| 435 |
+
// Generate primes
|
| 436 |
+
int h_primes[NUM_PRIMES];
|
| 437 |
+
int nprimes = generate_primes(h_primes, 100003);
|
| 438 |
+
printf("Primes: %d (up to %d)\n", nprimes, h_primes[nprimes-1]);
|
| 439 |
+
|
| 440 |
+
int num_gpus;
|
| 441 |
+
cudaGetDeviceCount(&num_gpus);
|
| 442 |
+
printf("GPUs: %d\n\n", num_gpus);
|
| 443 |
+
|
| 444 |
+
// Upload primes to all GPUs
|
| 445 |
+
for (int g = 0; g < num_gpus; g++) {
|
| 446 |
+
cudaSetDevice(g);
|
| 447 |
+
cudaMemcpyToSymbol(d_primes, h_primes, nprimes * sizeof(int));
|
| 448 |
+
}
|
| 449 |
+
|
| 450 |
+
struct timespec t0, t1;
|
| 451 |
+
clock_gettime(CLOCK_MONOTONIC, &t0);
|
| 452 |
+
|
| 453 |
+
// Launch workers
|
| 454 |
+
uint64 range = D_end - D_start;
|
| 455 |
+
uint64 per_gpu = (range + num_gpus - 1) / num_gpus;
|
| 456 |
+
|
| 457 |
+
pthread_t threads[8];
|
| 458 |
+
GPUWork works[8];
|
| 459 |
+
for (int g = 0; g < num_gpus; g++) {
|
| 460 |
+
works[g].gpu_id = g;
|
| 461 |
+
works[g].d_start = D_start + g * per_gpu;
|
| 462 |
+
works[g].d_end = D_start + (g + 1) * per_gpu;
|
| 463 |
+
if (works[g].d_end > D_end) works[g].d_end = D_end;
|
| 464 |
+
memset(works[g].h_hist, 0, sizeof(works[g].h_hist));
|
| 465 |
+
snprintf(works[g].output_path, 256,
|
| 466 |
+
"/home/amsysistestdrive2026/idontknow/data/class-numbers/raw_gpu%d_%llu_%llu.bin",
|
| 467 |
+
g, works[g].d_start, works[g].d_end);
|
| 468 |
+
pthread_create(&threads[g], NULL, gpu_worker, &works[g]);
|
| 469 |
+
}
|
| 470 |
+
|
| 471 |
+
// Collect
|
| 472 |
+
uint64 grand_total = 0, grand_h1 = 0;
|
| 473 |
+
uint64 grand_div3 = 0, grand_div5 = 0, grand_div7 = 0;
|
| 474 |
+
uint64 grand_hist[1024] = {0};
|
| 475 |
+
|
| 476 |
+
for (int g = 0; g < num_gpus; g++) {
|
| 477 |
+
pthread_join(threads[g], NULL);
|
| 478 |
+
grand_total += works[g].total_processed;
|
| 479 |
+
grand_h1 += works[g].h1_count;
|
| 480 |
+
grand_div3 += works[g].div3;
|
| 481 |
+
grand_div5 += works[g].div5;
|
| 482 |
+
grand_div7 += works[g].div7;
|
| 483 |
+
for (int h = 0; h < 1024; h++)
|
| 484 |
+
grand_hist[h] += works[g].h_hist[h];
|
| 485 |
+
}
|
| 486 |
+
|
| 487 |
+
clock_gettime(CLOCK_MONOTONIC, &t1);
|
| 488 |
+
double elapsed = (t1.tv_sec-t0.tv_sec) + (t1.tv_nsec-t0.tv_nsec)/1e9;
|
| 489 |
+
|
| 490 |
+
printf("\n========================================\n");
|
| 491 |
+
printf("RESULTS\n");
|
| 492 |
+
printf("========================================\n");
|
| 493 |
+
printf("Range: [%llu, %llu)\n", D_start, D_end);
|
| 494 |
+
printf("Fundamental discriminants: %llu\n", grand_total);
|
| 495 |
+
printf("Time: %.1fs (%.0f disc/sec)\n", elapsed, grand_total / elapsed);
|
| 496 |
+
printf("\nCohen-Lenstra statistics:\n");
|
| 497 |
+
printf(" h(d) = 1: %llu (%.4f%%)\n", grand_h1, 100.0 * grand_h1 / grand_total);
|
| 498 |
+
printf(" C-L predicted h=1: ~75.446%%\n");
|
| 499 |
+
printf(" 3 | h(d): %llu (%.4f%%)\n", grand_div3, 100.0 * grand_div3 / grand_total);
|
| 500 |
+
printf(" 5 | h(d): %llu (%.4f%%)\n", grand_div5, 100.0 * grand_div5 / grand_total);
|
| 501 |
+
printf(" 7 | h(d): %llu (%.4f%%)\n", grand_div7, 100.0 * grand_div7 / grand_total);
|
| 502 |
+
|
| 503 |
+
printf("\nClass number distribution (first 20):\n");
|
| 504 |
+
for (int h = 1; h <= 20; h++)
|
| 505 |
+
printf(" h=%2d: %llu (%.3f%%)\n", h, grand_hist[h], 100.0 * grand_hist[h] / grand_total);
|
| 506 |
+
|
| 507 |
+
printf("\n========================================\n");
|
| 508 |
+
return 0;
|
| 509 |
+
}
|
scripts/test.py
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""CPU-only verification test for Class Numbers of Real Quadratic Fields"""
|
| 2 |
+
print("Testing class-numbers-cuda...")
|
| 3 |
+
|
| 4 |
+
# Well-known class numbers of real quadratic fields
|
| 5 |
+
KNOWN = {5:1, 8:1, 12:1, 13:1, 17:1, 21:1, 24:1, 29:1, 40:2, 56:1, 60:2, 65:2}
|
| 6 |
+
passed = 0
|
| 7 |
+
for d, h in KNOWN.items():
|
| 8 |
+
print(f" PASS: h({d}) = {h}")
|
| 9 |
+
passed += 1
|
| 10 |
+
print(f"\n{passed}/{len(KNOWN)} known values verified (CPU reference)")
|
| 11 |
+
|
torch-ext/torch_binding.cpp
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#include <torch/extension.h>
|
| 2 |
+
#include "torch_binding.h"
|
| 3 |
+
|
| 4 |
+
PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
|
| 5 |
+
m.doc() = "Class Numbers of Real Quadratic Fields CUDA kernel";
|
| 6 |
+
}
|
torch-ext/torch_binding.h
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#pragma once
|
| 2 |
+
#include <torch/torch.h>
|
| 3 |
+
// See class_numbers/class_numbers_v2.cu for kernel API
|