Kernels and the optimization policy =================================== SpaceCore separates the abstract algebra of spaces and linear operators from the optimized kernels that implement specific fast paths. The :mod:`spacecore.kernels` subpackage is where those kernels live, and this page is the contract every contributor must satisfy when adding one. *Runtime* dispatch of the benchmarked numerical kernels — inspecting operands at call time and automatically choosing an optimized variant when ``applicable`` returns ``True`` — is the structural-dispatch system accepted in :doc:`ADR-016 ` and implemented here. A spec that names a ``dispatch_key`` and claims exact equivalence (``rtol == atol == 0``) is *dispatch-eligible*; the single :func:`spacecore.kernels.dispatch` entry point selects an applicable eligible spec by structural match. Dispatch is **off by default** (:func:`spacecore.kernels.dispatch_mode`), so until a key is turned on no SpaceCore code path silently routes through these kernels — a wired call site runs its inline ``generic`` path and is result-identical to the pre-dispatch code. The two ``0.4.0`` catalog kernels ship with no ``dispatch_key`` and stay explicit-entry kernels; *numerical fusion* of adjacent operators into a precomputed product remains out of scope. Two kernel layers ----------------- The subpackage hosts two complementary layers. **Core kernels** are the check-free cores of an operator's apply (or a functional's evaluation) — the body that runs once the public method has validated its boundary. For LinOps these are the ``apply`` / ``rapply`` / ``vapply`` / ``rvapply`` cores; for :class:`~spacecore.functional.Functional` objects they are the ``value`` / ``grad`` / ``vvalue`` / ``vgrad`` cores. They live as concrete functions in the kernels subpackage: the composite LinOp algebra in :mod:`spacecore.kernels.core.algebra`, the concrete LinOp leaves in :mod:`spacecore.kernels.core.dense`, :mod:`spacecore.kernels.core.diagonal`, and :mod:`spacecore.kernels.core.sparse` (each of which also owns its operator's private computation-mode enum), and the functionals in :mod:`spacecore.kernels.core.functional`. The binding *rules* live in :mod:`spacecore.kernels.core`. An operator opts in by *declaring* ``@core_kernels("dense")`` (etc.), and the decorator installs the registered functions as the class's ``_apply_core`` / ``_rapply_core`` / ``_vapply_core`` / ``_rvapply_core`` methods. This keeps the fast-path logic out of every operator class body — they say *which* kernel they use, not *how* it works — while binding statically at class-definition time, so it is **not** runtime dispatch and adds nothing per call. The base :class:`spacecore.linop.LinOp` cores remain the generic fallback for operators that register no kernel. ``ComposedLinOp`` additionally caches a flattened ``_apply_chain`` at construction, so a deep ``A @ B @ C @ ...`` fuses into a single loop. The leaf kernels import the metric-adjoint helpers lazily, on the non-Euclidean Riesz path only, so the kernels subpackage keeps no module-level dependency on :mod:`spacecore.linop` (no import cycle). These cores *are* on the default apply path; their correctness is pinned by the operator conformance suite together with ``tests/kernels/test_core_kernel_dispatch.py``. **Benchmarked numerical kernels** are the heavier fast paths described by :class:`~spacecore.kernels.KernelSpec` and governed by the contract below. A dispatch-eligible spec is routed by the dispatcher on structural match; an unkeyed spec is chosen explicitly by a clearly-scoped call site. Either way the benchmark and correctness rails are identical. The rest of this page governs the *benchmarked numerical kernel* layer. What "kernel" means here ------------------------ A kernel is a pair of callables that produce the same numerical result on the same inputs: * a *generic* implementation that mirrors the un-optimized SpaceCore path; and * an *optimized* implementation that produces the same result faster (or with lower memory) on a documented subset of inputs. A kernel is registered through a :class:`~spacecore.kernels.KernelSpec` that ties the two implementations to a correctness reference and an applicability predicate. The contract ------------ Every kernel must satisfy *all* of: 1. **Correctness reference.** Each ``KernelSpec`` names a pytest node id under ``tests/kernels/test_kernels_match_generic.py`` that asserts the optimized implementation matches the generic one on every applicable generated case. The reference exists *before* the optimization lands. 2. **Generic implementation.** The reference path the optimized implementation must match. May reuse the existing SpaceCore code path; the kernel module *re-exposes* it so the test can call it directly without relying on a higher-level wrapper. 3. **Applicability predicate.** Returns ``True`` only when the optimized implementation is safe. Used by future dispatch logic; today it documents the safe envelope. 4. **Stable name.** Kebab-case, unique across the registry. Adding a kernel with a duplicate name raises at import time. The :class:`KernelSpec` ``__post_init__`` raises :class:`~spacecore.kernels.MissingReferenceError` if the correctness reference is missing. The enforcement lives in :mod:`tests.kernels.test_kernel_policy`, which iterates every registered spec and asserts that its correctness reference resolves. Runtime dispatch ---------------- ADR-016's structural dispatch is implemented. A :class:`~spacecore.kernels.KernelSpec` may carry: * ``dispatch_key`` — the operation *family* a call site requests (e.g. ``"linop.composed.apply"``). Many specs may share a key; the unkeyed default means "explicit-entry only." * ``priority`` — selection order within a key, highest first. Two eligible specs sharing a key at equal priority raise at registration time. * ``cost`` — a shape-only :class:`~spacecore.kernels.KernelCost` estimator, required for any eligible spec whose ``optimized`` path allocates more than ``O(1)`` extra memory. The single :func:`spacecore.kernels.dispatch` entry point walks the eligible specs under a key in descending priority, returns the first whose ``applicable`` is ``True`` and whose memory cost fits the context budget, and otherwise runs the call site's ``generic`` fallback. :func:`spacecore.kernels.dispatch_mode` selects ``off`` (default; always ``generic``), ``on`` (route), or ``verify`` (run both, assert exact agreement, raise on mismatch). ``check_level="strict"`` implies ``verify``. A spec is *dispatch-eligible* only with ``rtol == atol == 0``; loosened-tolerance specs register and run explicitly but are never auto-routed. Shipped dispatch kernels ~~~~~~~~~~~~~~~~~~~~~~~~~~ Three exact (``rtol == atol == 0``) algebraic-optimization kernels route at the two wired call sites. All are off until ``dispatch_mode("on")``/``"verify"``. * ``composed-zero-annihilation`` (``linop.composed.apply``, priority 20) — a composition containing a zero map *is* the zero map, so the apply collapses to one ``codomain.zeros()`` instead of walking the chain. * ``composed-identity-elision`` (``linop.composed.apply``, priority 10) — skip identity leaves (``A @ I @ B = A @ B``). Both fire on chains that bypass the construction-time canonicalization in ``make_composed`` — JAX pytree round-trips, ``_convert``, or a directly-built ``ComposedLinOp``. * ``block-diagonal-uniform-dense-batched`` (``linop.block_diagonal.apply``) — a block-diagonal operator whose blocks are uniform-shaped flat-dense matrices folds the per-block loop into one batched ``matmul``. This is *materializing* (it stacks the blocks), so it carries a shape-only ``cost`` and the dispatcher gates it on the memory budget; it is NumPy-only until cross-backend bit-exactness is verified. Still out of scope: * **Numerical fusion.** No benchmarked kernel inspects adjacent operators to combine them into a precomputed product (``A @ B`` collapsed to one dense matrix, for example). The static ``ComposedLinOp`` chain flattening above is structural fusion of the apply *loop*, not numerical fusion of the operators. A materializing fusion spec would carry a ``cost`` and pass the memory gate before the dispatcher could select it. * **Block-, Kronecker-, or tensor-product specialized kernels** beyond the diagonal demonstration kernel. The task list ties those to "correctness references exist", which is now true; an implementation can land in a follow-up without revisiting the policy. The ``0.4.0`` shipped kernels ----------------------------- ``composed-chain-apply`` ~~~~~~~~~~~~~~~~~~~~~~~~ Apply a sequence of operators ``(A, B, C, ...)`` to ``x`` in order without rebuilding the nested :class:`spacecore.linop.ComposedLinOp` tree. The optimized variant skips the per-link ``@checked_method`` wrapper that the generic ``ComposedLinOp(A, ComposedLinOp(B, C)).apply(x)`` path pays on every intermediate. * Generic: ``spacecore.kernels.specs.composed.composed_chain_apply_generic``. * Optimized: ``spacecore.kernels.specs.composed.composed_chain_apply_optimized``. * Correctness: ``tests/kernels/test_kernels_match_generic.py ::test_composed_chain_apply_matches_generic``. ``block-diagonal-dense-apply`` ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Apply a flat tuple of block matrices to a flat tuple of input leaves via ``ops.matmul`` directly. The optimized variant hoists the bound method lookup out of the loop and writes a tight zip-loop, instead of going through the :class:`spacecore.linop.tree.BlockDiagonalLinOp` ``@checked_method`` wrapper per leaf. * Generic: ``spacecore.kernels.specs.block_diagonal.block_diagonal_dense_apply_generic``. * Optimized: ``spacecore.kernels.specs.block_diagonal.block_diagonal_dense_apply_optimized``. * Correctness: ``tests/kernels/test_kernels_match_generic.py ::test_block_diagonal_dense_apply_matches_generic``. Adding a new kernel ------------------- 1. Implement ``generic`` and ``optimized`` in a new module under ``spacecore/kernels/specs/``. The generic implementation must match the un-optimized path that the rest of SpaceCore would use. 2. Register a :class:`~spacecore.kernels.KernelSpec` at module scope. Name the correctness reference you are about to add. 3. Add the correctness test function under ``tests/kernels/test_kernels_match_generic.py``. Use the existing parametrized templates as guides. 4. Run ``pytest tests/kernels/`` — the policy-enforcement test refuses to pass until the correctness test exists. Cross-references ---------------- * :doc:`backend_conformance` — the matrix the kernels ultimately ride on. ``ops.matmul`` semantics are pinned there. * :doc:`checking_policy` — kernels deliberately skip the ``@checked_method`` wrapper. The reason it's safe inside a kernel is documented in each kernel's ``notes``. * :doc:`/release_notes` — ``0.4.0`` shipped the policy and the two demonstration kernels. Dispatch and fusion are tracked separately.