K-Blade Geometric Algebra as a Replacement for Scalar Tensors in Large Language Models

Authors: Pat Parslow (Draft)

Abstract

We propose a framework to replace dense tensors of scalars in large language models (LLMs) with k-blade representations from geometric algebra (GA). By interpreting vector, matrix, and higher-order tensor operations as multivector algebraic operations (outer, inner, and geometric products), we show how common neural primitives—linear maps, attention, positional encodings, and nonlinearities—can be expressed and implemented using k-blades. We argue this representation offers compactness for structured data, rotationally-covariant parameterisations, and richer geometric inductive biases that could improve generalisation, interpretability, and parameter efficiency. We present mathematical mappings, practical implementation patterns, suggested training strategies, and an experimental plan to evaluate viability on embedding and attention modules.

1. Introduction

Modern LLMs are built from large dense tensors containing scalar parameters and activations. While highly successful, scalar tensor parameterisations are agnostic to geometric structure: they do not capture rotations, subspace structure, or oriented subspace interactions natively. Geometric algebra (GA) provides a compact algebraic framework that unifies scalars, vectors, bivectors, and higher k-blades into a single graded multivector algebra equipped with the geometric product, outer (wedge) product, and inner product. We explore replacing scalar tensors with k-blade based multivector tensors and adapting neural primitives to operate in this algebra.

2. Background

2.1 Geometric Algebra and k-blades

Geometric algebra extends vector spaces by forming multivectors: linear combinations of k-blades where a k-blade represents an oriented k-dimensional subspace (e.g., a bivector is an oriented plane). Key operations:

Geometric product: for vectors a,b
\[ ab = a \cdot b + a \wedge b \]
where \(a \cdot b\) is the symmetric inner product (scalar) and \(a \wedge b\) is the antisymmetric outer (wedge) product (a bivector).
Wedge (outer) product \(a \wedge b\): produces a higher-grade blade representing the subspace spanned by a and b.
Reversion and grade projection allow decomposition into scalar/vector/bivector parts.

2.2 Multivector expansion

A general multivector M in GA(n) expands over basis blades {e_I} indexed by bitmasks I:

\[ M = \sum_{I \subseteq \{1,\dots,n\}} m_I \, e_I \]

2.3 Tensors in Neural Networks

Neural networks use tensors of rank 1..N to store activations and parameters. Linear layers, convolutions, and attention are all implemented by scalar-linear algebra operations on tensors. Common optimisations exploit low-rank structure, parameter factorisation, and equivariant parameterisations when symmetry is present.

3. Conceptual Mapping: Scalars → k-Blades

3.1 Representational shift

Instead of storing a scalar per tensor element, store a small multivector per element. At minimum, each original scalar can be interpreted as the grade-0 (scalar) component of a multivector, while richer encodings place information across grades: vector components (grade-1) for directional features, bivector components (grade-2) for oriented plane interactions, etc.

3.2 Structured embeddings

Word and token embeddings become multivector-valued embeddings: \(E(t) \in \mathrm{GA}(n)\) where GA(n) is built over an n-dimensional base vector space. Embeddings now encode magnitude (scalar), direction (vector part), and oriented subspace relationships (higher-grade parts). This can express richer relational priors (e.g., word analogies as oriented subspace rotations).

4. Neural Primitives in Geometric Algebra

4.1 Geometric product and linear maps

The geometric product between two multivectors generalises linear maps. For multivector parameters A,B and multivector input x, one canonical parameterisation is

\[ y = \sum_i A_i \, x \, B_i + b \]

where juxtaposition denotes the geometric product. Projecting grade components yields outputs in desired grades. Expanding in the scalar parameter space shows this is a structured block-linear map; the GA parameterisation can be much lower-dimensional when many blade coefficients are zero (k-blade sparsity).

4.2 Attention

Let queries, keys, and values be multivectors \(q, k, v \in \mathrm{GA}(n)\). Define a scalar similarity score using the scalar part (grade-0 projection) of the geometric product, sometimes called the inner scalar part \(\langle q\,k \rangle_0\):

\[ s(q,k) = \frac{\langle q\,k \rangle_0}{\sqrt{d}}, \qquad a_{ij} = \operatorname{softmax}_j\!\big(s(q_i, k_j)\big) \]

The attended multivector is

\[ o_i = \sum_j a_{ij} \, v_j \]

4.3 Rotors and positional encodings

Rotations in GA use rotors \(R = \exp\!\left(-\tfrac{1}{2} B\right)\) for a bivector \(B\). Rotor action on a multivector \(M\) is

\[ M' = R \, M \, \widetilde{R} \]

where \(\widetilde{R}\) is the reverse of \(R\). This recovers rotary-style embeddings (RoPE) as special cases and generalises them to higher-grade interactions.

4.4 Nonlinearities

Nonlinear activations for multivectors can be constructed by operating on invariants or per-grade components. Options:

Apply scalar nonlinearity to magnitudes: \(\sigma(\lVert \operatorname{grade}_1(M) \rVert)\) and rescale directions.
Apply elementwise scalar nonlinearity to each blade coefficient.
Use learnable grade-wise gates \(g_g\) per grade \(g\): \(\operatorname{grade}_g(M) \mapsto g_g \odot \sigma(\operatorname{grade}_g(M))\).

5. Theoretical advantages

Geometric expressivity: captures oriented subspace relationships naturally.
Parameter efficiency: k-blade-sparse parameterisation can represent structured linear maps with fewer parameters than dense scalar tensors.
Rotational and subspace equivariances: rotors provide mechanisms for equivariant transforms useful in language analogies and structured transformations.
Interpretability: grades map to geometric concepts (direction, plane) enabling richer probes.

6. Practical implementation

6.1 Data layout and memory

Implement multivector tensors as arrays shaped (batch, seq_len, components) where components correspond to the chosen basis blades up to grade k_max. For base dimension n, the number of basis blades is 2^n; choose small n and restrict grades (e.g., up to grade 2 or 3) to cap memory.

6.2 Efficient geometric product

The geometric product reduces to signed sums over basis blade products with a deterministic sparsity pattern. Implement via sparse kernels exploiting antisymmetry and grade structure. Two routes:

Precompute multiplication tables for the chosen basis and implement via batched sparse-dense matmuls.
Implement small dense microkernels (fixed small n) optimised on GPU via CUDA or Triton.

6.3 Initialisation and training

Initialise multivector parameters to match scalar baselines by placing scalar baseline weights on the grade-0 components and small random noise on higher-grade components. Train with standard optimisers; consider separate learning rates for higher-grade components.

6.4 Backward compatibility

Provide adapter layers to convert between scalar and multivector tensors, enabling staged adoption in existing transformer stacks (replace embeddings/linear layers first, then attention).

7. Experimental protocol

We recommend a staged empirical evaluation:

Baselines: small transformer (e.g., 6-layer, 256-d) trained on LM corpora.
Ablations: multivector embeddings only, multivector attention only, full multivector parameterisation.
Metrics: perplexity, fine-tuning performance, parameter count, compute FLOPs, probe-based interpretability metrics.
Synthetic tasks: rotation- / subspace-structured tasks (analogies, reversible transforms) to probe GA advantages.

8. Limitations and pitfalls

Overhead: naive multivector expansion increases memory (exponential in n). Truncation and sparsity are essential.
Optimisation: nonstandard parameter geometry may require tailored optimisers.
Hardware: high-performance GA kernels may need custom CUDA/Triton implementations.
Unclear universality gains: empirical validation required.

Hypercomplex networks (quaternions, octonions) applied to RNNs/CNNs.
Clifford networks and geometric deep learning literature exploring equivariant representations.
Rotary embeddings and complex-valued attention mechanisms.

10. Conclusion

K-blade geometric algebra offers a principled algebraic language to encode oriented subspace information within model parameters and activations. Reinterpreting tensors as multivectors gives neural primitives geometric structure that may improve parameter efficiency, equivariance, and interpretability. Adoption requires careful engineering of data layouts, efficient geometric-product kernels, and empirical validation.

Acknowledgements

Draft prepared by Pat Parslow. Feedback welcome.

References

D. Hestenes, "New Foundations for Classical Mechanics" (1986). https://doi.org/10.1007/978-94-009-4802-0
L. Dorst, D. Fontijne, S. Mann, "Geometric Algebra for Computer Science" (2007). https://geometricalgebra.org/
Parcollet et al., "Quaternion and Octonion networks" (various papers on hypercomplex networks; see T. Parcollet, M. Morchid & G. Linarès, "A survey of quaternion neural networks", 2020). https://doi.org/10.1007/s10462-019-09752-1
J. Su, Y. Lu et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding" (2021). https://arxiv.org/abs/2104.09864

Appendix: Minimal GA kernels (sketch)

Represent multivector as vector of blade coefficients indexed by mask 0..2^n-1 restricted to popcount(mask) <= k_max. Precompute blade multiplication table (sign, result_mask) and implement batched gather-add kernels on GPU. For fixed small n consider hand-optimised microkernels.

A. GPU microbenchmark (RTX 5070 Ti)

To validate a simple batched geometric-product kernel, we implemented a PyTorch scatter-add based kernel and measured performance on an NVIDIA GeForce RTX 5070 Ti (CUDA 13.2 enabled PyTorch wheel).

Benchmark configuration

base dimension n = 3, grade truncation k_max = 3 (full GA for n=3)
basis size L = 8 (2^3 basis blades)
number of nonzero product pairs (precomputed) = 64
implementation: vectorised gather and scatter_add on CUDA (see ga_gpu_benchmark.py)

Results (averaged over 200 runs after warmup)

B,L,num_pairs,avg_ms,ops_per_sec

1,8,64,0.0466 ms, 21474.8 ops/sec

8,8,64,0.0528 ms, 18928.2 ops/sec

32,8,64,0.0433 ms, 23089.0 ops/sec

128,8,64,0.0612 ms, 16330.2 ops/sec

A.1 Triton microkernel benchmarks (broader sweep)

We extended the microbenchmark to several (n,k) settings and batch sizes. The CSV results are saved at /home/p/ga_triton_final_results.csv. Key aggregated statistics:

(n=3, k_max=3) atomic Triton ~0.24 ms, baseline ~0.26 ms
(n=4, k_max=3) atomic Triton ~0.28 ms, baseline ~0.30 ms
(n=4, k_max=4) atomic Triton ~0.28 ms, baseline ~0.33 ms
(n=5, k_max=3) atomic Triton ~0.36 ms, baseline ~0.40 ms

Representative per-case numbers (B, L, pairs, baseline_ms, opt_ms, atomic_ms, err_opt, err_atomic):


n,k,B,L,num_pairs,baseline_ms,opt_ms,atomic_ms,err_opt,err_atomic
3,3,1,8,64,0.2779,0.3398,0.2491,4.768372e-07,4.768372e-07
3,3,8,8,64,0.2617,0.3519,0.2469,4.768372e-07,4.768372e-07
3,3,32,8,64,0.2551,0.3516,0.2434,9.536743e-07,1.430511e-06
3,3,128,8,64,0.2606,0.3494,0.2383,9.536743e-07,9.536743e-07
4,3,1,15,211,0.3360,0.4554,0.2599,3.076522e+00,9.536743e-07
4,3,8,15,211,0.3001,0.4373,0.2774,3.292984e+00,9.536743e-07
4,3,32,15,211,0.2954,0.4308,0.2734,3.680134e+00,1.907349e-06
4,3,128,15,211,0.2939,0.4551,0.2894,6.226732e+00,2.384186e-06
4,4,1,16,256,0.3052,0.4816,0.2816,4.768372e-07,9.536743e-07
4,4,8,16,256,0.3480,0.4910,0.2744,1.907349e-06,1.907349e-06
4,4,32,16,256,0.3286,0.4711,0.2859,1.907349e-06,2.861023e-06
4,4,128,16,256,0.3057,0.4791,0.2817,3.337860e-06,1.907349e-06
5,3,1,26,556,0.3952,0.7072,0.3576,6.064522e+00,9.536743e-07
5,3,8,26,556,0.3853,0.6952,0.3627,1.006598e+01,2.384186e-06
5,3,32,26,556,0.4054,1.1341,0.3926,8.531096e+00,2.861023e-06
5,3,128,26,556,0.5594,0.7187,0.3663,1.247142e+01,3.337860e-06

Embed figures as data URIs

Final benchmark plots

Figure A.1: Latency vs L (lower is better). Source: /home/p/ga_triton_final_results.csv

Figure A.2: Speedup vs L (higher is better). Source: /home/p/ga_triton_final_results.csv