HomeGuidesMany Worlds

Everettian Language Modelling

Published Oct 12, 2024
Updated Mar 10, 2025
6 minutes read
Stage

Introduction to Everett's Theory

Hugh Everett III's theory, also known as the Many-Worlds Interpretation (MWI), is a fascinating concept in quantum mechanics that attempts to explain the nature of reality. In this blog, we will delve into the core ideas behind Everett's theory, its implications, and the ongoing debates surrounding it.

Everett's theory was first proposed in 1957 Everett, 1957. as a solution to the paradoxes and inconsistencies arising from the Copenhagen interpretation of quantum mechanics. The Copenhagen interpretation, formulated by Niels Bohr and Werner Heisenberg, suggests that a quantum system remains in a superposition of states until observed, at which point it collapses into a single definite state. However, this raises questions about the role of the observer and the nature of reality.

Everett's theory proposes an alternative explanation, suggesting that every time a quantum event occurs, the universe splits into multiple parallel universes, each corresponding to a possible outcome of the event. This means that every possibility becomes a reality in a separate universe, resulting in an infinite number of parallel universes.

In this blog, we will explore the key aspects of Everett's theory, including its implications for our understanding of reality, the concept of probability, and the potential consequences for our understanding of free will and morality. We will also examine the criticisms and challenges faced by the theory, as well as its potential applications in fields such as cosmology and artificial intelligence.

Everettian Language Modelling: Quantum Wavefunctions Meet LLMs

Large Language Models (LLMs) generate text through intricate probabilistic mechanisms. This article explores an analogy between LLMs and quantum wavefunctions, drawing on Hugh Everett III’s Many-Worlds Interpretation (MWI). By mapping token distributions to superpositions and sampling to branching, we uncover deep parallels, enriched with mathematical formalism and a matrix equivalence between wavefunctions and Transformer outputs.

Everett’s Many-Worlds Interpretation

The universal wavefunction evolves as: Ψ(t)=ncnψn(t)\Psi(t) = \sum_n c_n \psi_n(t) where cnc_n are complex amplitudes, and ψn\psi_n are basis states DeWitt & Graham, 1973.

Probabilities emerge from the Born rule: Pn=cn2P_n = |c_n|^2 Wallace, 2012.

Wavefunction Analogy for LLMs

In LLMs, the next-token probability is: P(wtw<t)=softmax(zt)=ezt,ijezt,jP(w_t | w_{<t}) = \text{softmax}(z_t) = \frac{e^{z_{t,i}}}{\sum_j e^{z_{t,j}}} where zt=Whhtz_t = W_h h_t is the logit vector, WhW_h is a weight matrix, and hth_t is the hidden state Vaswani et al., 2017. This mirrors a wavefunction encoding multiple possibilities.

Superposition of Token States

Before sampling, the LLM’s distribution is a superposition: P(wtw<t)=iwiϕiP(w_t | w_{<t}) = \sum_i w_i \phi_i where ϕi\phi_i are token basis states, and wi=ezt,i/Zw_i = e^{z_{t,i}} / Z (with Z=jezt,jZ = \sum_j e^{z_{t,j}}) are normalized weights Radford et al., 2019.

Sampling as Collapse or Branching

Sampling collapses this to one token: P(wt)wt,iP(w_t) \to w_{t,i}. In MWI, all branches coexist, with measure ci2|c_i|^2. For LLMs, we can define a pseudo-wavefunction:

ψLLM=iP(wt,i)wt,i\psi_{\text{LLM}} = \sum_i \sqrt{P(w_{t,i})} |w_{t,i}\rangle where wt,i|w_{t,i}\rangle are token states Hinton et al., 2023.

Attention and Interference

The Transformer’s attention mechanism:

Attention(Q,K,V)=softmax(QKdk)V\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right)V

resembles quantum interference:

P(x)=ψ1(x)+ψ2(x)2P(x) = |\psi_1(x) + \psi_2(x)|^2

Attention weights (Aij=softmax(QK/dk)ijA_{ij} = \text{softmax}(QK^\top/\sqrt{d_k})_{ij}) modulate token relevance, akin to amplitude interference Bahdanau et al., 2014.

Contextual Entanglement

Token dependencies are captured by:

H(wt,wt1)=i,jP(wt,wt1)logP(wt,wt1)H(w_t, w_{t-1}) = -\sum_{i,j} P(w_t, w_{t-1}) \log P(w_t, w_{t-1})

where joint entropy exceeds individual entropies, reflecting entanglement Shannon, 1948.

Hallucinations and Decoherence

Hallucinations arise when noise perturbs the distribution:

P(wt)=P(wt)+ϵN(0,σ2)P'(w_t) = P(w_t) + \epsilon \cdot \mathcal{N}(0, \sigma^2)

This parallels decoherence:

ρipiψiψi\rho \to \sum_i p_i |\psi_i\rangle\langle\psi_i|

where off-diagonal terms vanish Zurek, 2003.

Matrix Equivalence: Wavefunction vs. Transformer Tokens

Consider a vocabulary of size ( V ). The wavefunction and Transformer outputs align as:

ComponentWavefunctionTransformer Next Token
State Vectorψ=[c1,cV]\psi = [c_1, \cdots c_V]^\topP=[p1,...,pV]P = [p_1, ..., p_V]^\top
Normalizationici2=1\sum_i c_i^2 = 1ipi=1\sum_i p_i = 1
AmplitudesPi=ci2P_i = c_i ^2pi=softmax(zi)p_i = \text{softmax}(z_i)

This matrix bridges quantum dynamics (HH as the Hamiltonian) and Transformer updates (ff as the layer function) Nielsen & Chuang, 2010.

Probability and Superposition in LLMs

In LLMs, the process begins with a probability distribution over possible next tokens, reflecting the likelihood of each choice based on learned patterns. This distribution can be seen as analogous to a quantum superposition, where a system exists in multiple states simultaneously. For instance, before sampling, an LLM might assign high probabilities to words like "cat" and "dog" given the context "I saw a," mirroring how a quantum system might be in a superposition of states like |0⟩ and |1⟩. This parallel is explored in theoretical discussions, though LLMs are classical systems, not quantum, as noted in How Could Quantum Computing Improve Large Language Models?.

Everettian Perspective on LLMs

Before generating text, LLMs maintain a probability distribution over possible next tokens, which can be seen as a "superposition" of linguistic states, similar to a quantum wavefunction. When the model samples a token, it's like a measurement collapsing this distribution to one outcome, but in MWI, all outcomes are realized in different branches. Thus, every possible text sequence exists in some parallel "world," and the sequence we see is just one branch.

This perspective highlights LLMs' exploration of multiple possibilities, like beam search evaluating candidate sequences, mirroring MWI's branching. Hallucinations—fluent but incorrect outputs—might be viewed as "decoherence," where the model's state drifts into an incorrect branch due to noise or incomplete data. Model accuracy then depends on navigating these branches effectively, ensuring the chosen path aligns with reality. In MWI, branching reflects all outcomes. For LLMs, beam search explores ( k ) sequences:

St=argmaxsSt1i=1tP(wis<i)S_t = \arg\max_{s \in S_{t-1}} \prod_{i=1}^t P(w_i | s_{<i})

Each sequence is a branch, with hallucinations as decohered paths Brown et al., 2020.

Rational (Brief)

Monty Hall

The Monty Hall problem is a famous probability puzzle based on a game show scenario. A contestant is presented with three doors. Behind one door is a car, and behind the other two doors are goats. The contestant initially selects one door. Monty, who knows what is behind each door, then opens one of the other two doors to reveal a goat. The contestant is then given the option to either stick with their original choice or switch to the other unopened door.

The probability of winning if the contestant switches is 23\frac{2}{3}, while the probability of winning if the contestant sticks is 13\frac{1}{3}.

Monty Hall and the Vulnerable World Hypothesis (VWH)

Now, applying the Monty Hall problem to the Vulnerable World Hypothesis (VWH), we can think of the three doors as different technological pathways or choices in the context of AI development. Initially, society makes a choice (like selecting a door), unaware of the long-term consequences of that choice. Monty, who represents societal awareness or experts, then reveals one of the "risky" paths (akin to opening a door with a goat behind it), guiding the contestant (society) toward a better decision.

The concept of switching doors in the Monty Hall problem corresponds to the decision to constrain or regulate potentially destabilizing technologies, such as AI systems. Just like switching doors in the game increases the probability of winning (to ( \frac23 )), switching to a more cautious approach in technological development (i.e., imposing constraints) increases the likelihood of mitigating existential risks associated with these technologies.

In this analogy:

The VWH posits destabilizing technologies are inevitable Bostrom, 2019.

Conclusion

Viewing language modeling through an Everettian lens provides a novel framework for understanding LLMs' probabilistic and multi-faceted nature. It underscores the idea that, while we experience one specific output, the model inherently considers a vast array of alternative linguistic realities, akin to MWI's branching universes. This perspective can inform research into improving model accuracy, reducing hallucinations, and enhancing generalization, potentially bridging quantum computing and AI, as seen in recent developments like World’s first quantum large language model can shape future of AI. The matrix equivalence ties quantum formalism to Transformer mechanics, while formulae deepen the parallel. This lens not only clarifies LLM behavior but hints at quantum-inspired enhancements Garg et al., 2023.

References