Home GuidesMany Worlds

Everettian Language Modelling

Published Oct 12, 2024

⋅

Updated Mar 10, 2025

⋅

6 minutes read

⋅

Stage

Introduction to Everett's Theory

Hugh Everett III's theory, also known as the Many-Worlds Interpretation (MWI), is a fascinating concept in quantum mechanics that attempts to explain the nature of reality. In this blog, we will delve into the core ideas behind Everett's theory, its implications, and the ongoing debates surrounding it.

Everett's theory was first proposed in 1957 Everett, 1957. as a solution to the paradoxes and inconsistencies arising from the Copenhagen interpretation of quantum mechanics. The Copenhagen interpretation, formulated by Niels Bohr and Werner Heisenberg, suggests that a quantum system remains in a superposition of states until observed, at which point it collapses into a single definite state. However, this raises questions about the role of the observer and the nature of reality.

Everett's theory proposes an alternative explanation, suggesting that every time a quantum event occurs, the universe splits into multiple parallel universes, each corresponding to a possible outcome of the event. This means that every possibility becomes a reality in a separate universe, resulting in an infinite number of parallel universes.

In this blog, we will explore the key aspects of Everett's theory, including its implications for our understanding of reality, the concept of probability, and the potential consequences for our understanding of free will and morality. We will also examine the criticisms and challenges faced by the theory, as well as its potential applications in fields such as cosmology and artificial intelligence.

Everettian Language Modelling: Quantum Wavefunctions Meet LLMs

Large Language Models (LLMs) generate text through intricate probabilistic mechanisms. This article explores an analogy between LLMs and quantum wavefunctions, drawing on Hugh Everett III’s Many-Worlds Interpretation (MWI). By mapping token distributions to superpositions and sampling to branching, we uncover deep parallels, enriched with mathematical formalism and a matrix equivalence between wavefunctions and Transformer outputs.

Everett’s Many-Worlds Interpretation

The universal wavefunction evolves as: $\Psi(t) = \sum_n c_n \psi_n(t)$ where $c_n$ are complex amplitudes, and $\psi_n$ are basis states DeWitt & Graham, 1973.

Probabilities emerge from the Born rule: $P_n = |c_n|^2$ Wallace, 2012.

Wavefunction Analogy for LLMs

In LLMs, the next-token probability is: $P(w_t | w_{<t}) = \text{softmax}(z_t) = \frac{e^{z_{t,i}}}{\sum_j e^{z_{t,j}}}$ where $z_t = W_h h_t$ is the logit vector, $W_h$ is a weight matrix, and $h_t$ is the hidden state Vaswani et al., 2017. This mirrors a wavefunction encoding multiple possibilities.

Superposition of Token States

Before sampling, the LLM’s distribution is a superposition: $P(w_t | w_{<t}) = \sum_i w_i \phi_i$ where $\phi_i$ are token basis states, and $w_i = e^{z_{t,i}} / Z$ (with $Z = \sum_j e^{z_{t,j}}$ ) are normalized weights Radford et al., 2019.

Sampling as Collapse or Branching

Sampling collapses this to one token: $P(w_t) \to w_{t,i}$ . In MWI, all branches coexist, with measure $|c_i|^2$ . For LLMs, we can define a pseudo-wavefunction:

$\psi_{\text{LLM}} = \sum_i \sqrt{P(w_{t,i})} |w_{t,i}\rangle$ where $|w_{t,i}\rangle$ are token states Hinton et al., 2023.

Attention and Interference

The Transformer’s attention mechanism:

\text{Attention}(Q, K, V) = \text{softmax}\left( \frac{QK^\top}{\sqrt{d_k}} \right)V

resembles quantum interference:

P(x) = |\psi_1(x) + \psi_2(x)|^2

Attention weights ( $A_{ij} = \text{softmax}(QK^\top/\sqrt{d_k})_{ij}$ ) modulate token relevance, akin to amplitude interference Bahdanau et al., 2014.

Contextual Entanglement

Token dependencies are captured by:

H(w_t, w_{t-1}) = -\sum_{i,j} P(w_t, w_{t-1}) \log P(w_t, w_{t-1})

where joint entropy exceeds individual entropies, reflecting entanglement Shannon, 1948.

Hallucinations and Decoherence

Hallucinations arise when noise perturbs the distribution:

P'(w_t) = P(w_t) + \epsilon \cdot \mathcal{N}(0, \sigma^2)

This parallels decoherence:

\rho \to \sum_i p_i |\psi_i\rangle\langle\psi_i|

where off-diagonal terms vanish Zurek, 2003.

Matrix Equivalence: Wavefunction vs. Transformer Tokens

Consider a vocabulary of size ( V ). The wavefunction and Transformer outputs align as:

Component	Wavefunction	Transformer Next Token
State Vector	$\psi = [c_1, \cdots c_V]^\top$	$P = [p_1, ..., p_V]^\top$
Normalization	$\sum_i c_i^2 = 1$	$\sum_i p_i = 1$
Amplitudes	$P_i = c_i ^2$	$p_i = \text{softmax}(z_i)$

This matrix bridges quantum dynamics ( $H$ as the Hamiltonian) and Transformer updates ( $f$ as the layer function) Nielsen & Chuang, 2010.

Probability and Superposition in LLMs

In LLMs, the process begins with a probability distribution over possible next tokens, reflecting the likelihood of each choice based on learned patterns. This distribution can be seen as analogous to a quantum superposition, where a system exists in multiple states simultaneously. For instance, before sampling, an LLM might assign high probabilities to words like "cat" and "dog" given the context "I saw a," mirroring how a quantum system might be in a superposition of states like |0⟩ and |1⟩. This parallel is explored in theoretical discussions, though LLMs are classical systems, not quantum, as noted in How Could Quantum Computing Improve Large Language Models?.

Everettian Perspective on LLMs

Before generating text, LLMs maintain a probability distribution over possible next tokens, which can be seen as a "superposition" of linguistic states, similar to a quantum wavefunction. When the model samples a token, it's like a measurement collapsing this distribution to one outcome, but in MWI, all outcomes are realized in different branches. Thus, every possible text sequence exists in some parallel "world," and the sequence we see is just one branch.

This perspective highlights LLMs' exploration of multiple possibilities, like beam search evaluating candidate sequences, mirroring MWI's branching. Hallucinations—fluent but incorrect outputs—might be viewed as "decoherence," where the model's state drifts into an incorrect branch due to noise or incomplete data. Model accuracy then depends on navigating these branches effectively, ensuring the chosen path aligns with reality. In MWI, branching reflects all outcomes. For LLMs, beam search explores ( k ) sequences:

S_t = \arg\max_{s \in S_{t-1}} \prod_{i=1}^t P(w_i | s_{<i})

Each sequence is a branch, with hallucinations as decohered paths Brown et al., 2020.

Rational (Brief)

Monty Hall

The Monty Hall problem is a famous probability puzzle based on a game show scenario. A contestant is presented with three doors. Behind one door is a car, and behind the other two doors are goats. The contestant initially selects one door. Monty, who knows what is behind each door, then opens one of the other two doors to reveal a goat. The contestant is then given the option to either stick with their original choice or switch to the other unopened door.

The probability of winning if the contestant switches is $\frac{2}{3}$ , while the probability of winning if the contestant sticks is $\frac{1}{3}$ .

Monty Hall and the Vulnerable World Hypothesis (VWH)

Now, applying the Monty Hall problem to the Vulnerable World Hypothesis (VWH), we can think of the three doors as different technological pathways or choices in the context of AI development. Initially, society makes a choice (like selecting a door), unaware of the long-term consequences of that choice. Monty, who represents societal awareness or experts, then reveals one of the "risky" paths (akin to opening a door with a goat behind it), guiding the contestant (society) toward a better decision.

The concept of switching doors in the Monty Hall problem corresponds to the decision to constrain or regulate potentially destabilizing technologies, such as AI systems. Just like switching doors in the game increases the probability of winning (to ( \frac23 )), switching to a more cautious approach in technological development (i.e., imposing constraints) increases the likelihood of mitigating existential risks associated with these technologies.

In this analogy:

The initial choice of a door represents the unrestricted development of powerful technologies (such as LLMs) without regulation.
Monty's reveal represents the growing awareness of the risks and hazards of these technologies.
The decision to switch doors is analogous to adopting mitigation strategies (like implementing AI constraints), which increases the probability of avoiding catastrophic outcomes.

The VWH posits destabilizing technologies are inevitable Bostrom, 2019.

Conclusion

Viewing language modeling through an Everettian lens provides a novel framework for understanding LLMs' probabilistic and multi-faceted nature. It underscores the idea that, while we experience one specific output, the model inherently considers a vast array of alternative linguistic realities, akin to MWI's branching universes. This perspective can inform research into improving model accuracy, reducing hallucinations, and enhancing generalization, potentially bridging quantum computing and AI, as seen in recent developments like World’s first quantum large language model can shape future of AI. The matrix equivalence ties quantum formalism to Transformer mechanics, while formulae deepen the parallel. This lens not only clarifies LLM behavior but hints at quantum-inspired enhancements Garg et al., 2023.

References

Everett, H. (1957). Relative State Formulation. Physical Review.
DeWitt, B. S., & Graham, N. (1973). The Many-Worlds Interpretation. Princeton University Press.
Wallace, D. (2012). The Emergent Multiverse. Oxford University Press.
Vaswani, A., et al. (2017). Attention is All You Need. arXiv:1706.03762.
Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
Hinton, G., et al. (2023). Scaling Laws for Neural Language Models. arXiv:2310.05737.
Bahdanau, D., et al. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. arXiv:1409.0473.
Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal.
Zurek, W. H. (2003). Decoherence and the Transition from Quantum to Classical. arXiv:quant-ph/0306072.
Nielsen, M. A., & Chuang, I. L. (2010). Quantum Computation and Quantum Information. Cambridge University Press.
Brown, T., et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165.
Bostrom, N. (2019). The Vulnerable World Hypothesis. nickbostrom.com.
Garg, S., et al. (2023). Unleashing LLMs for Quantum Computing. arXiv:2307.08191.
Penrose, R. (1989). The Emperor’s New Mind. Oxford University Press.
Deutsch, D. (1997). The Fabric of Reality. Penguin Books.

PreviousProject Structure NextQuiet Years of Agents