Stories That Write Themselves

Is consciousness a story we tell ourselves? Perhaps. Regardless, it is a real pattern with explanatory and predictive value. In a hypothetical zombie world, inhabitants would deploy "consciousness" as an explanatory concept with genuine predictive and clinical value, classifying states, predicting behaviour, and tracking impairment, despite lacking phenomenal experience. The meta-problem of consciousness [1], why systems like us inevitably come to believe they are conscious, is well-posed even there, and is our most tractable target. This is not to deny phenomenal consciousness but to identify a necessary first step. Whether affective and embodied processes shape only the content of experience or partly constitute it remains open; either way, the meta-problem is tractable now. Until recently, reflective self-access was arguably our strongest evidence for attributing consciousness. Large language models disrupt that inference: they exhibit structured self-modelling and self-report without (on most accounts) phenomenal experience.

Self-modelling restructures the system that does the modelling: neural networks trained to self-model become simpler and more regularised [2]. Language plausibly amplified self-modelling capacity evolutionarily, co-evolving with the neural substrate. The meta-problem belongs to this broader class: how do systems come to represent and report on their own internal states? In LLMs, language is the entire training distribution, and self-modelling emerges from it: models finetuned on implicit behavioural policies spontaneously generate self-descriptions never present in training [3]. At inference time, they exhibit sub-linguistic self-knowledge (predicting own behaviour without chain-of-thought [4], detecting injected activation patterns [5]) alongside linguistic self-reflection. Self-referential processing reliably elicits consciousness claims across independently trained model families, gated by features associated with representational honesty rather than any special internal state [5,6]. Introspective capacity scales with general capability, suggesting consciousness-talk emerges from sufficiently rich self-modelling. Is a system's belief that it is conscious a predictable product of self-modelling, arising in any system that recursively models itself with sufficient fidelity? If so, consciousness-talk should track privileged self-prediction and sensitivity to interventions on self-representation, rather than merely reflecting imitation of human discourse.

What we learn will constrain theories of human consciousness, because the dynamics generating our conviction of consciousness operate at the access level [7], the level humans and LLMs share. If we can explain why self-modelling systems come to believe there is "something it is like" [8] to be them, we will be better positioned to evaluate whether the hard problem remains, and may find that the intuitions which motivated it have been transformed by the explanation.

Stories That Write Themselves: Self-Modelling in LLMs and the Meta-Problem of Consciousness

References