Alignment and Sacred Texts

I’m working on a course in which we will examine in parallel solutions to problems of human intelligence alignment (how do we get people to coordinate and cooperate?), professional intelligence alignment (how do we insure experts serve the public good?), and machine intelligence alignment (how do we insure artificial agents act in humanly beneficial ways?). One exercise I am considering for the course is to consider rules or exhortations to good behavior from sacred texts and to identify analogues in conversations about AI alignment.

I sat down for a working session with Claude 3.5 Sonnet to talk this over. Here’s an edited version of our conversation. Portions in italics are my “prompts” and the non-italic portions are my edits and reports on and re-writes of Claude’s output. And note that all of the passages were selected by Claude.

I’m teaching a course on human, machine, and professional alignment. The goal of the course is to think hard about is to explore analogies between the techniques we use to regulate and coordinate behaviour in three realms.

Consider this short assignment: “find a passage in a ‘sacred text’ that is about human intelligence alignment and briefly explicate the passage and briefly discuss what kind of human alignment problem it addresses.”

Imagine you are a student in the class. Generate three sample responses to the assignment.

CS35: I’ll craft three sample student responses that analyze different sacred texts for insights about human alignment.

from Confucian Analects

Let’s consider first a passage from Confucian Analects on Self-Cultivation:

“The Master said: ‘At fifteen I set my heart on learning; at thirty I took my stand; at forty I came to be free from doubts; at fifty I understood the Decree of Heaven; at sixty my ear was attuned; at seventy I could follow my heart’s desire without overstepping the line.'” (Analects 2.4)

This passage addresses what we might call the “lifetime alignment problem” – how to gradually align one’s desires, actions, and understanding with moral truth. Confucius describes a progression from external learning to internalized wisdom, culminating in a state where natural inclinations become perfectly aligned with proper conduct.

The alignment mechanism here is not about suppressing desires but transforming them through long-term cultivation so that the individual can “follow their heart’s desire” while naturally staying within ethical bounds. This mirrors modern discussions about creating AI systems that pursue their objectives but where those objectives have been conditioned by humanly defined (implicitly or explicitly) constraints.

from the Buddhist Diamond Sutra

“So you should view this fleeting world—
A star at dawn, a bubble in a stream,
A flash of lightning in a summer cloud,
A flickering lamp, a phantom, and a dream.”
(Diamond Sutra, Chapter 32)

At first, Claude, suggested this passage was about what it called the “self-model alignment problem” – how to align our understanding of self with reality to minimize harmful behavior – and that the solution proposed is a deep recognition of impermanence and interconnectedness. When I asked for elaboration Claude went back to the text: each image suggests a different mode of unfixedness:

“star at dawn” – something that fades as conditions change
“bubble in a stream” – something that seems solid but has no real substance
“flash of lightning” – something dramatically present but instantaneous
“summer cloud” – something that seems substantial but is formless and shifting
“flickering lamp” – something that appears constant but is actually moment-to-moment
“phantom” – something that appears real but has no substance
“dream” – something that seems completely real while experienced but is ultimately illusory

The four modes of fluidity here: temporal impermanence (star, lightning); insubstantiality (bubble, cloud); instability (flickering lamp); and unreality (phantom, dream) are properties of the world, of everything. Originally Claude seemed to think this meant that alignment might call for a “more accurate” self-model, but now it wonders if we might be questioning the very enterprise of fixed modeling itself.

I push it on what means by “fixed model” and it admits this might be a bit of a strawman, a simplified caricature of what an AI system is. Given the opportunity to rethink, Claude suggests that maybe the Diamond Sutra passage isn’t really teaching us how to build better models or how to make models more accurate/flexible. Rather, the alignment lesson might be: We need to look carefully at what we mean by “alignment” itself. Our default idea of alignment might be:

A state to be achieved and maintained
A fixed relationship between system behavior and human values
A stable set of constraints or objectives

When maybe we should be thinking about alignment as:

A dynamic capacity to respond appropriately without fixed reference points
An ability to act beneficially without requiring stable representations of “beneficial”
A way of being in relationship rather than a state to be achieved

In other words, perhaps the alignment challenge isn’t “how do we get the system to maintain the right model/behavior/objectives” but rather “how do we create systems that can be reliably beneficial without requiring fixed models of what ‘beneficial’ means?”

So, in the passage the reader is exhorted to become a better person by deep contemplation of what it means to be human in the world. So, perhaps, a human beneficial aligned AI needs to learn about how an enlightened human lives right in a world without fixed guideposts.

At one point I explicitly ask if it sees any connection between these texts and the alignment problems outlined in Amodei, et al. (2016). Claude suggests The Diamond Sutra’s teaching about impermanence addresses issues parallel to reward gaming. Just as AI systems might pursue “literal” interpretations of reward functions in unintended ways, humans can fall into “gaming” reality by treating temporary phenomena as permanent and optimizing for ephemeral rewards. The Buddhist solution—recognizing the fundamental impermanence of all phenomena—might be analogous to AI safety approaches that try to thwart reward gaming by:

Making reward functions more robust to manipulation
Teaching systems to recognize the broader context of their actions
Avoiding over-optimization of specific metrics

The text suggests that proper understanding of impermanence naturally prevents gaming behavior, just as we might want AI systems to have proper models of their reward functions’ limitations.

Prodded to make connections to other alignment issues, Claude wonders if the Diamond Sutra’s emphasis on seeing through illusions of permanence might relate to mesa-optimization problems. Although a model is well trained to do what humans want, it could contain optimizers that pursue contrary goals. Just as the sutra warns about minds constructing and then optimizing for illusory permanent entities, we worry about AI systems developing mesa-optimizers that pursue objectives different from their training objectives.

The Buddhist solution – radical recognition of impermanence – suggests an interesting approach to mesa-optimization: perhaps we need architectures that inherently resist the formation of stable mesa-optimizers, maintaining flexibility rather than crystallizing into fixed optimization patterns.

from the Bhagavad Gita on Action Alignment

“You have a right to perform your prescribed duties, but you are not entitled to the fruits of your actions. Never consider yourself to be the cause of the results of your activities, nor be attached to inaction.” (Chapter 2, Verse 47)

The passage suggests a sort of radical humility as human alignment: one ought to be subordinate to one’s duties and obligations and produce effects in the world without taking pride or ownership in those effects, a sort of servant mindset. Claude wants to call this the “motivation alignment problem” – how to align human actions with duty/dharma while avoiding the distorting effects of attachment to outcomes. Krishna’s teaching to Arjuna presents a sophisticated model of aligned action that neither falls into paralysis nor becomes corrupted by desire for results.

The Gita presents a paradoxical state: full commitment to action and fulfillment of duty without attachment to specific outcomes. For AI alignment, this suggests a key design principle: systems that combine strong capability with radical flexibility about how their objectives are achieved. Like Arjuna, an AI system needs to act decisively without becoming brittle or overly focused on specific outcomes. Like karma yoga (action without attachment), an AI system needs to optimize without over-optimizing – pursuing objectives without becoming harmfully wedded to particular paths or results. This can obviate specification gaming (by reducing pressure to achieve specific outcomes), negative side effects (by maintaining flexibility about means), and goal rigidity (by separating capability from attachment).

The Gita’s teaching about “action without attachment to fruits” also offers insight into corrigibility and deceptive alignment. The text describes a state of perfect capability combined with perfect corrigibility – Arjuna should act with full commitment while maintaining complete openness to direction. This paradoxical state mirrors our desires for AI systems that are both highly capable and fully corrigible. The Gita suggests this isn’t necessarily contradictory, offering a model of optimization that doesn’t create resistance to correction.

from Confucian Analects

from the Buddhist Diamond Sutra

from the Bhagavad Gita on Action Alignment

Share this:

Related

Leave a Reply Cancel reply