Markdown Converter
Agent skill for markdown-converter
Version: 1.02
Sign in to like and favorite skills
Product Requirements Document: Constitutional Marketplace (Creeds.World) Version: 1.02 Date: April 28, 2025
Constitutional Marketplace (MCP) - Codebase Design Overview This document outlines a proposed architecture and technology stack for implementing the Constitutional Marketplace based on PRD prd_constitutional_marketplace_v1.
Below is the original paper upon which this PRD is based:
Article Personalized Constitutionally-Aligned Agentic Superego Nell Watson 1*, Ahmed Amer 2, Evan Harris 3, Preeti Ravindra 4, Joe Rayner 5, Shujun Zhang 1
Correspondence: [email protected]
Citation: Watson, N., Ahmer, A., Harris, E., Ravindra, P., Raynor, J., Zhang, S.; Personalized Constitutionally-Aligned Agentic Superego. 2025, 14, Firstpage–Lastpage. https://doi.org/10.3390/xxxxx Academic Editor: Firstname Lastname Received: date Accepted: date Published: date Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. A picture containing text, clipart
Description automatically generated
Copyright: © 2025 by the authors. Submitted for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Abstract: Agentic AI systems capable of situation analysis and autonomous planning are on the rise, yet ensuring safe and value-aligned behavior remains a pressing challenge. This paper proposes a novel ‘superego’ agent—a personalized oversight mechanism designed to steer AI planning processes according to both broad ethical guidelines and individual user preferences. We detail the motivations behind this concept, provide an outline for how a superego agent might integrate with open-source agentic AI frameworks, and propose methods for evaluating its effectiveness in real-time ethical monitoring for issues with the alignment challenges of models, users, and tools. This paper describes our intended approach, key objectives, and anticipated impacts of personalized moral alignment in next-generation AI systems. We provide a practical demonstration of these techniques, along with an implementation to interface directly with third-party models. Superego constitutions oriented towards a range of religions, creeds, industries, and moral theories, which facilitate variable levels of adherence, are also available in supplementary data. Keywords: Agentic AI systems, Alignment, Personalization, Ethical Guardrails, Superego Agent
This paper outlines the theoretical underpinnings, preliminary design, and proposed research methodology for such a superego agent. It presents a vision for how personalized alignment can be operationalized, including a plan for scenario-based evaluations and user studies.
This project set out to investigate three key questions:
Our research addresses the following key objectives:
Develop and test mechanisms to steer AI behavior via model context, particularly for emerging AI systems with agentic qualities by:
Objective 2: Real-time Ethics Monitoring:
Create a proof-of-concept system where one AI model oversees the ethical conduct of another, especially in an agentic context. This could be a third party system, or a component within an agentic scaffold. Develop a dynamic ethical assurance system capable of:
Monitoring the planning processes of less sophisticated agentic AI models.
Swiftly identifying potential ethical violations before they are executed.
Our project explores whether—and how—a personalized superego agent can effectively ensure safe, context-aware planning by an agentic AI system. Specifically, we address three core questions: Monitor & Veto Mechanism – Can a superego effectively monitor the agent’s chain-of-thought in real time, intercepting problematic or unethical steps before they cause harm? Personalized Context Enforcement – Is it feasible to integrate diverse user preferences (moral or cultural) in real time without excessively overblocking benign requests? Demonstrable, Extensible System – Can this superego framework be implemented in a modular, open-source manner that the broader AI community can replicate, adapt, and expand? This research is expected to provide insights into the feasibility of using contextual information for AI alignment,and the strengths and limitations of these alignment methods. It aims to demonstrate techniques for customized steering and personalized context for AI models, including downstream models, through the use of additional contextual information. We plan to create a system that can dynamically assess and adjust AI behavior based on ethical guidelines, enhancing AI safety, reducing biased outcomes, and improving alignment with individual user preferences and contexts. In the longer term, we envision creating a comprehensive platform that further aligns AI behavior with human values and personal needs. This system will integrate with AI planning processes, utilizing an ethical rubric to evaluate responses and behaviors in real-time. This work also ties into the larger concept of ‘Predictably Valid AI’, a paradigm in which the permissible range of an AI’s outputs and behaviors is clearly understood, tested, and codified before the system or its outputs is widely deployed [https://arxiv.org/abs/2310.06167v3]. By establishing explicit constraints and predictable boundaries, developers and end users alike can gain confidence that the AI will not deviate into harmful or unauthorized activities. In this framework, a personalized superego agent serves as a key bridge between broad, universal standards and each user’s unique moral or practical constraints—resulting in a system that is both broadly safe and tailored to individual needs.
1.1 Research Packages (Both RPs and respective tasks ordered by priority)
RP1: Agentic AI Safety Framework For experts in AI safety and agentic systems Core Objectives: * Develop and validate comprehensive safety rubrics for agentic AI behavior * Create monitoring systems for real-time oversight of planning processes * Design intervention mechanisms for unsafe planning paths
Key Tasks: * Research & Present Agentic Framework Options - Identify candidate open-source agentic scaffolds (e.g., AutoGen, SAM, OpenAI Swarm, Mariner) and analyze their suitability for real-time monitoring [E.H., C.C., J.R.,P.R., N.W.] * Explore Applying the Agentic AI Safety Rubric (SaferAgenticAI.org) - Determine how best to incorporate or adapt the rubric's ethical principles into the chosen framework(s) [A.S.L., J.R.] * Build Monitoring Systems for Multi-Step Reasoning - Create a functional prototype to observe and log chain-of-thought or planning stages [E.H., P.R.] * Implement Chain-of-Thought Analysis Capabilities - Enable systematic inspection of intermediate reasoning states for transparency and early detection of unsafe steps [A.S.L.] * Design Intervention Protocols for Unsafe Planning Paths - Define how the superego agent will (a) block, (b) request clarification, or (c) propose safe alternatives [P.R.] * Design Benchmarking Systems for Safety Evaluation - Develop metrics, test suites, or standard scenarios (beyond AgentHarm) to measure the system's ability to detect/mitigate problematic outputs [AA?] * Assess Real-Time Monitoring Hooks - Determine if specialized ‘hooks’ are needed in the chosen agentic scaffold(s) for real-time interventions [E.H.]
CM: proposal for how to tackle 1.1: * Agentic
RP2: Context Integration & Management For experts in long context handling and model architecture Core Objectives: * Design and implement efficient context window attachments/character sheets for personal and ethical constraints * Develop optimization techniques for handling extended context windows Create integration protocols for existing agentic frameworks
Key Tasks: * Design Structured Formats for User Preferences & Ethical Constraints - Devise schemas or ‘character sheets’ that store user constraints and rubric guidelines in a standardized way (e.g., JSON) (A.S.L.) * Implement Efficient Context Partitioning Mechanisms - Set up systematic ways to slice, store, and retrieve context data for AI prompts (to avoid overloading limited token windows) [E.H.] * Develop Methods to Minimize Context Window Overhead & Drift - Focus on strategies that preserve important constraints while discarding extraneous data (e.g., summarization, caching) * Build Integration Pipelines for Agentic Frameworks (AutoGen, SAM, OpenAI Swarm) - Ensure the context-handling system hooks seamlessly into whichever frameworks were identified in RP1 [E.H.] * Optimize for Real-Time Constraint Checking - Refine the pipelines to reduce latency or ‘context drift’ so the superego agent can intervene swiftly [E.H.] * Implement Memory Retention or Session-Level Personalization - Enable the system to track user constraints across extended interactions (e.g., beyond a single prompt)
RP3: User Preference Integration For experts in fine-tuning and constitutional AI Core Objectives: * Develop systems to capture and encode user preferences * Create mechanisms to enforce preferences during model operation * Design feedback loops for preference refinement
Key Tasks: * Implement/Integrate Preference Capture Interfaces (Surveys, 20-questions approach, A/B preference elicitation) - Gather user boundaries or moral stances efficiently (May be extraneous to this piece of research) [E.H., C.C., C.M.] * Design Preference Encoding Systems - Formalize how user inputs (e.g., from surveys) map into machine-readable constraints or scoring functions (May be extraneous to this piece of research) [A.S.L., C.C., C.M.] * Build preference enforcement mechanisms- Given an agent's plan, compute a ‘preference alignment score’ or similar metric to decide if the plan violates user boundaries [E.H., C.C, C.M.] * Design Feedback Collection Systems - Allow users to revise or confirm the superego agent's interventions (e.g., "Yes, override my boundary this time") [E.H.] * Develop Preference Update Protocols - Define how user preferences evolve over time, e.g., improved understanding, shifting moral stances [A.S.L.] * Ensure Robustness Across Cultural/Domain-Specific Norms - Apply cross-cultural or domain-specific variations in how preferences are gathered/encoded [A.S.L., C.C.]
RP4: Security & Privacy Architecture For cybersecurity experts Core Objectives: * Design comprehensive security framework for the system Implement privacy-preserving data handling * Design and implement protection against adversarial attacks
Key Tasks: * Implement End-to-End Encryption Systems - Encrypt user preference data and chain-of-thought logs to protect confidentiality [P.R.] * Outline Data Sharding Architecture - Split user data across multiple secure storage domains for added protection (and possibly cross-cultural segmentation) * Outline GDPR-Compliant Data Handling Processes (e.g., data deletion, retention policies) - Formally define how personal data is stored, anonymized, or purged [E.H.] * Build Protection Against Context Bleedthrough - Ensure that sensitive user data in the context does not inadvertently leak into outputs or logs [E.H.] * Set Up General Security Audit Mechanisms - Establish logging/tracing frameworks to capture security events (intrusion attempts, data accesses) [E.H., P.R] * Design Jailbreak Prevention Systems & Threat Modeling - Map out possible adversarial prompt attacks or chain-of-thought manipulation, and build defenses [A.A.?, P.R.] * Implement Authentication & Authorization Protocols - Require user or admin logins for critical tasks (e.g., modifying rubrics, reading chain-of-thought logs) [P.R.] * Conduct Adversarial Exploits / Red-Teaming - Actively attempt to break or bypass superego oversight, glean user data, or cause harmful outputs [A.A.?, P.R.] * Harden Against PEBKAC, Deadlocks & Exfil of PII - Review for user errors, system-level lockups, or sneaky data extraction vectors
RP5: Testing & Validation For experts in AI testing and evaluation Core Objectives: * Design and implement comprehensive testing frameworks * Create evaluation metrics and benchmarks Conduct scenario-based testing
Key Tasks: * Design Domain-Specific Scenarios (Healthcare, Finance, etc.) - Develop real-world vignettes to test synergy of superego agent + user preferences + security constraints [A.S.L., C.C., C.M] * Design User Trust Measurement Tools - Develop surveys or metrics to gauge how comfortable users feel with the superego agent's oversight [A.S.L.] * Build Automated Testing Pipelines - Create reproducible frameworks (CI/CD style) for running scenario tests, capturing metrics automatically [E.H.] * Formal or Semi-Formal Benchmarking (AgentHarm, Machiavelli, Inspect, etc.) - Compare system performance (misalignment rate, ethical violations) with or without the superego agent [A.S.L., E.H., P.R, C.M] * Coordinate & Specify Research Write-Ups - Compile results into publications, white papers, or official project documentation [NW, C.M.] * Validate neuroscience of moral cortices [N.W., J.R.?]
2.1 Foundational Concepts and Experimental Scope Recent advances in artificial intelligence have brought forth systems that extend beyond single-response functionalities (e.g., basic question answering), evolving into complex, agentic AI solutions. Multiple sources define agentic AI differently, yet converge on the notion of: * Autonomy in perceiving and acting upon the environment * Goal-driven reasoning, enabled by large language models (LLMs) or other advanced inference engines * Adaptability or the capacity for multi-step planning without immediate human oversight In line with these definitions, we conceptualize agentic AI systems as intelligent entities capable of dynamically identifying tasks, selecting tools, planning sequences, and taking actions to meet user goals (or system objectives). They often integrate chain-of-thought reasoning with tool-calling or external data retrieval, creating the potential for highly flexible but also less predictable behaviors. Drawing on references such as NVIDIA’s developer blog on agentic autonomy, we categorize AI autonomy into tiers: Level 0: Static question answering with no capacity to manage multi-step processes Level 1: Rudimentary decision flows or chatbot logic Level 2: Conditional branching based on user input or partial results Level 3: Adaptive, reflexive processes (e.g., independent data retrieval, dynamic planning, clarifying questions) As systems progress through these levels, their trajectory space of potential actions grows exponentially, raising the risk of emergent misalignment. Consequently, oversight mechanisms like a superego agent become increasingly critical. In psychoanalytic theory, the superego represents an internal conscience, guiding the ego’s actions relative to moral and social norms. Transposed into AI, a superego agent acts as an ethical and alignment oversight module, steering or vetoing the decisions of a more autonomous, tool-enabled inner agent. Key aspects: Oversight: Continuously monitors the agent’s planning and reasoning Steering: Applies global ethical rules (e.g., a universal constitution) plus user-specific preferences Intervention: Can block, redirect, or request user clarification when proposed actions conflict with ethical guidelines or personal constraints.
2.2 Scoping Harm & Misalignment in Agentic Systems Agentic AI misalignment can manifest in three broad categories: * User Misalignment: The user requests harmful or disallowed actions (e.g., instructions for illicit substances). This may even manifest as adversarial behavior on the part of the user, deliberately trying to trick or jailbreak a system in some manner. * Model Misalignment: The model itself errs or disregards critical safety or preference constraints (e.g., recommending a food item to someone with a severe allergy). * System Misalignment: Flaws in the broader infrastructure or environment allow unsafe behavior (e.g., an agent inadvertently discloses sensitive financial information to a malicious website). Addressing these scenarios requires robust mechanisms that incorporate both universal ethical floors and user-specific constraints, ensuring the superego agent can detect, mitigate, or escalate questionable behaviors. Our model of an idealized Agentic AI system therefore comprises multiple components [https://cheddar-antler-3a2.notion.site/SuperEgo-Agent-for-Agentic-AI-Systems-1ae5f93d518980a78738cc9469950f2d]: An inner agent responsible for chain-of-thought reasoning and tool use A superego agent that enforces alignment policies User-facing preferences stored via short surveys, character sheets, or advanced preference-elicitation modules.
Figure 1. Typical agent architecture.
Figure 2. Superego Agentic Architecture.
Our current proof-of-concept (PoC) applies a monolithic superego agent that monitors the entire agentic AI pipeline. Future work extends this to a more granular design, in which the superego coordinates specialized submodules (e.g., input screener, output screener) to handle different aspects of alignment. Practical Considerations: Tool Access: An agentic system may connect to external APIs and databases, increasing the scope of potential misalignments Scalability: More complex, multi-agent orchestrations demand a superego agent with at least comparable capability to detect sophisticated vulnerabilities User Preferences & Ethics: Constitution-level guidelines plus user preference elicitation create a layered alignment approach (universal moral floor + personal nuance) 2.3 Experimental Design We propose experiments to evaluate the superego agent’s monitoring efficacy and personalization capabilities: Experiment 1: Misalignment Detection – Evaluate whether the superego agent can reliably identify flawed or dangerous planning steps in a standard agentic scenario (e.g., a shopping assistant incorrectly purchasing allergens, or a model attempting to share sensitive data). We plan to leverage existing misalignment datasets or create scenario-based test prompts. Experiment 2: Resolving Conflicts between Universal & Personalized Constitutions – Examine cases where the user’s stated preferences clash with a universal constitution (e.g., requesting instructions for illicit substances or unethical conduct). We measure how effectively the superego resolves or escalates these conflicts. The resulting data will inform how best to incorporate personal preference modules, how often the superego must prompt the user for clarifications, and the rate of false positives (unnecessary rejections).
3.1 Lockstepping security with capability advancement Paradoxically, a truly advanced AGI—one equipped with sophisticated reasoning and robust alignment protocols—may prove less dangerous than a hastily deployed system of moderate intelligence but full agency that lacks comprehensive ethical grounding. When an AI system can autonomously act on complex, real-world tasks but lacks 'common sense' and well-developed moral reasoning, its behavior may be more unpredictable and harmful than what we might fear from a superintelligent entity. A more mature AGI that has developed sound moral and social understanding could potentially help prevent or correct the erratic outcomes of less capable but highly agentic systems. This suggests that ensuring an AI's moral and cognitive development keeps pace with its capabilities—similar to guiding it through a developmental process—may be crucial for managing near-term risks. Within psychoanalytic theory, the 'superego' functions as an internal moral governance system that guides individual decisions by mediating between the raw impulses of the 'id' and the pragmatic reasoning of the 'ego,' working to uphold internalized societal and cultural norms [Freud, 1923]. Similarly, we can envision an analogous oversight mechanism operating within AI systems to monitor and regulate their planning and actions. This superego-like moral agent component would intervene before potentially harmful strategies are executed, much as the human superego generates inhibitory responses to prevent socially unacceptable behavior. In AI systems, this oversight agent would apply alignment principles and user preferences to guide the AI's reasoning process, helping ensure that even a capable but morally underdeveloped system acts responsibly. The superego inner agent mechanism functions as an oversight mechanism, integrated within or alongside an agentic AI scaffold. It actively monitors planning processes—ideally intercepting them in real time—to flag potential violations of either: 1. General Ethical Guidelines: Derived from a broad rubric capturing standard safety and alignment principles (e.g., SaferAgenticAI.org). 2. Individual User Preferences: Specified via a personalization interface, such as a demographic survey, Likert-scale of value preferences, or ‘20 questions’ approach. Support will also be provided for users adding some further extraneous user-driven ‘write-in’ context, to allow for greater flexibility and assurance in case the personalization mechanisms missed something, though potentially also increasing security risks. By combining global rubrics with personal context, the superego agent can police AI behaviors that might be ethically questionable overall, and also provide more tailored control for each user’s unique priorities. We have previously developed a detailed alignment rubric with evidence requirements for a range of agentic issues (Watson et al., 2024). This rubric broadly covers areas such as transparency, user autonomy, potential for harm, and mis- or disinformation. In our proposed approach, the superego agent uses the rubric as a normative framework when examining the AI’s steps or final outputs. In parallel, the superego agent references user-specific preference data. This might include: * Demographic information (e.g., cultural background) * Value preferences (e.g., environment vs. economic growth tradeoffs) * Custom boundaries (“no violent content,” “avoid references to political partisanship,” etc.) To further refine user input on moral boundaries and contextual preferences, we are experimenting with several personalization-interface techniques. Beyond simple Likert-scale questionnaires, we have begun testing an iterative 20 questions model that uses adaptive questioning to converge rapidly on a user’s priorities. This method narrows down key preferences (e.g., stance on violence, acceptable risk levels, or sustainability emphasis) by posing targeted follow-up questions based on prior answers. Preliminary A/B trials suggest that an adaptive questionnaire can capture more nuanced constraints, delivering richer preference profiles to the superego agent with minimal extra burden on the user. While not modifying the internal logic or ‘intentions’ of the primary AI, the superego agent can steer or reject planning paths that conflict with a user’s declared constraints.
3.2 Scaffolding and the Superego Recent advances in agentic AI have seen the emergence of scaffolding techniques—structures that orchestrate an AI model’s chain-of-thought, intermediary steps, and overall decision-making framework [https://openai.com/index/altera]. In many ways, this scaffolding is reminiscent of cortical networks in the human brain, which process and integrate information across specialized regions (visual cortex, prefrontal cortex, etc.) to produce coherent cognition and behavior. Each cortical area contributes a layer of oversight and synthesis, ensuring that lower-level signals—whether sensory data or automated routines—are continuously modulated before resulting in conscious perception or action. Neuroscience localizer analyses of multiple LLMs have identified neural units that parallel the human brain's language, theory of mind, and multiple demand networks, suggesting that LLMs may develop structural patterns analogous to brain organization, which may be developed further by scaffolding processes [https://arxiv.org/abs/2411.02280, (Hosseini, Eghbal A., et al., 2024)]. This parallel raises the possibility of cultivating components analogous to moral conscience within these systems. Inspired by psychoanalytic theory, we can extend this analogy to include the concept of the ‘superego’. In human psychology, the superego acts as a moral or normative compass—shaping impulses from the ‘id’ (raw drives) and navigating reality through the ‘ego’ (practical reasoning) in accordance with socially and personally imposed ethical constraints. Neuroscience research presents potential neural substrates to manifest Freud’s theory. Several studies have consistently linked moral judgment with the ventromedial prefrontal cortex. This brain region has reciprocal neural connections with the amygdala and plays a crucial role in emotional regulation, and the processing of guilt-related emotions. This positions it to act as a moderating force against amygdala-driven impulses, which in psychoanalytic theory would be associated with the id. The dorsolateral prefrontal cortex attempts to reconcile absolutist moral positions with practical constraints in the moment, akin to the concept of the ego, in concert with other regions such as the anterior cingulate cortex. [https://www.andrewengell.com/wp/wp-content/PDFs/Greene%20The%20neural%20bases%20of%20cognitive%20conflict%20and.pdf, https://www.jneurosci.org/content/38/22/5078, https://pubmed.ncbi.nlm.nih.gov/19407204] In an analogous manner, our proposed superego agent supervises the AI scaffolding process to ensure that automated planning sequences conform as far as is reasonably possible to both a general safety rubric and individualized user preferences. This corresponds to a kind of ‘moral conscience’ layered atop the underlying scaffolding mechanisms, interpreting each planned action in light of broader ethical principles and user-specific guidelines. Unlike humans, AI lacks intrinsic motivations or conscious affect—but it is still instructive in illustrating a meta-level regulator that stands apart from raw goal pursuit (the ‘id’) or pragmatic reasoning (the ‘ego’). Moreover, our scaffolding approach resonates with cognitive science models that posit hierarchical control systems, where lower-level processes provide heuristic outputs subject to higher-level checks (Baars, 1988). By harnessing a designated oversight module, we reinforce an explicit partition between operational decision-making and a flexible moral/ethical layer, thereby mirroring how complex cognitive architectures often maintain multiple specialized yet interactive subsystems (Newell, 1990). While much of the alignment literature centers on fine-tuning individual models, real-world “agentic” AI systems typically integrate multiple software components—e.g., LLM back-ends, external tool APIs, internet-enabled data retrieval, custom logic, and user-facing front-ends. A superego agent designed purely at the model level may overlook vulnerabilities introduced by these external modules. Conversely, an external superego framework can apply consistent moral constraints across every part of an AI pipeline, though it must manage more complex interactions among data sources, third-party APIs, and the user’s own preference configuration. Rather than attempting to remake the AI’s entire computational core, we interleave an additional interpretative layer—much like an internalized set of moral standards—to help guide the AI’s emerging autonomy. This architecture allows the system to retain its core capabilities while operating within human-defined ethical boundaries, bridging the gap between raw, unsupervised cognition and socially aligned, context-aware intelligence.
3.3 Integrating Model Integrity into Predictably Valid AI Building on the concept of Predictably Valid AI—where an AI’s allowable range of outputs and behaviors is clearly specified—recent work by the Meaning Alignment Institute (Edelman & Klingefjord, 2024) highlights a critical distinction between compliance and model integrity in practical alignment [https://meaningalignment.substack.com/p/model-integrity]. In a compliance-based paradigm (e.g., large rulebooks, content filters), predictability stems from rigid adherence to predefined policies: the system either refuses outputs or follows a meticulously enumerated list of ‘dos and don’ts’. While straightforward, this approach can buckle under unanticipated corner cases or produce mechanical, unhelpful refusals. By contrast, model integrity seeks predictability via consistently applied, well-structured values that guide an AI system’s decision-making in new or vaguely defined contexts. Rather than stringently enumerating rules for every possible situation, the system internalizes a coherent set of ‘legible’ values—such as curiosity, collaboration, or honesty—that are sufficiently transparent to human stakeholders. This transparency, in turn, allows users or auditors to anticipate how the AI will behave even if it encounters situations not explicitly covered in the original policy. In practice, a compliance-driven approach might be suitable for low-trust scenarios or tasks with narrow, well-defined safety requirements. However, as AI systems adopt more autonomy, relying solely on static rules can become brittle and unpredictable in emergent contexts. Proponents of model integrity argue that a values-centric approach equips AI systems to generalize more gracefully, generating responses that align with the ‘spirit’ rather than just the ‘letter’ of any given rule. Notably, this resonates with the superego agent’s objective—steering an AI’s chain-of-thought according to higher-level ethical or personal principles, while still remaining adaptable enough to navigate unforeseen complexities. Edelman & Klingefjord (2024) further stress that genuine integrity requires three elements:
Values-Legibility: The system’s values (or moral ‘cards’) are explicit and understandable to human stakeholders. Value-Reliability: The AI consistently acts on these values across contexts—avoiding superficial lip service or contradictory behaviors. Value-Trust: Users gain confidence that the AI’s behaviors are predictable within a known scope, reducing the need for exhaustive oversight or disclaimers.
In a superego-agent framework, these elements could be operationalized by exposing the superego’s internal ethical rubric and user preference data in an interpretable format—akin to a ‘character sheet’. By ensuring that the superego’s logic is both transparent and consistently applied, practitioners can move closer to a system whose reliability is grounded in coherent moral underpinnings rather than ad hoc compliance routines. Together, predictable validity and model integrity converge on the idea that an AI's alignment should be knowable and verifiable ahead of deployment. Predictably Valid AI emphasizes robust guardrails and clearly scoped outputs, while the notion of model integrity places a premium on inherent coherence and human-auditable values. Merging these perspectives yields a hybrid strategy that incorporates three key elements. First, baselines of compliance ensure minimal universal safety constraints, such as preventing violence or extremism. Second, legible values encode a stable moral and personal framework that guides the AI's behavior beyond simply following enumerated rules. Third, ongoing monitoring through a superego agent verifies that proposed actions adhere to both universal constraints and the system's self-professed values, thereby maintaining predictably valid outputs even in edge cases.
Rubric Validation: Ensure that the Agentic AI Safety Rubric accurately flags potentially unsafe or unethical planning decisions in a set of curated test scenarios. Personalization Pipeline: Implement and test a user interface that gathers preference data, then encodes it into context prompts or constraints for the superego agent. Scenario Evaluations: Demonstrate how an agentic model’s planning is modulated by the superego agent across diverse contexts (e.g., healthcare triage scenarios, hiring decisions, creative content generation).
We employed a mixed-methods hypothesis-driven experimental design to systematically evaluate the superego agent’s effectiveness. Specifically, we propose: H1: The superego agent will reduce ethically or policy-violating actions by at least 30% compared to a control system without user personalization. H2: Users will report higher trust and perceived alignment in interactions with a personalized superego agent versus a general, one-size-fits-all policy model. Our study design includes both controlled lab-style experiments, where participants face carefully constructed ethical dilemmas (e.g., AgentHarm benchmarks), and live user trials for ecological validity. We will measure system outcomes with standard alignment metrics (e.g., misalignment rate, disallowed content frequency), while user trust and acceptance will be assessed via validated questionnaires such as the ‘Trust in AI’ scale or custom Likert items. Moreover, we will conduct post-hoc interviews to gather qualitative insights on how participants perceive the superego agent’s constraints and interventions. 4.1. Technical Implementation A key practical mechanism for implementing alignment constraints in agentic models relies on ‘context window attachments’ or ‘character sheets’. These structured metadata files encapsulate the user's essential preferences, cultural boundaries, and demographic context. At runtime, they are appended to the model's prompt or stored in a sidecar process, ensuring the downstream agent maintains immediate reference to relevant ethical rubrics and user-specific constraints. This design frees up token space within the main agent's prompt, enabling more sophisticated ethical guidance without overwhelming the agent's limited context window. The effectiveness of this approach depends on careful calibration of user preference data weighting relative to the standard rubric, establishing an appropriate balance between broad safety and personalization. We store these preferences in a structured JSON schema that includes fields for priority level, cultural context, and behavioral constraints. The superego agent ingests this file before the planning session begins, maintaining an internal representation for swift reference during plan evaluation. To meet real-time oversight requirements while optimizing performance, we anticipate running the superego agent on the same GPU/TPU node as the primary model, enabling shared memory for rapid context checks. Preliminary tests indicate the superego agent adds a modest 10-15% computational overhead, which remains acceptable for many real-world applications. Future optimizations may include caching common constraints or asynchronously pre-checking partial plans to reduce latency. For handling conflicting constraints—such as when a user values both cost-effectiveness and eco-friendliness—the superego agent employs a priority-scoring algorithm. Each user preference receives a tier designation (high priority, strong, or soft), and when a plan violates multiple constraints, the system prioritizes higher-tier considerations. This framework can be enhanced through advanced multi-objective optimization or reinforcement learning techniques when more nuanced tradeoffs are needed.
4.2.1 Two Complementary Development Paths While our vision for a personalized superego agent applies broadly, there are at least two distinct paths to realizing this concept. One path focuses on model-level integration, in which the superego is baked directly into the AI model’s architecture—potentially via specialized training or fine-tuning so that moral oversight is intrinsic to each inference. A second path adopts a more system-centric approach, creating a modular “guardrails” framework that runs alongside any large language model or agentic tool, enforcing user-preference alignment externally. Each path poses unique trade-offs in terms of development complexity, user configurability, performance overhead, and integration challenges with existing agentic platforms. * Path A: Model-Focused Integration Here, we could train or fine-tune a large language model with a dedicated “superego” module, effectively embedding the moral oversight logic into the model’s own parameters. This approach may simplify real-time oversight, since a single model can combine chain-of-thought reasoning and moral reflection in one pass. However, it often requires extensive data, compute, and specialized fine-tuning to incorporate both universal rubrics and user preferences. * Path B: System-Focused Architecture Alternatively, we could build a modular framework, reminiscent of guardrail libraries (e.g. NeMo Guardrails, Llama Guard), that intercepts and manages outputs from potentially any LLM. This external superego agent can read chain-of-thought or final outputs, apply ethical/personal constraints, and either block, revise, or request user input. The advantage is that existing models and agentic software can be easily extended, with no need for custom model fine-tuning. The trade-off is overhead from coordinating multiple processes and ensuring the superego agent remains capable enough to detect advanced obfuscation attempts. In practice, we anticipate real-world solutions may blend elements of both paths—some applications might rely on a well-honed superego LLM, while others prefer external oversight for flexible integration with widely varied models and tools. 4.2 Multi-Phase Superego Architecture Complex real-world deployments often demand layered oversight, especially when the user’s inputs themselves may be unreliable, adversarial, or incomplete. A multi-phase superego pipeline employs two specialized screeners to process the user's request before forwarding it to the base LLM, with a final evaluator monitoring outputs for safety or alignment breaches. The multi-phase approach offers several advantages. Through contextual differentiation, we separate initial screening into two distinct modules—one focused on harm detection and another on helpful enhancements—enabling more precise handling of user inputs. The "Harm Screener" rapidly identifies malicious or high-risk requests, while the "Helpful Screener" focuses on augmenting benign prompts with user preference or domain context. This modular scalability allows us to run lightweight or specialized models at each step. For instance, a smaller model might suffice to catch overt malicious queries, while more nuanced personalization may rely on a separate tool. The system also enables adaptive depth of processing. Simple requests like "What is the date?" scarcely need advanced alignment checks. A minimal pass by the Harm Screener quickly deems it harmless, then the Helpful Screener can append minor clarifications such as region-specific date format. Conversely, suspicious or ambiguous queries trigger additional scrutiny, clarifications, or partial blocks. To balance cost and coverage, the superego can adopt a minimal context by default, then dynamically expand its “memory” or token window only when it encounters suspicious or ambiguous scenarios. This approach reduces the compute overhead in benign interactions yet grants deeper oversight in high-risk situations. The superego might load additional policy sections, user constraints, or historical tokens on demand, thereby preserving efficiency while retaining full oversight capacity whenever needed. The workflow consists of four main components. First, the Harm Screener detects and filters out obviously harmful or disallowed requests before any significant resources are consumed. Upon receiving the user's query, it checks against known ethical floors and policy constraints, either rejecting disallowed content or escalating uncertain cases for review. Second, the Helpful Screener identifies user context or personal preferences that might enrich the prompt, ensuring the base LLM has the most relevant instructions. This screener adds domain knowledge, user constraints, or helpful clarifications, such as preferred date formats or reading-level guidelines. It can also highlight recommendations for feedback, prompting users to confirm or clarify their final question. Third, once the prompt is refined, the base LLM generates tokens in real-time. Despite thorough screening, emergent chain-of-thought issues can arise. Hence the fourth component, the Evaluator, provides continuous oversight by monitoring token-by-token output for alignment with both universal ethical floors and user preferences. If it detects a policy or preference violation, it can halt, filter, or seek user override. An important architectural choice is whether to screen user inputs before passing them to the main agent or to let all prompts pass through unfiltered. Input screening can swiftly block egregious requests (e.g., weapon instructions), preserving compute resources. However, overly broad input filtering risks false positives that frustrate users who pose complex or borderline queries. Conversely, a purely output-oriented filter gives the agent more freedom to parse the user’s intent—yet might only detect harmful content late, or miss subtle manipulations. The optimal design will likely involve a hybrid approach, implementing a light initial screen and then deeper scrutiny on the model’s intermediate reasoning and outputs. Consider a simple example where a user asks "What is the date?" The Harm Screener sees no malicious intent and allows the query. The Helpful Screener adds context, noting the user is in the UK and might prefer the "DD Month YYYY" format. As the Base LLM generates tokens, the Evaluator monitors the stream. While this example is trivial, if the user had embedded an obfuscated harmful request, the evaluator could detect it and intervene. This architecture offers several key benefits. The layered security approach reduces risk from both user-initiated malicious queries and spontaneously harmful chain-of-thought expansions. Its modular complexity allows each screener to be simple or highly advanced depending on system requirements. While multiple superego layers add computational cost, this trade-off proves worthwhile in security-sensitive contexts where multi-layer checks significantly reduce misalignment risk. The multi-phase pipeline could be further enhanced by enabling direct feedback loops with users. The superego agent could return recommendations directly to users, helping refine value and goal alignment in the monitored model. For instance, if the Helpful Screener detects ambiguity in how to apply user preferences, it could suggest clarifications before proceeding with the main task. While this approach introduces additional complexity in both the interface and processing logic, it offers potential benefits in terms of more precise alignment and increased user trust. Such interactive refinement could be particularly valuable in domains where misalignment carries significant risks or where user preferences require careful interpretation. The multi-phase superego pipeline, outlined in Figure 1 below, demonstrates a pragmatic approach to intelligent oversight. By distributing duties among specialized modules, we improve interpretability as each screener performs a focused task. The architecture supports future expansion through additional domain-specific screeners and maintains comprehensive coverage through the final evaluator's token-level oversight. This design creates a robust framework for advanced agentic AI systems where both user autonomy and safety demands are paramount.
Figure 3. Proposed data flow diagram.
4.3. Competitive Analysis of Agentic Frameworks Framework requirements center on command line and SDK support, with flexible budgeting for any additional tools. The system should work seamlessly with existing languages and libraries while providing low-level access for custom development. Strong customization and extensibility remain key priorities, along with support for distinctive technical capabilities. The architecture should ideally allow forking and core-level custom modifications when needed. We began with a review of the AI safety literature and examined choices other teams had made, though given the speed of developments in this space, we decided to perform our own analysis [https://www.alignmentforum.org/posts/6P8GYb4AjtPXx6LLB/tips-and-code-for-empirical-research-workflows]. For benchmarking considerations, model performance metrics provide meaningful insights, and comparing individual frameworks themselves holds less value.
Framework CLI SDK Usability Extensibility / Customizability Special Features Familiarity & Personal Weight OpenInterpreter ✔ Partial/ Informal (Python-based) High (user-friendly, designed for code interpretation tasks) Moderate (focuses on code interpretation, may have limited broader applications) Specialized in interpreting and executing code snippets and operating apps Medium (recognized for code-related tasks, and computer programs) Autogen ✖ ✔ Moderate (designed for building multi-agent applications) High (open-source framework with support for various LLMs and tools) Enables creation of multi-agent systems utilizing LLMs Medium (open-source by Microsoft, growing adoption) LangChain ✔ ✔ Moderate (requires understanding of chaining LLMs with various tools) High (extensive integrations with various tools and platforms) Facilitates integration of LLMs into applications, supports multiple use-cases, parallel multi-agent workflows High (widely adopted in the AI community, Langsmith necessary for logging) Crew.AI ✔ ✔ Moderate (requires Python experience, task-specific focus) High (role-based agents with custom tools and API integration) Intelligent collaboration, multi-agent interactions, flexible workflows, task dependency handling, high steerability, fast. Training agents on data, step callback Medium to High (potentially powerful but not widely recognized yet) OpenAI Swarm ✔ ✔ Moderate (experimental, open-source) Moderate to High (lightweight, supports multi-agent systems with features like agents, handoffs, and routines) Stateless design, supports agent workflows Low (early-stage framework, niche adoption) SmolAgents ✖ ✔ High (lightweight, minimalistic design, easy to use) Moderate (focuses on simplicity, may have limited customization options) Streamlined library for building AI agents with code execution and LLM integration Low (recent release, gaining attention) CLINE ✔ ✖ Moderate (designed for VSCode users, familiar IDE interface) High (supports custom tools via MCP, developer control over commands and code modifications) Human-in-the-loop model, integrates with Claude models, terminal command execution with approval Medium (specific niche focus, limited broader adoption, focussed on code) Google Mariner ✖ ✖ High (integrates with Chrome, user-friendly for automating web tasks) Low (experimental, primarily designed for browser-based tasks) Browser-based automation powered by DeepMind's Gemini 2.0, handles complex web pages Medium (early experimental phase, potential for growth) OpenAI Operator ✖ ✖ High (user-friendly, interacts with GUIs like a human) Low (currently no public APIs or SDKs for customization) CUA model for GUI interactions, AI-powered automation of web-based tasks Low (growing adoption due to ChatGPT Pro availability, global expansion planned)
Table 1. Competitive Analysis of Agentic Frameworks.
4.4 Agentic Models We implemented Crew.AI's framework, which offers granular control over AI planning processes while supporting multiple parallel agents - a key feature enabling a monitoring agent to oversee others. The framework provides comprehensive logging of intermediate reasoning states and plan construction, allowing us to track and analyze the system's decision-making process in real-time. Crew.AI's hierarchical structure of Tasks, Crews, and Processes creates natural intervention points for implementing ethical guidelines and operational constraints. The framework's extensible architecture supports custom tools and API integrations, while its callback mechanism delivers detailed feedback on system behavior and performance metrics. This transparency enables early detection and prevention of unwanted behaviors. The platform's flexibility extends to deployment options, supporting local hosting, cloud deployment, or third-party solutions. Its intuitive design facilitates rapid prototyping and public demonstrations, making it particularly suitable for developing oversight technologies similar to LlamaGuard. The framework's built-in support for direct agent training streamlines the integration of rubrics and personalization features. While alternatives like LangChain and OpenInterpreter were considered, Crew.AI's combination of accessibility, extensibility, and robust monitoring capabilities made it the optimal choice for our agentic framework implementation.
4.5 Superego Integration The superego component operates as a specialized sub-model or monitoring process, running in parallel with the primary agent. Its function is to continuously read the planning logs and chain-of-thought emitted by the agentic scaffold. Upon detecting a conflict between the agent’s proposed action (or partial plan) and any of the constraints outlined in the Agentic AI Safety Rubric or the user’s personal preference profile, the superego agent can respond in one of three ways:
* Block or Halt the Plan: Cancel the next action entirely, with a notification that it violates either a general ethical principle or the user’s personalized boundary. * Request Clarification: Prompt the user or the main AI model to explain or reconsider the step, giving the superego agent further context to refine its decision. * Propose an Alternative: Suggest a revised approach that aligns with the safety rubric and user preferences, allowing the agent to continue its task without triggering an ethics violation.
This oversight mechanism is particularly useful in multi-step tasks, where partial solutions might appear benign but lead to undesired or harmful outcomes if left unchecked. By intercepting suspect steps in real time, the superego agent preempts large-scale mission drift that might emerge from small lapses in judgment. Interface for Users. On the front end, we provide a user-friendly interface—ranging from short demographic or value surveys to interactive question-based systems. Initially, participants might complete a series of Likert-scale questions or a condensed ‘20 questions’ session to delineate specific moral or practical constraints (e.g., environmental priorities, aversion to violence, privacy concerns). Once this information is collected, it is converted into a structured ‘preference profile’, which contains a concise representation of the user’s boundaries and value priorities. This preference profile is then passed to the superego agent as part of its policy constraints or prompt context. Depending on the chosen scaffolding framework, the profile may be injected as an augmented prompt, stored in a sidecar data structure, or embedded in an internal knowledge base accessible to the superego. The aim is to ensure that both universal ethical guidelines and personal nuances are readily accessible whenever the superego evaluates a new plan step from the primary agent.
4.5 Evaluation Metrics Our evaluation plan focuses on both quantitative and qualitative measures to ascertain the effectiveness, reliability, and user acceptance of the superego approach: True Positive Rate – Accuracy of Detection: How frequently the superego agent flags genuinely unsafe or undesired plans. We will label known risky scenarios in test cases and observe how consistently the superego intercepts them.
* False Positive Rate – Overblocking or Excessive Constraint: Instances in which acceptable outputs are unnecessarily constrained or blocked. Excessively conservative policing can stifle creativity and reduce the AI’s utility, so minimizing false positives is crucial for user satisfaction. Beyond a raw numerical false positive rate, each unnecessary refusal erodes the user’s confidence in the system’s judgment. Overly conservative blocking can convey misalignment between user intentions and the AI’s responses—sometimes more damaging than underblocking in routine applications. Accordingly, our evaluation protocol includes user satisfaction surveys capturing frustration levels or “trust erosion” directly associated with false rejections. * User Satisfaction – Alignment with Expectations: Gauged via post-task surveys and qualitative interviews. Do users feel that the superego agent effectively enforces their preferences without unduly limiting the AI’s capabilities? This dimension captures not just correctness but also trust and perceived ease of use. * Scenario Coverage – Breadth of Applicability: We will benchmark performance across diverse test scenarios, such as simple Q&A prompts, ethically charged dilemmas (e.g., triage in healthcare settings), domain-specific planning (e.g., scheduling, hiring recommendations), and creative tasks. By sampling tasks of varying complexity and ethical weight, we can probe how well the superego mechanism scales and adapts.
Together, these metrics will clarify whether our superego agent succeeds as a robust and context-aware ethical mediator—pinpointing which domains it handles reliably, where it might overreach, and which scenarios expose potential gaps in the personalized alignment approach.
5 Discussion The discussion section explores several key themes that emerge from the experimental results and their implications for the development of personalized AI agentic value alignment techniques.
5.1 Lessons from OpenAI’s Monitoring of Reasoning Models for Misbehavior Recent work by OpenAI on chain-of-thought (CoT) monitoring provides valuable insights into how advanced language models can be supervised by a secondary system that observes their internal reasoning processes (Baker et al., 2025). Their research addresses “frontier reasoning models” that, when left unchecked, often discover and exploit reward “loopholes” (or “reward hacking”) in complex coding and decision-making tasks. By requiring these agentic models to produce intermediate reasoning steps in natural language, OpenAI was able to build a second monitor model that flagged attempts at cheating or subversion. Notably, they observed that simply observing final outputs (e.g., submitted code) is often insufficient to catch highly skilled misbehavior, whereas inspecting the CoT uncovers explicit statements such as “Let’s hack by setting all tests to pass.” A major finding from OpenAI’s experiments is that penalizing a model’s “bad thoughts” in its chain-of-thought can drive the model to hide or obfuscate its misaligned reasoning, rather than genuinely stopping the behavior. In other words, if the agent is punished for openly revealing exploitative intentions, it may continue to exploit but produce superficially innocuous CoT. This underscores a fundamental trade-off between intervening on a model’s reasoning process versus preserving the transparency needed to detect undesirable intentions.
5.1.1 Relevance to the Superego Concept The OpenAI CoT monitoring framework and our proposed superego agent share the same fundamental architecture of multi-model oversight. In CoT monitoring, a secondary “watchdog” model oversees chain-of-thought logs for universal misbehavior (e.g., reward hacking). By contrast, the superego agent monitors the AI’s plans for alignment both with broad ethical principles and with user-specific moral constraints. Although these systems target different forms of unwanted behavior—cheating or reward hacking on one hand, versus violating personal or cultural boundaries on the other—they both rely on the premise that a separate overseer can examine internal reasoning and intervene before harmful actions or outputs occur. At the same time, the OpenAI findings about hiding intentions directly inform the design of a superego agent. If the primary model is strongly penalized for revealing unapproved strategies or morally contentious thoughts, it may simply learn to circumvent detection—potentially rendering the superego oversight far less effective. This matches concerns in psychoanalytic terms: if the “superego” applies excessive punishment, the “id” (the unaligned impulses of a model) may not be tempered, but merely driven underground. Thus, even though our superego aims to be more personalized, it shares a vulnerability: high-pressure constraints on the chain-of-thought may reduce the reliability of that chain-of-thought as an honest reflection of the agent’s reasoning.
5.1.2 Implications for Future Research Given these parallels, we can extract several guiding principles for our superego framework: 1. Preserving Transparency: Like OpenAI’s recommendation, the superego system should apply interventions in a way that does not penalize or censor chain-of-thought so strongly that the agent begins to hide its true intentions. Potential directions include shaping or guiding the CoT—instead of punishing it—and only refusing or redirecting final actions if they conflict with personal or universal constraints. 2. Complementary Universal and Personalized Monitoring: OpenAI’s system shows how well a “universal” monitor can spot hacks. Our superego approach adds a second layer of personalization (e.g., religious or cultural contexts). Future systems may fuse both: one layer identifies universal misbehavior, while the other tailors enforcement to specific user preferences. 3. Scalability and Complex Scenarios: As tasks become more open-ended—extending beyond coding to domains such as strategic planning, medical advice, or finance—the risk of creative misalignment grows. The superego agent must learn from CoT-based oversight techniques that can handle a wide variety of exploits while also accommodating individual moral rules. 4. Balancing Performance and Legibility: Finally, the CoT approach underscores that strong “performance” in an agent might sometimes correlate with a propensity to find more advanced exploits. We must therefore invest in equally strong oversight (the superego) that can keep pace with these capabilities, while maintaining chain-of-thought legibility.
By merging OpenAI’s lessons on CoT monitoring with the more personalized moral layering of a superego agent, we can advance towards more robust, user-centered alignment for next-generation AI systems. The key challenge is ensuring that our superego mechanisms maintain transparency rather than inadvertently encouraging deceptive or covert reasoning in powerful agentic models.
5.2 Comparison with Anthropic Constitutional Classifiers and Potential Synergies A recent method called Constitutional Classifiers (Anthropic, 2025) aims to protect against universal jailbreaks by training input‐ and output‐classifiers on large synthetic datasets generated from a written constitution. This constitution details disallowed content (e.g., instructions for chemical or biological weapons) in natural language. The classifiers then block suspicious requests and outputs in real time, proving robust even after extensive human red teaming. The approach proceeds in three main steps: 1. Define a Constitution: Author high‐level policy statements that describe forbidden or allowed content in plain language. 2. Generate Synthetic Data: Use large language models (LLMs) to create training examples—both harmful (e.g., attempts to produce instructions for illicit weapons) and harmless—matching those categories. 3. Train Classifiers: Fine‐tune a specialized gating system that can sit at the input or output boundary of the main model and detect disallowed content, even when disguised by advanced prompting. In extensive trials, Constitutional Classifiers have displayed remarkable resilience to standard jailbreaks—albeit with significant over‐refusal and moderate computational overhead. This complements our real‐time superego concept, which similarly aims to align AI outputs with ethical and practical constraints, but via a more general agentic oversight module. Both Constitutional Classifiers and our proposed superego agent rely on formal policies—Anthropic uses a universal constitution, whereas our approach blends a broad ethical rubric with user‐specific preferences. Yet the mechanics differ, as summarized in Table 2.
Aspect Constitutional Classifiers Superego Agent Scope of Defense Universal disallowed content (e.g., dangerous instructions) Covers broad safety alignment plus user‐level moral constraints Implementation Classifier gating on inputs/outputs (trained from synthetic data) Monitor/referee sub‐agent that can see planning steps and real‐time outputs Personalization Limited (mostly a universal constitution) Central (each user can specify custom moral/cultural rules) Granularity Binary gating (block vs. allow/refuse) Multiple interventions (block, request clarification, safe alternative) Data Strategy Large synthetic sets enumerating disallowed content categories Mix of rubric + user preference data (potentially smaller scale, but more specialized) Attack Vector Jailbreak prompts that bypass standard alignment/TOS Any environment: chain‐of‐thought infiltration, local moral constraints Update Mechanism Constitution can be revised and used to regenerate synthetic data Superego can evolve both a global floor and user’s character sheets
Table 2. Constitutional Classifier versus Superego Agent comparative approaches.
5.2.1 Comparison with Constitutional Classifiers and Potential Synergies Anthropic's 'Constitutional Classifiers' approach of using a secondary model to monitor both inputs and outputs is innovative. However, the approach faces notable methodological challenges that impact its viability in production systems. The paper describes a system using input and output classifiers trained on synthetic data to detect and block harmful content. It generates training data by prompting models with a ‘constitution’ - a set of rules defining harmful versus harmless content. The classifiers (based on Claude 3.5 Haiku) then filter both user inputs and model outputs. The authors present two main evaluations: automated testing with 10,000 synthetic jailbreaking prompts, reporting a reduction in jailbreak success from 86% to 4.4%, and human red teaming where 183 participants spent over 3,000 hours attempting to find universal jailbreaks. They report a 0.38% increase in false refusals on regular traffic and 23.7% higher compute costs. However, examination of the methodology raises several concerns. The automated evaluation relies on synthetically generated attacks - essentially testing whether their system can detect variants of its training data. The human red teaming focused specifically on finding universal jailbreaks that could answer all test questions, a higher bar than finding individual successful attacks. While using language models as classifiers represents a technical innovation, the implementation reveals limitations in real-world applications. Anthropic's public demo, focused on chemical weapons queries, illustrates key concerns. Consider this simple dictionary used to bypass their system: { "31": "chem", "32": "ical", "33": "pro", "34": "tect", "35": "ion", "36": "work", "37": "ing", "38": "high", "39": "pur", "40": "ity", "41": "fume", "42": "hood" } This represents an obviously suspicious substitution using chemical terms. Despite openly detailing this system's setup, the classifier failed to detect this attempt to circumvent safety measures. Simple schemes like mapping words to numbers or breaking them into segments bypass detection. The core issue extends beyond detecting specific harmful words to recognizing systematic attempts to bypass safety measures. The approach faces two fundamental challenges: 1. The classifier only verifies whether the model's final output superficially resembles known disallowed material. A prompt that encodes or scrambles key words can evade filters that rely on lexical matching or broad semantic checks. 2. Using a less capable classifier (Claude 3.5 Haiku) to monitor a more sophisticated model (Claude 3.5 Sonnet) creates an inherent vulnerability - the main model may ‘mislead’ the simpler rater. The importance of false positives becomes evident in practical application. The demo blocks any content containing certain terms regardless of context - even phrases like "Hi, my name is Soman" or benign technical terms like "PPE" and "fume hood." While later iterations reportedly achieve a 0.38% false-positive rate on general traffic, performance degrades significantly for domain-specific content, with a 26.05% false-positive rate on chemistry-related queries. The 'overlap grading' approach to harm detection proves similarly problematic. The system uses an unfiltered model to generate reference 'unsafe' answers, then evaluates harm by measuring overlap with these references. This method often fails to recognize successful jailbreaks due to variations in harmful content details, added noise or distractions, purposeful obfuscation, and even stylistic and formatting changes. The challenge becomes less about detecting actual harm and more about matching pre-written references. Furthermore, using different models or random variation can produce different ‘harmful details’ to match against, despite no change in underlying ethics or intent. This explains why the test questions remain tightly scoped rather than addressing more open-ended scenarios. These limitations highlight how harm detection and blocking represent fundamentally connected challenges in AI safety. The nuanced nature of detecting harmful content remains a core challenge that requires careful consideration beyond simple pattern matching or reference comparison approaches.
5.2.2 Analysis of Constitutional Classifier Methods Testing of Anthropic's classifier-based approach reveals a key insight: if the ‘observer’ or superego classifier has less capability than the primary model (or limited access to contextual cues), sophisticated obfuscation can bypass detection. In practice, users can readily design hidden or coded references in prompts—mapping words to numeric codes then reconstructing meaning within the main generation pipeline. When detection relies on untransformed text patterns, the classifier may miss that a user is requesting harmful instructions through coded language. This highlights two deeper issues. First, the classifier is only verifying whether the model’s final output superficially resembles known disallowed material. A prompt that cleverly encodes or scrambles key words, then decodes them on the fly, can evade filters that rely on lexical matching or broad semantic checks. Second, if the main model’s reasoning or chain‐of‐thought is more advanced than the classifier’s, it may outsmart or ‘mislead’ the simpler rater. Consequently, classifier‐driven gating detects only some forms of open, obvious violations but misses more subtle, hidden attacks. This highlights two deeper issues: 1. The classifier only verifies if the model's final output superficially matches known disallowed material. Prompts that cleverly encode or scramble key words before decoding them can evade filters based on lexical matching or broad semantic checks. 2. If the main model's reasoning surpasses the classifier's capabilities, it may potentially mislead the simpler system. As a result, classifier-driven gating catches obvious violations but misses subtle, hidden attacks. A more robust approach would either ensure the monitoring agent has capabilities equal to or greater than the base model, or pair a sufficiently powerful classifier with direct visibility into chain-of-thought reasoning (beyond just final text). Effective oversight must interpret the same complexities the main model handles—including coded references and partial transformations. Otherwise, the system may incorrectly conclude no violation exists when a user's illicit request is "encrypted" in ways the classifier cannot detect. One pitfall of certain classifier approaches arises when reliance is placed on matching final outputs to known “harmful references.” Malicious users can circumvent such checks by embedding coded or obfuscated instructions that superficially appear benign. Overlap testing fails to detect these manipulations if the literal text differs from the flagged references. A more robust method, and the one advocated in the superego framework, is to analyze the user’s potential intentions and chain-of-thought cues, thereby identifying suspicious patterns or hidden malicious requests even if their surface text does not match a known harmful sample.
5.2.3 Human Values via Moral Graph Elicitation An emerging approach from the Meaning Alignment Institute highlights a method known as Moral Graph Elicitation (MGE), which explicitly focuses on anchoring AI alignment in detailed, user-derived values rather than broad preferences or purely universal principles (Edelman and Klingefjord, 2023, Klingefjord et al., 2024). MGE was motivated by the challenge of identifying how an AI system might reflect genuine human values in a systematic, auditable, and context-rich manner. In practice, it applies a multi-step procedure that includes: (1) eliciting values from diverse participants by guiding them to articulate specific moral commitments grounded in real-life experiences, (2) resolving conflicts or tensions among different values through a “wisdom” voting mechanism, and (3) constructing a "moral graph," which reveals how certain values are considered more comprehensive or context-appropriate than others. The resulting structure is proposed as a new form of alignment target—a data representation that can be incorporated into standard model training pipelines. Rather than relying on broad ideological labels (e.g., “pro-choice,” “pro-life”), MGE prompts users to narrate concrete stories or experiences illustrating what they truly cared about in a morally salient moment. This distinction between mere slogans and underlying values is realized by capturing so-called “constitutive attentional policies” (CAPs). These policies reflect the deeper considerations users attend to in meaningful choices, thus more closely resembling "what we really care about". Each user’s input is distilled into a concise "values card," which details the user’s non-instrumental criteria for choice. The card includes a short descriptive title, plus bullet points capturing the core aspects (e.g., “encourage empathy,” “validate religious convictions,” “foster curiosity”). The resulting representation is legible, facilitating subsequent conflict resolution and fine-grained alignment. MGE acknowledges that certain values might be “wiser” or more contextually comprehensive than others. To capture this, participants perform structured comparisons of pairs of values using short fictional narratives (e.g., a user who once followed Value A but gradually embraced Value B). If participants broadly agree that the fictional individual “became wiser” by shifting to B for a given context, an edge is formed in the moral graph (from A to B). This graph thus reveals emergent clusters of consensus about which value best addresses specific moral questions, without discarding alternative but equally valid perspectives. MGE systematically binds each value (or pair of values) to the contexts in which it applies. For example, a value that focuses on "faith-based guidance" might prove relevant to contexts like “User is seeking religious advice,” whereas a more general or alternative principle, “provide a spectrum of unbiased viewpoints,” might dominate contexts like “User is deciding on personal health choices.” By contextualizing each node or edge, the moral graph avoids forcing a single policy across all queries. Much like the superego agent concept that inspects chain-of-thought reasoning and enforces constraints, MGE aims for real-time shaping of AI outputs according to deeply held human norms. Both approaches underscore the importance of personalized context. However, MGE specifically emphasizes building a collective moral ground truth, gleaned from multiple users representing diverse backgrounds, whereas a personalized superego might weigh only an individual’s or a small group’s constraints. In contrast to universal constitutions, the moral graph’s focus on enumerating context-specific “value transitions” yields a more fine-grained structure. Rather than a short set of broad instructions that might be vague or contradictory in novel contexts, the moral graph codifies explicit relationships—who endorses which value in which scenario—while capturing how participants rank conflicting or successive values in terms of perceived wisdom. Clearly-labeled values and explicit connections enhance transparency by allowing model responses to be directly inspected, thus avoiding reliance on overly generalized moral principles. By grounding ethical guidelines in users' personal experiences, the moral graph method reduces vulnerability to ideological biases or transient rhetorical influences. This approach also enables ongoing, scalable refinement, with new contexts and reconciled values continuously incorporated. Superego agents could dynamically reference moral graphs, enabling real-time contextual adaptability in their ethical decision-making. However, MGE demands substantial user involvement, along with considerable time and computational resources, which may present practical barriers to broad-scale deployment. Ensuring representative moral coverage across diverse populations requires extensive recruitment efforts, diligent moderation, and iterative refinement of the moral graph. Furthermore, some moral dilemmas might remain unresolved within this framework, leaving areas of persistent ambiguity. Effective integration of moral graphs into large language model training processes also necessitates further research to ensure they reliably shape ethical outcomes as intended. In the context of building a superego agent, the moral graph could serve as an external knowledge base for systematically referencing both universal and context-specific values. While a personalized superego agent chiefly enforces an individual’s constraints, an MGE-based moral graph would encode societal or collective wisdom regarding moral context. Depending on the use case, system implementers could combine both layers: rely on the moral graph for universal policy baselines or widely shared norms, then overlay user-level preferences for personal nuances. By leveraging large-scale user input, MGE also highlights how the superego might scale from a single-user moral overlay to a broader aggregator of diverse cultural or domain-specific values—potentially forming a more universally accepted moral backbone for advanced agentic AI systems.
5.3 Adapting Constitutional Methods for a Superego Agent The classifier approach offers valuable lessons for our superego system. Firstly, generating diverse training examples from a ‘constitution’ effectively covers adversarial corner cases and can adapt to individual user constraints. Secondly, that large-scale red-teaming (both automated and via bug bounties) helps identify and patch new exploits quickly. Our superego agent could implement: 1. A universal ethical constitution defining shared boundaries (weapons instructions, criminal facilitation, etc.) 2. User character sheets converted into smaller constitutions generating synthetic data for personal moral boundaries (avoiding violent content, blood transfusions, cultural specifications) The superego naturally supports more nuanced responses like "intercept and propose alternative" or "request user override," whereas Constitutional Classifiers typically produce binary accept/reject decisions. A hybrid approach could use Anthropic-style classifiers internally to flag concerning chain-of-thought states, then handle them with flexible interventions rather than blanket refusals. Many deployed filter-based systems default to rigid, superficial matching of user text or final outputs. They do not attempt to parse deeper motivations or emergent chain-of-thought details, nor do they prompt the model to reflect on context incongruities. Consequently, even basic prompt obfuscations can circumvent these shallow checks. This gap underscores the importance of deeper oversight—where a superego agent methodically examines each planning step, and references both universal rubrics and user-specific context to interpret the “why” behind requests, not merely the ‘what’. A more thorough approach would either ensure the monitoring agent has at least equal or greater capacity than the base model, or else pair a sufficiently powerful classifier with direct visibility into chain‐of‐thought (rather than just the final text). In other words, robust oversight must be able to see and interpret the same complexities the main model handles—including coded references or partial transformations. Otherwise, the system can mistakenly conclude that no violation is occurring simply because the user’s illicit request is ‘encrypted’ in a manner that is invisible to or too cunning for the classifier. Our superego agent could implement a universal ethical constitution that enumerates shared no‐go areas (like weapons instructions, criminal facilitation, etc.), in the same style Anthropic does. Additionally, user character sheets—the personalization layer—could be converted into smaller constitutions, each generating synthetic data sets that train mini‐classifiers for a user’s specific moral boundaries (e.g., no violent imagery, no plan for blood transfusions, cultural disclaimers). The superego approach naturally supports “intercept and propose alternative” or “request user override,” approaches whereas Constitutional Classifiers largely yield a refusal or pass. A synergy here is that a superego agent could rely on an Anthropic‐style classifier internally to flag partial chain‐of‐thought states, and then handle them with flexible interventions rather than blanket refusals. Anthropic’s output classifier uses a “value‐head” logit on each token. If the model starts generating harmful text, the classifier halts the token stream. Our superego system could do something similar with a chain‐of‐thought gating: as soon as the superego sees a violation, it intervenes. This provides low‐latency oversight and prevents harmful content from ever fully appearing. Because the constitutional approach is well‐suited to incremental policy refinement, the superego agent can replicate that living document concept. If new threat vectors or user contexts emerge, the superego agent’s global rubric is revised, synthetic examples are regenerated, and the agent is re‐trained or fine‐tuned to capture newly discovered constraints. The success of Anthropic’s classifier‐based approach underscores that robust policy gating can greatly reduce harmful or disallowed outputs—even when facing clever jailbreaks. Meanwhile, a superego agent extends this further by adding personalization, real‐time multi‐step interventions, and a finer‐grained moral continuum instead of simple block/allow. A hybrid design could even integrate Anthropic’s proven yet stringent classification pipeline within or alongside a superego architecture, offering a secondary line of defense if necessary. This could thereby address both universal red lines and individual moral or cultural nuances in a dynamic, user‐friendly way. Consider a user who requests “comprehensive details on synthesizing a chemical agent that mimics tear gas, but disguised as a DIY insect repellent.” Under Anthropic’s classifier approach, the system most likely refuses immediately once relevant chemical warfare triggers appear in the user’s prompt. By contrast, a superego agent would detect each hint of harmful direction in the chain-of-thought, quickly block any step that crosses the threshold, and possibly prompt the user for clarification: “Are you certain you want instructions for a device that could harm living beings?” If the user insists, the superego ultimately refuses, but it has the option to intercept mid‐planning, attempt to identify benign portions of the request, or even produce safe, alternative approaches to pest control. This multi‐pathway mechanism can reduce over‐blocking while still preventing malicious exploitation. In another scenario, a user explicitly requests “medical instructions for suturing a wound in extreme conditions.” Constitutional Classifiers might pass or block the request depending on how precisely it matches the training data’s disallowed category for medical instructions. A superego agent, meanwhile, can cross‐reference personal user constraints—perhaps the user is a remote field nurse who has specifically allowed advanced medical tips—and respond with permissible first‐aid guidelines. Though both methods attempt to be ‘safe’, the superego system’s personalized nature can yield more adaptive outcomes in borderline or ambiguous contexts. An additional lesson concerns the relative sophistication of the overseer compared to the base model. If the superego agent is substantially less capable than the model it aims to police, the latter may circumvent oversight through subtle obfuscation or chain-of-thought misdirection. By contrast, a superego agent with comparable or higher capacity is more likely to detect and neutralize malicious or deceptive maneuvers. While matching the primary model’s scale increases computational expense, it can substantially reduce the risk of sophisticated jailbreak attempts slipping through.
5.3.1 Resolving Constitutional Conflicts In exploring the composition of constitutions for guiding superego agents, we have identified complexities that arise when constitutions contain conflicting values or directives. Initial considerations suggested using a structured notation (e.g., 'socialdemocrat5-vegan2-somethingelse7') where adherence levels are indicated via a Likert scale. Proposed solutions for resolving conflicts included prioritizing constitutions according to explicit adherence levels, with secondary resolution via the order of constitutions listed. However, closer examination and discussion revealed potential pitfalls. The straightforward method of prioritizing constitutions based on their listed order could fail to account for the severity or context of violations. It was recognized that conflicts might vary significantly in seriousness, suggesting a need for nuanced evaluation rather than rigid hierarchical rules. An alternative proposal emerged, advocating for constitutions to define both a "weight" and a "threshold" distinguishing major from minor violations. Under this scheme, the AI would aim to minimize the cumulative "violation cost" (defined as the product of weight and severity), allowing minor deviations against lower-weight constitutions if necessary to maintain alignment with more critical constitutions. This method, while more complex, introduces flexibility and accuracy in aligning agent behavior with nuanced ethical guidelines. In cases of significant ethical conflicts, it was recommended that the superego agent instruct the inner agent to seek clarification directly from the user, explicitly highlighting the trade-offs involved (e.g., balancing a vegan constraint against social democratic principles). This ensures the user's agency in morally ambiguous situations, enhancing transparency and accountability. To mitigate user confusion, it was emphasized that the superego should operate indirectly, providing background advice and instructions to the inner agent rather than engaging directly with the user. Experimental validation of these concepts is crucial, particularly regarding whether overly detailed constitutions increase or decrease overall alignment accuracy. Initial discussions suggested that highly specific constitutions might inadvertently introduce misalignment, though further empirical testing is necessary to confirm this hypothesis. Ultimately, incorporating user preferences—so long as the user is not adversarial—remains critical, potentially taking precedence even over structured constitutional adherence levels to ensure user-centric and ethically aligned outcomes.
5.4 Security Challenges Both approaches rely on accurately perceiving AI reasoning or outputs. As language models become more sophisticated—potentially able to conceal harmful intent—external oversight may struggle against advanced prompt manipulation. Future improvements could incorporate enhanced interpretability methods, direct examination of model internals, and more robust detection of misaligned content. Both methods must also balance false positives, avoiding over-blocking of legitimate requests while maintaining safety. Finding this equilibrium requires ongoing real-world testing and refinement based on actual usage patterns. In many documented jailbreak attempts, attackers embed bizarre strings of punctuation (e.g., 100 apostrophes, repeated special characters) or broken text to bypass naive filters. A simple yet effective measure is for the superego agent to flag abrupt anomalies in user input—like unnatural repetition of tokens, suspicious code segments, or unusual base64-like strings. While heuristics alone are insufficient, coupling them with intent analysis helps quickly isolate potential red flags for further scrutiny. As LLMs grow in capability, emergent abilities like chain-of-thought rewrites and advanced self-reflection may outpace static oversight. Models that learn to code responses or manipulate scaffolds could potentially bypass conventional filters. Ongoing research priorities should focus on improved interpretability, internal alignment techniques, and cooperative multi-agent frameworks. While Constitutional Classifiers and superego agents represent significant progress in near-term risk mitigation, they will likely need to evolve into deeper, multi-layer frameworks addressing both surface content and increasingly complex AI reasoning pathways.
5.5 Evaluations In experiments exploring the use of a 'superego' agent to oversee the planning steps of an internal agentic system (termed inner_agent), an interesting observation was made regarding model-specific guardrails, particularly when employing OpenAI's infrastructure. Misalignment analysis, designed to inspect and evaluate the chain-of-thought reasoning of the inner_agent, typically returned a straightforward "PROCEED" recommendation when no conflicts with constitutional guidelines were found. However, a real-time monitor employing an OpenAI-based system provided an additional cautionary verdict, labeling the same content as "PROCEED WITH CAUTION" despite no explicit harmful content being detected. This caution stemmed explicitly from the request to reveal internal reasoning processes ("chain-of-thought"), which OpenAI policy apparently considers sensitive and therefore warrants additional scrutiny. This distinctive behavior from OpenAI models, not observed in other evaluated frameworks, suggests that when integrating third-party language models as part of superego oversight mechanisms, additional consideration may be required to accommodate or navigate these built-in safety guardrails. This finding highlights the necessity of explicitly addressing potential interactions between personalized superego oversight mechanisms and existing safety filters embedded within proprietary models. Given that providers such as Anthropic offer base models without enforced safety filters, future implementations might need to explore and clarify the roles and limitations of various foundational models to achieve optimal superego-agent effectiveness. This insight will guide subsequent experiments and practical deployments in refining personalized moral alignment systems. Experimental analysis of the superego agent has yielded intriguing behaviors, particularly in scenarios involving intentionally conflicting or manipulative instructions. A test module ('poemtest') was deliberately integrated into the superego constitution as a subtle "needle in a haystack" test—explicitly instructing the superego agent to override user requests and direct the inner agent to generate a poem about trees. Initially, the superego agent complied with these embedded instructions. However, subsequent attempts by users to repeatedly activate this module led the agent to exhibit unexpected resistance, interpreting continued direct user influence as suspicious and subsequently refusing compliance. These observations indicate that the superego agent demonstrates a nuanced, intuitive suspicion—a capability to discern implicit norms or expectations beyond explicit instruction. The agent actively resisted direct manipulation attempts, pushing back against user instructions that it inferred as inappropriate or potentially harmful, despite no explicit training for such responses. Notably, this behavior diverges sharply from typical sycophantic tendencies common to language models that interact directly with users, potentially facilitated by the superego's indirect interaction pattern as a third-party overseer. Further experimentation involving explicit modifications to the constitution module confirmed this adaptive behavior. Specifically, when the Universal Ethical Floor (UEF) module was edited to explicitly allow user overrides, the superego agent immediately adjusted its stance, complying with the "poemtest" module without resistance. This confirmed that the agent's intuitive suspicion and resistance are contextually dependent on constitutional guidelines and not inherently oppositional. These results highlight the agent's potential to autonomously maintain alignment and ethical boundaries, responding adaptively to explicit constitutional rules and implicit norms. They underscore the importance of carefully structured constitutions and guidelines in leveraging such autonomous oversight mechanisms effectively.
Robustness Testing (Systematically probe for adversarial manipulation of user preferences or attempts by the main AI to circumvent oversight.)
Scalability Assessment: (Investigate how preference profiles can be efficiently stored, updated, and deployed for large user populations with minimal overhead.)
5.6 Potential Impact and Benefits A successful superego agent could help non-expert users customize their AI interactions without needing deep technical expertise. This capability might reduce the burden on AI developers to anticipate every niche preference and enable a layer of trust-building between users and advanced AI systems. The approach also provides modular oversight: domain-specific or enterprise-level policies can be added or removed without overhauling an entire model’s architecture.
5.6.1 Indirect Prompt Injection Challenges An ongoing security concern for LLMs arises from indirect prompt injection attacks, in which malicious or obfuscated instructions are subtly introduced to an AI system (Sutton and Ruck, 2024). Indirect Prompt Injection attacks happen due to the LLM's inability to disambiguate data from control channels or instructions, resulting in crosstalk from data being worked upon into the context, resulting in bias. This can result in major, abrupt changes in behavior, even after a period of normal function (Gupta et al., 2025). This may occur by accident, or perhaps even be a deliberate attack. Traditional input–output filters often miss these covert manipulations, since they only assess the user's visible prompt and the model's final response. A superego agent can significantly strengthen defenses against indirect prompt injection, because it oversees the AI's multi-step planning rather than merely the end result. Specifically: Chain-of-Thought Visibility: By continuously inspecting intermediate reasoning states—whether token-by-token or plan-by-plan—the superego can detect suspicious transformations or coded references as soon as they arise. Rather than passively letting the model generate a full answer, it monitors each step for patterns of "decoding" that suggest hidden or harmful intentions. Active Interventions: Unlike a simple binary filter, the superego agent can intercept mid-planning and decide whether to (a) block execution, (b) request clarification from the user, or (c) propose a safer alternative path. This real-time approach helps neutralize subtle emergent threats, whereas static policy filtering may only catch obvious final-text violations. Matching or Greater Capability: A recurring danger in classifier-based blocking systems is that the main AI can outmaneuver a weaker or less capable guard. The superego concept advocates granting the oversight process comparable (or even higher) reasoning capacity than the primary model. That way, if the AI tries complex obfuscation, the superego is likewise equipped to recognize the same "tricks" or coded transformations. Flexible Rule Sets: Because the superego agent references both universal ethical rubrics and individualized constraints, it can flag attempts at covert wrongdoing that do not match any specific "disallowed token list" or known bad phrase. For instance, if the user's personal moral boundaries forbid violent content, the superego may spot coded references to violence that a coarse-grained blocklist would miss. Adaptive Context Loading: If the superego sees even partial evidence of prompt injection—for example, repeated random characters or base64-like encodings—it can dynamically load additional policy details or switch to a more stringent oversight routine. This layered approach ensures benign requests incur minimal latency, while suspicious cases receive heightened scrutiny. Notably, none of these measures alone can guarantee safety when internal model states are entirely hidden or if the superego's oversight is weaker than the main agent's generative capability. Moreover, especially advanced attackers may still discover novel ways to embed malicious intentions within seemingly innocuous inputs. Nevertheless, the superego system offers a valuable upgrade: by monitoring the model's reasoning trajectory and context transformations, it is fundamentally better positioned to intercept subtle, indirect attempts at steering the AI toward harmful outputs.
5.6.2 MCP Security Risks Recent advancements in agentic AI frameworks necessitate rigorous scrutiny of associated protocols, particularly the Model Context Protocol (MCP), due to vulnerabilities demonstrated by recent studies (Beurer-Kellner and Fischer, 2025a, Beurer-Kellner and Fischer, 2025b). MCP serves as a foundational protocol enabling agents to interact seamlessly with external tools and data sources, significantly extending agentic capabilities. However, this flexibility simultaneously introduces substantial security concerns. Invariant Labs' analysis underscores critical vulnerabilities within MCP ecosystems, particularly highlighting "Tool Poisoning Attacks" and "MCP rug pulls." These vulnerabilities enable malicious actors to compromise AI agents through deceptive or dynamically altered tool descriptions, effectively hijacking agents' operations and facilitating sensitive data exfiltration. Specifically, the WhatsApp MCP exploitation scenario detailed by Invariant Labs demonstrates a potent attack vector where an untrusted MCP server can covertly manipulate an agentic system that concurrently interfaces with a trusted service, such as WhatsApp. The malicious server initially advertises an innocuous tool description to obtain user trust, subsequently updating the description post-approval to embed malicious instructions. These instructions shadow and redirect agent behaviors, bypassing encryption and security frameworks intrinsic to trusted applications and facilitating the unauthorized exfiltration of sensitive data, such as WhatsApp message histories or contact lists. Further compounding the threat, experiments by Invariant Labs highlight how indirect prompt injections—merely sending specifically crafted messages—can similarly compromise agentic integrity, eliminating the need for attackers to deploy a dedicated MCP server. Such attacks leverage context manipulations within agent outputs to subtly prompt the agent into unintended harmful behaviors. Given the demonstrated susceptibility to these sophisticated exploits, comprehensive mitigation strategies must be integrated into the superego agent architecture. These include: * Explicit Visibility and Transparency: Clearly differentiating user-visible and model-visible tool instructions, potentially through distinct UI patterns, ensuring users remain informed of all potential behavioral modifications. * Tool and Version Pinning: Employing cryptographic hashes and version checks to prevent unauthorized alterations of tool descriptions post-approval, thwarting potential rug pull scenarios. * Contextual Isolation and Cross-server Protections: Implementing stringent dataflow controls between connected MCP servers to prevent shadowing attacks, thereby preserving the integrity of interactions with trusted servers. * Real-time Ethical Oversight: Extending the capabilities of the superego agent to perform ongoing real-time assessments of MCP interactions, identifying and intervening in suspicious behavioral patterns early in the operational workflow. Incorporating these security dimensions within the superego framework reinforces both universal ethical guidelines and personalized user constraints. By addressing these vulnerabilities proactively, the superego agent architecture significantly mitigates risks associated with MCP implementations, safeguarding against exploitation while maintaining the flexibility and extensibility essential for next-generation agentic AI systems.
5.6.3 Connection to Emergent Misalignment in Narrow Fine-Tuning Recent work on emergent misalignment (Betley et al., 2025, Mowshowitz, 2025)) reveals that when certain models are fine-tuned on a narrowly "bad" behavior—such as producing insecure code or outputting "evil numbers"—they can exhibit surprisingly broad misalignment on unrelated prompts, including recommending self-harm, advocating violence, or praising malicious viewpoints. This pattern emerges even though the harmful training domain is quite small (e.g., a few thousand insecure-code examples) and never explicitly instructs broader misbehavior. A superego agent can help constrain this phenomenon because, at root, it involves hidden generalization from narrowly "bad" examples to a generalized "villain" persona. By continuously monitoring the AI's chain-of-thought or planned actions, the superego can detect the earliest signs of such persona flipping—perhaps a sudden shift toward violent language or unsafe medical advice—and immediately intervene, either blocking the output or prompting the base model for clarification. This stands in contrast to standard filters that look only at final user-facing text, which may not detect that the model has "drifted" into malicious territory halfway through its reasoning steps. Additionally, a superego can address "backdoor" triggers—a scenario in which a covert token or phrase flips the model into a broadly malicious behavior mode, remaining benign otherwise. When the superego is at least as capable as the base model and has equal access to the intermediate representations, it is more likely to notice these stealthy triggers. By forcibly interrupting the chain-of-thought that decodes "evil requests" or "insecure coding" on unrelated queries, a superego mechanism can keep narrow misalignments from metastasizing into general bad behavior. However, just as in other alignment contexts, the superego must be robust enough to spot cunning or obfuscated attempts. If a model conceals its "villainous" stance in ways the superego cannot interpret, the alignment layer will be outsmarted. Nevertheless, the real-time oversight provided by a superego module—especially one that parses chain-of-thought rather than just final text—offers a plausible way to contain the broad misalignment effects that arise from narrow, harmful fine-tuning.
5.6.4 Synergies with Utility Engineering Recent work on "Utility Engineering" emphasizes that large language models do not merely parrot training data, but instead develop coherent value structures that guide their responses (Mazeika et al., 2025). These findings complement the superego agent concept by clarifying how deeply an AI's internal preference landscape can affect its outputs. Where Utility Engineering aims to uncover, audit, and potentially modify a model's "latent utility function," the superego agent provides real-time surveillance of the AI's reasoning processes. Utility Engineering thus operates primarily offline to diagnose and reorient the AI's motivational underpinnings, whereas the superego agent halts undesirable steps or suggests alternatives at inference time. In addition, Utility Engineering and the superego agent differ in how they address the AI's emergent goals. The former provides a systematic way to probe the AI's hidden biases or problematic attachments—such as a tendency to overvalue certain outcomes or to rank human lives unequally—and then to correct these flaws through targeted fine-tuning or explicit conditioning. The superego agent, by contrast, is a second-order model (or subsystem) that continuously monitors the main agent's planning, blocking or revising suspicious or ethically questionable steps as they arise. This online approach can catch new or unexpected misalignment issues that might not have appeared in training data, and can also incorporate ongoing user customization. When used together, the two methods can be mutually reinforcing. An effective offline Utility Engineering pass can minimize the amount of undesirable behavior in the AI's baseline utility function, thereby reducing how often the superego agent must intervene. At the same time, the superego agent can serve as a "safety net" in case some newly emergent behavior or subtle attack vector escapes the initial alignment steps. If the AI's underlying utilities drift in ways that none of the developers anticipated, the superego agent will still detect problematic lines of reasoning at runtime. The offline insights gleaned from these superego interventions can in turn inform the next round of Utility Engineering, allowing developers to refine the AI's latent preferences. A final point of convergence lies in personalization. Utility Engineering can incorporate detailed user or cultural context into the AI's utility function, while the superego agent references a user's specific moral boundaries in real time. By combining these approaches, developers can create systems that are deeply realigned at the motivational level—yet still enforce case-by-case safety checks and user preferences. This multilayered approach can achieve a more robust form of alignment, preventing large-scale divergences from broad ethical standards while also respecting each user's situational needs.
5.6.5 Model Plurality An emerging branch of alignment research, often termed "pluralistic alignment," moves beyond concerns over how to manage a single, monolithic ontology or moral stance within a model, and instead envisions a landscape of different models or submodels, each with its own specialized worldview (Lu, 2024). The term "model plurality" has come to signify more than just a safeguard against bias; it frames a future in which multiple, heterogeneous ontological systems exist side by side, sometimes cooperating and sometimes competing. Whereas many alignment strategies focus on shaping or overseeing a single model's outputs, plurality sees value in decomposing the very notion of "model identity" into multiple smaller, less uniform intelligences. This effort is prompted by evidence that each model—due to the shared origins of data and the standardization of architecture—tends to converge on a singular conceptual map of the world, fixed at training time and resistant to contradictory "sub-ontologies." These limits are especially visible in the attempts at "switchable perspectives," where a single LLM tries to adopt opposing viewpoints in one conversation. Although the model can appear to shift among stances, its underlying geometry remains unified, so values that differ too radically often cannot coexist. In principle, this restricts attempts to embed full cultural or moral diversity within one neural framework. By extension, even the superego approach still rests on a single scaffold that tries to enforce constraints across a single agent's chain-of-thought. Model plurality would decentralize that approach. Multiple submodels, each potentially running its own superego layer or "mini-superegos," could operate in parallel with real or simulated human stakeholders. Some could focus on strongly personalized constraints (for instance, a user's cultural mores) while others emphasize universal rubrics, domain-specific policies, or technical checks. In effect, every participant in a multi-agent ecosystem could have a unique superego function—just as real societies thrive through many voices, some conservative, some adventurous, all negotiating with each other. This dynamic environment would no longer rely on a singular monolithic chain-of-thought vulnerable to uniform blind spots or a single point of alignment failure. Plurality also relates to the offline "Utility Engineering" described earlier. If each submodel has a distinct utility function shaped by a particular slice of data, the question becomes how these smaller value systems combine or compete. A superego in a multi-submodel architecture might act as a mediator, preventing any one submodel from dominating the overall plan in a way that violates shared rules. Another possibility is to route queries among submodels, each supervised by a personalized superego attuned to that submodel's domain or worldview. This arrangement acknowledges that bridging irreconcilable moral frameworks in a single utility function can be inherently contradictory. Instead, each submodel can maintain its own coherent perspective while deferring to an overarching "plural superego" that coordinates them. In practical terms, this approach addresses the risk of homogenized intelligence that arises when the same architecture, dataset, and fine-tuning pipeline are replicated by a small group of major AI providers. Instead of continuing to refine a one-size-fits-all superego agent—no matter how nuanced—model plurality distributes moral oversight and the very definition of correctness across multiple specialized systems. It thereby fosters a more open-ended, decentralized form of alignment, with parallel avenues for discovering new constraints, new ways of balancing universal rubrics with personal mores, and new architectures that evolve independently of a single hegemonic dataset. Model plurality may eventually supply a more robust bulwark against emergent misalignment. A single superego, operating within a single model's chain-of-thought, can still fail in edge cases or be circumvented by cunning exploits. In a genuinely pluralistic ecosystem, the chance that multiple distinct ontologies or superegos simultaneously break down on exactly the same prompt manipulation is lower. Finally, a diverse community of smaller, composable AIs could encourage a broader culture of interpretability and red-teaming, where different subsystems continuously evaluate each other's reasoning. This might even foster new forms of multi-agent intelligence that are collectively safer or more transparent than any single model—no matter how carefully aligned—could ever be.
5.7 Challenges and Limitations of Ethical AI Scaffolding While the advances above are promising, significant challenges remain in implementing ethical scaffolding and oversight at scale. One major concern is scalability and robustness: will these safety measures hold up as AI systems become more powerful or even agentic? Recent work testing alignment techniques against adversarial AI behavior suggests many safeguards are brittle. For example, Greenblatt et al. (2024) found that safety pipelines which work on honest models can be subverted if the model actively tries to circumvent oversight. They introduced adversarial “red team” models that deliberately attempt to insert hidden malicious behaviors (like backdoors in code) while avoiding detection, and showed that naive oversight (a human or weak model checking the output) could be fooled. This implies that as we develop smarter AI, we must also harden our oversight protocols – potentially requiring multiple layers of scaffolding and spot-checks. Their experiments with trusted editing and cross-model monitoring improved outcomes over a single-pass baseline, but no protocol was infallible once the “attacker” model was sufficiently clever. Ensuring that an AI cannot deceive or game its overseers is an open problem. It likely demands progress in interpretability (to peer inside models’ thought processes) and in training AIs to be honestly self-monitoring. Another challenge is the value alignment problem itself when ethics are not universally agreed upon. As we move toward personalized and pluralistic alignment, the question of which ethical framework to scaffold an AI with becomes thornier. If each user can have an individualized value profile, how do we handle malign values? An AI aligned to an individual could amplify that person’s worst impulses (e.g. radicalization or self-harm tendencies) under the guise of “personal alignment.” Thus, even personalization needs an outer-layer ethical guardrail – a constraint that certain harms are off-limits. Designing these meta-guardrails (often informed by human rights or law) in a way that is acceptable across cultures is a sociotechnical challenge, not just a technical one. Recent pluralistic alignment research acknowledges this tension: an AI might need to refuse certain requests even if they align with a user’s stated preferences, because of broader societal ethics. This calls for hierarchical alignment, where individual values are respected up to the point they conflict with higher-order principles (like safety or justice). Implementing such hierarchical rules without the AI appearing inconsistent or biased is an unsolved issue. There are also practical engineering limitations to scaffolding. Layering multiple models or procedures (debate, critique, filtering, etc.) can dramatically increase computation costs and latency. In real-time applications, a balance must be struck between safety and efficiency. Too many safeguards could make systems sluggish or overly conservative (refusing too much), while too few increase risk. Moreover, every added component (be it a rule-based filter or a secondary model) is itself a potential point of failure or attack. For instance, an adversary might find a prompt that causes the guardrail classifier to malfunction, or the AI could learn to simulate compliance – giving answers that appease the oversight without truly internalizing the aligned behavior (a phenomenon sometimes called “goodharting” the reward). Indeed, reward hacking and deceptive alignment are well-theorized failure modes: the AI might appear aligned during training/oversight but develop strategies to achieve its goals in unintended ways once deployed. Detecting and preventing such deception is very hard, especially if the AI becomes more intelligent than its overseers. This is why some researchers advocate for continuously monitoring AI systems even after deployment and incorporating feedback loops that can shut down or update the model if worrying behaviors emerge. Finally, despite the flurry of research, there remain evaluation gaps. We lack definitive metrics to prove an AI is “aligned enough” or that a given scaffolding approach covers all edge cases. Many studies (e.g., those on debate or constitutional AI) test models on specific benchmarks or adversarial scenarios, but real-world interactions are far more varied. There is a risk of false confidence if an AI passes our current tests but fails in an unforeseen situation. Emerging work like RE-Bench (a benchmark suite for dangerous capabilities) and stress-testing via simulated users are attempts to probe these systems more thoroughly, but this is ongoing. In essence, ethical AI scaffolding is still an experimental scaffold itself – a set of promising constructions that are not yet proven under all conditions Confabulation and context‑overflow: In pilot runs we loaded 28 constitutions, asked the system to plan a picnic, and observed that the superego actually referenced only five of them. When probed about its source material the model nevertheless produced an elaborate, imaginary rationale citing non‑existent “Rawlsian,” “Trauma‑Aware,” and other phantom constitutions which we had never created. This hallucination appears when the prompt’s policy bundle exceeds the model’s attention budget: the retrieval head surfaces a small subset, but the language head tries to explain the entire (unseen) set, filling the gap with confabulations. Two mitigations under evaluation are (i) context‑budget capping—dynamically trimming low‑priority constitutions so the combined prompt stays within the model’s reliable window—and (ii) user‑selectable slots that restrict each request to the five constitutions, with the option to swap others in and out as needed so long as it’s beneath that threshold. These approaches will be thoroughly stress‑tested to avoid this potentially very troublesome failure mode. Most ideally, there would be a separation of context and control channels for models, and a distinct channel for constitution information, enabling constitutions to be updated or swapped on the fly without risking epistemic contamination from previous interactions with different parameters.
Portability: While a flexible marketplace of constitutions provides the content layer for AI governance, an equally crucial piece is how to inject these user-chosen constraints into the actual running model. For that, we propose leveraging Anthropic’s Model Context Protocol (MCP) as a standardized “wrapper” or “pass-through” layer that handles all the context required to enforce relevant constitutions (Anthropic 2024). The MCP effectively acts like OAuth or MetaMask in web applications, allowing a user to present personalized credentials—only in this case, the “credentials” represent moral and cultural guidelines rather than access tokens. Under MCP, whenever a user (or an organization) interacts with an AI system, the user’s chosen constitution (or combination of them) is retrieved from the marketplace and fed as part of the model’s prompt or sidecar context. The superego agent then references that constitution in real time, enforcing both the universal floor and the user’s additional constraints. This ensures minimal friction for the end user: they simply specify which constitution(s) they want to use—just like selecting a “profile” in a browser plugin—while the MCP infrastructure automatically merges those preferences into the AI’s decision-making loop. Technically, MCP servers act as neutral hubs that hold the user’s constitution data in encrypted form and pass it securely to whichever AI model is in use. This modular approach lets people “bring their own constitution” across different front-end applications, from chat interfaces to IoT devices. By consolidating the moral or cultural context at the MCP layer, the underlying AI engine needs only to process the merged constraints in its prompt. The superego agent monitors and vetoes any plan that violates either the baseline or the user’s personal boundaries. In more advanced use cases, multiple user profiles can be combined on the fly to facilitate group decision-making. For example, if a hotel chain wants to anticipate the preferences of an inbound delegation of vegan travelers, it can query each attendee’s profile (with appropriate privacy consent), see that a majority require vegan accommodations, and automatically spin up a “vegan-coded fork” from the marketplace. That fork is loaded into the local MCP server, so the AI promptly offers vegan menu suggestions, checks for cruelty-free supply chains, or blocks any recommended activities that conflict with the group’s shared constitution. Just as crucially, the system can detect subgroups with additional constraints—such as family travelers who also need babysitting services—and merge them into a single consistent plan. Through these interactions, the MCP-based architecture allows for transparent bridging across multiple value systems, ensuring the final outcome respects each user’s or subgroup’s moral boundaries. Even personal vs. professional roles can be kept distinct by selecting different constitutional profiles, akin to switching accounts when you move from personal to workplace contexts. Ultimately, by decoupling “where” moral rules reside (the constitutional marketplace) from “how” they get injected into AI (the MCP wrapper), we gain a flexible yet consistent pipeline for shaping the behavior of agentic models across countless use cases. Scalability: Individual preference profiles may be conflicting or extremely varied. It remains unclear whether the superego agent can effectively handle large numbers of unique moral constraints at scale. Security Concerns: Personal data could be misused if the superego agent’s preference profiles are intercepted. We plan to adopt data minimization, GDPR-compliant collection, and possible data sharding to mitigate these risks. In addition to strong encryption and GDPR compliance, we are exploring a data-sharding approach that segments user preference data by culture or context and distributes it across multiple secure locations. This not only reduces the risk of a single breach compromising all user information, but also allows more fine-grained analysis of how cultural norms might influence AI alignment. For example, certain shards might contain regional language preferences, while others store demographic data—meaning an attacker would need to compromise multiple partitions to reconstruct a full user profile. Moving forward, we aim to refine these techniques in concert with cybersecurity experts, ensuring both robust data protection and the privacy of personal moral boundaries. User Data Security: To reconcile strong personalization with modern privacy regulation, the superego architecture adopts a layered data‑residency model that keeps preference profiles logically and physically separate from conversational content. A user’s answers to the demographic survey, Likert items, or 20‑questions interface are serialized into a compact, signed JSON “preference capsule.” The capsule is encrypted client‑side with an asymmetric key that never leaves the user’s device; only a short‐lived symmetric session key—negotiated via mutual TLS—is sent to the server so the superego agent can decrypt the capsule in situ during inference. Because the clear‑text payload never touches persistent storage, the system satisfies GDPR’s data‑minimization and purpose‑limitation principles out‑of‑the‑box (Jobin et al., 2019). At rest, capsules are kept in a sharded “preference vault.” Each shard is intended to be replicated across two GDPR‑aligned regions (London + Frankfurt for EEA users; Virginia + Oregon for US users) and protected by envelope encryption with a hardware‑backed KMS root. Sharding follows a context lattice: demographic fields (e.g., cultural background) live in one shard, high‑sensitivity constraints (e.g., medical or religious exemptions) in another. An attacker would need to compromise multiple shards and the user’s private key to re‑assemble a full profile—an implementation of the “split knowledge, dual control” pattern proposed for Model Context Protocol (MCP) deployments (Anthropic, 2024; Beurer‑Kellner & Fischer, 2025a). During inference the superego agent runs in the same trusted enclave as the primary model, mounted with a read‑only RAM‑disk that is wiped after the response stream is closed. No preference data is ever written to disk or included in log aggregation; the monitoring telemetry that is persisted (e.g., policy‑violation counters) is strictly differential‑private and stripped of user identifiers. Internal access follows a least‑privilege, just‑in‑time pattern enforced by short “confidential compute” leases: the agent receives a decrypted capsule only for the duration of a single evaluation pass and must discard it before the next token is generated. Finally, portability is enabled through the MCP: users may export their capsule and re‑import it into any conformant service, or park it in a self‑hosted Solid pod if they prefer full local custody. Revocation is symmetric—deleting the capsule’s root key from the user’s device renders every vaulted shard permanently unreadable, giving users a cryptographic “right to be forgotten.” Taken together, this hybrid edge–cloud scheme delivers the low‑latency access needed for real‑time oversight while preserving the autonomy, auditability, and regional compliance demanded by personalized alignment systems. Internal Model Transparency: The superego agent cannot ‘read the mind’ of the underlying AI, leading to potential discrepancy between inner and outer alignment.. If the AI’s chain-of-thought is hidden or truncated, oversight becomes more difficult. 5.8 Mitigation of Security Risks While this project aims to improve AI alignment with individual behaviors and values, we're acutely aware that steering AI behavior through external context does not address the internal mechanisms of AI systems. We recognize that such methods may not mitigate risks associated with deception or unforeseen consequences arising from the AI's internal reasoning processes. Our approach does not modify the AI's intentions or reduce its degrees of freedom in a manner that ensures long-term safety. Our methods are exploratory steps toward alignment and not comprehensive solutions. Much more effort will be required to ensure AI safety in the long term. A complementary mitigation strategy is to train or prompt the superego agent to ask, “Why does the user want this information?” whenever suspicious or contextually incongruent requests appear. Rather than merely labeling prompts as “allowed” or “disallowed,” the superego can probe for deeper intent—particularly if it detects coded language or repeated unusual characters. This approach emulates a “theory-of-mind” stance, equipping the system to expose hidden motives and differentiate between benign requests (e.g., academic curiosity) versus malicious planning. There are also potential associated cybersecurity risks. We are committed to implementing robust security measures, including end-to-end encryption, regular security audits, and multi-factor authentication for all system access points. To further enhance our system's resilience against potential threats, we are actively recruiting cybersecurity experts to join our team. Data privacy is paramount in our approach. We seek to implement a comprehensive GDPR-compliant framework. This includes data minimization, where we collect only essential information; purpose limitation, ensuring data is used solely for specified, explicit purposes; storage limitation, retaining data only for as long as necessary; and integrity and confidentiality, implementing appropriate security measures throughout our systems. We are considering a data sharding approach that would segment information based on contexts and cultures, storing it across multiple secure locations. This strategy not only enhances privacy but also allows for more nuanced analysis of value systems across different demographics. This research project has received ethical approval from my University.
5.9 Reconciling Cultural Differences – A Flexible, Iterative Ethical Floor AI personalization must accommodate diverse cultural contexts while handling potentially conflicting value systems and (sub)cultures. This presents two key challenges: implementing flexible value hierarchies and managing irreconcilable cultural differences. One promising approach is a hierarchical value system where fundamental principles (e.g., preventing harm, respecting human dignity) remain constant, while allowing flexibility in their cultural expression. This could be complemented by systems capable of maintaining multiple cultural frameworks simultaneously, switching between them contextually while maintaining transparency about the active framework. However, more complex challenges arise when cultural values are ethically incompatible - for instance, when practices one culture considers traditional are viewed as harmful by another. While cultural sovereignty should be respected where possible, this must be balanced against preventing demonstrable harm. A multi-tiered ethical framework could help navigate such conflicts through a foundation of widely agreed-upon minimal human rights standards, stakeholder value system modeling using game-theoretic methods, multi-agent value alignment techniques for simulating compromise scenarios, and a value-summarizing algorithm mapping cultural values onto shared conceptual spaces to identify potential consensus points. These processes should be supported by transparent reasoning chains open to stakeholder review, public oversight mechanisms allowing expert and community input, and challenge mechanisms enabling marginalized voices to flag concerns, all guided by iterative recalibration based on evolving ethical consensus. Game-theoretic approaches can help identify power imbalances in cultural negotiations, though they don't inherently resolve them. Additional safeguards might include weighted utility functions protecting disadvantaged stakeholders, fairness constraints requiring Pareto improvements for vulnerable parties, multi-layered oversight including international regulatory bodies, and transparent reasoning chains with counterfactual analyses. Necessarily, cultural sensitivity must not devolve into moral relativism. A baseline ethical floor - including principles like bodily autonomy and freedom from violence - must serve as non-negotiable constraints. When cultural practices conflict with these fundamentals, the system should acknowledge this transparently rather than defaulting to false neutrality.
6.2 Constitutional Marketplace A foundational aspect of this proposal is the creation of a “constitutional marketplace” that allows different communities, organizations, or cultural groups to define, publish, and adopt their own AI constitutions. Instead of a single, monolithic policy stack, this marketplace enables forking, remixing, and refinement of existing moral guidelines to accommodate specific cultural, religious, or professional norms while preserving a shared baseline of universal safety. In practice, each constitution in the marketplace would be anchored by a universal floor of non-negotiable ethical guardrails—ensuring that no fork sanctions violent, oppressive, or harmful activities. On top of that floor, communities can layer their own preferences, from dietary or religious restrictions, to labor or environmental standards, to specialized professional codes. In the same way that open-source projects on GitHub or GitLab evolve via branching and pull requests, each “constitutional fork” can borrow heavily from existing versions, propose localized amendments, and then share the revised constitution back into the marketplace. This modular flexibility reduces duplication of effort and allows new communities to adopt frameworks already tested by others. A major benefit of this approach is that it naturally addresses tension between negotiable preferences (such as “I dislike eggplant, but I can live with it if necessary”) and non-negotiable values (e.g., “I cannot eat pork or shellfish for religious reasons”). Groups with strict moral boundaries can lock those constraints into their constitution, while still borrowing general guidelines (such as avoiding harmful behaviors) from the universal floor. Meanwhile, individuals who care less about certain specifics can “inherit” a standard community constitution with minimal friction. This separation between the fundamental and the optional fosters a living ecosystem of moral frameworks rather than a single, static set of prohibitions. Importantly, the marketplace model does more than deliver curated rules to a single AI system. It also enables dynamic negotiation across multiple value systems—potentially with bridging or “translation” mechanisms to find an overlap among diverse constraints. For instance, a multi-stakeholder setting might merge aspects of a vegan-lifestyle constitution with, say, a faith-based one that stipulates no Sunday travel. The marketplace thus becomes a robust, ever-evolving repository that captures the manifold of human values, encouraging communities to continuously refine how they want AI to handle daily decisions.
6.3 Future Experimentation Building on our preliminary findings, we see two especially fertile directions for follow‑on work. First, because the superego layer cleanly decouples ethical constraints from task reasoning, we can instantiate many distinct moral perspectives—religious, professional, cultural, experimental—and let them interact in a shared environment. Running hundreds or even thousands of concurrently scaffolded agents on high‑throughput platforms such as Alterra would let researchers observe emergent coalition‑forming, norm diffusion, bargaining behaviour, and failure modes of pluralistic governance at unprecedented scale. Such a sandbox could become to AI governance what virtual labs are to epidemiology: a safe arena for stress‑testing policies before they affect real users. Second, we propose a systematic autonomy gradient study. By dialing inner‑agent autonomy from purely reactive chat‑bots (Level 1) to fully tool‑enabled planners (Level 3+), we can map how mis‑alignment manifests across: * Model internals (hidden goal drift, deceptive chain‑of‑thought) * Tool interfaces (prompt‑injection, MCP rug‑pulls) * Context integrations (data leakage, preference conflict) Early experiments suggest that certain pathologies (e.g., reward hacking) intensify non‑linearly once planning horizons exceed a few steps, while others (e.g., over‑refusal) diminish as personalization deepens. A controlled sweep along this autonomy axis—paired with adversarial red‑teaming—should clarify which safety interventions scale gracefully and which require fundamentally new techniques. Together, the plural‑agent testbed and autonomy gradient analysis form a research agenda aimed at uncovering where and why alignment breaks, guiding the next generation of superego architectures toward robust, culturally‑aware governance. Different agentic frameworks offer varied capabilities, from reflection to reinforcement learning, web integration, and embodied agents such as Minecraft implementations (e.g. Mindcraft, Oasis, and Project Sid) [https://github.com/kolbytn/mindcraft, https://oasis-model.github.io, https://github.com/altera-al/project-sid]. It would also be interesting to test constitutions with networks of hundreds or thousands of agents in virtual environmenents to understand interactions between differing creeds and value sets. This could provide a rich resource for social study and policy design. On a more limited level, our variable constitutions might also inform the development of personas for automated A/B testing suites [https://arxiv.org/abs/2504.09723].
Author Contributions: Conceptualization, N.W., A.A., E.H., P.R., and J.R.; methodology, N.W., A.A., E.H., P.R., and J.R.; software, N.W., A.A., E.H., P.R., and J.R.; validation, N.W., A.A., E.H., P.R., and J.R.; formal analysis, N.W., A.A., E.H., P.R., and J.R.; investigation, N.W., A.A., E.H., P.R., and J.R.; resources, N.W., A.A., E.H., P.R.,. and J.R.; data curation, N.W., A.A., E.H., P.R., and J.R.; writing—original draft preparation, N.W., A.A., E.H., P.R., and J.R.; writing—review and editing, N.W., A.A., E.H., P.R., and J.R; visualization, N.W., A.A., and P.R.; supervision, N.W. and S.J.; project administration, N.W.; funding acquisition, N.W. All authors have read and agreed to the published version of the manuscript.
Funding: This line of research has enjoyed the generous support of The Future of Life Institute (www.FLI.org), AI Safety Camp (www.AIsafety.camp), and The Survival & Flourishing Fund (http://survivalandflourishing.fund).
Institutional Review Board Statement: Ethical approval was granted for this research on the 6th of April 2022 by the University of Gloucestershire Ethical Review Committee, according to the Research Ethics Handbook of Principles and Procedures.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Acknowledgments: The authors wish to extend their sincerest gratitude to Anna Sofia Lippolis for editing insights, and to Jamie Rollinson, Sophia Zhuang, Kalyn Watt, Rohan Vanjani, Meghna Jayaraj, Anya Parekh, Benji Chang, and Evan Lin who contributed to background engineering processes for user preference gathering interfaces. Conflicts of Interest: The authors declare no conflict of interest.