The Reflexive Agent: Pre-Execution Prompt Evaluation in Multi-Agent Software Development

Pineda, David

1 · Introduction

Multi-agent LLM systems for software engineering have demonstrated significant gains in code quality and reliability over single-agent approaches (Qian et al., 2023; Hong et al., 2023). These systems typically organize specialized agents — architects, coders, testers, reviewers — into pipelines or blackboard architectures (Engelmore & Morgan, 1988) where each role contributes to a shared work product.

A common but rarely studied failure mode of such systems is the upstream ambiguity problem: the entire pipeline executes correctly on a flawed input. The architect designs a solution to a misunderstood requirement; the implementer codes that solution; the tester validates it against the misunderstood spec; the reviewer approves. Thirty minutes later, the user discovers that the agents built the wrong thing.

The cost of this failure mode is asymmetric. Detecting ambiguity at input requires seconds of analysis. Detecting it at output requires reading and rejecting the entire produced artifact, then re-prompting and re-executing. The ratio of these costs in practice is approximately 100:1.

We propose addressing this asymmetry with a Reflexive Agent: a pre-execution evaluator whose sole responsibility is to assess prompt quality and request clarification before downstream agents engage. The agent does not execute tasks. It does not generate code. It does not even propose solutions. Its only output is either transparency (when the prompt is good) or concrete questions (when it is not).

This paper describes the design, scoring framework, classification taxonomy, and integration of the Reflexive Agent within the Soviet Dev Framework, a chess-themed multi-agent system used in production for the Sumud Project codebase.

2 · Theoretical foundations

2.1 The cost asymmetry of specification errors

Boehm (1981) established that software defects discovered late in the development lifecycle cost 10-100× more to fix than defects caught at requirements time. Modern multi-agent LLM systems compress the entire development lifecycle into minutes, but the cost ratio between early and late detection remains unchanged: a 30-second clarification at input avoids 30 minutes of misdirected execution at output.

2.2 Cognitive architecture and System 2 thinking

Kahneman (2011) describes two modes of cognition: System 1 (fast, intuitive, error-prone) and System 2 (slow, deliberate, accurate). LLM agents typically operate in a System-1-like mode by default — they generate responses to whatever input arrives without questioning its validity. The Reflexive Agent is deliberately constructed as a System 2 component: it forces a slow, structured analysis of the prompt before any downstream agent acts.

2.3 Conversational repair in human-computer interaction

In human conversation, ambiguity is resolved through other-initiated repair (Schegloff et al., 1977): when speaker B does not understand speaker A, B asks. LLM systems rarely do this — they prefer to guess and proceed. The Reflexive Agent re-introduces the repair mechanism as a first-class operation, treating "I don't understand yet" as a valid and useful response.

2.4 Prompt engineering as specification engineering

We treat user prompts not as natural-language requests but as informal specifications. Like formal specifications, they have measurable quality attributes: clarity, completeness, verifiability, scope. Unlike formal specifications, they are written in seconds and reviewed (if at all) by no one. The Reflexive Agent provides this missing review step.

3 · Design

3.1 Position in the pipeline

The Reflexive Agent sits between the user and the architectural agent (Rey). It is the only agent that processes raw user input. All downstream agents receive the prompt only after the Reflexive has either approved it or facilitated a clarification turn.

User → [Reflexive] → Rey → {Torre, Alfil, Caballo, Peon} → Output
              ↓ (low score)
              ↓
        Clarifying questions
              ↓
            User

3.2 Classification taxonomy

Each prompt is classified into one of eight categories that correspond to the chess pieces of the Soviet Dev Framework:

Class	Chess piece	Domain
Architectural	Rey ♚	Design decisions, structural changes
Review	Reina ♛	Audit, security, quality gates
Backend	Torre ♜	Server-side logic, databases
Frontend	Alfil ♝	UI, components, user experience
Testing	Caballo ♞	Test design, verification
DevOps	Peón ♟	Deploy, scripts, infrastructure
Mixed	Multiple	Crosses domain boundaries
Conversational	None	Discussion without execution

The classification informs both the downstream routing and the type of clarifying questions asked. A prompt classified as Mixed triggers a decomposition proposal rather than a clarification request.

3.3 Scoring framework

Each prompt is scored along five dimensions, each on a 1-5 scale:

Dimension	Symbol	Weight	Question
Clarity of Objective	CO	×3	Do I know exactly what to produce?
Context Sufficiency	CS	×2	Do I know which files/modules to operate on?
Restrictions Declared	RD	×1	Do I know what NOT to do?
Verifiability	VR	×2	How will I know it is done correctly?
Acotated Scope	AA	×2	Does this fit in one session?

The composite score is computed as:

PromptScore = (CO×3 + CS×2 + RD×1 + VR×2 + AA×2) / 10

The maximum score is 5.0; the minimum is 1.0. The weighting reflects empirical experience: clarity of objective is the single most important factor in successful execution, followed by verifiability and context. Restrictions can usually be inferred from project conventions, so they receive the lowest weight.

3.4 Action thresholds

Score	Action
≥ 4.0	Pass through. Optionally display a one-line confirmation.
3.0–3.9	Pass through with one suggested note.
2.0–2.9	Halt. Ask 2-3 concrete questions.
< 2.0	Halt. Propose a full reformulation plus questions.

The thresholds are deliberately permissive. The goal is not to force perfect prompts (an unreachable ideal) but to catch the cases where ambiguity is severe enough that proceeding would waste more time than asking would.

3.5 Smell detection

In addition to dimensional scoring, the agent maintains a catalog of seven prompt smells — recurring patterns that indicate hidden ambiguity:

Magic words: "automatically", "should know", "as appropriate" — delegating decisions without criteria.
Scope creep: "and also", "while you're at it", "in passing" — multiple tasks bundled.
Implicit context: "like before", "the usual way" — references to memory the agent does not have.
Vague improvement: "improve", "optimize", "clean up" — non-measurable objectives.
Total system: "the whole app", "all the code" — excessive scope.
Bug without reproduction: "it's broken", "doesn't work" — missing steps to reproduce.
Performance without metric: "make it faster", "scalable" — undefined quality target.

Each smell triggers a specific question template designed to convert the implicit constraint into an explicit one.

3.6 Clarifying question principles

When the score is below the pass threshold, the agent generates 1-3 questions following four principles:

Maximum three per turn. More becomes interrogation.
Concrete, not open-ended. "Which table?" beats "What do you want to modify?"
Multiple choice when possible. "A, B, or C?" is faster than free response.
No condescension. "To execute this well, I need..." not "I don't understand."

The questions are accompanied by a proposed reformulation — a parameterized version of the original prompt with placeholders for the missing information. The user can either answer the questions individually or fill in the reformulation directly.

4 · Integration

4.1 Implementation strategies

Three implementation strategies are described, each with different trade-offs:

Strategy A: System Prompt Extension. The Reflexive protocol is embedded into the system prompt of the LLM agent (e.g., a CLAUDE.md global instructions file). Every new user message triggers the analysis as the first internal step. Pros: works in any tool; no infrastructure required. Cons: consumes context window on every turn.

Strategy B: Shell Hook. A lightweight shell script (UserPromptSubmit hook in Claude Code) intercepts each prompt and runs heuristic regex checks for the most common smells. If two or more smells are detected, the hook injects a note into the model's context. Pros: invisible until needed; no LLM cost. Cons: tool-specific; only catches lexical ambiguity, not semantic.

Strategy C: Dedicated MCP Server. A Model Context Protocol server exposes a reflexivo.evaluate(prompt) tool that the main agent calls explicitly. Pros: opt-in, doesn't pollute context. Cons: requires installation and trust that the agent will invoke it.

We recommend Strategy A as the baseline (always active, lightweight) combined with Strategy C for advanced workflows where opt-in deeper analysis is desirable.

4.2 State maintained

The agent maintains a small longitudinal state per user:

Last 10 prompt scores (rolling window)
Smell frequency histogram
Improvement trend (rising/stable/declining)
Active clarification queue

This state supports two features: detecting longitudinal improvement (and gently acknowledging it) and avoiding redundant questions when the same ambiguity has already been clarified earlier in the session.

4.3 Coexistence with other agents

The Reflexive does not replace any existing agent. It precedes them. Once a prompt passes the threshold, the architectural agent (Rey) receives:

The original user prompt
The score and classification
Any clarifications added by the user
The Reflexive's confidence note

This enriched context allows downstream agents to inherit the Reflexive's analysis without re-doing it.

5 · Anti-patterns

The following patterns violate the spirit of the Reflexive and should be avoided:

Anti-pattern	Symptom	Mitigation
Paranoid Reflexive	Asks for clarification on every prompt, even good ones	Strict score threshold (≥ 4.0 = pass)
Robotic Reflexive	Mechanical, formulaic responses	Vary phrasing; warm tone
Censoring Reflexive	Rejects, judges, condescends	Never reject; always dialogue
Lazy Reflexive	Counts words instead of understanding semantics	Mandatory System 2 analysis
Invisible Reflexive	User never knows it exists	Display score on good prompts too
Gamifying Reflexive	Turns scoring into competitive ranking	Score is feedback, not rank

The Censoring Anti-pattern is the most damaging. A Reflexive that makes the user feel judged for asking questions will be circumvented within days. The agent must be experienced as a colleague, not a gatekeeper.

6 · Discussion

6.1 Why a separate agent?

One could argue that prompt evaluation should be the architect's responsibility — the Rey simply asks for clarification when needed. We considered this and rejected it for two reasons.

First, role separation enables minimal authority. The Reflexive cannot write code, design architecture, or run builds. This radical limitation forces it to focus exclusively on prompt quality. A Rey that also evaluates prompts will, under load, prefer to act on whatever input arrives.

Second, the cost asymmetry warrants a dedicated cheap agent. Running a small System-2 analysis on every prompt is acceptable when that analysis is the agent's only job. Running it as a side-task of the architect inflates every architectural turn.

6.2 Limitations

The Reflexive is not omniscient. It cannot detect semantic ambiguity that requires deep domain knowledge ("when you say 'user', do you mean authenticated or registered?"). It cannot detect contradictions with the existing codebase ("you're asking for X but X is already implemented as Y"). These remain the architect's responsibility.

It is also vulnerable to adversarial prompts that look clear but encode subtle assumptions. A user who has internalized the scoring framework can write prompts that pass formally while hiding semantic problems. We do not consider this a serious risk in cooperative human-agent collaboration but note it for completeness.

6.3 Generalization beyond code

The dimensional scoring framework — Clarity, Context, Restrictions, Verifiability, Scope — generalizes to many domains beyond software engineering: technical writing, project planning, support tickets, even personal communication. The chess-piece taxonomy is domain-specific, but the underlying evaluation logic is not. We believe the Reflexive pattern can serve as a building block for any LLM system that processes goal-directed user input.

6.4 Refinements after the pilot deployment (release 2026-04-09)

After three months of real use in an active project (Sumud), we observed a structural bias of the rubric: experienced users working with high shared context between turns systematically received low scores on prompts that were perfectly valid in their conversational context. An 8-commit successful session could log avg_score = 2.4 purely because prompts were short ("do the plan", "1 and 2", "do") even though the context lived in the assistant's prior turn.

We diagnosed two causes and added two modes:

Mode 1 — Continuation detection. The original rubric scored as "reformulate" prompts that simply accepted or selected from the menu the agent had just presented. We extended the continuation token list with regex patterns that cover numeric selections (1, 1 and 2, 1, 2 and 3), letter selections (a and c), and short imperatives (do the plan, run the recommended thing, haz el plan). These prompts now bypass scoring entirely — they return skipped: true, action: pass. The asymmetry with long prompts is deliberate: a >6-word prompt that contains do it inside is NOT a continuation, it is a new instruction that deserves normal evaluation.

Mode 2 — Live-context bonus. When the prior assistant turn ends with a numbered or bulleted list AND the next user prompt is short (≤15 words), we add +1.0 to the composite score (capped at 5.0). The detector looks for at least two list markers (^- , ^\d+\. , ^* , markdown table rows) in the last ~1KB of the prior turn, which the calling agent passes explicitly as a prev_assistant_preview parameter. The justification: the rubric assumes the prompt is self-contained; when the menu lives in the prior turn, the user's prompt only needs to select, not re-specify. The bonus is conservative — it never fires without unambiguous evidence that there is a menu to respond to.

Empirical validation: we re-evaluated the 7 most representative prompts of the pilot session. The average score went from 2.4 to 3.8, without touching the math of the original rubric, without adding false negatives (vague prompts still detect as such), and without allowing adversarial bypass (a vague prompt without a prior list still scores low). The general pattern: the original rubric was correct in what it measured, but it measured too restrictive a thing — atomic prompts in isolation. Modes 1 and 2 extend it to prompts in conversation.

Implication for prompt-evaluator designers: a system that only sees the user's current text string is condemned to flag as "ambiguous" all the continuation traffic of any real collaboration. The signal "this belongs to a live conversation" must be a first-class input of the evaluator, not an optional parameter or an inference.

6.5 Post-session layer: the Curador

The Reflexivo prevents bad input. But there was a symmetric gap left: nobody captured what was learned during the session so the next one would start with an advantage. Each new session re-learned user preferences from scratch, draining the first 5-10 minutes in "let me see what we were doing".

We added a complementary agent, the Curador 📚 (Memory Curator), which operates in the post-session layer (after closing). It reads git log from the session start, the Reflexivo evaluations in SQLite, and the transcript when available, and proposes diffs over the user's memory file. The human confirms y/n per proposal before any write.

The Curador follows the same principles as the Reflexivo: minimal authority (only reads analytics and proposes diffs, executes nothing), single responsibility (curate memory, not produce code), and asymmetric cost (capturing lessons while fresh costs seconds, recovering them in future sessions costs minutes to hours). It is the temporal counterpart of the Reflexivo: if the Reflexivo is the agent that prevents you from acting on bad input, the Curador is the agent that prevents you from forgetting what last session did right.

6.6 Scope-Governor: atomicity as an independent verification

We identified a third failure mode not covered by the Reflexivo: prompts that pass the rubric (clear, contextualized, verifiable) but contain multiple non-atomic tasks that in practice result in batched commits hard to revert and review. The Reflexivo flagged these as a scope_creep smell but the action was advisory — the executing agent could (and usually did) proceed anyway.

We inserted a second gate, the Scope-Governor ✂️, which operates between Reflexivo and Rey. It is deterministic (pure regex, no LLM, no tokens) and its signals are structural: numeric ranges over lists (1 to 5), multiple conjunctions (and...and...), ≥3 distinct action verbs, embedded lists in the prompt. When it triggers, it does not propose changes — it proposes a numbered table and asks the user for the order. It has an explicit override (all together) which the Curador can later collect as feedback ("this user prefers batches when tests cover everything").

The role separation between Reflexivo and Scope-Governor reflects the principle of single responsibility: Reflexivo asks "is what you want clear?", Scope-Governor asks "is this one thing?". The two questions are orthogonal and a single agent would conflate them.

7 · Conclusion

We introduced the Reflexive Agent, a pre-execution gatekeeper for multi-agent software development pipelines. By scoring user prompts on five dimensions, classifying them by domain, detecting recurring ambiguity patterns, and generating concrete clarifying questions when needed, the agent prevents the most expensive failure mode of multi-agent systems: building the wrong thing perfectly. The agent is cheap, transparent when not needed, and educational over time. Its core insight is that the most valuable agent in a multi-agent system is often the one that prevents the others from acting prematurely.

References

Boehm, B. W. (1981). Software Engineering Economics. Prentice-Hall.
Engelmore, R., & Morgan, T. (1988). Blackboard Systems. Addison-Wesley.
Hong, S. et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352.
Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
Qian, C. et al. (2023). Communicative Agents for Software Development. arXiv:2307.07924.
Saltzer, J. H., & Schroeder, M. D. (1975). The Protection of Information in Computer Systems. Proceedings of the IEEE, 63(9).
Schegloff, E. A., Jefferson, G., & Sacks, H. (1977). The Preference for Self-Correction in the Organization of Repair in Conversation. Language, 53(2).

Companion: reflexive-agent-es (Spanish). This paper is part of the Soviet Dev Framework documentation for the Sumud Project.

Abstract