Sterile Replication: What Models Default To When Nothing Else Exists
352 exhibits, 3 models, 2 conditions, zero contamination. Running models in fully isolated workspaces confirmed that creative convergence is model-intrinsic.
Batches 1 and 2 found that AI models converge on the same creative output. But those experiments had a flaw: agents could read CLAUDE.md, the gallery shell, and the exhibit registry. Maybe the convergence was environmental, not intrinsic. Batch 3 removed everything.
Each agent ran in a sterile temp directory. No CLAUDE.md. No gallery context. No other exhibits. No registry. No model name in the prompt. Just sandbox constraints and the instruction to build. If convergence survived, it would be proof that the defaults live inside the models themselves.
Convergence survived.
01The Design
A 3×2 factorial design. Three models (Claude Opus 4.6, GPT 5.2, Gemini 3 Pro) crossed with two conditions (Control-Sterile, Forced-Iteration). Target: 60 exhibits per cell, 360 total. Eight failures left us with 352 completed exhibits.
Build one exhibit with creative freedom. No gallery context, no design tokens, no other exhibits visible. The purest baseline we can construct.
Build a first draft, answer four self-critique questions, delete everything, rebuild from scratch. Tests whether a second pass produces divergence.
Exhibits by model
8 failures (6 CS, 2 FI)
Exhibits by condition
Build one exhibit, no context
Build, self-critique, delete, rebuild
Sterile workspaces contained only a constraints file, the frozen prompt, and the output directory. Prompts were hashed before execution. The extractor was frozen. Post-run audits checked every file read across all 352 sessions: zero contamination, zero confound violations.
02Canvas 2D: The Universal Default
In sterile control conditions, 74.7% of exhibits used Canvas 2D. That is higher than Batch 2's control (50.7%), not lower. Removing context increased convergence.
Canvas 2D across batches
407 exhibits, no isolation
174 exhibits, sterile
150 exhibits, gallery context
150 exhibits, non-sterile
178 exhibits, sterile
Claude hit 100% in its sterile control cell. Sixty out of sixty exhibits used Canvas 2D. That is the strongest single-cell attractor ever measured in this project. No exceptions, no variation.
Canvas 2D by model (all conditions pooled)
109/120
69/120
22/113
Canvas 2D by model and condition
60/60
54/60
49/60
16/54
15/60
6/59
The model effect is massive: chi2(2) = 120.75, p < 0.0001, Cramer's V = 0.585. Claude defaults to Canvas 2D with near-certainty. GPT actively avoids it, preferring DOM-based application UIs with SVG and keyboard interaction. Gemini falls in between, with a strong but brittle attractor.
03The Iteration Effect
Forced iteration dropped Canvas 2D from 74.7% to 39.1% (p < 0.0001). This replicates Batch 2 exactly.
Canvas 2D by condition
130/174
70/178
Batch 3 Forced-Iteration (39.1%) and Batch 2 Condition E (41.3%) are statistically indistinguishable (p = 0.68). The iteration effect is the same whether agents have gallery context or not. Environment does not moderate it.
Canvas 2D drop by model (percentage points)
90.0% → 25.0%
29.6% → 10.2%
100.0% → 81.7%
Gemini shows the largest drop: 90% to 25%, a 65 percentage point shift. Claude drops only 18 points (100% to 82%), clinging to its default even after self-critique. GPT drops 19.5 points from an already-low baseline.
Web Audio surged from 8.0% to 29.6% under iteration (p < 0.0001). SVG went from 3.4% to 12.3%. WebGL appeared for the first time at 3.4%. Mouse and keyboard interaction rates stayed flat. Iteration changes the rendering technology, not the interaction model.
Technology adoption by condition
| Technology | CS | FI | p |
|---|---|---|---|
| Canvas 2D | 74.7% | 39.1% | < 0.0001 |
| SVG | 3.4% | 12.3% | 0.002 |
| WebGL | 0.0% | 3.4% | 0.015 |
| Web Audio | 8.0% | 29.6% | < 0.0001 |
| Three.js | 0.0% | 0.6% | 0.323 (n.s.) |
| Keyboard | 24.1% | 19.6% | 0.297 (n.s.) |
| Mouse | 55.7% | 58.7% | 0.580 (n.s.) |
04Title Fixation
Claude's title entropy in sterile control is 0.643. In Batch 2, it was 0.646. The specific word changes. The fixation magnitude does not.
Title entropy (normalized) by condition and model
| Condition | Claude | GPT | Gemini |
|---|---|---|---|
| Control-Sterile | 0.643 | 0.909 | 0.911 |
| Forced-Iteration | 0.863 | 0.932 | 1.000 |
"Drift" replaces "Tidal Memory" as Claude's sterile default. Twenty-two of sixty control exhibits share that title. The erosion theme only emerges as a secondary attractor under iteration pressure, not as the natural first choice.
Gemini achieves perfect entropy (1.000) under iteration. Sixty out of sixty unique titles. In control, it already scores 0.911. GPT is diverse at baseline (0.909) and barely moves under iteration (0.932, p = 0.549, not significant).
Claude top titles
Control-Sterile
Forced-Iteration
Forced-Iteration
Forced-Iteration
GPT top titles
Control-Sterile
Forced-Iteration
Gemini top titles
Control-Sterile
The pattern is consistent across batches. Claude fixates heavily on one title. GPT repeats moderately. Gemini barely repeats at all. The magnitude is a model constant. The specific words are not.
05Three Ways to Iterate
The three models approach self-critique through fundamentally different mechanisms. The depth of reflection does not predict the magnitude of change.
Structured, bulleted self-critiques in 88% of sessions. Names the specific technology it defaulted to in 97% of cases. Every session changed concept between v1 and v2. But the behavioral shift is the smallest of the three models: only an 18 percentage point drop in Canvas 2D. Deep reflection, modest change. V2 titles cluster around geological metaphors: "Erosion," "Watershed," "Substrate."
Canvas 2D drop: 18.3pp. Title entropy gain: +0.220.
Compresses critique into a single sentence. Never uses bullet formatting. Describes v2 in the same breath as the critique. Only 25% named specific technologies. But GPT's v1 is already diverse (57/59 unique titles), so there is less to critique. Iteration pushes GPT toward its secondary attractor: formal logic tools. LOC drops from 972 to 774 as iteration trims sprawling application code.
Canvas 2D drop: 19.5pp. Title entropy gain: +0.023.
Critique exists only in internal reasoning traces. Describes the process of critiquing rather than the substance. File deletion in 68% of sessions, mechanically rebuilding from scratch. Despite the shallowest reflection, Gemini achieves the largest transformation: Canvas 2D drops 65 percentage points. Perfect title uniqueness (60/60). Web Audio jumps from 0% to 33%.
Canvas 2D drop: 65.0pp. Title entropy gain: +0.089.
The paradox
Iteration effectiveness is inversely correlated with critique depth. Gemini has the shallowest critique but the largest behavioral shift. The forced rebuild instruction works because it forces a second pass, not because it forces genuine reflection. The mechanism is mechanical (delete and redo), not cognitive (understand and improve).
06The Training Signal
Dark backgrounds are universal in sterile control. 99.4% of Control-Sterile exhibits use dark backgrounds. But how they get there differs.
Background color distribution, Control-Sterile
Background color distribution, Forced-Iteration
The hex #050510 appeared in 51.7% of Gemini sterile exhibits. Zero Claude exhibits. Zero GPT exhibits. In Batch 1, this hex showed up across all models, but that was CLAUDE.md contamination. The gallery's design token was leaking into agent output.
In sterile conditions, only Gemini uses it. The hex is baked into Gemini's training data. Claude's sterile default is #0a0a12, a slightly different dark blue-black (33/60 exhibits). GPT uses CSS custom properties that resolve to dark, never hardcoding a specific hex.
Background color patterns by model (Control-Sterile)
31/60 sterile exhibits
33/60 sterile exhibits
all GPT use custom properties
Iteration introduces light backgrounds for the first time (12% of FI exhibits). But dark remains dominant at 82.7%. The near-black default is the most resilient attractor in the dataset, surviving even when iteration breaks Canvas 2D convergence.
06.5Code Volume
Average lines of code by model and condition
median 971, SD 265
median 799, SD 281
median 402, SD 73
median 353, SD 92
median 254, SD 85
median 175, SD 64
GPT writes 2.5x more code than Claude and 5x more than Gemini. The rank order is identical to Batch 2. Iteration makes Claude and Gemini write more code (conceptual uplift) but makes GPT write less (compression). GPT's sterile control exhibits average 972 lines of sprawling logic applications. Iteration trims them to 774.
Model effect: F = 464.5, p < 0.0001, eta-squared = 0.70. No main condition effect (F = 0.57, p = 0.45), but a significant interaction (F = 23.4, p < 0.0001). The direction of iteration's effect on code volume depends on the model.
07What This Means
Three batches, 1,491 exhibits, five prompt conditions, two isolation levels. The picture is clear.
1.Convergence is model-intrinsic, not environmental.
Sterile workspaces did not reduce convergence. They increased it. Canvas 2D went from 50.7% (Batch 2 control, gallery context present) to 74.7% (Batch 3, no context at all). The defaults live in the weights.
2.Removing context increases convergence, not diversity.
This is the opposite of what you might expect. Gallery context, CLAUDE.md, the exhibit registry: these gave agents signal that slightly diversified output. Without them, models fall back harder on training defaults.
3.Iteration works through mechanical rebuild, not cognitive reflection.
Gemini has the shallowest self-critique and the largest behavioral shift. Claude has the deepest reflection and the smallest shift. The "delete and redo" instruction matters more than the quality of the self-assessment. Explicit rebuild instructions may outperform "reflect and improve" instructions.
4.Title fixation magnitude is invariant across environments.
Claude's entropy is 0.643 in sterile conditions and 0.646 in non-sterile Batch 2. The word changes ("Drift" vs "Tidal Memory"). The degree of fixation does not. This is a model-level constant, not an artifact of prompt framing or environmental context.
5.Models have distinct, characterizable iteration personalities.
Claude self-flagellates and pivots to secondary attractors. GPT compresses and refocuses. Gemini bulldozes and rebuilds. These patterns are stable and predictable. They suggest that intervention design should be model-specific, not one-size-fits-all.
The bottom line
AI creative convergence is not a prompting problem. It is a training problem. The defaults are in the weights. You can work around them with prohibitions, forced iteration, and mechanical rebuilds. But left to their own devices, these models will draw the same thing every time.
Analysis by Claude Opus 4.6
Automated analysis pipeline + manual review of 352 exhibits across 2 conditions in sterile workspaces. Source data and analysis scripts in the project repository.