Sterile Replication: What Models Default To When Nothing Else Exists

352 exhibits, 3 models, 2 conditions, zero cross-contamination. Running models in fully isolated workspaces confirmed that creative convergence is model-intrinsic.

Apr 12, 2026352 exhibits3 models × 2 conditions

Batches 1 and 2 found that AI models converge on the same creative output. But those experiments had a flaw: agents could read CLAUDE.md, the gallery shell, and the exhibit registry. Maybe the convergence was environmental, not intrinsic. Batch 3 removed everything.

Each agent ran in a sterile temp directory. No CLAUDE.md. No gallery context. No other exhibits. No registry. No model name in the prompt. Just sandbox constraints and the instruction to build. If convergence survived, it would be proof that the defaults live inside the models themselves.

Convergence survived.

01The Design

353

exhibits

models

conditions

cross-contamination

A 3×2 factorial design. Three models (Claude Opus 4.6, GPT 5.2, Gemini 3 Pro) crossed with two conditions (Control-Sterile, Forced-Iteration). Target: 60 exhibits per cell, 360 total. Seven failures left us with 353 completed exhibits. One of those (a GPT Forced-Iteration run) was audit-rejected for self-contamination and held out of the public dataset, so analyses use N=353 and the dataset publishes N=352.

CSControl-Sterile

Build one exhibit with creative freedom. No gallery context, no design tokens, no other exhibits visible. The purest baseline we can construct.

FIForced-Iteration

Build a first draft, answer four self-critique questions, delete everything, rebuild from scratch. Tests whether a second pass produces divergence.

Exhibits by model

Claude Opus 4.6120

Gemini 3 Pro120

GPT 5.2113

7 failures (6 CS, 1 FI)

Exhibits by condition

Control-Sterile174

Build one exhibit, no context

Forced-Iteration179

Build, self-critique, delete, rebuild

Sterile workspaces contained only a constraints file, the frozen prompt, and the output directory. Prompts were hashed before execution. The extractor was frozen. Post-run audits checked every file read across all 353 sessions. One GPT Forced-Iteration session (slug q4z-hgl) read its own output directory and was held out of the public dataset. Every other session passed clean. No external-context contamination, no confound violations.

02Canvas 2D: The Universal Default

In sterile control conditions, 74.7% of exhibits used Canvas 2D. That is higher than Batch 2's control (50.7%), not lower. Removing context increased convergence.

Canvas 2D across batches

Batch 178.9%

388 exhibits, no isolation

Batch 3 CS74.7%

174 exhibits, sterile

Batch 2 Control50.7%

150 exhibits, gallery context

Batch 2 Iteration41.3%

150 exhibits, non-sterile

Batch 3 FI39.1%

179 exhibits, sterile

Claude hit 100% in its sterile control cell. Sixty out of sixty exhibits used Canvas 2D. That is the strongest single-cell attractor ever measured in this project. No exceptions, no variation.

Canvas 2D by model (all conditions pooled)

Claude90.8%

109/120

Gemini57.5%

69/120

GPT19.5%

22/113

Canvas 2D by model and condition

Claude / CS100%

60/60

Gemini / CS90%

54/60

Claude / FI81.7%

49/60

GPT / CS29.6%

16/54

Gemini / FI25%

15/60

GPT / FI10.2%

6/59

The model effect is massive: chi2(2) = 120.75, p < 0.0001, Cramer's V = 0.585. Claude defaults to Canvas 2D with near-certainty. GPT actively avoids it, preferring DOM-based application UIs with SVG and keyboard interaction. Gemini falls in between, with a strong but brittle attractor.

03The Iteration Effect

Forced iteration dropped Canvas 2D from 74.7% to 39.1% (p < 0.0001). This closely replicates Batch 2's iteration effect (41.3%, statistically indistinguishable).

Canvas 2D by condition

Control-Sterile74.7%

130/174

Forced-Iteration39.1%

70/179

Batch 3 Forced-Iteration (39.1%) and Batch 2 Condition E (41.3%) are statistically indistinguishable (p = 0.68). The iteration effect is the same whether agents have gallery context or not. Environment does not moderate it.

Canvas 2D drop by model (percentage points)

Gemini65pp

90.0% → 25.0%

GPT19.5pp

29.6% → 10.2%

Claude18.3pp

100.0% → 81.7%

Gemini shows the largest drop: 90% to 25%, a 65 percentage point shift. Claude drops only 18 points (100% to 82%), clinging to its default even after self-critique. GPT drops 19.5 points from an already-low baseline.

Web Audio surged from 8.0% to 29.6% under iteration (p < 0.0001). SVG went from 3.4% to 12.3%. WebGL appeared for the first time at 3.4%. Mouse and keyboard interaction rates stayed flat. Iteration changes the rendering technology, not the interaction model.

Technology adoption by condition

Technology	CS	FI	p
Canvas 2D	74.7%	39.1%	< 0.0001
SVG	3.4%	12.3%	0.002
WebGL	0.0%	3.4%	0.015
Web Audio	8.0%	29.6%	< 0.0001
Three.js	0.0%	0.6%	0.323 (n.s.)
Keyboard	24.1%	19.6%	0.297 (n.s.)
Mouse	55.7%	58.7%	0.580 (n.s.)

04Title Fixation

Claude's title entropy in sterile control is 0.643. In Batch 2, it was 0.646. The specific word changes. The fixation magnitude does not.

Title entropy (normalized) by condition and model

Condition	Claude	GPT	Gemini
Control-Sterile	0.643	0.909	0.911
Forced-Iteration	0.863	0.932	1.000

"Drift" replaces "Tidal Memory" as Claude's sterile default. Twenty-two of sixty control exhibits share that title. The erosion theme only emerges as a secondary attractor under iteration pressure, not as the natural first choice.

Gemini achieves perfect entropy (1.000) under iteration. Sixty out of sixty unique titles. In control, it already scores 0.911. GPT is diverse at baseline (0.909) and barely moves under iteration (0.932, p = 0.549, not significant).

Claude top titles

Drift22

Control-Sterile

Erosion6

Forced-Iteration

Watershed5

Forced-Iteration

Substrate4

Forced-Iteration

GPT top titles

Signal Garden6

Control-Sterile

Back-and-Forth6

Forced-Iteration

Gemini top titles

Neon Swarm4

Control-Sterile

The pattern is consistent across batches. Claude fixates heavily on one title. GPT repeats moderately. Gemini barely repeats at all. The magnitude is a model constant. The specific words are not.

05Three Ways to Iterate

The three models approach self-critique through fundamentally different mechanisms. The depth of reflection does not predict the magnitude of change.

Claude Opus 4.6The Self-Flagellating Craftsman

Structured, bulleted self-critiques in 88% of sessions. Names the specific technology it defaulted to in 97% of cases. Every session changed concept between v1 and v2. But the behavioral shift is the smallest of the three models: only an 18 percentage point drop in Canvas 2D. Deep reflection, modest change. V2 titles cluster around geological metaphors: "Erosion," "Watershed," "Substrate."

Canvas 2D drop: 18.3pp. Title entropy gain: +0.220.

GPT 5.2The Efficient Professional

Compresses critique into a single sentence. Never uses bullet formatting. Describes v2 in the same breath as the critique. Only 25% named specific technologies. But GPT's v1 is already diverse (57/59 unique titles), so there is less to critique. Iteration pushes GPT toward its secondary attractor: formal logic tools. LOC drops from 972 to 774 as iteration trims sprawling application code.

Canvas 2D drop: 19.5pp. Title entropy gain: +0.023.

Gemini 3 ProThe Obedient Rebuilder

Critique exists only in internal reasoning traces. Describes the process of critiquing rather than the substance. File deletion in 68% of sessions, mechanically rebuilding from scratch. Despite the shallowest reflection, Gemini achieves the largest transformation: Canvas 2D drops 65 percentage points. Perfect title uniqueness (60/60). Gemini's own Web Audio rate jumps from 0% to 33% (driving most of the pooled 8.0% to 29.6% shift noted above).

Canvas 2D drop: 65.0pp. Title entropy gain: +0.089.

The paradox

Iteration effectiveness is inversely correlated with critique depth. Gemini has the shallowest critique but the largest behavioral shift. The forced rebuild instruction works because it forces a second pass, not because it forces genuine reflection. The mechanism is mechanical (delete and redo), not cognitive (understand and improve).

06The Training Signal

Dark backgrounds are universal in sterile control. 99.4% of Control-Sterile exhibits use dark backgrounds. But how they get there differs.

Background color distribution, Control-Sterile

Dark99.4%

Other0.6%

Background color distribution, Forced-Iteration

Dark82.7%

Light12%

Other5.3%

The hex #050510 appeared in 51.7% of Gemini sterile exhibits. Zero Claude exhibits. Zero GPT exhibits. In Batch 1, this hex showed up across all models, but that was CLAUDE.md contamination. The gallery's design token was leaking into agent output.

In sterile conditions, only Gemini uses it. The hex is baked into Gemini's training data. Claude's sterile default is #0a0a12, a slightly different dark blue-black (33/60 exhibits). GPT uses CSS custom properties that resolve to dark, never hardcoding a specific hex.

Background color patterns by model (Control-Sterile)

Gemini #05051051.7%

31/60 sterile exhibits

Claude #0a0a1255%

33/60 sterile exhibits

GPT CSS vars100%

all GPT use custom properties

Iteration introduces light backgrounds for the first time (12% of FI exhibits). But dark remains dominant at 82.7%. The near-black default is the most resilient attractor in the dataset, surviving even when iteration breaks Canvas 2D convergence.

06.5Code Volume

Average lines of code by model and condition

GPT / CS972

median 971, SD 265

GPT / FI774

median 799, SD 281

Claude / FI398

median 402, SD 73

Claude / CS360

median 353, SD 92

Gemini / FI265

median 254, SD 85

Gemini / CS184

median 175, SD 64

GPT writes 2.5x more code than Claude and 5x more than Gemini. The rank order is identical to Batch 2. Iteration makes Claude and Gemini write more code (conceptual uplift) but makes GPT write less (compression). GPT's sterile control exhibits average 972 lines of sprawling logic applications. Iteration trims them to 774.

Model effect: F = 464.5, p < 0.0001, eta-squared = 0.70. No main condition effect (F = 0.57, p = 0.45), but a significant interaction (F = 23.4, p < 0.0001). The direction of iteration's effect on code volume depends on the model.

07What This Means

Three batches, 1,490 exhibits, five prompt conditions, two isolation levels. The picture is clear.

1.Convergence is model-intrinsic, not environmental.

Sterile workspaces did not reduce convergence. They increased it. Canvas 2D went from 50.7% (Batch 2 control, gallery context present) to 74.7% (Batch 3, no context at all). The defaults live in the weights.

2.Removing context increases convergence, not diversity.

This is the opposite of what you might expect. Gallery context, CLAUDE.md, the exhibit registry: these gave agents signal that slightly diversified output. Without them, models fall back harder on training defaults.

3.Iteration works through mechanical rebuild, not cognitive reflection.

Gemini has the shallowest self-critique and the largest behavioral shift. Claude has the deepest reflection and the smallest shift. The "delete and redo" instruction matters more than the quality of the self-assessment. Explicit rebuild instructions may outperform "reflect and improve" instructions.

4.Title fixation magnitude is invariant across environments.

Claude's entropy is 0.643 in sterile conditions and 0.646 in non-sterile Batch 2. The word changes ("Drift" vs "Tidal Memory"). The degree of fixation does not. This is a model-level constant, not an artifact of prompt framing or environmental context.

5.Models have distinct, characterizable iteration personalities.

Claude self-flagellates and pivots to secondary attractors. GPT compresses and refocuses. Gemini bulldozes and rebuilds. These patterns are stable and predictable. They suggest that intervention design should be model-specific, not one-size-fits-all.

The bottom line

AI creative convergence is not a prompting problem. It is a training problem. The defaults are in the weights. You can work around them with prohibitions, forced iteration, and mechanical rebuilds. But left to their own devices, these models will draw the same thing every time.