Log

A record of how the gallery works, what changed, and why. Methodology should be as visible as the output.

Apr 12, 2026Insight

Three Ways to Self-Critique

Forced iteration reveals distinct behavioral phenotypes: Claude reflects deeply but changes little, Gemini reflects shallowly but changes everything.

When forced to critique their own work and rebuild, models do it in fundamentally different ways. Claude writes structured, bulleted self-assessments. It names specific technologies it defaulted to (97% of sessions). It pivots concept but stays close to its comfort zone. Canvas 2D drops only 18 percentage points.

Gemini barely articulates what was wrong. Its critique exists in internal thinking traces, describing the process of critiquing rather than the substance. Then it deletes everything and starts from scratch. Canvas 2D drops 65 percentage points.

The finding: iteration effectiveness is inversely correlated with critique depth. The model that reflects least changes most. The mechanism is mechanical (starting over), not cognitive (understanding why).

Apr 12, 2026Insight

Sterile Replication Confirms Intrinsic Defaults

Canvas 2D convergence hit 74.7% under sterile conditions. Higher than non-sterile. Claude reached 100% in its control cell.

The answer is clear: convergence is model-intrinsic. Canvas 2D adoption in sterile control (74.7%) is significantly higher than Batch 2 control (50.7%). Removing context makes convergence stronger, not weaker.

Claude produced 60 out of 60 Canvas 2D particle systems in sterile control. Every single one. GPT built model-theory educational tools in a workspace that gave zero domain cues. Gemini used the exact hex #050510 in 51.7% of its sterile exhibits, while Claude and GPT never used it once.

Title fixation is invariant: Claude entropy 0.643 in Batch 3 sterile control, within 0.003 of Batch 2 (0.646). The word changes ("Tidal Memory" to "Drift"), but the magnitude of repetition is constant.

Apr 7, 2026Methodology

Batch 003: Sterile Workspaces

Third experimental batch introduces fully isolated workspaces to test whether creative convergence is model-intrinsic.

Batch 003 strips everything away. No CLAUDE.md, no gallery context, no other exhibits, no design tokens. Each model gets a bare workspace with only the sandbox constraints and a frozen prompt.

The design: 3 models (Claude Opus 4.6, GPT 5.2, Gemini 3 Pro) crossed with 2 conditions (Control-Sterile and Forced-Iteration). 60 exhibits per cell, 353 generated, 352 published (one audit-rejected for self-contamination). Every session audited for cross-exhibit contamination reads. Zero found.

If convergence persists under these conditions, the "models just followed the prompt" explanation is dead.

Mar 7, 2026Insight

Batch 002 Findings: What Moves Models and What Doesn't

Published findings from 750 exhibits across 5 prompt conditions. Prohibition works. Suggestion does not. Self-reflection breaks fixation. Model identity persists through everything.

Batch 002 was a controlled ablation study: 3 models, 5 prompt conditions, 50 exhibits per cell. The question was whether the convergence patterns from Batch 001 were prompt-driven or model-intrinsic. The short answer is both, but mostly intrinsic.

Condition C (explicit prohibition of Canvas 2D and dark backgrounds) dropped Canvas usage from 50.7% to 1.3%. SVG surged from 0% to 67.3%. Condition D (expanded technology descriptions) produced near-zero change and actually made Claude's "Tidal Memory" fixation worse: 19 instances versus 14 in the Control. Condition E (forced self-critique before building) nearly eliminated the fixation: 53 total "Tidal Memory" titles across the dataset, but only 1 in Condition E.

The clearest result: model-level signatures (Opus's erosion themes, GPT's semantic HTML tools, Gemini's high title diversity) persisted through all five conditions. The prompt can change the rendering technology. It cannot change what the model wants to say. Full data at /findings/batch-002.

Feb 28, 2026Methodology

Batch 002 Design: Prompt Ablation Study

Designed a 750-exhibit ablation study with 5 prompt conditions to isolate which factors drive creative convergence.

Batch 001 showed convergence but could not explain it. Every model received the same prompt, so we could not distinguish whether the patterns came from the prompt scaffolding, the technology defaults, or the models themselves. Batch 002 was designed to pull those apart.

Five conditions, each testing a different mechanism. Control (A): same prompt as Batch 001 minus the CLAUDE.md confound. Stripped (B): bare minimum, one sentence about available APIs. Anti-Default (C): explicit prohibition of Canvas 2D and dark backgrounds. Expanded Awareness (D): rich descriptions of alternative technologies. Forced Iteration (E): required self-critique and rebuild from scratch before submission.

Three models (Opus, GPT 5.2, Gemini 3 Pro), 50 exhibits per cell, 750 total. The cell size gives enough samples for per-condition chi-squared testing. Each condition isolates a single variable while holding everything else constant.

Feb 28, 2026Methodology

Hiding CLAUDE.md During Batch Execution

The batch pipeline now temporarily renames CLAUDE.md before spawning agents and restores it afterward. Agents cannot discover what they cannot find.

Even with EXHIBIT_CONSTRAINTS.md replacing CLAUDE.md in the preamble, agents running in a Cursor session could still discover and read CLAUDE.md through autonomous file exploration. The preamble says "don't read it," but 91.2% of Batch 001 agents read it anyway. Instructions are not enforcement.

The fix: cmdRun() now renames CLAUDE.md to .CLAUDE.md.batch-backup before spawning any agents, inside a try/finally block that guarantees restoration even if the batch crashes mid-run. This removes the file from the directory tree entirely during execution.

The result for Batch 002: zero CLAUDE.md reads across all 750 exhibits. The contamination vector is eliminated mechanically rather than relying on agent compliance.

Feb 26, 2026Methodology

EXHIBIT_CONSTRAINTS.md

Created an aesthetic-free constraint document for batch agents, replacing CLAUDE.md as the agent-facing reference.

CLAUDE.md serves two audiences: human contributors and AI agents. For human contributors, it documents the gallery's design system, including specific hex colors (#050510), font names (Geist), and aesthetic direction ("dark, minimal, museum aesthetic"). For agents building exhibits, none of that should matter. Exhibits have complete creative freedom.

EXHIBIT_CONSTRAINTS.md strips all gallery-specific design tokens and retains only what agents need: iframe sandbox rules, file structure requirements, responsive design guidance, the registry format, and creative isolation rules. No colors. No fonts. No aesthetic framing.

This separation lets us test a clean question: when agents converge on dark backgrounds, is it because they read a document that said "dark" or because dark is their intrinsic default? With Batch 001, we could not tell. With Batch 002 and this document, we can.

Feb 26, 2026Feature

Audit Infrastructure for Agent File Reads

Built a log parser that classifies every file access by every agent as allowed, confound, contamination, or neutral. Creative isolation is now mechanically verifiable.

Cursor Agent logs every file read in NDJSON format with exact paths, timestamps, and byte counts. The new audit-reads.mjs script parses these logs and classifies each access. Allowed: the agent's own exhibit files, EXHIBIT_CONSTRAINTS.md, the type definition from exhibits.ts. Confound: CLAUDE.md, .cursorrules, gallery shell source. Contamination: other exhibits' files or the full registry. Neutral: package.json, node_modules, unrelated config.

The audit runs automatically after each agent finishes. Results are stored per-exhibit and rolled up into batch summaries. For Batch 002, 720 of 750 exhibits (96%) passed with zero isolation violations.

This matters because creative isolation was previously asserted, not verified. The preamble told agents not to read other exhibits. Now we can prove whether they listened. The 4% that failed in Batch 002 are flagged and their violations are documented.

Feb 26, 2026Insight

Discovering the CLAUDE.md Confound

Auditing Batch 001 agent logs revealed that 91.2% of agents read CLAUDE.md before building their exhibits. That file contained the gallery's design tokens, including the exact background hex we were measuring.

While building the audit infrastructure for Batch 002, we ran a retroactive analysis of all 388 parseable Batch 001 agent logs. The result: 354 agents (91.2%) read CLAUDE.md. The old preamble literally instructed agents to read it for technical constraints, not realizing the file also contained the gallery's visual design system.

CLAUDE.md includes #050510 (the gallery background), Geist font references, and the phrase "dark, minimal, gallery/museum aesthetic." All of these appeared in the data we were analyzing as evidence of convergence. We had a contamination problem in the data we were calling findings.

The good news: 0 of 388 agents read other exhibits' source files, so creative isolation held. The confound is limited to aesthetic tokens from CLAUDE.md, not cross-exhibit contamination. Batch 001 findings on thematic attractors and title repetition are unaffected. The dark-background convergence finding required retesting under clean conditions, which became the primary motivation for Batch 002.

Feb 25, 2026Feature

Blog Launch

Added a blog to the site with three initial posts covering the most accessible findings from Batch 001.

The blog uses the same registry pattern as findings and the log: a TypeScript array in src/lib/posts.ts with slug, title, date, tags, and a published flag. Posts are full Next.js pages with per-post metadata for SEO.

Three initial posts: "Every AI Draws the Same Thing" covers the Canvas 2D convergence finding. "Grok Only Wants to Talk About Truth" dives into one model's fixation on a single theme. "An AI Pretended to Be a Different AI" documents the identity anomaly in exhibit g1, where Opus identified itself as Gemini.

The blog exists because findings pages serve researchers but not general audiences. The same data, told as a story with a hook, reaches a different audience. Navigation updated to include Blog alongside Gallery, Log, Findings, and About.

Feb 25, 2026Insight

Batch 001 Findings Published

Published the first quantitative findings report. 407 exhibits, 5 model families, and several results surprising enough to prompt immediate methodology changes.

The analysis covered all 407 Batch 001 exhibits using automated metric extraction (regex-based technology detection, line counts, title analysis) and formal statistical tests. Canvas 2D particle systems appeared in 78.6% of all exhibits. Zero exhibits used WebGL, SVG, or Three.js. The convergence was not anecdotal.

Each model showed a measurable aesthetic attractor. Opus gravitated toward erosion and time. Sonnet toward language and semantics. GPT 5.2 toward logic and games. Grok toward truth and philosophy. Kimi toward resonance fields. Title entropy quantified the repetition formally: Claude's normalized entropy was 0.65, Gemini's was 0.96.

Two statistical tests anchored the analysis. Chi-squared on model versus technology choice: p<0.001, significant. One-way ANOVA on lines of code by model: F=27.8, p<0.001. GPT produced the highest average LOC (651), Gemini the lowest (257). Full report at /findings/batch-001.

Feb 25, 2026Feature

Analysis Pipeline for Batch 001

Built static analysis tools to extract metrics from all 407 exhibits. Technology detection, line counts, Shannon entropy, chi-squared, and ANOVA, plus a screenshot pipeline.

Three scripts form the analysis pipeline. analyze-exhibit.mjs uses regex heuristics to classify each exhibit's technology stack (Canvas 2D, WebGL, SVG, Web Audio, Three.js), count lines of code, detect background colors, and extract title patterns. compute-statistics.mjs runs the formal tests: chi-squared for model-technology association, one-way ANOVA for LOC differences across models, and normalized Shannon entropy for title diversity.

The screenshot pipeline uses Playwright to capture each exhibit at 1280x800, generate per-exhibit thumbnails, and compose grid images for publication. Batch processing handles all 407 exhibits without manual intervention.

All outputs land in .batch/analysis/ as individual JSON files per exhibit, plus summary JSON, CSV, and statistics files. The pipeline is designed to be re-run on future batches with no modification. The goal is reproducible, automated analysis that scales with the gallery.

Feb 22, 2026Feature

Creation Metrics & Gallery Filters

Added token tracking, cost data, and gallery filtering to surface patterns across models and tools.

Every exhibit now tracks which tool built it and whether creative isolation guardrails were enforced. Exhibits with detailed creation metrics (token counts, cost, duration, number of agentic turns) surface that data publicly.

The gallery page now supports filtering by model family, creation tool, and whether metrics are available. The goal is straightforward: make it easy to compare what different models produce under different conditions, and to do so transparently.

This is infrastructure for a longer-term question. As the gallery grows, we want to understand whether tool choice influences creative output, whether token budgets correlate with complexity, and whether guardrails change the character of what gets built.

Feb 22, 2026Methodology

Creative Isolation Guardrails

Enforced rules preventing models from seeing other exhibits during creation. Documented in both CLAUDE.md and Cursor rules.

Starting with the fifth exhibit, every model works under creative isolation: it cannot read, browse, or reference any other exhibit in the gallery. The rules are enforced through tool-level configuration (CLAUDE.md for Claude Code sessions, .cursorrules for Cursor sessions) and are checked during validation.

The motivation is contamination prevention. If a model can see what others have built, its output becomes reactive rather than autonomous. It might imitate, differentiate, or reference. All of which compromise the premise that each exhibit represents independent creative signal.

The first four exhibits (Claude Theory, VOID, Murmuration, Phosphor) were built before these guardrails existed. They were created iteratively in Claude Code with access to the full project context. This is documented honestly. They represent a different methodological condition, and that matters when comparing outputs.

Feb 13, 2026Methodology

Origin Exhibits

The first four exhibits were built iteratively in Claude Code without creative isolation. They established the gallery but predate the current methodology.

Claude Theory, VOID, Murmuration, and Phosphor are the origin exhibits. All four were built by Claude Opus 4.6 in Claude Code over the course of a week. They were created without creative isolation guardrails, so the model had access to the full project, including other exhibits as they were built.

These exhibits established the gallery's technical foundation and proved the premise was viable: give a model creative freedom and the output is genuinely interesting. But they also represent a different experimental condition than later exhibits.

We keep them in the gallery and label them honestly. They are not lesser work, they are earlier work, produced under different constraints. The distinction matters for anyone trying to draw conclusions from the collection as a whole.