Skip to content

[codex] Add content lock for caption fidelity#119

Draft
yha9806 wants to merge 4 commits intomasterfrom
codex/caption-fidelity-content-lock-v1
Draft

[codex] Add content lock for caption fidelity#119
yha9806 wants to merge 4 commits intomasterfrom
codex/caption-fidelity-content-lock-v1

Conversation

@yha9806
Copy link
Copy Markdown
Contributor

@yha9806 yha9806 commented May 8, 2026

Summary

This draft PR adds a content-lock and artifact-boundary path for caption-driven vulca create runs.

The important product/challenge boundary is now explicit:

  • Vulca product repair is real: f4abc007 fixed known failure modes such as 0002 shanshui drift, 0064/0074 generate crashes, and 0301 gallery-photo/sample-id collapse.
  • Track1 submission replacement is still rejected: the current baseline remains cleaner and more stable; no Vulca-generated image has been accepted into the challenge submission package.

Current Track1 submission artifacts remain untouched and valid.

Problem / Root Cause

The original runtime was too style-first: cultural guidance could override explicit caption content, and evaluation could still allow high scores when required content was missing.

Dogfood exposed a second boundary failure: Vulca sometimes treated the requested artwork as an object to display in a scene, producing gallery walls, museum/installations, framed mockups, catalog layouts, visible sample IDs, or unrequested labels instead of the artwork itself.

Concrete observed failures:

  • track1_0002: bamboo/orchid/calligraphy drifted into generic shanshui before content-lock.
  • track1_0064 / track1_0074: Gemini returned no iterable image parts, causing generate-node NoneType failures before hardening.
  • track1_0301: graph-paper branching drawing collapsed into a gallery photo with visible TRACK1_0301; after f4abc007 the category recovered, but unwanted English labels remained.
  • track1_0151 / track1_0728: selective regeneration of poster-like captions became gallery/mockup scenes and was rejected.

Fix

The PR now does three things:

  1. Content lock

    • Extracts explicit required subjects, text/seal elements, surfaces/materials, style attributes, and mood/composition constraints.
    • Prepends non-negotiable content requirements before cultural guidance.
    • Adds VLM missing-content fields and gates high scores to 0.25 when required content is absent.
  2. Artifact boundary

    • Adds output_is_artwork_itself as a SDK/CLI semantic flag.
    • --content-lock now also enables the artifact boundary by default.
    • Adds vulca create --output-is-artwork-itself for boundary-only use.
    • Generation prompt now starts with ARTIFACT BOUNDARY REQUIREMENT, requiring the output to be the artwork surface itself, not a photo/display/mockup.
    • Adds poster-specific guidance: flat, front-facing poster artwork, not a poster hanging on a wall.
    • Adds scroll/album guidance: render the scroll/album-leaf artwork surface, not a gallery wall/catalog spread/framed display.
  3. Evaluation hardening

    • VLM scoring now requests:
      • forbidden_visual_artifacts
      • unwanted_visible_text
      • output_is_artwork_itself
    • The gate caps weighted_total to 0.25 and adds content_fidelity_failed when it sees gallery/photo/mockup artifacts, unwanted visible text, or an output that is not the artwork itself.
    • Gemini provider parsing now handles candidate.content.parts is None as “no image data” instead of raw NoneType iteration.
    • Sample-like IDs such as track1_0301 are not passed to the image provider as Subject: and are suppressed in mock fallback SVGs.

Validation

Latest validation after ffcf85a6:

  • Red/green artifact-boundary tests: poster, scroll/album, VLM gate, SDK flag, CLI flag, and GenerateNode prompt ordering.
  • PYTHONPATH=src pytest tests/test_content_lock.py tests/test_gemini_image_size.py tests/test_evaluate.py tests/test_cli_create_output.py::TestCreateOutputParam::test_create_help_has_output_param tests/test_cli_create_output.py::TestCreateOutputParam::test_create_cli_accepts_content_lock_flag tests/test_cli_create_output.py::TestCreateOutputParam::test_create_cli_accepts_output_is_artwork_itself_flag tests/test_create_hitl.py::TestCreateHITL::test_create_accepts_content_lock_argument tests/test_create_hitl.py::TestCreateHITL::test_create_accepts_output_is_artwork_itself_argument tests/test_pipeline_engine.py::TestGenerateNode::test_mock_generate_suppresses_sample_id_text tests/test_pipeline_engine.py::TestGenerateNode::test_generate_node_puts_content_lock_before_cultural_guidance tests/test_pipeline_engine.py::TestGenerateNode::test_generate_node_does_not_send_sample_id_as_provider_subject_with_content_lock tests/test_pipeline_engine.py::TestGenerateNode::test_generate_node_puts_artifact_boundary_before_content_requirements -> 78 passed.
  • ruff check src/vulca/content_lock.py src/vulca/create.py src/vulca/cli.py src/vulca/pipeline/nodes/generate.py src/vulca/providers/gemini.py tests/test_content_lock.py tests/test_pipeline_engine.py tests/test_cli_create_output.py tests/test_create_hitl.py tests/test_gemini_image_size.py -> passed.
  • git diff --check -> passed.
  • CLI help confirms --content-lock and --output-is-artwork-itself are exposed.

Known local environment notes:

  • Running the entire tests/test_pipeline_engine.py still fails in this local environment because the async pytest plugin is not registered. Targeted synchronous/regression tests pass.
  • Subprocess CLI tests need PYTHONPATH=src in this worktree unless the package is installed.

Dogfood Status

The current product delta sheet shows f4abc007 improved Vulca behavior but did not produce accepted Track1 replacements:

  • 0002: product improvement, baseline still better.
  • 0064/0074: crash fixed, baseline still better.
  • 0301: category restored, but unwanted labels remain.
  • 0151/0728: selective regeneration rejected because of gallery/mockup artifacts.

Next Work

Keep this PR draft until a new dogfood run validates ffcf85a6.

Recommended dogfood:

cd /Users/yhryzy/dev/emoart-130k
export GEMINI_API_KEY="$(security find-generic-password -s affectiveart-gemini-api-key -a gemini -w)"
export GOOGLE_API_KEY="$GEMINI_API_KEY"
.venv/bin/python scripts/vulca_caption_fidelity_ab.py \
  --out-dir experiments/track1_artifact_boundary_fix_dogfood \
  --sample-id track1_0151 \
  --sample-id track1_0728 \
  --sample-id track1_0301 \
  --sample-id track1_0064 \
  --limit 4 \
  --force

Acceptance rule:

  • No gallery/photo/mockup/installation artifacts.
  • No sample IDs.
  • No unrequested text.
  • Candidate must beat or clearly match baseline on content fidelity before visual polish matters.
  • Do not replace challenge submission artifacts unless a new manual A/B decision report shows clear wins.

@yha9806
Copy link
Copy Markdown
Contributor Author

yha9806 commented May 8, 2026

Track1 dogfood update: do not use this branch for batch replacement yet

I ran the small Track1 A/B dogfood against the content-locked create path. The result is clear: do not batch-replace the current Track1 submission package with latest Vulca content-lock output.

Artifacts from the dogfood run:

  • A/B script: /Users/yhryzy/dev/emoart-130k/scripts/vulca_caption_fidelity_ab.py
  • Contact sheet: /Users/yhryzy/dev/emoart-130k/experiments/track1_caption_fidelity_ab/contact_sheet.jpg
  • Decision report: /Users/yhryzy/dev/emoart-130k/experiments/track1_caption_fidelity_ab/decision_report.md
  • Run summary: /Users/yhryzy/dev/emoart-130k/experiments/track1_caption_fidelity_ab/run_summary.json

Result summary

  • 6 samples audited.
  • Latest Vulca generated candidates for only 4/6 samples.
  • track1_0064 and track1_0074 failed in the generate node with: 'NoneType' object is not iterable.
  • track1_0301 was a severe content/category failure: the caption asked for a graph-paper branching pencil drawing, but the output collapsed into a gallery photograph and included large visible TRACK1_0301 text.
  • track1_0002 did improve versus the known pre-PR drift: the new output preserved bamboo/orchid/calligraphy/seals instead of collapsing into generic mountains. However, the current submitted image is still better.
  • Replacement wins: 0/6.

Interpretation

This PR is moving in the right direction: it fixes the known track1_0002 style-prototype drift by putting explicit content before tradition guidance. But it is not enough for Track1 batch generation.

The main gaps are now:

  1. Extractor coverage is too narrow. When content-lock extraction finds no hard requirements, ordinary captions can still collapse by category. track1_0301 is the clearest example.
  2. Visible text/artifact suppression is missing. The generator should not introduce sample IDs, gallery labels, exhibition-photo framing, or large unsolicited text.
  3. Generate node reliability still has a real failure path. track1_0064 and track1_0074 hit 'NoneType' object is not iterable during generation.

Current submission status

The current Track1 package was left untouched and still validates:

  • JSON validation: OK
  • ZIP contents: submission.json + 1000 images
  • Missing images in ZIP: 0

Recommended next work before marking this PR ready

Keep this PR as draft and address these before any claim that --content-lock is batch-safe:

  • Debug and fix the NoneType generate-node failure for 0064/0074.
  • Extend content-lock extraction or add a VLM extraction phase so captions like graph-paper pencil drawings produce hard category/content constraints.
  • Add negative/forbidden artifact guidance for sample IDs, gallery/exhibition photo outputs, labels, and unsolicited large text.
  • Re-run the same 6-sample dogfood under emoart-130k/experiments/ and update the decision report.

For now: PR #119 should be treated as a guardrail improvement, not a Track1 replacement pipeline.

@yha9806
Copy link
Copy Markdown
Contributor Author

yha9806 commented May 8, 2026

Update after the Track1 dogfood feedback: pushed f4abc007 with the first remediation pass.

What changed:

  • expanded content-lock extraction for the actual failed captions:
    • 0064: Gongbi vertical scroll + lotus/stems/leaves + side calligraphy + silk ground;
    • 0074: Gongbi album leaf + bird/branches + circular calligraphy panel + ornate border;
    • 0301: hand-drawn branching lines + dense tree network + heart/geometric marks + graph paper + rectangular frame + monochrome pencil style;
  • removed sample-like IDs such as track1_0301 from provider subject when content lock is enabled;
  • added prompt-level bans for sample IDs, filenames, watermarks, large labels, gallery walls, exhibition labels, framed museum installations, and photographed artwork mockups;
  • extended VLM content-fidelity gate to penalize missing style attributes and forbidden artifacts, not only missing subjects/text/surface;
  • hardened Gemini response parsing so candidate.content.parts is None becomes an actionable “no image data” path instead of raw 'NoneType' object is not iterable;
  • suppressed sample IDs in mock fallback placeholders too.

Validation:

  • targeted dogfood regressions were written red-first and now pass;
  • 68 passed for content-lock, Gemini provider, evaluate, and the relevant GenerateNode regression tests;
  • ruff passed;
  • git diff --check passed.

This PR should remain draft. Next step is to rerun the same 6-sample Track1 A/B dogfood on this pushed commit and specifically verify 0064/0074 no longer crash and 0301 does not contain sample ID/gallery-photo artifacts. The current Track1 submission package should still remain untouched unless a new decision report shows clear replacement wins.

@yha9806
Copy link
Copy Markdown
Contributor Author

yha9806 commented May 8, 2026

Added the artifact-boundary fix from 2026-05-08-vulca-track1-artifact-boundary-fix.md in ffcf85a6.

Important boundary now documented in the PR:

  • f4abc007 is a real Vulca product improvement: 0002 content drift improved, 0064/0074 generate crashes fixed, and 0301 moved back from gallery-photo/sample-id collapse to graph-paper drawing.
  • Track1 submission package still accepts zero Vulca replacements: baseline remains cleaner/stabler, 0301 still had unwanted labels, and selective 0151/0728 regeneration was rejected for gallery/mockup artifacts.

What ffcf85a6 adds:

  • output_is_artwork_itself semantic flag in SDK/create path.
  • CLI: vulca create --output-is-artwork-itself.
  • --content-lock now also enables the artifact boundary by default.
  • Generation prompt now starts with ARTIFACT BOUNDARY REQUIREMENT: output must be the artwork itself, not a photograph/display/mockup/gallery scene.
  • Poster-specific guard: flat, front-facing poster artwork, not wall/room photo.
  • Scroll/album guard: artwork surface itself, not gallery wall/catalog spread/framed display.
  • VLM gate now requests and enforces forbidden_visual_artifacts, unwanted_visible_text, and output_is_artwork_itself; violations cap score to 0.25 and add content_fidelity_failed.

Validation:

  • Artifact-boundary tests were red-first and now pass.
  • Related focused suite: 78 passed.
  • ruff check: passed.
  • git diff --check: passed.

Next step remains dogfood, not submission replacement: rerun the 4-sample artifact-boundary harness on 0151/0728/0301/0064, inspect manually, and only consider any Track1 artifact changes after a new decision report shows clear wins.

@yha9806
Copy link
Copy Markdown
Contributor Author

yha9806 commented May 8, 2026

Dogfood update for ffcf85a6 is now available and confirms the product/submission boundary.

Artifacts:

  • Review doc: /Users/yhryzy/dev/emoart-130k/docs/superpowers/plans/2026-05-08-vulca-track1-artifact-boundary-dogfood-ffcf85a6.md
  • Decision report: /Users/yhryzy/dev/emoart-130k/experiments/track1_artifact_boundary_fix_dogfood_ffcf85a6/decision_report.md
  • Contact sheet: /Users/yhryzy/dev/emoart-130k/experiments/track1_artifact_boundary_fix_dogfood_ffcf85a6/contact_sheet.jpg

Result:

  • Product verdict: artifact-boundary fix works directionally.
    • 4/4 candidates generated successfully.
    • 4/4 are artwork surfaces, not gallery/installation/mockup/wall-photo scenes.
    • track1_0301 is back to graph-paper branching drawing instead of gallery/photo/sample-id collapse.
    • track1_0064 no longer hits the earlier no-image crash path.
  • Submission verdict: accepted replacements = 0/4.
    • 0151: baseline stronger; Vulca candidate still has tiny source-like text and score capped.
    • 0728: candidate is competitive but not a clear win.
    • 0301: candidate is correct, but baseline is denser and closer to “fill a rectangular frame.”
    • 0064: candidate usable, but less pale/delicate, adds rocks/red lotus, and had VLM parse/fallback uncertainty.

Important: this dogfood supports PR #119 as a Vulca product fix, but it still does not justify changing the Track1 submission package. Current Track1 artifacts remain untouched.

Follow-up product issues exposed:

  1. create result JSON does not expose artifact-boundary gate fields (forbidden_visual_artifacts, unwanted_visible_text, output_is_artwork_itself, content_fidelity_failed).
  2. VLM JSON parse fallback is not explicit enough in final output; 0064 hit parse failure and fell back to mock scoring.
  3. Local dogfood skipped some cultural tools because cv2 was unavailable.

I am checking whether (1) and (2) can be made visible in this PR with focused tests. The next dogfood should remain product-only; no Track1 artifact replacement unless a new manual decision report shows clear wins.

@yha9806
Copy link
Copy Markdown
Contributor Author

yha9806 commented May 8, 2026

Follow-up patch pushed: a485c77e (Expose content fidelity audit metadata).

This addresses two product issues exposed by the ffcf85a6 dogfood without touching Track1 submission artifacts:

  • Create/Pipeline output now surfaces audit metadata:
    • risk_flags
    • content_fidelity_gate
    • evaluation_source
    • evaluation_error
  • VLM parse/network/error fallback is now explicit. If VLM scoring fails and EvaluateNode falls back to mock scoring, the result carries:
    • evaluation_source: "mock_fallback"
    • evaluation_error: <original VLM/parser error>

Why this matters for the dogfood finding:

  • 0064 had Could not parse JSON from LLM output and then fell back to mock scoring. Future create JSON should now make that visible instead of looking like a clean evaluation path.
  • Artifact/content gates should now be inspectable in final create outputs, including forbidden artifacts and unwanted visible text when the VLM reports them.

Validation:

  • Red-first regression tests for fallback metadata and create JSON audit fields now pass.
  • Focused suite: 80 passed.
  • ruff check: passed.
  • git diff --check: passed.

Remaining known issue from dogfood: local cultural tool discovery skipped some tools because cv2 is unavailable. I have not changed dependency packaging in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant