[codex] Add content lock for caption fidelity#119
Conversation
Track1 dogfood update: do not use this branch for batch replacement yetI ran the small Track1 A/B dogfood against the content-locked create path. The result is clear: do not batch-replace the current Track1 submission package with latest Vulca content-lock output. Artifacts from the dogfood run:
Result summary
InterpretationThis PR is moving in the right direction: it fixes the known The main gaps are now:
Current submission statusThe current Track1 package was left untouched and still validates:
Recommended next work before marking this PR readyKeep this PR as draft and address these before any claim that
For now: PR #119 should be treated as a guardrail improvement, not a Track1 replacement pipeline. |
|
Update after the Track1 dogfood feedback: pushed What changed:
Validation:
This PR should remain draft. Next step is to rerun the same 6-sample Track1 A/B dogfood on this pushed commit and specifically verify |
|
Added the artifact-boundary fix from Important boundary now documented in the PR:
What
Validation:
Next step remains dogfood, not submission replacement: rerun the 4-sample artifact-boundary harness on |
|
Dogfood update for Artifacts:
Result:
Important: this dogfood supports PR #119 as a Vulca product fix, but it still does not justify changing the Track1 submission package. Current Track1 artifacts remain untouched. Follow-up product issues exposed:
I am checking whether (1) and (2) can be made visible in this PR with focused tests. The next dogfood should remain product-only; no Track1 artifact replacement unless a new manual decision report shows clear wins. |
|
Follow-up patch pushed: This addresses two product issues exposed by the
Why this matters for the dogfood finding:
Validation:
Remaining known issue from dogfood: local cultural tool discovery skipped some tools because |
Summary
This draft PR adds a content-lock and artifact-boundary path for caption-driven
vulca createruns.The important product/challenge boundary is now explicit:
f4abc007fixed known failure modes such as0002shanshui drift,0064/0074generate crashes, and0301gallery-photo/sample-id collapse.Current Track1 submission artifacts remain untouched and valid.
Problem / Root Cause
The original runtime was too style-first: cultural guidance could override explicit caption content, and evaluation could still allow high scores when required content was missing.
Dogfood exposed a second boundary failure: Vulca sometimes treated the requested artwork as an object to display in a scene, producing gallery walls, museum/installations, framed mockups, catalog layouts, visible sample IDs, or unrequested labels instead of the artwork itself.
Concrete observed failures:
track1_0002: bamboo/orchid/calligraphy drifted into generic shanshui before content-lock.track1_0064/track1_0074: Gemini returned no iterable image parts, causing generate-nodeNoneTypefailures before hardening.track1_0301: graph-paper branching drawing collapsed into a gallery photo with visibleTRACK1_0301; afterf4abc007the category recovered, but unwanted English labels remained.track1_0151/track1_0728: selective regeneration of poster-like captions became gallery/mockup scenes and was rejected.Fix
The PR now does three things:
Content lock
0.25when required content is absent.Artifact boundary
output_is_artwork_itselfas a SDK/CLI semantic flag.--content-locknow also enables the artifact boundary by default.vulca create --output-is-artwork-itselffor boundary-only use.ARTIFACT BOUNDARY REQUIREMENT, requiring the output to be the artwork surface itself, not a photo/display/mockup.Evaluation hardening
forbidden_visual_artifactsunwanted_visible_textoutput_is_artwork_itselfweighted_totalto0.25and addscontent_fidelity_failedwhen it sees gallery/photo/mockup artifacts, unwanted visible text, or an output that is not the artwork itself.candidate.content.parts is Noneas “no image data” instead of rawNoneTypeiteration.track1_0301are not passed to the image provider asSubject:and are suppressed in mock fallback SVGs.Validation
Latest validation after
ffcf85a6:PYTHONPATH=src pytest tests/test_content_lock.py tests/test_gemini_image_size.py tests/test_evaluate.py tests/test_cli_create_output.py::TestCreateOutputParam::test_create_help_has_output_param tests/test_cli_create_output.py::TestCreateOutputParam::test_create_cli_accepts_content_lock_flag tests/test_cli_create_output.py::TestCreateOutputParam::test_create_cli_accepts_output_is_artwork_itself_flag tests/test_create_hitl.py::TestCreateHITL::test_create_accepts_content_lock_argument tests/test_create_hitl.py::TestCreateHITL::test_create_accepts_output_is_artwork_itself_argument tests/test_pipeline_engine.py::TestGenerateNode::test_mock_generate_suppresses_sample_id_text tests/test_pipeline_engine.py::TestGenerateNode::test_generate_node_puts_content_lock_before_cultural_guidance tests/test_pipeline_engine.py::TestGenerateNode::test_generate_node_does_not_send_sample_id_as_provider_subject_with_content_lock tests/test_pipeline_engine.py::TestGenerateNode::test_generate_node_puts_artifact_boundary_before_content_requirements->78 passed.ruff check src/vulca/content_lock.py src/vulca/create.py src/vulca/cli.py src/vulca/pipeline/nodes/generate.py src/vulca/providers/gemini.py tests/test_content_lock.py tests/test_pipeline_engine.py tests/test_cli_create_output.py tests/test_create_hitl.py tests/test_gemini_image_size.py-> passed.git diff --check-> passed.--content-lockand--output-is-artwork-itselfare exposed.Known local environment notes:
tests/test_pipeline_engine.pystill fails in this local environment because the async pytest plugin is not registered. Targeted synchronous/regression tests pass.PYTHONPATH=srcin this worktree unless the package is installed.Dogfood Status
The current product delta sheet shows
f4abc007improved Vulca behavior but did not produce accepted Track1 replacements:0002: product improvement, baseline still better.0064/0074: crash fixed, baseline still better.0301: category restored, but unwanted labels remain.0151/0728: selective regeneration rejected because of gallery/mockup artifacts.Next Work
Keep this PR draft until a new dogfood run validates
ffcf85a6.Recommended dogfood:
Acceptance rule: