Date: 2026-05-26
Status at end of these two days: All 130 recipe pages show product card grids with real product images instead of plain text affiliate links; Amazon logo badges stripped from 63 recipe intros. Then a third content audit on challah bread surfaced 13 misses with three root causes, and patch-body-images.ts was rewritten as a sequential block walker — 130 recipes re-extracted, 0 failed. 73/73 tests pass throughout.
Everything here is built with Claude Code, and these two days are a clean demonstration of why the discipline around the model matters more than the model's raw speed. Claude wrote a Playwright scraper, a 55-product mapping script, two migration patchers, and a from-scratch rewrite of the content-extraction walker — work that would have taken me days alone, compressed into an afternoon each. But none of the failures below were crashes. They were plausible-but-wrong output: a script that reported "0 updated" and was actually correct, a Promise.all that raced two uploads of the same file, an ASIN quietly mapped to the wrong photo, an extraction that grabbed Amazon links instead of step photos, and — the big one — a regex that had been silently dropping every-other paragraph across all 130 recipes while reporting "130 updated, 0 failed" for three rounds straight. Every one of those would have shipped on a YOLO run, because every one of them produced output that looked right. What caught them was not a smarter prompt; it was the habit of checking the real thing — the rendered DOM, the Payload database, the live WordPress page side-by-side, the aioseo image sitemap as a ground-truth checklist — instead of trusting the model's account of what it did. The whole reason the audit even happened is that a human looked at a real page and saw braiding photos that the "successful" scripts had been throwing away. That's the judgment you bring around the AI, not instead of it: the model writes the boilerplate fast, and you stay senior enough to notice when "no errors" and "correct" have quietly come apart.
TLDR
- Built: A new "Shop the Ingredients" product card grid — responsive 2–3 column layout, product photo on a warm background, name, orange "Shop on Amazon" CTA. Replaces the previous plain text affiliate links. 55 unique products mapped, uploaded, and linked across all 130 recipes; Amazon logo badges stripped from 63 intros.
- Then broke open the content again: A side-by-side of the redesigned challah bread page against the live WordPress site revealed 13 missing pieces — three opening paragraphs, a Braiding section, four step photos, two videos, a Variations heading, three gallery captions, an Under Proofing heading, and a final slice image.
- Three root causes, one design flaw: (1) an optional regex group caused every-other paragraph to be silently dropped; (2) step images inside
wp:media-textfigures were never matched by thewp:image-only regex; (3)wp:headingblocks were never in scope. All three trace back to the same choice — two independent regexes merged after the fact. - The fix: Rewrote
extractIntroWithImages()as a single forward-scanning block walker (BLOCK_START_RE+ aprocessedEndguard) that handlesparagraph,image,heading,quote, and the image variant ofmedia-textwhile letting nested content fall through naturally. - Flops:
strip-amazon-logo.ts"updated 0 recipes" (human confusion — the target was already cleaned on a prior run); aPromise.allfilename race on duplicate product images; a copy-paste ASIN→image mapping error; an early extraction that grabbed Amazonhrefs instead of<img src>; atextFormat: 0literal type that rejected italic captions (format: 2). - Wins: Playwright DOM scrape pulled all five live-site product images + URLs in one
evaluate()call; all 55 product images already existed locally (zero remote fetches); thewpjc_aioseo_posts.sqlimage sitemap gave a ground-truth checklist of the 13 expected images; theprocessedEndguard made the nested-block problem trivial. - Result: Product cards live on every recipe page, and the content extraction rebuilt on authoritative WordPress data — challah bread now 25 paragraphs / 12 images, heaviest page (yule log) 90 paragraphs / 42 images, all rendering. Clean commits on main.
- Known gap: Two
.m4vbraid-demo videos still don't render — Payload's Lexical richText has no native video node. Captions are preserved; videos logged as a future TODO.
Part 1 — From "Shop Now" to product cards
The brief was a design task: "there's a 'shop now' button on some pages — I don't like how it looks, let's make it a nice card." Under the hood it was three separate jobs: fix the data (upload product images as Payload media and link them to equipment items), fix the content (strip the Amazon logo badge images that patch-body-images.ts had embedded in recipe intros), and fix the UI (redesign the equipment section layout).
We started with the dal-coconut-curry page as a live specimen. Playwright scraped the live site in one evaluate() call, returning all five product links with their associated images and Amazon URLs:
{
text: "SHOP NOW",
href: "https://www.amazon.com/Spicy-World-Masoor.../dp/B000K89490...",
imgSrc: "https://herfoodblog.com/wp-content/uploads/2024/01/dal-773x1024.jpg"
}
All five images already existed in resources/wp-uploads/ from the Day 10 extraction — no network requests needed to populate them.
The new UI is a responsive grid — 2-col on mobile, 3-col on desktop. Each card is a white rounded box with a warm #F7F3ED image area, object-contain layout so different product shapes don't get cropped weirdly, the product name below, and an orange "Shop on Amazon" button at the bottom. Hover gives a subtle border-colour change and scales the image up 5%.
After confirming it looked good on dal-coconut-curry, we built the 55-ASIN mapping from the WXR XML (a one-shot Node.js script that parsed all wp:media-text blocks and extracted ASIN→image URL pairs), put it in patch-equipment-thumbnails.ts, and ran it for all 130 recipes. The strip script (strip-amazon-logo.ts) fetched all Payload media whose filename contained "available_at_amazon", then scanned every recipe's intro richText for upload nodes with those media IDs and removed them. 63 recipes updated.
Part 2 — The third audit: 13 misses, three root causes, one rewrite
This one started with a comparison my friend sent: her local redesigned challah bread page next to the live WordPress version. The difference was striking. The live page had braiding instructions with photos, variations with a gallery, an under-proofing section with a diagnostic image. The local page had a few paragraphs and then jumped straight to the recipe card. We counted 13 specific misses and wrote them into documents/content-misses.md.
The data my friend provided
The previous extraction rounds had used the WXR file and a wp-uploads/ dump. This time we needed to understand why the content was missing, not just patch it again. My friend ran three new SQL exports from her WordPress database:
wpjc_posts.sql(60MB) — the full posts table, which let us confirm thepost_contentfor challah bread (post ID 9744) matched what we expected from the WXRwpjc_postmeta.sql— post metadata, used to cross-reference block structurewpjc_aioseo_posts.sql— the All-in-One SEO image sitemap, which listed all 13 image filenames for post 9744:IMG_8470,IMG_1112,IMG_1115,IMG_1650,IMG_1656,IMG_1660,IMG_1101,IMG_9608,IMG_0378,IMG_0383,IMG_8452,IMG_5737,IMG_1668. This was the authoritative list of what should be on the page — a ground truth to check against.
With the image sitemap as a checklist, we could see immediately that 5 of the 13 images were absent from the local page. That told us where to look.
Root cause 1: the alternating paragraph skip
The old patch-body-images.ts ran two global regexes — one for paragraphs, one for images — collected all matches into an array, sorted by match.index, and built the Lexical doc in that order. The paragraph regex looked like this (simplified):
/<!-- wp:paragraph((?:[\s\S]*?)?)?-->...([\s\S]*?)<!-- \/wp:paragraph -->/g
The (?:[\s\S]*?)? optional attribute group is the problem. In a long document with many consecutive paragraph blocks, the engine can "skip" over a block's opening comment and latch onto the next one — because the optional group allows zero-length matches. The result: every-other paragraph silently dropped.
Querying the Payload database confirmed it. The challah intro had 18 nodes:
node 1: "This festive bread…" ← PRESENT
node 2: "Ritually acceptable challah…" ← MISS
node 3: "This recipe is due…" ← PRESENT
node 4: "Braiding" ← MISS (heading — not even in scope)
node 5: "Braiding can be tricky…" ← MISS
...
The alternating pattern was unmistakable — a systematic regression affecting roughly half the paragraphs on every recipe with more than two or three consecutive paragraph blocks.
Root cause 2: step images inside wp:media-text
WordPress's media-text block renders an image alongside a paragraph:
<!-- wp:media-text {"mediaId":9799,"mediaType":"image"} -->
<div class="wp-block-media-text">
<figure class="wp-block-media-text__media">
<img src="https://herfoodblog.com/wp-content/uploads/2023/08/IMG_0378-665x997.jpg" .../>
</figure>
<div class="wp-block-media-text__content">
<!-- wp:paragraph --><p>Mix everything using a paddle attachment...</p><!-- /wp:paragraph -->
</div>
</div>
<!-- /wp:media-text -->
The old image regex matched <!-- wp:image --> block comments only. The wp:media-text block embeds the <img> directly inside the figure div. The image was right there in the HTML, completely invisible to the regex. Four step photos in challah were lost this way.
Root cause 3: headings never included
The old script recognised exactly two block types: wp:paragraph and wp:image. wp:heading was never in the alternation, so "Braiding", "Variations", and "Under proofing" were silently skipped on every recipe.
The fix: a sequential block walker
All three root causes trace back to the same design choice — two separate regexes applied independently, then merged. The fix is to process the document in a single forward pass:
const BLOCK_START_RE = /<!-- wp:(paragraph|image|heading|media-text|quote)((?:[^-]|-(?!->))*)-->/g
let processedEnd = 0
while ((match = BLOCK_START_RE.exec(body)) !== null) {
// Skip if this opening falls inside an already-processed block
if (match.index < processedEnd) {
BLOCK_START_RE.lastIndex = processedEnd
continue
}
// Find the matching close tag, process the block, advance processedEnd
...
}
The processedEnd guard is what makes it work. When we process a wp:media-text block, processedEnd advances past the entire block including its nested wp:paragraph and wp:image content, so the next iteration skips anything before it — the nested paragraph doesn't get double-counted as a standalone paragraph. Gallery images work the same way in reverse: because wp:gallery isn't in the alternation, the walker doesn't consume it as a single block, and its inner wp:image blocks are found naturally on later iterations.
For wp:heading we added it to the alternation and emitted a Lexical heading node:
const tagMatch = blockInner.match(/<h([2-4])[^>]*>([\s\S]*?)<\/h\1>/)
const tag = `h${tagMatch[1]}` as 'h2' | 'h3' | 'h4'
children.push({ type: 'heading', version: 1, direction: 'ltr', format: '', indent: 0,
tag, children: [{ type: 'text', ..., text: headingText }] })
For wp:media-text we pull the image out of the <figure class="wp-block-media-text__media"> div and upload it, then extract the caption paragraph from <div class="wp-block-media-text__content">.
AI flops
strip-amazon-logo.ts updating 0 recipes. First run (with only 5 ASINs) correctly stripped Amazon logos from dal-coconut-curry. The second run for all recipes returned "Updated 0 recipe(s)". I spent several minutes writing a more targeted version before realising: the logos were already gone from dal-coconut-curry (from run 1) and no other recipe had been processed yet. The script was finding the right media IDs but they genuinely weren't in any intro. The script was correct — the confusion was mine.
Promise.all filename conflict. The original equipment-thumbnail script processed all items for a recipe in parallel. Some recipes have the same product twice under different names (e.g. two piping-bag items both mapping to pipingbags-1024x1014.jpg). Both parallel uploadImage() calls would query for an existing record (find nothing), both upload, and the second would hit a ValidationError: filename unique constraint:
ValidationError: The following field is invalid: filename
at handleUpsertError (/node_modules/@payloadcms/drizzle/...)
Fixed by switching to a for...of loop so uploads happen one at a time; the second call's dedup check then finds the first's record and reuses the ID.
Wrong ASIN for a product. While building the 55-entry ASIN map from the WXR output, B00004S7V8 got assigned cooling-rack-730x1024.jpg (the same image as B0041HO4NW). The WXR clearly showed B00004S7V8: IMG_7951-683x1024.jpg — a copy-paste error during manual grouping. Caught during the full-run log review.
Early extraction grabbed the wrong images. Before fully understanding the wp:media-text structure, an early attempt found images by href attributes in the block markup rather than src attributes in the <img> tags. This produced media IDs for Amazon product links, not step photos. Running in --dry-run caught it before any bad data reached the database.
The alternating skip survived two prior rounds. Rounds 2 and 3 ran the script against all 130 recipes and reported "130 updated, 0 failed". Everything looked fine. The bug was present from the start but the output was never compared against a known-correct reference. We only caught it because of the side-by-side with the live site. Script success messages don't mean content correctness — you need a human to look at a real page.
TypeScript flop. Italic captions render with textFormat: 2 (italic bitmask), but LexicalParagraphNode had textFormat: 0 as a literal type:
Type '2' is not assignable to type '0'.
The fix was one character: change textFormat: 0 to textFormat: number. Use number for bitmask fields whenever the value can vary — don't infer the literal from the common case.
AI wins
Playwright DOM scrape as a data-extraction tool. Instead of parsing the WXR for product images (which stores different filenames per size), I scraped the rendered live page and extracted the exact image+URL pairing the template had assembled:
const amazonLinks = document.querySelectorAll('a[href*="amazon.com"]');
amazonLinks.forEach(link => {
const block = link.closest('.wp-block-media-text');
const img = block?.querySelector('img');
imgs.push({ href: link.href, imgSrc: img?.src });
});
One evaluate() call returned all five products in structured form — no HTML parsing, no regex, no rate limiting.
All 55 files already local. After extracting the full ASIN list from the WXR XML, we checked each image against resources/wp-uploads/. Zero missing. The full run uploaded them to Payload media without touching the live site once.
The aioseo image sitemap as ground truth. Using wpjc_aioseo_posts.sql as a checklist of expected images was a clean debugging move. Before writing any code we knew exactly which 13 images should appear on the challah page, and after the fix we could verify each one was newly uploaded or reused. No guessing.
processedEnd makes the nesting problem trivial. The hardest part of a sequential walker for nested Gutenberg markup is avoiding double-counts. The processedEnd guard solves it in four lines — when the walker moves past a wp:media-text block, all content inside it becomes unreachable to subsequent matches. Simpler and more correct than any regex-alternation trick.
The 280px gap was not our bug. While on the product-card work, we noticed a 280px blank space between the intro and Ingredients on every recipe page. It could have sent the investigation toward leftover upload nodes or empty Lexical containers. Checking the actual DOM:
const ins = document.querySelector('.adsbygoogle');
ins.getBoundingClientRect() // → { height: 280 }
NEXT_PUBLIC_ADSENSE_CLIENT_ID is set in .env.local, so the AdSlot renders and AdSense's script initialises the <ins> to 280px even without ad content in dev. Not a bug we introduced — the gap would be filled by a real ad in production.
Known gap: videos
The challah page originally had two embedded .m4v video demonstrations — a crown braid and a 5-strand braid — using wp:media-text with "mediaType":"video". The walker detects this case and skips the video itself, preserving only the text caption from the content div.
Payload's Lexical richText has no native video node type. Options for later:
- A custom Lexical node with a
<video>element - Embed as an
<iframe>via a custom node - Link out to the video files from a text paragraph
Not in scope for this round. Captions are preserved and the rest of the content is correct.
End of these two days
Done: product card grid on all recipe pages with images uploaded and linked, Amazon logo badges stripped from 63 intros; patch-body-images.ts rewritten with a sequential block walker, tested on challah bread, all 13 content misses resolved, full run for all 130 recipes (130 updated, 0 failed). Yule log spot-checked: 90 paragraphs and 42 step images rendering. Commits 96b090b and 4c9fd54 on main. 73/73 tests passing.
Not done: production deploy (still queued from Day 10); video rendering (future TODO — Payload Lexical has no native video node).
The honest take: the product-card work was clean, and its debugging flops were human errors — forgetting a script had already run, a copy-paste slip in a data map. The audit was sharper. The previous scripts reported success because they ran without errors, but "no errors" and "correct content" are different things. Having the raw WordPress database to query against let us confirm the alternating-miss pattern definitively before touching code. That's the right way to debug: find the evidence, understand the pattern, then fix. And the pattern that keeps holding up is "AI writes the boilerplate, human checks the data" — the WXR parser, the Playwright scrape, the block walker were all fast to write; catching the silently-wrong output is the judgment you bring around the model, not instead of it.
This is part of an ongoing series documenting the rebuild of a friend's food blog from WordPress to a custom Next.js stack, built with AI assistance.

Comments
Post a Comment