<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Udhay Kumar - Building the picks and shovels for Web3 x AI]]></title><description><![CDATA[DevRel engineer shipping developer tools, MCP servers, and infrastructure at the intersection of Web3 and AI. Open source packages, tutorials, and build-in-public updates.]]></description><link>https://blog.udhaykumarbala.dev</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1593680282896/kNC7E8IR4.png</url><title>Udhay Kumar - Building the picks and shovels for Web3 x AI</title><link>https://blog.udhaykumarbala.dev</link></image><generator>RSS for Node</generator><lastBuildDate>Fri, 05 Jun 2026 20:02:24 GMT</lastBuildDate><atom:link href="https://blog.udhaykumarbala.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Why Can't We Version-Control AI Images?]]></title><description><![CDATA[TL;DR: I built an MCP server that decomposes AI-generated images into structured JSON, so you can edit specific fields and regenerate instead of re-prompting from scratch. It's not perfect (Gemini still generates a new image each time) but it's way m...]]></description><link>https://blog.udhaykumarbala.dev/structured-json-editing-for-ai-images-gemini-mcp</link><guid isPermaLink="true">https://blog.udhaykumarbala.dev/structured-json-editing-for-ai-images-gemini-mcp</guid><dc:creator><![CDATA[udhay kumar]]></dc:creator><pubDate>Sat, 04 Apr 2026 14:07:41 GMT</pubDate><enclosure url="https://raw.githubusercontent.com/udhaykumarbala/gemini-image-studio-mcp/master/docs/blog-cover.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote>
<p><strong>TL;DR:</strong> I built an MCP server that decomposes AI-generated images into structured JSON, so you can edit specific fields and regenerate instead of re-prompting from scratch. It's not perfect (Gemini still generates a new image each time) but it's way more consistent than freeform re-prompting.</p>
</blockquote>
<p>If we can <code>git diff</code> a codebase, why can't we diff the components of an AI-generated image?</p>
<p>That question hit me while I was trying to make marketing visuals for the <a target="_blank" href="https://0g.ai">0G Labs</a> documentation site. I'd generated a hero image with Gemini that looked solid. Clean composition, right colors, good text placement. Then my teammate Gathin asked if we could change the background gradient to something warmer.</p>
<p>Should be easy, right?</p>
<p>I re-prompted. The warm background was there, but Gemini decided the person in the image needed a completely different outfit. And the text overlay was gone. I spent 40 minutes trying to coax Gemini back to the original composition with just the background changed. At one point the person's face changed entirely and I actually laughed out loud because it was so absurd.</p>
<p>AI image gen being imperfect? Fine. What's maddening is there's no way to say "change ONLY this one thing." Text prompts are lossy by nature. You describe the whole scene every time, and the model reinterprets everything every time.</p>
<h2 id="heading-the-idea-that-made-me-stop-re-prompting">The idea that made me stop re-prompting</h2>
<p>What if you broke the image into structured JSON first, then just patched the fields you wanted to change?</p>
<p>I started building this the next morning. The approach:</p>
<ol>
<li>Generate an image from a prompt</li>
<li>Send the image back to Gemini's vision model and ask it to describe every visual component as structured JSON</li>
<li>Cache that JSON "blueprint"</li>
<li>When you want an edit, patch the specific fields in the blueprint and regenerate from the modified JSON</li>
</ol>
<h2 id="heading-how-the-decomposition-actually-works">How the decomposition actually works</h2>
<p>The decomposition step sends your image to Gemini's vision model with a prompt that defines the exact JSON schema you want back. Here's a trimmed-down version showing the core fields (the actual prompt includes additional sections for text rendering, technical camera details, and style modifiers):</p>
<pre><code>Analyze <span class="hljs-built_in">this</span> image and describe EVERY visual detail <span class="hljs-keyword">as</span> a <span class="hljs-built_in">JSON</span> object.
Use <span class="hljs-built_in">this</span> exact structure:

{
  <span class="hljs-string">"subject"</span>: [{ <span class="hljs-string">"id"</span>: string, <span class="hljs-string">"type"</span>: <span class="hljs-string">"person"</span>|<span class="hljs-string">"object"</span>,
    <span class="hljs-string">"hair"</span>: {<span class="hljs-string">"style"</span>: string, <span class="hljs-string">"color"</span>: string},
    <span class="hljs-string">"clothing"</span>: [{<span class="hljs-string">"item"</span>: string, <span class="hljs-string">"color"</span>: string (hex), <span class="hljs-string">"fabric"</span>: string}],
    <span class="hljs-string">"expression"</span>: string, <span class="hljs-string">"pose"</span>: string }],
  <span class="hljs-string">"scene"</span>: { <span class="hljs-string">"location"</span>: string,
    <span class="hljs-string">"lighting"</span>: {<span class="hljs-string">"type"</span>: string, <span class="hljs-string">"direction"</span>: string},
    <span class="hljs-string">"background_elements"</span>: string[] },
  <span class="hljs-string">"composition"</span>: { <span class="hljs-string">"framing"</span>: string, <span class="hljs-string">"angle"</span>: string }
}

Use hex color codes <span class="hljs-keyword">for</span> precision. Return ONLY valid <span class="hljs-built_in">JSON</span>.
</code></pre><p>Gemini comes back with something like this:</p>
<pre><code class="lang-json">{
  <span class="hljs-attr">"subject"</span>: [{
    <span class="hljs-attr">"type"</span>: <span class="hljs-string">"person"</span>,
    <span class="hljs-attr">"hair"</span>: { <span class="hljs-attr">"color"</span>: <span class="hljs-string">"#0F0F11"</span>, <span class="hljs-attr">"style"</span>: <span class="hljs-string">"short, neatly styled"</span> },
    <span class="hljs-attr">"clothing"</span>: [{ <span class="hljs-attr">"item"</span>: <span class="hljs-string">"suit jacket"</span>, <span class="hljs-attr">"color"</span>: <span class="hljs-string">"#1C355E"</span>, <span class="hljs-attr">"fabric"</span>: <span class="hljs-string">"textured wool-blend"</span> }],
    <span class="hljs-attr">"expression"</span>: <span class="hljs-string">"warm, genuine smile"</span>
  }],
  <span class="hljs-attr">"scene"</span>: {
    <span class="hljs-attr">"lighting"</span>: { <span class="hljs-attr">"type"</span>: <span class="hljs-string">"professional studio, soft yet defined"</span>, <span class="hljs-attr">"direction"</span>: <span class="hljs-string">"key light from front-left"</span> },
    <span class="hljs-attr">"background_elements"</span>: [<span class="hljs-string">"smooth matte grey gradient, lighter in center"</span>]
  }
}
</code></pre>
<p>Then when you want to edit, the tool merges your changes into the blueprint and sends a new prompt to the image generation model:</p>
<pre><code>Here is the COMPLETE description <span class="hljs-keyword">of</span> the TARGET image after edits:
{ ...merged blueprint... }

Edit the provided image to match <span class="hljs-built_in">this</span> description. Change ONLY:
- subject[<span class="hljs-number">0</span>].clothing[<span class="hljs-number">0</span>].color

<span class="hljs-attr">CRITICAL</span>: Keep EVERYTHING <span class="hljs-keyword">else</span> EXACTLY <span class="hljs-keyword">as</span> it is <span class="hljs-keyword">in</span> the original image.
</code></pre><p>That "CRITICAL: Keep EVERYTHING else EXACTLY as it is" is doing a lot of work. It's prompt engineering, not a pixel-locking mechanism. Sometimes the model listens. Sometimes it doesn't.</p>
<h2 id="heading-the-result-and-where-it-breaks">The result (and where it breaks)</h2>
<p>Here's a real before/after from the tool. One JSON field changed: <code>clothing[0].color</code> from <code>#1C355E</code> to <code>#8B0000</code>.</p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Original (navy blazer)</td><td>After editing <code>clothing[0].color: "#8B0000"</code></td></tr>
</thead>
<tbody>
<tr>
<td><img src="https://raw.githubusercontent.com/udhaykumarbala/gemini-image-studio-mcp/master/docs/blog-demo-original.png" alt="Original - navy blazer" /></td><td><img src="https://raw.githubusercontent.com/udhaykumarbala/gemini-image-studio-mcp/master/docs/blog-demo-red-blazer.png" alt="Edited - red blazer" /></td></tr>
</tbody>
</table>
</div><p>In this case, everything held: face, expression, hair, shirt, pocket square, background, lighting. That doesn't always happen.</p>
<p>Simple, isolated changes on images with clear subjects work well: clothing color, background, lighting. For the 0G marketing work that started all this, I went from re-prompting 8-10 times to getting what I wanted in 1-2 edits (occasionally needing a re-roll).</p>
<p>Complex scenes with lots of overlapping elements are a different story. Five people in a group photo, try to change one person's shirt color, and the model will probably drift something else. Multi-step editing chains also accumulate drift. By the 4th or 5th edit on the same image, the blueprint and the actual pixels have diverged enough that edits become unreliable.</p>
<p>This is not actual version control. Git is deterministic. This is probabilistic. You're sending a modified text prompt to Gemini and asking it to generate a new image. The structured blueprint makes the prompt much more precise than freeform text, but Gemini still has the final say.</p>
<p>Across ~25 edits, the face and pose stayed consistent about two thirds of the time. The other third I'd regenerate once more and it usually came out right on the second try. Not great on its own, but compared to freeform re-prompting where I was getting consistent results maybe one in ten tries, it was a massive improvement.</p>
<p>The decomposition itself is Gemini's interpretation, not ground truth. It might describe a "wool blazer" when the material is ambiguous. The JSON blueprint is a best-effort scene description, not a pixel-accurate specification. Every edit prompt includes instructions telling the model to keep unchanged elements identical (you saw the "CRITICAL" block above), but that's prompt text, not an API constraint. A request, not a guarantee.</p>
<p>You can also skip the JSON workflow entirely and use <code>edit_type: "natural_language"</code> to just describe changes in plain English. I find JSON more reliable for production work, but natural language is faster for exploration.</p>
<h2 id="heading-the-workflow-in-claude-code">The workflow in Claude Code</h2>
<p>I wired this up as an <a target="_blank" href="https://modelcontextprotocol.io/">MCP server</a> so AI assistants (Claude Code, Cursor, Windsurf) can use it directly. The conversation looks like:</p>
<p><strong>Me:</strong> "Create a professional headshot for our team page, navy blazer, clean studio lighting"</p>
<p>Claude calls <code>generate_image</code>. I get a headshot. Looks good.</p>
<p><strong>Me:</strong> "The blazer needs to be dark red to match our brand. Don't change anything else."</p>
<p>Claude calls <code>decompose_image</code> on the existing image (takes about 3-4 seconds), gets the blueprint, patches <code>clothing[0].color</code>, and calls <code>edit_image</code> with the modified blueprint. The blueprint gets cached, so the next tweak skips decomposition.</p>
<p>Here's the actual terminal:</p>
<p><img src="https://raw.githubusercontent.com/udhaykumarbala/gemini-image-studio-mcp/master/docs/blog-terminal-generate.png" alt="Generate command in Claude Code" /></p>
<p><img src="https://raw.githubusercontent.com/udhaykumarbala/gemini-image-studio-mcp/master/docs/blog-terminal-edit.png" alt="Decompose and edit workflow" /></p>
<p>I almost built this as a CLI tool. The first prototype was a Node script with a <code>--edit</code> flag. But the interaction pattern is basically a conversation (generate, look, tweak, look again) so MCP was a more natural fit.</p>
<h2 id="heading-where-could-this-go">Where could this go?</h2>
<p>The decompose step is the weakest link. Right now I'm relying entirely on Gemini's vision model to produce the JSON blueprint, which means the quality depends on how well it interprets the image. Photorealistic images with clear subjects hold up. Abstract art? Total crapshoot.</p>
<p>If I could go deeper, I'd try caching blueprints at generation time (when you know exactly what was requested) instead of decomposing post-hoc. Or running decomposition through multiple models and taking consensus. Or integrating something like SAM for actual pixel-level region isolation, so "change only this area" means something mechanical rather than a polite prompt instruction.</p>
<p>There's a bigger question underneath all of this: what if the JSON blueprint became the actual source of truth, and pixel rendering was just a build step? The same way we write React components and render to DOM. You'd author in JSON, render to pixels, and any edit would be a commit to the JSON source. We're nowhere near that today, but every time the structured edit works and the freeform re-prompt doesn't, it feels like a small proof of concept for that future.</p>
<h2 id="heading-try-it">Try it</h2>
<p>If you have Claude Code:</p>
<pre><code class="lang-bash">claude mcp add gemini-image-studio-mcp \
  -e GEMINI_API_KEY=your-key -- gemini-image-studio-mcp
</code></pre>
<p>You'll need a <a target="_blank" href="https://aistudio.google.com/apikey">Gemini API key</a> (free tier works for experimenting, rate limits kick in during rapid edit cycles). Full docs: <a target="_blank" href="https://github.com/udhaykumarbala/gemini-image-studio-mcp">README</a>.</p>
<p>If you end up trying it, tell me what breaks. Seriously. DM me on <a target="_blank" href="https://x.com/udhaykumarbala">X</a> or <a target="_blank" href="https://github.com/udhaykumarbala/gemini-image-studio-mcp/issues">open an issue</a>. I want the weird edge cases.</p>
]]></content:encoded></item></channel></rss>