I love this experiment and am surprised that the Claude models performed that much better than the competition. Opus was particularly impressive both in the quality itself and the ability to iterate meaningfully.
Now... Was this article LLM written?
This part triggered all my LLM flags:
```
Adding a bicycle chain isn’t just decoration—it shows understanding of mechanical relationships. The wheel spokes, the adjusted proportions—these are signs of vision-driven refinement working as intended.
```
Did not feel LLM written to me - at least not overtly so. LLM editing/assisted perhaps?
It was a fun little post that felt accurate (ie confirmed my own biases ;)) about the current state of LLM models in a silly, but real, use-case.
The continual drive to out "llm written" articles feels a bit silly to me at this point. They are now part of the tools and tech we use, for better or worse. And to be clear, I think in a lot of cases it leans towards 'worse'.
But do you question if a video or photo was made with digital editing or filters or 'ai' tools (many of which we've had for years, just under different names) ? Do you worry about what tech was used in making your favorite album or song?
I get it, LLMs make it easy to produce trash content, but this is not a new problem. If you see trash, call it out as trash on its flaws, not on a presumption of how it was made.
No, I don't have anything against using LLMs to write. My problem is that I enjoy reading people in part for diversity of style.
I already spend too much time reading LLM outputs on my own interactions. And I get sick of their style because of it. So when I read it during leisure time, it just triggers a gut rejection.
Especially because they are so formulaic / template-y.
I get sick not only due to overexposure to LLM style, but also because I associate it now with a very poor substance-to-style ratio. LLMs tend to not only overuse but also misuse those turns of phrases it returns to obsessively. For example, enumerating three items where one is just another way to reference one of the first two, or it's of a different kind and doesn't really fit with the other two items. Or it will use "it's not just A, it's B" where B is unrelated to A, so "it's A and B" would have been more appropriate. Sacrificing logic for reasons of style. It also signals I should be on the lookout for possible hallucinations.
What an insightful comment! (sorry, couldn’t help it)
I agree about the silliness. God forbid i am a non-native English speaker and I have a bit of an of odd writing style in a real Brits eye. Or that I use ‘—‘ instead of ‘-‘ because usually typing two dashes converts to the long one on Mac (try even four, technology is crazy these days), and it just feels a bit nicer. OR that I adopt occasional use of ‘;’ because I feel like it (Yes. English is supposed to have short sentences. Unlike other languages. Beautiful. Sue me.)
I don’t care if they helped themselves with AI to improve writing or turn a bullet point into a sentence. It’s when the volume of text doesn’t justify the lack of content or value that I call bs and go to the next one. At this point it might as well be human generated content, but I don’t care, outcome’s the same.
Regarding the post — it’s a cute little article and the pelicans do seem be making a point with their funky shapes
I mean at some point you have to evaluate the content on its merit and they have a point — a chain is functional not just decorative in its precise placement.
Evaluating the content on its merit I'd question whether the author has seen a bicycle before. Yes, in the final iteration with Opus it added a chain, but it's missing a triangle which clearly shows a lack of understanding of mechanical relationships.
Ignoring the wording, em-dashes, etc. I'd assume an LLM not only wrote the article but also judged the pictures. That or the author has a much more relaxed opinion on what a pelican on a bicycle should actually look like. I don't think I would call Sonnet's arms and handlebars improved, nor would I call Haiku's legs and feet "proper." And if you overlay GPT-5 Medium's two photos the shapes proportions are nearly identical.
That phrase template isn’t just overdone—it's something some text models are obsessed with. The em-dashes, the contrastive language—these are signs of LLMs being asked to summarize or expand a compelling blog post.
What I take from this is that LLMs are somewhat miraculous in generation but terrible at revision. Especially with images, they are very resistant to adjusting initial approaches.
I wonder if there is a consistent way to force structural revisions. I have found Nano Banana particularly terrible at revisions, even something like "change the image dimensions to..." it will confidently claim success but do nothing.
A thing I've been noticing across the board is that current generative AI systems are horrible at composition. It’s most obvious in image generation models where the composition and blocking tend to be jarringly simple and on point (hyper-symmetry, all-middleground, or one of like three canned "artistic" compositions) no matter how you prompt them, but you see it in things like text output as well once you notice it.
I suspect this is either a training data issue, or an issue with the people building these things not recognizing the problem, but it's weird how persistent and cross-model the issue is, even in model releases that specifically call out better/more steerable composition behavior.
I almost always get better results from LLMs by going back and editing my prompt and starting again, rather than trying to correct/guide it interactively. Almost as if having mistakes in your context window is an instruction to generate more mistakes! (I'm sure it's not quite that simple)
I see this all the time when asking Claude or ChapGPT to produce a single-page two-column PDF summarizing the conclusions of our chat. Literally 99% of the time I get a multi-page unpredictably-formatted mess, even after gently asking over and over for specific fixes to the formatting mistake/s.
And as you say, they cheerfully assert that they've done the job, for real this time, every time.
Ask for the asciidoc and asciidoctor command to make a PDF instead. Chat bots aren’t designed to make PDFs. They are just trying to use tools in the background, probably starting with markdown.
Tools are still evolving out of the VLM/LLM split [0]. The reason image-to-image tasks are so variable in quality and vastly inferior to text-to-image tasks is because there is an entirely separate model that is trained on transforming an input image into tokens in the LLM's vector space.
The naive approach that gets you results like ChatGPT is to produce output tokens based on the prompt and generate a new image from the output. It is really difficult to maintain details from the input image with this approach.
A more advanced approach is to generate a stream of "edits" to the input image instead. You see this with Gemini, which sometimes maintains original image details to a fault; e.g. it will preserve human faces at all cost, probably as a result of training.
I think the round-trip through SVG is an extreme challenge to train through and essentially forces the LLM to progressively edit the SVG source, which can result in something like the Gemini approach above.
Revision should be much easier than generation, e.g. reflection style CoT (draft-critique-revision) is typically the simplest way to get things done with these models. It's always possible to overthink, though.
Nano Banana is rather terrible at multi-turn chats, just like any other model, despite the claim it's been trained for it. Scattered context and irrelevant distractors are always bad, compressing the conversation into a single turn fixes this.
I’m not quite sure.
I think that adversarial network works pretty well at image generation.
I think that the problem here is that svg is structured information and an image is unstructured blob, and the translation between them requires planning and understanding. Maybe if instead of treating an svg like a raster image in the prompt is wrong. I think that prompting the image like code (which svg basically is) would result in better outputs.
The prompt just said to iterate until they were satisfied. Adding something like "don't be afraid to change your approach or make significant revisions" would probably give different results.
I feels like it's a bit hard to take much from this without running this trial many times for each model. Then it would be possible to see if there are consistent themes among each model's solutions. Otherwise, it feels like the specific style of each result could be somewhat random. I didn't see any mention of running multiple trials for each model.
Oddly enough, I've found models are actually quite consistent in their drawings of pelicans riding bicycles.
I remember I even had one case where there was a stealth model running in preview via Open Router and I asked it for an SVG of a pelican riding a bicycle and correctly guessed the model vendor based on the response!
What's troubling to me is that it doesn't seem to have much account for "drift" -- it sort-of just goes down a single path and tries to improve as it goes.
What about structuring the agentic loop to do a simple genetic algorithm -- generate N children (probably 2 or 3), choose the best of the N+1 options (original vs. child A vs. child B vs. child C, and so-on) and then iterate again?
This wasn’t just “add more details”—it was “make this mechanically coherent.”
The overall text doesn’t appear to be AI written, making this all the more confusing. Is AI making people write this way now on their own? Or is it actually written by an LLM and just doesn’t look like it?
It's going to become the "MP3 sizzle" that young people at the time started to prefer once compressed audio became the norm on iPods and other portable music players, along with film grain and the judder of 24fps video. Artifacts imposed by the medium themselves become desirable once they become normal an associated and in fact signs of "quality", when, in fact, they are introduced noise and distortion to an otherwise more pristine or clean signal.
See also the "warmth" that certain vinyl enthusiasts sought after from their analog recordings which most certainly was mainly dust and defects in the groves rather than any actual tangible quality of the audio itself.
With vinyl warmth is the result of a deliberate process. Professional masters are done specifically for vinyl to accommodate its quirks which truly changes the sound. They have to clamp down the dynamic range and tidy low frequencies or the needle will skip. Recordings with lots of busy high frequency information also can’t be physically captured properly in the cut. The resulting master is a version that purposefully doesn’t have as many volume swings and harsh highs or boomy lows. Smooth, cohesive, warm. There are also track ordering strategies which is why ballads tend to be at the end of one side and the high energy stuff up front where there is “resolution” to serve it. The mastering engineer is adjusting each song with all this in mind.
What most people think of "vinyl sound effects" are not what the "warmth" is about. That's just playback instability and waveform aliasing caused by shoddily made players.
Good vinyl is "wait, did we have this back in 1970s" good(the recorder yes, the player not exactly, hence the prevalence of vinyl sound effects)
It would be interesting to see if they would get better results if they didn't grade their own work. Feed the output to a different model and let that suggest improvements, almost like a GAN.
I tried an experiment like this a while back (for the GPT-5 launch) and was surprised at how ineffective it was.
This is a better version of what I tried but suffers from the same problem - the models seem to stick close to their original shapes and add new details rather than creating an image from scratch that's a significantly better variant of what they tried originally.
I feel like I’ve seen this with code too, where it’s unlikely to scrap something and try a new approach a more likely to double down iterating on a bad approach.
For the svg generation, it would be an interesting experiment to seed it with increasingly poor initial images and see at what point if any the models don’t anchor on the initial image and just try something else
Maybe there’s a bias towards avoiding full rewrites? An “anti-refucktoring” bias
I’d be curious if the approach would be improved by having the model generate a full pelican from scratch each time and having it judge which variation is an improvement. Or if something should be altered in each loop, perhaps it should be the prompt instead
My rational is that perhaps it's being biased towards continuing doing what it's doing, or biased towards telling that it has done a good job and not being self-critical.
What prevents LLM designers from cheating and including a human handcrafted SVG into the model for this specific request (allowing for variations between calls)?
Nothing, but looking at the current results either no one tried yet or it didn't work very well. And the pelican benchmark has been around for a while so the opportunity was there.
I don't have all the most recent models but I've found image generation to be terribly disappointing in most that I've tried in that it doesn't seem capable of understanding directions or fixing mistakes.
"Create a drawing of a cliff in the desert."
Get something passing.
"Add a waterfall."
Get a waterfall that has no connection or outlet.
"Make the waterfall connect to a creek that runs off to the left."
Get a waterway that goes by the waterfall without connecting and goes straight through the center of the image.
Give up on that and notice that the shadows go to the left but the sun is straight behind.
"Move the sun to the right so that it matches the shadows more accurately."
Sun stays in the same spot, but grows while exaggerated and inaccurate shadows show up that seem to imply the backside of the cliff doesn't block light.
Iterating a Markov chain does not make it any more or less "agentic". This is yet another instance of corporate marketing departments redefining words b/c they are confused about what exactly they're trying to build & sell.
It's agentic because it's an iterated loop that relies on tool calls. The conversion of prompt to SVG is (presumably) a pure product of inference. But the rasterized SVG that the loop evaluates isn't; it's the product of hardcoded svg->jpg translation code (the model isn't inferring the raster). The loop is thus, in some small way, "grounded" (though not as firmly as a coding agent is grounded in a type-checking compiler's refusal to admit a hallucinated API).
How does your argument work if I move the rasterization into the Markov chain? Or is your assumption that (SVG, JPG) pairs can never be encoded w/ a neural network?
"Effectful loops" or "augmented loops" are much more descriptive of what is actually going on & do not confuse the reader w/ incoherent definitions of "agency".
I prefer coherent definitions instead of corporate marketing. Whether I like the term or not is secondary. Judging by past instances of this same phenomenon I expect the word to lose all meaning as more companies start telling their customers about their "agentic" offerings.
No this is obviously not corporate marketing, this individual is doing many things wrong by their own choice:
"creating an svg is surprisingly revealing" No it is not. They all do the same thing, they add suns and movement lines, and some more details. Like they were all trained on the same thing.
he makes up his own definition of "agent" there are at least 6 different definitions of this word now in this space. And his is again new for no reason.
The core idea here is "being vague and letting the models make weird random choice" This is the exact opposite of ALL direct instructions from the major model and coding agent programs at this time.
Actual interesting methodology would have been to create all combinations of the variables: let them use different svg to image tools and compare them, try many many different prompts with more specific instructions like "try to be more mechanically accurate"
Analysis is baseless assumptions: It is not "adding realism" all the models just had more pictures of roads trees suns and clouds... so it kept going back to the training data to add more like you keep telling it to do. It certainly wasn't understanding "more mechanically coherent" If it started focusing on the bike, it had more detailed bike pictures in the training data with chains.
This is why all the ai stuff is infuriating, people are mistaking so much for "good" or "useful" . At best this is a laugh once joke about how bad it is.
I admit the first time I saw that big george carlin generated video/stand up comedy, there was a special new feeling about "what on earth did they prompt to get this combination of visual and audio?" But that was such a fleeting thing I never need again
> Danielle Del, a spokeswoman for Sasso, said Dudesy is not actually an A.I.
> “It’s a fictional podcast character created by two human beings, Will Sasso and Chad Kultgen,” Del wrote in an email. “The YouTube video ‘I’m Glad I’m Dead’ was completely written by Chad Kultgen.”
I wasn't referring to just the blog post but the general trend in corporate marketing of redefining words in existing use because it is easier than educating their customers about their products & their limitations.
I love this experiment and am surprised that the Claude models performed that much better than the competition. Opus was particularly impressive both in the quality itself and the ability to iterate meaningfully.
Now... Was this article LLM written?
This part triggered all my LLM flags: ``` Adding a bicycle chain isn’t just decoration—it shows understanding of mechanical relationships. The wheel spokes, the adjusted proportions—these are signs of vision-driven refinement working as intended. ```
Did not feel LLM written to me - at least not overtly so. LLM editing/assisted perhaps?
It was a fun little post that felt accurate (ie confirmed my own biases ;)) about the current state of LLM models in a silly, but real, use-case.
The continual drive to out "llm written" articles feels a bit silly to me at this point. They are now part of the tools and tech we use, for better or worse. And to be clear, I think in a lot of cases it leans towards 'worse'.
But do you question if a video or photo was made with digital editing or filters or 'ai' tools (many of which we've had for years, just under different names) ? Do you worry about what tech was used in making your favorite album or song?
I get it, LLMs make it easy to produce trash content, but this is not a new problem. If you see trash, call it out as trash on its flaws, not on a presumption of how it was made.
No, I don't have anything against using LLMs to write. My problem is that I enjoy reading people in part for diversity of style.
I already spend too much time reading LLM outputs on my own interactions. And I get sick of their style because of it. So when I read it during leisure time, it just triggers a gut rejection.
Especially because they are so formulaic / template-y.
I get sick not only due to overexposure to LLM style, but also because I associate it now with a very poor substance-to-style ratio. LLMs tend to not only overuse but also misuse those turns of phrases it returns to obsessively. For example, enumerating three items where one is just another way to reference one of the first two, or it's of a different kind and doesn't really fit with the other two items. Or it will use "it's not just A, it's B" where B is unrelated to A, so "it's A and B" would have been more appropriate. Sacrificing logic for reasons of style. It also signals I should be on the lookout for possible hallucinations.
What an insightful comment! (sorry, couldn’t help it)
I agree about the silliness. God forbid i am a non-native English speaker and I have a bit of an of odd writing style in a real Brits eye. Or that I use ‘—‘ instead of ‘-‘ because usually typing two dashes converts to the long one on Mac (try even four, technology is crazy these days), and it just feels a bit nicer. OR that I adopt occasional use of ‘;’ because I feel like it (Yes. English is supposed to have short sentences. Unlike other languages. Beautiful. Sue me.)
I don’t care if they helped themselves with AI to improve writing or turn a bullet point into a sentence. It’s when the volume of text doesn’t justify the lack of content or value that I call bs and go to the next one. At this point it might as well be human generated content, but I don’t care, outcome’s the same.
Regarding the post — it’s a cute little article and the pelicans do seem be making a point with their funky shapes
I mean at some point you have to evaluate the content on its merit and they have a point — a chain is functional not just decorative in its precise placement.
Evaluating the content on its merit I'd question whether the author has seen a bicycle before. Yes, in the final iteration with Opus it added a chain, but it's missing a triangle which clearly shows a lack of understanding of mechanical relationships.
Ignoring the wording, em-dashes, etc. I'd assume an LLM not only wrote the article but also judged the pictures. That or the author has a much more relaxed opinion on what a pelican on a bicycle should actually look like. I don't think I would call Sonnet's arms and handlebars improved, nor would I call Haiku's legs and feet "proper." And if you overlay GPT-5 Medium's two photos the shapes proportions are nearly identical.
That phrase template isn’t just overdone—it's something some text models are obsessed with. The em-dashes, the contrastive language—these are signs of LLMs being asked to summarize or expand a compelling blog post.
If you give it credit for the chain, you need to also notice that that bike has a fixed front wheel. It literally can not be turned.
How do you know the wheel is fixed?
Humans are super bad at drawing bikes: https://www.gianlucagimini.it/portfolio-item/velocipedia/
Does being bad at drawing bikes make a machine more intelligent/human?
What I take from this is that LLMs are somewhat miraculous in generation but terrible at revision. Especially with images, they are very resistant to adjusting initial approaches.
I wonder if there is a consistent way to force structural revisions. I have found Nano Banana particularly terrible at revisions, even something like "change the image dimensions to..." it will confidently claim success but do nothing.
A thing I've been noticing across the board is that current generative AI systems are horrible at composition. It’s most obvious in image generation models where the composition and blocking tend to be jarringly simple and on point (hyper-symmetry, all-middleground, or one of like three canned "artistic" compositions) no matter how you prompt them, but you see it in things like text output as well once you notice it.
I suspect this is either a training data issue, or an issue with the people building these things not recognizing the problem, but it's weird how persistent and cross-model the issue is, even in model releases that specifically call out better/more steerable composition behavior.
I almost always get better results from LLMs by going back and editing my prompt and starting again, rather than trying to correct/guide it interactively. Almost as if having mistakes in your context window is an instruction to generate more mistakes! (I'm sure it's not quite that simple)
I see this all the time when asking Claude or ChapGPT to produce a single-page two-column PDF summarizing the conclusions of our chat. Literally 99% of the time I get a multi-page unpredictably-formatted mess, even after gently asking over and over for specific fixes to the formatting mistake/s.
And as you say, they cheerfully assert that they've done the job, for real this time, every time.
Ask for the asciidoc and asciidoctor command to make a PDF instead. Chat bots aren’t designed to make PDFs. They are just trying to use tools in the background, probably starting with markdown.
Tools are still evolving out of the VLM/LLM split [0]. The reason image-to-image tasks are so variable in quality and vastly inferior to text-to-image tasks is because there is an entirely separate model that is trained on transforming an input image into tokens in the LLM's vector space.
The naive approach that gets you results like ChatGPT is to produce output tokens based on the prompt and generate a new image from the output. It is really difficult to maintain details from the input image with this approach.
A more advanced approach is to generate a stream of "edits" to the input image instead. You see this with Gemini, which sometimes maintains original image details to a fault; e.g. it will preserve human faces at all cost, probably as a result of training.
I think the round-trip through SVG is an extreme challenge to train through and essentially forces the LLM to progressively edit the SVG source, which can result in something like the Gemini approach above.
[0]: https://www.groundlight.ai/blog/how-vlm-works-tokens
Revision should be much easier than generation, e.g. reflection style CoT (draft-critique-revision) is typically the simplest way to get things done with these models. It's always possible to overthink, though.
Nano Banana is rather terrible at multi-turn chats, just like any other model, despite the claim it's been trained for it. Scattered context and irrelevant distractors are always bad, compressing the conversation into a single turn fixes this.
I’m not quite sure. I think that adversarial network works pretty well at image generation.
I think that the problem here is that svg is structured information and an image is unstructured blob, and the translation between them requires planning and understanding. Maybe if instead of treating an svg like a raster image in the prompt is wrong. I think that prompting the image like code (which svg basically is) would result in better outputs.
This is just my uninformed opinion.
The prompt just said to iterate until they were satisfied. Adding something like "don't be afraid to change your approach or make significant revisions" would probably give different results.
> I wonder if there is a consistent way to force structural revisions.
Ask for multiple solutions?
> Some models (looking at you, GPT-5-Codex) seemed to mistake “more complex” for “better.”
That's what working with GPT-5-Codex on actual code also feels like.
Funny because I've felt that way and have switched back to Claude Sonnet 4.5 for agentic coding.
If Sonnet doesn't solve my problem, sometimes Codex actually does.
So it isn't like Codex is always worse. I just prefer to try Sonnet 4.5 first.
YES, and the sad truth is that the only person who can write good, simple code is likely the one who doesn't need an AI helper. ;(
So it's an accurate simulation of a programmer then
I feels like it's a bit hard to take much from this without running this trial many times for each model. Then it would be possible to see if there are consistent themes among each model's solutions. Otherwise, it feels like the specific style of each result could be somewhat random. I didn't see any mention of running multiple trials for each model.
Oddly enough, I've found models are actually quite consistent in their drawings of pelicans riding bicycles.
I remember I even had one case where there was a stealth model running in preview via Open Router and I asked it for an SVG of a pelican riding a bicycle and correctly guessed the model vendor based on the response!
What's troubling to me is that it doesn't seem to have much account for "drift" -- it sort-of just goes down a single path and tries to improve as it goes.
What about structuring the agentic loop to do a simple genetic algorithm -- generate N children (probably 2 or 3), choose the best of the N+1 options (original vs. child A vs. child B vs. child C, and so-on) and then iterate again?
I write like that and I'm not an LLM.
how many letter "a" are there in "hundreads"?
I assume this was written by a human and then "improved" by an LLM.
It's going to become the "MP3 sizzle" that young people at the time started to prefer once compressed audio became the norm on iPods and other portable music players, along with film grain and the judder of 24fps video. Artifacts imposed by the medium themselves become desirable once they become normal an associated and in fact signs of "quality", when, in fact, they are introduced noise and distortion to an otherwise more pristine or clean signal.
See also the "warmth" that certain vinyl enthusiasts sought after from their analog recordings which most certainly was mainly dust and defects in the groves rather than any actual tangible quality of the audio itself.
With vinyl warmth is the result of a deliberate process. Professional masters are done specifically for vinyl to accommodate its quirks which truly changes the sound. They have to clamp down the dynamic range and tidy low frequencies or the needle will skip. Recordings with lots of busy high frequency information also can’t be physically captured properly in the cut. The resulting master is a version that purposefully doesn’t have as many volume swings and harsh highs or boomy lows. Smooth, cohesive, warm. There are also track ordering strategies which is why ballads tend to be at the end of one side and the high energy stuff up front where there is “resolution” to serve it. The mastering engineer is adjusting each song with all this in mind.
Same vein https://open.substack.com/pub/animationobsessive/p/the-toy-s...
Pixar films were setup with the idea of being put on film so the DVD digital transfers color is all wrong.
In some cases, the artifacts of the medium can be desirable -- especially when consuming media that was created with those artifacts in mind.
Take a look at pixel art on CRTs vs LCDs [0] and Toy Story on film vs. digital [1].
[0]: https://wackoid.com/game/10-pictures-that-show-why-crt-tvs-a...
[1]: https://animationobsessive.substack.com/p/the-toy-story-you-...
What most people think of "vinyl sound effects" are not what the "warmth" is about. That's just playback instability and waveform aliasing caused by shoddily made players.
Good vinyl is "wait, did we have this back in 1970s" good(the recorder yes, the player not exactly, hence the prevalence of vinyl sound effects)
Something about the cadence, structure, and staccato nature of the bottom paragraphs also felt very LLMed.
This is a lot better than my attempt exactly one year ago: https://paritybits.me/llm-drawing-with-eyes-open/
Poor feet.
It would be interesting to see if they would get better results if they didn't grade their own work. Feed the output to a different model and let that suggest improvements, almost like a GAN.
I tried an experiment like this a while back (for the GPT-5 launch) and was surprised at how ineffective it was.
This is a better version of what I tried but suffers from the same problem - the models seem to stick close to their original shapes and add new details rather than creating an image from scratch that's a significantly better variant of what they tried originally.
I feel like I’ve seen this with code too, where it’s unlikely to scrap something and try a new approach a more likely to double down iterating on a bad approach.
For the svg generation, it would be an interesting experiment to seed it with increasingly poor initial images and see at what point if any the models don’t anchor on the initial image and just try something else
Yeah, for code I'll often start an entirely new chat and paste in just the bits I liked from the previous attempt.
Maybe there’s a bias towards avoiding full rewrites? An “anti-refucktoring” bias
I’d be curious if the approach would be improved by having the model generate a full pelican from scratch each time and having it judge which variation is an improvement. Or if something should be altered in each loop, perhaps it should be the prompt instead
Yeah I think you're right. In most cases it's extremely annoying to have the model make any more then minimal changes to code you provide it.
Very nice results! I recently created this SVG CLI agent with a similar idea: https://github.com/svgnew/Saul
Could this be improved if the evaluation was done by an independent sub-agent?
Is it running out of space in its context window?
My rational is that perhaps it's being biased towards continuing doing what it's doing, or biased towards telling that it has done a good job and not being self-critical.
A single run (irrespective of number of iterations) on any model is not a good data point.
If first output is crappy, the next 3 iterations will improve the same crap.
This was not a good test.
I have tried to do agentic figma in this way but same results: attempt 1 becomes frozen and no forward progress can be made.
What prevents LLM designers from cheating and including a human handcrafted SVG into the model for this specific request (allowing for variations between calls)?
Nothing, but looking at the current results either no one tried yet or it didn't work very well. And the pelican benchmark has been around for a while so the opportunity was there.
I don't have all the most recent models but I've found image generation to be terribly disappointing in most that I've tried in that it doesn't seem capable of understanding directions or fixing mistakes.
"Create a drawing of a cliff in the desert."
Get something passing.
"Add a waterfall."
Get a waterfall that has no connection or outlet.
"Make the waterfall connect to a creek that runs off to the left."
Get a waterway that goes by the waterfall without connecting and goes straight through the center of the image.
Give up on that and notice that the shadows go to the left but the sun is straight behind.
"Move the sun to the right so that it matches the shadows more accurately."
Sun stays in the same spot, but grows while exaggerated and inaccurate shadows show up that seem to imply the backside of the cliff doesn't block light.
...
A G E N T I C
it uses AGENTS
ITS AGENTIC
AGENTIC.
Iterating a Markov chain does not make it any more or less "agentic". This is yet another instance of corporate marketing departments redefining words b/c they are confused about what exactly they're trying to build & sell.
It's agentic because it's an iterated loop that relies on tool calls. The conversion of prompt to SVG is (presumably) a pure product of inference. But the rasterized SVG that the loop evaluates isn't; it's the product of hardcoded svg->jpg translation code (the model isn't inferring the raster). The loop is thus, in some small way, "grounded" (though not as firmly as a coding agent is grounded in a type-checking compiler's refusal to admit a hallucinated API).
How does your argument work if I move the rasterization into the Markov chain? Or is your assumption that (SVG, JPG) pairs can never be encoded w/ a neural network?
If that was what was actually happening you'd have a point, but it isn't.
The definitions are not coherent. It's obvious enough to anyone who understands the technical details.
The technical definition of an agent is an LLM being called in a loop, some of which calls include tool definitions. That's exactly what this is.
"Effectful loops" or "augmented loops" are much more descriptive of what is actually going on & do not confuse the reader w/ incoherent definitions of "agency".
So this whole thread was just you trying to express that you don't like the common accepted definition of the term "agent"?
I prefer coherent definitions instead of corporate marketing. Whether I like the term or not is secondary. Judging by past instances of this same phenomenon I expect the word to lose all meaning as more companies start telling their customers about their "agentic" offerings.
No this is obviously not corporate marketing, this individual is doing many things wrong by their own choice:
"creating an svg is surprisingly revealing" No it is not. They all do the same thing, they add suns and movement lines, and some more details. Like they were all trained on the same thing.
he makes up his own definition of "agent" there are at least 6 different definitions of this word now in this space. And his is again new for no reason.
The core idea here is "being vague and letting the models make weird random choice" This is the exact opposite of ALL direct instructions from the major model and coding agent programs at this time.
Actual interesting methodology would have been to create all combinations of the variables: let them use different svg to image tools and compare them, try many many different prompts with more specific instructions like "try to be more mechanically accurate"
Analysis is baseless assumptions: It is not "adding realism" all the models just had more pictures of roads trees suns and clouds... so it kept going back to the training data to add more like you keep telling it to do. It certainly wasn't understanding "more mechanically coherent" If it started focusing on the bike, it had more detailed bike pictures in the training data with chains.
This is why all the ai stuff is infuriating, people are mistaking so much for "good" or "useful" . At best this is a laugh once joke about how bad it is.
I admit the first time I saw that big george carlin generated video/stand up comedy, there was a special new feeling about "what on earth did they prompt to get this combination of visual and audio?" But that was such a fleeting thing I never need again
The George Carlin thing was fake - it wasn't written by AI: https://www.nytimes.com/2024/01/26/arts/carlin-lawsuit-ai-po...
> Danielle Del, a spokeswoman for Sasso, said Dudesy is not actually an A.I.
> “It’s a fictional podcast character created by two human beings, Will Sasso and Chad Kultgen,” Del wrote in an email. “The YouTube video ‘I’m Glad I’m Dead’ was completely written by Chad Kultgen.”
I wasn't referring to just the blog post but the general trend in corporate marketing of redefining words in existing use because it is easier than educating their customers about their products & their limitations.