I came to the same conclusion as the authors after generating 1000s of thumbnails[1]. OpenAI alters faces too much and smoothes out details by default. NanoBanana is the best but lacks high fidelity option. SeeDream is catching up to NanoBanana and sometimes is better. It's been too long since OpenAI's gpt-img-1 came out, hope they launch a better model soon.
I am probably at 50k-60k image generations from various models.
It is just very hard to make any generalizations because any single prompt will lead to so many different types of images.
The only thing I would really say to generalize is every model has strengths and weaknesses depending on what you are going for.
It is also generally very hard to explore all the possibilities of a model. So many times I thought I seen what the model could do just to be completely blown away by a particular generation.
Not to self plug, but I built a tool around this so that you could yield more than 1 result at a time eg., models, variants and aspect ratios. I would be interested in you having a crack at it. - willing to give you a free premium account for feedback.
But someone has to know and evaluate all of those strengths and weaknesses, keep up with new models etc. Thats's work someone has to do or their product loses in quality. But that's fine when all products lose quality across the board.
Youtube is full of AI slop right now, doesnt take much imagine to regconise how scammers (listed on an exhcange or not) are utilising it this... Take for instance a political influence organisation, generating avatars for vast bot networks that are implanted into social media to influence.
FWIW when I do txt2img or img2img locally I have the batch set to 8-12 (so 12 variation images are generated from the same seed in the same gen) so it’s fairly easy to numerically end up with tens of thousands, which are usually 99% not good.
I don't know if you looked at the same article as I did, but nanobanana seems to be the worst by far at following the prompts. Just look at the heat map images
Do you run thumbnail.ai? I would really like to try it, but I'm not going to pay before I've seen even a single generated thumbnail in my context. Is it unviable to let people generate at least a few thumbnails before they have to decide whether to pay?
I run a fairly comprehensive model comparison site (generative and editing). In my experience:
NanoBanana and Flux Kontext are the models that get closest to traditional SDXL inpainting techniques.
Seedream is a strong contender by virtue of its ability to natively handle higher resolutions (up to around 4 megapixels) so you lose less detail - however it also tends to alter the color palette more often then not.
Finally GPT-image-1 (yellowish filter notwithstanding) exhibits very strong prompt adherence but will almost always change a number of the details.
It was interesting to see how often the OpenAI model changed the face of the child. Often the other two models wouldn't, but OpenAI would alter the structure of their head (making it rounder), eyes (making them rounder), or altering the position and facing of the children in the background.
It's like OpenAI is reducing to some sort of median face a little on all of these, whereas the other two models seemed to reproduce the face.
For some things, exactly reproducing the face is a problem -- for example in making them a glass etching, Gemini seemed unwilling to give up the specific details of the child's face, even though that would make sense in that context.
It looks to me like OpenAI's image pipeline takes an image as input, derives the semantic details, and then essentially regenerates an entirely new image based on the "description" obtained from the input image.
Even Sam Altman's "Ghiblified" twitter avatar looks nothing like him (at least to me).
Other models seem much more able to operate directly on the input image.
This is inherent in the architecture of chatgpt. It's a unified model: text, images, etc all become tokenized input. It's similar to re-encoding your image in a lossy format, the format is just the black box of chatgpt's latent space.
This leads to incredibly efficient, dense semantic consistency because every object in an image is essentially recreated from (intuitively) an entire chapter of a book dedicated to describing that object's features.
However, it loses direct pixel reference. For some things that doesn't matter much, but humans are very discerning regarding faces.
Chatgpt is architecturally unable to reproduce exactly the input pixels - they're always encoded into tokens, then decoded. This matters more for subjects for which we are sensitive to detail loss, like faces.
Encoding/decoding tokens doesn't automatically mean lossy. Images, at least in term of raw pixels can be a very inefficient form of storing information from information theoretic perspective.
Now, the difficulty is in achieving an encoding/decoding scheme that is both: information efficient AND semantically coherent in latent space. Seems like there is a tradeoff here.
I've noticed that OpenAI modifies faces on a regular basis. I was using it to try and create examples of different haircuts and the face would randomly turn into a different face -- similar but noticeably changed. Even when I prompted to not modify the face, it would do it regardless. Perhaps part of their "safety" for modifying pictures of people?
It's interesting to me that the models often have their "quirks". GPT has the orange tint, but it also is much worse at being consistent with details. Gemini has a problem where it often returns the image unchanged or almost unchanged, to the point where I gave up on using it for editing anything. Not sure if Seedream has a similar defining "feature".
They noted the Gemini issue too:
> Especially with photos of people, Gemini seems to refuse to apply any edits at all
Nano Banana in general cannot do style transfer effectively unless the source image/subject is a similar style as the target style, which is an interesting and unexpected model quirk. Even the documentation examples unintentionally demonstrates this.
Seedream will always alter the global color balance with edits.
I've definitely noticed Gemini's tendency to return the image basically unchanged, but not noticed it being worse or better for images of people. When I tested by having it change aspects of a photo of me, I found it was far more likely to cooperate when I'd specify, for instance, "change the hair from long to short" rather than "Make the hair short" (the latter routinely failed completely).
It also helped to specify which other parts should not be changed, otherwise it was rather unpredictable about whether it would randomly change other aspects.
I have had that problem with nano banana but when it works I find it so much better than the others for editing an image. Since it’s free I usually try it first, and I would say approximately 10% of the time find myself having to use something else.
I’m editing mostly pics of food and beverages though, it wouldn’t surprise me if it is situationally better or worse.
If you don’t want your image to look like it’s been marinated in nicotine, throw stuff like “neutral white background, daylight balanced lighting, no yellow tint” into your prompt. Otherwise, congrats on your free vintage urine filter.
They don't want you creating images that mimic either works of other artists to an extent that's likely to confuse viewers (or courts), or that mimic realistic photographs to an extent that allows people to generate low-effort fake news. So they impose an intentionally-crappy orange-cyan palette on everything the model generates.
Peak quality in terms of realistic color rendering was probably the initial release of DALL-E 3. Once they saw what was going to happen, they fixed that bug fast.
SDXL and FLUX models with LoRAs can and do vastly outperform at tons of things singular big models can't or won't do now. Various subreddits and civitAI blogs describe comfyui workflows and details on how to maximize LoRA effectiveness and are probably all you need for a guided tour of that space.
This is not my special interest though but the DIY space is much more interesting than the SaaS offerings; this is something about generative AI more generally that also holds, the DIY scene is going to be more interesting.
OpenAI's new image generation model is autoregressive, while DALL-E was diffusion. The yellowish tone is an artefact of their autoregressive pipeline, if I recall correctly.
Could be. My point is that if the pipeline itself didn't impart an unmistakable character to the generated images, OpenAI would feel compelled to make it do so on purpose.
Most DALL-E 3 images have a orange-blue cast, which is absolutely not an unintended artifact. You'd literally have to be blind to miss it, or at least color-blind. That wasn't true at first -- check the original paper, and try the same prompts! It was something they started doing not long after release, and it's hardly a stretch to imagine why.
They will be doing the same thing for the same reasons today, assuming it doesn't just happen as a side effect.
Found OpenAI too often heavy handed. On balance, I'd probably pick Gemini narrowly over Seedream and just learn that sometimes Gemini needs a more specific prompt.
Timings were measured on a consumer internet connection in Japan (Fiber connection, 10 Gbps nominal bandwidth) during a limited test run in a short time period.
"consumer internet connection in Japan", "10 Gbps nominal bandwidth"
Coming from a third world country, that surprises me.
The 10gbit connection costs me ¥5,000/mo (around USD 30/mo), which was actually slightly cheaper than I was paying for 1 Gbit...
The main issue is latency and bandwidth across the oceans since Asia far away from the US where a lot of servers live, and even for services that are distributed, I live in a rural prefectural capitol of Japan 1000 km away from Tokyo where all the "Japan" data centers are, so my ping is always unimpressive despite the bandwidth.
It's disturbing how the models sometimes alter the objects in the images when they're only supposed to add an effect. That's not just a complete failure of the task, it also means manual work since a human has to double check evry detail in every image.
The shortcut to flip between models in an expanded view is nice, but the original image should also be included as one of the things to flip between, and should be included in the side by side view.
they are building a product and said the unit economics must make sense, local models have slower latency, unless you run a gpu on for hours which gets expensive fast
Local models will make a lot more sense once we have the scale for it, but when your user count is still small paying cents per image is a much better deal than paying for a GPU either in a data center or physically.
Local models are definitely something I want to dive into more, if only out of personal interest.
Honestly, I think it was misfounded. As an photographer and artist myself, I find the OpenAI results head-and-shoulders above the others. It's not perfect, and in a few cases one or the other alternative did better, but if I had to pick one, it would be OpenAI for sure. The gap between their aesthetics and mine makes me question ever using their other products (which is purely academic since I'm not an Apple person).
How many of those result did you actually look at? I thought it did ok with the cats, but check the other images and OpenAI strait up failed to do the prompt a large fraction of the time.
I'm wondering if I read the same article. Yeah, I looked at every single one the results. And although it's been a couple of hours since, I don't recall ANY examples where it completely failed to do anything. Can you point me to an example?
I don’t want to go through every image, for the mountain:
It failed Remove background, Isolate the background, long exposed (kept people), Apply a fish-eye lens effect (geometry incorrect), Strong bokeh blur (wrong blur type)
Some were more ambiguous. Give it a metallic sheen looked cool but that isn’t a metallic sheen and IMO it just failed ukiyo-e Japanese woodblock print style but I wouldn’t object to calling it a vaguely Japanese style. Compare how colors blend with ukiyo-e woodblocks vs how OpenAI‘a sky is done.
Removing the background is impossible - or more to the point, it would yield a blank image. There is no foreground in the image, it would wind up removing everything. Which also means that its result for isolate the background is exactly right. Although we might want to argue that the lower part of the image is a midground, that's ambiguous.
You're mostly right to criticize the fisheye - it's plausibly a fisheye image, but not one derived from the original. For bokeh, you're right that it got the mountain wrong. But it did get the other samples, and it's the only one that seems to know what bokeh is at all, as the other models got none of them (other than Seadream getting the Newton right).
For the "metallic sheen", I assume you mean where they said "give the object a metallic sheen", since the first attempt had OpenAI giving the image itself a quality as if it were printed or etched on metal, arguably correct. But for that second one, for all but the 4th sample, OpenAI did it best for mountain and rubik's cube, and no worse for cats and car. Seadream wins for the Newton.
I don't have any knowledge of the Japanese styles requested, so I'm not judging those.
I've reviewed your examples, and it hasn't changed my mind.
> I don't recall ANY examples where it completely failed to do anything
> I’ve reviewed your examples, and it hasn’t changed my mind.
I think I have a better understanding of your thinking, but IMO you’re using a bar so low effectively anything qualifies. “it's the only one that seems to know what bokeh is at all, as the other models got none of them (other than Seadream getting the Newton right).” For bokeh look at the original then the enlarged images on the car. OpenAI blurs the entire image car and ground fairly uniformly, where Seedream keeps the car in focus while blurring background elements including the ground when it’s far enough back. Same deal with the cats where the original has far more distant objects in the upper right which Seedream puts out of focus while keeping the cats in focus while OpenAI blurs everything.
In my mind the other models also did quite poorly in general, but when I exclude failures I don’t judge OpenAI as the winner. IE on the kaleidoscopic task OpenAI’s girl image didn’t have radial symmetry and so it simply failed the task, Gemini’s on the other hand looks worse but qualifies as a bad approximation of the task.
Well, that is a good point. That is for everyone themselves to decide, I suppose.
To me, I like to think in times the model failed versus success. So what I did, is I looked every time at the worst result. To me, the one which stood out (negatively) is Gemini. OpenAI had some very good results but also some missing the mark. SeeDream (which I never heard of previously) missed the mark less often than Gemini, and at times where OpenAI failed, SeeDream came out clearly on top.
So, if I were to use the effects of the mentioned models, I wouldn't bother with Gemini; only OpenAI and SeeDream.
Hey. We'd love to fund thr generations for free for you to try Riverflow 2 out if you're up for it. Riverflow 1 ranks above them all and 2 is now in preview this week.
I dunno about you lot, but I actually really like Stable Diffusion 1.5.
I like giving it weird, non-prompts, like lines from songs or novels. I then run it for a few hundred generations locally and doing stuff with the malformed shit it comes out with. I have a few art projects like this.
I like that they call openai’s image generator ground breaking and then explain that it’s prone to taking eight times longer to generate an image before showing it add a third cat over and over and over again
Is it me or ChatGPT change subtle or sometimes more prominent things? Like ball holding position of the hand, face features like for head, background trees and alike?
For all the AI slop and studies saying AI is more hype than substance I will say that this use case is one that seems very legit.
The stock photo industry was always pretty bad and silly expensive. Being able to custom generate visuals and photos to replace that is a good use case of AI IMHO. Yes sometimes it does goofy things, but it’s getting quite good. If AI blows up the stock photo industry few will shed a tear.
Using gen. ai for filters is stupid, a filter guarantees the same object but filtered, a gen. AI version of this guarantees nothing and an expensive AI bill.
It’s like using gen. ai to do math instead of extracting the numbers from a story and just doing the math with +, -, / and *
No, but this is the beginning of a new generation of tools to accelerate productivity. What surprises me is that the AI companies are not market savvy enough to build those tools yet. Adobe seems to have gotten the memo though.
In testing some local image gen software, it takes about 10 seconds to generate a high quality image on my relatively old computer. I have no idea the latency on a current high end computer, but I expect it's probably near instantaneous.
Right now though the software for local generation is horrible. It's a mish-mash of open source stuff with varying compatibility loaded with casually excessive use of vernacular and acronyms. To say nothing of the awkwardness of it mostly being done in python scripts.
But once it gets inevitably cleaned up, I expect people in the future are going to take being able to generate unlimited, near instantaneous images, locally, for free, for granted.
Did you test some local image gen software in that you installed the Python code on the github page for a local model, which is clearly a LOT for a normal user... or did you look at ComfyUI, which is how most people are running local video and image models? There are "just install this" versions, which eases the path for users (but it's still, admittedly, chaos beneath the surface).
Interesting you say that. No I've tried out Invoke and AUTOMATIC1111/WebUI. I specifically avoided ComfyUI because of my inexperience in this and the fact that people described it as a much more advanced system with manual wiring of the pipeline and so on.
It's likely that I'm seeing this from my deep into ComfyUI bubble. My impression was that AUTOMATIC1111 and Forge and the like, were fading as ComfyUI was the "what people ended up on" no matter which AI generation framework they started with. But I don't know that there are any real stats on usage of these programs, so it's entirely possible that AUTOMATIC1111/Forge/InvokeAI are being used by more people than ComfyUI.
So far Adobe AI tools are pretty useless, according to many professional illustrators. With Firefly you can use other (non-Adobe) image generators. The output is usually barely usable at this point in time.
I've been waiting for solutions that integrate into the artistic process instead of replacing it. Right now a lot of the focus is on generating a complete image, but if I was in photoshop (or another editor) and could use AI tooling to create layers and other modifications that fit into a workflow, that would help with consistency and productivity.
I haven't seen the latest from adobe over the last three months, but last I saw the firefly engine was still focused on "magically" creating complete elements.
"AI won't replace you, but someone who knows how to use AI will replace you" appears to be too short a phrase.
There is no better recent example than AI comedy made by a professional comedian [0]
Of course, this makes sense once you think about it for a second. Even AGI, without a BCI, could not read your mind to understand what you want. Of course, the people who have been communicating these ideas with other humans up to this point, are the best at doing that.
Apologies if I wrote my original comment poorly, but that was I was trying to communicate.
Not only was this person able to write good comedy, but they knew what tools were available and how to use them.
I previously wrote:
> "AI won't replace you, but someone who knows how to use AI will replace you." ...
The missing part is "But a person who was excellent at their pre-AI job, will replace ten of the people down the chain."
The possible analog that just popped into my head is the nearly always missed part of the quote "the customer is always right" ... "in matters of taste."
> a person who was excellent at their pre-AI job, will replace ten of the people down the chain
I think comedy is a great example of how this is not the general case.
In this instance, the video you posted was the result when a comic used a tool to make a non-living thing say their jokes.
That’s not new, that’s a prop. It’s ventriloquism. People have been doing that gag since the first crude marionette was whittled.
The existence of prop comics isn’t an indicator that that’s the pinnacle of comedy (or even particularly good). If Mitch Hedburg had Jeff Dunham’s puppets it probably would’ve been… fine, but if Jeff Dunham woke up tomorrow with Hedburg’s ability to write and deliver jokes his life and career would be dramatically changed forever.
Better dummies will benefit some ventriloquists but there’s no reason to think that this is the moment that the dummies get so good that everyone will stop watching humans and start watching ventriloquists (which is what would have to happen for one e-ventriloquist putting 10 comedians out of a job to be a regular thing)
For that matter, the car didn’t make horse riding completely obsolete either.
For artists, the question is whether generative AI is like photography or the car. My guess, at this stage, is photography.
For what it’s worth I think the proponents of generative AI are grossly overestimating the utility and economic value of meh-OK images that approximate the thing you’ve asked for.
I've seen cover art on a lot of magazines already replaced with AI images. I suspect, for the time being, that a lot of the low hanging art fruit will be destroyed by image generation. The knock on effect is less art jobs, but more artists. In the vein of your analogy, it removes the gas station attendants that fill your tank.
The more I think about it, most artists/illustrators will be replaced by workers who can't draw or paint but are better than artists at generating AI prompts.
And some day the news will announce that the last human actor has died.
No. People create art as a form of expression and other people enjoy it because it resonates with them. Nobody that’s inclined to artistically express a thought or feeling is going to give up on creativity because maybe somebody that isn’t really interested in creating art might be able to type words into their computer and spit out something vaguely similar.
That aside, humans are necessary for making up new forms and styles. There was no cubism before Picasso and Braque, or pointillism before Seurat and Signac. I don’t think I’ve seen anyone argue that if you trained a diffusion model on only the art that Osamu Tezuka was exposed to before he turned 24 it would output Astro Boy.
When there's a need for something with specific traits and composition at high quality, I've yet to see a model that can deliver that, especially in a reasonable amount of time. It's still way more reliable to just hand a description to a skilled illustrator along w/references and then go back and forth a bit to get a quality result. The illustrator is more expensive, but my time isn't free, so it works out.
Artists no, illustrators and graphic designers yes. They'll mostly become redundant within the next 50 years. With these kind of technologies, people tend to overestimate the short-term effects and severely underestimate the long-term effects.
Horse and buggy isn't quite the analogy, I think it is more like the arrival of junk food, packed with sugar, salt and saturated fats. You will still be able to find a cafe or restaurant where a full kitchen team cooks from scratch but everything else is fast food garbage.
Maybe just the advent of the microwave oven is the analogy.
Either way, I am out. I have spent many days fiddling with AI image generation but, looking back on what I thought was 'wow' at the time, I now think everything AI art is practically useless. I only managed one image I was happy with and most of that was GIMP, not AI.
This study has confirmed my suspicions, hence I am out.
Going back to the fast food analogy, for the one restaurant that actually cooks actual food from actual ingredients, if everyone else is selling junk food then the competition has been decimated. However, the customers have been decimated too. This isn't too bad as those customers clearly never appreciated proper food in the first place, so why waste effort on them? It is a pearls and swine type of thing.
This seems to imply that the capabilities being tested are like the descriptive words used in the prompts, but, as a category using random words would be just as valid for exercising the extents of the underlying math. And when I think of that reality I wonder why a list of tests like this should be interesting and to what ends. The repeated nature of the iteration implies that some control or better quality is being sought but the mechanism of exploration is just trial and error and not informative of what would be repeatable success for anyone else in any other circumstance given these discoveries.
• OpenAI (gpt-image-1):
The wild artist. Best for creative, transformative, style-heavy edits—Ghibli, watercolor, fantasy additions, portals, sci-fi stuff, etc. But it hallucinates a lot and often distorts fine details (especially faces). Slowest.
• Gemini (flash-image / nanoBanana):
The cautious realist. Best for subtle, photorealistic edits—fog, lighting tweaks, gentle filters, lens effects. Almost never ruins details, but sometimes refuses to do artsy transformations, especially on human photos.
• Seedream:
The adventurous middle child. Faster, cheaper, and often surprisingly good at aesthetic effects—bokeh, low-poly, ukiyo-e, metallic sheen, etc. Not as creative as OpenAI, not as conservative as Gemini. Can hallucinate, but in fun ways.
If you’re planning an automated pipeline, routing “artistic” prompts to OpenAI and “photorealistic” ones to Gemini (with Seedream as a wildcard) matches their own conclusion.
I came to the same conclusion as the authors after generating 1000s of thumbnails[1]. OpenAI alters faces too much and smoothes out details by default. NanoBanana is the best but lacks high fidelity option. SeeDream is catching up to NanoBanana and sometimes is better. It's been too long since OpenAI's gpt-img-1 came out, hope they launch a better model soon.
[1] = https://thumbnail.ai/
I am probably at 50k-60k image generations from various models.
It is just very hard to make any generalizations because any single prompt will lead to so many different types of images.
The only thing I would really say to generalize is every model has strengths and weaknesses depending on what you are going for.
It is also generally very hard to explore all the possibilities of a model. So many times I thought I seen what the model could do just to be completely blown away by a particular generation.
Not to self plug, but I built a tool around this so that you could yield more than 1 result at a time eg., models, variants and aspect ratios. I would be interested in you having a crack at it. - willing to give you a free premium account for feedback.
https://brandimagegen.com
But someone has to know and evaluate all of those strengths and weaknesses, keep up with new models etc. Thats's work someone has to do or their product loses in quality. But that's fine when all products lose quality across the board.
What do you even do with 50k images? Even at just 10 seconds attention each, that's an solid entire week of waking time.
Youtube is full of AI slop right now, doesnt take much imagine to regconise how scammers (listed on an exhcange or not) are utilising it this... Take for instance a political influence organisation, generating avatars for vast bot networks that are implanted into social media to influence.
Why so many?
FWIW when I do txt2img or img2img locally I have the batch set to 8-12 (so 12 variation images are generated from the same seed in the same gen) so it’s fairly easy to numerically end up with tens of thousands, which are usually 99% not good.
I don't know if you looked at the same article as I did, but nanobanana seems to be the worst by far at following the prompts. Just look at the heat map images
Half the time nanobanana doesn't do anything to the photo from my experience, also confirmed in some of these examples.
You can alter prompts yourself though to be clearer about what you want. The other things, you can't change.
Do you run thumbnail.ai? I would really like to try it, but I'm not going to pay before I've seen even a single generated thumbnail in my context. Is it unviable to let people generate at least a few thumbnails before they have to decide whether to pay?
I am a small time youtuber
I run a fairly comprehensive model comparison site (generative and editing). In my experience:
NanoBanana and Flux Kontext are the models that get closest to traditional SDXL inpainting techniques.
Seedream is a strong contender by virtue of its ability to natively handle higher resolutions (up to around 4 megapixels) so you lose less detail - however it also tends to alter the color palette more often then not.
Finally GPT-image-1 (yellowish filter notwithstanding) exhibits very strong prompt adherence but will almost always change a number of the details.
It was interesting to see how often the OpenAI model changed the face of the child. Often the other two models wouldn't, but OpenAI would alter the structure of their head (making it rounder), eyes (making them rounder), or altering the position and facing of the children in the background.
It's like OpenAI is reducing to some sort of median face a little on all of these, whereas the other two models seemed to reproduce the face.
For some things, exactly reproducing the face is a problem -- for example in making them a glass etching, Gemini seemed unwilling to give up the specific details of the child's face, even though that would make sense in that context.
It looks to me like OpenAI's image pipeline takes an image as input, derives the semantic details, and then essentially regenerates an entirely new image based on the "description" obtained from the input image.
Even Sam Altman's "Ghiblified" twitter avatar looks nothing like him (at least to me).
Other models seem much more able to operate directly on the input image.
You can see this in the images of the Newton: in GPT's versions, the text and icons are corrupted.
Isn't this from the model working o. really low res images, and then bein uppscalef afterwards?
https://www.reddit.com/r/ChatGPT/comments/1n8dung/chatgpt_pr...
This is inherent in the architecture of chatgpt. It's a unified model: text, images, etc all become tokenized input. It's similar to re-encoding your image in a lossy format, the format is just the black box of chatgpt's latent space.
This leads to incredibly efficient, dense semantic consistency because every object in an image is essentially recreated from (intuitively) an entire chapter of a book dedicated to describing that object's features.
However, it loses direct pixel reference. For some things that doesn't matter much, but humans are very discerning regarding faces.
Chatgpt is architecturally unable to reproduce exactly the input pixels - they're always encoded into tokens, then decoded. This matters more for subjects for which we are sensitive to detail loss, like faces.
Encoding/decoding tokens doesn't automatically mean lossy. Images, at least in term of raw pixels can be a very inefficient form of storing information from information theoretic perspective.
Now, the difficulty is in achieving an encoding/decoding scheme that is both: information efficient AND semantically coherent in latent space. Seems like there is a tradeoff here.
I've noticed that OpenAI modifies faces on a regular basis. I was using it to try and create examples of different haircuts and the face would randomly turn into a different face -- similar but noticeably changed. Even when I prompted to not modify the face, it would do it regardless. Perhaps part of their "safety" for modifying pictures of people?
I had thought it was a deliberate choice to avoid potential abuse, however Sora put an end to that line of thinking.
It's also changing scene features too. Like removed background trees.
It's interesting to me that the models often have their "quirks". GPT has the orange tint, but it also is much worse at being consistent with details. Gemini has a problem where it often returns the image unchanged or almost unchanged, to the point where I gave up on using it for editing anything. Not sure if Seedream has a similar defining "feature".
They noted the Gemini issue too:
> Especially with photos of people, Gemini seems to refuse to apply any edits at all
Nano Banana in general cannot do style transfer effectively unless the source image/subject is a similar style as the target style, which is an interesting and unexpected model quirk. Even the documentation examples unintentionally demonstrates this.
Seedream will always alter the global color balance with edits.
Something like a style transfer works better in Whisk. Still quirky and hit and miss.
I've definitely noticed Gemini's tendency to return the image basically unchanged, but not noticed it being worse or better for images of people. When I tested by having it change aspects of a photo of me, I found it was far more likely to cooperate when I'd specify, for instance, "change the hair from long to short" rather than "Make the hair short" (the latter routinely failed completely).
It also helped to specify which other parts should not be changed, otherwise it was rather unpredictable about whether it would randomly change other aspects.
Not only does it return the image unchanged, but if you are using the Gemini interface, it confidently tells you it made the changes.
Check out Mask Banana - you might have better luck with using masks to get image models to pay attention to what you want edited.
I have had that problem with nano banana but when it works I find it so much better than the others for editing an image. Since it’s free I usually try it first, and I would say approximately 10% of the time find myself having to use something else.
I’m editing mostly pics of food and beverages though, it wouldn’t surprise me if it is situationally better or worse.
It's crazy that the 'piss filter' of openAI image generation hasn't been fixed yet. I wonder if it's on purpose for some reason ?
If you don’t want your image to look like it’s been marinated in nicotine, throw stuff like “neutral white background, daylight balanced lighting, no yellow tint” into your prompt. Otherwise, congrats on your free vintage urine filter.
They don't want you creating images that mimic either works of other artists to an extent that's likely to confuse viewers (or courts), or that mimic realistic photographs to an extent that allows people to generate low-effort fake news. So they impose an intentionally-crappy orange-cyan palette on everything the model generates.
Peak quality in terms of realistic color rendering was probably the initial release of DALL-E 3. Once they saw what was going to happen, they fixed that bug fast.
SDXL and FLUX models with LoRAs can and do vastly outperform at tons of things singular big models can't or won't do now. Various subreddits and civitAI blogs describe comfyui workflows and details on how to maximize LoRA effectiveness and are probably all you need for a guided tour of that space.
This is not my special interest though but the DIY space is much more interesting than the SaaS offerings; this is something about generative AI more generally that also holds, the DIY scene is going to be more interesting.
Agreed. People can do things at home that couldn't be done with any of the overcapitalized-but-thoroughly-nerfed commercial models.
LOL. If you believe that, let me tell you about this bridge I've got.
(Shrug) It fits the facts. Do a GIS for images from DALL-E 3 and provide an alternative explanation for what you see.
It absolutely did not do that on day 1.
OpenAI's new image generation model is autoregressive, while DALL-E was diffusion. The yellowish tone is an artefact of their autoregressive pipeline, if I recall correctly.
Could be. My point is that if the pipeline itself didn't impart an unmistakable character to the generated images, OpenAI would feel compelled to make it do so on purpose.
Most DALL-E 3 images have a orange-blue cast, which is absolutely not an unintended artifact. You'd literally have to be blind to miss it, or at least color-blind. That wasn't true at first -- check the original paper, and try the same prompts! It was something they started doing not long after release, and it's hardly a stretch to imagine why.
They will be doing the same thing for the same reasons today, assuming it doesn't just happen as a side effect.
Found OpenAI too often heavy handed. On balance, I'd probably pick Gemini narrowly over Seedream and just learn that sometimes Gemini needs a more specific prompt.
Timings were measured on a consumer internet connection in Japan (Fiber connection, 10 Gbps nominal bandwidth) during a limited test run in a short time period.
"consumer internet connection in Japan", "10 Gbps nominal bandwidth"
Coming from a third world country, that surprises me.
The 10gbit connection costs me ¥5,000/mo (around USD 30/mo), which was actually slightly cheaper than I was paying for 1 Gbit...
The main issue is latency and bandwidth across the oceans since Asia far away from the US where a lot of servers live, and even for services that are distributed, I live in a rural prefectural capitol of Japan 1000 km away from Tokyo where all the "Japan" data centers are, so my ping is always unimpressive despite the bandwidth.
You can always identify the OpenAI result because it's yellow.
And mid journey because it's cell shading:)
Also because it’s mid :)
It's disturbing how the models sometimes alter the objects in the images when they're only supposed to add an effect. That's not just a complete failure of the task, it also means manual work since a human has to double check evry detail in every image.
The shortcut to flip between models in an expanded view is nice, but the original image should also be included as one of the things to flip between, and should be included in the side by side view.
> If you made it all the way down here you probably don’t need a summary
Love the optimism
I skipped to the end to see if they did any local models. spoilers: they didn't.
they are building a product and said the unit economics must make sense, local models have slower latency, unless you run a gpu on for hours which gets expensive fast
Local models will make a lot more sense once we have the scale for it, but when your user count is still small paying cents per image is a much better deal than paying for a GPU either in a data center or physically.
Local models are definitely something I want to dive into more, if only out of personal interest.
Honestly, I think it was misfounded. As an photographer and artist myself, I find the OpenAI results head-and-shoulders above the others. It's not perfect, and in a few cases one or the other alternative did better, but if I had to pick one, it would be OpenAI for sure. The gap between their aesthetics and mine makes me question ever using their other products (which is purely academic since I'm not an Apple person).
How many of those result did you actually look at? I thought it did ok with the cats, but check the other images and OpenAI strait up failed to do the prompt a large fraction of the time.
I'm wondering if I read the same article. Yeah, I looked at every single one the results. And although it's been a couple of hours since, I don't recall ANY examples where it completely failed to do anything. Can you point me to an example?
I don’t want to go through every image, for the mountain:
It failed Remove background, Isolate the background, long exposed (kept people), Apply a fish-eye lens effect (geometry incorrect), Strong bokeh blur (wrong blur type)
Some were more ambiguous. Give it a metallic sheen looked cool but that isn’t a metallic sheen and IMO it just failed ukiyo-e Japanese woodblock print style but I wouldn’t object to calling it a vaguely Japanese style. Compare how colors blend with ukiyo-e woodblocks vs how OpenAI‘a sky is done.
Removing the background is impossible - or more to the point, it would yield a blank image. There is no foreground in the image, it would wind up removing everything. Which also means that its result for isolate the background is exactly right. Although we might want to argue that the lower part of the image is a midground, that's ambiguous.
You're mostly right to criticize the fisheye - it's plausibly a fisheye image, but not one derived from the original. For bokeh, you're right that it got the mountain wrong. But it did get the other samples, and it's the only one that seems to know what bokeh is at all, as the other models got none of them (other than Seadream getting the Newton right).
For the "metallic sheen", I assume you mean where they said "give the object a metallic sheen", since the first attempt had OpenAI giving the image itself a quality as if it were printed or etched on metal, arguably correct. But for that second one, for all but the 4th sample, OpenAI did it best for mountain and rubik's cube, and no worse for cats and car. Seadream wins for the Newton.
I don't have any knowledge of the Japanese styles requested, so I'm not judging those.
I've reviewed your examples, and it hasn't changed my mind.
> I don't recall ANY examples where it completely failed to do anything
> I’ve reviewed your examples, and it hasn’t changed my mind.
I think I have a better understanding of your thinking, but IMO you’re using a bar so low effectively anything qualifies. “it's the only one that seems to know what bokeh is at all, as the other models got none of them (other than Seadream getting the Newton right).” For bokeh look at the original then the enlarged images on the car. OpenAI blurs the entire image car and ground fairly uniformly, where Seedream keeps the car in focus while blurring background elements including the ground when it’s far enough back. Same deal with the cats where the original has far more distant objects in the upper right which Seedream puts out of focus while keeping the cats in focus while OpenAI blurs everything.
In my mind the other models also did quite poorly in general, but when I exclude failures I don’t judge OpenAI as the winner. IE on the kaleidoscopic task OpenAI’s girl image didn’t have radial symmetry and so it simply failed the task, Gemini’s on the other hand looks worse but qualifies as a bad approximation of the task.
Interesting experiment, though I'm not certain quite how the models are usefully compared.
Well, that is a good point. That is for everyone themselves to decide, I suppose.
To me, I like to think in times the model failed versus success. So what I did, is I looked every time at the worst result. To me, the one which stood out (negatively) is Gemini. OpenAI had some very good results but also some missing the mark. SeeDream (which I never heard of previously) missed the mark less often than Gemini, and at times where OpenAI failed, SeeDream came out clearly on top.
So, if I were to use the effects of the mentioned models, I wouldn't bother with Gemini; only OpenAI and SeeDream.
Everyday I generate more than 600 image and also compare them, it takes me 5 hours
Hey. We'd love to fund thr generations for free for you to try Riverflow 2 out if you're up for it. Riverflow 1 ranks above them all and 2 is now in preview this week.
I dunno about you lot, but I actually really like Stable Diffusion 1.5.
I like giving it weird, non-prompts, like lines from songs or novels. I then run it for a few hundred generations locally and doing stuff with the malformed shit it comes out with. I have a few art projects like this.
Aphex Twin vibes.
ChatGPT is the only one I've found that can transform an image into a specified size. i.e. "resize this image to be 1280x1024 pixels"
A thing worth noting is that Seedream 4.0 is uncensored and was seemingly trained on a lot of uncensored stuff.
I like that they call openai’s image generator ground breaking and then explain that it’s prone to taking eight times longer to generate an image before showing it add a third cat over and over and over again
I meant to say it was ground-breaking when it was released, the other models came later.
Is it me or ChatGPT change subtle or sometimes more prominent things? Like ball holding position of the hand, face features like for head, background trees and alike?
It's not you. The model seems to refuse to accurately reproduce details. It changes things and leaves stuff out every time.
Highly amusing that OAI model is still ghibli-zing its outputs.
I wish they'd used a better image than the low contrast mountain, which rarely transformed into anything much.
Seedream is the only one that outputs 4k. Last time I checked that is..
For all the AI slop and studies saying AI is more hype than substance I will say that this use case is one that seems very legit.
The stock photo industry was always pretty bad and silly expensive. Being able to custom generate visuals and photos to replace that is a good use case of AI IMHO. Yes sometimes it does goofy things, but it’s getting quite good. If AI blows up the stock photo industry few will shed a tear.
yeah fuck stock photos
we build our sandbox just for this use case, fal.ai/sandbox. take the same image/prompt, and compare across tens of models.
Using gen. ai for filters is stupid, a filter guarantees the same object but filtered, a gen. AI version of this guarantees nothing and an expensive AI bill.
It’s like using gen. ai to do math instead of extracting the numbers from a story and just doing the math with +, -, / and *
Are artists and illustrators going the way of the horse and buggy?
No, but this is the beginning of a new generation of tools to accelerate productivity. What surprises me is that the AI companies are not market savvy enough to build those tools yet. Adobe seems to have gotten the memo though.
In testing some local image gen software, it takes about 10 seconds to generate a high quality image on my relatively old computer. I have no idea the latency on a current high end computer, but I expect it's probably near instantaneous.
Right now though the software for local generation is horrible. It's a mish-mash of open source stuff with varying compatibility loaded with casually excessive use of vernacular and acronyms. To say nothing of the awkwardness of it mostly being done in python scripts.
But once it gets inevitably cleaned up, I expect people in the future are going to take being able to generate unlimited, near instantaneous images, locally, for free, for granted.
Did you test some local image gen software in that you installed the Python code on the github page for a local model, which is clearly a LOT for a normal user... or did you look at ComfyUI, which is how most people are running local video and image models? There are "just install this" versions, which eases the path for users (but it's still, admittedly, chaos beneath the surface).
Interesting you say that. No I've tried out Invoke and AUTOMATIC1111/WebUI. I specifically avoided ComfyUI because of my inexperience in this and the fact that people described it as a much more advanced system with manual wiring of the pipeline and so on.
It's likely that I'm seeing this from my deep into ComfyUI bubble. My impression was that AUTOMATIC1111 and Forge and the like, were fading as ComfyUI was the "what people ended up on" no matter which AI generation framework they started with. But I don't know that there are any real stats on usage of these programs, so it's entirely possible that AUTOMATIC1111/Forge/InvokeAI are being used by more people than ComfyUI.
[dead]
> Adobe seems to have gotten the memo though.
So far Adobe AI tools are pretty useless, according to many professional illustrators. With Firefly you can use other (non-Adobe) image generators. The output is usually barely usable at this point in time.
I heard it’s useful for non illustrators? Surely those non professionals will pay for Adobe software.
I've been waiting for solutions that integrate into the artistic process instead of replacing it. Right now a lot of the focus is on generating a complete image, but if I was in photoshop (or another editor) and could use AI tooling to create layers and other modifications that fit into a workflow, that would help with consistency and productivity.
I haven't seen the latest from adobe over the last three months, but last I saw the firefly engine was still focused on "magically" creating complete elements.
DxO PureRaw & Topaz for photography are both "AI" tools that integrate into the workflow. Mostly for denoising & sharpening photographs.
"AI won't replace you, but someone who knows how to use AI will replace you" appears to be too short a phrase.
There is no better recent example than AI comedy made by a professional comedian [0]
Of course, this makes sense once you think about it for a second. Even AGI, without a BCI, could not read your mind to understand what you want. Of course, the people who have been communicating these ideas with other humans up to this point, are the best at doing that.
[0] old.reddit.com/r/ChatGPT/comments/1oqnwvt/ai_comedy_made_by_a_professional_comedian/
> There is no better recent example than AI comedy made by a professional comedian
To clarify, the “comedy” part of this “AI comedy” was written entirely by a human with no assistance from a language model.
> For anyone interested in my process. I wrote every joke myself, then use Sora 2 to animate them.
Exactly.
Apologies if I wrote my original comment poorly, but that was I was trying to communicate.
Not only was this person able to write good comedy, but they knew what tools were available and how to use them.
I previously wrote:
> "AI won't replace you, but someone who knows how to use AI will replace you." ...
The missing part is "But a person who was excellent at their pre-AI job, will replace ten of the people down the chain."
The possible analog that just popped into my head is the nearly always missed part of the quote "the customer is always right" ... "in matters of taste."
> a person who was excellent at their pre-AI job, will replace ten of the people down the chain
I think comedy is a great example of how this is not the general case.
In this instance, the video you posted was the result when a comic used a tool to make a non-living thing say their jokes.
That’s not new, that’s a prop. It’s ventriloquism. People have been doing that gag since the first crude marionette was whittled.
The existence of prop comics isn’t an indicator that that’s the pinnacle of comedy (or even particularly good). If Mitch Hedburg had Jeff Dunham’s puppets it probably would’ve been… fine, but if Jeff Dunham woke up tomorrow with Hedburg’s ability to write and deliver jokes his life and career would be dramatically changed forever.
Better dummies will benefit some ventriloquists but there’s no reason to think that this is the moment that the dummies get so good that everyone will stop watching humans and start watching ventriloquists (which is what would have to happen for one e-ventriloquist putting 10 comedians out of a job to be a regular thing)
This is likely the most intriguing response to a comment that I have ever received on this website.
I will be right back after I think of a reply while in the shower, many months in the future.
I appreciate this response given I realized I misspelled Hedberg twice
Yes and now. IKEA and co didn't replace custom made tables, just reduced the number of people needing a custom table.
Same will happen to music, artists etc. They won't vanish. But only a few per city will be left
[dead]
For some applications.
Photography didn’t make artists obsolete.
For that matter, the car didn’t make horse riding completely obsolete either.
For artists, the question is whether generative AI is like photography or the car. My guess, at this stage, is photography.
For what it’s worth I think the proponents of generative AI are grossly overestimating the utility and economic value of meh-OK images that approximate the thing you’ve asked for.
I've seen cover art on a lot of magazines already replaced with AI images. I suspect, for the time being, that a lot of the low hanging art fruit will be destroyed by image generation. The knock on effect is less art jobs, but more artists. In the vein of your analogy, it removes the gas station attendants that fill your tank.
The more I think about it, most artists/illustrators will be replaced by workers who can't draw or paint but are better than artists at generating AI prompts.
And some day the news will announce that the last human actor has died.
No. People create art as a form of expression and other people enjoy it because it resonates with them. Nobody that’s inclined to artistically express a thought or feeling is going to give up on creativity because maybe somebody that isn’t really interested in creating art might be able to type words into their computer and spit out something vaguely similar.
That aside, humans are necessary for making up new forms and styles. There was no cubism before Picasso and Braque, or pointillism before Seurat and Signac. I don’t think I’ve seen anyone argue that if you trained a diffusion model on only the art that Osamu Tezuka was exposed to before he turned 24 it would output Astro Boy.
When there's a need for something with specific traits and composition at high quality, I've yet to see a model that can deliver that, especially in a reasonable amount of time. It's still way more reliable to just hand a description to a skilled illustrator along w/references and then go back and forth a bit to get a quality result. The illustrator is more expensive, but my time isn't free, so it works out.
I could see that changing in a few years.
Artists no, illustrators and graphic designers yes. They'll mostly become redundant within the next 50 years. With these kind of technologies, people tend to overestimate the short-term effects and severely underestimate the long-term effects.
Horse and buggy isn't quite the analogy, I think it is more like the arrival of junk food, packed with sugar, salt and saturated fats. You will still be able to find a cafe or restaurant where a full kitchen team cooks from scratch but everything else is fast food garbage.
Maybe just the advent of the microwave oven is the analogy.
Either way, I am out. I have spent many days fiddling with AI image generation but, looking back on what I thought was 'wow' at the time, I now think everything AI art is practically useless. I only managed one image I was happy with and most of that was GIMP, not AI.
This study has confirmed my suspicions, hence I am out.
Going back to the fast food analogy, for the one restaurant that actually cooks actual food from actual ingredients, if everyone else is selling junk food then the competition has been decimated. However, the customers have been decimated too. This isn't too bad as those customers clearly never appreciated proper food in the first place, so why waste effort on them? It is a pearls and swine type of thing.
[dead]
Would have loved to see Grok (xAI) in there, by my (limited) experience it is often better than OpenAI or Gemini.
This seems to imply that the capabilities being tested are like the descriptive words used in the prompts, but, as a category using random words would be just as valid for exercising the extents of the underlying math. And when I think of that reality I wonder why a list of tests like this should be interesting and to what ends. The repeated nature of the iteration implies that some control or better quality is being sought but the mechanism of exploration is just trial and error and not informative of what would be repeatable success for anyone else in any other circumstance given these discoveries.
A tiny, high-signal TL;DR:
• OpenAI (gpt-image-1): The wild artist. Best for creative, transformative, style-heavy edits—Ghibli, watercolor, fantasy additions, portals, sci-fi stuff, etc. But it hallucinates a lot and often distorts fine details (especially faces). Slowest.
• Gemini (flash-image / nanoBanana): The cautious realist. Best for subtle, photorealistic edits—fog, lighting tweaks, gentle filters, lens effects. Almost never ruins details, but sometimes refuses to do artsy transformations, especially on human photos.
• Seedream: The adventurous middle child. Faster, cheaper, and often surprisingly good at aesthetic effects—bokeh, low-poly, ukiyo-e, metallic sheen, etc. Not as creative as OpenAI, not as conservative as Gemini. Can hallucinate, but in fun ways.
Bottom line: • Creative prompts → OpenAI • Realistic photo edits → Gemini • Budget-friendly, balanced option → Seedream
If you’re planning an automated pipeline, routing “artistic” prompts to OpenAI and “photorealistic” ones to Gemini (with Seedream as a wildcard) matches their own conclusion.
[dead]
LLM slop comment. 5 of them within 5 minutes.
[flagged]
[dead]