Back

Create and edit images with Gemini 2.0 in preview

192 points18 hoursdevelopers.googleblog.com
vunderba14 hours ago

I've added/tested this multimodal Gemini 2.0 to my shoot-out of SOTA image gen models (OpenAI 4o, Midjourney 7, Flux, etc.) which contains a collection of increasingly difficult prompts.

https://genai-showdown.specr.net

I don't know how much of Google's original Imagen 3.0 is incorporated into this new model, but the overall aesthetic quality seems to be unfortunately significantly worse.

The big "wins" are:

- Multimodal aspect in trying to keep parity with OpenAI's offerings.

- An order of magnitude faster than OpenAI 4o image gen

esperent8 hours ago

Your site is really useful, thanks for sharing. One issue is that the list of examples sticks to the top and covers more than half of the screen on mobile, could you add a way to hide it?

If you're looking for other suggestions a summary table showing which models are ahead would be great.

vunderba5 hours ago

Great point - when I started building it I think I only had about four test cases, but now the nav bar is eating 50% of the vertical display so I've removed it from mobile display!

Wrt to the summary table, did you have a different metric in mind? The top of the display should already be showing a "Model Performance" chart with OpenAI 4 and Google Imagen 3 leading the pack.

ticulatedspline8 hours ago

Excellent site! OpenAI 4o is more than mildly frighting in it's capabilities to understand the prompt. Seems mostly what's holding it back is a tendency away from photo-realism (or even typical digital art styles) and it's own safeguards.

avereveard7 hours ago

It's a bit expensive/slow but for styled request I let it do the base image and when happy with the composition I ask to remake it as a picture or in whatever style needed.

troupo5 hours ago

I also find it weird how it defaults/devolves into this overall brown-ish style. Once you see it, you see it everywhere

echelon7 hours ago

Multimodal is the only image generation modality that matters going forward. Flux, HiDream, Stable Diffusion, and the like are going to be relegated to the past once multimodal becomes more common. Text-to-image sucks, and image-to-image with all the ControlNets and Comfy nodes is cumbersome in comparison to true multimodal instructiveness.

I hope that we get an open weights multimodal image gen model. I'm slightly concerned that if these things take tens to hundreds of millions of dollars to train, that only Google and OpenAI will provide them.

That said, the one weakness in multimodal models is that they don't let you structure the outputs yet. Multimodal + ControlNets would fix that, and that would be like literally painting with the mind.

The future, when these models are deeply refined and perfected, is going to be wild.

zaptrem5 hours ago

Good chance a future llama will output image tokens

echelon5 hours ago

That's my hope. That Llama or Qwen bring multimodal image generation capabilities to open source so we're not left in the dark.

If that happens, then I'm sure we'll see slimmer multimodal models over the course of the next year or so. And that teams like Black Forest Labs will make more focused and performant multimodal variants.

We need the incredible instructivity of multimodality. That's without question. But we also need to be able to fine tune, use ControlNets to guide diffusion, and to compose these into workflows.

liuliu8 hours ago

Do you mind to share which HiDream-I1 model you are using? I am getting better results with these prompts from mine implementation inside Draw Things.

vunderba5 hours ago

Sure - I was using "hidream-l1-dev" but if you're seeing better results - I might rerun the hidream tests with the "hidream-i1-full" model.

I've been thinking about possibly rerunning the Flux Dev prompts using the 1.1 Pro but I liked having a base reference for images that can be generated on consumer hardware.

pkulak10 hours ago

> That mermaid was quite the saucy tart.

Really now?

belter14 hours ago

Your shoot-out site is very useful. Could I suggest adding prompts that expose common failure modes?

For example, asking the models to show clocks set to a specific time or people drawing with their left hand. I think most, if not all models, will likely display every clock with the same time...And portray subjects drawing with their right hand.

vunderba12 hours ago

@belter / @crooked-v

Thanks for the suggestions. Most of the current prompts are a result of personal images that I wanted to generate - so I'll try to add some "classic GenAI failure modes". Musical instruments such as pianos also used to be a pretty big failure point as well.

troupo5 hours ago

For personal images I often play with wooly mammoths, and most models are incapable of generating anything but textbook images. Any deviation either becomes an elephant or an abomination (bull- or bear-like monsters)

crooked-v13 hours ago

Another I would suggest is buildings with specific unusual proportions and details(e.g. "the mansion's west wing is twice the height of the right wing and has only very wide windows"). I've yet to find a model that will do that kind of thing reliably, where it seems to just fall back on the vibes of whatever painting or book cover is vaguely similar to what's described.

droopyEyelids13 hours ago

generating a simple maze for kids is also not possible yet

vunderba12 hours ago

Love this one so I've added it. The concept is very easy for most GenAI models to grasp, but it requires a strong overall cohesive understanding. Rather unbelievably, OpenAI 4o managed to produce a pass.

I should also add an image that is heavy with "greebles". GenAI usually lacks the fidelity for these kinds of minor details so although it adds them - they tend to fall apart at more than a cursory examination.

https://en.wikipedia.org/wiki/Greeble

eminence3216 hours ago

This seems neat, I guess. But whenever I try tools like this, I often run into the limits of what I can describe in words. I might try something like "Add some clutter to the desk, including stacks of paper and notebooks" but when it doesn't quite look like what I want, I'm not sure what else to do except try slightly different wordings until the output happens to land on what I want.

I'm sure part of this is a lack of imagination on my part about how to describe the vague image in my own head. But I guess I have a lot of doubts about using a conversational interface for this kind of stuff

monster_truck16 hours ago

Chucking images at any model that supports image input and asking it to describe specific areas/things 'in extreme detail' is a decent way to get an idea of what its expecting vs what you want.

thornewolf16 hours ago

+1 to this flow. I use the exact same phrase "in extreme detail" as well haha. Additionally, I ask the model to describe what prompt it might write to produce some edit itself.

crooked-v15 hours ago

I just tried a couple of cases that ChatGPT is bad at (reproducing certain scenes/setpieces from classic tabletop RPG adventures, like the weird pyramid from classic D&D B4 The Lost City), and Gemini fails in just about the same way of getting architectural proportions and scenery details wrong even when given simple, broad rules about them. Adding more detail seems kind of pointless when it can't even get basics like "creature X is about as tall as the building around it" or "the pyramid is surrounded by ruined buildings" right.

BoorishBears13 hours ago

What's an example of a prompt you tried and it failed on?

metalrain5 hours ago

Exactly, describing more complex compositions, lighting, image enchancements/filters there is so many things you know how it looks but to describe it such that LLM gets it and will reproduce it is pretty difficult.

Sometimes sketching it could be helpful, but more abstract technical thing like LUTs, feels still out of reach.

qoez16 hours ago

Maybe that's how the future will unfold. There will be subtle things AI fails to learn, and there will be differences in skills in how good people are at making AI do things, which will be a new skill in itself and will end up being determining difference in pay in the future.

gowld10 hours ago

This is "Prompt Engineering"

betterThanTexas13 hours ago

> I'm sure part of this is a lack of imagination on my part about how to describe the vague image in my own head.

This is more related to our ability to articulate than is easy to demonstrate, in my experience. I can certainly produce images in my head I have difficulty reproducing well and consistently via linguistic description.

SketchySeaBeast13 hours ago

It's almost as if being able to create art accurate to our mental vision requires practice and skill, be it the ability to create an image or to write it and invoke an imagine in others.

betterThanTexas13 hours ago

Absolutely! But this was surprising to me—my intuition says if I can firmly visualize something, I should be able to describe it. I think many people have this assumption and it's responsible for a lot of projection in our social lives.

SketchySeaBeast12 hours ago

Yeah, it's probably a good argument for having people try some form of art, to have them understand that their intent and their outcome is rarely the same.

bufferoverflow10 hours ago

In that scenario, if you can't describe what you want with words, a human designer can't read your mind either.

Hasnep10 hours ago

No, but a good designer will be able to help you put what you want into words.

gowld10 hours ago

Ask the AI to help you put what you want into words.

maksimur3 hours ago

I think the issue with AI (in contrast to human interaction) is its lack of real-time responsiveness. This slower back-and-forth can lead to frustration, especially if it takes a dozen or more messages to get the point across. Humans are also helped in helping you by contextual cues like gestures, facial expressions or "shared qualia".

xbmcuser15 hours ago

ask Gemini to word your thoughts better then use those to do the image editing.

Nevermark14 hours ago

Perhaps describe the types and styles of work associated with the desk, to create a coherent character to the clutter

zoogeny14 hours ago

I would politely suggest you work at getting better at this since it would be a pretty important skill in a world where a lot of creative work is done by AI.

As some have mentioned, LLMs are treasure troves of information for learning how to prompt the LLM. One thing to get over is a fear of embarrassment in what you say to the LLM. Just write a stream of consciousness to the LLM about what you want and ask it to generate a prompt based on that. "I have an image that I am trying to get an image LLM to add some clutter to. But when I ask it to do it, like I say add some stack of paper and notebooks, but it doesn't look like I want because they are neat stacks of paper. What I want is a desk that kind of looks like it has been worked at for a while by a typical office worker, like at the end of the day with a half empty coffee cup and .... ". Just ramble away and then ask the LLM to give you the best prompt. And if it doesn't work, literally go back to the same message chain and say "I tried that prompt and it was [better|worse] than before because ...".

This is one of those opportunities where life is giving you an option: give up or learn. Choose wisely.

mkl7 hours ago

> what the lamp from the second image would look like on the desk from the first image

The lamp is put on a different desk in a totally different room, with AI mush in the foreground. Props for not cherry-picking a first example, I guess. The sofa colour one is somehow much better, with a less specific instruction.

cyral6 hours ago

That one is an odd example.. especially since image #3 does a similar task with excellent accuracy in keeping the old image intact. I've had the same issues when trying to make it visualize adding decor, it ends up changing the whole room or furniture materials.

cush16 hours ago
simonw12 hours ago

Be a bit careful playing with this one. I tried this:

  curl -s -X POST \
    "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash-preview-image-generation:generateContent?key=$GEMINI_API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "contents": [{
        "parts": [
          {"text": "Provide a vegetarian recipe for butter chicken but with chickpeas not chicken and include many inline illustrations along the way"}
        ]
      }],
      "generationConfig":{"responseModalities":["TEXT","IMAGE"]}
    }' > /tmp/out.json
And got back 41MB of JSON with 28 base64 images in it: https://gist.github.com/simonw/55894032b2c60b35f320b6a166ded...

At 4c per image that's more than a dollar on that single prompt.

I built this quick tool https://tools.simonwillison.net/gemini-image-json for pasting that JSON into to see it rendered.

weird-eye-issue10 hours ago

I mean you did ask for "many illustrations"

Yiling-J10 hours ago

I generated 100 recipes with images using gemini-2.0-flash and gemini-2.0-flash-exp-image-generation as a demo of text+image generation in my open-source project: https://github.com/Yiling-J/tablepilot/tree/main/examples/10...

You can see the full table with images here: https://tabulator-ai.notion.site/1df2066c65b580e9ad76dbd12ae...

I think the results came out quiet well. Be aware I don't generate a text prompt based on row data for image generation. Instead, the raw row data(ingredients, instructions...) and table metadata(column names and descriptions) are sent directly to gemini-2.0-flash-exp-image-generation.

minimaxir16 hours ago

Of note is that the per-image pricing for Gemini 2.0 image generation is $0.039 per image, which is more expensive than Imagen 3 ($0.03 per image): https://ai.google.dev/gemini-api/docs/pricing

The main difference is that Gemini does allow for incorporating a conversation to generate the image as demoed here, while Imagen 3 is a strict text-in/image-out with optional mask-constrained edits but likely allows for higher-quality images overall if skilled with prompt engineering. This is a nuance that is annoying to differentiate.

vunderba14 hours ago

Anecdotal but from preliminary sandbox testing side-by-side with Gemini 2.0 Flash and Imagen 3.0 - it definitely appears that that is the case - higher overall visual quality from Imagen 3.

ipsum215 hours ago

> likely allows for higher-quality images overall

What makes you say that?

thornewolf16 hours ago

Model outputs look good-ish. I think they are neat. I updated my recent hack project https://lifestyle.photo to the new model. It's middling-to-good.

There are a lot of failure modes still but what I want is a very large cookbook showing what known-good workflows are. Since this is just so directly downstream of (limited) training data it might be that I am just prompting in a ever so slightly bad way.

sigmaisaletter16 hours ago

Re your project: I'd expect at least the demo to not have an obvious flaw. The "lifestyle" version of your bag has a handle that is nearly twice as long as the "product" version.

thornewolf16 hours ago

This is a fair critique. While I am merely a "LLM wrapper", I should put the product's best foot forward and pay more attention to my showcase examples.

nico16 hours ago

Love your project, great application of gen AI, very straightforward value proposition, excellent and clear messaging

Very well done!

thornewolf16 hours ago

Thank you for the kind words! I am looking forward to creating a Show HN next week alongside a Product Hunt announcement. I appreciate any and all feedback. You can provide it through the website directly or through the email I have attached in my bio.

mNovak15 hours ago

I'm getting mixed results with the co-drawing demo, in terms of understanding what stick figures are, which seems pretty important for the 99% of us who can't draw a realistic human. I was hoping to sketch a scene, and let the model "inflate" it, but I ended up with 3D rendered stick figures.

Seems to help if you explicitly describe the scene, but then the drawing-along aspect seem relatively pointless.

Tsarp7 hours ago

There are direct prompt tests and then there are tests with tooling.

If for example you use controlnets you can pretty much get very close to a style composition that you need with an open model like Flux that will be far better. Flux has a few successors coming up now

emporas7 hours ago

I use gemini to create covers for songs/albums i make, with beautiful typography. Something like this [1]. I was dying of curiosity, how ideogram managed to create such gorgeous images. I figured it out 2 days ago.

I take an image with some desired colors or typography from an already existing music album or from Ideogram's poster section. I pass it to gemini and give the command:

"describe the texture of the picture, all the element and their position in the picture, left side, center right side, up and down, the color using rgb, the artistic style and the calligraphy or font of the letters"

Then i take the result and pass it through an LLM, a different LLM because i don't like gemini that much, i find it is much less coherent than other models. I use qwen-qwq-32b usually and I take the description gemini outputs and give it to qwen:

" write a similar description, but this time i want a surreal painting with several imaginative colors. Follow the example of image description, add several new and beautiful shapes of all elements and give all details, every side which brushstrokes it uses, and rgb colors it uses, the color palette of the elements of the page, i want it to be a pastel painting like the example, and don't put bioluminesence. I want it to be old style retro style mystery sci fi. Also i want to have a title of "Song Title" and describe the artistic font it uses and it's position in the painting, it should be designed as a drum n bass album cover "*

Then i take the result and give it back to gemini with command: "Create an image with text "Song Title" for an album cover: here is the description of the rest of the album"

If the resulting image is good, then it is time to add font, i take the new image description and pass it through qwen again, supposing the image description has fields Title and Typography:

"rewrite the description and add full description of the letters and font of text, clean or distressed, jagged or fluid letters or any other property they might have, where they are overlayed, and make some new patterns about the letter appearance and how big they are and the material they are made of, rewrite the Title and Typography."

I replace the previous description's section Title and Typography with the new description and create images with beautiful fonts.

[1] https://imgur.com/a/8TCUJ75

pentagrama14 hours ago

I want to take a step back and reflect on what this actually shows us. Look at the examples Google provides: it refers to the generated objects as "products", clearly pointing toward shopping or e-commerce use cases.

It seems like the real goal here, for Google and other AI companies, is a world flooded with endless AI-generated variants of objects that don’t even exist yet, crafted to be sold and marketed (probably by AI too) to hyper-targeted audiences. This feels like an incoming wave of "AI slop", mass-produced synthetic content, crashing against the small island of genuine human craftsmanship and real, existing objects.

hapticmonkey4 hours ago

It's sort of sad how these tools went from "godlike new era of human civilization" to "some commodity tools for marketing teams to sell stuff".

I get that they are trying to find some practical used cases for their tools. But there's no enlightenment in the product development here.

If this is already the part of the s-curve where these AI tools get diminishing returns...what a waste of everybody's time.

nly10 hours ago

Recently I've been seeing a lot of holiday lets on sites like Rightmove (UK) and Airbnb with clearly AI generated 'enhancements' to the photos.

It should be illegal in my view.

vunderba14 hours ago

Yeah - and honestly I don't really get this. Using GenAI for real-world products seems like a recipe for a slew of incoming fraudulent advertising lawsuits if the images are slightly different from the actual physical products yet presented as if they are actual real photographs.

nkozyra13 hours ago

The gating factor here is the pool of consumers. Once people have slop exhaustion there's nobody to sell this to.

Maybe this is why all of the future AI fiction has people dressed in the same bland clothing.

ohadron16 hours ago

For one thing, it's way faster than the OpenAI equivalent in a way that might unlock additional use cases.

freedomben16 hours ago

Speed has been the consistent thing I've noticed with Gemini too, even going back to the earlier days when Gemini was a bit of a laughing stock. Gemini is fast

julianeon14 hours ago

I don't know exactly the speed/quality tradeoff but I'll tell you this: Google may be erring too much on the speed side. It's fast but junk. I suspect a lot of people try it then bounce off back to Midjourney, like I did.

simonw11 hours ago

Posted some notes from trying this out here, including examples of the images it produced and a tool for rendering the JSON https://simonwillison.net/2025/May/7/gemini-images-preview/

egamirorrim16 hours ago

I don't understand how to use this, I keep trying to edit a photo (change a jacket to a t-shirt) of myself in the Gemini app with 2.0 flash selected and it just generated a new image that's nothing like the original

FergusArgyll16 hours ago

I think this is just in AI Studio. In the Gemini app I think it goes: Flash describes the image to imagen -> imagen generates a new image

thornewolf16 hours ago

It is very sensitive to your input prompts. Minor differences will result in drastic quality differences.

julianeon14 hours ago

Remember you are paying about 4 cents an image if I'm understanding the pricing correctly.

qq9914 hours ago

Wasn't this already available in AI Studio? It sounds like they also improved the image quality. It's hard to keep up with what's new with all these versions

taylorhughes14 hours ago

Image editing/compositing/remixing is not quite as good as gpt-image-1, but the results are really compelling anyway due to the dramatic increase in speed! Playing with it just now, it's often 5 seconds for a compositing task between multiple images. Feels totally different from waiting 30s+ for gpt-image-1.

refulgentis16 hours ago

Another release from Google!

Now I can use:

- Gemini 2.0 Flash Image Generation Preview (May) instead of Gemini 2.0 Flash Image Generation Preview (March)

- or when I need text, Gemini 2.5 Flash Thinking 04-17 Preview ("natively multimodal" w/o image generation)

- When I need to control thinking budgets, I can do that with Gemini 2.5 Flash Preview 04-17, with not-thinking at a 50% price increase over a month prior

- And when I need realtime, fallback to Gemini 2.0 Flash 001 Live Preview (announced as In Preview on April 9 2025 after the Multimodal Live API was announced as released on December 11 2024)

- I can't control Gemini 2.5 Pro Experimental/Preview/Preview IO Edition's thinking budgets, but good news follows in the next bullet: they'll swap the model out underneath me with one that thinks ~10x less so at least its in the same cost ballpark as their competitors

- and we all got autoupgraded from Gemini 2.5 Pro Preview (03/25 released 4/2) to Gemini 2.5 Pro Preview (IO Edition) yesterday! Yay!

justanotheratom16 hours ago

Yay! do you use your Gemini in Gemini App or AI Studio or Vertex AI?

refulgentis15 hours ago

I am Don Quixote, building a app that abstracts over models (i.e. allows user choice), while providing them a user-controlled set of tools, and allowing users to write their own "scripts", i.e. precanned dialogue / response steps to permit ex. building of search.

Which is probably what makes me so cranky here. It's very hard keeping track of all of it and doing my best to lever up the models that are behind Claude's agentic capabilities, and all the Newspeak of Google PR makes it consume almost as much energy as the rest of the providers combined. (I'm v frustrated that I didn't realize till yesterday that 2.0 Flash had quietly gone from 10 RPM to 'you can actually use it')

I'm a Xoogler and I get why this happens ("preview" is a magic wand that means "you don't have to get everyone in bureaucracy across DeepMind/Cloud/? to agree to get this done and fill out their damn launchcal"), but, man.

xnx15 hours ago

A matrix of models, capabilities, and prices would be really useful.

GaggiX16 hours ago

Not available in the EU, first version was and then removed.

Btw still not as good as ChatGPT but much, much faster, it's a nice progress compare to the previous model.

adverbly16 hours ago

Google totally crushing it and stock is down 8% today :|

Is it just me or is the market just absolutely terrible at understanding the implications and speed of progress behind what's happening right now in the walls of big G?

abirch15 hours ago

A potential reason that GOOG is down right now is that Apple is looking at AI Search Engines.

https://www.bloomberg.com/news/articles/2025-05-07/apple-wor...

Although AI is fun and great, an AI search engine may have issues of being unprofitable. It's similar to how 23 and Me got many customers selling a 500 dollar test to people for 100 dollars.

xnx15 hours ago

Would be quite a financial swing for Apple from getting paid billions of dollars by Google for search to having to spend billions of dollars to make their own.

abirch15 hours ago

From the article Eddy Cue is Apple’s senior vice president of services. "Cue said he believes that AI search providers, including OpenAI, Perplexity AI Inc. and Anthropic PBC, will eventually replace standard search engines like Alphabet’s Google. He said he believes Apple will bring those options to Safari in the future."

So Apple may not be making their own, but they won't be spending billions either. I'm wondering how the people will be able to monetize the searches so that they make money.

mattlondon14 hours ago

FWIW I searched this story not long after it broke and Google - yes the traditional "old school search engine" - had an AI-generated summary of the story with a breakdown of the whys and how's right there at the top of the page. This was basically real time given or take 10 minutes.

I am not sure why people think OpenAI et al are going to eat Google's lunch here. Seems like they're already doing AI-for-search and if there is anyone who can do it cheaply and at scale I bet on Google being the ones to do it (with all their data centers, data integrations/crawlers, and custom hardware and experience etc). I doubt some startup using the Bing-index and renting off-the-shelf Nvidia hardware using investor-funds is going to leapfrog Google-scale infrastructure and expertise.

resource_waste15 hours ago

Why would any of this have an impact on stock prices?

LLMs are insanely competitive and a dime a dozen now. Most professional uses can get away with local models.

This is image generation... Niche cases in another saturated market.

How are any of these supposed to make google billions of dollars?

lenerdenator15 hours ago

The market is absolutely terrible at a lot of things.

mvdtnz11 hours ago

I gave this a crack this morning, trying something very similar to the examples. I tried to get Gemini 2.0 Preview to add a set of bi-fold doors to a picture of a house in a particular place. It failed completely. It put them in the wrong place, they looked absolutely hideous (like I had pasted them in with MS Paint) and the more I tried to correct it with prompts the worse it got. At one point when I re-prompted it, it said

> Okay, I understand. You want me to replace ONLY the four windows located underneath the arched openings on the right side of the house with bifold doors, leaving all other features of the house unchanged. Here is the edited image:

Followed by no image. This is a behaviour I have seen many times from Gemini in the past so it's frustrating that it's still a problem.

I give this a 0/10 for my first use case.

jansan16 hours ago

Some examples are quite impressive, but the one with the ice bear on the white mug is very underwhelming and the co-drawing looks like it was hacked together by a vibe coder.

cyral6 hours ago

It looks like those horribly edited gift mugs I see on amazon occasionally, where someone just puts the image over the mug without accounting for the 3D shape. Too many variants to actually take the image. Would have been an excellent example to show how much better AI is if they made it do that.

thornewolf16 hours ago

The co-drawing is definitely not a fully fleshed-out product or anything but I think it is a great tech demo. What don't you like about it?