thegeomaster 15 hours ago

A detail that is not mentioned is that Google models >= Gemini 2.0 are all explicitly post-trained for this task of bounding box detection: https://ai.google.dev/gemini-api/docs/image-understanding

Given that the author is using the specific `box_2d` format, it suggests that he is taking advantage of this feature, so I wanted to highlight it. My intuition is that a base multimodal LLM without this type of post-training would have much worse performance.

  • simedw 15 hours ago

    That's true, it's also why I didn't benchmark against any other model provider.

    It has been tuned so heavily on this specific format that even a tiny change, like switching the order in the `box_2d` format from `(ymin, xmin, ymax, xmax)` to `(xmin, ymin, xmax, ymax)` causes performance to tank.

    • pbhjpbhj 11 hours ago

      That's interesting because it suggests the meaning and representation are very tightly linked; I would expect it to be less tightly coupled given Gemini is multimodal.

  • demirbey05 15 hours ago

    I was really shocked when I first see this but yes it's in training data. Not thinking feature.

  • IncreasePosts 10 hours ago

    Why do they do post training instead of just delegating segmentation to a smaller/purpose-built model?

    • thegeomaster 7 hours ago

      Post-training allows leveraging the considerable world and language understanding of the underlying pretrained model. Intuition is that this would be a boost to performance.

fzysingularity 13 hours ago

This isn't surprising at all - most VLMs today are quite poor on localization even though they've been explicitly post-trained on object detection tasks.

One insight that the author calls out is the inconsistencies in coordinate systems used in post-training these - you can't just swap models and get similar results. Gemini uses (ymin, xmin, ymax, xmax) integers b/w 0-1000. Qwen uses (xmin, ymin, xmax, ymax) floats b/w 0-1. We've been evaluating most of the frontier models for bounding boxes / segmentation masks, and this is quite a footgun to new users.

One of the reasons we chose to delegate object-detection to specialized tools is essentially due to the poor performance (~0.34 mAP w/ Gemini to 0.6 mAP w/ DETR like architectures). Check out this cookbook [1] we recently released, we use any LLM to delegate tasks like object-detection, face-detection and other classical CV tasks to a specialized model while still giving the user the dev-ex of a VLM.

[1] https://colab.research.google.com/github/vlm-run/vlmrun-cook...

  • joshvm 4 hours ago

    Box format degeneracy has been a footgun for computer vision developers since forever. You can define a rectangle as two corner coordinates, a coordinate + width + height. Since "one coordinate" can be a corner or the center, there are normally 6 variations and every single one exists somewhere. This also causes havoc for validation because it's easy to forget and wonder why your metrics are all practically zero because you didn't specify the right one.

    There's a neat table here: https://dragoneye.ai/blog/a-guide-to-bounding-box-formats

    Picking yxyx was certainly a decision.

smus 14 hours ago

We benchmarked Gemini 2.5 on 100 open source object detection datasets in our paper: https://arxiv.org/abs/2505.20612 (see table 2)

Notably, performance on out of distribution data like those in RF100VL is super degraded

It worked really well zero-shot (comparatively to the foundation model field) achieving 13.3 average mAP, but counterintuitively performance degraded when provided visual examples to ground its detections from, and when provided textual instructions on how to find objects as additional context. So it seems it has some amount of object detection zero-shot training, probably on a few standard datasets, but isn't smart enough to incorporate additional context or its general world knowledge into those detection abilities

serjester 15 hours ago

I wrote a similar article a couple of months ago, but focusing instead on PDF bounding boxes—specifically, drawing boxes around content excerpts.

Gemini is really impressive at these kinds of object detection tasks.

https://www.sergey.fyi/articles/using-gemini-for-precise-cit...

  • simedw 15 hours ago

    That's really interesting, thanks for sharing!

    Are you using that approach in production for grounding when PDFs don't include embedded text, like in the case of scanned documents? I did some experiments for that use case, and it wasn't really reaching the bar I was hoping for.

    • serjester 14 hours ago

      Yes, this was completely image-based. Not quite of a point of using it in production since I agree it can be flakey at times. Although I do think there's viable workarounds, like sending the same prompt multiple times, and seeing if the returned results overlap.

      It really feels like we're maybe half a model generation away from this being a solved problem.

  • svat 15 hours ago

    Thanks for this post — I'm doing something similar for a personal/hobby project (just trying to work with very old scanned PDFs in Sanskrit etc), and the bounding box next to "Sub-TOI" in your screenshot (https://www.sergey.fyi/images/bboxes/annotated-filing.webp) is like something I'm encountering too: it clearly “knows” that there is a box of a certain width and height, but somehow the box is offset from its actual location. Do you have any insights into that kind of thing, and did anything you try fix that?

    • serjester 14 hours ago

      I suspect this is a remnant of how images get tokenized - simplest solution is probably to increase the buffer.

      • svat 12 hours ago

        What buffer are you referring to / how do you increase it? And did that solution work for you (if you happened to try)?

sly010 13 hours ago

Genuine question: How does this work? How does an LLM do object detection? Or more generally, how does an LLM do anything that is not text? I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision. It doesn't make sense to me why would Gemini 2 and 2.5 would have different vision capabilities, shouldn't they both have access to the same, purpose trained state of the art vision model?

  • sashank_1509 13 hours ago

    You tokenize the image and then pass it through a vision encoder that is generally trained separately from large scale pretraining (using say contrastive captioning) and then added to the model during RLHF. I’m not surprised if the vision encoder is used in pre training now too, this will be a different objective than next token prediction of course (unless they use something like next token prediction for images which I don’t think is the case).

    Different models have different encoders, they are not shared as the datasets across models and even model sizes vary. So performance between models will vary.

    What you seem to be thinking is that text models were simply calling an API to a vision model, similar to tool-use. That is not what’s happening, it is much more inbuilt, the forward pass is going through the vision architecture to the language architecture. Robotics research has been doing this for a while.

  • simonw 12 hours ago

    > I always thought tasks like this are usually just handed to an other (i.e. vision) model, but the post talks about it as if it's the _same_ model doing both text generation and vision.

    Most vision LLMs don't actually use a separate vision model. https://huggingface.co/blog/vlms is a decent explanation of what's going on.

    Most of the big LLMs these days are vision LLMs - the Claude models, the OpenAI models, Grok and most of the Gemini models all accept images in addition to text. To my knowledge none of them are using tool calling to a separate vision model for this.

    Some of the local models can do this too - Mistral Small and Gemma 3 are two examples. You can tell they're not tool calling to anything because they run directly out of a single model weights file.

    • gylterud 9 hours ago

      Not a contradiction to anything you said, but O3 will sometimes whip up a python script to analyse the pictures I give it.

      For instance, I asked it to compute the symmetry group of a pattern I found on a wallpaper in a Lebanese restaurant this weekend. It realised it was unsure of the symmetries and used a python script to rotate and mirror the pattern and compare to the original to check the symmetries it suspected. Pretty awesome!

  • Legend2440 13 hours ago

    It used to be done that way, but newer multimodal LLMs train on a mix of image and text tokens, so they don’t need a separate image encoder. There is just one model that handles everything.

chrismorgan 14 hours ago

> Hover or tap to switch between ground truth (green) and Gemini predictions (blue) bounding boxes

> Sometimes Gemini is better than the ground truth

That ain’t ground truth, that’s just what MS-COCO has.

See also https://en.wikipedia.org/wiki/Ground_truth.

  • Cubre 14 hours ago

    Are you implying it "ain't" ground truth because it's not perfect? Ground truth is simply a term used in machine learning to denote a dataset's labels. A quote extracted from the link that you sent acknowledges that ground truth may not be perfect: "inaccuracies in the ground truth will correlate to inaccuracies in the resulting spam/non-spam verdicts".

    • nomel 6 hours ago

      What they have is not ground truth, it's bad data. Why is it bad data? Because any model that uses, or any metric based on it, will be worse. That's in opposition to the definition and purpose of ground truth data: it's not supposed to make things worse.

      You're both right. Perfection isn't possible or practical. But their "ground truth" (in that example) is obviously shite, that nobody should be using for training or any sort of metric, since it will make them worse. You're also right that you can name a dataset "ground truth", but names don't mean much when they're in opposition to the intent.

    • chrismorgan 13 hours ago

      Tell me with a straight face that the car labeling is okay. It’s clearly been made by a dodgy automated system, with no human confirmation of correctness. That ain’t ground truth.

      • ajcp 11 hours ago

        You're conflating "truthiness" with "correctness". I realize this sounds like an oxymoron when talking about something called ground "truth", but when we're building ground truth to measure how good our model outputs are, it does not matter what is "true", rather what is "correct".

        Our ground truth should reflect the "correct" output expected of the model in regards to it's training. So while in many cases "truth" and "correct" should algin, there are many many cases where "truth" is subjective, and so we must settle for "correct".

        Case in point: we've trained a model to parse out addresses from a wide-array of forms. Here is an example address as it would appear on the form.

        Address: J Smith 123 Example St

        City: LA State: CA Zip: 85001

        Our ground truth says it should be rendered as such:

        Address Line 1: J Smith

        Address Line 2: 123 Example St

        City: LA

        State: CA

        ZipCode: 85001

        However our model outputs it thusly:

        Address Line 1: J Smith 123 Example St

        Address Line 2:

        City: LA

        State: CA

        ZipCode: 85001

        That may be true, as there is only 1 address line and we have a field for "Address Line 1", but it is not correct. Sure, there may be a problem with our taxonomy, training data, or any other number of other things, but as far as ground truth goes it is not correct.

        • chrismorgan 10 hours ago

          I fail to see how your example is applicable.

          Are you trying to tell me that the COCO labelling of the cars is what you call correct?

          • ajcp 7 hours ago

            I'm trying to help you understand what "ground truth" means.

            If, as it seems in the article, they are using COCO to establish ground truth, i.e. what COCO says is correct, then whatever COCO comes up with is, by definition "correct". It is, in effect, the answer, the measuring stick, the scoring card. Now what you're hinting at is that, in this instance, that's a really bad way to establish ground truth. I agree. But that doesn't change what is and how we use ground truth.

            Think of it another way:

            - Your job is to pass a test.

            - To pass a test you must answer a question correctly.

            - The answer to that question has already been written down somewhere.

            To pass the test does your answer need to be true, or does it need to match what is already written down?

            When we do model evaluation the answer needs to match what is already written down.

            • chrismorgan 2 hours ago

              So, it sounds like you’re saying that the ML field has hijacked the well-defined and -understood term “ground truth”, to mean something that should be similar, but which is fundamentally unrelated, and in cases like this is in no way similar. Even what it is to be “correct” is damaged.

              I am willing to accept that this is how they are using the terms; but it distresses me. They should choose appropriate terms rather than misappropriating existing terms.

              (Your address example I still don’t get, because I expect your model to do some massaging to match custom, so I wouldn’t consider an Address Line 1 of “J Smith 123 Example St” with empty Address Line 2 to be true or correct.)

          • ghurtado 7 hours ago

            You're trying so hard not to learn something new in this thread, that it's almost impressive.

bee_rider 15 hours ago

I wonder how the power consumption compares. I’d expect the classic CNN to be cheaper just because it is more specialized.

> The allure of skipping dataset collection, annotation, and training is too enticing not to waste a few evenings testing.

How’s annotation work? Do you actually have to mark every pixel of “the thing,” or does the training process just accept images with “a thing” inside it, and learn to ignore all the “not the thing” stuff that tends to show up. If it is the latter, maybe Gemini with it’s mediocre bounding boxes could be used as an infinitely un-bore-able annotater instead.

  • joelthelion 12 hours ago

    If it works, you could use the llms for the first few thousand cases, then use these annotations to train an efficient supervised model, and switch to that.

    That way it would be both efficient and cost-effective.

    • bee_rider 11 hours ago

      It always is fraught to make analogies between human brains and these learning models, but this sounds a bit like muscle memory or something.

iamgopal 3 hours ago

can't there be specialise models that gemini can take help from just like humans do take help from another human ? I have guts feeling that adding everything in single model may degrade its overall competence. For almost all model, as I go deeper, ( and most here must have experienced it. ) I clearly recognize that gemini ( or most other well known ) starts to fail at task, so subtly that you may think it is correct but it is not. This is dangerous, and increases work of verification at our end instead of reducing the work.

xrendan 13 hours ago

One thing that has surprised me (and I should've known that it wasn't great at it), but it is terrible at creating bounding boxes around things it's not trained on (like bounding parts on a PCB schematic.)

  • amelius 13 hours ago

    So this tells us that it does not _understand_ what it is doing, really. No real intelligence here. Might as well use an old-school YOLO network for the task.

    • ta8645 11 hours ago

      It's just behaving like a child. A child could draw a bounding box around a dog and a cat, but would fail if you told them to draw a box around the transistors of a PCB. They have no idea what a transistor is, or what it looks like. They lack the knowledge and maturity. But you would never claim the child doesn't _understand_ what they're doing, at least not to imply that they're forever incapable of the task.

      • amelius 11 hours ago

        Yeah, but a child does one-shot learning much better. Just tell it to find the black rectangles and it will draw boxes around the transistors of a PCB, no extra training required.

        • ta8645 11 hours ago

          Perhaps. But I think you'll find there are a lot of black rectangles on a PCB that aren't actually transistors. You'll end up having to teach the child a lot more if you want accurate results. And that's the same kind of training you'll have to give to an LLM.

          In either case, your assertion that one _understands_, and the other doesn't, seems like motivated reasoning, rather than identifying something fundamental about the situation.

          • graemep 11 hours ago

            Then you explain transistors have three wires coming of them.

          • amelius 11 hours ago

            I mean, problem solving with loose specs is always going to be messy.

            But at least with a child I can quickly teach it to follow simple orders, while this AI requires hours of annotating + training, even for simple changes in instructions.

            • ta8645 11 hours ago

              Humans are the beneficiaries of millions of years of evolution, and are born with innate pattern matching abilities that we don't need "training" for; essentially our pre-training. Of course, it is superior to the current generation of LLMs, but is it fundamentally different? I don't know one way or the other to be honest, but judging from how amazing LLMs are given all their limitations and paucity of evolution, I wouldn't bet against it.

              The other problem with LLMs today, is that they don't persist any learning they do from their everyday inference and interaction with users; at least not in real-time. So it makes them harder to instruct in a useful way.

              But it seems inevitable that both their pre-training, and ability to seamlessly continue to learn afterward, should improve over the coming years.

mkagenius 13 hours ago

Oh yes, its been good for a while. When we created our Android-use[1] (like computer use) tool, it was the cheapest and the best option among Openai, Claude, llama etc.

We have a planner phase followed by a "finder" phase where vision models are used. Following is the summary of our findings for planner and finder. Some of them are "work in progress" as they do not support tool calling (or are extremely bad at tool calling).

  +------------------------+------------------+------------------+
  | Models                 | Planner          | Finder           |
  +------------------------+------------------+------------------+
  | Gemini 1.5 Pro         | recommended      | recommended      |
  | Gemini 1.5 Flash       | can use          | recommended      |
  | Openai GPT 4o          | recommended      | work in progress |
  | Openai GPT 4o mini     | recommended      | work in progress |
  | llama 3.2 latest       | work in progress | work in progress |
  | llama 3.2 vision       | work in progress | work in progress |
  | Molmo 7B-D-4bit        | work in progress | recommended      |
  +------------------------+------------------+------------------+
1. https://github.com/BandarLabs/clickclickclick
svat 15 hours ago

Thanks for this post; it's inspiring — for a personal project I'm trying just to get bounding boxes from scanned PDF pages (around paragraphs/verses/headings etc), and so far did not get great results. (It seems to recognize the areas but then the boxes are offset/translated by some amount.) I only just got started and haven't looked closely yet (I'm sure the results can be improved, looking at this post), but I can already see that there are a bunch of things to explore:

- Do you ask the multimodal LLM to return the image with boxes drawn on it (and then somehow extract coordinates), or simply ask it to return the coordinates? (Is the former even possible?)

- Does it better or worse when you ask it for [xmin, xmax, ymin, ymax] or [x, y, width, height] (or various permutations thereof)?

- Do you ask for these coordinates as integer pixels (whose meaning can vary with dimensions of the original image), or normalized between 0.0 and 1.0 (or 0–1000 as in this post)?

- Is it worth doing it in two rounds: send it back its initial response with the boxes drawn on it, to give it another opportunity to "see" its previous answer and adjust its coordinates?

I ought to look at these things, but wondering: as you (or others) work on something like this, how do you keep track of which prompts seem to be working better? Do you log all requests and responses / scores as you go? I didn't do that for my initial attempts, and it feels a bit like shooting in the dark / trying random things until something works.

  • pkilgore 13 hours ago

    The model is seeming trained to pick up on the existence of the words "bounding box" or "segmentation mask" and if so is pre-trained to return Array<{ box_2d: [number, number, number, number], label: string>, mask: "base64/png"}>, where [y0,x0,y1,x1] for bounding box if you ask it for JSON too.

    Recommend the Gemini docs here, they are implicit on some of these points.

    Prompts matter too, less is more.

    And you need to submit images to get good bounding boxes. You can somewhat infer this from the token counts, but Gemini APIs do something to PDFs (OCR, I assume) that cause them to lose complete location context on the page. If you send the page in as an image, that context isn't lost and the boxes are great.

    As an example of this, you can send a PDF page with half of the page text, the bottom half empty. If you ask it to draw a bounding box around the last paragraph it tends to return a result that is much higher number on the normalized scale (lower on the y axis) than it should be. In one experiment I did, it would think a footer text that was actually about 2/3 down the page was all the way at the end. When I sent as an image, it had in around the 660 mark on the normalized 1000 scale exactly where you would expect it.

    • mdda 11 hours ago

      You've got to be careful with PDFs : We can't see how they are rendered internally for the LLM, so there may be differences in how it's treating the margin/gutters/bleeds that we should account for (and cannot).

mehulashah 11 hours ago

Cool post. We did a similar evaluation for document segmentation using the DocLayNet benchmark from IBM: https://ds4sd.github.io/icdar23-doclaynet/task/ but on modern document OCR models like Mistral, OpenAI, and Gemini. And what do you know, we found similar performance -- DETR-based segmentation models are about 2x better.

Disclosure: I work for https://aryn.ai/

EconomistFar 16 hours ago

Really interesting piece, the bit about tight vs loose bounding boxes got me thinking. Small inaccuracies can add up fast, especially in edge cases or when training on limited data.

Has anyone here found good ways to handle bounding box quality in noisy datasets? Do you rely more on human annotation or clever augmentation?

  • simedw 16 hours ago

    Thank you! Better training data is often the key to solving these issues, though it can be a costly solution.

    In some cases, running a model like SAM 2 on a loose bounding box can help refine the results. I usually add about 10% padding in each direction to the bounding box, just in case the original was too tight. Then if you don't actually need to mask you just convert it back to a bounding box.

muxamilian 12 hours ago

I'm rather puzzled by how bad the COCO ground truth is. This is the benchmark dataset for object detection? Wow. I would say Gemini's output is better than the ground truth in most of the example images.

aae42 13 hours ago

i find these discussions comparing the "vision language models" to the old computer vision tech pretty interesting

since there are still strengths the computer vision has, i wonder why someone hasn't made an "über vision language service" that just exposes the old CV APIs as MCP or something, and have both systems work in conjunction to increase accuracy and understanding

nolok 14 hours ago

Not directly related but still kind of, I've more or less settled on Gemini lately and often use it "for fun", not to do the task but see if it could do it better than me and or in novel or efficient way. NotebookLM and Canvas work nicely and it felt easy to use.

I've been absurdly surprised at how good it is at things, and how bad it is at others, and notably that the thing it seems the worst at are the easy picking parts.

Let me give an exemple; I was checking with it the payslip of my employees for the last few months, various wires related to their salaries and the various taxes, and my social declaration papers for labor taxes (which in France are very numerous and complex to follow).I had found a discrepency in a couple of declaration that ultimately led to a few dozen euros losts over some back and forth. Figuring it out by myself took me a while, and was not fun; I had the right accounting total and almost everything was okay, and ultimately it was a case of a credit being applied while an unrelated malus was also applied, both to some employees but not others, and the collision meant a pain to find.

Providing all the papers to gemini and asking it to check if everything was fine, it found me a bazillion "weird things", all mostly correct but worth checking, but not the real problem.

Giving it the same papers, telling him the problem I had and where to look without being sure, it found it for me with decent details, making me confident that next time I can use it not to solve it, but to be put on the right track much much faster than without gemini.

Giving it the same papers, the problem but also the solution I had but asking it to give me more details, again provided great result and actually helped me clarified which lines collided in which order, again not a replacement but a great add on. Definitely felt like the price I'm paying for it is worth it.

But here is the funny part : in all of those great analysis, it kept trying to tally me totals, and there was always one wrong. We're not talking impressive stuff here, but quite literal case of here is a 2 column 5 rows table of data, and here is the total, and the total is wrong, and I needed to ask it like 3 or 4 times in a row to fix its total until it agreed / found its issue (which was, literally).

Despite being a bit amused (and intrigued) at the "show thinking" detail of that, where I saw it do the same calculation in half a dozen different way to try and find how I came up with my number, it really showed to me how weirdly different from us those thing work (or "think", some would say).

It it's not thinking but just emergent behavior for text assimilation, which it's supposed to be, then it figuring it something like that in such details and clarity was impressive in a way I can't quite grasp. But if it's not that but a genuine thought process of some sort, how could he miss so many time the simplest thing beside being told.

I don't really have a point here, other than I used to know where I sat on "are the models thinking or not" and the waters have really been murkied for me lately.

There have been lots of talk about these things replacing employees or not, and I don't see how they could, but I also don't see how an employee without one could compete with one helped by one as an assistant; "throw ideas at me" or "here is the result I already know but help me figure out why". That's where they shine very brightly for me.

pkilgore 13 hours ago

I wish temperature was a dimension. I believe the Gemini docs even recommend avoiding t=0 to avoid the kinds of spirals the author was talking about with masks.

Alifatisk 15 hours ago

I might be completely off here but it kinda feels like Multimodal LLMs is our silver bullet to applying different technological solutions? From text analysis, to video generation to bounding boxes, it's kinda incredible!

And hopefully with diffusion based llms, we might even see real-time appliances?