Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

122 points by JnBrymn a day ago

tcdent 4 minutes ago

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

nl 17 minutes ago

Kapathy's points are correct (of course).

One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).

"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.

This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.

viraptor an hour ago

https://xcancel.com/karpathy/status/1980397031542989305

kirubakaran an hour ago

Thanks. There are also these:
- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/
- https://chromewebstore.google.com/detail/xcancelcom-redirect...

dang 4 hours ago

Recent and related:

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)

DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)

sabareesh 4 hours ago

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

ACCount37 4 hours ago

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.
- typpilol 3 hours ago
  
  It will require like 20x the compute
  - ACCount37 2 hours ago
    
    A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".
    If we had a million times the compute? We might have brute forced our way to AGI by now.
    
    Jensson an hour ago
    
    But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.
  - kenjackson 2 hours ago
    
    Why so much compute? Can you tie it to the problem?
  - Mehvix 2 hours ago
    
    Why do you suppose this is a compute limited problem?
    
    ACCount37 an hour ago
    
    It's kind of a shortcut answer by now. Especially for anything that touches pretraining.
    "Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
    The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
    A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
CuriouslyC 4 hours ago

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.

ianbutler 2 hours ago

https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)

You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.

scotty79 19 minutes ago

I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.

cnxhk 2 hours ago

The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.

dgfitz an hour ago

Hard to say, should probably spend 187 billion dollars to check.

scotty79 18 minutes ago

It's kind of beautiful that they can actually do that.

hbarka 4 hours ago

Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?

anabis 2 hours ago

Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.

varispeed 4 hours ago

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

ants_everywhere 3 minutes ago

some of us with ADHD just kind of read all the words at once
sosodev 3 hours ago

LLMs don't "read" text sequentially, right?
- olliepro 3 hours ago
  
  The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.
  - Merik 2 hours ago
    
    Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...
    
    ACCount37 an hour ago
    
    Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.
    Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.
spiralcoaster 2 hours ago

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!
- numpad0 an hour ago
  
  I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text
  is that crazy? I'm not buying it is
  - bigbluedots an hour ago
    
    Don't know, probably? I'm a linear reader

yunwal a day ago

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

fspeech 4 hours ago

If you can read your input on your screen your computer apparently knows how to convert your texts to images.
CuriouslyC 4 hours ago

All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.
smegma2 a day ago

No? He’s talking about rendered text
- rhdunn 4 hours ago
  
  From the post he's referring to text input as well:
  > Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
  Italicized emphasis mine.
  So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
  Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.