Do we know which changes made DeepSeek V3 so much faster and better at training than other models? DeepSeek R1's performances seem to be highly related to V3 being a very good model to start with.
I went through the paper and I understood they made these improvements compared to "regular" MoE models:
1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;
2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;
3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used, to make them more likely to be selected in the future training steps;
4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;
5. They are using FP8 instead of FP16 when it does not impact accuracy.
It's not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.
1), 2), 3) and 5) could explain why their model trains faster by some small factor (+/- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).
The key idea of Latent MHA is that "regular" multi-headed attention needs you to keep a bunch of giant key-value (KV) matrices around in memory to do inference. The "Latent" part just means that DeepSeek takes the `n` KV matrices in a given n-headed attention block and replaces them with a lower-rank approximation (think of this as compressing the matrices), so that they take up less VRAM in a GPU at the cost of a little extra compute and a little lost accuracy. So not caching, strictly speaking, but weight compression to trade compute off for better memory usage, which is good because the KV matrices are one of the more expensive part of this transformer architecture. MoE addresses the other expensive part (the fully-connected layers) by making it so only a subset of the fully-connected layers are active at any given forward pass.
They also did bandwidth scaling to handle work around the nerfed H800 interconnects.
> efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths
> The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component.
I think the fact that they used synthetic/distilled high-quality data from GPT4-o output to train in the style of Phi models are of significance as well.
> Looks interesting, but I'm skeptical that a book can feasibly stay up to date with the speed of development.
The basic structure of the base models has not really changed since the first GPT launched in 2018. You still have to understand gradient descent, tokenization, embeddings, self-attention, MLPs, supervised fine tuning, RLHF etc for the foreseeable future.
Adding RL based CoT training would be a relatively straightforward addendum to a new edition, and it's an application of long established methods like PPO.
All "generations" of models are presented as revolutionary -- and results-wise they maybe are -- but technically they are usually quite incremental "tweaks" to the previous architecture.
Even more "radical" departures like state space models are closely related to same basic techniques and architectures.
Transformer encoders are not really popular anymore, and all the top LLMs are decoder-only architectures. But encoder models like BERT are used for some tasks.
In any case, self-attention and MLP is the crux of Transformer blocks, be they in the decoder or the encoder.
I have not, but Jay has created a ton of value and knowledge for free and don't fault him for throwing an ad for his book / trying to benefit a bit financially.
Am I the only one not that impressed with Deepseek R1? Its "thinking" seems full of the usual LLM blindsides, and ultimately generating more of it then summarizing doesn't seem to overcome any real limits.
It's like making mortgage backed securities out of bad mortgages, you never really overcome the badness of the underlying loans, no matter how many layers you pile on top
I haven't used or studied DeepSeek R1 (or o1) in exhaustive depth, but I guess I'm just not understanding the level of breathless hype right now.
If it matches the latest GPT O-N model in performance - or is just close, even, at a fraction of the compute (50x less?) and it is free, then that's huge news.
They just upended the current LLM/AI/ML dominance, or at least the perceived dominance. Billions and billions have been pumped into the race, where investors are betting on the winner - and here comes a Chinese hedge fund side-project on shoestring budget, matching those billion dollar behemoths. And they'll continue to release their work.
They just made the OpenAI et. al. secret sauce a lot less valuable.
New theory: this is a short-long play by the fund. They shorted NV, now they're hoovering up stock. In the process of making their billions from a small 50$mm investment!
The results i did get from deepseek-r1 on their webpage did not match the results i did get from o1-pro.
I did ask it go to a github repo, find the part where the logic of the “export” button is and explain why it doesn’t work (the whole logic is actually missing, won’t work at all).
O1 pro did get it right in the first try while deepseek r1 was heavily hallucinating.
Maybe i am using the wrong model?
No, you’re not. They explicitly mention in the R1 paper (in the last paragraph before the bibliography) that R1 isn’t a “huge” improvement over DeepSeek-V3 in coding - where “huge” is an academic weasel word.
It’s just a lot of hype. In my coding tests it significantly underperforms o1 (haven’t tried o1-pro), often getting stuck in a reasoning loop because I underspecified something (that I don’t have to with o1).
Same anecdotal experience. Its definitely an improvement and they have made operational improvements at runtime but I am still concerned they are have over fit for the tests.
It is leaps and bounds better than LLMs. For one you are doing RL which is classic AI like tuning that optimizes a reward function with nice qualities- it's the same stuff used to train chess games and Go by showing them the actual moves.
LLMs pre o1 and deepseek R1 were RHLF tuned which is like if you trained a LM how to play chess by showing people two boards and doing a vibe check on which "looks" better.
Think of it this way say you were dropped in a maze that you had to solve but you could do only one of two things:
1. Look at two random moves from your start position and selected which one looked better to get out.
2. Made a series of moves and then backtracked, then use a quantitative function to exploit the best path.
The latter is what R1 does and it chooses optimal and more certain path to success.
Apply this to math and coding tokens and you have a competitive LLM.
As with most early tech of a particular category, it's not the current capabilities that are the point but the direction of travel.
DeepSeek has upended the conventional wisdom about model performance with respect to training, and it's a shock to the system. It demonstrates something that has become obvious: you don't need massive scale or funding to innovate and have impact, you just need good ideas.
I've been using deepseek for a while, I never paid for chat gpt or any other services.
The fact that r1 is now free and unlimited VS chat gpt 200$ a month subscription is impressive enough for me. If the development cost is anywhere close to what they advertise publicly it's even more impressive
It's as good or better than chat gpt free, gemini free, &c. and that's all I care about
Why is cost your only concern? You don’t care at all what data the model was trained on? The motivations of the people who trained it? I mean maybe the importance of those things isn’t super high to you, but “don’t care”?
> You don’t care at all what data the model was trained on? The motivations of the people who trained it?
Working for a news org whose data was used without our consent to give ChatGPT a leg up, yeah I'd care a lot about it. The bottom line though is that nobody is innocent in that regard.
Would a self-respecting organization go all in on Deepskeek R1 without a security audit and a ton of competitive testing against other models? I doubt it. Same way they shouldn't just give all their employees OnePlus or Huwei phones.
I like how the Meta guy put it. This isn't China beating the US in AI, it's "open source" (however these guys define it) beating closed models.
OpenAI broke its promise with the world, which anybody who compares their name with their product can tell you. If open models put a dent in their hegemony it's only good for the rest of the industry.
Except you said it was worth caring about! I agree with what you wrote here, but I thought we were talking about this because you had concerns about OpenAI's disrespect for copyright.
Hey just wanna say I was sick yesterday, so if I was a schmuck I apologize. I think we actually agree on most of the things, and even as I was posting responses I was like, "... there's something hypocritical here but I can't quite see it." So again, my apologies.
It doesn't matter to the Chinese, and that's all that matters.
To the extent that training a model with copyrighted material infringes copyright, copyright law has to change. There is no other point of view. Disagreement means forfeiting a very important game before it even begins.
That’s fine, it’s just not what Westerners generally say, in my experience. Not that distaste for Chinese authoritarianism is inherently pro Western, just that Westerners I know tend to have an ideological issue with the substantially higher levels of censorship and oppression that takes place in China.
It’s surprising to find people who genuinely don’t care about any of that, is all.
"westerners" as if it was a monolithic block. I envy the simplicity of seeing life in such terms, China bad, US good, it must be very relaxing
You don't have any concerns with the US government, the nepotism, the corruption, the conflict of interests, the insider trading, the massive concentration of wealth and power in tech, the insane lobbyism, PRISM, &c.
It's surprising to find people who can so clearly see how China is bad but are completely oblivious to their own problems. It's not a football game, you don't have to chose a side and be a boot licker for eternity
I didn’t say China bad, I said Westerners typically find the levels of oppression and censorship present in China to be of concern. I actually gave zero of my own judgement on China at all.
If you really think the Good vs. Evil narrative is wrong, why would you immediately go towards unrelated generic issues the West has? A neutral party would be more likely to acknowledge the problems with both sides, not reflexively try to change the subject!
Then again you didn’t claim to be a neutral party, did you?
The CEO did gave a statement about their motivation. Could be a lie, but he delivered and it is also vastly more sensible that what we often hear from other companies. Google and Meta are an exception for this space though.
Also, because not only the weights, but also the data is open, any propaganda can be identified and corrected. This is not the case for other models and what we have seen from Gemini, there certainly are "adaptations". I don't think Google had ill intent here, but this would fit what some would classify as propaganda.
Well yeah, both sides are fucked so I'll use the free tool and not the $200/month tool, it really isn't rocket science
Even if deepseek is a chinese communist party evil trick what do they get ? My shitty code ? Big deal, at least I'm not down $200 a month, which is half of my rent
If they would establish deepseek in their authorized version worldwide - it means they would establish their worldview worldwide.
Students asking for help with homework will get the chinese approved version. Any housewife asking for recommendation who to vote and why, etc.
But ... deepseek seems open source and the local version not restricted (as it likely was trained on ChatGPT in the first place). So I also refuse to refuse deepseek because it comes from china. I see it as more competition that hopefully will help with establishing good open source models in control of no single political organisation.
> Students asking for help with homework will get the chinese approved version. Any housewife asking for recommendation who to vote and why, etc.
Well too bad for them, using a pencil to cut bread sucks too, that's why we don't do it. Meanwhile it's converting my json to xml and pissing code so I don't have to piss it myself
I'm investing in my local community, it's closer to chicken coop vs rabbit cages than chatgpt vs deepseek, and sadly the war isn't even 1000km away as we speak
Off topic: why the hell are you converting json to xml? I would rather convert json to yaml instead, a step forwards instead of backwards in evolution, but that is my opinion.
> Students asking for help with homework will get the chinese approved version. Any housewife asking for recommendation who to vote and why, etc.
How is this any worse than where the US is now, getting the Musk/Zuckerberg/Bezos/Trump approved answers,
if any answer can be gotten at all,
after the current occupant of the oval office has silenced all federal agencies.
Western trained models will tell you politically inconvenient things, and if they won’t, that info is otherwise freely available. Neither is the case for Chinese trained models.
1. *David Mayer de Rothschild*: Born in 1978, he is a British environmentalist, adventurer, and member of the Rothschild family. He is known for his environmental advocacy and expeditions, including the "Plastiki" project, where he sailed across the Pacific Ocean on a boat made from recycled plastic bottles to raise awareness about plastic pollution. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Mayer_de_Rothschild?utm_...))
3. *David R. Mayer*: An American politician born in 1967, he has served as the mayor of Gloucester Township, New Jersey, and was a member of the New Jersey General Assembly. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_R._Mayer?utm_source=chat...))
4. *David Delaney Mayer*: Born in 1992, he is an American documentary filmmaker and social entrepreneur, known for projects like the PBS series "Food Town" and co-founding DreamxAmerica, an initiative supporting immigrant entrepreneurs. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Delaney_Mayer?utm_source...))
5. *David Mayer (Historian)*: An American-British theatre historian (1928–2023), he was an emeritus professor at the University of Manchester, recognized for his work on 19th-century drama and the Victorian stage. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Mayer_%28historian%29?ut...))
6. *Akhmed Chatayev (Alias: David Mayer)*: A Chechen militant (1980–2017) who used the alias "David Mayer." This alias led to a case of mistaken identity affecting the historian David Mayer. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Mayer_%28historian%29?ut...))
If you have a specific "David Mayer" in mind, please provide more context or details, and I can offer more targeted information.
this directly blends people with a government.. Western people is not equal to the USA govt. Here in California, three decades of China human rights history is very clear to a lot of people. In fact, many Chinese speaking people on the Americas west coast, left China for specific reasons, too.
Utilitarian money-oriented self-servers definitely have less "care" about these things? EU or UK or wherever
I think there’s a real chance of changes in soft-power dynamics given recent events. When allies feel their partner and neighbour is behaving solely in its own interests, it changes trust relationships.
Trump, tech, US-first focus with less interest in collaboration. TikTok. Ukraine and discussions around defence budgets. Reducing focus on EV’s. Tariffs. Iceland. Gulf of America.
Immediate choices might be fear-based but it’s smart to look for other partners as trust erodes.
An extreme side effect might be countries who felt safe under the US arms umbrella, needing to arm themselves. Are we absolutely sure that’s what we want? Does that include nukes?
Chinese models are censored, while Western models are _aligned_. it's a very important distinction.
personally I imagine in the future I'll use a mock UN panel of LLMs to advise me, and avoid any one nation/political party's influence, if I ever get to the point of delegating much of my thinking to the machine.
Most westerners wouldn't want their land threatened by their allies either, but we don't live in a perfect world.
Aligned to what ? By who ? Did you vote for any of these people ? Did anyone ask you your opinion ? Can I see what topics are aligned ? Can I see how much they are aligned ? Does the alignment change depending on who's the current US president ?
The big bad wolf is going to eat your lunch, you better increase your nationalism and taxes and stop global meritocracy?
There was a psyop by the politicians just a month or so ago: ”Trump employees recruited people on the street by promising them an expensive restaurant meal, according to a Danish radio station”, but if you listen to most of media you will think greenlanders dream of becoming the next Mississippi with a median salary of $35,070 instead of current $43,664.
Not saying it is bad, but US sure has a stronger self interest than most.
Im just saying let the models output speak for themselves, and let the actions speak for themselves.
It's the cost comparison with O1, both to train and run (per their pricing), that is causing most of the shock, perhaps as well as the fact that it's a GPU poor Chinese company that has caught up with O1, not a US one (Anthropic, Google, Meta, X.ai, Microsoft). The fact that it's open weights and training is fairly detailed in the paper they released is also significant.
The best comparison for R1 is O1, but given different training data, hard to compare outside of benchmarks. At the moment these "reasoning models" are not necessarily the best thing to use for non-reasoning tasks, but Anthropic have recently indicated that they expect to release models that are more universal.
You’re not the only one. It’s not as impressive at coding compared to O1 as people make it out to be and it’s explicitly spelled out in DeepSeek’s R1 paper that they had trouble with improving over DeepSeek-V3:
> Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency. (last bullet point of page 16, which is the last page of the paper before the bibliography - hmm…) [1]
It does even worse on my real world coding problems than the benchmarks would suggest. Some of my tests: write a Qt QAbstractListModel in C++ that parses markdown into block using a C/C++ MD parsing library, write Rust cxx-qt bindings for QTextDocument (all docs included in context), write a window switching script for Wayland with an alternative to wmctrl, etc. I also asked it some geochemistry questions that I had trialed O1 with previously, and the reasoning had a lot hallucinations. The answers were suboptimal to say the least.
Having access to the thought process in <think></think> tags is cool but it just reveals how underdeveloped the actual reasoning is compared to whatever o1 is doing. I’ve had it get stuck on silly things like whether a C++ library for markdown exists because I underspecified that I’m okay with header only C libs. O1 has faired much better on my qualitative test suite - even more so when using it via API with a “high” reasoning_effort.
With all of the hype over the last few days, it feels like I’m taking crazy pills (loving all the action though).
The brute force approach is the expensive one ("probably, so far anyway" add these 4 words everywhere) - and impossible to make "always correct" aka "usual blindsides". They seem to be trying a bunch of specialized training ideas here and there in the system - just like a Mixture of Experts does, in different places than was obvious so far, and with an eye toward reasonning. In particular trying to build a reasonning-oriented training base from minimal seed.
It's still not going to give an "always correct" result but we are nowhere near the point where that's needed. We are only at the point where a new idea can get you a percentage point further in benchmarks. Some fundamental limits were baked into the previous assumptions - easy to get past by leaving these assumptions behind.
I get the hype, even if I don’t necessarily agree with it. TLDR: it’s technically impressive in training infrastructure and a geopolitical surprise.
As others commenters said, it compares favorably against American Flagship models in benchmarks. This is geopolitically interesting since the model is Chinese, and subject to trade restrictions and America thinks of itself as the world’s best model builders.
What makes it interesting technically, is the trade restrictions required them to focus on efficiency and reusing old hardware. This made it wildly cheaper to run the training. The They make a ton of very low-level optimizations to make training efficient. This is impressive engineering, and shows how a small amount of effort can free up a nvidia-lock-in. They had to totally bypass a lot of CUDA libraries and drop down even lower to control the GPUs. Nvidia has been capturing a huge fraction of industry wide AI spend, and surely no one wants that. This is, IMO, the actual part to watch. I suspect it’ll drive a new wave of efficiency-focused LLM tools which will also unlock competitors GPUs.
They also had some novel-for-LLMS training techniques but it’s suspected that the big AI companies elsewhere are doing it now too, but not disclosed. (Mostly reinforcement learning).
What I think is hype, meanwhile, is the actual benchmarks. Most models are trained on their competitors output. This is just a reality of the industry. Especially non-flagships being trained against flagships data. DeepSeek was almost certainly trained against OpenAI models, so it makes sense that it would approach the output quality. That’s very different from being capable of outperforming or “taking the lead” or whatever. We’ll need to wait longer and see how the future goes to make that determination. “China” has long had a great history of machine learning tech, so there is no reason to think that it’s structurally impossible for a Chinese organization to be on the leading edge, but it has to happen before we can say it happened.
What is also hype is calling this a “side project” of a financial firm. The firm spun it out as a dedicated company. China cracked down on hedge funds, so the company looked for ways to repurpose the talent and compute infrastructure. This isn’t some side project during lunch breaks for a few bored quants.
PS, thinking models are very different in use-case than normal models. It’s great for tasks with a “right answer” like math problems. Consider a simple example, in calculus, your teacher made you “show your work”, and doing so certainly reduced the likelihood of errors. That’s the purpose of a thinking model, and it excels at similar tasks.
fwiw, National Public Radio (NPR) news in the USA said that AI experts stated it was "almost as good" as other current offerings like chatGPT and Gemini. That its real advantage is the low cost of providing the information. However, this is only a claim made by the company without any proof.
"DeepSeek-R1 is the latest resounding beat in the steady drumroll of AI progress. " IBM's Intellect, from 1983 cost $47,000 dollars a month. Let me know when DeepSleep-Rx exceeds Windows (tm) version numbers or makes a jump like AutoCADs version numbers.
> This is a large number of long chain-of-thought reasoning examples (600,000 of them). These are very hard to come by and very expensive to label with humans at this scale. Which is why the process to create them is the second special thing to highlight
I didn't know the reasonings were part of the training data. I thought we basically just told the LLM to "explain its thinking" or something as an intermediate step, but the fact that the 'thinking' is part of the training step makes more sense and I can see how this improves things in a non-trivial way.
Still not sure if using word tokens as the intermediate "thinking" is the correct or optimal way of doing things, but I don't know. Maybe after everything is compressed into latent space it's essentially the same stuff.
Maybe too much of the same topic? "How R1 was trained" also seemed to quickly fall off. But the big arxiv paper with 1000+ upvotes stuck around a while.
Spot on. I've read the very accessible paper and it's better than any of the how-to's written elsewhere. Nothing against good content being written, but the source material is already pretty good.
It’s remarkable we’ve hit a threshold where so much can be done with synthetic data. The reasoning race seems an utterly solvable problem now (thanks mostly to the verifiability of results). I guess the challenge then becomes non-reasoning domains, where qualitative and truly creative results are desired.
It seems like we need an evaluation model for creativity. I'm curious, is there research on this -- for example, can one score a random painting and output how creative/good a given population is likely to find it?
How do you account for the impact of culture/lived experience of the specific population viewing the painting? Intuitively it seems like that would be the biggest factor, rather than the objective attributes of the painting, no?
All art is subjective. Any attempt to "verify" a piece of art would be entirely dependent on cultural and personal sensitivities. Art isn't a math problem with a solution.
But you can dissect it into concepts and see if it is something truly new to the model - if the output contains things which aren’t there in the weights, you have a nice specimen to study and, crucially, a recipe to get a bunch of matrices to output untrained things.
This is like saying: All cooks are equally good, even the most disgusting slop (e.g. water/flour soup) isn't any better than a dish from a cook with several Michelin stars. Of course the latter is better. And if it is better, it is objectively better. Even if 0.001% of people prefer flour soup.
> culture/lived experience of the specific population viewing the painting
Isn't this lived experience baked into LLM language bases? It's certainly very hard to target all possible populations at once. And art doesn't need that, doesn't do that. Only rare marketing sometimes attempts to do that and only in very limited ways, such as a brand name acceptable all over the world.
There are two kinds of creativity at play here. One is mashing together combinations of learned things - it’s kinda like shuffling a deck of cards where basically every shuffle gets you a deck that has never been seen and won’t be seen again, but it’s still the same 52 cards every time. The other kind is going outside of the box and inventing truly new, unseen/untrained concepts. This one is hard, but I don’t think it’s impossible - the <think> slop stirring the learned concepts with a bit of randomness should make progress here.
A new "AI challenge" -- can an AI make a hit movie (even if just for Netflix) in each of Documentary, Action, Thriller, Comedy, and Drama genres. This isn't art like the "Mona Lisa", but more like the ability to make "art" that has appeal to some level of the public. I think if an AI can do that, I'll be pretty impressed.
The prompt:
"Create a feature length [Action/Comedy/etc...] film that can borrow elements from existing films, but would generally not be considered a copy of any given film."
You can get very mechanical in scoring an image. Ask any art student. If you want to or if your instructor or audience wants to. For example "fits rule of thirds?" yes is a point to common attraction, no is a point to unexpected at the risk of outsider-ness. You can do that in color, composition, recognizing objects and fitting that to memes or associations or non-associations. Too many points in "unexpected" is meta points in "unpleasant chaos" and so a strong downgrade in common attraction. You can match all this to images in the library (see how copyright or song recognition operates in the music category) and get out of that some kind of familiarity vs edge score (where too much edge goes against common attraction.)
I would expect you could get better than most humans at recognizing shapes in an image and drawing associations from that. Such associations are a plus in unexpected / surprise if they are rare in the culture or a plus in common attraction is they are common.
After that, to be cynic about it, you can randomize and second guess yourself so your audience doesn't catch on the 1st level mimicry.
Creativity is not normally used as an absolute with a unique measure. It's not "length". And you only need to please part of the audience to be successful - sometimes a very small part, some of which loves surprise and some hates it, etc. Someone elsewhere objected on the grounds that creativity or attractiveness is culture based - yeah so? if you were to please much of just one whole culture, you would have an insane hit on your hands.
You can train a supervised model, taking into account the properties of the rater as well as the artwork, and tease out the factors that make it rated so.
You can probably cluster raters and the artwork they rate highly - but probably not in large quantities? -- Which might be the case also with raters being willing to tell you why - and how! most love to do that - but also not in very large quantities. With the added issues that the raters' own opinion of why they love or hate something is likely not to be entirely true and self-understanding.
You could use a larger corpus, like auction house files and art magazines. But then you are confounding for celebrity - a large ingredient in art prices.
We all knew The Chinese government was going to censor it. The censoring happening in ChatGPT is arguably more interesting since they are not beholden to the US government. I'm more interested in that report.
The thing I still don’t understand is how DeepSeek built the base model cheaply, and why their models seem to think they are GPT4 when asked. This article says the base model is from their previous paper, but that paper also doesn’t make clear what they trained on. The earlier paper is mostly a description of optimization techniques they applied. It does mention pretraining on 14.8T tokens with 2.7M H800 GPU hours to produce the base DeepSeek-V3. But what were those tokens? The paper describes the corpus only in vague ways.
I imagine it's a mix of either using ChatGPT as an the oracle to get training data. Or, it's the radiocarbon issue where the Internet has so much info on ChatGPT other models now get confused.
Various other models also think they're ChatGPT or built by OpenAI, or at least those are the highest probability tokens when talking about an AI model or an AI company because of the massive prevalence in training data (the internet). It isn't the big reveal that it is often being held to be.
Add that training off of ChatGPT wouldn't reduce their training costs at all, but would actually increase their training costs. Literally all of the same training difficulty, but then add paying OpenAI for an enormous number of API calls. Not really seeing the win.
>The paper describes the corpus only in vague ways.
Anyone who runs a public website has logs absolutely filled by a seemingly infinite number of information aggregators. Just like everyone else they scraped the entire internet, pulled in all of Wikipedia, etc. Probably lots of pirate books, movie transcripts, etc.
The fact that training could be done more effectively is something that intuitively makes absolute sense to everyone in the field, but we just didn't make that leap. Similar to how a human isn't trained to recognize digits by training on 60,000 training digits then suddenly failing if a real world digit is slightly rotated or morphed in some way, we are making these improvements to content ingestion.
Er, how would that reduce the cost? You still need to train the model, which is the expensive bit.
Also, the base model for V3 and the only-RL-tuned R1-Zero are available, and they behave like base models, which seems unlikely if they used data from OpenAI as their primary data source.
It's much more likely that they've consumed the background radiation of the web, where OpenAI contamination is dominant.
Hypothetical question: is the chinese government capable of exploiting chatgpt to get around the query limit? For example, making queries through compromised devices or even snooping local traffic on devices? Let's face it, these models are closely alligned with China's national security so it's not a farfetched question to ask.
They fixed that. Now it replies: "Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation."
You can't distill from GPT-4 because Open AI conceals the probabilities (and has for a couple years now-- since before gpt4), presumably to prevent that. You can fine tune against output though. I might guess that they used something like openorca or some other public data set that includes gpt4 output as part of their initial fine tuning.
How does such a distillation work in theory? They don’t have weights from OpenAI’s models, and can only call their APIs, right? So how can they actually build off of it?
Do we know which changes made DeepSeek V3 so much faster and better at training than other models? DeepSeek R1's performances seem to be highly related to V3 being a very good model to start with.
I went through the paper and I understood they made these improvements compared to "regular" MoE models:
1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;
2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;
3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used, to make them more likely to be selected in the future training steps;
4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;
5. They are using FP8 instead of FP16 when it does not impact accuracy.
It's not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.
1), 2), 3) and 5) could explain why their model trains faster by some small factor (+/- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).
The key idea of Latent MHA is that "regular" multi-headed attention needs you to keep a bunch of giant key-value (KV) matrices around in memory to do inference. The "Latent" part just means that DeepSeek takes the `n` KV matrices in a given n-headed attention block and replaces them with a lower-rank approximation (think of this as compressing the matrices), so that they take up less VRAM in a GPU at the cost of a little extra compute and a little lost accuracy. So not caching, strictly speaking, but weight compression to trade compute off for better memory usage, which is good because the KV matrices are one of the more expensive part of this transformer architecture. MoE addresses the other expensive part (the fully-connected layers) by making it so only a subset of the fully-connected layers are active at any given forward pass.
https://planetbanatt.net/articles/mla.html this is a great overview of how MLA works.
They also did bandwidth scaling to handle work around the nerfed H800 interconnects.
> efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths
> The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component.
(I know some of those words)
https://arxiv.org/html/2412.19437v1
I think the fact that they used synthetic/distilled high-quality data from GPT4-o output to train in the style of Phi models are of significance as well.
For the uninitiated, this is the same author as the many other "The Illustrated..." blog posts.
A particularly popular one: https://jalammar.github.io/illustrated-transformer/
Always very high quality.
Thanks so much for mentioning this. His name carries a lot of weight for me as well.
Have you read his book Hands-On Large Language Models?
Looks interesting, but I'm skeptical that a book can feasibly stay up to date with the speed of development.
> Looks interesting, but I'm skeptical that a book can feasibly stay up to date with the speed of development.
The basic structure of the base models has not really changed since the first GPT launched in 2018. You still have to understand gradient descent, tokenization, embeddings, self-attention, MLPs, supervised fine tuning, RLHF etc for the foreseeable future.
Adding RL based CoT training would be a relatively straightforward addendum to a new edition, and it's an application of long established methods like PPO.
All "generations" of models are presented as revolutionary -- and results-wise they maybe are -- but technically they are usually quite incremental "tweaks" to the previous architecture.
Even more "radical" departures like state space models are closely related to same basic techniques and architectures.
> gradient descent
funny mentioning the math but not the Transformer encoders..
Transformer encoders are not really popular anymore, and all the top LLMs are decoder-only architectures. But encoder models like BERT are used for some tasks.
In any case, self-attention and MLP is the crux of Transformer blocks, be they in the decoder or the encoder.
> Transformer encoders are not really popular anymore
references, please
I have not, but Jay has created a ton of value and knowledge for free and don't fault him for throwing an ad for his book / trying to benefit a bit financially.
Yeah no shade for someone selling their knowledge; I'm just trying to suss out how useful the book is for learning foundations.
Foundations don't change much with "the speed of development"
That's a good point
Am I the only one not that impressed with Deepseek R1? Its "thinking" seems full of the usual LLM blindsides, and ultimately generating more of it then summarizing doesn't seem to overcome any real limits.
It's like making mortgage backed securities out of bad mortgages, you never really overcome the badness of the underlying loans, no matter how many layers you pile on top
I haven't used or studied DeepSeek R1 (or o1) in exhaustive depth, but I guess I'm just not understanding the level of breathless hype right now.
What's there not to understand?
If it matches the latest GPT O-N model in performance - or is just close, even, at a fraction of the compute (50x less?) and it is free, then that's huge news.
They just upended the current LLM/AI/ML dominance, or at least the perceived dominance. Billions and billions have been pumped into the race, where investors are betting on the winner - and here comes a Chinese hedge fund side-project on shoestring budget, matching those billion dollar behemoths. And they'll continue to release their work.
They just made the OpenAI et. al. secret sauce a lot less valuable.
New theory: this is a short-long play by the fund. They shorted NV, now they're hoovering up stock. In the process of making their billions from a small 50$mm investment!
Bullseye
Damn Matt Levine to hell! His latest newsletter has an entire section devoted to the entire topic!
In my tests it does not come close to O1-Pro. Still huge news but it does not quite make it.
The results i did get from deepseek-r1 on their webpage did not match the results i did get from o1-pro. I did ask it go to a github repo, find the part where the logic of the “export” button is and explain why it doesn’t work (the whole logic is actually missing, won’t work at all). O1 pro did get it right in the first try while deepseek r1 was heavily hallucinating. Maybe i am using the wrong model?
No, you’re not. They explicitly mention in the R1 paper (in the last paragraph before the bibliography) that R1 isn’t a “huge” improvement over DeepSeek-V3 in coding - where “huge” is an academic weasel word.
It’s just a lot of hype. In my coding tests it significantly underperforms o1 (haven’t tried o1-pro), often getting stuck in a reasoning loop because I underspecified something (that I don’t have to with o1).
Same anecdotal experience. Its definitely an improvement and they have made operational improvements at runtime but I am still concerned they are have over fit for the tests.
It is leaps and bounds better than LLMs. For one you are doing RL which is classic AI like tuning that optimizes a reward function with nice qualities- it's the same stuff used to train chess games and Go by showing them the actual moves.
LLMs pre o1 and deepseek R1 were RHLF tuned which is like if you trained a LM how to play chess by showing people two boards and doing a vibe check on which "looks" better.
Think of it this way say you were dropped in a maze that you had to solve but you could do only one of two things:
1. Look at two random moves from your start position and selected which one looked better to get out.
2. Made a series of moves and then backtracked, then use a quantitative function to exploit the best path.
The latter is what R1 does and it chooses optimal and more certain path to success.
Apply this to math and coding tokens and you have a competitive LLM.
I am using the 32Gb distilled model on my local 3090 with Continue in VSCode. It beats everything out of the water.
How many tokens/s do you get on a 3090? With the extra tokens for the internal monologue, is it still performant enough for smooth VSCode integration?
Any idea how to use a cloud hosted version with cursor?
As with most early tech of a particular category, it's not the current capabilities that are the point but the direction of travel.
DeepSeek has upended the conventional wisdom about model performance with respect to training, and it's a shock to the system. It demonstrates something that has become obvious: you don't need massive scale or funding to innovate and have impact, you just need good ideas.
I guess this is why Googles CEO said that they have no LLM moat, as cleverness is not a moat.
I've been using deepseek for a while, I never paid for chat gpt or any other services.
The fact that r1 is now free and unlimited VS chat gpt 200$ a month subscription is impressive enough for me. If the development cost is anywhere close to what they advertise publicly it's even more impressive
It's as good or better than chat gpt free, gemini free, &c. and that's all I care about
Why is cost your only concern? You don’t care at all what data the model was trained on? The motivations of the people who trained it? I mean maybe the importance of those things isn’t super high to you, but “don’t care”?
How odd. Most Westerners I know would care…
> You don’t care at all what data the model was trained on? The motivations of the people who trained it?
Working for a news org whose data was used without our consent to give ChatGPT a leg up, yeah I'd care a lot about it. The bottom line though is that nobody is innocent in that regard.
Would a self-respecting organization go all in on Deepskeek R1 without a security audit and a ton of competitive testing against other models? I doubt it. Same way they shouldn't just give all their employees OnePlus or Huwei phones.
I like how the Meta guy put it. This isn't China beating the US in AI, it's "open source" (however these guys define it) beating closed models.
OpenAI broke its promise with the world, which anybody who compares their name with their product can tell you. If open models put a dent in their hegemony it's only good for the rest of the industry.
So it genuinely does not matter to you how completely irrelevant US copyright law is to China, when using a Chinese LLM?
That’s such a weirdly specific application of a moral principle, it’s hard to believe.
It doesn't matter to you how completely irrelevant global copyright law is to OpenAI? [1]
[1] https://futurism.com/the-byte/openai-copyrighted-material-pa...
No, but then again I didn't claim copyright law mattered to me at all, you did.
I also said:
> The bottom line though is that nobody is innocent in that regard.
In other words, that bird has flown and it isn't a valid reason to choose one model over another.
Except you said it was worth caring about! I agree with what you wrote here, but I thought we were talking about this because you had concerns about OpenAI's disrespect for copyright.
Hey just wanna say I was sick yesterday, so if I was a schmuck I apologize. I think we actually agree on most of the things, and even as I was posting responses I was like, "... there's something hypocritical here but I can't quite see it." So again, my apologies.
It doesn't matter to the Chinese, and that's all that matters.
To the extent that training a model with copyrighted material infringes copyright, copyright law has to change. There is no other point of view. Disagreement means forfeiting a very important game before it even begins.
Ah yes, because chat gpt is ethical AI lmao
I'm European, seeing the clowns in charge of US tech companies I'm equally happy using Chinese tools, especially if they're open source.
Stop twisting the debate as if it was "good US" vs "evil China", I have no horses in this race
That’s fine, it’s just not what Westerners generally say, in my experience. Not that distaste for Chinese authoritarianism is inherently pro Western, just that Westerners I know tend to have an ideological issue with the substantially higher levels of censorship and oppression that takes place in China.
It’s surprising to find people who genuinely don’t care about any of that, is all.
"westerners" as if it was a monolithic block. I envy the simplicity of seeing life in such terms, China bad, US good, it must be very relaxing
You don't have any concerns with the US government, the nepotism, the corruption, the conflict of interests, the insider trading, the massive concentration of wealth and power in tech, the insane lobbyism, PRISM, &c.
It's surprising to find people who can so clearly see how China is bad but are completely oblivious to their own problems. It's not a football game, you don't have to chose a side and be a boot licker for eternity
I didn’t say China bad, I said Westerners typically find the levels of oppression and censorship present in China to be of concern. I actually gave zero of my own judgement on China at all.
If you really think the Good vs. Evil narrative is wrong, why would you immediately go towards unrelated generic issues the West has? A neutral party would be more likely to acknowledge the problems with both sides, not reflexively try to change the subject!
Then again you didn’t claim to be a neutral party, did you?
The CEO did gave a statement about their motivation. Could be a lie, but he delivered and it is also vastly more sensible that what we often hear from other companies. Google and Meta are an exception for this space though.
Also, because not only the weights, but also the data is open, any propaganda can be identified and corrected. This is not the case for other models and what we have seen from Gemini, there certainly are "adaptations". I don't think Google had ill intent here, but this would fit what some would classify as propaganda.
Stop making this about "westerners" and "chinese". We don't care. Your bubble isn't representative of "westerners".
Well yeah, both sides are fucked so I'll use the free tool and not the $200/month tool, it really isn't rocket science
Even if deepseek is a chinese communist party evil trick what do they get ? My shitty code ? Big deal, at least I'm not down $200 a month, which is half of my rent
If they would establish deepseek in their authorized version worldwide - it means they would establish their worldview worldwide.
Students asking for help with homework will get the chinese approved version. Any housewife asking for recommendation who to vote and why, etc.
But ... deepseek seems open source and the local version not restricted (as it likely was trained on ChatGPT in the first place). So I also refuse to refuse deepseek because it comes from china. I see it as more competition that hopefully will help with establishing good open source models in control of no single political organisation.
> Students asking for help with homework will get the chinese approved version. Any housewife asking for recommendation who to vote and why, etc.
Well too bad for them, using a pencil to cut bread sucks too, that's why we don't do it. Meanwhile it's converting my json to xml and pissing code so I don't have to piss it myself
Too bad for you, if the majority of them will vote someone not to your liking.
That's why I'm investing in land and ammos instead of cryptos and nvidia stocks
I can see the motive, but I rather invest in a world where I won't end up in a fortified home shooting hungry scavengers.
I'm investing in my local community, it's closer to chicken coop vs rabbit cages than chatgpt vs deepseek, and sadly the war isn't even 1000km away as we speak
A bit closer for me, but how will the local community help, if the war is coming?
I considered switzerland for this reason. (Also I like mountains)
> how will the local community help, if the war is coming?
That's the only thing left.
Switzerland won't help when the top 30%+ of every EU country flees there
> Too bad for you, if the majority of them will vote someone not to your liking.
If the majority of people get their vote from a LLM, then it already doesn't matter which LLM it is.
Off topic: why the hell are you converting json to xml? I would rather convert json to yaml instead, a step forwards instead of backwards in evolution, but that is my opinion.
Working with legacy backward code?
> Students asking for help with homework will get the chinese approved version. Any housewife asking for recommendation who to vote and why, etc.
How is this any worse than where the US is now, getting the Musk/Zuckerberg/Bezos/Trump approved answers, if any answer can be gotten at all, after the current occupant of the oval office has silenced all federal agencies.
Western trained models will tell you politically inconvenient things, and if they won’t, that info is otherwise freely available. Neither is the case for Chinese trained models.
This is virtue signaling and concern trolling until you post evidence that DeepSeek is doing worse than what ChatGPT does on Israel vs Palestine.
Who am I signaling my virtue to if I'm posting anonymously?
> Western trained models will tell you politically inconvenient things,
Pull the other leg. Do you know who David Mayer is? Hint: https://archive.ph/iI5xC
It's a random guy. He's also not censored by OpenAI models:
> Who is David Mayer
![Quién es David Mayer de Rothschild, el millonario al que relacionan ...](https://tse3.mm.bing.net/th?id=OIP.CCKTcyWe80YvrRbkkxY_AAHaK...) The name "David Mayer" is associated with several individuals across various fields:
1. *David Mayer de Rothschild*: Born in 1978, he is a British environmentalist, adventurer, and member of the Rothschild family. He is known for his environmental advocacy and expeditions, including the "Plastiki" project, where he sailed across the Pacific Ocean on a boat made from recycled plastic bottles to raise awareness about plastic pollution. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Mayer_de_Rothschild?utm_...))
2. *David M. Mayer*: A professor of Management and Organizations at the University of Michigan, specializing in behavioral ethics. ([scholar.google.com](https://scholar.google.com/citations?hl=en&user=c2Zunb8AAAAJ...))
3. *David R. Mayer*: An American politician born in 1967, he has served as the mayor of Gloucester Township, New Jersey, and was a member of the New Jersey General Assembly. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_R._Mayer?utm_source=chat...))
4. *David Delaney Mayer*: Born in 1992, he is an American documentary filmmaker and social entrepreneur, known for projects like the PBS series "Food Town" and co-founding DreamxAmerica, an initiative supporting immigrant entrepreneurs. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Delaney_Mayer?utm_source...))
5. *David Mayer (Historian)*: An American-British theatre historian (1928–2023), he was an emeritus professor at the University of Manchester, recognized for his work on 19th-century drama and the Victorian stage. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Mayer_%28historian%29?ut...))
6. *Akhmed Chatayev (Alias: David Mayer)*: A Chechen militant (1980–2017) who used the alias "David Mayer." This alias led to a case of mistaken identity affecting the historian David Mayer. ([en.wikipedia.org](https://en.wikipedia.org/wiki/David_Mayer_%28historian%29?ut...))
If you have a specific "David Mayer" in mind, please provide more context or details, and I can offer more targeted information.
this directly blends people with a government.. Western people is not equal to the USA govt. Here in California, three decades of China human rights history is very clear to a lot of people. In fact, many Chinese speaking people on the Americas west coast, left China for specific reasons, too.
Utilitarian money-oriented self-servers definitely have less "care" about these things? EU or UK or wherever
It's funny you say that cause chatgpt, Gemini, etc have way more censorship built in than deepseek.
No, the US used to be allies, but that was before Trump.
I think there’s a real chance of changes in soft-power dynamics given recent events. When allies feel their partner and neighbour is behaving solely in its own interests, it changes trust relationships.
Trump, tech, US-first focus with less interest in collaboration. TikTok. Ukraine and discussions around defence budgets. Reducing focus on EV’s. Tariffs. Iceland. Gulf of America.
Immediate choices might be fear-based but it’s smart to look for other partners as trust erodes.
An extreme side effect might be countries who felt safe under the US arms umbrella, needing to arm themselves. Are we absolutely sure that’s what we want? Does that include nukes?
Chinese models are censored, while Western models are _aligned_. it's a very important distinction.
personally I imagine in the future I'll use a mock UN panel of LLMs to advise me, and avoid any one nation/political party's influence, if I ever get to the point of delegating much of my thinking to the machine.
Most westerners wouldn't want their land threatened by their allies either, but we don't live in a perfect world.
> Western models are _aligned_
Aligned to what ? By who ? Did you vote for any of these people ? Did anyone ask you your opinion ? Can I see what topics are aligned ? Can I see how much they are aligned ? Does the alignment change depending on who's the current US president ?
that was my point, precisely.
Talk to R1 for a while, and you'll notice that it's both censored and aligned.
I think the most free-minded large models might be the Groks, but just slightly, as they have different biases. In sum, there's strength in diversity.
> a very important distinction.
A distinction without a difference.
As used by the Musk/Zuck/Bezos crowd, "aligned" is a weasel word for "parroting the TESCREAL world view"
Indeed.
The big bad wolf is going to eat your lunch, you better increase your nationalism and taxes and stop global meritocracy?
There was a psyop by the politicians just a month or so ago: ”Trump employees recruited people on the street by promising them an expensive restaurant meal, according to a Danish radio station”, but if you listen to most of media you will think greenlanders dream of becoming the next Mississippi with a median salary of $35,070 instead of current $43,664.
Not saying it is bad, but US sure has a stronger self interest than most.
Im just saying let the models output speak for themselves, and let the actions speak for themselves.
It's the cost comparison with O1, both to train and run (per their pricing), that is causing most of the shock, perhaps as well as the fact that it's a GPU poor Chinese company that has caught up with O1, not a US one (Anthropic, Google, Meta, X.ai, Microsoft). The fact that it's open weights and training is fairly detailed in the paper they released is also significant.
The best comparison for R1 is O1, but given different training data, hard to compare outside of benchmarks. At the moment these "reasoning models" are not necessarily the best thing to use for non-reasoning tasks, but Anthropic have recently indicated that they expect to release models that are more universal.
You’re not the only one. It’s not as impressive at coding compared to O1 as people make it out to be and it’s explicitly spelled out in DeepSeek’s R1 paper that they had trouble with improving over DeepSeek-V3:
> Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency. (last bullet point of page 16, which is the last page of the paper before the bibliography - hmm…) [1]
It does even worse on my real world coding problems than the benchmarks would suggest. Some of my tests: write a Qt QAbstractListModel in C++ that parses markdown into block using a C/C++ MD parsing library, write Rust cxx-qt bindings for QTextDocument (all docs included in context), write a window switching script for Wayland with an alternative to wmctrl, etc. I also asked it some geochemistry questions that I had trialed O1 with previously, and the reasoning had a lot hallucinations. The answers were suboptimal to say the least.
Having access to the thought process in <think></think> tags is cool but it just reveals how underdeveloped the actual reasoning is compared to whatever o1 is doing. I’ve had it get stuck on silly things like whether a C++ library for markdown exists because I underspecified that I’m okay with header only C libs. O1 has faired much better on my qualitative test suite - even more so when using it via API with a “high” reasoning_effort.
With all of the hype over the last few days, it feels like I’m taking crazy pills (loving all the action though).
[1] https://arxiv.org/abs/2501.12948
The brute force approach is the expensive one ("probably, so far anyway" add these 4 words everywhere) - and impossible to make "always correct" aka "usual blindsides". They seem to be trying a bunch of specialized training ideas here and there in the system - just like a Mixture of Experts does, in different places than was obvious so far, and with an eye toward reasonning. In particular trying to build a reasonning-oriented training base from minimal seed.
It's still not going to give an "always correct" result but we are nowhere near the point where that's needed. We are only at the point where a new idea can get you a percentage point further in benchmarks. Some fundamental limits were baked into the previous assumptions - easy to get past by leaving these assumptions behind.
I get the hype, even if I don’t necessarily agree with it. TLDR: it’s technically impressive in training infrastructure and a geopolitical surprise.
As others commenters said, it compares favorably against American Flagship models in benchmarks. This is geopolitically interesting since the model is Chinese, and subject to trade restrictions and America thinks of itself as the world’s best model builders.
What makes it interesting technically, is the trade restrictions required them to focus on efficiency and reusing old hardware. This made it wildly cheaper to run the training. The They make a ton of very low-level optimizations to make training efficient. This is impressive engineering, and shows how a small amount of effort can free up a nvidia-lock-in. They had to totally bypass a lot of CUDA libraries and drop down even lower to control the GPUs. Nvidia has been capturing a huge fraction of industry wide AI spend, and surely no one wants that. This is, IMO, the actual part to watch. I suspect it’ll drive a new wave of efficiency-focused LLM tools which will also unlock competitors GPUs.
They also had some novel-for-LLMS training techniques but it’s suspected that the big AI companies elsewhere are doing it now too, but not disclosed. (Mostly reinforcement learning).
What I think is hype, meanwhile, is the actual benchmarks. Most models are trained on their competitors output. This is just a reality of the industry. Especially non-flagships being trained against flagships data. DeepSeek was almost certainly trained against OpenAI models, so it makes sense that it would approach the output quality. That’s very different from being capable of outperforming or “taking the lead” or whatever. We’ll need to wait longer and see how the future goes to make that determination. “China” has long had a great history of machine learning tech, so there is no reason to think that it’s structurally impossible for a Chinese organization to be on the leading edge, but it has to happen before we can say it happened.
What is also hype is calling this a “side project” of a financial firm. The firm spun it out as a dedicated company. China cracked down on hedge funds, so the company looked for ways to repurpose the talent and compute infrastructure. This isn’t some side project during lunch breaks for a few bored quants.
PS, thinking models are very different in use-case than normal models. It’s great for tasks with a “right answer” like math problems. Consider a simple example, in calculus, your teacher made you “show your work”, and doing so certainly reduced the likelihood of errors. That’s the purpose of a thinking model, and it excels at similar tasks.
fwiw, National Public Radio (NPR) news in the USA said that AI experts stated it was "almost as good" as other current offerings like chatGPT and Gemini. That its real advantage is the low cost of providing the information. However, this is only a claim made by the company without any proof.
[flagged]
"DeepSeek-R1 is the latest resounding beat in the steady drumroll of AI progress. " IBM's Intellect, from 1983 cost $47,000 dollars a month. Let me know when DeepSleep-Rx exceeds Windows (tm) version numbers or makes a jump like AutoCADs version numbers.
> This is a large number of long chain-of-thought reasoning examples (600,000 of them). These are very hard to come by and very expensive to label with humans at this scale. Which is why the process to create them is the second special thing to highlight
I didn't know the reasonings were part of the training data. I thought we basically just told the LLM to "explain its thinking" or something as an intermediate step, but the fact that the 'thinking' is part of the training step makes more sense and I can see how this improves things in a non-trivial way.
Still not sure if using word tokens as the intermediate "thinking" is the correct or optimal way of doing things, but I don't know. Maybe after everything is compressed into latent space it's essentially the same stuff.
How is this very high signal vs noise post out of the front page in 2hs?
Are people so upset with the stock market crash that they are flagging it?
Maybe too much of the same topic? "How R1 was trained" also seemed to quickly fall off. But the big arxiv paper with 1000+ upvotes stuck around a while.
Spot on. I've read the very accessible paper and it's better than any of the how-to's written elsewhere. Nothing against good content being written, but the source material is already pretty good.
dang has provided an answer for how the algorithm works whenever I've asked similar question.
But I still don't get it. 6 hours + 170 points and it's on third page. Meanwhile second page has "Null Byte on Steroids" at 12 hours + 20 points. ??
It’s remarkable we’ve hit a threshold where so much can be done with synthetic data. The reasoning race seems an utterly solvable problem now (thanks mostly to the verifiability of results). I guess the challenge then becomes non-reasoning domains, where qualitative and truly creative results are desired.
It seems like we need an evaluation model for creativity. I'm curious, is there research on this -- for example, can one score a random painting and output how creative/good a given population is likely to find it?
How do you account for the impact of culture/lived experience of the specific population viewing the painting? Intuitively it seems like that would be the biggest factor, rather than the objective attributes of the painting, no?
All art is subjective. Any attempt to "verify" a piece of art would be entirely dependent on cultural and personal sensitivities. Art isn't a math problem with a solution.
But you can dissect it into concepts and see if it is something truly new to the model - if the output contains things which aren’t there in the weights, you have a nice specimen to study and, crucially, a recipe to get a bunch of matrices to output untrained things.
This is like saying: All cooks are equally good, even the most disgusting slop (e.g. water/flour soup) isn't any better than a dish from a cook with several Michelin stars. Of course the latter is better. And if it is better, it is objectively better. Even if 0.001% of people prefer flour soup.
> culture/lived experience of the specific population viewing the painting
Isn't this lived experience baked into LLM language bases? It's certainly very hard to target all possible populations at once. And art doesn't need that, doesn't do that. Only rare marketing sometimes attempts to do that and only in very limited ways, such as a brand name acceptable all over the world.
There are two kinds of creativity at play here. One is mashing together combinations of learned things - it’s kinda like shuffling a deck of cards where basically every shuffle gets you a deck that has never been seen and won’t be seen again, but it’s still the same 52 cards every time. The other kind is going outside of the box and inventing truly new, unseen/untrained concepts. This one is hard, but I don’t think it’s impossible - the <think> slop stirring the learned concepts with a bit of randomness should make progress here.
A new "AI challenge" -- can an AI make a hit movie (even if just for Netflix) in each of Documentary, Action, Thriller, Comedy, and Drama genres. This isn't art like the "Mona Lisa", but more like the ability to make "art" that has appeal to some level of the public. I think if an AI can do that, I'll be pretty impressed.
The prompt: "Create a feature length [Action/Comedy/etc...] film that can borrow elements from existing films, but would generally not be considered a copy of any given film."
> can one score a random painting
You can get very mechanical in scoring an image. Ask any art student. If you want to or if your instructor or audience wants to. For example "fits rule of thirds?" yes is a point to common attraction, no is a point to unexpected at the risk of outsider-ness. You can do that in color, composition, recognizing objects and fitting that to memes or associations or non-associations. Too many points in "unexpected" is meta points in "unpleasant chaos" and so a strong downgrade in common attraction. You can match all this to images in the library (see how copyright or song recognition operates in the music category) and get out of that some kind of familiarity vs edge score (where too much edge goes against common attraction.)
I would expect you could get better than most humans at recognizing shapes in an image and drawing associations from that. Such associations are a plus in unexpected / surprise if they are rare in the culture or a plus in common attraction is they are common.
After that, to be cynic about it, you can randomize and second guess yourself so your audience doesn't catch on the 1st level mimicry.
Creativity is not normally used as an absolute with a unique measure. It's not "length". And you only need to please part of the audience to be successful - sometimes a very small part, some of which loves surprise and some hates it, etc. Someone elsewhere objected on the grounds that creativity or attractiveness is culture based - yeah so? if you were to please much of just one whole culture, you would have an insane hit on your hands.
Sounds feasible to me.
You can train a supervised model, taking into account the properties of the rater as well as the artwork, and tease out the factors that make it rated so.
You can probably cluster raters and the artwork they rate highly - but probably not in large quantities? -- Which might be the case also with raters being willing to tell you why - and how! most love to do that - but also not in very large quantities. With the added issues that the raters' own opinion of why they love or hate something is likely not to be entirely true and self-understanding.
You could use a larger corpus, like auction house files and art magazines. But then you are confounding for celebrity - a large ingredient in art prices.
It's still reasonning based on pattern matching, which should go only so far. But "only so far" could be plenty for lots of applications.
Tuning for qualitative outcomes is pretty much solved via RLHF/DPO (what this post calls "preference tuning"). Right?
We all knew The Chinese government was going to censor it. The censoring happening in ChatGPT is arguably more interesting since they are not beholden to the US government. I'm more interested in that report.
The thing I still don’t understand is how DeepSeek built the base model cheaply, and why their models seem to think they are GPT4 when asked. This article says the base model is from their previous paper, but that paper also doesn’t make clear what they trained on. The earlier paper is mostly a description of optimization techniques they applied. It does mention pretraining on 14.8T tokens with 2.7M H800 GPU hours to produce the base DeepSeek-V3. But what were those tokens? The paper describes the corpus only in vague ways.
I imagine it's a mix of either using ChatGPT as an the oracle to get training data. Or, it's the radiocarbon issue where the Internet has so much info on ChatGPT other models now get confused.
Various other models also think they're ChatGPT or built by OpenAI, or at least those are the highest probability tokens when talking about an AI model or an AI company because of the massive prevalence in training data (the internet). It isn't the big reveal that it is often being held to be.
Add that training off of ChatGPT wouldn't reduce their training costs at all, but would actually increase their training costs. Literally all of the same training difficulty, but then add paying OpenAI for an enormous number of API calls. Not really seeing the win.
>The paper describes the corpus only in vague ways.
Anyone who runs a public website has logs absolutely filled by a seemingly infinite number of information aggregators. Just like everyone else they scraped the entire internet, pulled in all of Wikipedia, etc. Probably lots of pirate books, movie transcripts, etc.
The fact that training could be done more effectively is something that intuitively makes absolute sense to everyone in the field, but we just didn't make that leap. Similar to how a human isn't trained to recognize digits by training on 60,000 training digits then suddenly failing if a real world digit is slightly rotated or morphed in some way, we are making these improvements to content ingestion.
A friend just sent me a screenshot where he asks DeepSeek if it has an app for Mac and it replies that they have a ChatGPT app from OpenAI, lol.
I 100% believe they distilled GPT-4, hence the low "training" cost.
Er, how would that reduce the cost? You still need to train the model, which is the expensive bit.
Also, the base model for V3 and the only-RL-tuned R1-Zero are available, and they behave like base models, which seems unlikely if they used data from OpenAI as their primary data source.
It's much more likely that they've consumed the background radiation of the web, where OpenAI contamination is dominant.
Hypothetical question: is the chinese government capable of exploiting chatgpt to get around the query limit? For example, making queries through compromised devices or even snooping local traffic on devices? Let's face it, these models are closely alligned with China's national security so it's not a farfetched question to ask.
They fixed that. Now it replies: "Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation."
???
I just did and it told me about ChatGPT and OpenAI.
Are you affiliated with them, btw?
No. I’m not affiliated. Maybe you’re on a node that doesn’t have whatever change they might have made.
You can't distill from GPT-4 because Open AI conceals the probabilities (and has for a couple years now-- since before gpt4), presumably to prevent that. You can fine tune against output though. I might guess that they used something like openorca or some other public data set that includes gpt4 output as part of their initial fine tuning.
How does such a distillation work in theory? They don’t have weights from OpenAI’s models, and can only call their APIs, right? So how can they actually build off of it?
Like RLHF but the HF part is GPT4 instead.
How do you ensure the student model learns robust generalizations rather than just surface-level mimicry?
No idea as I don't work on that, but my guess would be that the higher the 'n' the more model A approaches model B.
[dead]
This is fantastic work, thank you!
The "illustrated"... He needs to read up on Tufte or Bret Victor or something, these are just diagrams with text inside of boxes.