maz1b 2 days ago

Cerebras has been a true revelation when it comes to inference. I have a lot of respect for their founder, team, innovation, and technology. The colossal size of the WS3 chip, utilizing DRAM to mind-boggling scale, it's definitely ultra cool stuff.

I also wonder why they have not been acquired yet. Or is it intentional?

I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

I'm not claiming to be an expert, but as a CEO/CTO, there were other providers in the market that had relatively comparable inference speed (obviously Cerebras is #1), easier onboarding, better response from people that worked there (all of my experience with Cerebras have been days/weeks late or simply ignored). IMHO, if Cerebras wants to gain more mindshare, they'll have to look into this aspect.

  • aurareturn 2 days ago

      I also wonder why they have not been acquired yet. Or is it intentional?
    
    A few issues:

    1. To achieve high speeds, they put everything on SRAM. I estimated that they needed over $100m of chips just to do Qwen 3 at max context size. You can run the same model with max context size on $1m of Blackwell chips but at a slower speed. Anandtech had an article saying that Cerebras was selling a single chip for around $2-3m. https://news.ycombinator.com/item?id=44658198

    2. SRAM has virtually stopped scaling in new nodes. Therefore, new generations of wafer scale chips won’t gain as much as traditional GPUs.

    3. Cerebras was designed in the pre-ChatGPT era where much smaller models were being trained. It is practically useless for training in 2025 because of how big LLMs have gotten. It can only do inference but see above 2 problems.

    4. To inference very large LLMs economically, Cerebras would need to use external HBM. If it has to reach outside for memory, the benefits of a wafer scale chip greatly diminishes. Remember that the whole idea was to put the entire AI model inside the wafer so memory bandwidth is ultra fast.

    5. Chip interconnect technology might make wafer scale chips more redundant. TSMC has a roadmap for glueing more than 2 GPU dies together. Nvidia’s Feynman GPUs might have 4 dies glued together. IE, the sweet spot for large chips might not be wafer scale but perhaps 2, 4, 8 GPUs together.

    6. Nvidia seems to be moving much faster in terms of development and responding to market needs. For example, Blackwell is focused on FP4 inferencing now. I suppose the nature of designing and building a wafer scale chip is more complex than a GPU. Cerebras also needs to wait for new nodes to fully mature so that yields can be higher.

    There exists a niche where some applications might need super fast token generation regardless of price. Hedge funds and Wallstreet might be good use cases. But it won’t challenge Nvidia in training or large scale inference.

    • Voloskaya 2 days ago

      > I estimated that they needed over $100m of chips just to do Qwen 3 at max context size

      I will point out (again :)), that this math is completely wrong. There is no need (nor performance gains) to store the entire weights of the model in SRAM. You simply store n transformer blocks on-chip and then stream block l+n from external memory to on-chip when you start computing block l, this completely masks the communication time behind the compute time, and specifically does not require you to buy 100M$ worth of SRAM. This is standard stuff that is done routinely in many scenarios, e.g. FSDP.

      https://www.cerebras.ai/blog/cerebras-software-release-2.0-5...

      • bubblethink 2 days ago

        That blog is about training. For inference, the weights and kv cache are in SRAM. Having said that, the $100M number is inaccurate/meaningless. It's a niche product that doesn't have economies of scale yet.

        • Voloskaya a day ago

          The blog is about training but the technique applies equally well to inference, just like FSDP and kv cache sharing are routinely done in inference on GPUs.

          There is just no need to have parameters or kv cache for layer 48 in SRAM when you are currently computing layer 3, you have all the time in the world to move that to SRAM when you get to layer 45 or whatever the maths work out to be for your specific model.

      • vlovich123 2 days ago

        I did experiments with this on traditional consumer GPU and the larger the discrepancy between model size and VRAM, the faster it dropped off (exponentially) to as if you didn’t even have any VRAM in the first place (over PCIe). This technique is well known and works when you have more than enough bandwidth.

        However, the whole point that even HBM is a problem is the available bandwidth is insufficient, so if you’re marrying SRAM and HBM I would expect the performance gains to be overall modest for models that exceed available SRAM in a meaningful way.

        • Voloskaya 2 days ago

          This is highly dependent on exact model size, architecture and hardware configurations. If the compute time for some unit of work is larger than the time it takes to transfer the next batch of Params you are good to go. If you are doing it sequentially though then yes you will pay a heavy price, but the idea is to fetch a future layer not the one you need right away.

          As a similar example I have trained video models on ~1000 H100 where the vast majority of parameters are sharded and so need to be first fetched on the network before being available on HBM, which is similar imbalance to the HBM vs SRAM story. We were able to fully mask comms time such that not sharding (if it was even possible) would offer no performance advantage.

      • aurareturn a day ago

        What about for inference?

        In that same thread, Cerebras executive disputed my $100m number but did not dispute that they store the entire model on SRAM.

        They can make chips at cost and claim it isn’t $100m. But Anandtech did estimate/report $2-3m per chip.

        • Voloskaya a day ago

          > What about for inference?

          Same techniques apply.

          > but did not dispute that they store the entire model on SRAM.

          No idea what they did or did not do for that specific test (which was about delivering 1800 tokens/sec though, not simply running qwen-3) since they didn't provide any detail. I don't think there is any point storing everything in SRAM, even if you do happen to have 100M$ worth of chips lying around in a test cluster at the office, since WSE-3 is designed from the ground up for data parallelism (see [1] section 3.2) and inference is sequential both within a single token generation (you need to go through layer 1 before you can go through layer 2 etc.) and between tokens (autoregressive, so token 1 before token 2). This means most of your weights loaded in SRAM would be just sitting unused most of the time, and when they need to be used they need to be broadcasted to all chips from the SRAM of the chip that has the particular layer you care about, this is extremely fast, but external memory is certainly fast enough to do this if you fetch the layer in advance. So the way to get the best ROI on such a system would be to pack the biggest batch size you can (so many users' queries) and process them all in parallel, streaming the weights as needed. The more your SRAM is occupied by batch activations and not parameters, the better the compute density and thus $/flops.

          You can check the Cerebras doc to see how weight streaming works [2]. From the start, one of the selling point of Cerebras is the possibility to scale memory independently of compute, and they have developped an entire system specifically for weight streaming from that decoupled memory. Their docs seems to keep things fairly simple assuming you can only fit one layer in SRAM and thus they fetch things sequentially, but if you can store at least 2 layers in those 44GB of SRAM then you can simply fetch l+1 when l is starting to compute, completely masking latency cost. Its possible they already mask the latency even within a single layer by streaming by tiles for matmul though, unclear from their docs. They mention that in passing in [3] section 6.3.

          All of their doc is for training since it seems for inference play they have pivoted to selling API access rather than chips, but inference is really the same thing, just without the backprop (especially in their case were they aren't doing pipeline parallelism where you could claim doing fwd+back prop gives you better compute density). At the end of the day whether you are doing training or inference, all you care about is that your cores have the data they need in their registers at the moment they are free to compute, so streaming to SRAM works the same way in both cases.

          Ultimately I can't tell you how much it cost to run Qwen-3, you can certainly do it on a single chip + weight streaming, but their specs are just too light on the exact FLOPs and bandwidth to know what the memory movement cost would be in this case (if any), and we don't even know the price of single chip (everyone is saying 3M$ though, regardless of that comment on the other thread). But I can tell you that your math of doing `model_size/sram_per_chip * chip_cost` just isn't the right way to think about this, and so the 100M$ figure doesn't make sense.

          [1]: https://arxiv.org/html/2503.11698v1#S3.

          [2]: https://training-api.cerebras.ai/en/2.1.0/wsc/cerebras-basic....

          [3]: https://8968533.fs1.hubspotusercontent-na2.net/hubfs/8968533...

      • MichaelZuo 2 days ago

        So then what explains such a low implied valuation at series G?

        There’s no way that could be the case if the technology was competitive.

        • Voloskaya 2 days ago

          I’m not saying it’s particularly competitive, I’m saying claiming it cost 100M$ to run Qwen is complete lunacy. There is a gulf between those 2 things.

          And beyond pure performance competitiveness there are many things that make it hard for Cerebras and to be actually competitive: can they ship enough chips to meet the need of large clusters ? What about the software stack and lack of great support compared to nvidia? Lack of ml engineers that know how to use them, when everyone knows how to use CUDA and there are many things developed on top of it by the community (e.g triton).

          Just look at the valuation difference between AMD and Nvidia, when AMD is already very competitive. But being 99% of the way there is still not enough for customers that are going to pay 5B$ for their clusters.

    • sinuhe69 a day ago

      No, only Groq uses the all SRAM approach, Cerebras only use SRAM for local context while the weights are still loaded from RAM (or HBM). With 48 Kbytes per node, the whole wafer has only 44 GB SDRAM, much lower than the amount needed for loading the whole networks.

      • aurareturn a day ago

        In order to achieve the speeds they're claiming, they are putting the entire model on multiple chips and completely on SRAM.

        • sinuhe69 a day ago

          Why are you claiming on false facts and not checking their documents?

    • addaon 2 days ago

      > SRAM has virtually stopped scaling in new nodes.

      But there are several 1T memories that are still scaling, more or less — eDRAM, MRAM, etc. Is there anything preventing their general architecture from moving to a 1T technology once the density advantages outweigh the need for pipelining to hide access time?

      • aurareturn 2 days ago

        I’m pretty sure that HBM4 can be 20-30x faster in terms of bandwidth than eDRAM. That makes eDRAM not an option for AI workloads since bandwidth is the main bottleneck.

        • addaon 2 days ago

          HBM4 is limited to a few thousand bits of width per stack. eDRAM bandwidth scales with chip area. A full-wafer chip could have astonishing bandwidth.

    • arisAlexis a day ago

      But apparently they serve all the models super fast and in production so you must be wrong somewhere

  • oceanplexian 2 days ago

    I’ve been using them as a customer and have been fairly impressed. The thing is, a lot of inference providers might seem better on paper but it turns out they’re not.

    Recently there was a fiasco I saw posted on r/localllama where many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs, but lying to customers about it. Unless you’re actually auditing the tokens you’re purchasing you may not be getting what you’re paying for even if the T/s and $/token seems better.

    • dlojudice 2 days ago

      OpenRouter should be responsible for this quality control, right? It seems to me to be the right player in the chain with the duties and scale to do so.

    • teruakohatu a day ago

      > many of the OpenRouter providers were degraded on benchmarks compared to base models, implying they are serving up quantized models to save costs,

      Do you have information on this? This seems like brand destroying for both OpenRouter and the model providers.

  • throw123890423 2 days ago

    > I will say, their pricing and deployment strategy is a bit murky and unclear. Paying $1500-$10,000 per month plus usage costs? I'm assuming that it has to do with chasing and optimizing for higher value contracts and deeper-pocketed customers, hence the minimum monthly spend that they require.

    Yeah wait, why rent chips instead of sell them? Why wouldn't customers want to invest money in competition for cheaper inference hardware? It's not like Nvidia has a blacklist of companies that have bought chips from competitors, or anything. Now that would be crazy! That sure would make this market tough to compete in, wouldn't it. I'm so glad Nvidia is definitely not pressuring companies to not buy from competitors or anything.

    • aurareturn 2 days ago

      Their chips weren’t selling because:

      1. They’re useless for training in 2025. They were designed for training prior to LLM explosion. They’re not practical for training anymore because they rely on SRAM which is not scalable.

      2. No one is going to spend the resources to optimize models to run on their SDK and hardware. Open source inference engines don’t optimize for Cerebras hardware.

      Given the above two reasons, it makes a lot of sense that no one is investing in their hardware and they have switched to a cloud model selling speed as the differentiator.

      It’s not always “Nvidia bad”.

  • OkayPhysicist 2 days ago

    The UAE has sunk a lot of money into them, and I suspect it's not purely a financial move. If that's the case, an acquisition might be more complicated than it would seem at first glance.

  • nsteel 2 days ago

    > utilizing DRAM to mind-boggling scale

    I thought it was the SRAM scaling that was impressive, no?

    • maz1b 2 days ago

      oops, typo! S and D are next to each other on the keyboard. thanks for pointing this out

  • liuliu 2 days ago

    They were acquisition target since 2017 (from the OpenAI internal emails). So lacking of acquisition is not because lacking of interests. Let you wonder what happened in these due-diligence.

Shakahs 2 days ago

Sonnet/Claude Code may technically be "smarter", but Qwen3-Coder on Cerebras is often more productive for me because it's just so incredibly fast. Even if it takes more LLM calls to complete a task, those calls are all happening in a fraction of the time.

  • nerpderp82 2 days ago

    We must have very different workflows, I am curious about yours. What tools are you using and how are you guiding Qwen3-Coder? When I am using Claude Code, it often works for 10+ minutes at a time, so I am not aware of inference speed.

    • solarkraft 2 days ago

      You must write very elaborate prompts for 10 minutes to be worth the wait. What permissions are you giving it and how much do you care about the generated code? How much time did you spend on initial setup?

      I‘ve found that the best way for myself to do LLM assisted coding at this point in time is in a somewhat tight feedback loop. I find myself wanting to refine the code and architectural approaches a fair amount as I see them coming in and latency matters a lot to me here.

    • CaptainOfCoit 2 days ago

      > When I am using Claude Code, it often works for 10+ minutes at a time, so I am not aware of inference speed.

      Indirectly, it sounds like you're aware about the inference speed? Imagine if it took 2 minutes instead of 10 minutes, that's what the parent means.

      • yodon 2 days ago

        2 minutes is the worst delay. With 10 minutes, I can and do context switch to something else and use the time productively. With 2 min, I wait and get frustrated and bored.

        • dataangel 2 days ago

          Context switching makes you less productive compared to if you could completely finish one task before moving to the other though. in the limit an LLM that responds instantly is still better.

  • sdesol 2 days ago

    > Sonnet/Claude Code may technically be "smarter", but Qwen3-Coder on Cerebras is often more productive for me because it's just so incredibly fast.

    Saying "technically" is really underselling the difference in intelligence in my opinion. Claude and Gemini are much, much smarter and I trust them to produce better code, but you honestly can't deny the excellent value that Qwen-3, the inference speed and $50/month for 25M tokens/per day brings to the table.

    Since I paid for the Cerebras pro plan, I've decided to force myself to use it as much as possible for the duration of the month for developing my chat app (https://github.com/gitsense/chat) and here so some of my thoughts so far:

    - Qwen3 Coder is a lot dumber when it comes to prompting as Gemini and Claude are much better at reading between the lines. However since the speed is so good, I often don't care as I can go back to the message and make some simple clarifications and try again.

    - The max context window size of 128k for Qwen 3 Coder 480B on their platform can be a serious issue if you need a lot of documentation or code in context.

    - I've never come close to the 25M tokens per day limit for their Pro Plan. The max I am using is 5M/day.

    - The inference speed + a capable model like Qwen 3 will open up use cases most people might not have thought of before.

    I will probably continue to pay for the $50 dollar plan for these use cases.

    1. Applying LLM generated patches

    Qwen 3 coder is very much capable of applying patches generated by Sonnet and Gemini. It is slower than what https://www.morphllm.com/ provides but it is definitely fast enough for most people to not care. The cost savings can be quite significant depending on the work.

    2. Building context

    Since it is so fast and because the 25M token limit per day is such a high limit for me, I am finding myself loading more files into context and just asking Qwen to identify files that I will need and/or summarize things so I can feed it into Sonnet or Gemini to save me significant money.

    3. AI Assistant

    Due to it's blazing speed, you can analyze a lot data fast for deterministic searches and because it can review results at such a great speed, you can do multiple search and review loops without feeling like you are waiting forever.

    Given what I've experienced so far, I don't think Cerebras can be a serious platform for coding if Qwen 3 Coder is the only available model. Having said that, given the inference speed and Qwen being more than capable, I can see Cerebras becoming a massive cost savings option for many companies and developers, which is where I think they might win a lot of enterprise contracts.

fcpguru 2 days ago

Their core product is the Wafer Scale Engine (WSE-3) — the largest single chip ever made for AI, designed to train and run models much faster and more efficiently than traditional GPUs.

Just tried https://cloud.cerebras.ai wow is it fast!

mythz 2 days ago

Running Qwen3 coder at speed is great, but would also prefer to have access to other leading OSS models like GLM 4.6, Kimi K2 and DeepSeek v3.2 before considering switching subs.

Groq also runs OSS models at speed which is my preferred way to access Kimi K2 on their free quotas.

OGEnthusiast 2 days ago

I'm surprised how under-the-radar Cerebras is. Being able to get near-instantaneous responses from Qwen3 and gpt-oss is pretty incredible.

  • data-ottawa 2 days ago

    I wish I could invest in them. Agree they're under the radar.

rbitar 2 days ago

Congrats to the team, I'm surprised the industry hasn't been as impressed with their benchmarks on token throughput. We're using the Qwen 3 Coder 480b model and seeing ~2000 tokens/second, which is easily 10-20x faster then most LLM models on the market. Even some of the fastest models still only achieve 100-150 tokens / second (see OpenRouter stats by provider). I do feel after around 300-400 tokens/second the gains in speed feel more incremental, so if there was a model at 300+ tokens/second, I would consider that a very competitive alternative.

JLO64 2 days ago

My experience with Cerebras is pretty mixed. On the one hand for simple and basic requests, it truly is mind blowing how fast it is. That said, I’ve had nothing but issues and empty responses whenever I try to use them for coding tasks (Opencode via Openrouter, GPT-OSS). It’s gotten to a point where I’ve disabled them as a provider on Openrouter.

  • divmain 2 days ago

    I experienced the same, but I think it is a limitation of OpenRouter. When I hit Cerebra’s OpenAI endpoint directly, it works flawlessly.

darkbatman 19 hours ago

Its so useful to use Cerebras api for other tasks too not just coding with qwen coder but even simpler things like lets say analysing with gpt-120 oss or llama.

Just plug it in with normal chat interface like Jan or Cherry studio and its incredibly fast.

ramshanker 2 days ago

I am not able to guess, what is preventing Cerebras from replacing few of the cores in the Wafer-Scale package with HBM memory? It seems the only constraint with their WSE3 is memory capacity. Considering the size of NVDA chips, Only a small subset of wafer area should easily exceed the memory size of contemporary models.

  • reliabilityguy 2 days ago

    DRAMs (core of the HBM memories) use different technology nodes than logic and SRAM. Also, stacking that many DRAMs on waver will complicate the packaging quite a bit I think.

  • xadhominemx 2 days ago

    I don’t think so. The reason why Cerebras is so fast for inference is that the KV cache sits in the SRAM.

  • aurareturn 2 days ago

    If you replace some cores with HBM on package, you basically get the traditional GPU + HBM model.

arjie 2 days ago

I just tried out Qwen-3-480B-Coder on them yesterday and to be honest it's not good enough. It's very fast but has trouble on lots of tasks that Claude Code just solves. Perhaps part of it is that I'm using Charm's Crush instead of Claude Code.

  • arisAlexis a day ago

    They make chips. Potentially you could also test Claude on them in the future.

redwood 2 days ago

Would be interesting if IBM were to acquire. Seems like the big iron approach to GPUs

lvl155 2 days ago

Last I tried, their service was spotty and unreliable. I would wait maybe a year or so to retry.

fcpguru 2 days ago

does Guillaume Verdon from https://www.extropic.ai/ have thoughts on on cerebras?

(or other people that read the litepaper https://www.extropic.ai/future)

  • landl0rd 2 days ago

    Beff has shipped zero chips and shitposted a lot. It is a cool idea but he has made tons of promises and it's starting to seem more like vaporware. Don't get me wrong, I hope it works, but doubt it will. Less podcasts more building please.

    He reads to me like someone who markets better than he does things. I am disinclined to take him as an authority in this space.

    How do you believe this is related to Cerebras?

    • fcpguru 2 days ago

      that's the first thing I thought of when I read "cerebras faster chip for ai". Beff "sold me" on it a year ago. I guess I drank the kool-aid. Thinking about un-drinking now...

dgfitz 2 days ago

Valued at 8.1 billion dollars.

https://www.cerebras.ai/pricing

$50/month for one person for code (daily token limit), or pay per token, or $1500/month for small teams, or an enterprise agreement (contact for pricing).

Seems high.

  • arisAlexis a day ago

    The valuation is for the best inference chips. You know Nvidia cloud pricing is irrelevant and so it's here too.

    • dgfitz a day ago

      So their angle is an exit?

rvz 2 days ago

Sooner or later, lots of competitors including Cerebras are going to take apart Nvidia's data center market share and it will cause many AI model firms to question the unnecessary spend and hoarding of GPUs.

OpenAI is still developing their own chips with Broadcom, but they are not operational yet. So for now, they're buying GPUs from Nvidia to build up their own revenue income (to later spend it on their own chips)

By 2030, eventually many companies will be looking for alternatives to Nvidia like Cerebras or Lightmatter for both training and inference use-cases.

For example [0] Meta just acquired a chip startup for this exact reason - "An alternative to training AI systems" and "to cut infrastructure costs linked to its spending on advanced AI tools.".

[0] https://www.reuters.com/business/meta-buy-chip-startup-rivos...

  • onlyrealcuzzo 2 days ago

    There's so much optimization to be made when developing the model and the hardware it runs on, most of the big players are likely to run a non-trivial percentage of their workloads on proprietary chips eventually.

    If that's 5 years into the future, that looks bad for Nvidia, if it's >10 years in the future, that doesn't affect Nvidia's current stock price very much.

allisdust 2 days ago

If the idiots at AMZN have any brains left, they would acquire this and make it the center of their inference offerings. But considering how lackluster their performance and strategy as a company has been off late, I doubt that.

Disappointed quite a bit with this fund raise. They were expected to IPO this year and give us poor retail investors a chance at investing in them.

  • reliabilityguy 2 days ago

    Amazon has their own chips for inference and training: Trainium1/2.

    • allisdust 2 days ago

      Nothing (may be except groq ?) comes even close to Cerebras in inference speed. I seriously don't get why these guys aren't more popular. The difference in using them as a inference provider vs anything else for any use case is like night and day. I hope more inference providers focus on speed. And this is where AMZN will benefit a lot since their entire cloud model is to have something people would anyway want and mark it up by 3x. God forbid if AVGO acquires this.

      • xadhominemx 2 days ago

        Cerebras hasn’t made any technical breakthroughs, they are just putting everything in SRAM. It’s a brute force approach to get very high inference throughput but comes at extremely high cost per token per second and is not useful for batched inferencing. Groq uses the same approach.

        Memory hierarchy management across HBM/DDR/Flash is much more difficult but necessary to achieve practical inference economics.

        • twothreeone 2 days ago

          I don't think you realize the history of wafer-scale integration and what it means for the chip industry [1]. The approach was famously taken by Gene Amdahl's Trilogy Systems in the 80ies, but failed dramatically leading to (among others) deployment of "accelerator cards" in the form of.. the NVIDIA GeForce 256, the first GPU in 1999. It's not like NVIDIA hasn't been trying to integrate multiple dies in the same package, but doing that successfully has been a huge technological hurdle so far.

          [1] https://ieeexplore.ieee.org/abstract/document/9623424

          • averne_ 2 days ago

            The main reason a wafer scale chip works there is because their cores are extremely tiny, and silicon area that gets fused off in the event of a defect is much lower than on NVIDIA chips, where a whole SM can get disabled. AFAIU this approach is not easily applicable to complex core designs.

          • xadhominemx 2 days ago

            I understand that topic well. They stitched top metal layers across the reticle - not that challenging, and the foundational IP is not their own.

            Everyone else went the CoWoS direction, which enables heterogeneous integration and much more cost effective inference.

      • reliabilityguy a day ago

        Optimizing for one metric only, e.g., speed, leads to suboptimal outcomes, e.g., cost, scalability, etc.

        I think while being fast, cerebra’s probably not very economical in fleets at scale.

  • onlyrealcuzzo 2 days ago

    It would be hard to beat designing their own in-house offering that is 50% as good, at 20% the cost.

    That's the problem.

    Unless the majority of the value is on the other end of the curve, it's a tough sell.