Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
It's a really weird way to open up an article concluding that LLMs make one a worse programmer: "I definitely know how to use this tool optimally, and I conclude the tool sucks". Ok then. Also: the piano is a terrible, awful instrument; what a racket it makes.
Fully agree. It takes months to learn how to use LLMs properly. There is an initial honeymoon where the LLMs blow your mind out. Then you get some disappointments. But then you start realizing that there are some things that LLMs are good at and some that they are bad at. You start creating a feel for what you can expect them to do. And more importantly, you get into the habit of splitting problems into smaller problems that the LLMs are more likely to solve. You keep learning how to best describe the problem, and you keep adjusting your prompts. It takes time.
it really doesn't take that long. Maybe if you're super junior and never coded before? In that case I'm glad its helping you get into the field. Also, if its taking you months there are whole new models that will get released and you need to learn those quirks again.
No, it's a practice. You're not necessarily building technical knowledge, rather you're building up an intuition. It's for sure not like learning a programming language. It's more like feeling your way along and figuring out how to inhabit a dwelling in the dark. We would just have to agree to disagree on this. I feel exactly as the parent commenter felt. But it's not easy to explain (or to understand from someones explanation.)
I'm glad you feel like you've nailed it. I've been using models to help me code for over two years, and I still feel like I have no idea what I'm doing.
I feel like every time I have a prompt or use a new tool, I'm experimenting with how to make fire for the first time. It's not to say that I'm bad at it. I'm probably better than most people. But knowing how to use this tool is by far the largest challenge, in my opinion.
Love this, and it's so true. A lot of people don't get this, because it's so nuanced. It's not something that's slowing you down. It's not learning a technical skill. Rather, it's building an intuition.
I find it funny when people ask me if it's true that they can build an app using an LLM without knowing how to code. I think of this... that it took me months before I started feeling like I "got it" with fitting LLMs into my coding process. So, not only do you need to learn how to code, but getting to the point that the LLM feels like a natural extension of you has its own timeline on top.
Sure. But it happens that I have 20 years of experience, and I know quite well how to code. Everything the LLM does for me I can do myself. But the LLM does that 100 times faster than me. Most of the days nowadays I push thousands of lines of code. And it's not garbage code, the LLMs write quite high quality code. Of course, I still have to go through the code and make sure it all makes sense. So I am still the bottleneck. At some point I will probably grown to trust the LLM, but I'm not quite there yet.
You are a bit quick to jump to conclusions. With LLMs, test driven development becomes both a necessity and a pleasure. The actual functional code I push in a day is probably in the low hundreds LOC’s. But I push a lot of tests too. And sure, lots of that is boilerplate. But the tests run, pass, and if anything have better coverage than when I was writing all the code myself.
Wait a minute, you didn't just claim that we have reached AGI, right? I mean, that's what it would mean to delegate work to junior engineers, right? You're delegating work to human level intelligence. That's not what we have with LLMs.
Yes and no. With junior developers you need to educate them. You need to do that with LLMs too. Maybe you need to break down the problem in smaller chunks, but you get to this after a while. But once the LLM understands the task, you get a few hundred lines of code in a mater of minutes. With a junior developer you are lucky if they come back the same day. The iteration speed with AI is simply in a different league.
Edit: it is Sunday. As I am relaxing, and spending time writing answers on HN, I keep a lazy eye on the progress of an LLM at work too. I got stuff done that would have taken me a few days of work by just clicking a "Continue" button now and then.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Is the non-trivial amount of time significantly less than you trying to ramp up yourself?
I am still hesitant using AI for solving problems for me. Either it hallucinates and misleads me. Or it does a great job and I worry that my ability of reasoning through complex problems with rigor will degenerate. When my ability of solving complex problems degenerated, patience diminished, attention span destroyed, I will become so reliant on a service that other entities own to perform in my daily life. Genuine question - are people comfortable with this?
The ramp-up time with AI is absolutely lower than trying to ramp up without AI.
My comment is specifically in contrast to working in a codebase where I'm at "max AI productivity". In a new codebase, it just takes a bit of time to work out kinks and figure out tendencies of the LLMs in those codebases. It's not that I'm slower than I'd be without AI, I'm just not at my "usual" AI-driven productivity levels.
>Genuine question - are people comfortable with this?
It's a question of degree, but in general, yeah. I'm totally comfortable being reliant on other entities to solve complex problems for me.
That's how economies work [1]. I neither have nor want to acquire the lifetime of experience I would need to learn how to produce the tea leaves in my tea, or the clean potable water in it, or the mug they are contained within, or the concrete walls 50 meters up from ground level I am surrounded by, or so on and so forth. I can live a better life by outsourcing the need for this specialized knowledge to other people, and trade with them in exchange for my own increasingly-specialized knowledge. Even if I had 100 lifetimes to spend, and not the 1 I actually have, I would probably want to put most of them to things that, you know, aren't already solved-enough problems.
Everyone doing anything interesting works like this, with vanishingly few exceptions. My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant. And his accountant doesn't need to know how to rewire his turn of the century New England home. And if you look at the exceptions, like that really cute 'self sufficient' family who uploads weekly YouTube videos called "Our Homestead Life"... It often turns out that the revenue from that YouTube stream is nontrivial to keeping the whole operation running. In other words, even if they genuinely no longer go to Costco, it's kind of a gyp.
> My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant.
This is not quite the same thing. The AI is not perfect, it frequently makes mistakes or suboptimal code. As a software engineer, you are responsible for finding and fixing those. This means you have to review and fully understand everything that the AI has written.
Quite a different situation than your dad and his accountant.
I see your point. I don't think it's different in kind, just degree. My thought process: First, is my dad's accountant infallible?
If not, then they must themselves make mistakes or do things suboptimally sometimes. Whose responsibility is that - my dad, or my dad's accountant?
If it is my dad, does that then mean my dad has an obligation to review and fully understand everything the accountant has written?
And do we have to generalize that responsibility to everything and everyone my dad has to hand off work to in order to get something done? Clearly not, that's absurd. So where do we draw the line? You draw it in the same place I do for right now, but I don't see why we expect that line to be static.
But there’s no way one is giving as thorough a review as if one had written code to solve the problem themselves. Writing is understanding. You’re trading thoroughness and integrity for chance.
Writing code should never have been a bottle neck. And since it wasn’t, any massive gains are due to being ok with trusting the AI.
I would honestly say, it's more like autocomplete on steroids, like you know what you want so you just don't wanna type it out (e.g. scripts and such)
And so if you don't use it then someone else will... But as for the models, we already have some pretty good open source ones like Qwen and it'll only get better from here so I'm not sure why the last part would be a dealbreaker
> That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
> Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Do you find that these details translate between models? Sounds like it doesn't translate across codebases for you?
I have mostly moved away from this sort of fine-tuning approach because of experience a while ago around OpenAI's ChatGPT 3.5 and 4. Extra work on my end necessary with the older model wasn't with the new one, and sometimes counterintuitively caused worse performance by pointing it at what the way I'd do it vs the way it might have the best luck with. ESPECIALLY for the sycophantic models which will heavily index on "if you suggested that this thing might be related, I'll figure out some way to make sure it is!"
So more recently I generally stick to the "we'll handle a lot of the prompt nitty gritty" for you IDE or CLI agent stuff, but I find they still fall apart with large complex codebases and also that the tricks don't translate across codebases.
Yes and no. The broader business context translates well, but each model has it's own blindspots and hyperfocuses that you need to massage out.
* Business context - these are things like code quality/robustness, expected spec coverage, expected performance needs, domain specific knowledge. These generally translate well between models, but can vary between code bases. For example, a core monolith is going to have higher standards than a one-off auxiliary service.
* Model focuses - Different models have different tendencies when searching a code base and building up their context. These are specific to each code base, but relatively obvious when they happen. For example, in one code base I work in, one model always seems to pick up our legacy notification system while another model happens to find our new one. It's not really a skill issue. It's just luck of the draw how files are named and how each of them search. They each just find a "valid" notification pattern in a different order.
LLMs are massively helpful for orienting to a new codebase, but it just takes some time to work out those little kinks.
It is nothing at all like UB in a compiler. UB creates invisible bugs that tend to be discovered only after things have shipped. This is code generation. You can just read the code to see what it does, which is what most professionals using LLMs do.
With the volume of code people are generating, no you really can't just read it all. pg recently posted [1] that someone he knows is generating 10kloc/day now. There's no way people are using AI to generate that volume of code and reading it. How many invisible bugs are lurking in that code base, waiting to be found some time in the future after the code has shipped?
I read every line I generate and usually adjust things; I'm uncomfortable merging a PR I haven't put my fingerprints on somehow. From the conversations I have with other practitioners, I think this is pretty normal. So, no, I reject your premise.
My premise didn't have anything to do with you, so what you do isn't a basis for rejecting it. No matter what you or your small group of peers do, AI is generating code at a volume that all the developers in the world combined couldn't read if they dedicated 24hrs/day.
Getting 80% of the benefit of LLMs is trivial. You can ask it for some functions or to write a suite of unit tests and you’re done.
The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
> The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
I'm arguing that there's a skill that has to be learned in order to break through this. As you start in a new code base, you should be quick to jump in when you hit that 20%. But, as you spend more time in it, you learn how to avoid the same "context hell" issues and move that number down to 15%, 10%, 5% of the time.
You're still going to need to jump in, but when you can learn to get the LLM to write 95% of the code for you, that's incredibly powerful.
> 'm arguing that there's a skill that has to be learned in order to break through this. As you start in a new code base, you should be quick to jump in when you hit that 20%. But, as you spend more time in it, you learn how to avoid the same "context hell" issues and move that number down to 15%, 10%, 5% of the time.
The problem is that you're learning a skill that will need refinement each time you switch to a new model. You will redo some of this learning on each new model you use.
This actually might not be a problem anyway, as all the models seem to be converging asymptotically towards "programming".
The better they do on the programming benchmarks, the further away from AGI they get.
It’s not incredibly powerful, it’s incrementally powerful. Getting the first 80% via LLM is already the incredible power. A sufficiently skilled developer should be able to handle the rest with ease. It is not worth doing anything unnatural in an effort to chase down the last 20%, you are just wasting time and atrophying skills. If you can get full 95% in some one shot prompts, great. But don’t go chasing waterfallls.
No, it actually has an exponential growth type of effect on productivity to be able to push it to the boundary more.
I’m making this a bit contrived, but I’m simplifying it to demonstrate the underlying point.
When an LLM is 80% effect, I’m limits to doing 5 things in parallel since I still need to jump in 20% of the time.
When an LLM is 90% effect, I can do 10 things at once. When it’s 95%, 20 things. 99%, 100 things.
Now, obviously I can’t actually juggle 10 or 20 things at once. However, the point is there are actually massive productivity gains to be had when you can reduce your involvement in a task from 20% to, even 10%. You’re effectively 2x as productive.
I run through a really extensive planning step that generates technical architecture and iterative tasks. I then send an LLM along to implement each step, debugging, iterative, and verifying it's work. It's not uncommon for it to take a non-trivial amount of time to complete a step (5+ minutes).
Right now, I still need to intervene enough that I'm not actually doing a second coding project in parallel. I tend to focus on communication, documentation, and other artifacts that support the code I'm writing.
However, I am very close to hitting that point and occasionally do on easier tasks. There's a _very_ real tipping point in productivity when you have confidence that an LLM can accomplish a certain task without your intervention. You can start to do things legitimately in parallel when you're only really reviewing outputs and doing minor tweaks.
I agree with your assessment about this statement. I actually had to reread it a few times to actually understand it.
He is actually recommending Copilot for price/performance reasons and his closing statement is "Don’t fall for the hype, but also, they are genuinely powerful tools sometimes."
So, it just seems like he never really gave a try at how to engineer better prompts that these more advanced models can use.
The OPs point seems to be: it's very quick for LLMs to be a net benefit to your skills, if it is a benefit at all. That is, he's only speaking of the very beginning part of the learning curve.
The first two points directly contradict each other, too. Learning a tool should have the outcome that one is productive with it. If getting to "productive" is non-trivial, then learning the tool is non-trivial.
> I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
Because for all our posturing about being skeptical and data driven we all believe in magic.
Those "counterintuitive non-trivial workflows"? They work about as well as just prompting "implement X" with no rules, agents.md, careful lists etc.
Because 1) literally no one actually measures whether magical incarnations work and 2) it's impossible to make such measurements due to non-determinism
The problem with your argument here is that you're effectively saying that developers (like myself) who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
Either I've wasted significant chunks of the past ~3 years of my life or you're missing something here. Up to you to decide which you believe.
I agree that it's hard to take solid measurements due to non-determinism. The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well and figure out what levers they can pull to help them perform better.
That's not a problem, that is the argument. People are bad at measuring their own productivity. Just because you feel more productive with an LLM does not mean you are. We need more studies and less anecdata
I'm afraid all you're going to get from me is anecdata, but I find a lot of it very compelling.
I talk to extremely experienced programmers whose opinions I have valued for many years before the current LLM boom who are now flying with LLMs - I trust their aggregate judgement.
Meanwhile my own https://tools.simonwillison.net/colophon collection has grown to over 120 in just a year and a half, most of which I wouldn't have built at all - and that's a relatively small portion of what I've been getting done with LLMs elsewhere.
Hard to measure productivity on a "wouldn't exist" to "does exist" scale.
It might also be the largest collection of published chat transcripts for this kind of usage from a single person - though that's not hard since most people don't publish their prompts.
Building little things like this is really effective way of gaining experience using prompts to get useful code results out of LLMs.
Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.
There have been so many "advances" in software development in the last decades - powerful type systems, null safety, sane error handling, Erlang-style fault tolerance, property testing, model checking, etc. - and yet people continue to write garbage code in unsafe languages with underpowered IDEs.
I think many in the industry have absolutely no clue what they're doing and are bad at evaluating productivity, often prioritising short term delivery over longterm maintenance.
LLMs can absolutely be useful but I'm very concerned that some people just use them to churn out code instead of thinking more carefully about what and how to build things. I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
> I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
I mean we do. I think programmers are more interested in long term maintainable software than its users are. Generally that makes sense, a user doesn't really care how much effort it takes to add features or fix bugs, these are things that programmers care about. Moreover the cost of mistakes of most software is so low that most people don't seem interested in paying extra for more reliable software. The few areas of software that require high reliability are the ones regulated or are sold by companies that offer SLAs or other such reliability agreements.
My observation over the years is that maintainability and reliability are much more important to programmers who comment in online forums than they are to users. It usually comes with the pride of work that programmers have but my observation is that this has little market demand.
Users definitely care about things like reliability when they're using actually important software (which probably excludes a lot of startup junk). They may not be able to point to what causes issues, but they obviously do complain when things are buggy as hell.
I'm not the OP and I"m not saying you are wrong, but I am going to point out that the data doesn't necessarily back up significant productivity improvements with LLMs.
In this video (https://www.youtube.com/watch?v=EO3_qN_Ynsk) they present a slide by the company DX that surveyed 38,880 developers across 184 organizations, and found the surveyed developers claiming a 4 hour average time savings per developer per week. So all of these LLM workflows are only making the average developer 10% more productive in a given work week, with a bunch of developers getting less. Few developers are attaining productivity higher than that.
In this video by stanford researchers actively researching productivity using github commit data for private and public repositories (https://www.youtube.com/watch?v=tbDDYKRFjhk) they have a few very important data points in there:
1. There's zero correlation they've found between how productive respondants claim their productivity is and how it's actually measured, meaning people are poor judges of their own productivity numbers. This does refute the claims on the previous point I made but only if you assume people are wildly more productive then they claim on average.
2. They have been able to measure actual increase in rework and refactoring commits in the repositories measured as AI tools become more in use in those organizations. So even with being able to ship things faster, they are observing increase number of pull requests that need to fix those previous pushes.
3. They have measured that greenfield low complexity systems have pretty good measurements for productivity gains, but once you get more towards higher complexity systems or brownfield systems they start to measure much lower productivity gains, and even negative productivity with AI tools.
This goes hand in hand with this research paper: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... which had experienced devs in significant long term projects lose productivity when using AI tools, but also completely thought the AI tools were making them even more productivity.
Yes, all of these studies have their flaws and nitpicks we can go over that I'm not interested in rehashing. However, there's a lot more data and studies that show AI having very marginal productivity boost compared to what people claim than vice versa. I'm legitimately interested in other studies that can show significant productivity gains in brownfield projects.
> who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
It's quite possible you do. Do you have any hard data justifying the claims of "this works better", or is it just a soft fuzzy feeling?
> The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well
It's actually really easy to judge if a team is performing well.
What is hard is finding what actually makes the team perform well. And that is just as much magic as "if you just write the correct prompt everything will just work"
So far I've found that the people who are hating on AI are stuck maintaining highly coupled that they've invested a significant amount of mental energy internalizing. AI is bad on that type of code, and since they've invested so much energy on understanding the code, it ends up taking longer for them to load context and guide the AI than to just do the work. Their code base is hot coupled garbage, and rather than accept that the tools aren't working because of their own lack of architectural rigor, they just shit on the tools. This is part of the reason that that study of open source maintainers using Cursor didn't consistently produce improvement (also, Cursor is pretty mid).
https://www.youtube.com/watch?v=tbDDYKRFjhk&t=4s is one of the largest studies I've seen so far and it shows that when the codebase is small or engineered for AI use, >20% productivity improvements are normal.
On top of this a lot of the “learning to work with LLMs” is breaking down tasks into small pieces with clear instructions and acceptance criteria. That’s just part of working efficiently but maybe don’t want to be bothered to do it.
Even this opens up a whole field of weird subtle workflow tricks people have, because people run parallel asynchronous agents that step on each other in git. Solo developers run teams now!
Really wild to hear someone say out loud "there's no learning curve to using this stuff".
I agree with you and I have seen this take a few times now in articles on HN, which amounts to the classic: "We've tried nothing and we're all out of ideas" Simpson's joke.
I read these articles and I feel like I am taking crazy pills sometimes. The person, enticed by the hype, makes a transparently half-hearted effort for just long enough to confirm their blatantly obvious bias. They then act like the now have ultimate authority on the subject to proclaim their pre-conceived notions were definitely true beyond any doubt.
Not all problems yield well to LLM coding agents. Not all people will be able or willing to use them effectively.
But I guess "I gave it a try and it is not for me" is a much less interesting article compared to "I gave it a try and I have proved it is as terrible as you fear".
I've said it before, I feel like I'm some sort of lottery winner when it comes to LLM usage.
I've tried a few things that have mostly been positive. Starting with copilot in-line "predictive text on steroids" which works really well. It's definitely faster and more accurate than me typing on a traditional intellisense IDE. For me, this level of AI is cant-lose: it's very easy to see if a few lines of prediction is what you want.
I then did Cursor for a while, and that did what I wanted as well. Multi-file edits can be a real pain. Sometimes, it does some really odd things, but most of the time, I know what I want, I just don't want to find the files, make the edits on all of them, see if it compiles, and so on. It's a loop that you have to do as a junior dev, or you'll never understand how to code. But now I don't feel I learn anything from it, I just want the tool to magically transform the code for me, and it does that.
Now I'm on Claude. Somehow, I get a lot fewer excursions from what I wanted. I can do much more complex code edits, and I barely have to type anything. I sort of tell it what I would tell a junior dev. "Hey let's make a bunch of connections and just use whichever one receives the message first, discarding any subsequent copies". If I was talking to a real junior, I might answer a few questions during the day, but he would do this task with a fair bit of mess. It's a fiddly task, and there are assumptions to make about what the task actually is.
Somehow, Claude makes the right assumptions. Yes, indeed I do want a test that can output how often each of the incoming connections "wins". Correct, we need to send the subscriptions down all the connections. The kinds of assumptions a junior would understand and come up with himself.
I spend a lot of time with the LLM critiquing, rather than editing. "This thing could be abstracted, couldn't it?" and then it looks through the code and says "yeah I could generalize this like so..." and it means instead of spending my attention on finding things in files, I look at overall structure. This also means I don't need my highest level of attention, so I can do this sort of thing when I'm not even really able to concentrate, eg late at night or while I'm out with the kids somewhere.
So yeah, I might also say there's very little learning curve. It's not like I opened a manual or tutorial before using Claude. I just started talking to it in natural language about what it should do, and it's doing what I want. Unlike seemingly everyone else.
Agreed. This is an astonishingly bad article. It's clear that the only reason it made it to the front page is because people who view AI with disdain or hatred upvoted it. Because as you say: how can anyone make authoritative claims about a set of tools not just without taking the time to learn to use them properly, but also believing that they don't even need to bother?
Pianists' results are well known to be proportional to their talent/effort. In open source hardly anyone is even using LLMs and the ones that do have barely any output, In many cases less output than they had before using LLMs.
> In open source hardly anyone is even using LLMs and the ones that do have barely any output, In many cases less output than they had before using LLMs.
Which shows that LLMs, when given to devs who are inexperienced with LLMs but are very experienced with the code they're working on, don't provide a speedup even though it feels like it.
Which is of course a very constrained scenario. IME the LLM speedup is mostly in greenfield projects using APIs and libraries you're not very experienced with.
Judging from all the comments here, it’s going to be amazing seeing the fallout of all the LLM generated code in a year or so. The amount of people who seemingly relish the ability to stop thinking and let the model generate giant chunks of their code base, is uh, something else lol.
It entirely depends on the exposure and reliability the code needs. Some code is just a one-off to show a customer what something might look like. I don't care at all how well the code works or what it looks like for something like that. Rapid prototyping is a valid use case for that.
I have also written a C++ code that has to have a runtime of years, meaning there can be absolutely no memory leaks or bugs whatsoever, or TV stops working. I wouldn't have a language model write any of that, at least not without testing the hell out of it and making sure it makes sense to myself.
It's not all or nothing here. These things are tools and should be used as such.
> It entirely depends on the exposure and reliability the code needs.
Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Obviously, to emphasize, this kind of thing happens all the time with human-generated code, but LLMs make the issue a lot worse because it lets you generate a ton of eventual mess so much faster.
Also, I do agree with your primary point (my comment was a bit tongue in cheek) - it's very helpful to know what should be core and what can be thrown away. It's just in the real world whenever "throwaway" code starts getting traction and getting usage, the powers that be rarely are OK with "Great, now let's rebuild/refactor with production usage in mind" - it's more like "faster faster faster".
In one camp are the fast code slingers putting something quickly without long design and planning. They never get it just right the first few iterations.
So in the other camp you have seasoned engineers who will have a 5x longer design and planning process. But they also never get it right the first several iterations. And by the time their “properly-engineered” design gets its chance to shine, the business needs already changed.
Or there are those people who were fast code slingers when they began coding, and learned how to design, and now they ship production ready code even faster with rock solid architecture and code quality even after the first iteration.
> Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Because this is the first pass on any project, any component, ever. Design is done with iterations. One can and should throw out the original rough lynchpin and replace it with a more robust solution once it becomes evident that it is essential.
If you know that ahead of time and want to make it robust early, the answer is still rarely a single diligent one-shot to perfection - you absolutely should take multiple quick rough iterations to think through the possibility space before settling on your choice. Even that is quite conducive to LLM coding - and the resulting synthesis after attacking it from multiple angles is usually the strongest of all. Should still go over it all with a fine toothed comb at the end, and understand exactly why each choice was made, but the AI helps immensely in narrowing down the possibility space.
Not to rag on you though - you were being tongue in cheek - but we're kidding ourselves if we don't accept that like 90% of the code we write is rough throwaway code at first and only a small portion gets polished into critical form. That's just how all design works though.
I would love to work at the places you have been where you are given enough time to throw out the prototype and do it properly. In my almost 20 years of professional experience this has never been the case and prototype and exploratory code has only been given minimal polishing time before reaching production and in use state.
We are all too well aware of the tragedy that is modern software engineering lol. Sadly I too have never seen that situation where I was given enough time to do the requisite multiple passes for proper design...
I have been reprimanded and tediously spent collectively combing over said quick prototype code for far longer than the time originally provided to work on it though, as a proof of my incompetence! Does that count?
Dunno about you, but I find thinking hard… when I offload boilerplate code to Claude, I have more cycles left over to hold the problem in my head and effectively direct the agent in detail.
This makes sense. I find that after 15 to 20 iterations, I get better understanding of what is being done and possible simplifications.
I then manually declare some functions, JSDoc comments for the return types, imports and stop halfway. By then the agent is able to think, ha!, you plan to replace all the api calls to this composable under the so and so namespace.
It's iterations and context. I don't use them for everything but I find that they help when my brain bandwidth begins to lag or I just need a boilerplate code before engineering specific use cases.
I think you are over estimating the quality of code humans generate. I take LLM over any output of junior - to mid level developer (if they were given the same prompt / ask)
LLM’s are basically glorified slot machines. Some people try very hard to come up with techniques or theories about when the slot machine is hot, it’s only an illusion, let me tell you, it’s random and arbitrary, maybe today is your lucky day maybe not. Same with AI, learning the “skill” is as difficult as learning how to google or how to check stackoverflow, trivial. All the rest is luck and how many coins do you have in your pocket.
There's plenty of evidence that good prompts (prompt engineering, tuning) can result in better outputs.
Improving LLM output through better inputs is neither an illusion, nor as easy as learning how to google (entire companies are being built around improving llm outputs and measuring that improvement)
Sure, but tricks & techniques that work with one model often don't translate or are actively harmful with others. Especially when you compare models from today and 6 or more months ago.
Keep in mind that the first reasoning model (o1) was released less than 8 months ago and Claude Code was released less than 6 months ago.
This is not a good analogy. The parameters of slot machines can be changed to make the casino lose money. Just because something is random, doesn't mean it is useless. If you get 7 good outputs out of 10 from an LLM, you can still use it for your benefit. The frequency of good outputs and how much babysitting it requires determine whether it is worth using or not. Humans make mistakes too, although way less often.
Do you have an entry in your CV saying: proficiency in googling? It difficult not because it is complex, it difficult because Google want it to be opaque and as harder as possible to figure out.
If anything getting good information out of Google has become harder for us expert users because Google have tried to make it easier for everyone else.
The power-user tricks like "double quote phrase searches" and exclusion though -term are treated more as gentle guidelines now, because regular users aren't expected to figure them out.
There's always "verbatim" mode, though amusingly that appears to be almost entirely undocumented! I tried using Google to find the official documentation for that feature just now and couldn't do better than their 2011 blog entry introducing it: https://search.googleblog.com/2011/11/search-using-your-term...
Maybe if I was more skilled at Google I'd be able to use it to find documentation on its own features?
So true! About ten years ago Peter Norvig recommended the short Google online course on how to use Google Search: amazing how much one hour of structured learning permanently improved my search skills.
I have used neural networks since the 1980s, and modern LLM tech simply makes me happy, but there are strong limits to what I will use the current tech for.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
Learning how to use LLMs in a coding workflow is trivial to start, but you find you get a bad taste early if you don't learn how to adapt both your workflow and its workflow. It is easy to get a trivially good result and then be disappointed in the followup. It is easy to try to start on something it's not good at and think it's worthless.
The pure dismissal of cursor, for example, means that the author didn't learn how to work with it. Now, it's certainly limited and some people just prefer Claude code. I'm not saying that's unfair. However, it requires a process adaptation.
"There's no learning curve" just means this guy didn't get very far up, which is definitely backed up by thinking that Copilot and other tools are all basically the same.
Define "not trivial". Obviously, experience helps, as with any tool. But it's hardly rocket science.
It seems to me the biggest barrier is that the person driving the tool needs to be experienced enough to recognize and assist when it runs into issues. But that's little different from any sophisticated tool.
It seems to me a lot of the criticism comes from placing completely unrealistic expectations on an LLM. "It's not perfect, therefore it sucks."
As of about three months ago, one of the most important skills in effective LLM coding is coding agent environment design.
If you want to use a tool like Claude Code (or Gemini CLI or Cursor agent mode or Code CLI or Qwen Code) to solve complex problems you need to give them an environment they can operate in where they can solve that problem without causing too much damage if something goes wrong.
You need to think about sandboxing, and what tools to expose to them, and what secrets (if any) they should have access to, and how to control the risk of prompt injection if they might be exposed to potentially malicious sources of tokens.
The other week I wanted to experiment with some optimizations of configurations on my Fly.io hosted containers. I used Claude Code for this by:
- Creating a new Fly organization which I called Scratchpad
- Assigning that a spending limit (in case my coding agent went rogue or made dumb expensive mistakes)
- Creating a Fly API token that could only manipulate that organization - so I could be sure my coding agent couldn't touch any of my production deployments
- Putting together some examples of how to use the Fly CLI tool to deploy an app with a configuration change - just enough information that Claude Code could start running its own deploys
- Running Claude Code such that it had access to the relevant Fly command authenticated with my new Scratchpad API token
With all of the above in place I could run Claude in --dangerously-skip-permissions mode and know that the absolute worse that could happen is it might burn through the spending limit I had set.
This took a while to figure out! But now... any time I want to experiment with new Fly configuration patterns I can outsource much of that work safely to Claude.
I don’t really see how it’s different than how you’d setup someone really junior to have a playground of sorts.
It’s not exactly a groundbreaking line of reasoning that leads one to the conclusion of “I shouldn’t let this non-deterministic system access production servers.”
Now, setting up an LLM so that they can iterate without a human in the loop is a learned skill, but not a huge one.
For one - I’d say scoped API tokens that prevent messing with resources across logical domains (eg prod vs nonprod, distinct github repos, etc) is best practice in general. Blowing up a resource with a broadly scoped token isn’t a failure mode unique to LLMs.
edit: I don’t have personal experience around spending limits but I vaguely recall them being useful for folks who want to set up AWS resources and swing for the fences, in startups without thinking too deeply about the infra. Again this isn’t a failure mode unique to LLMs although I can appreciate it not mapping perfectly to your scenario above
edit #2: fwict the LLM specific context of your scenario above is: providing examples, setting up API access somehow (eg maybe invoking a CLI?). The rest to me seems like good old software engineering
The statement I responded to was, "creating an effective workflow is not trivial".
There are plenty of useful LLM workflows that are possible to create pretty trivially.
The example you gave is not hardly the first thing a beginning LLM user would need. Yes, more sophisticated uses of an advanced tool require more experience. There's nothing different from any other tool here. You can find similar debates about programming languages.
Again, what I said in my original comment applies: people place unrealistic expectations on LLMs.
I suspect that this is at least partly is a psychological game people unconsciously play to try to minimize the competence of LLMs, to reduce the level of threat they feel. A sort of variation of terror management theory.
I don’t think anyone expects perfection. Programs crash, drives die, and computers can break anytime. But we expect our tools to be reliable and not fight with it everyday to get it to work.
I don’t have to debug Emacs every day to write code. My CI workflow just runs every time a PR is created. When I type ‘make tests’, I get a report back. None of those things are perfect, but they are reliable.
I'm not a native speaker, but to me that quote doesn't necessarily imply an inability of OP to get up the curve. Maybe they just mean that the curve can look flat at the start?
No, it's sometimes just extremely easy to recognize people who have no idea what they're talking about when they make certain claims.
Just like I can recognize a clueless frontend developer when they say "React is basically just a newer jquery". Recognizing clueless engineers when they talk about AI can be pretty easy.
It's a sector that is both old and new: AI has been around forever, but even people who worked in the sector years ago are taken aback by what is suddenly possible, the workflows that are happening... hell, I've even seen cases where it's the very people who have been following GenAI forever that have a bias towards believing it's incapable of what it can do.
For context, I lead an AI R&D lab in Europe (https://ingram.tech/). I've seen some shit.
Basically, they are the same, they are all LLMs. They all have similar limitations. They all produce "hallucinations". They can also sometimes be useful. And they are all way overhyped.
The amount of misconceptions in this comment are quite profound.
Copilot isn't an LLM, for a start. You _combine_ it wil a selection of LLMs. And it absolutely has severe limitations compared to something like Claude Code in how it can interact with the programming environment.
"Hallucinations" are far less of a problem with software that grounds the AI to the truth in your compiler, diagnostics, static analysis, a running copy of your project, runnning your tests, executing dev tools in your shell, etc.
If it’s not trivial, it’s worthless, because writing things out manually yourself is usually trivial, but tedious.
With LLMs, the point is to eliminate tedious work in a trivial way. If it’s tedious to get an LLM to do tedious work, you have not accomplished anything.
If the work is not trivial enough for you to do yourself, then using an LLM will probably be a disaster, as you will not be able to judge the final output yourself without spending nearly the same amount of time it takes for you to develop the code on your own. So again, nothing is gained, only the illusion of gain.
The reason people think they are more productive using LLMs to tackle non-trivial problems is because LLMs are pretty good at producing “office theatre”. You look like you’re busy more often because you are in a tight feedback loop of prompting and reading LLM output, vs staring off into space thinking deeply about a problem and occasionally scribbling or typing something out.
So, I'd like you to talk to a fair number of emacs and vim users. They have spent hours and hours learning their tools, tweaking their configurations, and learning efficiencies. They adapt their tool to them and themselves to the tool.
We are learning that this is not going to be magic. There are some cases where it shines. If I spend the time, I can put out prototypes that are magic and I can test with users in a fraction of the time. That doesn't mean I can use that for production.
I can try three or four things during a meeting where I am generally paying attention, and look afterwards to see if it's pursuing.
I can have it work through drudgery if I provide it an example. I can have it propose a solution to a problem that is escaping me, and I can use it as a conversational partner for the best rubber duck I've ever seen.
But I'm adapting myself to the tool and I'm adapting the tool to me through learning how to prompt and how to develop guardrails.
Outside of coding, I can write chicken scratch and provide an example of what I want, and have it write a proposal for a PRD. I can have it break down a task, generate a list of proposed tickets, and after I've went through them have it generate them in jira (or anything else with an API). But the more I invest into learning how to use the tool, the less I have to clean up after.
Maybe one day in the future it will be better. However, the time invested into the tool means that 40 bucks of investment (20 into cursor, 20 into gpt) can add 10-15% boost in productivity. Putting 200 into claude might get you another 10% and it can get you 75% in greenfield and prototyping work. I bet that agency work can be sped up as much as 40% for that 200 bucks investment into claude.
That's a pretty good ROI.
And maybe some workloads can do even better. I haven't seen it yet but some people are further ahead than me.
vim and eMacs are owned by the developer who configures them. LLMs are products, whose capabilities are subject to the whims of their host. These are not the same things.
Everything you mentioned is also fairly trivial, just a couple of one shot prompts needed.
Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. [...]
LLMs will always suck at writing code that has not be written millions of times before. As soon as you venture slightly offroad, they falter.
That right there is your learning curve! Getting LLMs to write code that's not heavily represented in their training data takes experience and skill and isn't obvious to learn.
I’m still waiting that someone claiming how prompting is such an skill to learn, explain just once a single technique that is not obvious, like: storing checkpoint to go back to working version (already a good practice without using Llm see:git) or launch 10 tabs with slightly different prompts and choose the best, or ask the Llm to improve my prompt, or adding more context … is that an skill? I remember when I was a child that my mom thought that programming a vcr to record the night show to be such a feat…
In my experience, it's not just prompting that needs to be figured out, it's a whole new workstyle that works for you, your technologies and even your current project. As an example, I write almost all my code functional-programming style, which I rarely did before. This lets me keep my prompts and context very focused and it essentially elminates hallucinations.
Also I started in the pre-agents era and so I ended up with a pair-programming paradigm. Now everytime I conceptualize a new task in my head -- whether it is a few lines of data wrangling within a function, or generating an entire feature complete with integration tests -- I instinctively do a quick prompt-vs-manual coding evaluation and seamlessly jump to AI code generation if the prompt "feels" more promising in terms of total time and probability of correctness.
I think one of the skills is learning this kind of continuous evaluation and the judgement that goes with it.
You may not consider it a skill, but I train multiple programming agents on different production and quality code bases, and have all of them pr review a change, with a report given at the end.
it helps dramatically on finding bugs and issues. perhaps that's trivial to you, but it feels novel as we've only had effective agents in the last couple weeks.
Usually when you learn difficult skills, you can go to a trainer, take a class, read about the solutions.
Right now, you are entirely up to the random flawed information on the internet that you often can't repeat in trials, or your structured ideas on how to improve a thing.
That is difficult. It is difficult to take the information available right now, and come up with a reasonable way to improve the performance of LLMs through your ingenuity.
At some point it will be figured out, and every corporation will be following the same ideal setup, but at the moment it is a green field opportunity for the human brain to come up with novel and interesting ideas.
Thanks. So the skill is figuring out heuristics? That is not even related with AI or LLM. But as I said is like learning how to google, which is exactly that, try and error until you figure out what Google prefers
I mean, it's definitely related. We have this tool that we know can perform better with software with it. Building that software is challenging. Knowing what to build, testing it.
I believe that's difficult, and not just what google prefers. I guess we feel differently about it.
If you have a big rock (a software project), there's quite a difference between pushing it uphill (LLM usage) and hauling it up with a winch (traditional tooling and methods).
People are claiming that it takes time to build the muscles and train the correct footing to push, while I'm here learning mechanical theory and drawing up levers. If one managed to push the rock for one meter, he comes clamoring, ignoring the many who was injured by doing so, saying that one day he will be able to pick the rock up and throw it at the moon.
Simon, I have mad respect for your work but I think on this your view might be skewed because your day to day work involves a codebase where a single developer can still hold the whole context in their head. I would argue that the inadequacies of LLMs become more evident the more you have to make changes to systems that evolve at the speed of 15+ concurrent developers.
One of the things I'm using LLMs for a lot right now is quickly generating answers about larger codebases I'm completely unfamiliar with.
Anything up to 250,000 tokens I pipe into GPT-5 (prior to that o3), and beyond that I'll send them to Gemini 2.5 Pro.
For even larger code than that I'll fire up Codex CLI or Claude Code and let them grep their way to an answer.
This stuff has gotten good enough now that I no longer get stuck when new tools lack decent documentation - I'll pipe in just the source code (filtered for .go or .rs or .c files or whatever) and generate comprehensive documentation for myself from scratch.
Don't you see how this opens up a blindspot in your view of the code?
You don't have the luxury of having someone who is deeply familiar with the code sanity check your perceived understanding of the code, i.e. you don't see where the LLM is horribly off-track because you don't have sufficient understanding of that code to see the error. In enterprise contexts this is very common tho so its quite likely that a lot of the haters here have seen PRs submitted by vibecoders to their own work which have been inadequate enough that they started to blame the tool. For example I have seen someone reinvent the wheel of the session handling by a client library because they were unaware that the existing session came batteries included and the LLM didn't hesitate to write the code again for them. The code worked, everything checked out but because the developer didn't know what they didn't know about they submitted a janky mess.
As someone who leans more towards the side of LLM-sceptiscism, I find Sonnet 4 quite useful for generating tests, provided I describe in enough detail how I want the tests to be structured and which cases should be tested. There's a lot of boilerplate code in tests and IMO because of that many developers make the mistake of DRYing out their test code so much that you can barely understand what is being tested anymore. With LLM test generation, I feel that this is no longer necessary.
Isn’t tests supposed to be premises (ensure initial state is correct), compute (run the code), and assertions (verify the result state and output). If your test code is complex, most of it should be moved into harness and helpers functions. Writing more complex code isn’t particularly useful.
LLM driven coding can yield awesome results, but you will be typing a lot and, as article states, requires already well structured codebase.
I recently started with fresh project, and until I got to the desired structure I only used AI to ask questions or suggestions. I organized and written most of the code.
Once it started to get into the shape that felt semi-permanent to me, I started a lot of queries like:
```
- Look at existing service X at folder services/x
- see how I deploy the service using k8s/services/x
- see how the docker file for service X looks like at services/x/Dockerfile
- now, I started service Y that does [this and that]
- create all that is needed for service Y to be skaffolded and deployed, follow the same pattern as service X
```
And it would go, read existing stuff for X, then generate all of the deployment/monitoring/readme/docker/k8s/helm/skaffold for Y
With zero to none mistakes.
Both claude and gemini are more than capable to do such task.
I had both of them generate 10-15 files with no errors, with code being able to be deployed right after (of course service will just answer and not do much more than that)
Then, I will take over again for a bit, do some business logic specific to Y, then again leverage AI to fill in missing bits, review, suggest stuff etc.
It might look slow, but it actually cuts most boring and most error prone steps when developing medium to large k8s backed project.
My workflow with a medium sized iOS codebase is a bit like that. By the time everything works and is up to my standards, I‘ve usually taken longer, or almost as long, as if I‘d written everything manually. That’s with Opus-only Claude Code. It’s complicated stuff (structured concurrency and lots of custom AsyncSequence operators) which maybe CC just isn‘t suitable for.
Whipping up greenfield projects is almost magical, of course. But that’s not most of my work.
Deeply curious to know if this is an outlier opinion, a mainstream but pessimistic one, or the general consensus. My LinkedIn feed and personal network certainly suggests that it's an outlier, but I wonder if the people around me are overly optimistic or out of synch with what the HN community is experiencing more broadly.
My impression has been that in corporate settings (and I would include LinkedIn in that) AI optimism is basically used as virtue signaling, making it very hard to distinguish people who are actually excited about the tech from people wanting to be accepted.
My personal experience has been that AI has trouble keeping the scope of the change small and targeted. I have only been using Gemini 2.5 pro though, as we don’t have access to other models at my work. My friend tells me he uses Claud for coding and Gemini for documentation.
I reckon this opinion is more prevalent than the hyped blog posts and news stories suggest; I've been asking this exact question of colleagues and most share the sentiment, myself included, albeit not as pessimistic.
Most people I've seen espousing LLMs and agentic workflows as a silver bullet have limited experience with the frameworks and languages they use with these workflows.
My view currently is one of cautious optimism; that LLM workflows will get to a more stable point whereby they ARE close to what the hype suggests. For now, that quote that "LLMs raise the floor, not the ceiling" I think is very apt.
I think it’s pretty common among people whose job it is to provide working, production software.
If you go by MBA types on LinkedIn that aren’t really developers or haven’t been in a long time, now they can vibe out some react components or a python script so it’s a revolution.
Hi, my job is building working production software (these days heavily LLM assisted). The author of the article doesn't know what they're talking about.
I tend to strongly agree with the "unpopular opinion" about the IDEs mentioned versus CLI (specifically, aider.chat and Claude Code).
Assuming (this is key) you have mastery of the language and framework you're using, working with the CLI tool in 25 year old XP practices is an incredible accelerant.
Caveats:
- You absolutely must bring taste and critical thinking, as the LLM has neither.
- You absolutely must bring systems thinking, as it cannot keep deep weirdness "in mind". By this I mean the second and third order things that "gotcha" about how things ought to work but don't.
- Finally, you should package up everything new about your language or frameworks since a few months or year before the knowledge cutoff date, and include a condensed synthesis in your context (e.g., Swift 6 and 6.1 versus the 5.10 and 2024's WWDC announcements that are all GPT-5 knows).
For this last one I find it useful to (a) use OpenAI's "Deep Research" to first whitepaper the gaps, then another pass to turn that into a Markdown context prompt, and finally bring that over to your LLM tooling to include as needed when doing a spec or in architect mode. Similarly, (b) use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
I'm confused why these two obvious steps aren't built into leading agentic tools, but maybe handling the LLM as a naive and outdated "Rain Man" type doesn't figure into mental models at most KoolAid-drinking "AI" startups, or maybe vibecoders don't care, so it's just not a priority.
Either way, context based development beats Leroy Jenkins.
> use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
It seems to me that currently there are 2 schools of thought:
1. Use repomap and/or LSP to help the models navigate the code base
2. Let the models figure things out with grep
Personally, I am 100% a grep guy, and my editor doesn't even have LSP enabled. So, it is very interesting to see how many of these agentic tools do exactly the same thing.
And Claude Code /init is a great feature that basically writes down the current mental model after the initial round of grep.
Linkedin posts seems like an awful source. The people I see posting for themselves there are either pre-successful or just very fond of personal branding
Speaking to actual humans IRL (as in, non-management colleagues and friends in the field), people are pretty lukewarm on AI, with a decent chunk of them who find AI tooling makes them less productive. I know a handful of people who are generally very bullish on AI, but even they are nowhere near the breathless praise and hype you read about here and on LinkedIn, they're much more measured about it and approach it with what I would classify as common sense. Of course this is entirely anecdotal, and probably depends where you are and what kind of business you're in, though I will say I'm in a field where AI even makes some amount of sense (customer support software), and even then I'm definitely noticing a trend of disillusionment.
On the management side, however, we have all sorts of AI mandates, workshops, social media posts hyping our AI stuff, our whole "product vision" is some AI-hallucinated nightmare that nobody understands, you'd genuinely think we've been doing nothing but AI for the last decade the way we're contorting ourselves to shove "AI" into every single corner of the product. Every day I see our CxOs posting on LinkedIn about the random topic-of-the-hour regarding AI. When GPT-5 launched, it was like clockwork, "How We're Using GPT-5 At $COMPANY To Solve Problems We've Never Solved Before!" mere minutes after it was released (we did not have early access to it lol). Hilarious in retrospect, considering what a joke the launch was like with the hallucinated graphs and hilarious errors like in the Bernoulli's Principle slide.
Despite all the mandates and mandatory shoves coming from management, I've noticed the teams I'm close with (my team included) are starting to push back themselves a bit. They're getting rid of the spam generating PR bots that have never, not once, provided a useful PR comment. People are asking for the various subscriptions they were granted be revoked because they're not using them and it's a waste of money. Our own customers #1 piece of feedback is to focus less on stupid AI shit nobody ever asked for, and to instead improve the core product (duh). I'm even seeing our CTO who was fanboy number 1 start dialing it back a bit and relenting.
It's good to keep in mind that HN is primarily an advertisement platform for YC and their startups. If you check YC's recent batches, you would think that the 1 and only technology that exists in the world is AI, every single one of them mentions AI in one way or another. The majority of them are the lowest effort shit imaginable that just wraps some AI APIs and is calling it a product. There is a LOT of money riding on this hype wave, so there's also a lot of people with vested interests in making it seem like these systems work flawlessly. The less said about LinkedIn the better, that site is the epitome of the dead internet theory.
I think that beyond the language used, the article does have some points I agree with. In general, LLMs code better in languages that are more easily available online, where they can be trained on a larger amount of source code. Python is not the same as PL/I (I don't know if you've tried it, but with the latter, they don't know the most basic conventions used in its development).
When it is mentioned that LLMs "have terrible code organization skills", I think they are referring mainly to the size of the context. It is not the same to develop a module with hundreds of LoCs, one with thousands or one with tens of thousands of LoCs.
I am not very much in favor of skill degradation; I am not aware of a study that validates it in this regard. On the other hand, it is true that agents are constantly evolving, and I don't see any difficulties that cannot be overcome with the current evolutionary race, given that, in the end, coding is one of the most accessible functions for artificial intelligence.
People that comment on and get defensive about this bit:
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
How much of your workflow or intuition from 6 months ago is still relevant today? How long would it take to learn the relevant bits today?
Keep in mind that Claude Code was released less than 6 months ago.
A fraction of the LLM maximalists are being defensive, because they don't want to consider that they've maybe invested too much time in those tools ; considering what said tools are currently genuinely good at.
Pretty much all of the intuition I've picked up about getting good results from LLMs has stayed relevant.
If I was starting from fresh today I expect it would take me months of experimentation to get back to where I am now.
Working thoughtfully with LLMs has also helped me avoid a lot of the junk tips ("Always start with 'you are the greatest world expert in X', offer to tip it, ...") that are floating around out there.
All of the intuition? Definitely not my experience. I have found that optimal prompting differs significantly between models, especially when you look at models that are 6months old or older (the first reasoning model, o1, is less than 8 months old).
Speaking mostly from experience of building automated, dynamic data processing workflows that utilize LLMs:
Things that work with one model, might hurt performance or be useless with another.
Many tricks that used to be necessary in the past are no longer relevant, or only applicable for weaker models.
This isn't me dimissing anyone's experience. It's ok to do things that become obsolete fairly quickly, especially if you derive some value from it. If you try to stay on top of a fast moving field, it's almost inevitable. I would not consider it a waste of time.
Opening the essay with ~~Learning how to use LLMs in a coding workflow is trivial.~~ and closing with suggestion ~~ Copilot ~~ for AI agent is the worst take of LLM coding I ever saw
Have built many pipelines integrating LLMs to drive real $ results. I think this article boils it down too simply. But i always remember, if the LLM is the most interesting part of your work, something is severely wrong and you probably aren’t adding much value. Context management based on some aspects of your input is where LLMs get good, but you need to do lots of experimentation to tune something. Most cases i have seen are about developing one pipeline to fit 100s of extremely different cases; LLM does not solve this problem but basically serves as an approximator for you to discretize previously large problems in to some information sub space where you can treat the infinite set of inputs as something you know. LLMs are like a lasso (and a better/worse one than traditional lassos depending on use case) but once you get your catch you still need to process it, deal with it progammatically to solve some greater problem. I hate how so many LLM related articles/comments say “ai is useless throw it away dont use it” or “ai is the future if we dont do it now we’re doomed lets integrate it everywhere it can solve all our problems” like can anyone pick a happy medium? Maybe thats what being in a bubble looks like
>I made a CLI logs viewers and querier for my job, which is very useful but would have taken me a few days to write (~3k LoC)
I recall The Mythical Man-Month stating a rough calculation that the average software developer writes about 10 net lines of new, production-ready code per day. For a tool like this going up an order of magnitude to about 100 lines of pretty good internal tooling seems reasonable.
OP sounds a few cuts above the 'average' software developer in terms of skill level. But here we also need to point out a CLI log viewer and querier is not the kind of thing you actually needed to be a top tier developer to crank out even in the pre-LLM era, unless you were going for lnav [1] levels of polish.
A lot of the Mythical Man-Month is timeless, but for a stat like that, it really is worth bearing in mind the book was written half a century ago about developers working on 1970s mainframes.
Yeah, I think that metric has grown to about 20 lines per day using 2010s-era languages and methods. So maybe we could think of LLM usage as an attempt to bring it back down to 10 per day.
So many articles should prepend “My experience with ...” to their title. Here is OP's first sentence: “I spent the past ~4 weeks trying out all the new and fancy AI tools for software development.” Dude, you have had some experiences and they are worth writing up and sharing. But your experiences are not a stand-in for "the current state." This point applies to a significant fraction of HN articles, to the point that I wish the headlines were flagged “blog”.
Clickbait gets more reach. It's an unfortunate thing. I remember Veritasium in a video even saying something along the lines of him feeling forced to do clickbaity YouTube because it works so well.
The reach is big enough to not care about our feelings. I wish it wasn't this way.
Interesting read, but strange to totally ignore the macOS ChatGPT app that optionally integrates with a terminal session, the currently opened VSCode editor tab, XCode. etc. I use this combination at least 2 or 3 times a month, and even if my monthly use is less that 40 minutes total, it is a really good tool to have in your toolbelt.
The other thing I disagree with is the coverage of gemnini-cli: if you use gemini-cli for a single long work session, then you must set your Google API key as an environment variable when starting gemini-cli, otherwise you end up after a short while using Gemini-2.5-flash, and that leads to unhappy results. So, use gemini-cli for free for short and focused 3 or 4 minute work sessions and you are good, or pay for longer work sessions, and you are good.
I do have a random off topic comment: I just don’t get it: why do people live all day in an LLM-infused coding environment? LLM based tooling is great, but I view it as something I reach for a few times a day for coding and that feels just right. Separately, for non-coding tasks, reaching for LLM chat environments for research and brainstorming is helpful, but who really needs to do that more than once or twice a day?
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
Do they? I’ve found Clojure-MCP[1] to be very useful. OTOH, I’m not attempting to replace myself, only augment myself.
Thanks for the link! I used to use Clojure a lot professionally, but now just for fun projects, and to occasionally update my old Clojure book. I had bookmarked Clojure-MCP a while ago, but never got back to it but I will give it a try.
I like your phrasing of “OTOH, I’m not attempting to replace myself, only augment myself.” because that is my personal philosophy also.
I think we're still in the gray zone of the "Incessant Obsolescence Postulate" (the Wait Calculation). Are you better off "skilling up" on the tech as it is today, or waiting for it to just "get better" so by the time you kick off, you benefit from the solved-problems X years from now. I also think this calculation differs by domain, skill level, and your "soft skill" abilities to communicate, explain and teach. In some domains, if you're not already on this train, you won't even get hired anymore.
The current state of LLM-driven development is already several steps down the path of an end-game where the overwhelming majority of code is written by the machine; our entire HCI for "building" is going to be so far different to how we do it now that we'll look back at the "hand-rolling code era" in a similar way to how we view programming by punch-cards today. The failure modes, the "but it SUCKS for my domain", the "it's a slot machine" etc etc are not-even-wrong. They're intermediate states except where they're not.
The exceptions to this end-game will be legion and exist only to prove the end-game rule.
Relying on LLM for any skill, especially programming, is like cutting your own healthy legs and buying crutches to walk. Plus you now have to pay $49/month for basic walking ability and $99/month for "Walk+" plan, where you can also (clumsily) jog.
There are a lot of skills which I haven't developed because I rely on external machines to handle it for me; memorization, fire-starting, navigation. On net, my life is better for it. LLMs may or may not be as effective at replacing code development as books have been at replacing memorization and GPS has been at replacing navigation, but eventually some tool will be and I don't think I'll be worse off for developing other skills.
GPS is particularly good analog... Lose it for any reason and suddenly you are helpless without backup navigation aids. But compass, paper map, watch and sextant will still work!
I would actually disagree with the final conclusion here; despite claiming to offer the same models, Copilot seems very much nerfed — cross-comparing the Copilotified LLM and the same LLM through OpenRouter, the Copilot one seems to fail much harder. I'm not an expert in the details of LLMs but I guess there might be some extra system prompt, I also notice the context window limit is much lower, which kinda suggests it's been partially pre-consumed.
In case it matters, I was using Copilot that is for 'free' because my dayjob is open source, and the model was Claude Sonnet 3.7.
I've not yet heard anyone else saying the same as me which is kind of peculiar.
Good read. I just want to pinpoint that LLMs seems to write better React code, but as an experienced frontend developers my opinion is that it's also bad at React. Its approach is outdated as it doesn't follow the latest guidelines. It writes React as I would have written it in 2020. So as usual, you need to feed the right context to get proper results.
OP did miss the vscode extension for claude code, it is still terminal based but:
- it show you the diff of the incoming changes in vscode ( like git )
- it know the line you selected in the editor for context
I have not tried every IDE/CLI or models, only a few, mostly Claude and Qwen.
I work mostly in C/C++.
The most valuable improvement of using this kind of tools, for me, is to easily find help when I have to work on boring/tedious tasks or when I want to have a Socratic conversation about a design idea with a not-so-smart but extremely knowledgeable colleague.
But for anything requiring a brain, it is almost useless.
Does not mention the actual open source solution that has autocomplete, chat, planer and agents, lets you bring your own keys, connect to any llm provider, customize anything, rewrite all the prompts and tools.
This article makes me wanna try building a token field in Flutter using a LLM chat or agent. Chat should be enough. A few iterations to get the behaviour and the tests right. A bit of style to make it look Apple-nice. As if a regular dev would do much better/quicker for this use case, such a bad example imo I don't buy it
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
I haven't found that to be true with my most recent usage of AI. I do a lot of programming in D, which is not popular like Python or Javascript, but Copilot knows it well enough to help me with things like templates, metaprogramming, and interoperating with GCC-produced DLL's on Windows. This is true in spite of the lack of a big pile of training data for these tasks. Importantly, it gets just enough things wrong when I ask it to write code for me that I have to understand everything well enough to debug it.
Strange post. It reads in part like an incoherent rant and in part as a well made analysis.
It’s mostly on point though. Although, in recent years I’ve been assigned to manage and plan projects at work, and the skills I’ve learnt from that greatly help to get effective results from an LLM I think.
I have a biased opinion since I work for a background agent startup currently - but there are more (and better!) out there than Jules and Copilot that might address some of the author's issues.
By no means are better background agents "mythical" as you claim. I didn't bother to mention them as it is easy enough to search for asynchronous/background agents yourself.
Devin is perhaps the one that is most fully featured and I believe has been around the longest. Other examples that seem to be getting some attention recently are Warp, Cursor's own background agent implementation, Charlie Labs, Codegen, Tembo, and OpenAI's Codex.
I do not work for any of the aforementioned companies.
>Ah yes. An unverifiable claim followed by "just google them yourself".
Some agent scaffolding performs better on benchmarks than others given the same underlying base model - see SWE Bench and Terminal Bench for examples.
Some may find certain background agents better than others simply because of UX. Some background agents have features that others don't - like memory systems, MCP, 3rd party integrations, etc.
I maintain it is easy to search for examples of background coding agents that are not Jules or Copilot. For me, searching "background coding agents" on google or duckduckgo returns some of the other examples that I mentioned.
"LLMs won’t magically make you deliver production-ready code"
Either I'm extremely lucky or I was lucky to find the guy who said it must all be test driven and guided by the usual principles of DRY etc. Claude Code works absolutely fantastically nine out of 10 times and when it doesn't we just roll back the three hours of nonsense it did postpone this feature or give it extra guidance.
I'm beginning to suspect robust automated tests may be one of the single strongest indicators for if you're going to have a good time with LLM coding agents or not.
If there's a test suite for the thing to run it's SO much less likely to break other features when it's working. Plus it can read the tests and use them to get a good idea about how everything is supposed to work already.
Telling Claude to write the test first, then execute it and watch it fail, then write the implementation has been giving me really great results.
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
Almost like hiring and scaling a team? There are also benchmarks that specifically measure this, and its in theory a very temporary problem (Aider Polyglot Benchmark is one such).
My favorite setup so far is using the Claude code extension in VScode. All the power of CC, but it opens files and diffs in VScode. Easy to read and modify as needed.
There are kind of a lot of errors in this piece. For instance, the problem the author had with Gemini CLI running out of tokens in ten minutes is what happens when you don’t set up (a free) API key in your environment.
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
I use clojure for my day-to-day work, and I haven't found this to be true. Opus and GPT-5 are great friends when you start pushing limits on Clojure and the JVM.
> Or 4.1 Opus if you are a millionaire and want to pollute as much possible
I know this was written tongue-in-cheek, but at least in my opinion it's worth it to use the best model if you can. Opus is definitely better on harder programming problems.
> GPT 4.1 and 5 are mostly bad, but are very good at following strict guidelines.
This was interesting. At least in my experience GPT-5 seemed about as good as Opus. I found it to be _less_ good at following strict guidelines though. In one test Opus avoided a bug by strictly following the rules, while GPT-5 missed.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
I'm sprry, but I disagree with this claim. That is not my experience, nor many others. It's true that you can make them do something without learning anything. However, it takes time to learn what they are good amd bad at, what information they need, and what nonsense they'll do without express guidance. It also takes time to know what to look for when reviewing results.
I also find that they work fine for languages without static types. You need need tests, yes, but you need them anyway.
"Google’s enshittification has won and it looks like no competent software developers are left. I would know, many of my friends work there". Ouch ... I hope his friends are in marketing!
They missed OpenAI Codex, maybe deliberately? It's less llm-development and more vibe-coding, or maybe "being a PHB of robots". I'm enjoying it for my side project this week.
Yet another developer who is too full of themselves to admit that they have no idea how to use LLMs for development. There's an arrogance that can set in when you get to be more senior and unless you're capable of force feeding yourself a bit of humility you'll end up missing big, important changes in your field.
It becomes farcical when not only are you missing the big thing but you're also proud of your ignorance and this guy is both.
Personally, I’ve had a pretty positive experience with the coding assistants, but I had to spend some time to develop intuition for the types of tasks they’re likely to do well. I would not say that this was trivial to do.
Like if you need to crap out a UI based on a JSON payload, make a service call, add a server endpoint, LLMs will typically do this correctly in one shot. These are common operations that are easily extrapolated from their training data. Where they tend to fail are tasks like business logic which have specific requirements that aren’t easily generalized.
I’ve also found that writing the scaffolding for the code yourself really helps focus the agent. I’ll typically add stubs for the functions I want, and create overall code structure, then have the agent fill the blanks. I’ve found this is a really effective approach for preventing the agent from going off into the weeds.
I also find that if it doesn’t get things right on the first shot, the chances are it’s not going to fix the underlying problems. It tends to just add kludges on top to address the problems you tell it about. If it didn’t get it mostly right at the start, then it’s better to just do it yourself.
All that said, I find enjoyment is an important aspect as well and shouldn’t be dismissed. If you’re less productive, but you enjoy the process more, then I see that as a net positive. If all LLMs accomplish is to make development more fun, that’s a good thing.
I also find that there's use for both terminal based tools and IDEs. The terminal REPL is great for initially sketching things out, but IDE based tooling makes it much easier to apply selective changes exactly where you want.
As a side note, got curious and asked GLM-4.5 to make a token field widget with React, and it did it in one shot.
It's also strange not to mention DeepSeek and GLM as options given that they cost orders of magnitude less per token than Claude or Gemini.
Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
It's a really weird way to open up an article concluding that LLMs make one a worse programmer: "I definitely know how to use this tool optimally, and I conclude the tool sucks". Ok then. Also: the piano is a terrible, awful instrument; what a racket it makes.
Fully agree. It takes months to learn how to use LLMs properly. There is an initial honeymoon where the LLMs blow your mind out. Then you get some disappointments. But then you start realizing that there are some things that LLMs are good at and some that they are bad at. You start creating a feel for what you can expect them to do. And more importantly, you get into the habit of splitting problems into smaller problems that the LLMs are more likely to solve. You keep learning how to best describe the problem, and you keep adjusting your prompts. It takes time.
it really doesn't take that long. Maybe if you're super junior and never coded before? In that case I'm glad its helping you get into the field. Also, if its taking you months there are whole new models that will get released and you need to learn those quirks again.
No, it's a practice. You're not necessarily building technical knowledge, rather you're building up an intuition. It's for sure not like learning a programming language. It's more like feeling your way along and figuring out how to inhabit a dwelling in the dark. We would just have to agree to disagree on this. I feel exactly as the parent commenter felt. But it's not easy to explain (or to understand from someones explanation.)
I'm glad you feel like you've nailed it. I've been using models to help me code for over two years, and I still feel like I have no idea what I'm doing.
I feel like every time I have a prompt or use a new tool, I'm experimenting with how to make fire for the first time. It's not to say that I'm bad at it. I'm probably better than most people. But knowing how to use this tool is by far the largest challenge, in my opinion.
Love this, and it's so true. A lot of people don't get this, because it's so nuanced. It's not something that's slowing you down. It's not learning a technical skill. Rather, it's building an intuition.
I find it funny when people ask me if it's true that they can build an app using an LLM without knowing how to code. I think of this... that it took me months before I started feeling like I "got it" with fitting LLMs into my coding process. So, not only do you need to learn how to code, but getting to the point that the LLM feels like a natural extension of you has its own timeline on top.
Months? That’s actually an insanely long time
I dunno, man. I think you could have spent that time, you know, learning to code instead.
Sure. But it happens that I have 20 years of experience, and I know quite well how to code. Everything the LLM does for me I can do myself. But the LLM does that 100 times faster than me. Most of the days nowadays I push thousands of lines of code. And it's not garbage code, the LLMs write quite high quality code. Of course, I still have to go through the code and make sure it all makes sense. So I am still the bottleneck. At some point I will probably grown to trust the LLM, but I'm not quite there yet.
> Most of the days nowadays I push thousands of lines of code
Insane stuff. It’s clear you can’t review so much changes in a day, so you’re just flooding your code base with code that you barely read.
Or is your job just re-doing the same boilerplate over and over again?
You are a bit quick to jump to conclusions. With LLMs, test driven development becomes both a necessity and a pleasure. The actual functional code I push in a day is probably in the low hundreds LOC’s. But I push a lot of tests too. And sure, lots of that is boilerplate. But the tests run, pass, and if anything have better coverage than when I was writing all the code myself.
it is, mind you, exactly the same experience as working on a team with lots of junior engineers, and delegating work to them
Wait a minute, you didn't just claim that we have reached AGI, right? I mean, that's what it would mean to delegate work to junior engineers, right? You're delegating work to human level intelligence. That's not what we have with LLMs.
Yes and no. With junior developers you need to educate them. You need to do that with LLMs too. Maybe you need to break down the problem in smaller chunks, but you get to this after a while. But once the LLM understands the task, you get a few hundred lines of code in a mater of minutes. With a junior developer you are lucky if they come back the same day. The iteration speed with AI is simply in a different league.
Edit: it is Sunday. As I am relaxing, and spending time writing answers on HN, I keep a lazy eye on the progress of an LLM at work too. I got stuff done that would have taken me a few days of work by just clicking a "Continue" button now and then.
If you have 20 years of experience, then you know that number of lines of codes is always inversely proportional to code quality.
> ...thousands of lines of code ... quite high quality
A contradiction in terms.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Is the non-trivial amount of time significantly less than you trying to ramp up yourself?
I am still hesitant using AI for solving problems for me. Either it hallucinates and misleads me. Or it does a great job and I worry that my ability of reasoning through complex problems with rigor will degenerate. When my ability of solving complex problems degenerated, patience diminished, attention span destroyed, I will become so reliant on a service that other entities own to perform in my daily life. Genuine question - are people comfortable with this?
The ramp-up time with AI is absolutely lower than trying to ramp up without AI.
My comment is specifically in contrast to working in a codebase where I'm at "max AI productivity". In a new codebase, it just takes a bit of time to work out kinks and figure out tendencies of the LLMs in those codebases. It's not that I'm slower than I'd be without AI, I'm just not at my "usual" AI-driven productivity levels.
>Genuine question - are people comfortable with this?
It's a question of degree, but in general, yeah. I'm totally comfortable being reliant on other entities to solve complex problems for me.
That's how economies work [1]. I neither have nor want to acquire the lifetime of experience I would need to learn how to produce the tea leaves in my tea, or the clean potable water in it, or the mug they are contained within, or the concrete walls 50 meters up from ground level I am surrounded by, or so on and so forth. I can live a better life by outsourcing the need for this specialized knowledge to other people, and trade with them in exchange for my own increasingly-specialized knowledge. Even if I had 100 lifetimes to spend, and not the 1 I actually have, I would probably want to put most of them to things that, you know, aren't already solved-enough problems.
Everyone doing anything interesting works like this, with vanishingly few exceptions. My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant. And his accountant doesn't need to know how to rewire his turn of the century New England home. And if you look at the exceptions, like that really cute 'self sufficient' family who uploads weekly YouTube videos called "Our Homestead Life"... It often turns out that the revenue from that YouTube stream is nontrivial to keeping the whole operation running. In other words, even if they genuinely no longer go to Costco, it's kind of a gyp.
[1]: https://www.youtube.com/watch?v=67tHtpac5ws
> My dad doesn't need to know how to do algebra to get his taxes done, he just has an accountant.
This is not quite the same thing. The AI is not perfect, it frequently makes mistakes or suboptimal code. As a software engineer, you are responsible for finding and fixing those. This means you have to review and fully understand everything that the AI has written.
Quite a different situation than your dad and his accountant.
I see your point. I don't think it's different in kind, just degree. My thought process: First, is my dad's accountant infallible?
If not, then they must themselves make mistakes or do things suboptimally sometimes. Whose responsibility is that - my dad, or my dad's accountant?
If it is my dad, does that then mean my dad has an obligation to review and fully understand everything the accountant has written?
And do we have to generalize that responsibility to everything and everyone my dad has to hand off work to in order to get something done? Clearly not, that's absurd. So where do we draw the line? You draw it in the same place I do for right now, but I don't see why we expect that line to be static.
> This means you have to review and fully understand everything that the AI has written.
Yes, and people who care and is knowledgeable do this already. I do this, for one.
But there’s no way one is giving as thorough a review as if one had written code to solve the problem themselves. Writing is understanding. You’re trading thoroughness and integrity for chance.
Writing code should never have been a bottle neck. And since it wasn’t, any massive gains are due to being ok with trusting the AI.
I would honestly say, it's more like autocomplete on steroids, like you know what you want so you just don't wanna type it out (e.g. scripts and such)
And so if you don't use it then someone else will... But as for the models, we already have some pretty good open source ones like Qwen and it'll only get better from here so I'm not sure why the last part would be a dealbreaker
> That's a wild statement. I'm now extremely productive with LLMs in my core codebases, but it took a lot of practice to get it right and repeatable. There's a lot of little contextual details you need to learn how to control so the LLM makes the right choices.
> Whenever I start working in a new code base, it takes a a non-trivial amount of time to ramp back up to full LLM productivity.
Do you find that these details translate between models? Sounds like it doesn't translate across codebases for you?
I have mostly moved away from this sort of fine-tuning approach because of experience a while ago around OpenAI's ChatGPT 3.5 and 4. Extra work on my end necessary with the older model wasn't with the new one, and sometimes counterintuitively caused worse performance by pointing it at what the way I'd do it vs the way it might have the best luck with. ESPECIALLY for the sycophantic models which will heavily index on "if you suggested that this thing might be related, I'll figure out some way to make sure it is!"
So more recently I generally stick to the "we'll handle a lot of the prompt nitty gritty" for you IDE or CLI agent stuff, but I find they still fall apart with large complex codebases and also that the tricks don't translate across codebases.
Yes and no. The broader business context translates well, but each model has it's own blindspots and hyperfocuses that you need to massage out.
* Business context - these are things like code quality/robustness, expected spec coverage, expected performance needs, domain specific knowledge. These generally translate well between models, but can vary between code bases. For example, a core monolith is going to have higher standards than a one-off auxiliary service.
* Model focuses - Different models have different tendencies when searching a code base and building up their context. These are specific to each code base, but relatively obvious when they happen. For example, in one code base I work in, one model always seems to pick up our legacy notification system while another model happens to find our new one. It's not really a skill issue. It's just luck of the draw how files are named and how each of them search. They each just find a "valid" notification pattern in a different order.
LLMs are massively helpful for orienting to a new codebase, but it just takes some time to work out those little kinks.
This is like UB in compilers but 100x worse, because there's no spec, it's not even documented, and it could change without a compiler update.
It is nothing at all like UB in a compiler. UB creates invisible bugs that tend to be discovered only after things have shipped. This is code generation. You can just read the code to see what it does, which is what most professionals using LLMs do.
With the volume of code people are generating, no you really can't just read it all. pg recently posted [1] that someone he knows is generating 10kloc/day now. There's no way people are using AI to generate that volume of code and reading it. How many invisible bugs are lurking in that code base, waiting to be found some time in the future after the code has shipped?
[1] https://x.com/paulg/status/1953289830982664236
I read every line I generate and usually adjust things; I'm uncomfortable merging a PR I haven't put my fingerprints on somehow. From the conversations I have with other practitioners, I think this is pretty normal. So, no, I reject your premise.
My premise didn't have anything to do with you, so what you do isn't a basis for rejecting it. No matter what you or your small group of peers do, AI is generating code at a volume that all the developers in the world combined couldn't read if they dedicated 24hrs/day.
[dead]
He’s not wrong.
Getting 80% of the benefit of LLMs is trivial. You can ask it for some functions or to write a suite of unit tests and you’re done.
The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
> The last 20%, while possible to attain, is ultimately not worth it for the amount of time you spend in context hells. You can just do it yourself faster.
I'm arguing that there's a skill that has to be learned in order to break through this. As you start in a new code base, you should be quick to jump in when you hit that 20%. But, as you spend more time in it, you learn how to avoid the same "context hell" issues and move that number down to 15%, 10%, 5% of the time.
You're still going to need to jump in, but when you can learn to get the LLM to write 95% of the code for you, that's incredibly powerful.
> 'm arguing that there's a skill that has to be learned in order to break through this. As you start in a new code base, you should be quick to jump in when you hit that 20%. But, as you spend more time in it, you learn how to avoid the same "context hell" issues and move that number down to 15%, 10%, 5% of the time.
The problem is that you're learning a skill that will need refinement each time you switch to a new model. You will redo some of this learning on each new model you use.
This actually might not be a problem anyway, as all the models seem to be converging asymptotically towards "programming".
The better they do on the programming benchmarks, the further away from AGI they get.
It’s not incredibly powerful, it’s incrementally powerful. Getting the first 80% via LLM is already the incredible power. A sufficiently skilled developer should be able to handle the rest with ease. It is not worth doing anything unnatural in an effort to chase down the last 20%, you are just wasting time and atrophying skills. If you can get full 95% in some one shot prompts, great. But don’t go chasing waterfallls.
No, it actually has an exponential growth type of effect on productivity to be able to push it to the boundary more.
I’m making this a bit contrived, but I’m simplifying it to demonstrate the underlying point.
When an LLM is 80% effect, I’m limits to doing 5 things in parallel since I still need to jump in 20% of the time.
When an LLM is 90% effect, I can do 10 things at once. When it’s 95%, 20 things. 99%, 100 things.
Now, obviously I can’t actually juggle 10 or 20 things at once. However, the point is there are actually massive productivity gains to be had when you can reduce your involvement in a task from 20% to, even 10%. You’re effectively 2x as productive.
Do you understand what parallel means? Most LLM responds in seconds, there is no parallel work for you to do there.
Or do you mean you are using long running agents to do tasks and then review those? I haven't seen such a workflow be productive so far.
I run through a really extensive planning step that generates technical architecture and iterative tasks. I then send an LLM along to implement each step, debugging, iterative, and verifying it's work. It's not uncommon for it to take a non-trivial amount of time to complete a step (5+ minutes).
Right now, I still need to intervene enough that I'm not actually doing a second coding project in parallel. I tend to focus on communication, documentation, and other artifacts that support the code I'm writing.
However, I am very close to hitting that point and occasionally do on easier tasks. There's a _very_ real tipping point in productivity when you have confidence that an LLM can accomplish a certain task without your intervention. You can start to do things legitimately in parallel when you're only really reviewing outputs and doing minor tweaks.
I’d bet you don’t even have 2 or 3 things to do at once, much less 100. So it’s pointless to chase those types of coverages.
exactly. people delude themselves thinking this is productivity. Tweaking prompts is to get it "right" is very wasteful.
I agree with your assessment about this statement. I actually had to reread it a few times to actually understand it.
He is actually recommending Copilot for price/performance reasons and his closing statement is "Don’t fall for the hype, but also, they are genuinely powerful tools sometimes."
So, it just seems like he never really gave a try at how to engineer better prompts that these more advanced models can use.
The OPs point seems to be: it's very quick for LLMs to be a net benefit to your skills, if it is a benefit at all. That is, he's only speaking of the very beginning part of the learning curve.
The first two points directly contradict each other, too. Learning a tool should have the outcome that one is productive with it. If getting to "productive" is non-trivial, then learning the tool is non-trivial.
> I have never heard anybody successfully using LLMs say this before. Most of what I've learned from talking to people about their workflows is counterintuitive and subtle.
Because for all our posturing about being skeptical and data driven we all believe in magic.
Those "counterintuitive non-trivial workflows"? They work about as well as just prompting "implement X" with no rules, agents.md, careful lists etc.
Because 1) literally no one actually measures whether magical incarnations work and 2) it's impossible to make such measurements due to non-determinism
The problem with your argument here is that you're effectively saying that developers (like myself) who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
Either I've wasted significant chunks of the past ~3 years of my life or you're missing something here. Up to you to decide which you believe.
I agree that it's hard to take solid measurements due to non-determinism. The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well and figure out what levers they can pull to help them perform better.
That's not a problem, that is the argument. People are bad at measuring their own productivity. Just because you feel more productive with an LLM does not mean you are. We need more studies and less anecdata
I'm afraid all you're going to get from me is anecdata, but I find a lot of it very compelling.
I talk to extremely experienced programmers whose opinions I have valued for many years before the current LLM boom who are now flying with LLMs - I trust their aggregate judgement.
Meanwhile my own https://tools.simonwillison.net/colophon collection has grown to over 120 in just a year and a half, most of which I wouldn't have built at all - and that's a relatively small portion of what I've been getting done with LLMs elsewhere.
Hard to measure productivity on a "wouldn't exist" to "does exist" scale.
> my own https://tools.simonwillison.net/colophon collection has grown to over 120
What in the wooberjabbery is this even.
List of single-commit LLM generated stuff. Vibe coded shovelware like animated-rainbow-border [1] or unix-timestamp [2].
Calling these tools seems to be overstating it.
1: https://gist.github.com/simonw/2e56ee84e7321592f79ceaed2e81b...
2: https://gist.github.com/simonw/8c04788c5e4db11f6324ef5962127...
Cool right? It's my playground for vibe coded apps, except I started it nearly a year before the term "vibe coding" was introduced.
I wrote more about it here: https://simonwillison.net/2024/Oct/21/claude-artifacts/ - and a lot of them have explanations in posts under my tools tag: https://simonwillison.net/tags/tools/
It might also be the largest collection of published chat transcripts for this kind of usage from a single person - though that's not hard since most people don't publish their prompts.
Building little things like this is really effective way of gaining experience using prompts to get useful code results out of LLMs.
> Cool right?
100s of single commit AI generated trash in the likes of "make the css background blue".
On display.
Like it's something.
You can't be serious.
[flagged]
I've been using LLM-assistance for my larger open source projects - https://github.com/simonw/datasette https://github.com/simonw/llm and https://github.com/simonw/sqlite-utils - for a couple of years now.
Also literally hundreds of smaller plugins and libraries and CLI tools, see https://github.com/simonw?tab=repositories (now at 880 repos, though a few dozen of those are scrapers and shouldn't count) and https://pypi.org/user/simonw/ (340 published packages).
Unlike my tools.simonwillison.net stuff the vast majority of those products are covered by automated tests and usually have comprehensive documentation too.
What do you mean by my script?
The whole debate about LLMs and productivity consistently brings the "don't confuse movement with progress" warning to my mind.
But it was already a warning before LLMs because, as you wrote, people are bad at measuring productivity (among many things).
Another problem with it is that you could have said the same thing about virtually any advancement in programming over the last 30 years.
There have been so many "advances" in software development in the last decades - powerful type systems, null safety, sane error handling, Erlang-style fault tolerance, property testing, model checking, etc. - and yet people continue to write garbage code in unsafe languages with underpowered IDEs.
I think many in the industry have absolutely no clue what they're doing and are bad at evaluating productivity, often prioritising short term delivery over longterm maintenance.
LLMs can absolutely be useful but I'm very concerned that some people just use them to churn out code instead of thinking more carefully about what and how to build things. I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
> I wish we had at least the same amount of discussions about those things I mentioned above as we have about whether Opus, Sonnet, GPT5 or Gemini is the best model.
I mean we do. I think programmers are more interested in long term maintainable software than its users are. Generally that makes sense, a user doesn't really care how much effort it takes to add features or fix bugs, these are things that programmers care about. Moreover the cost of mistakes of most software is so low that most people don't seem interested in paying extra for more reliable software. The few areas of software that require high reliability are the ones regulated or are sold by companies that offer SLAs or other such reliability agreements.
My observation over the years is that maintainability and reliability are much more important to programmers who comment in online forums than they are to users. It usually comes with the pride of work that programmers have but my observation is that this has little market demand.
> I think programmers are more interested in long term maintainable software than its users are.
Please talk to your users
Users definitely care about things like reliability when they're using actually important software (which probably excludes a lot of startup junk). They may not be able to point to what causes issues, but they obviously do complain when things are buggy as hell.
I'm not the OP and I"m not saying you are wrong, but I am going to point out that the data doesn't necessarily back up significant productivity improvements with LLMs.
In this video (https://www.youtube.com/watch?v=EO3_qN_Ynsk) they present a slide by the company DX that surveyed 38,880 developers across 184 organizations, and found the surveyed developers claiming a 4 hour average time savings per developer per week. So all of these LLM workflows are only making the average developer 10% more productive in a given work week, with a bunch of developers getting less. Few developers are attaining productivity higher than that.
In this video by stanford researchers actively researching productivity using github commit data for private and public repositories (https://www.youtube.com/watch?v=tbDDYKRFjhk) they have a few very important data points in there:
1. There's zero correlation they've found between how productive respondants claim their productivity is and how it's actually measured, meaning people are poor judges of their own productivity numbers. This does refute the claims on the previous point I made but only if you assume people are wildly more productive then they claim on average.
2. They have been able to measure actual increase in rework and refactoring commits in the repositories measured as AI tools become more in use in those organizations. So even with being able to ship things faster, they are observing increase number of pull requests that need to fix those previous pushes.
3. They have measured that greenfield low complexity systems have pretty good measurements for productivity gains, but once you get more towards higher complexity systems or brownfield systems they start to measure much lower productivity gains, and even negative productivity with AI tools.
This goes hand in hand with this research paper: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... which had experienced devs in significant long term projects lose productivity when using AI tools, but also completely thought the AI tools were making them even more productivity.
Yes, all of these studies have their flaws and nitpicks we can go over that I'm not interested in rehashing. However, there's a lot more data and studies that show AI having very marginal productivity boost compared to what people claim than vice versa. I'm legitimately interested in other studies that can show significant productivity gains in brownfield projects.
> who put effort into figuring out good workflows for coding with LLMs are deceiving themselves, and are effectively wasting their time.
It's quite possible you do. Do you have any hard data justifying the claims of "this works better", or is it just a soft fuzzy feeling?
> The same goes for managing people, and yet somehow many good engineering managers can judge if their team is performing well
It's actually really easy to judge if a team is performing well.
What is hard is finding what actually makes the team perform well. And that is just as much magic as "if you just write the correct prompt everything will just work"
---
wait. why are we fighting again? :) https://dmitriid.com/everything-around-llms-is-still-magical...
So far I've found that the people who are hating on AI are stuck maintaining highly coupled that they've invested a significant amount of mental energy internalizing. AI is bad on that type of code, and since they've invested so much energy on understanding the code, it ends up taking longer for them to load context and guide the AI than to just do the work. Their code base is hot coupled garbage, and rather than accept that the tools aren't working because of their own lack of architectural rigor, they just shit on the tools. This is part of the reason that that study of open source maintainers using Cursor didn't consistently produce improvement (also, Cursor is pretty mid).
https://www.youtube.com/watch?v=tbDDYKRFjhk&t=4s is one of the largest studies I've seen so far and it shows that when the codebase is small or engineered for AI use, >20% productivity improvements are normal.
On top of this a lot of the “learning to work with LLMs” is breaking down tasks into small pieces with clear instructions and acceptance criteria. That’s just part of working efficiently but maybe don’t want to be bothered to do it.
Working efficiently as a team, perhaps, but during solo development this is unnecessary beyond the extent that is necessary to document the code
Even this opens up a whole field of weird subtle workflow tricks people have, because people run parallel asynchronous agents that step on each other in git. Solo developers run teams now!
Really wild to hear someone say out loud "there's no learning curve to using this stuff".
The "learning curve" is reading "experts opinion" on the ever-changing set of magical rituals that may or may not work but trust us it works.
No, you do not need to trust anyone, you can just verify what works and what doesn't, it's very easy.
Indeed. And it's extremely easy to verify my original comment: https://news.ycombinator.com/item?id=44849887
I agree with you and I have seen this take a few times now in articles on HN, which amounts to the classic: "We've tried nothing and we're all out of ideas" Simpson's joke.
I read these articles and I feel like I am taking crazy pills sometimes. The person, enticed by the hype, makes a transparently half-hearted effort for just long enough to confirm their blatantly obvious bias. They then act like the now have ultimate authority on the subject to proclaim their pre-conceived notions were definitely true beyond any doubt.
Not all problems yield well to LLM coding agents. Not all people will be able or willing to use them effectively.
But I guess "I gave it a try and it is not for me" is a much less interesting article compared to "I gave it a try and I have proved it is as terrible as you fear".
I've said it before, I feel like I'm some sort of lottery winner when it comes to LLM usage.
I've tried a few things that have mostly been positive. Starting with copilot in-line "predictive text on steroids" which works really well. It's definitely faster and more accurate than me typing on a traditional intellisense IDE. For me, this level of AI is cant-lose: it's very easy to see if a few lines of prediction is what you want.
I then did Cursor for a while, and that did what I wanted as well. Multi-file edits can be a real pain. Sometimes, it does some really odd things, but most of the time, I know what I want, I just don't want to find the files, make the edits on all of them, see if it compiles, and so on. It's a loop that you have to do as a junior dev, or you'll never understand how to code. But now I don't feel I learn anything from it, I just want the tool to magically transform the code for me, and it does that.
Now I'm on Claude. Somehow, I get a lot fewer excursions from what I wanted. I can do much more complex code edits, and I barely have to type anything. I sort of tell it what I would tell a junior dev. "Hey let's make a bunch of connections and just use whichever one receives the message first, discarding any subsequent copies". If I was talking to a real junior, I might answer a few questions during the day, but he would do this task with a fair bit of mess. It's a fiddly task, and there are assumptions to make about what the task actually is.
Somehow, Claude makes the right assumptions. Yes, indeed I do want a test that can output how often each of the incoming connections "wins". Correct, we need to send the subscriptions down all the connections. The kinds of assumptions a junior would understand and come up with himself.
I spend a lot of time with the LLM critiquing, rather than editing. "This thing could be abstracted, couldn't it?" and then it looks through the code and says "yeah I could generalize this like so..." and it means instead of spending my attention on finding things in files, I look at overall structure. This also means I don't need my highest level of attention, so I can do this sort of thing when I'm not even really able to concentrate, eg late at night or while I'm out with the kids somewhere.
So yeah, I might also say there's very little learning curve. It's not like I opened a manual or tutorial before using Claude. I just started talking to it in natural language about what it should do, and it's doing what I want. Unlike seemingly everyone else.
Agreed. This is an astonishingly bad article. It's clear that the only reason it made it to the front page is because people who view AI with disdain or hatred upvoted it. Because as you say: how can anyone make authoritative claims about a set of tools not just without taking the time to learn to use them properly, but also believing that they don't even need to bother?
Pianists' results are well known to be proportional to their talent/effort. In open source hardly anyone is even using LLMs and the ones that do have barely any output, In many cases less output than they had before using LLMs.
The blogging output on the other hand ...
> In open source hardly anyone is even using LLMs and the ones that do have barely any output, In many cases less output than they had before using LLMs.
That is not what that paper said, lol.
Which paper? The quoted part is my own observation.
Oh I see, I thought you were quoting https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o... "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity"
Which shows that LLMs, when given to devs who are inexperienced with LLMs but are very experienced with the code they're working on, don't provide a speedup even though it feels like it.
Which is of course a very constrained scenario. IME the LLM speedup is mostly in greenfield projects using APIs and libraries you're not very experienced with.
Judging from all the comments here, it’s going to be amazing seeing the fallout of all the LLM generated code in a year or so. The amount of people who seemingly relish the ability to stop thinking and let the model generate giant chunks of their code base, is uh, something else lol.
It entirely depends on the exposure and reliability the code needs. Some code is just a one-off to show a customer what something might look like. I don't care at all how well the code works or what it looks like for something like that. Rapid prototyping is a valid use case for that.
I have also written a C++ code that has to have a runtime of years, meaning there can be absolutely no memory leaks or bugs whatsoever, or TV stops working. I wouldn't have a language model write any of that, at least not without testing the hell out of it and making sure it makes sense to myself.
It's not all or nothing here. These things are tools and should be used as such.
> It entirely depends on the exposure and reliability the code needs.
Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Obviously, to emphasize, this kind of thing happens all the time with human-generated code, but LLMs make the issue a lot worse because it lets you generate a ton of eventual mess so much faster.
Also, I do agree with your primary point (my comment was a bit tongue in cheek) - it's very helpful to know what should be core and what can be thrown away. It's just in the real world whenever "throwaway" code starts getting traction and getting usage, the powers that be rarely are OK with "Great, now let's rebuild/refactor with production usage in mind" - it's more like "faster faster faster".
In one camp are the fast code slingers putting something quickly without long design and planning. They never get it just right the first few iterations.
So in the other camp you have seasoned engineers who will have a 5x longer design and planning process. But they also never get it right the first several iterations. And by the time their “properly-engineered” design gets its chance to shine, the business needs already changed.
Or there are those people who were fast code slingers when they began coding, and learned how to design, and now they ship production ready code even faster with rock solid architecture and code quality even after the first iteration.
They exist.
I don't think you read my second paragraph.
> Ahh, sweet summer child, if I had a nickel for every time I've heard "just hack something together quickly, that's throwaway code", that ended up being a critical lynchpin of a production system - well, I'd probably have at least like a buck or so.
Because this is the first pass on any project, any component, ever. Design is done with iterations. One can and should throw out the original rough lynchpin and replace it with a more robust solution once it becomes evident that it is essential.
If you know that ahead of time and want to make it robust early, the answer is still rarely a single diligent one-shot to perfection - you absolutely should take multiple quick rough iterations to think through the possibility space before settling on your choice. Even that is quite conducive to LLM coding - and the resulting synthesis after attacking it from multiple angles is usually the strongest of all. Should still go over it all with a fine toothed comb at the end, and understand exactly why each choice was made, but the AI helps immensely in narrowing down the possibility space.
Not to rag on you though - you were being tongue in cheek - but we're kidding ourselves if we don't accept that like 90% of the code we write is rough throwaway code at first and only a small portion gets polished into critical form. That's just how all design works though.
I would love to work at the places you have been where you are given enough time to throw out the prototype and do it properly. In my almost 20 years of professional experience this has never been the case and prototype and exploratory code has only been given minimal polishing time before reaching production and in use state.
We are all too well aware of the tragedy that is modern software engineering lol. Sadly I too have never seen that situation where I was given enough time to do the requisite multiple passes for proper design...
I have been reprimanded and tediously spent collectively combing over said quick prototype code for far longer than the time originally provided to work on it though, as a proof of my incompetence! Does that count?
I'm not sure if I could've said this better
Dunno about you, but I find thinking hard… when I offload boilerplate code to Claude, I have more cycles left over to hold the problem in my head and effectively direct the agent in detail.
This makes sense. I find that after 15 to 20 iterations, I get better understanding of what is being done and possible simplifications.
I then manually declare some functions, JSDoc comments for the return types, imports and stop halfway. By then the agent is able to think, ha!, you plan to replace all the api calls to this composable under the so and so namespace.
It's iterations and context. I don't use them for everything but I find that they help when my brain bandwidth begins to lag or I just need a boilerplate code before engineering specific use cases.
└── Dey well
Software "engineering" at it's finest
lol yep we've never had codebases hacked together by juniors before running major companies in production - nope, never
I think you are over estimating the quality of code humans generate. I take LLM over any output of junior - to mid level developer (if they were given the same prompt / ask)
LLM’s are basically glorified slot machines. Some people try very hard to come up with techniques or theories about when the slot machine is hot, it’s only an illusion, let me tell you, it’s random and arbitrary, maybe today is your lucky day maybe not. Same with AI, learning the “skill” is as difficult as learning how to google or how to check stackoverflow, trivial. All the rest is luck and how many coins do you have in your pocket.
There's plenty of evidence that good prompts (prompt engineering, tuning) can result in better outputs.
Improving LLM output through better inputs is neither an illusion, nor as easy as learning how to google (entire companies are being built around improving llm outputs and measuring that improvement)
Sure, but tricks & techniques that work with one model often don't translate or are actively harmful with others. Especially when you compare models from today and 6 or more months ago.
Keep in mind that the first reasoning model (o1) was released less than 8 months ago and Claude Code was released less than 6 months ago.
Yes, though that just means the probability of success is a function of not only user input but also the model version.
Slot machines on the other hand are truly random and success is luck based with no priors (the legal ones in the US anyways)
This is not a good analogy. The parameters of slot machines can be changed to make the casino lose money. Just because something is random, doesn't mean it is useless. If you get 7 good outputs out of 10 from an LLM, you can still use it for your benefit. The frequency of good outputs and how much babysitting it requires determine whether it is worth using or not. Humans make mistakes too, although way less often.
I didn’t say it’s useless.
Learning how to Google is not trivial.
Do you have an entry in your CV saying: proficiency in googling? It difficult not because it is complex, it difficult because Google want it to be opaque and as harder as possible to figure out.
If anything getting good information out of Google has become harder for us expert users because Google have tried to make it easier for everyone else.
The power-user tricks like "double quote phrase searches" and exclusion though -term are treated more as gentle guidelines now, because regular users aren't expected to figure them out.
There's always "verbatim" mode, though amusingly that appears to be almost entirely undocumented! I tried using Google to find the official documentation for that feature just now and couldn't do better than their 2011 blog entry introducing it: https://search.googleblog.com/2011/11/search-using-your-term...
Maybe if I was more skilled at Google I'd be able to use it to find documentation on its own features?
So true! About ten years ago Peter Norvig recommended the short Google online course on how to use Google Search: amazing how much one hour of structured learning permanently improved my search skills.
I have used neural networks since the 1980s, and modern LLM tech simply makes me happy, but there are strong limits to what I will use the current tech for.
We know what random* looks like: a coin toss, the roll of a die. Token generation is neither.
Neither are slot machines. But there is a random element and that is more than enough to keep people hooked.
Pseudo-random number generators remain one of the most amazing things in computing IMO. Knuth volume 2. One of my favourite books.
I disagree from almost the first sentence:
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
Learning how to use LLMs in a coding workflow is trivial to start, but you find you get a bad taste early if you don't learn how to adapt both your workflow and its workflow. It is easy to get a trivially good result and then be disappointed in the followup. It is easy to try to start on something it's not good at and think it's worthless.
The pure dismissal of cursor, for example, means that the author didn't learn how to work with it. Now, it's certainly limited and some people just prefer Claude code. I'm not saying that's unfair. However, it requires a process adaptation.
"There's no learning curve" just means this guy didn't get very far up, which is definitely backed up by thinking that Copilot and other tools are all basically the same.
> "There's no learning curve" just means this guy didn't get very far up
Not everyone with a different opinion is dumber than you.
This is all just ignorance. We've all worked with LLMs and know that creating an effective workflow is not trivial and it varies based on the tool.
Define "not trivial". Obviously, experience helps, as with any tool. But it's hardly rocket science.
It seems to me the biggest barrier is that the person driving the tool needs to be experienced enough to recognize and assist when it runs into issues. But that's little different from any sophisticated tool.
It seems to me a lot of the criticism comes from placing completely unrealistic expectations on an LLM. "It's not perfect, therefore it sucks."
As of about three months ago, one of the most important skills in effective LLM coding is coding agent environment design.
If you want to use a tool like Claude Code (or Gemini CLI or Cursor agent mode or Code CLI or Qwen Code) to solve complex problems you need to give them an environment they can operate in where they can solve that problem without causing too much damage if something goes wrong.
You need to think about sandboxing, and what tools to expose to them, and what secrets (if any) they should have access to, and how to control the risk of prompt injection if they might be exposed to potentially malicious sources of tokens.
The other week I wanted to experiment with some optimizations of configurations on my Fly.io hosted containers. I used Claude Code for this by:
- Creating a new Fly organization which I called Scratchpad
- Assigning that a spending limit (in case my coding agent went rogue or made dumb expensive mistakes)
- Creating a Fly API token that could only manipulate that organization - so I could be sure my coding agent couldn't touch any of my production deployments
- Putting together some examples of how to use the Fly CLI tool to deploy an app with a configuration change - just enough information that Claude Code could start running its own deploys
- Running Claude Code such that it had access to the relevant Fly command authenticated with my new Scratchpad API token
With all of the above in place I could run Claude in --dangerously-skip-permissions mode and know that the absolute worse that could happen is it might burn through the spending limit I had set.
This took a while to figure out! But now... any time I want to experiment with new Fly configuration patterns I can outsource much of that work safely to Claude.
The situation you’re outlining is trivial though.
Yea, there’s some grunt work involved but in terms of learned ability all of that is obvious to someone who knew only a little bit about LLMs.
We are going to have to disagree on this one.
I don’t really see how it’s different than how you’d setup someone really junior to have a playground of sorts.
It’s not exactly a groundbreaking line of reasoning that leads one to the conclusion of “I shouldn’t let this non-deterministic system access production servers.”
Now, setting up an LLM so that they can iterate without a human in the loop is a learned skill, but not a huge one.
Yeah if I want I to develop I need tooling around me. Still trivial to learn. Not a difficult skill. Not an specific skill to llm.
Why would you need to take all of these additional sandboxing measures if you weren't using an LLM?
For one - I’d say scoped API tokens that prevent messing with resources across logical domains (eg prod vs nonprod, distinct github repos, etc) is best practice in general. Blowing up a resource with a broadly scoped token isn’t a failure mode unique to LLMs.
edit: I don’t have personal experience around spending limits but I vaguely recall them being useful for folks who want to set up AWS resources and swing for the fences, in startups without thinking too deeply about the infra. Again this isn’t a failure mode unique to LLMs although I can appreciate it not mapping perfectly to your scenario above
edit #2: fwict the LLM specific context of your scenario above is: providing examples, setting up API access somehow (eg maybe invoking a CLI?). The rest to me seems like good old software engineering
I usually work with containers for repeatability and portability. Also makes the local env closer to the final prod env.
The statement I responded to was, "creating an effective workflow is not trivial".
There are plenty of useful LLM workflows that are possible to create pretty trivially.
The example you gave is not hardly the first thing a beginning LLM user would need. Yes, more sophisticated uses of an advanced tool require more experience. There's nothing different from any other tool here. You can find similar debates about programming languages.
Again, what I said in my original comment applies: people place unrealistic expectations on LLMs.
I suspect that this is at least partly is a psychological game people unconsciously play to try to minimize the competence of LLMs, to reduce the level of threat they feel. A sort of variation of terror management theory.
I don’t think anyone expects perfection. Programs crash, drives die, and computers can break anytime. But we expect our tools to be reliable and not fight with it everyday to get it to work.
I don’t have to debug Emacs every day to write code. My CI workflow just runs every time a PR is created. When I type ‘make tests’, I get a report back. None of those things are perfect, but they are reliable.
If you work in a team, you work with other people, whose reliability is more akin to LLMs than to the deterministic processes you're describing.
What you're describing is a case of mismatched expectations.
I'm not a native speaker, but to me that quote doesn't necessarily imply an inability of OP to get up the curve. Maybe they just mean that the curve can look flat at the start?
No, it's sometimes just extremely easy to recognize people who have no idea what they're talking about when they make certain claims.
Just like I can recognize a clueless frontend developer when they say "React is basically just a newer jquery". Recognizing clueless engineers when they talk about AI can be pretty easy.
It's a sector that is both old and new: AI has been around forever, but even people who worked in the sector years ago are taken aback by what is suddenly possible, the workflows that are happening... hell, I've even seen cases where it's the very people who have been following GenAI forever that have a bias towards believing it's incapable of what it can do.
For context, I lead an AI R&D lab in Europe (https://ingram.tech/). I've seen some shit.
Basically, they are the same, they are all LLMs. They all have similar limitations. They all produce "hallucinations". They can also sometimes be useful. And they are all way overhyped.
The amount of misconceptions in this comment are quite profound.
Copilot isn't an LLM, for a start. You _combine_ it wil a selection of LLMs. And it absolutely has severe limitations compared to something like Claude Code in how it can interact with the programming environment.
"Hallucinations" are far less of a problem with software that grounds the AI to the truth in your compiler, diagnostics, static analysis, a running copy of your project, runnning your tests, executing dev tools in your shell, etc.
>Copilot isn't an LLM, for a start
You're being overly pedantic here and moving goalposts. Copilot (for coding) without an LLM is pretty useless.
I stand by my assertion that these tools are all basically the same fundamental tech - LLMs.
> I stand by my assertion that these tools are all basically the same fundamental tech - LLMs.
Over generalizing. The synergy between the LLM and the client (cursor, Claude code, copilot, etc) make a huge difference in results.
This is like saying every web app is basically the same fundamental tech - databases.
Or that writing Python with notepad.exe and Jupyter are fundamentally the same.
If it’s not trivial, it’s worthless, because writing things out manually yourself is usually trivial, but tedious.
With LLMs, the point is to eliminate tedious work in a trivial way. If it’s tedious to get an LLM to do tedious work, you have not accomplished anything.
If the work is not trivial enough for you to do yourself, then using an LLM will probably be a disaster, as you will not be able to judge the final output yourself without spending nearly the same amount of time it takes for you to develop the code on your own. So again, nothing is gained, only the illusion of gain.
The reason people think they are more productive using LLMs to tackle non-trivial problems is because LLMs are pretty good at producing “office theatre”. You look like you’re busy more often because you are in a tight feedback loop of prompting and reading LLM output, vs staring off into space thinking deeply about a problem and occasionally scribbling or typing something out.
So, I'd like you to talk to a fair number of emacs and vim users. They have spent hours and hours learning their tools, tweaking their configurations, and learning efficiencies. They adapt their tool to them and themselves to the tool.
We are learning that this is not going to be magic. There are some cases where it shines. If I spend the time, I can put out prototypes that are magic and I can test with users in a fraction of the time. That doesn't mean I can use that for production.
I can try three or four things during a meeting where I am generally paying attention, and look afterwards to see if it's pursuing.
I can have it work through drudgery if I provide it an example. I can have it propose a solution to a problem that is escaping me, and I can use it as a conversational partner for the best rubber duck I've ever seen.
But I'm adapting myself to the tool and I'm adapting the tool to me through learning how to prompt and how to develop guardrails.
Outside of coding, I can write chicken scratch and provide an example of what I want, and have it write a proposal for a PRD. I can have it break down a task, generate a list of proposed tickets, and after I've went through them have it generate them in jira (or anything else with an API). But the more I invest into learning how to use the tool, the less I have to clean up after.
Maybe one day in the future it will be better. However, the time invested into the tool means that 40 bucks of investment (20 into cursor, 20 into gpt) can add 10-15% boost in productivity. Putting 200 into claude might get you another 10% and it can get you 75% in greenfield and prototyping work. I bet that agency work can be sped up as much as 40% for that 200 bucks investment into claude.
That's a pretty good ROI.
And maybe some workloads can do even better. I haven't seen it yet but some people are further ahead than me.
vim and eMacs are owned by the developer who configures them. LLMs are products, whose capabilities are subject to the whims of their host. These are not the same things.
Everything you mentioned is also fairly trivial, just a couple of one shot prompts needed.
[dead]
Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. [...]
LLMs will always suck at writing code that has not be written millions of times before. As soon as you venture slightly offroad, they falter.
That right there is your learning curve! Getting LLMs to write code that's not heavily represented in their training data takes experience and skill and isn't obvious to learn.
I’m still waiting that someone claiming how prompting is such an skill to learn, explain just once a single technique that is not obvious, like: storing checkpoint to go back to working version (already a good practice without using Llm see:git) or launch 10 tabs with slightly different prompts and choose the best, or ask the Llm to improve my prompt, or adding more context … is that an skill? I remember when I was a child that my mom thought that programming a vcr to record the night show to be such a feat…
In my experience, it's not just prompting that needs to be figured out, it's a whole new workstyle that works for you, your technologies and even your current project. As an example, I write almost all my code functional-programming style, which I rarely did before. This lets me keep my prompts and context very focused and it essentially elminates hallucinations.
Also I started in the pre-agents era and so I ended up with a pair-programming paradigm. Now everytime I conceptualize a new task in my head -- whether it is a few lines of data wrangling within a function, or generating an entire feature complete with integration tests -- I instinctively do a quick prompt-vs-manual coding evaluation and seamlessly jump to AI code generation if the prompt "feels" more promising in terms of total time and probability of correctness.
I think one of the skills is learning this kind of continuous evaluation and the judgement that goes with it.
You may not consider it a skill, but I train multiple programming agents on different production and quality code bases, and have all of them pr review a change, with a report given at the end.
it helps dramatically on finding bugs and issues. perhaps that's trivial to you, but it feels novel as we've only had effective agents in the last couple weeks.
But give an example? What did you do that you consider a difficult skill to learn?
Usually when you learn difficult skills, you can go to a trainer, take a class, read about the solutions.
Right now, you are entirely up to the random flawed information on the internet that you often can't repeat in trials, or your structured ideas on how to improve a thing.
That is difficult. It is difficult to take the information available right now, and come up with a reasonable way to improve the performance of LLMs through your ingenuity.
At some point it will be figured out, and every corporation will be following the same ideal setup, but at the moment it is a green field opportunity for the human brain to come up with novel and interesting ideas.
Thanks. So the skill is figuring out heuristics? That is not even related with AI or LLM. But as I said is like learning how to google, which is exactly that, try and error until you figure out what Google prefers
I mean, it's definitely related. We have this tool that we know can perform better with software with it. Building that software is challenging. Knowing what to build, testing it.
I believe that's difficult, and not just what google prefers. I guess we feel differently about it.
See my comment here about designing environments for coding agents to operate in: https://news.ycombinator.com/item?id=44854680
Effective LLM usage these days is about a lot more than just the prompts.
If you have a big rock (a software project), there's quite a difference between pushing it uphill (LLM usage) and hauling it up with a winch (traditional tooling and methods).
People are claiming that it takes time to build the muscles and train the correct footing to push, while I'm here learning mechanical theory and drawing up levers. If one managed to push the rock for one meter, he comes clamoring, ignoring the many who was injured by doing so, saying that one day he will be able to pick the rock up and throw it at the moon.
Then there are those who are augmenting their winch with LLM usage.
I'd describe LLM usage as the winch and LLM avoidance as insisting on pushing it up hill without one.
Simon, I have mad respect for your work but I think on this your view might be skewed because your day to day work involves a codebase where a single developer can still hold the whole context in their head. I would argue that the inadequacies of LLMs become more evident the more you have to make changes to systems that evolve at the speed of 15+ concurrent developers.
One of the things I'm using LLMs for a lot right now is quickly generating answers about larger codebases I'm completely unfamiliar with.
Anything up to 250,000 tokens I pipe into GPT-5 (prior to that o3), and beyond that I'll send them to Gemini 2.5 Pro.
For even larger code than that I'll fire up Codex CLI or Claude Code and let them grep their way to an answer.
This stuff has gotten good enough now that I no longer get stuck when new tools lack decent documentation - I'll pipe in just the source code (filtered for .go or .rs or .c files or whatever) and generate comprehensive documentation for myself from scratch.
Don't you see how this opens up a blindspot in your view of the code?
You don't have the luxury of having someone who is deeply familiar with the code sanity check your perceived understanding of the code, i.e. you don't see where the LLM is horribly off-track because you don't have sufficient understanding of that code to see the error. In enterprise contexts this is very common tho so its quite likely that a lot of the haters here have seen PRs submitted by vibecoders to their own work which have been inadequate enough that they started to blame the tool. For example I have seen someone reinvent the wheel of the session handling by a client library because they were unaware that the existing session came batteries included and the LLM didn't hesitate to write the code again for them. The code worked, everything checked out but because the developer didn't know what they didn't know about they submitted a janky mess.
The LLMs go off track all the time. I spot that when I try putting what I've learned from them into action.
This just sounds 1:1 equivalent to "there are things LLMs are good for and things LLMs are bad for."
I'll bite.
What are those things that they are good for? And consistently so?
As someone who leans more towards the side of LLM-sceptiscism, I find Sonnet 4 quite useful for generating tests, provided I describe in enough detail how I want the tests to be structured and which cases should be tested. There's a lot of boilerplate code in tests and IMO because of that many developers make the mistake of DRYing out their test code so much that you can barely understand what is being tested anymore. With LLM test generation, I feel that this is no longer necessary.
Isn’t tests supposed to be premises (ensure initial state is correct), compute (run the code), and assertions (verify the result state and output). If your test code is complex, most of it should be moved into harness and helpers functions. Writing more complex code isn’t particularly useful.
I didn't say complex, I said long.
If you have complex objects and you're doing complex operations on them, then setup code can get rather long.
[dead]
LLM driven coding can yield awesome results, but you will be typing a lot and, as article states, requires already well structured codebase.
I recently started with fresh project, and until I got to the desired structure I only used AI to ask questions or suggestions. I organized and written most of the code.
Once it started to get into the shape that felt semi-permanent to me, I started a lot of queries like:
```
- Look at existing service X at folder services/x
- see how I deploy the service using k8s/services/x
- see how the docker file for service X looks like at services/x/Dockerfile
- now, I started service Y that does [this and that]
- create all that is needed for service Y to be skaffolded and deployed, follow the same pattern as service X
```
And it would go, read existing stuff for X, then generate all of the deployment/monitoring/readme/docker/k8s/helm/skaffold for Y
With zero to none mistakes. Both claude and gemini are more than capable to do such task. I had both of them generate 10-15 files with no errors, with code being able to be deployed right after (of course service will just answer and not do much more than that)
Then, I will take over again for a bit, do some business logic specific to Y, then again leverage AI to fill in missing bits, review, suggest stuff etc.
It might look slow, but it actually cuts most boring and most error prone steps when developing medium to large k8s backed project.
My workflow with a medium sized iOS codebase is a bit like that. By the time everything works and is up to my standards, I‘ve usually taken longer, or almost as long, as if I‘d written everything manually. That’s with Opus-only Claude Code. It’s complicated stuff (structured concurrency and lots of custom AsyncSequence operators) which maybe CC just isn‘t suitable for.
Whipping up greenfield projects is almost magical, of course. But that’s not most of my work.
Deeply curious to know if this is an outlier opinion, a mainstream but pessimistic one, or the general consensus. My LinkedIn feed and personal network certainly suggests that it's an outlier, but I wonder if the people around me are overly optimistic or out of synch with what the HN community is experiencing more broadly.
My impression has been that in corporate settings (and I would include LinkedIn in that) AI optimism is basically used as virtue signaling, making it very hard to distinguish people who are actually excited about the tech from people wanting to be accepted.
My personal experience has been that AI has trouble keeping the scope of the change small and targeted. I have only been using Gemini 2.5 pro though, as we don’t have access to other models at my work. My friend tells me he uses Claud for coding and Gemini for documentation.
I reckon this opinion is more prevalent than the hyped blog posts and news stories suggest; I've been asking this exact question of colleagues and most share the sentiment, myself included, albeit not as pessimistic.
Most people I've seen espousing LLMs and agentic workflows as a silver bullet have limited experience with the frameworks and languages they use with these workflows.
My view currently is one of cautious optimism; that LLM workflows will get to a more stable point whereby they ARE close to what the hype suggests. For now, that quote that "LLMs raise the floor, not the ceiling" I think is very apt.
LinkedIn is full of BS posturing, ignore it.
I think it’s pretty common among people whose job it is to provide working, production software.
If you go by MBA types on LinkedIn that aren’t really developers or haven’t been in a long time, now they can vibe out some react components or a python script so it’s a revolution.
Hi, my job is building working production software (these days heavily LLM assisted). The author of the article doesn't know what they're talking about.
Which part of the opinion?
I tend to strongly agree with the "unpopular opinion" about the IDEs mentioned versus CLI (specifically, aider.chat and Claude Code).
Assuming (this is key) you have mastery of the language and framework you're using, working with the CLI tool in 25 year old XP practices is an incredible accelerant.
Caveats:
- You absolutely must bring taste and critical thinking, as the LLM has neither.
- You absolutely must bring systems thinking, as it cannot keep deep weirdness "in mind". By this I mean the second and third order things that "gotcha" about how things ought to work but don't.
- Finally, you should package up everything new about your language or frameworks since a few months or year before the knowledge cutoff date, and include a condensed synthesis in your context (e.g., Swift 6 and 6.1 versus the 5.10 and 2024's WWDC announcements that are all GPT-5 knows).
For this last one I find it useful to (a) use OpenAI's "Deep Research" to first whitepaper the gaps, then another pass to turn that into a Markdown context prompt, and finally bring that over to your LLM tooling to include as needed when doing a spec or in architect mode. Similarly, (b) use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
I'm confused why these two obvious steps aren't built into leading agentic tools, but maybe handling the LLM as a naive and outdated "Rain Man" type doesn't figure into mental models at most KoolAid-drinking "AI" startups, or maybe vibecoders don't care, so it's just not a priority.
Either way, context based development beats Leroy Jenkins.
> use repomap tools on dependencies if creating new code that leverages those dependencies, and have that in context for that work.
It seems to me that currently there are 2 schools of thought:
1. Use repomap and/or LSP to help the models navigate the code base
2. Let the models figure things out with grep
Personally, I am 100% a grep guy, and my editor doesn't even have LSP enabled. So, it is very interesting to see how many of these agentic tools do exactly the same thing.
And Claude Code /init is a great feature that basically writes down the current mental model after the initial round of grep.
I agree with the 2 schools, but different conclusion:
The strategy of one or the other brings differing big gaps and require context or prompt work to compensation.
They should be using 1 to keep overall lay of the land, and 2 before writing any code.
Linkedin posts seems like an awful source. The people I see posting for themselves there are either pre-successful or just very fond of personal branding
Speaking to actual humans IRL (as in, non-management colleagues and friends in the field), people are pretty lukewarm on AI, with a decent chunk of them who find AI tooling makes them less productive. I know a handful of people who are generally very bullish on AI, but even they are nowhere near the breathless praise and hype you read about here and on LinkedIn, they're much more measured about it and approach it with what I would classify as common sense. Of course this is entirely anecdotal, and probably depends where you are and what kind of business you're in, though I will say I'm in a field where AI even makes some amount of sense (customer support software), and even then I'm definitely noticing a trend of disillusionment.
On the management side, however, we have all sorts of AI mandates, workshops, social media posts hyping our AI stuff, our whole "product vision" is some AI-hallucinated nightmare that nobody understands, you'd genuinely think we've been doing nothing but AI for the last decade the way we're contorting ourselves to shove "AI" into every single corner of the product. Every day I see our CxOs posting on LinkedIn about the random topic-of-the-hour regarding AI. When GPT-5 launched, it was like clockwork, "How We're Using GPT-5 At $COMPANY To Solve Problems We've Never Solved Before!" mere minutes after it was released (we did not have early access to it lol). Hilarious in retrospect, considering what a joke the launch was like with the hallucinated graphs and hilarious errors like in the Bernoulli's Principle slide.
Despite all the mandates and mandatory shoves coming from management, I've noticed the teams I'm close with (my team included) are starting to push back themselves a bit. They're getting rid of the spam generating PR bots that have never, not once, provided a useful PR comment. People are asking for the various subscriptions they were granted be revoked because they're not using them and it's a waste of money. Our own customers #1 piece of feedback is to focus less on stupid AI shit nobody ever asked for, and to instead improve the core product (duh). I'm even seeing our CTO who was fanboy number 1 start dialing it back a bit and relenting.
It's good to keep in mind that HN is primarily an advertisement platform for YC and their startups. If you check YC's recent batches, you would think that the 1 and only technology that exists in the world is AI, every single one of them mentions AI in one way or another. The majority of them are the lowest effort shit imaginable that just wraps some AI APIs and is calling it a product. There is a LOT of money riding on this hype wave, so there's also a lot of people with vested interests in making it seem like these systems work flawlessly. The less said about LinkedIn the better, that site is the epitome of the dead internet theory.
I think that beyond the language used, the article does have some points I agree with. In general, LLMs code better in languages that are more easily available online, where they can be trained on a larger amount of source code. Python is not the same as PL/I (I don't know if you've tried it, but with the latter, they don't know the most basic conventions used in its development).
When it is mentioned that LLMs "have terrible code organization skills", I think they are referring mainly to the size of the context. It is not the same to develop a module with hundreds of LoCs, one with thousands or one with tens of thousands of LoCs.
I am not very much in favor of skill degradation; I am not aware of a study that validates it in this regard. On the other hand, it is true that agents are constantly evolving, and I don't see any difficulties that cannot be overcome with the current evolutionary race, given that, in the end, coding is one of the most accessible functions for artificial intelligence.
People that comment on and get defensive about this bit:
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
How much of your workflow or intuition from 6 months ago is still relevant today? How long would it take to learn the relevant bits today?
Keep in mind that Claude Code was released less than 6 months ago.
A fraction of the LLM maximalists are being defensive, because they don't want to consider that they've maybe invested too much time in those tools ; considering what said tools are currently genuinely good at.
Pretty much all of the intuition I've picked up about getting good results from LLMs has stayed relevant.
If I was starting from fresh today I expect it would take me months of experimentation to get back to where I am now.
Working thoughtfully with LLMs has also helped me avoid a lot of the junk tips ("Always start with 'you are the greatest world expert in X', offer to tip it, ...") that are floating around out there.
All of the intuition? Definitely not my experience. I have found that optimal prompting differs significantly between models, especially when you look at models that are 6months old or older (the first reasoning model, o1, is less than 8 months old).
Speaking mostly from experience of building automated, dynamic data processing workflows that utilize LLMs:
Things that work with one model, might hurt performance or be useless with another.
Many tricks that used to be necessary in the past are no longer relevant, or only applicable for weaker models.
This isn't me dimissing anyone's experience. It's ok to do things that become obsolete fairly quickly, especially if you derive some value from it. If you try to stay on top of a fast moving field, it's almost inevitable. I would not consider it a waste of time.
Hell, my workflow isn't the same two weeks ago when subagents were released.
Opening the essay with ~~Learning how to use LLMs in a coding workflow is trivial.~~ and closing with suggestion ~~ Copilot ~~ for AI agent is the worst take of LLM coding I ever saw
Have built many pipelines integrating LLMs to drive real $ results. I think this article boils it down too simply. But i always remember, if the LLM is the most interesting part of your work, something is severely wrong and you probably aren’t adding much value. Context management based on some aspects of your input is where LLMs get good, but you need to do lots of experimentation to tune something. Most cases i have seen are about developing one pipeline to fit 100s of extremely different cases; LLM does not solve this problem but basically serves as an approximator for you to discretize previously large problems in to some information sub space where you can treat the infinite set of inputs as something you know. LLMs are like a lasso (and a better/worse one than traditional lassos depending on use case) but once you get your catch you still need to process it, deal with it progammatically to solve some greater problem. I hate how so many LLM related articles/comments say “ai is useless throw it away dont use it” or “ai is the future if we dont do it now we’re doomed lets integrate it everywhere it can solve all our problems” like can anyone pick a happy medium? Maybe thats what being in a bubble looks like
>I made a CLI logs viewers and querier for my job, which is very useful but would have taken me a few days to write (~3k LoC)
I recall The Mythical Man-Month stating a rough calculation that the average software developer writes about 10 net lines of new, production-ready code per day. For a tool like this going up an order of magnitude to about 100 lines of pretty good internal tooling seems reasonable.
OP sounds a few cuts above the 'average' software developer in terms of skill level. But here we also need to point out a CLI log viewer and querier is not the kind of thing you actually needed to be a top tier developer to crank out even in the pre-LLM era, unless you were going for lnav [1] levels of polish.
[1]: https://lnav.org/
A lot of the Mythical Man-Month is timeless, but for a stat like that, it really is worth bearing in mind the book was written half a century ago about developers working on 1970s mainframes.
Yeah, I think that metric has grown to about 20 lines per day using 2010s-era languages and methods. So maybe we could think of LLM usage as an attempt to bring it back down to 10 per day.
So many articles should prepend “My experience with ...” to their title. Here is OP's first sentence: “I spent the past ~4 weeks trying out all the new and fancy AI tools for software development.” Dude, you have had some experiences and they are worth writing up and sharing. But your experiences are not a stand-in for "the current state." This point applies to a significant fraction of HN articles, to the point that I wish the headlines were flagged “blog”.
Clickbait gets more reach. It's an unfortunate thing. I remember Veritasium in a video even saying something along the lines of him feeling forced to do clickbaity YouTube because it works so well.
The reach is big enough to not care about our feelings. I wish it wasn't this way.
Interesting read, but strange to totally ignore the macOS ChatGPT app that optionally integrates with a terminal session, the currently opened VSCode editor tab, XCode. etc. I use this combination at least 2 or 3 times a month, and even if my monthly use is less that 40 minutes total, it is a really good tool to have in your toolbelt.
The other thing I disagree with is the coverage of gemnini-cli: if you use gemini-cli for a single long work session, then you must set your Google API key as an environment variable when starting gemini-cli, otherwise you end up after a short while using Gemini-2.5-flash, and that leads to unhappy results. So, use gemini-cli for free for short and focused 3 or 4 minute work sessions and you are good, or pay for longer work sessions, and you are good.
I do have a random off topic comment: I just don’t get it: why do people live all day in an LLM-infused coding environment? LLM based tooling is great, but I view it as something I reach for a few times a day for coding and that feels just right. Separately, for non-coding tasks, reaching for LLM chat environments for research and brainstorming is helpful, but who really needs to do that more than once or twice a day?
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
Do they? I’ve found Clojure-MCP[1] to be very useful. OTOH, I’m not attempting to replace myself, only augment myself.
1: https://github.com/bhauman/clojure-mcp
Thanks for the link! I used to use Clojure a lot professionally, but now just for fun projects, and to occasionally update my old Clojure book. I had bookmarked Clojure-MCP a while ago, but never got back to it but I will give it a try.
I like your phrasing of “OTOH, I’m not attempting to replace myself, only augment myself.” because that is my personal philosophy also.
I think we're still in the gray zone of the "Incessant Obsolescence Postulate" (the Wait Calculation). Are you better off "skilling up" on the tech as it is today, or waiting for it to just "get better" so by the time you kick off, you benefit from the solved-problems X years from now. I also think this calculation differs by domain, skill level, and your "soft skill" abilities to communicate, explain and teach. In some domains, if you're not already on this train, you won't even get hired anymore.
The current state of LLM-driven development is already several steps down the path of an end-game where the overwhelming majority of code is written by the machine; our entire HCI for "building" is going to be so far different to how we do it now that we'll look back at the "hand-rolling code era" in a similar way to how we view programming by punch-cards today. The failure modes, the "but it SUCKS for my domain", the "it's a slot machine" etc etc are not-even-wrong. They're intermediate states except where they're not.
The exceptions to this end-game will be legion and exist only to prove the end-game rule.
Relying on LLM for any skill, especially programming, is like cutting your own healthy legs and buying crutches to walk. Plus you now have to pay $49/month for basic walking ability and $99/month for "Walk+" plan, where you can also (clumsily) jog.
It's more like strapping on a exoskeleton suit with a jetpack.
It makes your existing strength and mobility greater, but don't be surprised if you fly into space that you will suffocate,
or if you fly over an ocean and run out gas, that you'll sink to the bottom,
or if you fly the suit in your fine glassware shop with patrons in the store, that your going to break and burn everything/everyone in there.
There are a lot of skills which I haven't developed because I rely on external machines to handle it for me; memorization, fire-starting, navigation. On net, my life is better for it. LLMs may or may not be as effective at replacing code development as books have been at replacing memorization and GPS has been at replacing navigation, but eventually some tool will be and I don't think I'll be worse off for developing other skills.
GPS is particularly good analog... Lose it for any reason and suddenly you are helpless without backup navigation aids. But compass, paper map, watch and sextant will still work!
Why would I pay you to walk with crutches when I can just get crutches and walk myself?
I would actually disagree with the final conclusion here; despite claiming to offer the same models, Copilot seems very much nerfed — cross-comparing the Copilotified LLM and the same LLM through OpenRouter, the Copilot one seems to fail much harder. I'm not an expert in the details of LLMs but I guess there might be some extra system prompt, I also notice the context window limit is much lower, which kinda suggests it's been partially pre-consumed.
In case it matters, I was using Copilot that is for 'free' because my dayjob is open source, and the model was Claude Sonnet 3.7. I've not yet heard anyone else saying the same as me which is kind of peculiar.
Good read. I just want to pinpoint that LLMs seems to write better React code, but as an experienced frontend developers my opinion is that it's also bad at React. Its approach is outdated as it doesn't follow the latest guidelines. It writes React as I would have written it in 2020. So as usual, you need to feed the right context to get proper results.
I don't agree. Cursor is mind-blowingly good with the new agentic updates.
I find all AI coding goes something like this algorithm
* I let the AI do something
* I find bad bug or horrifying code
* I realize I have it too much slack
* hand code for a while
* go back to narrow prompts
* get lazy, review code a bit less add more complexity
* GOTO 1, hopefully with a better instinct for where/how to trust this model
Then over time you hone your instinct on what to delegate and what to handle yourself. And how deeply to pay attention.
OP did miss the vscode extension for claude code, it is still terminal based but: - it show you the diff of the incoming changes in vscode ( like git ) - it know the line you selected in the editor for context
I have not tried every IDE/CLI or models, only a few, mostly Claude and Qwen.
I work mostly in C/C++.
The most valuable improvement of using this kind of tools, for me, is to easily find help when I have to work on boring/tedious tasks or when I want to have a Socratic conversation about a design idea with a not-so-smart but extremely knowledgeable colleague.
But for anything requiring a brain, it is almost useless.
Does not mention the actual open source solution that has autocomplete, chat, planer and agents, lets you bring your own keys, connect to any llm provider, customize anything, rewrite all the prompts and tools.
https://github.com/continuedev/continue
This article makes me wanna try building a token field in Flutter using a LLM chat or agent. Chat should be enough. A few iterations to get the behaviour and the tests right. A bit of style to make it look Apple-nice. As if a regular dev would do much better/quicker for this use case, such a bad example imo I don't buy it
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
I haven't found that to be true with my most recent usage of AI. I do a lot of programming in D, which is not popular like Python or Javascript, but Copilot knows it well enough to help me with things like templates, metaprogramming, and interoperating with GCC-produced DLL's on Windows. This is true in spite of the lack of a big pile of training data for these tasks. Importantly, it gets just enough things wrong when I ask it to write code for me that I have to understand everything well enough to debug it.
Strange post. It reads in part like an incoherent rant and in part as a well made analysis.
It’s mostly on point though. Although, in recent years I’ve been assigned to manage and plan projects at work, and the skills I’ve learnt from that greatly help to get effective results from an LLM I think.
I have a biased opinion since I work for a background agent startup currently - but there are more (and better!) out there than Jules and Copilot that might address some of the author's issues.
And those mythical better tools tools that you didn't even bother to mention are?
Presumably if they did, they would be accused of promoting their startup :)
But he said there are more and better out there. "More" implies more than one :)
And promoting own startups are usually okay if that is phrased okay :)
By no means are better background agents "mythical" as you claim. I didn't bother to mention them as it is easy enough to search for asynchronous/background agents yourself.
Devin is perhaps the one that is most fully featured and I believe has been around the longest. Other examples that seem to be getting some attention recently are Warp, Cursor's own background agent implementation, Charlie Labs, Codegen, Tembo, and OpenAI's Codex.
I do not work for any of the aforementioned companies.
> as it is easy enough to search for asynchronous/background agents yourself.
Ah yes. An unverifiable claim followed by "just google them yourself".
> Devin is perhaps the one that is most fully featured and I believe has been around the longest.
And it had been hilariously bad the longest. Is it better now? Maybe? I don't really know anyone even mentioning Devin anymore
> examples that seem to be getting some attention recently
So, "some attention", but you could "easily find them by searching".
> Charlie Labs, Codegen, Tembo
Never heard of them, but will take a look.
See how easy it was to mention them?
>Ah yes. An unverifiable claim followed by "just google them yourself".
Some agent scaffolding performs better on benchmarks than others given the same underlying base model - see SWE Bench and Terminal Bench for examples.
Some may find certain background agents better than others simply because of UX. Some background agents have features that others don't - like memory systems, MCP, 3rd party integrations, etc.
I maintain it is easy to search for examples of background coding agents that are not Jules or Copilot. For me, searching "background coding agents" on google or duckduckgo returns some of the other examples that I mentioned.
"LLMs won’t magically make you deliver production-ready code"
Either I'm extremely lucky or I was lucky to find the guy who said it must all be test driven and guided by the usual principles of DRY etc. Claude Code works absolutely fantastically nine out of 10 times and when it doesn't we just roll back the three hours of nonsense it did postpone this feature or give it extra guidance.
I'm beginning to suspect robust automated tests may be one of the single strongest indicators for if you're going to have a good time with LLM coding agents or not.
If there's a test suite for the thing to run it's SO much less likely to break other features when it's working. Plus it can read the tests and use them to get a good idea about how everything is supposed to work already.
Telling Claude to write the test first, then execute it and watch it fail, then write the implementation has been giving me really great results.
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
Almost like hiring and scaling a team? There are also benchmarks that specifically measure this, and its in theory a very temporary problem (Aider Polyglot Benchmark is one such).
My favorite setup so far is using the Claude code extension in VScode. All the power of CC, but it opens files and diffs in VScode. Easy to read and modify as needed.
There are kind of a lot of errors in this piece. For instance, the problem the author had with Gemini CLI running out of tokens in ten minutes is what happens when you don’t set up (a free) API key in your environment.
There’s an IntelliJ extension for GitHub CoPilot.
It’s not perfect but it’s okay.
Yeah for my uses it works fine. Not sure why OP thinks Copilot Chat doesn't exist anywhere but VSCode...
It may not be perfect, but IntelliJ beats VS Code on so many other levels that I don't understand why everyone keeps creating clones of the latter.
> By being particularly bad at anything outside of the most popular languages and frameworks, LLMs force you to pick a very mainstream stack if you want to be efficient.
I use clojure for my day-to-day work, and I haven't found this to be true. Opus and GPT-5 are great friends when you start pushing limits on Clojure and the JVM.
> Or 4.1 Opus if you are a millionaire and want to pollute as much possible
I know this was written tongue-in-cheek, but at least in my opinion it's worth it to use the best model if you can. Opus is definitely better on harder programming problems.
> GPT 4.1 and 5 are mostly bad, but are very good at following strict guidelines.
This was interesting. At least in my experience GPT-5 seemed about as good as Opus. I found it to be _less_ good at following strict guidelines though. In one test Opus avoided a bug by strictly following the rules, while GPT-5 missed.
> Learning how to use LLMs in a coding workflow is trivial. There is no learning curve. You can safely ignore them if they don’t fit your workflows at the moment.
I'm sprry, but I disagree with this claim. That is not my experience, nor many others. It's true that you can make them do something without learning anything. However, it takes time to learn what they are good amd bad at, what information they need, and what nonsense they'll do without express guidance. It also takes time to know what to look for when reviewing results.
I also find that they work fine for languages without static types. You need need tests, yes, but you need them anyway.
"If an(y) LLM could operate on your codebase without much critical issues, then your architecture is sound" - revskill
"Google’s enshittification has won and it looks like no competent software developers are left. I would know, many of my friends work there". Ouch ... I hope his friends are in marketing!
They missed OpenAI Codex, maybe deliberately? It's less llm-development and more vibe-coding, or maybe "being a PHB of robots". I'm enjoying it for my side project this week.
I agree. I had a similar experience.
https://speculumx.at/pages/read_post.html?post=59
Yet another developer who is too full of themselves to admit that they have no idea how to use LLMs for development. There's an arrogance that can set in when you get to be more senior and unless you're capable of force feeding yourself a bit of humility you'll end up missing big, important changes in your field.
It becomes farcical when not only are you missing the big thing but you're also proud of your ignorance and this guy is both.
Personally, I’ve had a pretty positive experience with the coding assistants, but I had to spend some time to develop intuition for the types of tasks they’re likely to do well. I would not say that this was trivial to do.
Like if you need to crap out a UI based on a JSON payload, make a service call, add a server endpoint, LLMs will typically do this correctly in one shot. These are common operations that are easily extrapolated from their training data. Where they tend to fail are tasks like business logic which have specific requirements that aren’t easily generalized.
I’ve also found that writing the scaffolding for the code yourself really helps focus the agent. I’ll typically add stubs for the functions I want, and create overall code structure, then have the agent fill the blanks. I’ve found this is a really effective approach for preventing the agent from going off into the weeds.
I also find that if it doesn’t get things right on the first shot, the chances are it’s not going to fix the underlying problems. It tends to just add kludges on top to address the problems you tell it about. If it didn’t get it mostly right at the start, then it’s better to just do it yourself.
All that said, I find enjoyment is an important aspect as well and shouldn’t be dismissed. If you’re less productive, but you enjoy the process more, then I see that as a net positive. If all LLMs accomplish is to make development more fun, that’s a good thing.
I also find that there's use for both terminal based tools and IDEs. The terminal REPL is great for initially sketching things out, but IDE based tooling makes it much easier to apply selective changes exactly where you want.
As a side note, got curious and asked GLM-4.5 to make a token field widget with React, and it did it in one shot.
It's also strange not to mention DeepSeek and GLM as options given that they cost orders of magnitude less per token than Claude or Gemini.
> Claude 4 Sonnet > Or 4.1 Opus if you are a millionaire and want to pollute as much possible
That was an unnecessary guilt-shaming remark.
Yeah, this moralizing is like side-eyeing your fellow soldiers for killing "too much" because your level of killing is fine.
It's all about the Kilo Code extension.