Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲--dangerously-skip-reading-code (olano.dev)

79 points by fagnerbrack 13 hours ago | 95 comments

facundo_olano 45 minutes ago [-]

Author here. I'm surprised to see this surfacing now. I just wanted to clarify, since apparently the post doesn't do a good job at it, that what I discussed there is not a methodology I advocate for. The point of the post was: ok, since there are organizations mandating to maximize speed by reducing time spent on typing code (or even mandating to maximize agents usage), is there a way we can meet that requirement while still preserving the rigor somewhere else?

This was a follow up to a previous article[1] and the pair tried to express what I still think today (using AI daily at work): every time I use AI for coding, to some capacity I'm sacrificing system understanding and stability in favor of programming speed. This is not necessarily always a bad tradeoff, but I think it's important to constantly remind ourselves we are making it.

[1] https://olano.dev/blog/tactical-tornado/

ignoreusernames 5 minutes ago [-]

Don’t you think that the provider of the LLM is also a dimension on these discussions about responsibility? We often talk about the tech itself (LLM driven development) but how we access it is just as important imo. It’s either locked behind a non trivial amount of hardware (for open models) or some monopolistic driven provider entity like OpenAI or anthropic. In the provider case, it’s not really the LLM that will “own” the code, it’s the provider itself and we’ll be at the mercy of whatever pricing model they shove down our throats.

throwaw12 4 hours ago [-]

> my first bet would be specifications and tests

You are missing another dimension how easy it would be to migrate if adding new feature hits a ceiling and LLM keeps breaking the system.

Imagine all tests are passing and code is confirming the spec, but everything is denormalized because LLM thought this was a nice idea at the beginning since no one mentioned that requirement in the spec. After a while you want to add a feature which requires normalized table and LLM keeps failing, but you also have no idea how this complex system works.

Don't forget that very very detailed spec is actually the code

abalashov 1 hours ago [-]

> Don't forget that very very detailed spec is actually the code

Came here to say this, but you said it for me. If the problem were merely one of insufficient rigour or detail in specs, it would have been solved long before LLMs.

iloveoof 14 minutes ago [-]

Software engineering has always worked this way, just not to ICs.

“The LLMs produce non-deterministic output and generate code much faster than we can read it, so we can’t seriously expect to effectively review, understand, and approve every diff anymore. But that doesn’t necessarily mean we stop being rigorous, it could mean we should move rigor elsewhere.“

Direct reports, when delegated tasks by managers, product non-deterministic outputs much faster than team leads/managers can review, understand or approve every diff. Being a manager of software developers has always been a non-deterministic form of software engineering.

zoogeny 3 hours ago [-]

>... my first bet would be specifications ... and tests ... If I had to roll out such a development process today, I’d make a standardized Markdown specification the new unit of knowledge for the software project.

I've found that adopting RFC Keywords (e.g. RFC 2119 [1]; MUST, SHOULD, MAY) at least makes the LLM report satisfaction. I'd love to see a proper study on the usage of RFC keywords and their effect on compliance and effectiveness.

1. https://www.rfc-editor.org/info/rfc2119/

kortex 21 minutes ago [-]

That's literally what OpenSpec does (https://openspec.dev/). It's quite nice. I've only exceptionally rarely seen claude do something wrong based on spec docs when it's fully spec'd out. More often it's because something wasn't nailed down and claude was forced to make assumptions.

The downside is the ospx markdown specs sometimes end up too granular, focusing on the wrong or less important details, so reading the specs feels like a slog.

Also at times aspects of the english language spec end up way more verbose than just giving a code example would be.

ramoz 7 hours ago [-]

> If I had to roll out such a development process today, I’d make a standardized Markdown specification the new unit of knowledge for the software project. Product owners and engineers could initially collaborate on this spec and on test cases to enforce business rules. Those should be checked into the project repositories along with the implementing code. There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec. This specification, and not the code that materializes it, is what the team would need to understand, review, and be held accountable for.

The constant urge I have today is for some sort of spec or simpler facts to be continuously verified at any point in the development process; Something agents would need to be aware of. I agree with the blog and think it's going to become a team sport to manage these requirements. I'm going to try this out by evolving my open source tool [1] (used to review specs and code) into a bit more of a collaborative & integrated plane for product specs/facts - https://plannotator.ai/workspaces/

[1] https://github.com/backnotprop/plannotator

20k 5 hours ago [-]

What we really we need is some kind of more detailed spec language that doesn't have edge cases, where we describe exactly what we expect the generated code to do, and then formally verify that the now generated code matches the input spec requirement. It'd be super helpful to have something more formal with no ambiguity, especially because the english language tends to be pretty ambiguous in general which can result in spec problems

I also tend to find especially that there's a lot of cruft in human written spec languages - which makes them overly verbose once you really get into the details of how all of this works, so you could chop a lot of that out with a good spec language

I nominate that we call this completely novel, evolving discipline: 'programming'

dnmc 5 hours ago [-]

There are languages like Dafny that permit you to declare pre- and post-conditions for functions. Dafny in particular tries to automatically verify or disprove these claims with an SMT solver. It would be neat if LLMs could read a human-written contract and iterate on the implementation until it's provably correct. I imagine you'd have much higher confidence in the results using this technique, but I doubt that available models are trained appropriately for this use case.

throw1234567891 2 hours ago [-]

Ask it to do so, show it how, and it will do it.

IsTom 3 hours ago [-]

> What we really we need is some kind of more detailed spec language that doesn't have edge cases, where we describe exactly what we expect the generated code to do, and then formally verify that the now generated code matches the input spec requirement.

That's theorem provers and they're awful for anything of any reasonable complexity.

pyrale 16 minutes ago [-]

That's any programming language, really [1]. Any website contains millions of "proofs", not all of them are useful. Choosing what needs to be proven is hard. And the spectrum of languages/type systems and their usability as either is more explored nowadays than it used to be. If you don't likue coq, you can look for agda. If agda is too far for you, you can look for Haskell. If that's still impractical, there's rust or f#, etc... The tradeoff you have between "convenient for expressing proofs" and "convenient for programming" has many options.

[1]: https://www.youtube.com/watch?v=IOiZatlZtGU

ramoz 5 hours ago [-]

yes, am familiar with the "code is spec" trope.

Shame us all for moving away from something so perfect, precise, and that "doesn't have edge cases."

Hey - if you invent a programming language that can be used in such a way and create guaranteed deterministic behavior based on expressed desires as simple as natural language - ill pay a $200/m subscription for it.

20k 5 hours ago [-]

As people are discovering, natural language is insufficiently precise to be able to specify edge cases. Any language precise enough to be formally verified against is a programming language

ramoz 5 hours ago [-]

we're going to end up speaking past each other - but generally I do agree with you and am not denouncing the importance of formal verification methods. I do think abstractions are going to dominate the human ux above them

giancarlostoro 3 hours ago [-]

XML will do it very well.

giancarlostoro 3 hours ago [-]

Hate it all you want, but XML is genuinely a good fit there, and Claude is apparently insanely good at working with XML prompts.

geraneum 2 hours ago [-]

I don’t know why, but I get this feeling whenever someone uses “insanely” or “shockingly” along with AI, I think they’re bot or are writing based on a guideline! No offense, btw, I’m not saying you’re a bot.

giancarlostoro 1 hours ago [-]

You’re absolutely right!

Kidding, nah no worries. I do worry people become overly paranoid of bots as time passes.

SpicyLemonZest 2 hours ago [-]

Is Claude good at working with XML prompts, or is XML good at convincing users to write more Claude-able specs? I am intensely skeptical that you could write an XML document describing a nontrivial web application in full detail, but I could easily imagine someone who thinks they have to stripping out important details because they don't really map to XML.

giancarlostoro 1 hours ago [-]

They train it with XML even the system prompts that Claude reads are formatted by it.

SpicyLemonZest 5 minutes ago [-]

I haven't done it professionally, but my understanding is that this kind of work is much more in the second category, where you have to understand the closest approximation to what you want that the LLM can reliably produce or the training won't work at all.

zombot 5 hours ago [-]

> where we describe exactly what we expect the generated code to do, and then formally verify that the now generated code matches the input spec requirement.

In ancient times we had tech to do exactly that: Programming languages and tests.

k3vinw 2 hours ago [-]

like declarative vs imperative?

giancarlostoro 3 hours ago [-]

I called it gates on mine. I loved Beads but it closed tasks without any validation steps. Beads also had other weird issues, so I made my own alternative. I think "Gates" is also used by others projects that took on the same challenge I did in mine weirdly enough.

https://github.com/Giancarlos/guardrails

piokoch 4 hours ago [-]

We've been through that so many times. When UML arrived (and ALM tools suites, IBM was trying to sell it, Borland was trying to sell it, all those fancy and expensive StarTeam, Caliber and Together soft), then BPML and its friends arrived, Business Rule Management System (BRMS), Drools in Java world, etc.

It all failed. For a simple reason, popularized by Joel Spolsky: if you want to create specification that describes precisely what software is doing and how it is doing its job, then, well, you need to write that damn program using MS Word or Markdown, which is neither practical nor easy.

The new buzzword is "spec driven development", maybe it will work this time, but I would not bet on that right now.

BTW: when we will be at this point, it does not make sense anymore to generate code in programming languages we have today, LLM can simply generate binaries or at least some AST that will be directly translated to binary. In this way LISP would, eventually, take over the world!.

jondwillis 7 hours ago [-]

I’ve been considering this as well, and trying to get my colleagues to understand and start doing it. I use it to pretty decent effect in my vibe coded slop side projects.

In the new world of mostly-AI code that is mostly not going to be properly reviewed or understood by humans, having a more and more robust manifestation and enforcement, and regeneration of the specs via the coding harness configuration combined with good old fashioned deterministic checks is one potential answer.

Taken to an extreme, the code doesn’t matter, it’s just another artifact generated by the specs, made manifest through the coding harness configuration and CI. If cost didn’t matter, you could re-generate code from scratch every time the specs/config change, and treat the specs/config as the new thing that you need to understand and maintain.

“Clean room code generation-compiler-thing.”

SpicyLemonZest 2 hours ago [-]

> If cost didn’t matter, you could re-generate code from scratch every time the specs/config change, and treat the specs/config as the new thing that you need to understand and maintain.

The critical insight is that this is not true. When people depend on your software, replacing it with an entirely different program satisfying all of your specs and configurations is a large, months-long project requiring substantial effort and coordination even after new program is written. It seems to work in vibe coded side projects because you don't have those dependencies; if you got an angry email from a CEO saying that moving a critical button ruined their monthly review cycle, and demanding 7 days notice before you move any buttons going forwards, you'd just tell them no.

KaiShips 4 hours ago [-]

[flagged]

uptodatenews 4 hours ago [-]

[dead]

DavidVoid 3 hours ago [-]

> Product owners and engineers could initially collaborate on this spec and on test cases to enforce business rules. Those should be checked into the project repositories along with the implementing code. There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec. This specification, and not the code that materializes it, is what the team would need to understand, review, and be held accountable for.

This just sounds like typical requirements management software (IBM DOORS for example, which has been around since the 90s).

It's kind of funny how AI evangelists keep re-discovering the need for work methods and systems that have existed for decades.

When I worked as a software developer at a big telecom company and I had no say in what the software was supposed to do, that was up to the software design people--they were the ones responsible for designing the software and defining all the requirements--I was just responsible for implementing that behavior in code.

irishcoffee 3 hours ago [-]

One of my first tasks at my first job out of college required me to learn dxl (doors extension language) and implement some really intricate requirements management features.

It was gratifying to build the confidence of learning a new language quickly that I had never even heard of before. DXL was also pretty awful.

Opened a lot of doors for me though, no pun intended.

bitwize 2 hours ago [-]

Spec-driven development is basically PRIDE, the first proven commercial software methodology dating back to 1971. In fact it may be the culmination of PRIDE because PRIDE's creators realized coding wasn't the hard part; the hard part was systems analysis, determining what problem needed to be solved and what to build. Coding comes last and when you did it right, was simply a translation step.

And now that step can be 100% automated.

Information systems design was a solved problem in the 1970s. PRIDE turned it from an art into a proven, repeatable science. Programmers, afraid of losing their perceived importance, resisted the discipline it imposes as the mustang resists the bit, but now that they're going the way of buggy-whip makers, maybe systems design as a science will make a comeback after 50 years.

tyleo 8 hours ago [-]

The underlying mechanism is still the same: humans type and products come out.

So something which must be true if this author is right is that whatever the new language is—the thing people are typing into markdown—must be able to express the same rigor in less words than existing source code.

Otherwise the result is just legacy coding in a new programming language.

SoftTalker 7 hours ago [-]

> Otherwise the result is just legacy coding in a new programming language.

And this is why starting with COBOL and through various implementations of CASE tools, "software through pictures" or flowcharts or UML, etc, which were supposed to let business SMEs write software without needing programmers, have all failed to achieve that goal.

tyleo 7 hours ago [-]

While they failed to achieve the goal outright, I'd argue that each is a concrete step towards it. The languages we have today are more productive than the languages we had decades ago.

I think it's an open question of whether we achieve the holy grail language as the submission describes. My guess is that we inch towards the submission's direction, even if we never achieve it. It won't surprise me if new languages take LLMs into account just like some languages now take the IDE experience into account.

dpark 6 hours ago [-]

> must be able to express the same rigor in less words than existing source code

Yes but also no. Writing source means rigorously specifying the implementation itself in deep detail. Most of the time, the implementation does not need to be specified with this sort of rigor. Instead the observable behavior needs to be specified rigorously.

tyleo 6 hours ago [-]

That doesn't sound right. For example, there's plenty of software with the correct observable behavior which leaks credentials. So what needs to be captured goes beyond observable behavior.

dpark 2 hours ago [-]

Leaking credentials is observable behavior.

Certainly you could write specification for a piece of software, and the software could meet the specification while also leaking credentials. Obviously, that would be a problem. But at some point, this starts to feel artificial and silly. The same software could reformat your hard disk, right?

At some point, we aren’t discussing whether or not AI is doing a bad job writing software. We’re discussing whether or not it’s actively malicious.

tyleo 2 hours ago [-]

If leaking credentials is observable behavior. I don't get your point.

Memory leaks, deleting the hard drive, spending money would all be observable behavior.

By your reasoning that the "observable behavior needs to be specified rigorously" it seems like you'd have to list these all out. We do, after all, already have cases of AI deleting data.

That sounds harder and more error prone than what we're doing now by rigorously defining these defects out of existence in code.

dpark 48 minutes ago [-]

My ultimate point is that source code specifies everything the software can or will do, but most of that is not part of the specification. Most of it is implementation details you don’t care about.

The entire reason we have functions and components and modules etc is to isolate engineers from the things we do not need to care about. I should not need to care about the implementation details of most software, only if it meets my retirements.

The move to AI first software development will not happen because we find a way to specify as much in English as we previously would have specified in a programming language. The move will happen when and as we figure out how to specify the things that matter. We don’t need the same rigor. We need the correct rigor.

ninalanyon 7 hours ago [-]

> Rework is almost free

Is it? All the electricity and capital investment in computing hardware costs real money. Is this properly reflected in the fees that AI companies charge or is venture capital propping each one up in the hope that they will kill off the competition before they run out of (usually other people's) money?

gessha 6 hours ago [-]

Yeah, a lot of Claude Code users(me included) found in March if rework is free or not.

humbleharbinger 2 hours ago [-]

My amazon orgs leadership has been obsessed with spec driven development while individual engineers tell me the only use they have is to placate leadership. I'm tired

culi 1 hours ago [-]

How does spec driven development differ from test driven development?

nullsex 50 minutes ago [-]

[dead]

k3vinw 2 hours ago [-]

> We can stop reading LLM-generated code just like we don’t read assembly, or bytecode, or transpiled JavaScript; our high-level language source would now be another form of machine code

This is too weird for me. At least with programming languages I can consult the documentation and if the programming language isn’t behaving as documented, it’s obviously a defect and if you’re savvy enough you often have open channels that accept contributions. Can we say the same for Claude or other AI solutions?

Sevii 2 hours ago [-]

If you run a local LLM and an open source agent harness you are pretty close to that.

throawayonthe 2 hours ago [-]

can you explain how? with a compiler you can rely on the adage "it's never a compiler bug" (until it is! and then you can fix it)

how can a local LLM with an open source agent harness provide the same trustworthiness?

zoogeny 2 hours ago [-]

> ... then you can fix it

I recall working on a project that used (MSVC) VC++ and a coworker found a bug in the compiler. We reported the issue to Microsoft and they eventually patched it.

You may find yourself arguing explicitly for open source dev tools if you continue down this line. There are many commercial cases where "you can fix it" does not apply to the dev toolchain and you will find yourself reliant on a provider. At that point, the trustworthiness of "compiler provider" and "local LLM provider" is the pertinent discussion (e.g. provider vs. provider instead of LLM vs compiler).

jmull 6 hours ago [-]

The lesson I've learned from our new AI age is how little a large number of people who've worked in software development their entire careers understand software development.

I suppose all the money floating around AI helps dummify everything, as people glom on to narratives, regardless of merit, that might position them to partake.

What we actually have now is the ability to bang out decent quality code really fast and cheaply.

This is massive, a huge change, one which upends numerous assumptions about the business of software development.

...and it only leaves us to work through every other aspect of software development.

The approach this article advocates is to essentially pretend none of this exists. Simple, but will rarely produce anything of value.

This paragraph from the post gives you the gist of it:

> ...we need to remove humans-in-the-loop, reduce coordination, friction, bureaucracy, and gate-keeping. We need a virtually infinite supply of requirements, engineers acting as pseudo-product designers, owning entire streams of work, with the purview to make autonomous decisions. Rework is almost free so we shouldn’t make an effort to prevent incorrect work from happening.

As if the only reason we ever had POs or designers or business teams, or built consensus between multiple people, or communicated with others, or reviewed designs and code, or tested software, was because it took individual engineers too long to bang out decent code.

AI has just gotten people completely lost. Or I guess just made it apparent they were lost the whole time?

evolve-maz 2 hours ago [-]

All the talking points and techniques are those which were used when pushing outsourcing: give better specs, write detailed tests, accept bad code because it works so who cares, we can just rewrite from scratch later, and my favorite "they will get better with more exposure to your code base". None of these takes is wrong, but what they neglect is doing all that work is way more effort than if I wrote the original code myself.

Using an LLM to one shot a small function (something i would do with a very specific search on Google or SO) is handy. Giving it a harness and free access to a code base leads to some terrible code, and doubling down with more instructions and agents in the loop means more time writing the rube Goldberg orchestration rather than just opening up an editor and writing code.

dasil003 4 hours ago [-]

Yeah this article is in a real uncanny valley for me where it has some insight, but it also throws out some wild ideas that don't pass the sniff test for me.

To me what AI is doing is changing the economics of human thought, but the change is happening way faster than individuals, let along organizations can absorb the implications. What I've seen is that AI magnifies the judgment of individuals who know how to use it, and so far it's mostly software engineers who have learned to use it most effectively because they are the ones able to develop an intuition about its limitations.

The idea of removing the human from the loop is nonsense. The question is more what loops matter, and how can AI speed them up. For instance, building more prototypes and one-off hacky tools is a great use of vibe coding, changing the core architecture of your critical business apps is not. AI has simultaneously increased my ability to call bullshit, while amplifying the amount of bullshit I have to sift through.

When the dust settles I don't really see that the value or importance of reading code has changed much. The whole reason agentic coding is successful is because code provides a precise specification that is both human and machine readable. The idea that we'll move from code to some new magical form of specification is just recycling the promise of COBOL, visual programming, Microsoft Access, ColdFusion, no-code tools, etc, to simplify programming. But actually the innovations that have moved the state of the art of professional programming forward, are the same ones that make agentic coding successful.

vinnymac 5 hours ago [-]

I appreciate your insights in a sea of psychosis comments. I find it strange how many people think we have achieved the likes of Y2K flying cars 20 years ago, or the dream of having every car on the road be an electric fully self driving car by now (a promise made at least over a decade ago by several of these types).

The point I’m making is that we give the spotlight to people who are making absurd claims. We have not achieved the ability to remove the human from the loop and continually produce value-able outputs. Until we do, I don’t see how any of the claims made in this article are even close to anything more than simply gate-keeping slop.

ninalanyon 5 hours ago [-]

And if we do remove the human from the loop? What then, what are humans for? Do we get Keynes' idea that we only need to work a few hours a week or do we get a continuation and intensification of what we already have: a few high 'earners' and a sea of people struggling to make ends meet?

phyzix5761 5 hours ago [-]

I wonder if with the speed of iteration with AI the industry will switch back to waterfall. Clear documentation first so the LLM can easily produce what's being asked with a round of testing before going back to the documentation stage and running it again. History does repeat itself.

yibers 5 hours ago [-]

We already switched

montroser 10 hours ago [-]

This could very well be a pattern that some teams evolve into. Specs are the new source -- they describe the architectural approach, as well as the business rules and user experience details. End to end tests are described here too. This all is what goes through PRs and review process, and the code becomes a build artifact.

vips7L 8 hours ago [-]

It just doesn’t work though. Anthropic couldn’t even get Claude to build a working C compiler which has a way better specification than any team can write and multiple reference implementations.

hombre_fatal 6 hours ago [-]

Yeah, this has been my process for months now.

I might even start my own blog to write about things I've found.

1. Always get the agent to create a plan file (spec). Whatever prompt you were going to yolo into the agent, do it in Plan Mode first so it creates a plan file.

2. Get agents to iterate on the plan file until it's complete and thorough. You want some sort of "/review-plan <file>" skill. You extend it over time so that the review output is better and better. For example, every finding should come with a recommended fix.

3. Once the plan is final, have an agent implement it.

4. Check the plan in with the impl commit.

The plan is the unit of work really since it encodes intent. Impl derives from it, and bugs then become a desync from intent or intent that was omitted. It's a nicer plane to work at.

From this extends more things: PRs should be plan files, not code. Impl is trivial. The hard part is the plan. The old way of deriving intent from code sucked. Why even PR code when we haven't agreed on a plan/intent?

This process also makes me think about how code implementation is just a more specific specification about what the computer should do. A plan is a higher level specification. A one-line prompt into an LLM is the highest level specification. It's kinda weird to think about.

Finally, this is why I don't have to read code anymore. Over time, my human review of the code unearthed fewer and fewer issues and corrections to the point where it felt unnecessary. I only read code these days so I can impose my preferences on it and get a feel for the system, but one day you realize that you can accumulate your preferences (like, use TDD and sum types) in your static prompt/instructions. And you're back to watching this thing write amazing code, often better than what you would have written unless you have maximum time + attention + energy + focus no matter how uninteresting the task, which you don't.

geraneum 2 hours ago [-]

> PRs should be plan files, not code. Impl is trivial.

Doesn’t it bother you that the outcome each PR is different every time you/CI “run it”?

moritzwarhier 7 hours ago [-]

Entertaining flag name!

React team seems to really have set a precedent with their "dangerouslySetInnerHTML" idea.

Or did they borrow it somewhere?

I'm just curious about that etymology, of course the idea is not universally helpful: for example, for dd CLI parameters, it would only make a mess.

But when there's a flag/option that really requires you to be vigilant and undesired the input and output and all edge cases, calling it "dangerous" is quite a feat!

wrxd 6 hours ago [-]

I’m pretty sure this comes from Claude code’s --dangerously-skip-permissions

saulpw 5 hours ago [-]

which sounds like it came from React's "dangerouslySetInnerHTML", per the comment you replied to.

brabel 3 hours ago [-]

I think people used similar prefixes for a long time. For example, Haskell has had `unsafePerformIO` since the 90's... and MSFT's Hungarian notation was also similar, though it used abbreviations for things like "unsafe" (not "dangerous"). Perhaps React was the most famous case of using "dangerously" though.

culi 1 hours ago [-]

"unsafe" seems quite different from the "dangerously [...]" phrasal template. I don't think it's a stretch to suppose it was inspired by React. Still waiting for this one to catch on:

  React.__SECRET_INTERNALS_DO_NOT_USE_OR_YOU_WILL_BE_FIRED

https://github.com/reactjs/react.dev/issues/3896

urbandw311er 20 minutes ago [-]

This is sometimes exposed in front end browser code and I had an actual (non technical) end user email our support team last month asking if it was something they should be concerned about! God knows how they found it, I suspect everyone is now an AI-enabled expert at these things…

debesyla 7 hours ago [-]

I found that adding "philosophy" descriptions help guide the tooling. No specs, just general vibes what's the point, because we can't make everyone happy and it's not a goal of a good tool (I believe).

Technology, implementation may change, but general point of "why!?" stays.

wizzwizz4 10 hours ago [-]

> There would need to be automated pull-request checks verifying not only that tests pass but that code conforms to the spec.

As I understand, this is an unsolved problem.

InsideOutSanta 8 hours ago [-]

Step 1: solve the halting problem.

soraminazuki 7 hours ago [-]

Yep, calling it an "unsolved" problem is a misnomer. We already have mathematical proof that it's impossible.

But that aside, it's such a shame that many drinking the AI Kool-Aid aren't even aware of the theoretical limits of a computer's capabilities.

mjd 6 hours ago [-]

This sort of theoretical result is not always as clear-cut as you suggest.

Computers are finite machines. There is a theorem that although a machine with finite memory can add, multiplication requires unbounded memory. Somehow we muddle along and use computers for multiplication anyway.

More to your point there is a whole field of people who write useful programs using languages in which every program must be accompanied by a proof that it halts on all inputs.

(See for example https://lean-lang.org/ or David Turner's work on Total Functional Programming from about 20 years ago.)

Other examples are easy to find. The simplex algorithm for linear optimization requires exponential time in general, and the problem it solves is NP-hard, but in practice works well on problems of interest and is widely used. Or consider the dynamic programming algorithms for problems like subset-sum.

Theory is important, but engineering is also important.

za_creature 5 hours ago [-]

> There is a theorem that (...) multiplication requires unbounded memory

What theorem is that?

The multiplication of any two integers below a certain size (called "words") fits in a "double word" and the naive multiplication algorithm needs to store the inputs, an accumulator and at most another temporary for a grand total of 6*word_size

Sure, you can technically "stream" carry-addition (which is obvious from the way adders are chained in ALU-101) and thus in a strict sense addition is O(1) memory but towards your final point:

> Theory is important, but engineering is also important.

In practice, addition requires unbounded memory as well (the inputs). And it's definitely compute-unbounded, if your inputs are unbounded.

I dislike the term "we muddle along". IEEE 754 has well specified error bars and cases, and so does all good data science. LLMs do not, or at least they do not expose them to the end user

So then, how exactly do we go about proving that the result of chaining prompts is within a controllable margin of error of the intended result? Because despite all the specs, numerical stability is the reason people don't write their own LAPACK.

soraminazuki 6 hours ago [-]

But it's not like these systems make theory go away, they make compromises. So the question is, what's the compromise required for an algorithm that can check the conformance of computer programs to natural language specifications that doesn't involve hoping for the best?

wizzwizz4 3 hours ago [-]

Natural language specifications often aren't specifications at all: interpreting them requires context that is not available to the computer, and often not even available to the specification's authors without further research / decision-making.

LLMs address this problem by just making things up (and they don't do a great job of comprehending the natural language, either), which I think qualifies as "hoping for the best", but I'm not sure there is another way, unless you reframe the problem to allow the algorithm to request the information it's missing.

wizzwizz4 7 hours ago [-]

You're probably thinking of Rice's theorem (a sort of generalised halting problem), but this task is actually way easier than that, since we're not trying to study arbitrary algorithms: we're trying to study the subset of human-comprehensible algorithms. Most of the things we want computers to do are things that, given enough time, someone can write a program to solve: and generally, those programs are not tricksy or meta, and don't involve unsolved mathematics problems found by studying busy-beaver candidates. If it's possible for a human to understand how a program works (which is a necessary component of writing such a program), it's possible to write a mathematical proof about the program's behaviour, and that means it's in principle possible to automate the construction of that mathematical proof, which is equivalent to the "determine whether code conforms to the spec" task.

"Somewhat easier than an impossible task" is not a particularly strong claim about when (or whether) this problem will be solved, though.

soraminazuki 7 hours ago [-]

We can't create an algorithm determining whether a computer program halts or not, but we can write one that checks whether it conforms to natural language specifications much more easily? That makes no sense. There's no exception to the halting problem regarding "human comprehensible" computer programs.

__s 4 hours ago [-]

They're saying most useful programs don't fall in the complete / correct divide. You can get a lot done while restricting yourself to provable programs

wizzwizz4 4 hours ago [-]

Rice's theorem says that we can't draw a partition between two sets of programs, based on their semantic properties. It says nothing about drawing a partition slightly to one side of the desirable partition, misclassifying some tricksy cases, but correctly classifying all the programs we care about.

ElectricalUnion 5 hours ago [-]

I will give you a class of programs humans wrote and they want improved: LLMs.

Those were written by humans, and don't involve unsolved mathematics.

Is your claim tht you just need to solve comprehensibility of LLMs?

Figuring out epistemology and cognition to have a chance to reason about the outputs of a LLM seems to me way harder that traditional attempts to reason directly about algorithms.

Ecys 9 hours ago [-]

this is actually precisely what humans' roles will be.

"is this implementation/code actually aligned with what i want to do?"

humanic responsibility's focus will move entirely from implementing code to deciding whether it should be implemented or not.

u probably mean unsolved as in "not yet able to be automated", and that's true.

if pull-request checks verifying that tests are conforming to the spec are automated, then we'd have AGI.

lefra 7 hours ago [-]

That's what formal verification is about. I did some (using PSL for hardware verification); writing the formal spec is way harder than the actual code. It will find a lot of subtle issues, and you spend a most of the time deciding if it's the spec or the code that's wrong.

Having the code-writing part automated would have a negligible impact on the total project time.

phainopepla2 8 hours ago [-]

> humanic

No, thank you

wizzwizz4 9 hours ago [-]

This is a task that humans are exceptionally bad at, because we are not computers. If something uses the right words in the right order such that it communicates the correct algorithm to a human, then a human is likely to say "yup, that's correct", even if an hour's study of these 15 lines reveals that a subtle punctuation choice, or a subtle mismatch between a function's name and its semantics, would reveal that it implements a different algorithm to the expected one.

LLMs do not understand prose or code in the same way humans do (such that "understand" is misleading terminology), but they understand them in a way that's way closer to fuzzy natural language interpretation than pedantic programming language interpretation. (An LLM will be confused if you rename all the variables: a compiler won't even notice.)

So we've built a machine that makes the kinds of mistakes that humans struggle to spot, used RLHF to optimise it for persuasiveness, and now we're expecting humans to do a good job reviewing its output. And, per Kernighan's law:

> Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?

And that's the ideal situation where you're the one who's written it: reading other people's code is generally harder than reading your own. So how do you expect to fare when you're reading nobody's code at all?

Ecys 8 hours ago [-]

i meant on a higher, agentic level where the AI's code is infallible. and that's going to happen very soon:

say: human wants to make a search engine that money for them.

1. for a task, ask several agents to make their own implementation and a super agent to evaluate each one and interrogate each agent and find the best implementation/variable names, and then explain to the human what exactly it does. or just mythos

2. the feature is something like "let videos be in search results, along with links"

3. human's job "is it worth putting videos in this search engine? will it really drive profits higher? i guess people will stay on teh search engine longer, but hmmm maybe not. maybe let's do some a/b testing and see whether it's worth implementing???" etc...

this is where the developer has to start thinking like a product manager. meaning his position is abolished and the product manager can do the "coding" part directly.

now this should be basic knowledge in 2026. i am just reading and writing back the same thing on HN omds.

wizzwizz4 8 hours ago [-]

The AI's code is not going to be infallible any time soon. It's been "very soon" for the past 4 years, and the AI systems are still making the same kinds of mistakes, which are the mistakes you'd expect from a first-principles study of their model architectures. There's no straightforward path to modifying the systems we have now, to make them infallible.

Ozzie-D 4 hours ago [-]

the irony is that AI is making this exact problem worse. ppl are generating entire codebases now without reading any of it -- the flag might as well be the default. the skill thats actually becoming scarce isnt writing code, its reading code you didnt write and knowing if its correct.

retinaros 4 hours ago [-]

markdown became the language I hate the most thank to LLMs and specs-driven approach. everything feels so dumb right now in agentic coding. looping blindlessly and aimlessly until it compiles then until the playwright server or whatever devtools shows that it somehow works. push the code, have a llm autoreview/autofix,push to prod, run a mythos (perfect name) to identify the bug that opus 4.7 create. loops on loops on loops of some kind of zombie processes running to a "goal" that everyone seems to mystify in talks to just hide the fact that we do nothing anymore. the bottleneck never was code. it was the gate that was keeping away the Elizabeth Holmes and SBF from software engineering and it just opened.

abalashov 54 minutes ago [-]

A colleague and I have taken to use of the verb "meatspin", from another era in Internet shock humour, to describe what it is that coding agents actually do 99% of the time.

testplzignore 6 hours ago [-]

> Product owners and engineers could initially collaborate on this spec and on test cases to enforce business rules.

LOL. I had to check if this was published on April 1st.

lesscode 6 hours ago [-]

Instead of accepting 20,000 lines of slop per PR (and never-ending combinatorial complexity), maybe we should aim to think about abstractions and how to steer LLMs to generate code similar to that of a skilled human developer. Then it could actually be a maintainable artifact by humans and LLMs alike.

crnkofe 3 hours ago [-]

I don't get why every AI article is so hyper-focused on coding speed. If the coding is so fast doesn't it make sense to invest more time into quality, learning, documentation, testing refactoring, making a better product? I'm beginning to think that the slopcoders are evaluated by kLOCs of lines written in addition to LLM token usage and they're just maximising the measured metrics. Whether that actually ends up in production or is used by any real person is seemingly irrelevant. Likely the more bugs that are produced the more agents can be spun in parallel to simulate busywork.

farmerbb 4 hours ago [-]

I legit can't tell if this article is satire, or not.

Uptrenda 50 minutes ago [-]

Does this post mark the top of the hype train or is there still more to come?

Ecys 9 hours ago [-]

very true. and we already know and agree with this.

user experience/what the app actually does >>> actually implementing it.

elon musk said this a looong time ago. we move from layer 1 (coding, how do we implement this?) to layer 2 thinking (what should the code do? what do we code? should we implement this? (what to code to get the most money?))

this is basic knowledge

duskdozer 7 hours ago [-]

Elon Musk has been saying Teslas would have fully autonomous self-driving within 1-3 years since 2013

vinnymac 5 hours ago [-]

I left a similar comment elsewhere in this thread. I still remember when so many people hallucinated that we would suddenly have flying cars by 2002 at the latest. If we achieve several more major improvements on current technology, these thoughts are interesting to consider. But not before that occurs.

We need the pragmatic engineer more than ever.

donbventures 1 hours ago [-]

[flagged]

fijiol 6 hours ago [-]

[flagged]

Rendered at 23:03:08 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.