Coding with LLMs can lead to more and better software
Monday, April 3, 2023
We are in the early days with a new technology. There is a lot of hype around LLMs, and takes on every end of the spectrum. Some predict that programmers will be out of a job sooner than later. Others predict that these will just contribute to spam. Today I'd like to focus on one particular take I've read: Using LLMs will make us produce worse software.
There are a lot of different takes on this, and they all have nuance. So if you do think this, and I misrepresent what you think: Sorry, would love to talk about it!
The crux of this argument seems to come down to a few things:
- LLMs produce buggy and insecure code
- They have no or limited ability to reason
- Making things quickly goes against making them well
I think it misses a lot of the point of what makes LLMs such a fundamental improvement for software engineering. Let's go through those arguments, then come back to how LLMs properly used can make for much better software, produced more quickly.
First, the matter of buggy and insecure code. I have seen many examples of bad code produced by LLMs, such as Copilot and ChatGPT, where it is just plain doing something bad. This shouldn't be shocking, since it is trained on tons of open-source code which also contains... bugs! But here's the thing:
Humans write bugs, too. And humans write a ton of insecure code. This problem isn't unique to LLMs, but it's just an aspect of producing software.
The question we should be asking here is, will the code we produce with LLMs have more bugs or fewer, be more secure or less? My impression at this point is that, if used properly (and more on that later), they can lead to code with fewer bugs and with better security properties. A few thigns contribute to that, but mostly it is that by producing code more quickly we can spend more time on review, and we can use LLMs to do review for us. There is a lot of common security knowledge baked into these models, and we can leverage them to help review for security issues and bugs. We can use them to produce more robust test cases, and make the drudgery of writing tests less painful.
At any rate, the code that I get out of an LLM typically has far fewer (but characteristically different) bugs than the same code written by a recent graduate from a bootcamp. Onwards.
Whether or not LLMs have any ability to reason is currently an open question, as I understand it. While the models are fundamentally statistical models1, they exhibit some really interesting emergent properties which make it so I don't think it's obvious that they lack reasoning. But I also... don't care, at this point? The more important thing is what can they do.
If you know that an LLM will fail on certain classes of problems, then you as the reasoning being can choose to dole out certain parts of the problem to it and reserve others for yourself. Early models have been bad at things like math, but good at things like generating command-line arguments for programs. Learn what the limits of the LLM are and use it on things it is good at. We don't fire good engineers just because they are bad at one part of programming. You keep them on your team and assign them things they excel at, or figure out how to make them better at other kinds of problems.
And that brings us to the last bit, which rankles me more than the rest. Some have argued that making things quickly, and prolifically, means we're just producing trash. That to make something good, takes more time than to make something bad. This just doesn't line up with reality, though.
There is an oft-repeated story2 about a teacher and their class. The teacher divided the students into two groups. One group was going to be graded on the quality of their output. The other group would only be graded on quantity of output. At the end of the term, the best results, the highest quality output, were produced by the quantity-seeking group.
This is going to hold true for software, as well. The more independent pieces of code you produce, the more chance you have of one of them being truly excellent.
You may produce a lot of bad code along the way, but our best software will come from producing a lot of it. Prototyping is now much cheaper than it ever has been before. We can try out so many ways of doing something and pick just the best one. We can spend the time we save producing code to instead think about what the code should do.
Using LLMs well
So, how should we use LLMs well, to produce good software?
It's early days, so we don't really know the best work styles yet. But there are a few things that have held true so far in my early work with them, and what I've observed from others.
Use them on things you could do yourself, as an accelerant. Where we get into trouble is with using LLMs for coding tasks we are unfamiliar with and which are high stakes, since we can no longer check their work. If there is a fatal flaw, we cannot review it to catch it3!
Check their work diligently. It is not enough to have the LLM generate code that seems to work. You must check that it does do what you asked for and what you wanted (these may be at odds). This takes time, but is an important part of any software engineering review process.
Learn the models' limits and strengths, and use them for their strong suits. LLMs are good at some sorts of tasks, and poor at others. With present models, they cannot write large programs independently if for no other reason than limits on their context and thus their memory. And they have gaps of knowledge, or things they're just not good at (such as figuring out issues with lifetimes in Rust; also a challenge for humans). Use them just for the things where you are confident they'll do well. But also experiment with other things, to see what limits are and what changes over time!
Use them for repetition, tedium, and test generation. Anything which is repetitive and tedious is ripe for automation with LLMs. They're very good at repeating structures, so repetitive tasks are easy for an LLM to do usually. They also excel at generating test cases, some of which will be valid and some which are invalid. Automating these things lets you spend less time on them so that you can spend more time on parts you are uniquely good at. That also includes tests: let the machine test the obvious things, and spend more time thinking about what tests you want.
Don't expect novelty. In general it doesn't seem like you can expect completely novel solutions to things from these models. If it is something which is generally tried and true, the model can do it. Glue together APIs, yep! But it won't come up with a clever new algorithm to solve your problem. You've got to do that with your head meat.
One of the most important things I took away from RC was to learn generously. The idea is by sharing and being open, the entire community improves and learns more. We all get more out of it that way.
This is especially important in a new emerging field like the practical use of LLMs. We all have a lot of learning to do, so as we learn, we should tell others about what we have learned for the betterment of our entire community, our whole field of software engineering.
Doing this isn't always easy or comfortable. I'm not entirely comfortable writing this post, because I feel like I don't know what I'm doing with these yet! (Discounting the fact that no one does, but some people certainly know way more than me.) But the reality is that everyone has a valid, valuable perspective, and sharing when you feel uncomfortable is one of the strongest signs that you are learning and growing.
So please, join me in sharing generously how you work with LLMs, what works well and what doesn't, what your fears are, what your hopes are. We will all improve together if we all share with and learn from each other.
Increasingly I wonder how much humans are "just" statistical models, as well.
This story has an excellent backstory for why it comes in different flavors with different types of teachers.
Related, the ACM Code of Ethics instructs us to "Perform work only in areas of competence", so if you cannot check the work of an LLM, you should probably turn down that work task anyway if it risks any harm.