TDD with an AI assistant, my perspective as a beginning practitioner

I have been practising Test-Driven Development for more than 25 years. The red-green-refactor cycle is second nature.

I have of course heard a lot about AI tools and LLMs over the last couple of years. The times I have tried them, I have not been impressed. Their proposals didn't compile. They seemed to solve the wrong problem. This was probably because they lacked the context of the project I'm working on. Honestly, a lot of what I have seen has been rubbish.

Not too long ago I saw a thread on LinkedIn where some people who I know really know what they are doing discussed their experience with AI assistants and specifically Claude Code. I decided to give it a try.

I tried some smaller tasks and it worked pretty well.

Then I looked at a rather urgent task in a project. We want a single sign-on using OIDC over OAUTH2 in a project. I gave Claude Code a chance and ran out of tokens more or less immediately. Literally within an hour.

For some reason, Claude Code limits the number of tokens you can use every five-hour window. I had of course signed up for the cheapest plan and it stopped me. This annoyed me and I quickly changed plans. Now I'm on the next level and don't seem to run out of tokens anymore. It is also about ten times as expensive per month. With more tokens I could implement SSO using OIDC over OAUTH2 in two days. I had a brief idea of how it worked after reading up on it, but now I could get something that worked up and running in two days. I spent one day implementing a rough solution with lots of help from an assistant, one day verifying it and sanding off the edges. When I integrated with our first client, it almost worked out of the box. There was one detail I had to figure out that had nothing to do with OIDC. It was an internal detail about how we checked two factor authentication in our login flow. I had unfortunately forgotten the details. But asking Claude Code how we did it for SAML revealed the answer in a few minutes.

So getting something up was fast, making it work was also fast when I understood the next hurdle. I would say I saved at least three to four days of work at the expense of using tokens in Claude Code. Well worth it.

Thinking at a higher abstraction level

One of the things I have noticed is that I seem to be able to think more about what I want the system to do than about how. I think I have raised my abstraction level a notch. Maybe two.

I work a lot on the tests. At the moment I allow my assistant to write the skeleton of them and I review them. I also demand that I see them red so I know that they fail before any production code is allowed to be changed. It happens that I revert all changes because the assistant runs off and makes production changes before the tests are in place. That is very irritating, but it seems to be a limitation of the tooling at the moment. A very irritating limitation.

Adding characterizing tests

The code base I mostly work on has good test coverage. But as always, there are holes. Plugging those holes manually is tedious. Really boring. But if for some reason I want to refactor and I discover that the area I'm interested in is missing some test, I can easily tell the assistant to add some characterizing tests. I of course make sure that I don't have any changes in the production code before I start. That allows me to lean on the unchanged production code and cement its behaviour with some characterizing tests. This is obviously not TDD, but someone was sloppy and didn't do a good job at some point and now I pay for it. Except that it isn't that expensive to amortize this technical debt. A few minutes, a code review before the implementation and all of a sudden there are tests covering the problematic code. Very useful.

I am still responsible

We are doing continuous deployment in all of the projects I'm involved in at the moment. Changes end up in production and in front of a user immediately, unless I hide them behind a feature flag. Knowing this and having demanding users 24/7 is a really strong argument for doing test first. This reduces bugs a lot. So I am still very thorough with what I write and what I ship. I can never blame my assistant. It is just an assistant and I as its supervisor am always responsible for the result.

Red-green-refactor cycle is still a life saver

I write or review a failing test. Red.

I ask the assistant to make it pass. Green.

I look at the result and refactor, together with the assistant or on my own.

I really have to use baby steps. Really small and nice steps. I can't review too much at the same time.

My approach is to have a plan, divide it into smaller steps and then do one step at a time. As I always have done. But I get a lot more help now. It seems to speed me up so far. I get to think more about the path forward when I wait for the assistant. I have started to keep a scratch file with my upcoming steps. The planned steps sometimes grow rather large, but that's ok since I see that I'm usually able to do all the steps within reasonable time. I rarely have steps left at the end of my working day.

The feedback loop is still fast. The green step happens quicker, which means I spend more time on the interesting parts; deciding what to test next and improving the design.

Getting started

If you want to try this yourself, start small. Pick a feature you understand well. Write a failing test. Then ask your assistant to make it pass. Refactor with the assistant. Review, commit and push. In my case, push is to production.

A final word

Always remember, at the end of the day our job as developers is not typing. It has never been about typing and it still isn't. Our job is to understand the problem and solve it. Understanding the problem is still infinitely more valuable than typing the solution.

Resources

Thomas Sundberg - author