Greg Beech

Exit staging left

2024-12-22T00:00:00+00:00

It’s received wisdom in the startup world that you need to have a staging environment to test things out before going to production. In some cases, maybe even multiple staging environments with different names like preproduction or demo or QA, with different rules about how mature code has to be to go on there. However, I don’t believe staging environments are worth the time, effort, and cost to keep running, so let’s talk about why you should just deploy straight to production.

Building and maintaining staging environments takes time, whether it’s setting up and maintaining the infrastructure, seeding and managing test data, or resetting the environment when it inevitably gets so much cruft on there it diverges too far from production. If you’ve got your infrastructure all configured in Terraform or whatever and automated reset scripts this might not be a huge amount of time, but it’s still time.

Waiting for staging environments takes much more time. The amount of time I’ve spent over the years waiting for deploys to staging of code I already know works fine, and then running cursory checks “just to be sure” because “that’s the process we follow” is shocking. Even if it’s only ten minutes per PR that could be half an hour to an hour per day, and it’s hard to use that time productively because you’re distracted by the pending context switch back to check the deploy and get the PR merged for real.

Given a finite amount of time, staging environments are not the best place to spend it. There’s nothing you can do on staging that can’t be done better in either development, where you’ve got all your tools handy, or in production with real data and real users.

Testing code can be done well in a development environment. This is hopefully not shocking news, but you can write tests that test your code, which will give you–and others–more confidence it works as you expect, and let you evolve it safely in future. You can test ad hoc in development, e.g. for UI changes where tests are hard to write and insufficient for ensuring things feel right. You can run the tests in CI to stop things deploying if you accidentally broke stuff.

Of course pre-production testing won’t cover everything and that’s why you need to spend more time on release practices. Deploy is not release. Use feature flags, rollout groups, gradual rollout, A/B testing or MVT, etc. as needed to make sure your deployed changes are rolled out safely and have the effect you want. If you’re not spending time on staging environments, you’ll have more time and more motivation to get this right.

The argument that always comes up at this point is what about infrastructure changes? How can you be confident doing things like database upgrades or k8s upgrades or any of those scary things in production if you haven’t done them in staging? Let me ask you a different question: How can you be confident doing it even if you have done it in staging, given staging is always different? You’d do things like create or upgrade read replicas first, or run on multiple production clusters so if one goes down you’re still up. And guess what? If you’re not spending time on maintaining staging, you’ll have more time and motivation to do this too.

OK but what about integrations environments for customers? You’re probably starting to get the idea now, but run them in your production environment too. You’ve hopefully already got ways of separating out metrics etc. so you don’t count your own team in them, so just do that better and separate out integrations metrics too. If your mechanism for separating these things out is bad then by not having a staging environment you’ll have more time and motivation to do this better. If you’re worried about integrations interfering with real customers, you’ll spend more time on isolation, rate limiting, etc.

Performance testing is usually the next argument. I usually find this to be whataboutism because (a) most companies don’t actually do performance testing, and (b) you can’t do it on staging anyway unless it’s identical to production–albeit with fewer replicas and so on–which it isn’t because that’s too expensive. Anyway, if you really do performance testing then you could possibly do it on production where you’ve already got a production-like dataset, using the capabilities you set up for infrastructure & integrations environments to separate infra & metrics. But maybe not, in which case I’d say fair enough spin up a performance environment for that as and when you need it. But that’s not really staging.

We could go on but what I’m trying to get you thinking about is for every reason you could give to need a staging environment, there’s usually another answer which lets you level up either your development or production environments instead. You know, the ones that actually matter. Before just defaulting to a staging environment, at least ask yourself whether you really need it, and what you could do better with the time you save by not having one.

I’ll end by noting this isn’t just a hypothetical post. We’ve built Jitty for the last year without a staging environment simply by using good development and release practices. You can do it too.

No more roadmaps

2023-12-01T00:00:00+00:00

Product managers spend aeons creating and rearranging them. Delivery managers obsess over their dates and dependencies. People in “the business” plan financials, campaigns, and more around them. But engineers—the people tasked with actually delivering them—mostly moan about them. The engineers are closest to being right, because roadmaps are little more than detrimental fantasy.

We’ll talk about why, and what I suggest you do instead, but first I want to be clear this isn’t merely a hypothetical opinion piece. At the start of this year I proposed a complete ban on roadmaps at Zego and while lots of people were worried about the implications on the way they work, fatigue with the failed promises of roadmaps led to everyone agreeing to give it a try.

The topic of roadmaps never came up again. Nobody missed the time lost creating or maintaining them. Nobody missed the pain of trying to change what was baked into them. Nobody missed planning around dates that never materialised anyway. Roadmaps just went away, and nobody missed them.

So what happens when roadmaps just go away? What do you do instead?

The biggest concern most organisations will have about ditching roadmaps is how they’ll plan when there are no dates, for example how can a new marketing campaign be planned if the marketing team don’t know the projected launch date. How can financial projections be made if key rollout dates aren’t known? The solution, in the immortal words of Maximus Decimus Meridius, is “we’ve got a better chance of survival if we work together”.

That’s it. That’s the answer. You work together. It really is that simple.

When planning and running projects, don’t have only engineering involved. Involve all the people and teams across the company you’ll actually need for a successful launch. Instead of keeping them at arms length and just updating them on the launch date, get them involved in planning and catch-ups, work with them on requirements, and see what solutions they come up with. Make them part of the project. When you’re working on it, they’re working on it, and you go live when all teams are in a state to do so, not when engineering decides they are ready.

Okay, sure, that’s the shorter-term stuff. But what about the longer term planning? What about things that would normally be planned months in advance?

Ignore them.

If you’re not actively working on a project right now, then don’t bother planning things around its launch because the date won’t be when you think, and it might not even happen. Business priorities shift—particularly once you’ve got rid of roadmaps so they’re allowed to shift—and what seems important now might not seem so relevant in a couple of months time. Anything you do in anticipation has the potential to be wasted effort, so don’t do it. Focus on things that are actually happening soon.

How soon is soon? I’d suggest about six weeks is the sweet spot. I don’t really have any good rationale for this other than six weeks feels like a good sort of timeframe to have an idea of what you’re doing. It’s long enough to deliver some pretty serious value without having to shoddily hack a temporary solution together, but short enough that you won’t be too swayed by the sunk cost fallacy if you realise it’s not working out or you want to go in a different direction.

Sometimes, though, you do need to plan longer term. You’re always going to have a small number of projects that need to hit a particular date, for example launches that have a significant impact on a financial model, or improvements to meet regulatory requirements, or upgrades to meet a deprecation date. This doesn’t necessitate you throw the baby out with the bathwater; you can treat these as special cases where you work backwards from the dates and ring-fence people to work on them, while taking the shorter-term planning approach with everything else, and keep much of your agility.

Agility. That’s the huge benefit you get when you ditch roadmaps. You start to get some business agility back because you’ve no longer got the next six-to-twelve months of deliverables baked into a Gantt chart, but instead you can constantly reevaluate what’s most important to the business and whether your plan for the next few weeks is delivering the most value, in the most efficient way, and change it if not.

I know that’s probably not the definition of agility you’re used to, but that’s because I’m talking about real agility, not the Capital-A Agility from the Deloitte playbook. I’ve got a lot more to say about agility, but it’ll have to wait for future posts because we need to get back on track denigrating roadmaps.

Anyway, to get the benefits of agility, it’s essential these six-ish week periods where you’ve got plans aren’t treated as iterations where you do six weeks and then plan the next six weeks. Instead, you need to treat it as a sliding window where the next six weeks from now is always being evaluated, current projects are being assessed to see if they’re on track and delivering the right value, and ideas are moving into and out of scope. This should be a continuous process, where decisions are made daily or—at most—weekly about changes to the current plans.

You’ll find you change plans a lot more than you thought you would, because you’re always assessing them against the value they can deliver.

That essential project you thought would take eight people two months, but now it’s better understood you realise it’s going to take twelve people and another three months, does it still deliver sufficient value for the new effort? Are there tactical solutions you could take instead which might not be as satisfying from a technical or product perspective, but which deliver sufficient value sooner for much less effort? If you’re evaluating what you’re working on constantly you’ll be much more likely to ask these questions—and act on the answers—than if you’re just shuffling out end dates on a fixed roadmap.

Those little features and fixes that never got onto the big static roadmaps because they were too small or insignificant, can you now get them done when you see they could easily be shipped in a couple of days and add incremental value, or ease the frustrations of customers?

Those unplanned events like realising your database needs attention because you’re maxing out your IOPS at peak, or a security report which you need to spend a few days investigating and patching, or the engineer with critical knowledge on a project who called in sick and will be out for a fortnight. These change your immediate ranking of what has value and what can be delivered, so your plans change to accommodate them, and it doesn’t phase you because you’re doing that all the time anyway.

You’re no longer entrenched in the routine of delivering something because the roadmap says that’s what you have to deliver. You’re always looking at what can deliver the most value in the best way.

A lot of companies go through the phase of thinking they need to introduce OKRs because they realise they’re busy all the time but nothing that really matters seems to have changed much over the last year or so. I’ve been through that myself. I’ve introduced OKRs myself. But it doesn’t work if you still have roadmaps because you’ll always be coerced back into the mindset of thinking in terms of deliverables rather than value.

If this sounds familiar to you—and trust me, I know this sounds very familiar to a lot of you—then don’t try and fix your lack of ability to deliver value with sticking plasters like OKRs. Fix the root causes instead. Like roadmaps.

No more roadmaps.

Don’t hire

2023-10-12T00:00:00+00:00

Most job specs are awful. They have a “main responsibilities” section filled with vague generic bullet points listing things you’ll never actually do, followed by an interminable disjointed list of “essential” skills and experience that will never actually be needed. Have you ever wondered why this is?

Spoiler: It’s because the role isn’t necessary, and the company shouldn’t be hiring for it.

Hiring is the worst thing you can do. I don’t mean hiring bad people, I mean hiring any people. Hiring people makes your company bigger, which means less consensus, more communication overhead, greater division of knowledge, more time with your best people mentoring or managing instead of doing, more time in meetings, and so on. Your delivery speed drops and you start tracking DORA metrics and hiring Agile Coaches or Scrum Masters or Delivery Managers to try and fix things. Which it won’t. It compounds the problem, because the root cause is that you hired people.

“But we’ve got more work than we can cope with.”

I hate to be the one to break it to you, but hiring people doesn’t make this better; it actually makes it worse. The amount of work you need to do doesn’t stay the same as you get more people, it goes up because you have more people so the expectation goes up. However, due to all the problems with more people you deliver sublinearly as you add them, so the amount of work you need to do appears to grow superlinearly. The solution to having too much to do is to do less, not to hire more people.

“But we need to grow the business.”

So grow it with the people you have. The great thing about not hiring people is that you need to grow it less. Let’s say, hypothetically, it’s going to cost you £120k/annum to hire a new person, and you make 3% margin on whatever it is you’re selling, which is pretty common across a range of industries. If you hire that person, you now need to grow the business by an additional £4M/annum just to cover the cost of hiring them. Unless you actually know how that person is going to contribute that £4M, and then some, don’t hire them.

“But I know how they’re going to contribute that £4M.”

Okay, great, in that case go and hire somebody.

No, really. Forget what I said above. If you know what the new person needs to do for you and how they’ll contribute value you otherwise wouldn’t get, go and hire them.

Thankfully, your job spec won’t be one of those crappy ones with a ton of vague generic responsibilities and interminable disjointed skills, because you actually know what they’re going to be doing, and you know which skills are really needed and which can be learned on the job. As a result, it’ll be much more likely attract the types of candidate you’ll want.

Deriving domain-driven design

2021-12-03T00:00:00+00:00

If you thought about it for long enough, you would invent domain-driven design. This is a talk I gave at DevLab ‘21 about deriving domain-driven design from first principles to demonstrate that it’s an intrinsic property of well architected systems.

Watch the recording on YouTube:

Strengthen your types

2021-06-28T00:00:00+00:00

“Static typing” and “strong typing” are frequently conflated, as for many people a statically typed language implies that programs written with it must also be strongly typed. That’s notionally true as the variables themselves have types, not just the values, but I’d argue that most statically typed programs are actually fairly weakly typed.

That’s a bold claim, so I’ll spend the next few minutes justifying it.

To begin, we need to examine what we mean by strong and weak typing, and where better to start than JavaScript? Here are some vexing expressions taken from the infamous wat talk which produce clearly nonsensical results:

[] + []  // ""
[] + {}  // "[object Object]"
{} + []  // 0
{} + {}  // NaN

You can read the detailed explanation of what’s happening if you like, but it comes down to JavaScript’s surprising implicit type conversions at runtime. To demonstrate that this is a consequence of weak typing rather than dynamic typing we can contrast the results with another dynamically typed language, Python, which doesn’t coerce types at runtime. Here the results are either sensible or the runtime raises a TypeError:

[] + []  # []
[] + {}  # TypeError: can only concatenate list (not "dict") to list
{} + []  # TypeError: unsupported operand type(s) for +: 'dict' and 'list'
{} + {}  # TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

Wikipedia says there is no precise definition of what constitutes weak or strong typing but hopefully those examples are sufficient to convince you that one of its definitions is a good baseline for considering a language weakly typed:

A weakly typed language has looser typing rules and may produce unpredictable or even erroneous results

Assuming we can broadly agree on that definition, let’s get back to the business of showing that most statically typed programs are also weakly typed. Quite fantastically, we don’t need anything more complex than Hello World’s greet function to do so! Here’s what that might look like in Scala 3:

def greet(name: String): String = "Hello " + name + "!"

This is Scala so it’s statically typed, and at first glance it appears strongly typed because the parameter type is declared to be a String, the return type is also declared to be a String, and it clearly does return a String. We know the + operator will always be String concatenation in this context, so the function works as you’d expect:

greet("Alice")  // "Hello Alice!"

However, we can still call this function in ways that produce unpredictable or erroneous results. These might not be as unpredictable or erroneous as some of the JavaScript expressions — we’re not going to get NaN, for example— but they definitely aren’t the results we would like.

greet("  Alice  ")  // "Hello   Alice  !"
greet("")           // "Hello !"
greet(null)         // "Hello null!"

As such, it appears we need to modify our definition of weakly typed. It’s not just whether the language itself is weakly typed, but whether the program written in the language is weakly typed.

A weakly typed program has looser typing rules and may produce unpredictable or even erroneous results

Here we demonstrably have a program with sufficiently loose typing rules that unpredictable or erroneous results are produced. Ergo, this program is weakly typed.

There are two key weaknesses in the parameter type.

Firstly it’s not really constraining the parameter type to be a String because null is not a String and passing null is permitted. By default in Scala 3 the String type actually means String | Null. However, by supplying the -Yexplicit-nulls compiler option we can strengthen the String type to mean exactly String so the final line of code won’t compile:

greet(null)  // [E007] Type Mismatch Error: Found: Null, Required: String

Secondly we’re not constraining the contents of the string to be valid, so we can pass in things like the empty string, or strings with leading or trailing whitespace. This problem could be solved without changing the types by adding branching logic to the greet function; from what I’ve seen over the last twenty years or so, this is the approach most people would use to solve it in most languages:

def greet(name: String): String =
  val trimmed = name.trim
  if trimmed != null && trimmed.nonEmpty then
    "Hello " + trimmed + "!"
  else
    ""

greet("Alice")      // "Hello Alice!"
greet("  Alice  ")  // "Hello Alice!"
greet("")           // ""

This looks better. Unfortunately we’ve moved the problem of unpredictable or erroneous results to the return type because the function’s type of String implies it will always return a greeting, but the empty string isn’t a valid greeting.

You might argue I’m picking nits here, but when it comes to using this function there’s nothing to indicate to the caller they may need to deal with this case so there’s a fair chance no greeting being displayed will be raised as a bug somewhere down the line, and it’ll be nontrivial to work out that it was an empty string that somehow got into the system causing it. Half a day gone.

We can fix this by strengthening the return type to an Option[String] which tells the caller that greet may not be able to construct a greeting if the input is invalid, i.e. it is a partial function rather than a total function. Now when the function is used, the caller explicitly has to handle the failure case and decide what to do. Half a day back.

def greet(name: String): Option[String] =
  val trimmed = name.trim
  if trimmed != null && trimmed.nonEmpty then
    Some("Hello " + trimmed + "!")
  else
    None
    
greet("Alice")      // Some("Hello Alice!")
greet("  Alice  ")  // Some("Hello Alice!")
greet("")           // None

This function now arguably meets our definition of strongly typed, because whatever you pass into it, the result is predictable and it is hard to subsequently make erroneous use of it. So, we’re done, right?

No. Unfortunately not.

This function might now be hard to misuse, but to make it that way we’ve broken the single responsibility principle. It has the dual responsibilities of understanding what makes a name valid, and also formatting a greeting. This might not seem like too much of a problem, but over time it means other functions dealing with names will have to reimplement the same logic, might do it slightly differently, and in a long-lived codebase you’ll end up with numerous different implementations of the same basic concept and won’t know which is correct (if any).

The reason it’s having to break the single responsibility principle is because the types are still too weak. So let’s strengthen them again and instead of name being a String introduce a proper FirstName type and move the validation into its smart constructor fromString to ensure that only valid instances can be constructed; this approach is sometimes called “making illegal states unrepresentable”:

object types:
  opaque type FirstName = String
  
  object FirstName:
    def fromString(name: String): Option[FirstName] =
      val trimmed = name.trim
      if trimmed != null && trimmed.nonEmpty then
        Some(trimmed)
      else
        None
      
import types.*

def greet(name: FirstName): String = "Hello " + name.toString + "!"

FirstName.fromString("Alice").map(greet)      // Some("Hello Alice!")
FirstName.fromString("  Alice  ").map(greet)  // Some("Hello Alice!")
FirstName.fromString("").map(greet)           // None

The above code might benefit from a little explanation. The opaque type line defines FirstName as being a String but because it’s opaque that fact isn’t known outside the container types, so if you try to call greet("Alice") then you’ll get a compilation error that FirstName was expected but String was found. However, inside the types object the equivalence of FirstName and String can be used to return a String as a FirstName after validation.

Here FirstName.fromString has the single responsibility of understanding what makes a first name valid, and greet has the single responsibility of formatting the greeting. Note that greet can go back to returning a String rather than an Option[String] because the input must always be valid, so it’s a total function rather than a partial function. We could go further than this and return a NonEmptyString or even a Greeting (and we should!) but I think you get the idea by now so that is left as an exercise for the reader.

These improvements might not appear to be much of a benefit in this little sample of code, but when additional functions need a first name they can reuse the same type, and won’t need to implement any argument validation, or negative tests for invalid input. The functions will have less branching which makes them easier to reason about, and the code will likely have fewer bugs as a consequence.

There are myriad other benefits to strengthening your types. The obvious one is that it makes the code more self-documenting which is a major benefit for readability and maintainability, especially in long-lived codebases where the original authors may no longer be around, or may be busy with other things.

It also helps to prevent trivial mistakes in code. If you had a method expecting a last name or an email address then if the types are all String you can pass one where another is expected and the compiler can’t help you. When you use distinct FirstName, LastName, and Email types then it’s not possible to mix them up.

A less obvious benefit is that it helps to enforce good architectural practices. If all your business logic deals with strong types rather than strings or integers then it forces validation of input up to the boundary layers where it should be, not merely by convention, but by necessity because you cannot call lower layers of code without doing so!

Returning to the claim that most statically typed programs are pretty weakly typed, I don’t have proof but I’ll take a bet that majority of the code written by the majority of people reading this looks much more like the initial version of the function with its weak String parameter and String result types than the final version with its strong types. That’s fine; I’m in that majority.

But there’s always time to make amends.

Any time you find yourself needing to validate arguments, give a function multiple responsibilities, or make a function partial rather than total, consider whether you can strengthen your types to remove those problems.

Originally posted on the Zego Product Engineering blog

Feature flags

2020-11-02T00:00:00+00:00

Introducing feature flags helps to separate release from deployment, and allow changes to be partially rolled out on a percentage basis and/or to target groups of users. It cannot completely separate these two things as some changes such as database migrations are all or nothing. However, it can significantly reduce the risk of deployments.

The trouble with feature flags is that they sound so simple: They’re just an on/off switch, right? Engineers tend to underestimate how much complexity there is in implementing them well, so they’ll either spend too little time choosing an off-the-shelf system and later find out it won’t scale with the company, or spend hundreds of engineering hours building it themselves (and probably still find out it won’t scale with the company).

We’ve just rolled out LaunchDarkly at Zego after evaluating a number of open-source and SaaS offerings. This post contains the criteria we used to select a provider so we can be confident they’ll scale with us. It’s probably not an exhaustive list, but it’ll give you a starting point.

There are some table stakes requirements like multiple environments (production, staging, development, etc.), targeting groups using attributes, partial rollout of flags, archiving flags, and client libraries for all your backend and frontend languages. Pretty much every feature flagging system will support these. However, there are many nonobvious requirements that are also essential.

Must haves

In-memory evaluation (performance, reliability) - It’s likely that requests will pass through a number of feature flag gates during execution; network calls to evaluate features increase latency as even a couple of milliseconds for each of a number of feature flags adds up, and reduces reliability as those calls may fail or be slow.
Second-level persistent cache (reliability, startup time) - When applications start they need to load feature flags, and going to the source takes time. There must be a second-level persistent cache (e.g. Redis, DynamoDB) that applications can load initial configuration from which improves startup time, means apps can start even if the SaaS service is down, and in emergencies lets you edit settings manually.
Rollout stickiness (sanity) - When flags have a partial rollout it’s essential that when given the same parameters the decision is the same each time, because a request may pass through a gate multiple times, or a view may make multiple requests, and the decision needs to be consistent otherwise vexing bugs will occur.
Rapid updating of flags (incident management) - When we want to turn things off because they have gone wrong, being able to change the configuration quickly is important. Flags should be updated within, say, 30 seconds either by polling or event-driven (less is better).
Explaining why a result was returned (debugging) - As feature flags and target groups get more complex, which they will, being able to see why a result was returned for a user is essential. This is even better if it can be done before the flag setting is changed (e.g. a UI that lets you see what the result would be for a given input).
History of changed flags (compliance, debugging) - It’s useful to be able to see who changed which flags, and may be important for some security cases. Bonus points if there’s a feed of changes, ability to see changes in a time window, or publishing changes to places like Datadog. Extra bonus points if comments can be added to changes.
Don’t send server-side flags to clients (competitive advantage) - Feature flags are used to gate new features, so broadcasting your feature flags to clients (i.e. web UI or mobile) can be used by competitors to predict what features you will launch and reduce your advantage; it must be possible to share some but not all flags with clients.
Single-Sign On (security) - Incomplete offboarding of people who have left a company is one of the biggest security risks, so the management tools should have SSO for users to ensure they are offboarded automatically. Even if you’re not going to use it initially to save costs (SSO tends to be Enterprise Plan $$$) you will need it as you grow.

Should haves

Partial updates (network) - Over time it’s likely you’ll get to get to hundreds or even thousands of feature flags. If updates are loaded from a single file, as some services do, then this can cause significant network usage especially from clients or when not centralised. Ideally updates should only update the things that have changed.
Good documentation (ease of use) - Growing companies onboard people frequently and if the documentation is bad then people won’t read it and they’ll make mistakes. It’s not essential as they can learn from examples, but it really helps.
Target groups (ease of use) - It’s possible to create properties to represent target groups, e.g. whether they are an employee, but it can make configuration easier if it’s possible to create target groups of common properties, and be able to target them. Bonus points for being able to combine Boolean logic on conditions and groups.

Nice to haves

Tracking of result distribution (debugging) - Being able to see a breakdown of feature flag results can make it easier to ensure that your rollout percentages and target groups are doing what you expect. This can be done manually with Datadog or similar but it’d be nice to have built-in.
Scheduled flags (ease of use) - Turn features on or off on a schedule rather than having to do it manually. This can help when marketing or other teams want to launch at a particular date and time, especially if you’re targeting users in Australia and you don’t want to be up at 3am.
Flag groups (ease of use) - Sometimes feature flags are related and a number of them can be used to control rollout or behaviours of larger features. Ideally flags should be able to be grouped or tagged or have some other way to organise them and find related flags.
Open source clients (debugging, bug fixing) - The clients will inevitably have some bugs and so it’s useful for them to be open source which will allow us to fork and patch them if necessary. This isn’t essential as hopefully we won’t need to do it.

The other flagging system we evaluated which met the requirements was Split but we went with LaunchDarkly because it’s a bit nicer to use, the documentation is much clearer, and it has a number of small and even more nonobvious features which weren’t part of our evaluation criteria but which are nice to have. I’d be happy to use it though. Everything else came up a long way short.

Modelling composite types

2020-02-22T00:00:00+00:00

A composite type is one that can have different possible types of value, for example contact info might be either a phone number or an email address. Other common names for this—depending on programming language and context—are sum types, tagged unions, disjoint unions, discriminated unions, coproducts, or variant types (not the same as the old COM concept).

These are easy to model in functional languages that use algebraic type systems, because they can directly represent composite types. However, in object oriented languages there is no ideal way to model them. There is another approach which is neither algebraic nor object-oriented which might be the best of both worlds though.

To explore this, we’ll model a contact info type which can be either a phone number or an email, show how it’s used, and then evolve it to also add a website URL alternative.

Algebraic modelling

For the algebraic type system I’m going to use F# as it has particularly clean syntax for domain modelling. The contact info can be modelled as a discriminated union where | indicates a choice between the type constructors, i.e. a ContactInfo is either a Phone containing a string, or an Email containing a string.

type ContactInfo =
    | Phone of string
    | Email of string

To use this we pattern match over the type. The following code defines a function called contact and extracts either the number or address string in the pattern match, then takes the appropriate action (let’s assume that call and message are functions that somehow exist and know what to do).

let contact contactInfo =
    match contactInfo with
    | Phone number  -> call number
    | Email address -> message address

If you’re not familiar with F# then this code might look strange because there are no types and no braces! F# uses spaces for function application so the first line declares a function named contact which takes an argument named contactInfo. It doesn’t need any explicit types because it can infer from the pattern match branches that the argument must be of type ContactInfo.

To evolve the type and add a website, we just add another case. It doesn’t matter that the contained type is a Uri rather than a string because the type constructors Phone, Email and Website are independent of each other and don’t need to have any shared interface.

open System

type ContactInfo =
    | Phone of string
    | Email of string
    | Website of Uri

When we add this new Website type constructor we’ll get a compiler error as the match is now not exhaustive, so we need to add a case to any matches throughout the codebase. That’s just a matter of adding another line.

let contact contactInfo =
    match contactInfo with
    | Phone number  -> call number
    | Email address -> message address
    | Website url   -> browse url

That was pretty neat. Adding a different type didn’t require any existing lines of code to be changed, and the compiler told us everywhere that would be affected. The downside is that this type can’t be externally extended, i.e. only the author of the type can add a new variant, but that isn’t an issue for business domains which are inherently closed.

Now let’s see how we can model this concept in an object-oriented language.

Object-oriented attempt 1: Tagging

The approach I see most often used for this situation in object-oriented languages is to add an enum tag to the class indicating the type of value. This is probably because corresponds to the way you’d store the data in a relational database (a column for the type and a column for the value) so it’s easy to use with object-relational mappers.

public enum ContactMethod {
    Phone,
    Email
}

public sealed class ContactInfo {
    public ContactInfo(ContactMethod method, string value) {
        this.Method = method;
        this.Value = value;
    }

    public ContactMethod Method { get; private set; }
    public string Value { get; private set; }
}

In use this looks broadly similar to, if somewhat more verbose than, the functional approach where the type is matched and then the value is used.

public static void Contact(ContactInfo contactInfo) {
    switch (contactInfo.Method) {
        case ContactMethod.Phone:
            Call(contactInfo.Value);
            break;
        case ContactMethod.Email:
            Message(contactInfo.Value);
            break;
    }
}

However when we come to evolve it to add website with a Uri value we have a problem because the value is defined as a string. Generics don’t help here because that would prevent us from doing things like putting multiple contact infos in a list if the type was different (or needing to use existential types in the list).

We’ll have to settle for storing the value as a string and then converting it to a Uri when it’s read with a different accessor.

using System

public enum ContactMethod {
    Phone,
    Email,
    Website
}

public sealed class ContactInfo {
    public ContactInfo(ContactMethod method, string value) {
        this.Method = method;
        this.Value = value;
    }

    public ContactMethod Method { get; private set; }
    public string Value { get; private set; }

    public Uri ValueAsUri {
        get {
            // Will throw an exception if Value isn't a valid URI
            return Uri(this.Value);
        }
    }
}

In use this feels more unpleasant as we now have to remember to call the correct accessor based on the tag. This relationship between logical type and accessor isn’t enforced by the type system so we’ve sacrificed some type safety, and the code is no longer exception-safe even though that isn’t obvious from looking at it.

public static void Contact(ContactInfo contactInfo) {
    switch (contactInfo.Method) {
        case ContactMethod.Phone:
            Call(contactInfo.Value);
            break;
        case ContactMethod.Email:
            Message(contactInfo.Value);
            break;
        case ContactMethod.Website:
            Browse(contactInfo.ValueAsUri);
            break;
    }
}

In general, this kind of tagged data approach is not very future-proof in object-oriented languages. We could just about get away with the evolving requirements here as the data is almost the same shape, but what if the requirement was to add a phone type (e.g. home, mobile, etc.) to the phone number? There’s nowhere to store it.

Let’s see if we can do better.

Object-oriented attempt 2: Inheritance

For our second attempt we’ll use the proper object-oriented approach of inheritance and subtype polymorphism. I’ve used an abstract base class here, but an interface could be used instead without changing any of the modelling discussion.

public abstract class ContactInfo {
    public string Value { get; private set; }
}

public sealed class Phone : ContactInfo {
    public Phone(string value) {
        this.Value = value;
    }
}

public sealed class Email : ContactInfo {
    public Email(string value) {
        this.Value = value;
    }
}

Unfortunately, the above code makes one of the most common mistakes of object-oriented modelling which is exposing the object’s data instead of its behaviour. We shouldn’t be switching on the type of the class and reading the Value property, but should instead call a method on it and allow the object to respond correctly based on its runtime type.

Let’s change the interface to hide the data and expose the desired behaviour instead. This now works nicely as we can call the Contact() method on any instance.

public abstract class ContactInfo {
    public abstract void Contact();
}

public sealed class Phone : ContactInfo {
    private readonly string number;

    public Phone(string number) {
        this.number = number;
    }

    public override void Contact() {
        Call(this.number);
    }
}

public sealed class Email : ContactInfo {
    private readonly string address;

    public Email(string address) {
        this.address = address;
    }

    public override void Contact() {
        Message(this.address);
    }
}

We can also cleanly add a Website class with a Uri rather than string value because the type of the value isn’t exposed in the interface, and much like the functional approach we are forced to implement the behaviour as it’s part of the class contract. This is how object-oriented design is supposed to be done. It’s unfortunate that many of the languages have standards (e.g. Java Beans) or features (e.g. auto-implemented properties) that encourage programmers to do the wrong thing by default.

using System

// other code the same as before

public sealed class Website : ContactInfo {
    private readonly Uri url;

    public Website(Uri url) {
        this.url = url;
    }

    public override void Contact() {
        Browse(this.url);
    }
}

All good then? Not quite. Unfortunately, subtype polymorphism is only viable when you can actually modify the classes themselves when you need to add behaviours. It can also easily lead to very large classes that have high coupling and low cohesion as everything related to the class ends up in there (how many barely related methods do your User or Order or similar classes have, for example?).

Subtype polymorphism is a great approach for functionality that is intrinsic to the type, but for other things you might want to do with it (e.g. converting it to data transfer objects for rendering in APIs or UIs) another approach is necessary.

Object-oriented attempt 3: Visitor pattern

We’ll start off with subtype polymorphism again, but this time we will expose the data because the visitor pattern does dispatch based on type. Note, however, that there are no shared properties or behaviour and so ContactInfo becomes effectively a marker interface.

public abstract class ContactInfo {}

public sealed class Phone : ContactInfo {
    public Phone(string number) {
        this.Number = number;
    }

    public string Number { get; private set; }
}

public sealed class Email : ContactInfo {
    public Email(string address) {
        this.Address = address;
    }

    public string Address { get; private set; }
}

Rather than having each specific visitor implement the boilerplate for the visitor pattern, we can implement a visitor base class and then allow specific visitors override the methods that handle the concrete types. This code uses C# 7.0’s feature of aliasing variables after as checks so there is no need for an additional cast.

using System;

public abstract class ContactInfoVisitor {
    public void Visit(ContactInfo contactInfo) {
        if (contactInfo is Phone phone) {
            Visit(phone)
        } else if (contactInfo is Email email) {
            Visit(email)
        } else {
            throw new NotSupportedException()
        }
    }

    protected abstract void Visit(Phone phone);
    protected abstract void Visit(Email email);
}

public sealed class ContactVisitor : ContactInfoVisitor {
    protected override void Visit(Phone phone) {
        Call(phone.Value)
    }

    protected override void Visit(Email email) {
        Message(email.Value)
    }
}

We can now evolve this to add the website. Unfortunately again we don’t get any compiler errors when we add the Website class saying that it isn’t handled, so we need to remember to update our visitor base class in lockstep. Fortunately with the abstract methods we will get errors in the derived visitors when we add the method to the base class, so there’s some safety there at least.

using System

// other entity classes as before

public sealed class Website : ContactInfo {
    public Email(Uri url) {
        this.Url = url;
    }
    
    public Uri Url { get; private set; }
}

public abstract class ContactInfoVisitor {
    public void Visit(ContactInfo contactInfo) {
        if (contactInfo is Phone phone) {
            Visit(phone)
        } else if (contactInfo is Email email) {
            Visit(email)
        } else if (contactInfo is Website website) {
            Visit(website)
        } else {
            throw new NotImplementedException()
        }
    }

    protected abstract void Visit(Phone phone);
    protected abstract void Visit(Email email);
    protected abstract void Visit(Website website);
}

Finally we can use this to provide external extensibility to the class in a safe way.

public sealed class ContactVisitor : ContactInfoVisitor {
    protected override void Visit(Phone phone) {
        Call(phone.Number)
    }
    
    protected override void Visit(Email email) {
        Message(email.Address)
    }
    
    protected override void Visit(Website website) {
        Browse(website.Url)
    }
}

Look back at the visitor pattern again though. We’ve exposed disparate properties on the entity classes, dispatched on their concrete types, and handled each case individually. It should be evident that the visitor pattern as used here is just a poor facsimile of the functional language’s built-in discriminated union support, with a lot more boilerplate and a little less help from the compiler.

Unfortunately with traditional object-oriented languages we haven’t found an ideal approach for modelling composite types.

Ad hoc interface implementation

This brings us neatly around to another form of modelling which is neither algebraic nor object-oriented. It’s the approach used in Go’s interfaces and Rust’s trait objects, as shown below. Note that in Go you wouldn’t tend to declare the interface along with the structures, but only when you need to make them implement a common behaviour.

import "net/url"

// these types get defined up-front

type Phone struct {
        Number string
}

type Email struct {
        Address string
}

type Website struct {
        URL url.URL
}

// the interface and implementation can be defined later anywhere else

type ContactInfo interface {
        Contact()
}

func (p *Phone) Contact() {
        call(p.Number)
}

func (e *Email) Contact() {
        message(e.Address)
}

func (w *Website) Contact() {
        browse(w.URL)
}

This ad hoc interface implementation for disjoint types is also supported in Python. That probably isn’t too surprising as Python can’t decide what type of language it wants to be, so it chucks a bit of every paradigm into the mix. It takes the approach of defining a ‘base’ method and registering additional methods as handlers, using metaprogramming rather than being a language intrinsic. This ‘base’ method is effectively the interface defintion.

from dataclasses import dataclass
from functools import singledispatch

# these types get defined up-front

@dataclass
class Phone:
    number: str

@dataclass
class Email:
    address: str

@dataclass
class Website:
    url: str  # No URL type in Python :-(

# the implementation can be defined later anywhere else; no interface needed

@singledispatch
def contact(_contact_info: object):
    ...

@contact.register(Phone)
def __contact_phone(phone: Phone):
    call(phone.number)

@contact.register(Email)
def __contact_email(email: Email):
    message(email.address)

@contact.register(Website)
def __contact_website(website: Website):
    browse(website.url)

Ad hoc interface implementation, or ad hoc polymorphism as it’s more commonly known, is a more powerful approach than traditional object-orientation’s subtype polymorphism. (That’s right sports fans, I actually said that there’s a design decision in Go that isn’t terrible. Just one though. Don’t get excited.)

If you’re into functional programming then you’ll recognise this as being conceptually similar to typeclasses, but this post is already running long and I’d need to introduce yet another language that supports them to demonstrate, so I’m going to call it a day.

The engineering career triangle

2020-01-19T00:00:00+00:00

Engineering career progression is often described as a ‘ladder’ or ‘track’, where you can also switch into a parallel management track. However, this doesn’t adequately illustrate where the tech lead role sits, or why the staff+ engineer levels have at least as much in common with management as they do with coding. To be able to explain this we need to separate the concepts of management and leadership. The result is the engineering career triangle.

There’s quite a lot of information encoded in this diagram which I’ll explain in more depth, but the key points are that the black dots indicate a position, solid arrows show career progression within a track, and dashed arrows show a switch between engineering and management tracks.

Job titles vary hugely in engineering so I’ve tried to pick a set that are fairly cohesive and representative of what I’ve seen at a range of companies. It probably won’t exactly match what your company uses.

Management vs leadership

Management and leadership are often conflated, but they’re orthogonal. Management responsibilities include performance assessment and feedback, dealing with interpersonal conflicts, and overseeing the planning and delivery of projects. Leadership is more about defining the vision for the organisation, establishing and defending culture, and inspiring people to work towards common goals.

Perhaps the most significant difference, however, is that management implies authority whereas leadership does not. In other words, you have to do what your manager tells you, but it’s up to you whether you follow a leader. That said, good managers are also good leaders, and rarely need to flex their authoritarian muscles.

The reason this distinction is important is that engineers are not managers, but as people in either track become more senior their focus shifts more onto leadership. Consequently the tracks which look quite different at lower levels start to converge, and include many of the same responsibilities at higher levels.

Levels

In many engineering organisations there isn’t a defined equivalence between engineering positions and management positions. The career triangle makes it obvious by putting equivalent levels in the same horizontal band. Level defines, among other things, the compensation people receive so a senior engineer will be paid similarly to a standard-level engineering manager.

I’ve labelled them from 3 through 8 which might seem somewhat arbitrary, but is based on the approach taken by Deliveroo, which was in turn largely copied from Facebook and Google. However, Microsoft use 59 through 69 for these levels, and Apple use 2 through 6, so this is far from a universal convention even among the FAANG companies.

If you want to retain senior engineers, and have them stay as engineers rather than switching to management, then it’s important to have this kind of equivalence defined. Without it there’s always doubt as to whether managers are treated better (higher compensation, etc.) and whether engineers are really second-class citizens.

Responsibilities

We can roughly divide positions into segments based on their primary responsibility. Note that although the areas of these segments are roughly equal in the diagram, in reality that will be far from the case. In a typical organisation around 80% of people will be in the green segment, with around 10% in each of the yellow and purple ones.

This illustrates why the transition from senior engineer to staff engineer shouldn’t really be seen as a promotion, but more as a change of role. Focus shifts away from a single team and working on one or two products, writing production code most days, to working across multiple teams and often not writing production code for weeks at a time. It’s a transition that many engineers don’t want to make because what they really enjoy doing is working in a team and writing production software.

Working backwards from this, the rationale behind having two levels of senior engineer becomes clearer. There can a long time between mid-level (“Engineer”) and staff—perhaps ten or fifteen years—even if people actually want it, and without this distinction it can be hard to show progression or give people who have already made senior realistic goals to aim for.

This should also make clear that being a tech lead is not a promotion, but a different role with more management responsibility, for example organising the work team members are doing. The more senior tech lead roles are deliberately shown closer to the management line, as it’s not uncommon at this level that they will start taking on a bit more of the people management side of things.

As a counterpoint, the junior manager role is shown closer to the engineering line. This is because they typically only manage a small number of people while they learn the ropes, and so because they have more free time and technical skills they will often still be somewhat hands-on. As they get more experienced and manage more engineers, or other managers, being hands-on is no longer practical.

Switching tracks

I’ve shown the four most common points at which people switch tracks, but it is quite possible to switch tracks at any point in your career. It gets less common as people become more senior because they are more likely to know which track they want to be on, and less likely to have the skill required for the other track, but there are some people who oscillate between the two.

In general if you’re moving from an engineering position to a management position, as a new manager, you’ll drop a level or possibly even two because you won’t be as skilled at management as you are at engineering. Going back the other way it’s more likely you’ll stay at the same level if you’ve maintained technical skill, but if you need to re-skill that may not be the case.

It’s worth noting that in many countries, the UK included, your salary cannot legally be reduced even if you change level. This makes it relatively risk-free to try switching tracks if you want to. Most good companies will support you doing this, and switching back if it doesn’t suit you. The tech lead role is a great way of dipping your toe in the water.

All about identifiers

2019-12-10T00:00:00+00:00

Identifiers aren’t usually considered very interesting. Indeed, popular frameworks such as Django or Hibernate find them so uninteresting that they ship with the default of auto-incrementing integers so users don’t need to be concerned with them. Unfortunately using sequential identifiers is a bad idea, and if you don’t think about your identifiers up front then fixing things later can be very time consuming. Here’s most of what you need to know about identifiers.

Don’t use sequential integers

Before we get into anything else, let me convince you that you should not use sequential integers as your primary identifier, whether 32-bit or 64-bit. There are three main problems with sequential integers.

Firstly, they make life difficult if you want to generate them in multiple places such as a different service if an entity is being migrated, or in different shards, because the sequences will overlap with each other. If you’re using 64-bit integers this issue can often be mitigated by partitioning the keyspace, but this kind of partitioning can be manual and error-prone, leading to accidental overlaps. If you’re using 32-bit integers then you’re out of luck unless your rate of growth is really very small.

Secondly, sequential integers can impact application security because they facilitate enumeration and thus make insecure direct object reference (IDOR) attacks easier to mount, and the consequences of insufficient access control more severe. No matter how dilligent you are, mistakes will be made in access control. While this is to some extent security by obscurity, it’s still a valuable part of defence in depth.

Another way security can be impacted by integer identifiers is in the aftermath of database failure. Because backups lag some way behind real-time, when a database is restored from a backup the latest id will be lower than the previously highest issued id which can result in things like UserIds being reissued to different people. If these integers have already been used elsewhere (security tokens, caches, third parties, etc.) then the person with the post-restore UserId may be able to access information related to the person who held it pre-restore. This isn’t just a theoretical case; I have worked at a company where this happened.

Lastly, they leak business information which could be valuable to competitors, who can derive things like how many orders/day you’re doing from sequential order numbers. The workaround of partitioning the keyspace can make this leak even worse because if your partitions relate to business units or markets then competitors can obtain a more detailed breakdown. Trust me, your competitors will do this.

Instead I’d recommend using randomly generated identifiers with effectively zero probability of collision such as v4 UUIDs. However, note that while most mainstram v4 UUID generation algorithms are cryptographically secure and thus the UUIDs are unguessable under any circumstances, this is not required by the specification. UUIDs are required to be unique but not unguessable. As such, if the identifiers must be unguessable then it may be safest to use a cryptographically random generator directly.

Another option you might want to consider is k-sortable identifiers which are essentially a random string prefixed with a specially formatted timestamp so that the identifiers are roughly sortable by time both lexicographically and in binary. However, again beware that this leaks information as the timestamps could be extracted by interested third parties.

Before ending this section, a word on performance. These larger random identifiers work well in many data stores, but for some such as SQL Server which uses clustered primary keys they can cause major problems from effects like page splits. In this case you may want to create the table with an auto-incrementing integer primary key which is a surrogate and only used within the application but never exposed to any other application or party. This isn’t necessary in Postgres; using a UUID rather than a 64-bit integer key has a negligible effect on performance.

Identifiers can be personal data

When we think of personal data we tend to think of things like name or email. However, any identifier for a person that is accessible to that person or any third party (whether it’s in a user interface, URL, API body, cookie, JWT, etc.) should also be considered personal data. On learning this many people are dismissive, but consider that it falls into the same category as identifiers like passport or driving license number.

If you operate in Europe and are thus required to be compliant with the GDPR then you need to be able to delete any accessible identifier if the person files a deletion request (aka “the right to erasure” or “the right to be forgotten”).

In theory if you implement ‘true’ deletion where all of the data related to that person is actually deleted then there should be no problem having the same identifier for both internal and external use.

However, it is common practice particularly with relational databases to implement pseudonymisation and replace any personal data with anonymised data, because unpicking the foreign key relationships to allow true deletion is impractical. In this case the accessible identifier must also be anonymised; this is clearly not possible if it is the primary key or used in foreign key relationships.

As such, for people you often need at least two identifiers: One that is used only for internal purposes, and one that is exposed to the person. The internal identifier should be used as the key for all internal storage but never exposed in any UI, URL, JWT, etc. which is where the external identifier should be used.

The reason for at least two rather than exactly two is that if you integrate with any third parties then you should provide a different identifier to each of those third parties, or even to different accounts with the same third party if applicable. This is part of defence in depth as it means the third party cannot identify people they don’t have access to, and it makes it harder for them to correlate information. In fact, individually translating the identifier of any resource for third parties is generally good practice.

Cross-system compatibility

Identifiers are one of the few things in systems that tend to stay stable, unlike languages and storage or data warehousing technologies, so we want identifiers to work with any future technology choices we may make. In addition, if they are sent to third parties then they may be processed using a completely different set of technologies of which we are unaware. To ensure the highest level of compatibility there are a number of considerations.

The most important consideration is not to create identifiers that differ only by case. Some data stores such as DynamoDB are case sensitive, others such as Postgres have selectable collations which control case sensitivity, and others such as Elasticsearch are sometimes case sensitive depending on both index and query. Having identifiers that differ only by case makes it more likely that they will collide or be conflated, causing hard-to-find bugs that potentially have security or privacy implications.

If the identifiers will not differ by case then it also makes sense to define a canonical case so they can always be compared lexically as well as logically. For example 42A23DAB-67AD-4AD7-B7A9-3DAFF34F6C02 and 42a23dab-67ad-4ad7-b7a9-3daff34f6c02 are logically the same UUID but are lexically unequal. If identifiers are always represented in a particular case (I prefer lowercase) then systems can convert identifiers to the canonical case and then it doesn’t matter if they use logical or lexical comparison.

On the subject of case, we don’t want to deal with anomalies like the Turkish I problem so restrict identifiers to ASCII characters. As identifiers may end up being used in URLs, headers, cookies, etc. it’s also best to stay away from any character that might have special meaning in those places. A safe character set to stick to is 0-9, a-z, - and _.

Labelling

One of the issues with identifiers is trying to work out what they refer to if you don’t have the context, and this gets worse in distributed systems or with data warehousing. A neat solution to this is to include labels in the identifiers.

Stripe uses a fairly basic prefix approach with their API keys (which are really just a form of identifier) using a pk_ prefix for publishable keys, sk_ prefix for secret keys, and then a live_ or test_ environment slug. For example, a live publishable key looks like pk_live_8hTOJDTo....

Amazon uses a much more complex approach with Amazon Resource Names (ARNs) which have a structured format including details of where the resource is and which account it belongs to, e.g. arn:partition:service:region:account-id:resource-id.

These two examples also show a different approach to identifier uniqueness scope. In Stripe the resource identifier is globally unique, whereas in AWS it is only necessarily unique within the scope of the partition, service, region and account because the complete identifier is always used.

At Deliveroo we chose an approach somewhere in between these two, using the format drn:partition:market:resource-name:resource-id which gave us future flexibility to partition the system, a market because all resources are strongly market-affine, and a globally unique resource identifier that can be used standalone in the many existing URLs without ambiguity (as those URLs typically already identify both the market and resource name).

We decided not to include an environment component as it added a generation and validation burden with little practical benefit because environments are strongly separated using other mechanisms. The market is deliberately abstracted from the shards in which that market resides to allow shards to be rearranged as markets grow.

Labelling identifiers makes them much more useful, but be careful not to paint yourself into a corner by encoding aspects of your business that might change in future.

Summary

Identifiers are more complex to get right than you might think. Here’s a summary of the recommendations in this post which should help:

Never use sequential integer identifiers
Usually use v4 UUID identifiers
Always represent identifiers in lowercase
Consider labelling identifiers with their meaning
Remember that identifiers can be personal data

As always with software development, “never” doesn’t mean never and “always” doesn’t mean always. You’ll find that many of these recommendations are broken in the wild, and often for good reason. However, this is a good base set of guidelines to start from, and you should think carefully before deviating.

2020-03-24: Updated with additional paragraph about integer identifier security problems in the aftermath of database failure.

Deliberate engineering

2019-11-07T00:00:00+00:00

When building a software project with a given scope there are three main constraints you can trade off: good, fast, and cheap. This is commonly known as the project management triangle. Focusing on one will have an effect on the others. For example, trying to do things cheap will adversely affect good and fast.

Cheap in this context uses salary as a proxy for the skill and experience levels of engineers. Don’t take it too literally as you may be paying over the odds for engineers, so cheap really means projects that are staffed by engineers lacking in skill and/or experience. It could also mean projects that are under-resourced, but that problem is much more obvious so we won’t talk about it here.

The thing many people don’t seem to realise is that you don’t have a free choice with these constraints. Much like the CAP theorem only allows you to trade off between consistency and availability because network partitions are beyond your control, with software projects you can only really trade off between good and fast because cheap is largely defined by your company’s hiring policy.

Some companies have such a poor hiring policy that they are unable to choose either good or fast because there simply is not the skill and experience available. Companies like this tend to have a lot of engineers who have been at the company less than two years; most of the rest have been at the company for over a decade and are just waiting for retirement. Their interviews are ad hoc with no rubric and no hiring bar. This is, sadly, a large number of companies.

At the other end of the scale there are the companies who have relatively small but highly skilled teams who can choose both fast and good. These companies have a high proportion of senior engineers, including some that are well known in the industry, and low staff turnover rates. They have multi-stage interviews with standardised rubrics and a very high hiring bar. If you don’t know whether you work at one of these companies, then you probably don’t.

The remainder are somewhere in between. They have a relatively bottom-heavy pyramid of seniority with lots of junior/mid-level engineers and a much smaller number of seniors. They have structured interviews, but the rubrics may be absent or lacking and interviewers may not be calibrated, so the hiring bar is inconsistent. This group includes most startups and young companies.

This last group of companies is the most interesting. They have the choice between fast and good, but struggle to do both because the people who could do this are too thinly spread.

The trouble is, they often can’t do either fast or good.

Why fast fails

Every company wants to deliver new features fast, because features are thought to be what drives sales, conversion, etc. and so getting them out in front of customers is the most important thing. When the pressure is on to go fast, people want to look and feel like they’re going fast.

The majority of engineers appear to believe the fallacy that any activity other than writing code isn’t “real work” and is just ceremony that slows them down. This means when they want to feel like they’re going fast they’ll forgo—or merely pay lip service to—gathering requirements, writing design documents, doing research, and building proof-of-concepts to test and measure. Instead they jump straight into the code and start building.

They don’t even realise that they’re trading off good. They’re getting peer review on the code and fixing up the issues pointed out there, they have good test coverage, and they may even be extracting reusable components. They feel like they’re moving fast, and they feel like they’re doing good quality work. In the first month or two it can be hard to tell a successful project from a doomed one.

Building software is a bit like building a house. We all know roughly what shape a house is, that it’s built of bricks, and that it has windows and doors. If you were to start laying down bricks and putting in windows then fairly quickly you’ll have something that starts to resemble a house. You might even convice yourself you’ll be finished soon.

Then the rain comes and because you didn’t build foundations one of the walls starts to subside and crack, and you need to spend time propping it up and repairing it. Then you realise you forgot to hook up the gas or water so you have to dig up the floor. Then you find out you’ve got two kitchens but no bathroom. Then you find out that the client wanted a barn, not a house.

By forgoing the requirements and blueprints and foundations the house starts to take shape quickly, but finishing it off takes far longer than originally planned because it keeps needing to be re-engineered. The end result isn’t quite what anybody wanted, and it keeps trying to fall over because it has no foundations.

I’ve lost count of the number of software projects I’ve seen run this way.

For any nontrivial software project, the faster you think you’re starting, the slower you’ll end up finishing. If you finish at all. As the estimates and timelines start to stretch and the priorities of the company evolve, the probability of projects getting cancelled increases significantly. The alternative is that hard deadlines get set and the project gets delivered, but is unreliable and unmaintainable, and consumes months of time after launch to support and re-engineer it.

You go fast to go slow.

Go slow to go fast

If you want to go fast then you have to understand how to go fast, and as you’ve probably deduced by now, that means starting slowly.

Define the problem statement, the goals, and the non-goals so you can scope the project more tightly. Work out the minimal set of features you can cohesively deliver, which shortcuts make sense, and what tradeoffs you’re prepared to make in non-functional requirements like reliability, cost, or performance. Design the architecture and data model in a document and get feedback from people with experience in the domain and/or the technologies, to resolve potential problems before you’ve coded them.

These activities don’t need to take long. For a relatively straightforward feature or service in a well understood domain using boring technologies you should be able to get the docs written and peer reviewed in a couple of days.

That’s not to say you should timebox it to a couple of days though. For a new service I worked on recently, research, design and prototyping took a few weeks. It was a good job we did it though, because many of our initial hypotheses were invalidated in ways that wouldn’t have become apparent until after it was live. It’s much quicker to correct mistakes on (virtual) paper than it is in code.

Once you’ve done the preparation, you can deliver deliberately and fast. On any nontrivial project, starting slowly will mean you deliver faster than if you just dive in and start writing code.

This doesn’t necessarily mean you’ll end up with good software. If fast is the main priority, as it might be for features of uncertain long-term value, then you will make some design decisions that are not objectively good. For example using a data store with a high hosting cost to reduce implementation time, or omitting reliability features like retries or circuit breakers because downtime is an acceptable risk. What’s important is that you considered these decisions, made them deliberately, and you know the future cost of making the software good, should that prove necessary.

On the flip side if your priority leans towards good, as it might for features that form the core of the business and which you know will need to be evolved over years, then deliberate engineering helps you to ensure that you meet your goals, and also gives you the best chance of delivering it fast.