Systems, Failures, and Finding What Matters

The Timeless Art of Software Engineering

There is a common misconception that the best software engineers are simply the ones who write the most code. In the age of coding agents, how many commits can one churn out focuses on the definition of best.

It is easy to picture a brilliant programmer sitting in a dark room, typing flawlessly for eight hours straight. But the reality of building software that scales to millions of users is entirely different. The true difference between an average programmer and a master engineer isn’t the volume of their work, but the direction of it.

Impactful engineering comes down to a blend of understanding distributed systems, learning from catastrophic failures, and adapting to a world where technology shifts under your feet. Here is a look at the core theories and concepts that define high-level software engineering.

Finding Problems That Matter

Great engineering starts with identifying the right problems to solve. This usually happens at the exact intersection of two things: major technological trends and real-world friction.

If you look at the bottom-up technical trends in the world—networks are getting faster, multi-core processors are everywhere, and GPUs are exploding in power. On the flip side, there are people out there trying to build things, and they are constantly running into roadblocks. When you combine a new technical capability with a known frustration, you find a problem worth solving.

For example, when building modern, massively scalable cloud databases, engineers noticed two things. First, developers loved building "serverless" applications (where they don't have to manage the underlying computers), but traditional databases didn't fit that model well. Second, underlying block storage over the network had become incredibly fast and reliable. By taking that new, lightning-fast storage and pairing it with a serverless architecture, engineers could build a completely new kind of database. That is how real innovation happens—not by writing random code, but by connecting a technical trend with a real-world need.

The Power of Postmortems and "On-Call"

In the software industry, being "on-call" means you are the person holding the pager when the system crashes at 3:00 AM. Many engineers avoid this at all costs, viewing it as a chore. Yet, investigating system failures is arguably the absolute best way to learn how distributed systems actually work in the wild.

When a massive system fails, great engineering cultures write what is known as a "postmortem" or a Correction of Errors (COE) document. A great postmortem does not stop at the first, most obvious cause of a crash. It dives through multiple layers of "why."

If a system crashed because of a code bug, why didn't the testing process catch it? Why did the engineers assume the system would behave a certain way? Why are there organizational blind spots?

Some engineering teams fall into the trap of "operational heroics." They pride themselves on having brilliant engineers who will stay up all night to manually restart servers and fix broken databases. From the inside, this feels like dedication. From the outside, it is a massive waste of energy. The goal should never be to fight the same fire twice. The goal is to deeply analyze the failure, fix the root cause, and build automated systems that prevent it from ever happening again.

When Best Practices Fail: The Danger of Caches

If you read basic system design guides, you will often see the advice: "Just throw a cache on it." A cache is a temporary storage layer that remembers frequently requested data so the system doesn't have to do the hard work of fetching it from the main database every time. Generally, caching makes systems incredibly fast.

However, in massive distributed systems, caches have a hidden dark side: modality.

A cached system has two modes. In Mode 1, the cache is full of the right data, and everything is fast and healthy. In Mode 2, the cache is empty (perhaps because it just restarted). Suddenly, all the traffic bypasses the empty cache and slams directly into the main database. Because the database was never designed to handle that much raw traffic, it crashes.

This leads to a phenomenon called a metastable failure. The system is down, and it cannot recover on its own. Every time it tries to restart, the massive wave of uncached traffic instantly crashes it again. Because of this, elite engineers often prefer to avoid caching when possible. Instead, they design scalable backends that can handle the raw load natively, or they keep complete "materialized views" of the data so the system never has to recover from an empty state.

Designing Resilient Databases (MVCC)

Speaking of databases, one of the most common ways a system crashes is when a client misbehaves. Imagine a program connects to a database, locks a piece of data to update it, and then "goes to lunch"—maybe the network drops, or the program freezes. Now, that data is locked, and every other part of the system waiting to read or write that data is stuck in a massive traffic jam.

Modern databases solve this using a brilliant concept called Multi-Version Concurrency Control (MVCC). Instead of locking a row of data, the database keeps a history of versions for every row. If you want to read data while someone else is currently writing to it, the database simply gives you the most recent old version of that data.

Because of this, readers never block writers, and writers never block readers. It requires a bit more storage space (usually less than 10% overhead, because most data doesn’t change constantly), but it entirely eliminates an entire class of catastrophic system jams.

AI and the Future of the Craft

The nature of coding is changing at an unprecedented speed. With the rise of AI-powered and agentic development tools, code itself is starting to flow like water. Writing standard boilerplate code is no longer the primary bottleneck in software creation.

Because of this, the baseline expectations for an engineer are shifting. It is no longer enough to just open an IDE (Integrated Development Environment) and type out functions. The most valuable skills are now rooted in deep systems thinking. Can you ask the right questions? Can you understand the underlying mathematics, optimization problems, and networking architecture? Can you talk to the people using your product and figure out what actually needs to be built? AI can write the code to sort a list, but it cannot decide if sorting that list actually solves the core architectural problem.

Writing and "Apparent Expertise"

One of the most underrated skills in a highly technical field is the ability to write clearly. Writing forces a level of mental clarity that talking simply does not. When designing complex systems with thousands of micro-decisions, writing a design document forces an engineer to separate the arbitrary guesses from the deeply researched decisions. It acts as an incredible multiplier, allowing an idea to scale across time and space to thousands of other engineers.

However, there is a delicate balance to strike between communicating and actually doing the work. This is the concept of "Apparent Expertise."

If you spend 100% of your time writing, talking, and discussing technology, you will gain a lot of visibility. People will think you are an expert, but you will quickly lose touch with how systems actually work in reality. You will become overrated.

Conversely, if you spend 100% of your time heads-down in the code, you will be a phenomenal builder, but you will lack the visibility to scale your impact. You will be underrated.

The master engineer strikes a balance—perhaps spending 75% of their time actively building and wrestling with the technology, and 25% of their time writing, teaching, and scaling their insights. If you have to choose a side to lean toward, it is always better in the long run to be underrated. In the world of complex engineering, reality always wins. You cannot fool a distributed system into working just because you sound smart.

Software engineering is a craft of continuous learning. By staying grounded in reality, investigating every failure with deep curiosity, and focusing on the underlying systems rather than just the code itself, anyone can build technology that leaves a lasting impact on the world.