← Back

Compilers Should Have a Superoptimize Flag

Dec 2025

Production compilers should have a --superoptimize flag, off by default. Passing this flag should mean three things:

  1. The compiler is free to use expensive (e.g., exponential) optimization algorithms.
  2. There are no guarantees about compile-time, so a timeout must be used.
  3. If the compiler finishes within the timeout, it should do so with certain guarantees about the quality of the generated code. E.g., that certain optimizations have been applied optimally.

Ever since Backus and friends developed the first one, compilers have been strangled by tight compile-time budgets because they run during interactive debugging. Human attention wavers if the debugging loop is more than a few seconds long. But over time, people realized that choosing different points on the Pareto frontier of compile-time vs. performance of generated code is useful, resulting in flags like -O0, ..., -O3 and languages like Rust that are happy to spend (much) longer compiling production builds. The superoptimize flag is simply one endpoint of the frontier. Why are we hesitant to reach it?

People have tried doing so in the past, but my understanding is that it was never worth it. The speedups obtained rarely justified the added complexity in the compiler, and the hardware was getting faster year after year anyway. Thank you, Gordon Moore.

But the world is different now. Maybe not for general-purpose code, but surely for accelerator code like GPU kernels. These programs have more structure than general-purpose code which can be exploited, and run for a Long Time once compiled. And even though the hardware is getting faster every year, it is not doing so in the same way as before: the increase in performance is coming at the cost of ease of programming. It is no secret that current GPUs (and other tensor accelerators) need complex optimizations to be utilized fully. Does the superoptimize flag make sense for such a world?

I think the economics works out. If we are going to run attention for a bajillion H100-hours, it makes perfect sense to spend, say, an hour compiling it to obtain even a slight speedup. And it wouldn't even take that long. In Twill we show that, for several important kernels, a few minutes is all you need to guarantee that certain optimizations have been done optimally. Even if the speedup is marginal over a fast compiler, a guarantee of optimality is worthwhile: you don't have to keep staring at the code to convince yourself there are improvements to be made.

Another reason for this mindset change is to further compiler research. It may be self-serving, but it's true. There is only so much innovation to be done in fast algorithms. We can access new vistas of optimization when we do not shy away from tackling NP-hard problems in our compilers.

Thoughts?