Making large language models faster, smaller, and more reliable.

We research the deployment characteristics of large language models: how they generate tokens under production constraints, how they manage memory at long context, and how their behavior degrades or holds up inside larger systems.

About the lab

LLMs Research is a small, independent lab. We work on the parts of large language models that decide how usable they are once they leave a benchmark: how fast they generate tokens, how much memory they consume, how reliably they reason, and how predictably they behave when stitched together into larger systems.

Most of our day to day is fairly ordinary research work. We read recent papers, reproduce the ones that look promising, and keep the code we write public. When something interesting comes out of that, we publish it. When something does not work, we try to write that up too, since it tends to be more useful than the headline numbers suggest.

Research directions

Our research spans inference efficiency, memory and KV cache compression, adaptive compute for reasoning, and the dynamics of multi-agent LLM systems. These threads share a common concern: identifying the conditions under which LLM methods remain reliable, and characterizing the failure modes that emerge under realistic deployment constraints.

Inference efficiency

We study how LLMs serve tokens in production. This includes decoding strategies, attention variants, quantization schemes, and the engineering choices that decide whether a model fits on a given GPU at all. The questions we keep coming back to are practical ones: where does quality actually degrade, what trade off is the user really making, and how much of the published gain survives outside the benchmark setup.

Memory and KV cache

KV cache is often the dominant cost in long context inference, and it is one of the more interesting parts of the stack to optimize. We work on compression schemes, eviction policies, and attention designs that reduce memory footprint without obviously hurting downstream quality. A lot of this work is empirical: there are many ideas, and most of them break in different ways at scale.

Reasoning and adaptive compute

Not every input deserves the same amount of thinking. We are interested in methods that let models spend more compute on hard inputs and less on easy ones, with some notion of verification along the way. The same questions show up in agentic systems, where the cost of a wrong decision is much higher than the cost of an extra step.

Multi-agent systems

When you put several LLMs together in a committee, interesting failure modes show up. Diversity collapses, agents start agreeing for the wrong reasons, and aggregate accuracy can be worse than a single model. We study where these failures come from, how to measure them, and what to do about them.

How we work

We try to keep the loop short between reading a paper and having running code. New publications are paired with reproducible implementations whenever we can manage it. The newsletter exists as a way to share that process, not just the results, since the implementation notes are often the more useful artifact for engineers.

The lab is independent and self funded. There are no sponsors and no editorial line beyond what the numbers actually show.