How We Developed Zeta2

Last week, we released Zeta2 as the new and improved default Zeta edit prediction model. Zeta predicts your next edit based on your preceding activity, and users can accept the suggestions by hitting tab. The new model is smarter and faster, with a 30% better acceptance rate and faster responses.

Model refinement demands two things: the infrastructure to learn quickly, and a boatload of data. With Zeta1, we were beginning with a thimble of data and little infrastructure to learn. We needed a way to increase our data intake ethically, and then we needed to be wiser about how we learned if our improvements were, well, improvements.

Knowledge distillation

When Zeta makes a prediction, it sees the code around your cursor, your recent edits, and your cursor position. Rather than filling in the middle, it rewrites a snippet around the cursor with predicted edits applied, which is what allows predictions at arbitrary locations. For Zeta2, that input got richer: finer-grained recent edit history, and type and symbol definitions resolved via LSP for visibility beyond the immediate file.

We trained Zeta2 through knowledge distillation, a popular technique where a large, capable model (the teacher) generates training data that a smaller, faster model (the student) learns to reproduce. In our case, we collect starting states from users like cursor position and recent edits, but we don't train on what users actually write. Instead, we feed that starting state to the teacher (Sonnet 4.6 as of now) and use its predictions as the training signal for Zeta2.

Before you can train a good student, you need a good teacher. A lot of our development time went into tuning the teacher's prompt to ensure it produces predictions aligned with what users would actually expect.

We had a small set of handwritten evaluation cases testing the properties we cared about: predictions that aren't too large, aren't too small, don't reverse user edits, use the available context. We used these to guide the teacher prompt. Then we ran the teacher on roughly 100,000 examples and fine-tuned the student on that output.

Collecting the right training data

We initially generated synthetic training examples from GitHub commits, a common approach in academia. One problem is that commits only capture the final state of code, but not the messy sequence of edits that produced it. We used an LLM to reconstruct a chronological stream of edits, but the result was still too clean. It didn't capture the typos, second thoughts and self-corrections that are typical in real development. So while commit-derived examples were still useful for evaluation, for training, we switched to collecting traces of user data (consensual, opt-in only, open source repos only) that offers a more realistic picture of how code is created.

This approach, applied to Zeta1, generated roughly a 250,000 to 300,000 requests a week that we could train on. This was a crucial change: by collecting context inside Zed where the user already has everything (uncommitted diffs, repo state, LSP), we could theoretically use 100% of the data we collect from production.

The reversal problem

One of the first problems we encountered was reversals, when you type a few characters and the model treats them as accidental and tries to delete them.

The root cause traced back to the teacher. We were distilling Sonnet, and it struggled with the fundamental task of inferring what the user was doing and simply doing the next thing. Instead it would try to improve the whole region to clean up messy code and remove what looked like a typo. It didn't understand that whatever the user has done most recently should be treated as intentional. Re-orienting the teacher demanded a lot of prompt training.

We also increased the granularity of recent edits passed to the model so it can see what you're doing right now, not just a compressed blob of everything you've done in the last 10 minutes.

Switching the base model

For Zeta 2, we switched the base model from Qwen 2.5 Coder (7B) to Seed Coder (8B), an open-weight model from ByteDance. It's slightly newer, slightly larger, and showed better results on every offline and online metric. We ran the two in parallel from roughly January to February and Seed Coder was consistently better:

Trained on the same data, Seed-Coder had roughly 1% higher acceptance rate. We also measure how well a model matches the teacher using a custom metric called deltaChrF (higher is better):

Model	deltaChrF
Qwen2.5-Coder-7B	78.36
Seed-Coder-8B	80.61

How to know when it's time to ship

We had confidence in the teacher from handwritten evals. Then we fine-tuned the student and measured how closely its outputs matched the teacher's across 5,000 examples. We manually reviewed the worst cases where metrics showed the largest divergence. In most cases, the prediction was reasonable; it just went in a different direction than the teacher. That's an innate problem with this task: there's no single correct answer. To account for this reality, we generated three teacher completions for each example and measured the student against all three.

It's easy to tinker with a model forever because there's always more you can improve. But we knew we needed human feedback, so we dogfooded with the Zed team first, and then shipped a shadow release that ran the model in production without showing suggestions, to check for unexpected issues. Then a small rollout, using acceptance rate as the primary signal, and a gradual expansion to 100% because the metrics were encouraging.

What's next

Building Zeta2 taught us that model improvement is less about any single breakthrough and more about closing loops: better data, better teachers, better signals for knowing when something is actually working. There's a lot more we want to share about how this work evolves: the metrics we're still refining, the infrastructure bets we're making, and the tradeoffs that come with running a model in production at scale. If you want to join us, we're hiring!

We plan to keep writing posts like this one as the model develops. The work is ongoing, and we'd rather show you the real process than wait until everything is clean.

Check out similar blogs from the Zed team.