Four mistakes engineers make on the 13-day project. -- Blog

Stage 3 of the OpenTalent screening is a 13-day real-world project. The candidate picks one of three open-ended briefs in their specialization, ships a deliverable, and writes it up. We’ve now reviewed roughly 2,400 submissionsover the past three years, and one of the more useful patterns we’ve noticed is that the projects that fail tend to fail in the same four ways. They are not execution failures. They are decision failures — and they are catchable in the first 48 hours if you know to look for them.

This piece is the long version of the panel advice we already give candidates the morning they receive their brief. If you’re considering applying to the network — or already mid-screen and reading this in week one — most of the value here is in which things to do early and which not to do at all.

The four mistakes, briefly, before we dig in:

Scope inflation. Picking a problem big enough to feel impressive and then failing to deliver any of it well.
No hypothesis. Submitting a tour of techniques rather than an answer to a specific question.
Write-up as recap, not argument. Documenting what you did instead of arguing for what you found.
No eval.Submitting work with no honest way to know whether it’s any good.

The fourth is the one that has grown most in our data — six years ago maybe 30% of submissions had a proper eval; this year roughly 70% do. That’s a real improvement. The other three have moved barely at all.

// 01Scope inflation

// Mistake 01

Picking a problem too big to deliver in 13 days.

Symptom:the candidate’s framing in their day-1 plan describes a multi-month effort. The deliverable is a partial sketch of that effort.

// SCOPE INFLATED

“I’ll build a full RLHF pipeline from scratch including reward model training, PPO, and a custom eval harness, evaluated on three different benchmarks.”

// SCOPE RIGHT

“I’ll train a 1B reward model on a 50K-pair subset of HH-RLHF, evaluate on the held-out 5K pairs, and characterize one specific failure mode I find.”

This is the single most common failure mode. About 42%of submissions that don’t make it through the panel have a scope-inflation problem — even though the brief explicitly tells candidates to narrow.

The instinct behind scope inflation is recognizable. The candidate wants the project to look like it justifies the offer band; the offer band is a senior IC frontier-lab role; senior IC work looks ambitious. So they pick an ambitious-sounding problem and try to deliver enough of it to demonstrate the ambition.

That logic gets the prior backward. Senior IC work at frontier labs is, in practice, mostly narrow shipping. The work that wins those teams promotions is the work that picks a tight question and answers it with rigor — not the work that proposes to solve everything. Panels grade for the disposition that actually predicts senior IC performance. Picking a smaller problem and finishing it cleanly is the dominant strategy.

The recover move, if you’re on day 3 and realize you’ve scope-inflated: cut. Ruthlessly. Pick the single most interesting sub-problem inside your original plan and treat the rest as future work in the write-up. Panels reward this. We’ve reversed our debrief verdict on the strength of a candidate’s day-7 email saying “I’ve narrowed the scope, here’s what I’m cutting, here’s why” more times than is comfortable.

// 02No hypothesis

// Mistake 02

Submitting a tour of techniques instead of an answer to a question.

Symptom:the write-up reads as a chronological narrative of what the candidate built. There is no thesis sentence. No “I expected X, and found Y.”

// NO HYPOTHESIS

“I implemented LoRA fine-tuning on a 7B base model, then explored DPO as an alternative, then tried RLAIF. Each method has different trade-offs.”

// HYPOTHESIS-DRIVEN

“I claimed that DPO with synthetic preferences would outperform LoRA-SFT on instruction-following. Three runs show DPO +11% on IFEval; LoRA wins on factuality (−3%). The crossover happens around 60K examples.”

This is the failure mode that frustrates panel members most directly, because it makes the work harder to evaluate. When the submission is a chronological tour, the panel has to assemble the implicit thesis on the candidate’s behalf — and that’s not the panel’s job. About 31% of failed submissions in our data are no-hypothesis submissions.

Strong senior IC engineers operate hypothesis-first by default, and the project is a probe for that disposition. We’re not asking whether you can implement three things in two weeks (you can; most candidates can). We’re asking whether your first instinct when given an open-ended brief is to convert it into a specific, falsifiable, testable claim. That instinct is what the panel is grading.

The recover move, if you’re at day 9 and realize you’ve toured rather than argued: state a hypothesis now, retroactively, and present your existing work as evidence for or against it. Write the hypothesis at the top of the write-up. Frame everything that follows as supporting or contradicting that hypothesis. This is not deception — it’s the act of converting your week of work into research. The panel cares much less about whether the hypothesis was your first thought and much more about whether the final submission is hypothesis-driven.

The candidate’s write-up should pass the elevator test: a senior practitioner who hasn’t read the brief should be able to read your abstract paragraph and tell us what you found. If they can’t tell us what you found in 60 seconds, the panel can’t either.

// FROM THE STAGE-3 RUBRIC, 2024 REVISION

// 03Write-up as recap, not argument

// Mistake 03

Twelve days building. One hour writing. It shows.

Symptom:the document is structured as “what I did,” not “what I claim and why.” The reader’s attention has to do the work the writer should have done.

// RECAP STRUCTURE

“On day 1 I read the brief. On days 2–4 I set up the training environment. On day 5 I encountered a CUDA OOM error. The training run finished on day 11. Results are in Appendix A.”

// ARGUMENT STRUCTURE

“Headline finding: DPO matches LoRA-SFT at 60K examples and pulls ahead at 100K (Fig. 1). Below: the experimental setup, the three runs, the failure mode I noticed in the middle, and what this implies for the upper bound on synthetic preference data.”

The write-up gets graded harder than the code in many cases. This surprises candidates, but it shouldn’t. The panel’s question isn’t could you have done the work— by the time we’re reading your submission, the work exists. The question is can you communicate it to a skeptical reader fast. That’s most of the senior IC job. The 13-day project is a sample of how you’ll write your team’s design docs, your release post-mortems, your half-year reviews. Bad write-ups predict bad design docs more reliably than almost any other signal we have.

The recover move, if you’re on the last day and realize your draft is a recap: throw it out. Write the headline finding in one sentence at the top. Spend two paragraphs making the case for it. Use the rest of the document to support the case, not to narrate your time. Cut every sentence that begins with “next, I…” or “on day…” — those are recap markers. Submit the shorter document. Candidates routinely panic about the write-up being too short; in three years of reviews, we have never marked down a submission for being too brief.

// 04No eval

// Mistake 04

Submitting work without an honest way to know if it's any good.

Symptom:the deliverable produces outputs but offers no answer to “how do we know it works?” — and where eval is present, it’s a single number on a single benchmark.

// NO EVAL / WEAK EVAL

“The model improves response quality.” / “Accuracy: 0.78.” (No baseline. No held-out set. No discussion of what 0.78 means.)

// REAL EVAL

“Baseline: 0.61 on the held-out set. My method: 0.78. The gain is concentrated in the multi-step examples; on single-step prompts the methods are within noise. I also caught a regression on factuality I didn’t expect — discussed below.”

Of the four mistakes, this is the one trending the right direction. Six years ago we’d see eval-free submissions in roughly 70% of senior IC project reviews. This year that number is down to ~30%, and a meaningful chunk of the remainder include at least a single-number eval. That’s progress.

What we still rarely see — and what would meaningfully distinguish your submission — is eval rigor commensurate with the claims being made. If your submission claims that method A beats method B, the eval should have: a baseline that A and B are both compared against, a held-out set neither method has seen, multiple seeds where stochasticity matters, and a brief discussion of what failure modes the eval might be missing. None of this is hard; almost none of it is present in the average submission.

The recover move, if you’re on day 12 and don’t have a real eval: build one. Even one hour of focused eval work — held-out set, baseline number, multi-seed variance — substantially upgrades the rest of the submission. It also signals research instincts that nothing else in the project can signal. We have moved candidates from “borderline reject” to “advance” on the strength of a clearly described day-12 eval that the candidate added under time pressure.

// 05What we’d add for 2026

One mistake we’d put on the list if we were drafting it today, less than a year after the four above were finalized: using AI tools without saying how. Most strong candidates are now using Claude or Cursor or one of their cousins during the 13-day project, and that’s fine. What’s not fine is pretending they didn’t. Panels can tell, and the pretense is a worse signal than the usage.

If you used an AI tool, say so in the write-up — briefly, in the methodology section. “I used Cursor for the boilerplate of the training loop; the training logic and the eval were written from scratch; I used Claude to draft three different versions of the data-curation script and picked the one that survived my tests.” That kind of disclosure is what senior IC engineering work looks like in 2026, and the panel grades it that way. The candidates who hide it are evaluated against the candidates who don’t, and the comparison isn’t flattering.

If you’re heading into Stage 3 in the next quarter, the interview guides include the per-track scope advice we give privately. The screening overview lives on the Top 3% page. Next week’s piece will be on the seven failure modes of production tool-using agents. Subscribe at the bottom of the blog page for it.

The Panel

// EDITORIAL · OPENTALENT SCREENING TEAM

// Read next

// SCREENING

The one question we ask every senior IC panel

Stage 4’s final question, what it tests, and the month-four pattern that locked it in.

READ →// THE SCREEN

The Top 3% screening process

Five stages, less than 3% acceptance. The full process documented.

READ →// PREP

Interview guides

Per-track playbooks calibrated to each stage of the screen.

READ →

Apply when the work is ready.

The screening rewards depth, not surface area. The four mistakes above are catchable in the first 48 hours of the project.

Apply to the network See the bar