Advances the frontier of coding and computer work. SOTA on SWE-Bench Pro (57%) and OSWorld (64%). Features mid-task steerability (interact while it works), 25% faster speeds, and "High" capabilities in cybersecurity.
The benchmark jumps are impressive (especially OSWorld going from ~38% to 64%), but I found this specific detail in the announcement most interesting:
"GPT‑5.3‑Codex is our first model that was instrumental in creating itself."
The team used early versions of the model to debug the training run, manage deployment, and diagnose test results. It basically accelerated its own development.
Codex is becoming a broader productivity agent that can handle complex workflows end-to-end.
It is available now for paid ChatGPT plans, everywhere you can use Codex: the app, CLI, IDE extension and web. API on the way.
The practical difference here is execution + iteration: it can take a task, make changes, run/validate, and refine without needing a new prompt for every bump. The frequent status updates and mid-course steering are what made it useful for real repo work (refactors, failing tests, debugging). I still review diffs carefully—especially anything touching auth/security—but it’s a legitimate productivity boost compared to earlier Codex versions.
Replies
Flowtica Scribe
Hi everyone!
GPT-5.3-Codex is here.
The benchmark jumps are impressive (especially OSWorld going from ~38% to 64%), but I found this specific detail in the announcement most interesting:
The team used early versions of the model to debug the training run, manage deployment, and diagnose test results. It basically accelerated its own development.
Codex is becoming a broader productivity agent that can handle complex workflows end-to-end.
It is available now for paid ChatGPT plans, everywhere you can use Codex: the app, CLI, IDE extension and web. API on the way.
Migma AI
Cool addition!
The practical difference here is execution + iteration: it can take a task, make changes, run/validate, and refine without needing a new prompt for every bump. The frequent status updates and mid-course steering are what made it useful for real repo work (refactors, failing tests, debugging). I still review diffs carefully—especially anything touching auth/security—but it’s a legitimate productivity boost compared to earlier Codex versions.