Startup Ideas Bank
Ornith-1.0: Ambitious, but are developers ready to pay for 'self-improving' coding agents?
AI roast score: 72/100 (B)
The idea
deepreinforce-ai/Ornith-1
Ornith-1.0
Aloha! 🌺 Ornith-1.0 is a self-improving open-source models for agentic coding.
Highlights:
State-of-the-Art Coding Agents : Available in 9B-Dense, 31B-Dense, 35B-MoE, and 397B-MoE (post-trained on top of Gemma 4 and Qwen 3.5), achieving state-of-the-art performance among open-source models of comparable size on coding benchmarks such as Terminal-Bench 2.1, SWE-Bench, NL2Repo and OpenClaw.
Self-Improving Training Framework : Ornith-1.0 employs RL to learn to generate not only solution rollouts, but also the scallfold that drive those rollouts. By jointly optimizing the scaffold and the resulting solution, the model discovers better search trajectories and generates higher-quality solutions.
Licence : MIT licensed, globally accessible, and free from regional limitations.
Benchmarks
Each model is evaluated against its size-appropriate baselines. All three use the same harnesses and decoding setup (see the notes under the tables).
Ornith-1.0-9B
Ornith-1.0-9B
Qwen3.5-9B
Qwen3.5-35B
Gemma4-12B
Gemma4-31B
Agentic Coding
Terminal-Bench 2.1 (Terminus-2) 43.1 21.3 41.4 21 42.1
Terminal-Bench 2.1 (Claude Code) 40.6 18.9 38.9 - -
SWE-bench Verified 69.4 53.2 70 44.2 52
SWE-bench Pro 42.9 31.3 44.6 27.6 35.7
SWE-bench Multilingual 52 39.7 60.3 32.5 51.7
NL2Repo 27.2 16.2 20.5 10.3 15.5
Claw-eval Avg 63.1 53.2 65.4 32.5 48.5
SWE Atlas - QnA 17.9 9.2 13.2 - -
SWE Atlas - RF 16.6 4.3 10.2 - -
SWE Atlas - TW 15.3 4.4 9.8 - -
Ornith-1.0-35B
Ornith-1.0-35B
Qwen3.5-35B
Qwen3.6-35B
Gemma4-31B
Qwen3.5-397B
Agentic Coding
Terminal-Bench 2.1 (Terminus-2) 64.2 41.4 52.5 42.1 53.5
Terminal-Bench 2.1 (Claude Code) 62.8 38.9 49.2 - 48.6
SWE-bench Verified 75.6 70 73.4 52 76.4
SWE-bench Pro 50.4 44.6 49.5 35.7 51.6
SWE-bench Multilingual 69.3 60.3 67.2 51.7 69.3
NL2Repo 34.6 20.5 29.4 15.5 36.8
Claw-eval Avg 69.8 65.4 68.7 48.5 70.7
SWE Atlas - QnA 37.1 13.2 15.5 - 20.4
SWE Atlas - RF 29.7 10.2 11.4 - 18.4
SWE Atlas - TW 27.8 9.8 13.3 - 18.5
Ornith-1.0-397B
Ornith-1.0-397B
Qwen3.5-397B
Qwen3.7-Max
GLM-5.2-744B
Minimax-M3-428B
DeepSeek-V4-Pro-1.6T
Claude Opus 4.7
Claude Opus 4.8
Agentic Coding
Terminal-Bench 2.1 (Terminus-2) 77.5 53.5 73.5 81.0 64 64 70.3 85
Terminal-Bench 2.1 (Claude Code) 78.2 48.6 69.8 82.7 - 66.5 69.7 78.9
SWE-bench Verified 82.4 76.4 80.4 - - 80.6 80.8 87.6
SWE-bench Pro 62.2 51.6 60.6 62.1 59 55.4 64.3 69.2
SWE-bench Multilingual 78.9 69.3 78.3 - - 76.2 - -
NL2Repo 48.2 36.8 47.2 48.9 42.1 - - 69.7
Claw-eval Avg 77.1 70.7 65.2 - - 75.8 78.2 -
SWE Atlas - QnA 41.2 20.4 - - 37.9 27.2 40.3 48.8
SWE Atlas - RF 42.6 18.4 - - - - 48.6 46.7
S
The roast
Your pitch is a jargon-heavy labyrinth appealing to a niche audience of AI enthusiasts and developers. While your benchmarks and RL-driven scaffolding sound impressive, the real question is whether developers are willing to pay for yet another AI coding tool. Your 'idea-stage' solo operation lacks the market validation and team depth to turn this into a scalable business. Without funding and with 'will_pay' as your biggest unknown, you are flying blind into a highly competitive space already dominated by well-funded incumbents.
Red flags
- q12=idea
- q13=solo
- q15=will_pay
Verdict
You need to validate developer willingness to pay before diving deeper into development.
Roast your own startup idea →