| Benchmark / Métrica | Kimi K2.5 | GPT-5.2 / 5.3 | Claude 4.5/4.6 Opus | Gemini 3.0/3.1 Pro | Qwen3.5 / 3-Max | LangGraph | CrewAI | OpenAI Swarm |
|---|---|---|---|---|---|---|---|---|
| HLE-Full (razonamiento agéntico) | 50.2% | 45.5% | 43.2% (4.5) / 53.1% (4.6 + tools)* | 38.3% (3.0) / 44.4% (3.1) / 48.4% (Deep Think) | 49.8% | N/A | N/A | N/A |
| BrowseComp (búsqueda/navegación) | 60.6% (single) / 74.9% (thinking) / 78.4% (swarm) | 77.9% (5.2 Pro) / 90% (5 high+tools)* | 84% (4.6 thinking+tools) | 85.9% (3.1 Pro Preview thinking+tools) | 69% (3.5-397B) / 78.6% (sin swarm) | N/A | N/A | N/A |
| SWE-Bench Verified (programación) | 76.8% | 80.0% (5.2) / ~78% (5.3-Codex) | 80.9% (4.5) / 80.8% (4.6) | 76.2% (3.0) / 80.6% (3.1 Pro) | 76.4% (3.5) / 80.2% (MiniMax M2.5) | N/A | N/A | N/A |
| AIME 2025 (matemáticas) | 96.1% | 100% | 92.8% (4.5) | 95.0% (3.0 Pro) | ~ | N/A | N/A | N/A |
| GPQA-Diamond (conocimiento científico) | 87.6% | 92.4% | 87.0% (4.5) / 91.3% (4.6) | 91.9% (3.0) / 94.3% (3.1 Pro) | 88.4% | N/A | N/A | N/A |
| ARC-AGI-2 (razonamiento general) | ~ | 52.9% (5.2) | 68.8% (4.6) | 31.1% (3.0) / 77.1% (3.1 Pro) / 84.6% (Deep Think)* | ~ | N/A | N/A | N/A |
| OSWorld (agente computador) | ~ | ~ | 61.4% (Sonnet 4.5) / 72.7% (4.6) | ~ | ~ | N/A | N/A | N/A |
| Terminal-Bench 2.0 (coding agéntico) | 50.8% | 46.2% (5.2) / 77.3% (5.3 Codex) | 54.0% (4.5) / 65.4% (4.6) | 46.4% (3.0) / 68.5% (3.1 Pro) | ~ | N/A | N/A | N/A |
| LiveCodeBench (programación competitiva) | 85.0% (v6) | ~ | 82.2% (4.5) | 87.4% (3.0) / 2887 Elo (3.1 Pro) | ~ | N/A | N/A | N/A |
| Agent & Tools (ReLE TAU / KAMI) | 82.2% (TAU) | 95.7% (KAMI v0.1)* | 85.4% (4.6 TAU) | 84.5% (3.0 TAU) | 83.2% (Qwen3-Max TAU) / 91.88% (Qwen3-Coder KAMI)* | N/A | N/A | N/A |
| Latencia / Frameworks (velocidad) | N/A | N/A | N/A | N/A | N/A | Más Rápido | Lento | Más Rápido |
| Precisión en decisiones | N/A | N/A | N/A | N/A | N/A | 100% | 87% | 90% |
| Eficiencia (uso de recursos) | N/A | N/A | N/A | N/A | N/A | Alta | Baja | Alta |
| Tasa de éxito en herramientas | N/A | N/A | N/A | N/A | N/A | 100% | 37% | 100% |
| Mejor caso de uso | N/A | N/A | N/A | N/A | N/A | Flujos complejos con control detallado | Sistemas de producción con delegación de tareas | Prototipado ligero y tareas simples |
📌 Los asteriscos * indican datos auto-reportados por las compañías (OpenAI, Anthropic, Google, Signal65).