Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
-
Updated
Apr 13, 2026 - Python
Windows Agent Arena (WAA) 🪟 is a scalable OS platform for testing and benchmarking of multi-modal AI agents.
An agent benchmark with tasks in a simulated software company.
Frontier Models playing the board game Diplomacy.
The definitive benchmark for AI agents on OpenClaw. 45 tasks across 4 tiers. Powered by MyClaw.ai
Ranking LLMs on agentic tasks
llmBench is a high-depth benchmarking tool designed to measure the raw performance of local LLM runtimes (Ollama, llama.cpp) while providing deep hardware intelligence.
🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems
📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena) leaderboards — LLM, Vision, Code, Video, Image & more. Structured JSON with historical tracking.
MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV/JSON reports.
Precision-aware retrieval benchmark for LLM memory systems.
Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.
Benchmark abierto en español de 141 LLMs (89 con 13K+ runs reales y juez Phi-4 independiente). Quality, costo, velocidad, long-context y fuga de credenciales como dimensiones separadas. Alternativas a Claude, GPT y Gemini para agentes n8n/OpenClaw. Calculadora interactiva con tus propios pesos.
A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.
GDB: GraphicDesignBench - A real-world benchmark for evaluating AI on graphic design tasks
GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities
This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.
Benchmark local LLM models: speed, quality, and hardware fitness scoring. CLI, MCP server, and IDE plugins.
The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.
RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.
The first open evaluation framework for AI continuity. 250 narrative tests, 1835 verification questions, 10 checkpoints. Benchmark for AI memory systems, stateful agents, and long-term context persistence. No LLM in the evaluation loop.
Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.
To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."