ai-benchmark

llmBench is a high-depth benchmarking tool designed to measure the raw performance of local LLM runtimes (Ollama, llama.cpp) while providing deep hardware intelligence.

llm local-llm llm-inference llm-tools ai-benchmark

Updated Mar 15, 2026
Python

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

oolong-tea-2026 / arena-ai-leaderboards

Star

📊 Daily auto-updated snapshots of all Arena AI (LMSYS Chatbot Arena) leaderboards — LLM, Vision, Code, Video, Image & more. Structured JSON with historical tracking.

benchmark machine-learning ai leaderboard text-to-video llm lmsys ai-benchmark chatbot-arena arena-ai

Updated Jun 25, 2026
Python

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV/JSON reports.

Updated Jun 25, 2026
Go

tenurehq / precisionMemBench

Star

Precision-aware retrieval benchmark for LLM memory systems.

benchmark information-retrieval ai memory evaluation memory-benchmark ai-agents long-term-memory rag llm ai-memory ai-benchmark memory-evaluation

Updated Jun 26, 2026
TypeScript

BennettSchwartz / ERR-EVAL

Sponsor

Star

Benchmark for evaluating AI epistemic reliability - testing how well LLMs handle uncertainty, avoid hallucinations, and acknowledge what they don't know.

python nlp benchmark machine-learning ai evaluation collaborate ai-safety llm llm-evaluation hallucination-detection ai-benchmark llm-benchmark

Updated Mar 22, 2026
Python

ctala / ai-benchmarks-alternativos

Star

Benchmark abierto en español de 141 LLMs (89 con 13K+ runs reales y juez Phi-4 independiente). Quality, costo, velocidad, long-context y fuga de credenciales como dimensiones separadas. Alternativas a Claude, GPT y Gemini para agentes n8n/OpenClaw. Calculadora interactiva con tus propios pesos.

ai-agents startup-tools n8n ai-models emprendedores openrouter ollama llm-evaluation llm-comparison ai-benchmark llm-benchmark spanish-ai openclaw hermes-agent claude-alternatives gpt-alternatives phi4-judge benchmark-en-espanol emprendedores-ia

Updated Jun 26, 2026
Python

brandonhimpfen / awesome-ai-benchmarks-evaluation

Sponsor

Star

A curated list of evaluation tools, benchmark datasets, leaderboards, frameworks, and resources for assessing model performance.

awesome ai awesome-list awesome-lists ai-benchmarks ai-evaluation ai-benchmark

Updated May 11, 2026
Python

lica-world / GDB

Star

GDB: GraphicDesignBench - A real-world benchmark for evaluating AI on graphic design tasks

svg design benchmark layout evaluation typography graphic-design lottie vlm multimodal vision-language-model ai-benchmark

Updated May 5, 2026
Python

Habitante / gta-benchmark

Star

GTA (Guess The Algorithm) Benchmark - A tool for testing AI reasoning capabilities

python docker flask benchmark machine-learning puzzle reverse-engineering educational pattern-recognition ctf binary-analysis algorithm-analysis computational-thinking algorithmic-reasoning ai-benchmark

Updated Jan 12, 2025
Python

mlcommons / storage_results_v2.0

Star

This repository contains the results and code for the MLPerf™ Storage v2.0 benchmark.

benchmark machine-learning ai ai-benchmark ml-benchmark

Updated Aug 4, 2025
Python

MetriLLM / metrillm

Star

Benchmark local LLM models: speed, quality, and hardware fitness scoring. CLI, MCP server, and IDE plugins.

cli benchmark typescript mcp llm local-ai ollama ai-benchmark

Updated Jun 2, 2026
TypeScript

yasarshaikh / SF-bench

Star

The first comprehensive benchmark for evaluating AI coding agents on Salesforce development tasks. Tests Apex, LWC, Flows, and more.

benchmark salesforce apex lwc lightning-web-components ai-benchmark

Updated Jan 27, 2026
Python

Mr-Dark-debug / RetardBench

Star

RetardBench is an open, no-censorship benchmark that ranks large language models purely on how retarded they are.

jailbreak large-language-models llm prompt-injection red-teaming-tools ollama llm-evaluation uncensored-llm open-llm-leaderboard llm-jailbreaks prompt-injection-llm-security ai-benchmark ai-red-teaming llm-benchmark

Updated Mar 2, 2026
TypeScript

Kenotic-Labs / ATANT

Star

The first open evaluation framework for AI continuity. 250 narrative tests, 1835 verification questions, 10 checkpoints. Benchmark for AI memory systems, stateful agents, and long-term context persistence. No LLM in the evaluation loop.

persistent-memory evaluation-framework memory-benchmark ai-agents continuity long-term-memory temporal-reasoning llm-evaluation ai-memory ai-benchmark context-engineering stateful-agents narrative-testing

Updated Apr 21, 2026
Python

Improve this page

Add a description, image, and links to the ai-benchmark topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmark topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmark

Here are 93 public repositories matching this topic...

microsoft / WindowsAgentArena

TheAgentCompany / TheAgentCompany

GoodStartLabs / AI_Diplomacy

LeoYeAI / myclaw-bench

rungalileo / agent-leaderboard

AnkitNayak-eth / llmBench

chaosync-org / awesome-ai-agent-testing

oolong-tea-2026 / arena-ai-leaderboards

petmal / MindTrial

tenurehq / precisionMemBench

BennettSchwartz / ERR-EVAL

ctala / ai-benchmarks-alternativos

brandonhimpfen / awesome-ai-benchmarks-evaluation

lica-world / GDB

Habitante / gta-benchmark

mlcommons / storage_results_v2.0

MetriLLM / metrillm

yasarshaikh / SF-bench

Mr-Dark-debug / RetardBench

Kenotic-Labs / ATANT

Improve this page

Add this topic to your repo