While Baidu did not release full benchmark details or raw scores publicly, its performance positioning suggests a deliberate ...
Elliot Williams and Al Williams got together to share their favorite hacks of the week with you. If you listen in, you’ll hear exciting news about the upcoming SuperCon and the rare occurrence of Al ...
We did an informal poll around the Hackaday bunker and decided that, for most of us, our favorite programming language is solder. However, [Stephen Cass] over at IEEE Spectrum released their annual ...
According to Greg Brockman (@gdb), OpenAI's latest reasoning system has achieved a perfect score on the 2025 ICPC programming competition, as confirmed by Mostafa ...
The Federal Aviation Administration (FAA) and MITRE are introducing a new benchmark to enable the evaluation and assessment of large language models (LLMs) for aerospace tasks. Given the ...
A team from Stanford University and UC Santa Cruz has introduced AHELM, a new benchmark designed to evaluate audio-language models (ALMs) across a wide range of capabilities. ALMs are multimodal ...
Software Development Life Cycle Perspective A Survey of Benchmarks for Code Large Language Models and Agents from Xi’an Jiaotong University HumanEval Evaluating Large Language Models Trained on Code ...
This document contains performance benchmarks for a compute-heavy task across multiple programming languages. The benchmark performs the same mathematical computation (matrix operations, factorial ...
A new multilingual AI benchmarking initiative backed by the German Government aims to advance equitable access to language technologies by highlighting where today’s large language models (LLMs) ...
For years, code-editing tools like Cursor, Windsurf, and GitHub’s Copilot have been the standard for AI-powered software development. But as agentic AI grows more powerful and vibe coding takes off, a ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results