How should we evaluate AI performance?

How should we benchmark AI? Writing in MIT Tech Review, UCL School of Management Professor Angela Aristidou argues that AI benchmarks are broken, and that one-off tests don’t measure its true impact.

The norm for evaluating AI has been to analyse whether machines outperform individual humans completing tasks. The system gives clear, easily standardised and optimised answers that can be compared and generate rankings.

In her piece, Professor Aristidou argues that AI is rarely used in this way. Instead of evaluating it at task-level, we should look to shift our benchmarks to those that assess how AI systems perform over a longer timeframe. We should also assess how it performs within human teams, workflows and organisations.

An AI model that has been benchmarked as delivering 98% accuracy, impressive speed and outputs may be widely adopted by organisations, only for staff using the model to spend more time interpreting the AI’s outputs. Aristidou used the real-world example of highly ranked radiology AI applications, which when used in healthcare settings led to delays in practice.

Aristidou suggests paying attention to the conditions under which AI models will be used, asking whether AI can function as a productive participant within human teams and whether it can generate sustained and collective value.

Read the full article

UCL School of Management

University College London

31 March 2026