Testing AI-based Systems is Difficult

Test strategies must be built according to each system's specific needs. Sometimes a helping hand can go a far way. 


AI Test Guide
(Guidelines on Testing Machine Learning Systems)

ML(Machine Learning)-Based AI systems are typically complex (e.g. deep neural nets), are often based on big data, poorly specified and non-deterministic, which creates many new challenges and opportunities for testing them.
This AI Test Guide considers testing of AI model, input data, and development framework. Each testing focuses on test types to effectively find their related defect types. It also suggest related test completion criteria.


Contact us if you and your organization is seeking to establish a customized AI Test Strategy (Guidelines on Testing Machine Learning Systems) with your specific needs.

How this Guide is Organized

  1. Introduction to AI and ML

  2. Introduction to Machine Learning



  3. Introduction to Machine Learning Testing
    3.1 Introduction to Testing ML-Based Systems
    3.2 Risk-Based Testing
    3.3 ML Test Levels
    3.4 ML Test Environments

  4. 4 Input Data Testing
    4.1 Scope

    4.2 Defect Types
    4.3 Test Types
    4.4 Test Completion Criteria

  5. Model Testing
    5.1 Scope

    5.2 Defect Types
    5.3 Test Types
    5.4 Test Completion Criteria

  6. Development Framework Testing
    6.1 Scope

    6.2 Defect Types
    6.3 Test Types
    6.4 Test Completion Criteria

  7. Annex – Example of the Testing of an ML System

  8. Annex – Introduction to Neural Networks

  9. Annex - Characteristics of ML Systems
    Functional Characteristics / Non-Functional Characteristics

  10. Annex – Example ML Systems

  11. Annex - Machine Learning Performance Metrics
    Confusion Matrix / Accuracy / Precision / Recall / F1-Score / Aggregate Metrics / Other Supervised Classification Metrics / Supervised Regression Metrics / Unsupervised Clustering Metrics / Limitations of ML Functional Performance Metrics / Selection of Performance Metrics

  12. Annex – Benchmarks for Various ML Domains

  13. Annex - Documentation of an MLS
    Typical Documentation Content / Example ML Model Documentation / Available Documentation Schemes

  14. Annex – ML System Testing Checklists


Target Audience

This guide is focused on individuals with an interest in, or a need to perform, the testing of ML-Based systems, especially those working in areas such as autonomous systems, big data, retail, finance, engineering and IT services. This includes people in roles such as system testers, test analysts, test engineers, test consultants, test managers, user acceptance testers, business analysts and systems developers.

Failures and the Importance of Testing for Machine Learning Systems

There have already been a number of widely publicized failures of ML. According to a 2019 IDC Survey, “Most organizations reported some failures among their AI projects with a quarter of them reporting up to 50% failure rate; lack of skilled staff and unrealistic expectations were identified as the top reasons for failure.” [IDC 2019].

Example ML failures include:

  • IBM’s “Watson for Oncology” cancelled after $62 million spent due to “unsafe treatment” recommendations [IEEE 2019]

  • Microsoft’s AI Chatbot, Tay, was corrupted by Twitter trolls [Forbes 2016]

  • Joshua Brown died in a Tesla Model S on a bright day, when his car failed to spot a white 18-wheel truck/trailer [Reuters 2017]

  • Elaine Herzberg was killed crossing the street at 10pm with her bicycle in Arizona by an Uber self-driving car travelling at 38 mph [DF 2019]

  • Google searches showing high-paying jobs only to male users [WP 2015]

  • COMPAS AI-Based sentencing system in the US biased against African Americans [NS 2018]

  • Anti-Jaywalking system in Ningbo, China recognized a photo of a billionaire on a bus as a jaywalker [BBC 2018]

Failures have historically provided one of the most convincing drivers for performing adequate software testing. Industry surveys show a perception that ML is an important trend for software testing:

  • AI was rated the number one new technology that will be important to the testing world in the next 3 to 5 years. [SoTR 2019]

  • AI was rated second (by 49.9% of respondents) of all technologies that will be important to the software testing industry in the following 5 years [ISTQB 2018]

  • The most popular trends in software testing were AI, CI/CD, and Security (equal first). [LogiGear 2018]

  • Testing is already being performed on ML-based systems:

  • 19% of respondent are already testing AI / Machine Learning [SoTR 2019]

  • 57% of companies are experimenting with new testing approaches [WQR 2019]


The preferred term (or terms) for a given concept and are written in bold type. Alternative, less preferred synonyms, are written below the preferred terms in regular type, where applicable. Where a definition of a term applies to a specific domain, the domain precedes the definition (e.g. <machine learning>).

A/B testing

split-run testing

statistical testing approach that allows testers to determine which of two systems or components performs better

activation value

<neural network> output of an activation function of a node in a neural network


activation function

transfer function

<neural network> the formula associated with a node in a neural network that determines the output of the node (activation value) from the inputs to the neuron

adversarial attack

deliberate use of adversarial examples to cause an ML model to fail

Note 1: Typically targets ML models in the form of a neural network.

adversarial example

input to an ML model created by applying small perturbations to a working example that results in the model outputting an incorrect result with high confidence

Note 1: Typically applies to ML models in the form of a neural network.

adversarial testing

testing approach based on the attempted creation and execution of adversarial examples to identify defects in an ML model

Note 1: Typically applied to ML models in the form of a neural network.

AI effect

situation when a previously labelled AI system is no longer considered to be AI as technology advances

artificial intelligence (AI)

capability of a system to perform tasks that are generally associated with intelligent beings

[ISO/IEC 2382 – removed second option on AI as a discipline]


application specific integrated circuit

[ISO/IEC/IEEE 24765d:2015]


ML algorithm

<machine learning> algorithm used to create an ML model from the training data

EXAMPLE: ML algorithms are used to generate models for Linear Regression, Logistic Regression, Decision Tree, SVM, Naive Bayes, kNN, K-Means and Random Forest.

automated exploratory testing

form of exploratory testing supported by tools

autonomous system

system capable of working without human intervention for sustained periods


ability of a system to work for sustained periods without human intervention

back-to-back testing

differential testing

approach to testing whereby an alternative version of the system is used as a pseudo-oracle to generate expected results for comparison from the same test inputs
EXAMPLE: The pseudo-oracle may be a system that already exists, a system developed by an independent team or a system implemented using a different programming language.

backward propagation

<neural network> method used in artificial neural networks to determine the weights to be used on the network connections based on the computed error at the output of the network

Note 1: It is used to train deep neural networks.

benchmark suite

collection of benchmarks, where a benchmark is a set of tests used to compare the performance of alternatives


<machine learning> measure of the distance between the predicted value provided by the machine learning (ML) model and a desired fair prediction


<machine learning> machine learning function that predicts the output class for a given input


<machine learning> ML model used for classification


grouping of a set of objects such that objects in the same group (i.e. a cluster) are more similar to each other than to those in other clusters

combinatorial testing

black-box test design technique in which test cases are designed to execute specific combinations of values of several parameters

EXAMPLE: Pairwise testing, all combinations testing, each choice testing, base choice testing.

confusion matrix

table used to describe the performance of a classifier on a set of test data for which the true and false values are known

data pre-processing

<machine learning> part of the ML workflow that transforms raw data into a state ready for use by the ML algorithm to create the ML model

Note 1: Pre-processing can include analysis, normalization, filtering, reformatting, imputation, removal of outliers and duplicates, and ensuring the completeness of the data set.

decision trees

<machine learning> supervised-learning model for which inference can be represented by traversing one or more tree-like structures

[ISO/IEC 23053]

deep learning

approach to creating rich hierarchical representations through the training of neural networks with one or more hidden layers

Note 1: Deep learning uses multi-layered networks of simple computing units (or “neurons”). In these neural networks each unit combines a set of input values to produce an output value, which in turn is passed on to other neurons downstream.

[ISO/IEC 23053]

deep neural net

neural network with more than two layers

deterministic system

system which, given a particular set of inputs and starting state, will always produce the same set of outputs and final state




<machine learning> changes to ML model behaviour that occur over time
Note 1: These changes typically make predictions less accurate and may require the model to be re-trained with new data.


<AI> level of understanding how the AI-Based system came up with a given result

exploratory testing

experience-based testing in which the tester spontaneously designs and executes tests based on the tester's existing relevant knowledge, prior exploration of the test item (including the results of previous tests), and heuristic "rules of thumb" regarding common software behaviours and types of failure
Note 1: Exploratory testing hunts for hidden properties (including hidden behaviours) that, while quite possibly benign by themselves, could interfere with other properties of the software under test, and so constitute a risk that the software will fail.


<machine learning> performance metric used to evaluate a classifier, which provides a balance (the harmonic average) between recall and precision

false negative

incorrect reporting of a failure when in reality it is a pass
Note1: This is also known as a Type II error.

EXAMPLE: The referee awards an offside when it was a goal and so reports a failure to score a goal when a goal was scored.

false positive

incorrect reporting of a pass when in reality it is a failure
Note1: This is also known as a Type I error.
EXAMPLE: The referee awards a goal that was offside and so should not have been awarded.

feature engineering

<machine learning> activity in which those attributes in the raw data that best represent the underlying relationships that should be appear in the model are identified for use in the training data

synonym: feature selection

forward propagation

<neural network> process of a neural network accepting an input and using the activation functions to pass a succession of values through the network layers to generate a predicted output

fuzz testing

software testing approach in which high volumes of random (or near random) data, called fuzz, are used to generate inputs to the test item

graphical processing unit (GPU)

application-specific integrated circuit (ASIC) specialized for display functions such as rendering images

Note 1: GPUs are designed for parallel data processing of images with a single function, but this parallel processing is also useful for executing AI-Based software, such as neural networks.


<neural network> variables used to define the structure of a neural network and how it is trained

Note 1: Typically, hyperparameters are set by the developer of the model and may also be referred to as a tuning parameter.


<AI> level of understanding how the underlying (AI) technology works

machine learning


process using computational techniques to enable systems to learn from data or experience

[ISO/IEC 23053]

metamorphic relation

describes how a change in the test inputs from the source test case to the follow-up test case affects a change (or not) in the expected outputs from the source test case to the follow-up test case

metamorphic testing

testing where the expected results are not based on the specification but are instead extrapolated from previous actual results


ML model

<machine learning> output of a machine learning algorithm trained with a training data set that generates predictions using patterns in the input data

model accuracy

performance metric used to evaluate a classifier, which measures the proportion of classifications predictions that were correct

neural network

artificial neural network

network of primitive processing elements connected by weighted links with adjustable weights, in which each element produces a value by applying a nonlinear function to its input values, and transmits it to other elements or presents it as an output value

Note 1: Whereas some neural networks are intended to simulate the functioning of neurons in the nervous system, most neural networks are used in artificial intelligence as realizations of the connectionist model.

Note 2: Examples of nonlinear functions are a threshold function, a sigmoid function, and a polynomial function.

[ISO/IEC 2382]

neuron coverage

proportion of activated neurons divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests

Note 1: A neuron is considered to be activated if its activation value exceeds zero.

non-deterministic system

system which, given a particular set of inputs and starting state, will NOT always produce the same set of outputs and final state


<machine learning> generation of an ML model that corresponds too closely to the training data, resulting in a model that finds it difficult to generalize to new data

pairwise testing

black-box test design technique in which test cases are designed to execute all possible discrete combinations of each pair of input parameters

NOTE 1: Pairwise testing is the most popular form of combinatorial testing.


<machine learning> parts of the model that are learnt from applying the training data to the algorithm

EXAMPLE: Learnt weights in a neural net.

Note 1: Typically, parameters are not set by the developer of the model.

parameterized test scenario

test scenario defined with one or more attributes that can be changed within given constraints

performance metrics

<machine learning> metrics used to evaluate ML models that are used for classification

EXAMPLE: Typical metrics include accuracy, precision, recall and F1-Score.


<machine learning> performance metric used to evaluate a classifier, which measures the proportion of predicted positives that were correct


<machine learning> machine learning function that results in a predicted target value for a given input

EXAMPLE: Includes classification and regression functions.

probabilistic software engineering

software engineering concerned with the solution of fuzzy and probabilistic problems

probabilistic system

system whose behaviour is described in terms of probabilities, such that its outputs cannot be perfectly predicted


derived test oracle

independently derived variant of the test item used to generate results, which are compared with the results of the original test item based on the same test inputs

NOTE: Pseudo-oracles are a useful alternative when traditional test oracles are not available.

reasoning technique

<AI> form of AI that generates conclusions from available information using logical techniques, such as deduction and induction



<machine learning> performance metric used to evaluate a classifier, which measures the proportion of actual positives that were predicted correctly


<machine learning> machine learning function that results in a numerical or continuous output value for a given input

regulatory standard

standard promulgated by a regulatory agency

reinforcement learning

<machine learning> task of building an ML model using a process of trial and reward to achieve an objective

Note 1: A reinforcement learning task can include the training of a machine learning model in a way similar to supervised learning plus training on unlabelled inputs gathered during the operation phase of the AI system. Each time the model makes a prediction, a reward is calculated, and further trials are run to optimize the reward.

Note 2: In reinforcement learning, the objective, or definition of success, can be defined by the system designer.

Note 3: In reinforcement learning, the reward can be a calculated number that represents how close the AI system is to achieving the objective for a given trial.

[ISO/IEC 23053]

reward hacking

activity performed by an agent to maximise its reward function to the detriment of meeting the original objective


programmed actuated mechanism with a degree of autonomy, moving within its environment, to perform intended tasks

Note 1: A robot includes the control system and interface of the control system.

Note 2: The classification of robot into industrial robot or service robot is done according to its intended application.

[ISO 18646-1]


expectation that a system does not, under defined conditions, lead to a state in which human life, health, property, or the environment is endangered

[ISO/IEC/IEEE 12207]

Safety of the Intended Functionality (SOTIF)

ISO/PAS 21448: Safety of the Intended Functionality

search algorithm

<AI> algorithm that systematically visits a subset of all possible states (or structures) until the goal state (or structure) is reached

search based software engineering

software engineering that applies search techniques, such as genetic algorithms and simulated annealing to solve problems

self-learning system

adaptive system that changes its behaviour based on learning from the practice of trial and error

sign change coverage

proportion of neurons activated with both positive and negative activation values divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests

Note 1: An activation value of zero is considered to be a negative activation value.

sign-sign coverage

coverage level achieved if by changing the sign of each neuron it can be shown to individually cause one neuron in the next layer to change sign while all other neurons in the next layer stay the same (i.e. they do not change sign)


<testing> device, computer program or system used during testing, which behaves or operates like a given system when provided with a set of controlled inputs.

software agent

digital entity that perceives its environment and takes actions that maximize its chance of successfully achieving its goals

supervised learning

<machine learning> task of learning a function that maps an input to an output based on labelled example input-output pairs

[ISO/IEC 23053]

technological singularity

point in the future when technological advances are no longer controllable by humans

tensor processing units (TPU)

application-specific integrated circuit designed by Google for neural network machine learning

test data

<machine learning> independent dataset used to provide an unbiased evaluation of the final, tuned ML model

test oracle

source of information for determining whether a test has passed or failed

NOTE 1: The test oracle is often a specification used to generate expected results for individual test cases, but other sources may be used, such as comparing actual results with those of another similar program or system or asking a human expert.

test oracle problem

challenge of determining whether a test has passed or failed for a given set of test inputs and state

threshold coverage

<neural networks> proportion of neurons exceeding a threshold activation value divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests

Note 1: A threshold activation value between 0 and 1 must be chosen as the threshold value.

training data

<machine learning> dataset used to train an ML model


<AI> level of accessibility to the algorithm and data used by the AI-Based system

true negative

correct reporting of a failure when it is a failure

EXAMPLE: The referee correctly awards an offside and so reports a failure to score a goal.

true positive

correct reporting of a pass when it is a pass

EXAMPLE: The referee correctly awards a goal.

Turing test

test by a human of a machine's ability to exhibit intelligent behaviour that is indistinguishable from human behaviour


<machine learning> generation of an ML model that does not reflect the underlying trend of the training data, resulting in a model that finds it difficult to make accurate predictions

unsupervised learning

<machine learning> task of learning a function that maps unlabelled input data to a latent representation

[ISO/IEC 23053]

validation data

<machine learning> dataset used to evaluate a candidate ML model while tuning it

value change coverage

proportion of neurons activated where their activation values differ by more than a change amount divided by the total number of neurons in the neural network (normally expressed as a percentage) for a set of tests

virtual test environment

test environment where one or more parts are digitally simulated



Need more details? Contact us

If you have any questions, please contact us.