Mutation Testing: Testing the Tests

Every other page in this track helps you test the code. This one asks the uncomfortable follow-up: who tests the tests? A suite can execute every line and branch (see Code Path Analysis) while asserting almost nothing. Mutation testing closes that loop by measuring the only thing a test suite is actually for — its ability to notice when the code is wrong.

The Idea

Deliberately plant a small bug — a mutant — and run the suite [1]:

  • If some test fails, the mutant is killed. Good: your suite would have caught that bug.
  • If every test still passes, the mutant survives. Bad: a real bug of that shape would ship.

Do this systematically — hundreds of mutants, one tiny change each — and the mutation score (killed ÷ total) tells you how trustworthy the green tick really is. It is coverage's missing half: coverage proves the tests ran the code; mutation proves they checked it.

A Worked Example

This suite has perfect coverage and a blind spot you have probably shipped yourself:

def qualifies_for_discount(age, total):
    return age >= 65 or total > 100

# A suite with 100% statement AND branch coverage...
def test_senior():
    assert qualifies_for_discount(70, 50)

def test_big_basket():
    assert qualifies_for_discount(30, 150)

def test_neither():
    assert not qualifies_for_discount(30, 50)

# ...that still lets BOTH boundary mutants survive:
#
#   mutant 1:  age >= 65   ->   age > 65      (65-year-olds lose their discount)
#   mutant 2:  total > 100 ->   total >= 100  (£100 baskets gain one)
#
# No test uses age == 65 or total == 100, so every test still passes
# against the mutated code. The fix is two boundary tests:

def test_senior_boundary():
    assert qualifies_for_discount(65, 50)        # kills mutant 1

def test_basket_boundary():
    assert not qualifies_for_discount(30, 100)   # kills mutant 2
bool qualifies_for_discount(int age, double total) {
    return age >= 65 || total > 100;
}

// A suite with 100% statement AND branch coverage...
TEST_CASE("senior")     { CHECK(qualifies_for_discount(70, 50)); }
TEST_CASE("big basket") { CHECK(qualifies_for_discount(30, 150)); }
TEST_CASE("neither")    { CHECK_FALSE(qualifies_for_discount(30, 50)); }

// ...that still lets BOTH boundary mutants survive:
//
//   mutant 1:  age >= 65   ->   age > 65      (65-year-olds lose their discount)
//   mutant 2:  total > 100 ->   total >= 100  (£100 baskets gain one)
//
// No test uses age == 65 or total == 100, so every test still passes
// against the mutated code. The fix is two boundary tests:

TEST_CASE("senior boundary") { CHECK(qualifies_for_discount(65, 50)); }
TEST_CASE("basket boundary") { CHECK_FALSE(qualifies_for_discount(30, 100)); }
boolean qualifiesForDiscount(int age, double total) {
    return age >= 65 || total > 100;
}

// A suite with 100% statement AND branch coverage...
@Test void senior()    { assertTrue(qualifiesForDiscount(70, 50)); }
@Test void bigBasket() { assertTrue(qualifiesForDiscount(30, 150)); }
@Test void neither()   { assertFalse(qualifiesForDiscount(30, 50)); }

// ...that still lets BOTH boundary mutants survive:
//
//   mutant 1:  age >= 65   ->   age > 65      (65-year-olds lose their discount)
//   mutant 2:  total > 100 ->   total >= 100  (£100 baskets gain one)
//
// No test uses age == 65 or total == 100, so every test still passes
// against the mutated code. The fix is two boundary tests:

@Test void seniorBoundary() { assertTrue(qualifiesForDiscount(65, 50)); }
@Test void basketBoundary() { assertFalse(qualifiesForDiscount(30, 100)); }
bool QualifiesForDiscount(int age, double total) =>
    age >= 65 || total > 100;

// A suite with 100% statement AND branch coverage...
[Fact] public void Senior()    => Assert.True(QualifiesForDiscount(70, 50));
[Fact] public void BigBasket() => Assert.True(QualifiesForDiscount(30, 150));
[Fact] public void Neither()   => Assert.False(QualifiesForDiscount(30, 50));

// ...that still lets BOTH boundary mutants survive:
//
//   mutant 1:  age >= 65   ->   age > 65      (65-year-olds lose their discount)
//   mutant 2:  total > 100 ->   total >= 100  (£100 baskets gain one)
//
// No test uses age == 65 or total == 100, so every test still passes
// against the mutated code. The fix is two boundary tests:

[Fact] public void SeniorBoundary() => Assert.True(QualifiesForDiscount(65, 50));
[Fact] public void BasketBoundary() => Assert.False(QualifiesForDiscount(30, 100));
def qualifies_for_discount?(age, total)
  age >= 65 || total > 100
end

# A suite with 100% statement AND branch coverage...
RSpec.describe "qualifies_for_discount?" do
  it("grants seniors a discount")    { expect(qualifies_for_discount?(70, 50)).to be true }
  it("grants big baskets a discount") { expect(qualifies_for_discount?(30, 150)).to be true }
  it("denies everyone else")          { expect(qualifies_for_discount?(30, 50)).to be false }

  # ...that still lets BOTH boundary mutants survive:
  #
  #   mutant 1:  age >= 65   ->   age > 65      (65-year-olds lose their discount)
  #   mutant 2:  total > 100 ->   total >= 100  (£100 baskets gain one)
  #
  # No test uses age == 65 or total == 100, so every test still passes
  # against the mutated code. The fix is two boundary tests:

  it("includes exactly-65-year-olds")  { expect(qualifies_for_discount?(65, 50)).to be true }
  it("excludes exactly-£100 baskets")  { expect(qualifies_for_discount?(30, 100)).to be false }
end

The surviving mutants pointed straight at the missing boundary tests — which is typical. Mutation testing doesn't just score your suite; each survivor is a specific, actionable "write this test".

The Classic Mutation Operators

OperatorExampleThe bug it simulates
Relational boundary>=>Off-by-one at a threshold
Conditional negationif (x)if (!x)Inverted logic
Arithmetic swap+-Wrong operator
Boolean connector&&||Wrong combination rule
Constant tweak100101, 0, -1Wrong magic number
Statement / call removaldelete audit_log(...)Forgotten side-effect (nothing asserted it happened)
Return valuereturn xreturn null/0/trueWrong result wiring

One honest caveat: some mutants are equivalent — the change doesn't alter behaviour (mutating dead code, or i < ni != n in a loop that only steps by one). These can't be killed and must be reviewed away by a human; they are the main reason 100% mutation scores are not a sensible target [2].

In Practice

  • It is expensive by design — the suite runs once per mutant. Mitigate: run it on changed files only in CI (most tools support diff-based runs), nightly on the whole codebase, and always with your fastest test tier (see Types of Testing — this is another reason to keep the unit layer quick).
  • Treat the score as a trend, not a gate at 100%. Teams commonly gate at 60–80% on touched code; the real value is the list of survivors.
  • Pairs beautifully with TDD — if you write tests first (see TDD & BDD), mutation testing audits whether the refactoring phase quietly outgrew the tests.
EcosystemTool
Pythonmutmut
Rubymutant
Java / JVMPIT (pitest) [3]
C# / JavaScript / ScalaStryker [4]
C / C++Mull

References

  1. DeMillo, R.A., Lipton, R.J. & Sayward, F.G. (1978). "Hints on Test Data Selection: Help for the Practicing Programmer." IEEE Computer, 11(4), 34–41. https://doi.org/10.1109/C-M.1978.218136
  2. Jia, Y. & Harman, M. (2011). "An Analysis and Survey of the Development of Mutation Testing." IEEE Transactions on Software Engineering, 37(5), 649–678. https://doi.org/10.1109/TSE.2010.62
  3. PIT Mutation Testing. https://pitest.org/
  4. Stryker Mutator. https://stryker-mutator.io/