Tuesday, December 18, 2012

Testing with qcheck, version 2.0


qcheck, version 2.0

Introduction

QuickCheck is a system developed by John Hughes and Koen Claessen. Its premise is that comprehesive testing can be obtained by combining a testing specification language and test data automation ("random testing"). It has been ported to several languages from its native Haskell: the Mercury programming language provides a QuickCheck facility in their "extras" distribution, called qcheck.
As it stands, the original qcheck (herein after referred to as Q1) is an excellent piece of work, fully capable of testing entire systems. It can decipher how to generate example data from user-defined types. It also provides examples for various general and specific situations where the user may wish to exercise control over the ranges or frequencies of the data generated. What, then, would need to be changed?
Not much, it turns out. The fundamental aspects of the system -- comprehesive unit testing fed by randomly-generated values and directed by a test specification remain intact, but there are several features to comprehensive testing that can be added to improve this system. Q1 is good at verifying that one particular predicate behaves as it should, but it gives no indication that all the (interface) predicates of a system have been tested. Further, the user has some control over the detail reported for each test, but, for a user wishing a summary report, even the smallest report allowed is too much detail: adding a facility that give the user complete control over reportage becomes necessary for larger systems. Finally, Q1 has an excellent facility to generate random values for user-defined types, but for primitive types (such as charint, and float), the approach is a bit arcane (only mentioned in one of the last examples) and lossy -- giving the user control over the random number generator itself, and integrating that part of the system with the goal of ease of use will carry forward the automation of user-defined typed values as well as simplify controlling ranges of generated primitive values.
These improvements were the aim of this new version: keep the essence of previous version while adding these new layers to help the tester verify much larger systems. The first two aims outlined above, that of comprehension and detail, are addressed by the new reporting facility that exists both outside qcheck2 proper and is also integrated into its state. The third aim, more and simpler control over randomly-generated values, incorporate changes that now allow the user to define their own random number generator (or to use the very excellent one supplied) and also to change the generator's behavior in the midst of testing. We will address each aim, and their implementations, in the following sections.

Aims

Aim 1: Comprehension

One question that testing frameworks, such as QuickCheck, must eventually address, particularly for larger systems, is the one of completeness. Or, "has the entire functionality of the system been tested?" Under the first implementation this is a difficult question to answer, and the root of this problem is a rather trivial one to fix: the second input argument toqcheck/[4,7,8] is a string that has the purpose of describing the test to be performed. Typing this argument as a string is rather limitting: although the (human) user can seen the purpose from the description, the problem is that encoding the test description as a string does not facilitate reasoning about test results mechanically, which makes it difficult for the system to report on what was and, importantly, was not, covered in the testing.

A new accumulated state variable

The fix to this problem is therefore simple: generalize the test description argument to any (univeral) type. This way other systems may, for example, generate a model of the system being tested and then, after the tests are completed, a method of tabulating and reporting the results. The internals of Q1 change in two ways: first, the description type signature is relaxed, and second, as reportage is spread across several calls to qcheck/[4,7,8], we will defer the commitment of outputting results until the user so determines. In this case, we convert the state variable from io.state (which commits the output) to a specific accumulator (that simply collects it):
:- pred qcheck(T, Desc, int, list(list(frequency)),
               general_frequencies, list(user_gen_type(RNG)),
               qstate(RNG, Desc), qstate(RNG, Desc))
        <= (testable(T), random_number_generator(RNG)).

:- mode qcheck(in, in, in, in, in, list_skel_in(user_gen_inst(RNG)),
               in, out) is det.
where:
Tthe type (pred/func) to test
Desca generic type describing the test
intnumber of tests
list(list(frequency))list of specific frequencies (as per Q1)
general_frequencieslist of general frequencies for testing (described later)
list(user_gen_type(RNG))user-defined types for value generation, as per Q1, specialized on the random number generator
qstate(RNGDesc)The state variable collecting test results and keeping the random number generator au courant
The typeclass testable(T) is as per Q1; and the random number generator typeclass (random_number_generator(RNG)) will be discussed under the value generator aim.
For the new types for qcheck2:
:- type general_frequencies == assoc_list(type_desc, list(frequency)).
:- type qstate(RNG, Description)
        ---> qstate(generator(RNG), map(Description, test_results)).
:- type test_results
        ---> test_results(int, int, int, bag(univ))
        ;    falsifiable(univ).
The above qstate type shows that the test description is mapped to the test_results type, as qcheck/[4,7,8] is called for each test, it accumulates the test results indexed by the (generic) description.

The Program Modelling Tool

As Description can be of any type, we can now allow the user to pass in a simple description string, as in Q1, or, alternatively, we can enhance the description with something that we can use as a declarative index. In Prolog, we could use the index of the module-qualified name of the predicate. Mercury is not as dynamic a system to allow any number of module/predicate/function values for Description,1 so we will employ a preprocessing system to identify and alias all interface predicate/function types to a uniform (enumerated) type. One such system is qcpt available from logicaltypes.com packaged into the ltq system. From an input set of files, qcpt generates the following types:
:- program_module
        ---> .
:- func all_program_modules = list(program_module).
:- func all_public_predicates = list(full_predicate_signature).

:- type public_predicate_signature
        ---> public_predicate / int.

:- type full_predicate_signature
        ---> public(program_module, public_predicate_signature).

:- type test_desc == pair(full_predicate_signature, string).

:- type public_predicate
        ---> .
Example
Say we wish to test a single module, for example the peano module. By running ...
$ qcpt peano.m
... the file qcheck2.tests_digests_types.m is generated with the above type declarations and the following specific realizations:
% ...

:- type program_module
        ---> peano.

% ...

:- type public_predicate
        ---> increment
 ;    peano
 ;    to_peano.

:- implementation.

all_program_modules = [
 peano
].

all_public_predicates = [
 public(peano, increment/2),
 public(peano, peano/2),
 public(peano, to_peano/2)
].

A new reporting facility

Given that we now have collected all the public (or interface) predicates and functions of a system, it is now also possible to determine the completeness of the testing, in that all of those predicates were tested. It simply now falls to a system to collect the results of testing and report the results. The user may attack this task any number of ways, but we also provide one implementation in module qcheck2.tests_digest_reports. The predicate
show_module_tests_summary(Results, !IO)
does all this work by comparing the accumulated test Results (typed as an assoc_list(test_desc, test_results), where test_desc is provided by moduleqcheck2.tests_digest_types (generated from the tested module using qcpt), and the assoc_list is the accumulated result of calling a set of qcheck/[4,7,8] goals).
The show_module_tests_summary/3 predicate reports the total number of tests (and those that passed) within each module and reports the number of predicates that succeeded all unit testing. This predicate also reports, by name and arity, the predicates not tested by the framework.
Example
The following report ...
$ ./peano_unit_tests_missing_to_peano

Qcheck2 tested the following modules:
*** 6/6 tests passed on 2/3 predicates in peano
    (did not test to_peano/2)
... shows that my modifications to module peano_unit_tests (I commented out the to_peano/2 tests) resulted in all tests passing, but the test suite did not provide full coverage of the peano module's functionality.

Traditional Reportage

Module qcheck2.reports provides two ways to report results as Q1 did:
show_test_results_sets(Results, !IO)
show_test_results(Description, test_results, !IO)
The predicate show_test_results_sets/3 simply iterates over Results, calling show_test_results/4 at each iteration. The predicate show_test_results/4 reports the results from qcheck/[4,7,8] in the style of Q1.
Example
Here are some of the results reported from running peano_unit_tests_with_failing_test with a call to show_test_results_sets/3:
...
1)
Test description : public(peano, increment / 2) - "Makes sure we get a successor"
Number of test cases that succeeded : 100
Number of trivial tests : 0
Number of tests cases which failed the pre-condition : 0
2)
Test description : public(peano, increment / 2) - "Checks increment/2\'s compare"
Number of test cases that succeeded : 19
Number of trivial tests : 0
Number of tests cases which failed the pre-condition : 81
Distributions of selected arguments :
13 {0, 1}
5 {1, 2}
1 {2, 3}
3)
Test description : public(peano, to_peano / 2) - "Tests string creation"
Falsifiable :
i(-68)
...
The above examples demonstrate 1) a sample test report where all the tests succeeded (with no reportage on the test values used), 2) a sample where some test values were unusable (with reportage), and 3) a sample where the predicate and test values (by intention) failed the test. These tests are of the same form as those of Q1. The major difference is that test tests are reported separately from the qcheck/[4,7,8] call. This separation allows delayed (even permanently delayed) reporting of results.

Aim 2: Generation

Q1 offers a dizzying array of options when giving the user control over what test values are generated and how they are generated, that is, if these values are to be generated from user-defined, or complex (not builtin), types. It also offers a generator of such power that it is dubbed: "The Mother of all Random Number Generators" by its creator. Impressive as the above offerings are (and they are, in fact, impressive) there are two glaring wants: first, one is required to use the supplied generator -- users are not permitted alternate generators for their own testing, and this becomes an issue when, for example, "pure" randomness (such as the values provided by random.org), or cryptological-strength randomness is a requirement of the tested system, and, second, restricting the ranges of primitive builtins (or other "unbound" enumerated types (such as integer)) is not directly feasible (albeit possibly with some esoteric indirection). We address each of these wants in the new system in this document by turns.

Want 1: User-supplied random number generators

Q1 was built to hide its inner workings. This software principle is considered to be good practice, but, since one of the inner workings is random number generation to generate test values, this good practice obstructs the ability to replace the random number generator when so required. The new architecture in qcheck2 exposes the random number generator in the state variable, allowing its replacement by alternate equivalent types. The change also comes with a supplied state initialization predicate that provides the default random number generator.
The design of the system requires the user to wrap their random number generator (hereinafter referred to as the RNG) in type described in the next section, and it must conform to the following typeclass:
:- typeclass random_number_generator(RNG) where [
        % gives a random float on the range [0, 1)
        pred rnd(float::out, RNG::in, RNG::out) is det,
 pred reseed(int::in, RNG::unused, RNG::out) is det
].
where:
rnd/3supplies a float between 0.0 and 1.0 and updates the state of the RNG; and
reseed/3reinitializes the RNG with the seed supplied as the first argument.

Want 2: User-controllable numeric ranges

It is very easy for the user of Q1 to specify a discriminated-union type and have the system generate test examples. The problem is for builtin types: Q1 does not consider bounds when generating floats, ints and, surprisingly, chars.2 One could argue that by choosing unbound types, the user must be prepared to accept any value those types describe, and indeed this is true. Where this argument falls apart is when the user wishes to test predicates within their nominal ranges. Certainly the test framework should generate test data that tests beyond the ranges, but when every test case offered is extrinsic, nothing is gained by submitting that predicate to random testing. In short, the user must be allowed direct control over ranges of test values of builtin types when the situation warrants.
:- type generator(RNG)
        ---> generator(int_range, float_range, character_type, RNG).
where:
int_rangeis eitherrange(int, int) or unbound

Endnotes

1
Well, this statement is not true in all cases -- if all the tested predicates had the same type and modality then one may simply use the name of the predicate as the Description. See, for example, modules foo and foo_unit_tests included in the distribution. The material point here is that although it is possible to use the name of a predicate as the Description it is not realistic given that most systems use predicates with different types, arities, and modalities.
2
Q1 treats strings as list(int), where each element is a char-equivalent int, which can be positive, negative or, as is most often the case, of very large magnitude.


(article posted circa 2006)

No comments: