Saturday, August 2, 2008

todo_xxxxxxx

I recently altered the "fail incomplete tests" mechanism we use in the pygame test runner. Before we were doing assertions on test_utils.test_not_implemented(). This would check a module level variable test_utils.fail_incomplete_tests, which we would set as desired depending on whether we wanted to fail incomplete tests.

This was a fairly non-invasive technique but as I was already hijacking the test loading mechanism for filtering tests by tags, I realized I could alter the TestLoader class to pick up tests starting with the prefix "todo_" as well as "test_". I would call TestCase.fail directly which would only run if picking up todo_ tests.

This of course meant altering all the stubs. I pondered briefly doing a mass search and replace, completely automating it but I don't really trust that for tests.

For the test stubs I have been including the documentation so it's really easy to walk through a test file writing tests without having to leave the editor. I was just using inspect.getdoc to get the __doc__ string.

It seems the documentation included in the .doc files is different to that contained in the __doc__ for each function. The __doc__ seems to be the function signature and a very brief, usually one sentence description. The .doc files contains a lot more detailed descriptions that can be very useful when writing tests.

I quickly added a docs_as_dict() function to makeref.py, then added it to the stub generator. The stub generator will add both the __doc__ and the .doc file documentation to each stub.

I went through semi-manually updating all the unfilled out stubs for each test file with the more complete docs and the new todo_xxxxx test naming. It took about an hour but I feel more confident than if I had just grep'd it.

Everything is pretty much now in place for the test site I wanted to create.

Test Timing
Test Tagging
Isolated Tests

Friday, July 18, 2008

import test.unittest as unittest

I split the test runner further, now into three files, with all the monkey business in unittest_patch.

The patching is done by a patch() function taking an optparse options object as the solo argument, which drives the decisions behind which parts of unittest are patched.

With the features we wanted I had to override some methods in a quite drastic way. I even needed to override TestCase.run, a many many line method. The only way I could do this was to basically copy/paste, alter and monkey-patch in. This meant sometimes calling private members.

Unfortunately, the author of unittest had decided somewhere between python 2.4 and 2.5 that he would rename all the private members from the double underscore preceding __name_mangling convention to a single underscore _caution.

As my mentor said (or something like it), "using an underscore is a warning, that said member is an implementation detail not an interface".

What to do? We now include a 2.5 version of unittest in the test directory. Apparenly pygame has come full circle; it was included way back in the day before PyUnit was part of the standard library.

All of our individual test files, typically $module_test.py, all import an unpatched unittest and run unittest.main() to make the module "conveniently executable". Only when running the complete suite is unittest enhanced with extra functionality.

While I was in there tinkering with the internals, recording timings of individual tests I moved the redirect std(err|out) per module to per test. I then patched the TextTestRunner to dump stderr/stdout on error.

def printErrorList(self, flavour, errors):
for test, err in ((e[0], e[1]) for e in errors):
self.stream.writeln(self.separator1)
self.stream.writeln("%s: %s" % (flavour, test))
self.stream.writeln(self.separator2)
self.stream.writeln("%s" % err)

# DUMP REDIRECTED STDERR / STDOUT ON ERROR / FAILURE
if self.show_redirected_on_errors:
stderr, stdout = map(self.tests[test].get, ('stderr','stdout'))
if stderr: self.stream.writeln("STDERR:\n%s" % stderr)
if stdout: self.stream.writeln("STDOUT:\n%s" % stdout)


It would be relatively easy to add in support for show locals() etc.

Tuesday, July 15, 2008

Redesign

I decided to (had to) redesign the test runner, this time cutting more directly to the root of matters, overriding select methods of unittest classes.

Before, in subprocess mode, I was calling the individual test modules, which would in turn run unittest.main() with all the attendant pains of cmd line options conflicting and output parsing. (we have to add profiling, exclusion by tags etc). One major design change I made was to unify the single / subprocess modes to use one test runner, (test_runner.py).

In it, along with a lot of utility functions, is defined a run_test() function. It takes a list of modules and an options object as arguments. It compiles a dictionary of the test results and on completion either returns the dict or in subprocess mode pretty prints it to stdout. (This is then eval'd for an all_results.update(result))

RESULTS_TEMPLATE = {
'output' : '', # unittest.TextTestRunner output
'stderr' : '', # stderr outpout
'stdout' : '', # stdout output
'num_tests' : 0, # taken directly from the unittest results object
'failures' : [], # ditto
'errors' : [], # ditto
}


In single process mode run_tests.py just imports from test_runner.py run_test() function and passes it the optparse options object and list of modules to search for tests.

Both run_tests.py and test_runner.py, share the same optparse cmd line parser options. In subprocess mode, run_tests.py calls test_runner.py with essentialy the same sys.argv it was initiated with. if __main__ it runs the run_test() function on a list of [args[0]]. Now all the extra functionality and cmd line parsing is all in one place.

There were quite a few extra little changes that have made it not perfect but a lot better. Adding exclution by tagging functionality took 10 minutes, most of the time being spent on picking a format.

|Tags:display|


Adding profiling decorators or whatever other functionality is desired will also be a lot easier now.

Thursday, July 10, 2008

Comedy Of Errors

** Build Page / Testing **
==========================

As reported earlier, in reaction to the crashing tests rendering the build page ineffective, I have been working on creating a script to isolate test modules in subprocesses. The approach I took, was to compile the results of each isolated test into the same form as the old test runner. A quick hack, or so I hoped.

I realised that subprocess out of the box has no cross platform non-blocking calls, so you can't timeout on hung tests. I had to find a recipe for this which unfortunately required win32 extensions. Not really a big deal but still time spent and dependencies.

So what we have is a test runner parsing the results of a unittest text report, meant for human consumption, which is then in turn parsed by the automated build pages regexes. This seems pretty ridiculous, especially as the form is not exactly machine friendly. I could have (should have?) hacked into the build page code and modularised the test parsing code there, sharing between the test runner and build page.

But then if you are going to do that why not just replace the TextTestRunner class with something completely customised for the job? Replace unittest bit by bit in an adhoc as-needed fashion? Slowly building a framework? I didn't want to. I'm not really supposed to be and that was the psychology in play.

Another? foolish design decision I made, based on a shallow visual aesthetic of less LOC, was to parse unittest results in a way that only worked when there was no "test noise". What do I mean by test noise? print statments left in source code. C extensions that don't respect sys.stderr, sys.stdout redirection/supression.

See below exhibit A, a specimen from a sunny day of testing.

...............
---------------------------------------------------------------------
Ran 15 tests in 1.234s

As the tests are running unittest prints to a stream, by default stderr, but it can be any file-like object of choice, either a dot an E or an F, mapping to pass, error or fail. I used a simple regex ^[.EF]*$ to find any "dots" in the return output. If there were any, I would take a slice from the length of the dots. From there I would take the first of a split at the "Ran xxx tests" boundary, defined as '%s\nRan' % (70 * '-'). In between the DOTS and the RAN_TEST_DIV (thus named) would lay the failures.

To piece it all together as if it was the output of one run I would "join the dots" and join the failures. Then at the end count the total length of DOTS (., E, F combined), E)rrors and F)ailures. Voila. Worked a charm.

What the hell was I thinking? The whole point of the exercise was to create a reliable test runner. I suppose I thought I was. I wrote a few tests for some spectacularly unimaginative cases. I compared output of single process mode and subprocess mode running some fake test suites, zero assertions, all passing, some failures, some errors. The subprocess mode was character for character perfect in its mime artistry. In fact it was for this easy, pull apart, bind together, compare automated testing that I did it in the first place.

All was simple and peaceful, until I finally got a linux test box working again. (my laptop fan died) I used ssh to log in and run the tests from my friends windows machine. Of course one of the tests that required initiating the display failed.

single process mode: 504 tests, FAIL (failures=1)
subprocess mode: 495 tests OK

What the hell was going on? With horror I realized what I had done. Something was wreaking havoc with the fragile little regex. On failure a huge amount of debugging output was put out by one of the SDL functions interupting the DOTS. I thought about rewriting it using some more substantive regular expressions. I tossed up between doing that and redirecting sys.(stdout|stderr) and passing a StringIO to unittest for test results. I figured by doing that I would be able to keep the comparison tests I had in place, and for that matter the same degree of mimicry. I opted for redirecting std(err|out). I imagined other uses for this at the same time, none all that compelling upon reflection and only useful if implemented in another manner. (only show stdout/stderr on failure of test, can leave print statement debugging in there, I did global redirection)

Of course to do that I needed to create a command line option for each individual test module to call from the "master" script in subprocess mode. Because unittest.main() is running with it's own getopt parsing, you can't just add an option and check sys.argv or use optparse. You have to do either and then clear those options from sys.argv which would otherwise cause unittest to error. So more fun hamfisting around with unittest. I realised that I would need to do that at some stage for profiling cmd line options so there was another push in that direction. All the time wondering whether I should just completely override the parseArgs method.

I replaced the test_utils.get_fail_incomplete_option();unittest.main() in each module with a test_utils.get_command_line_options(). unittest.main() always calls sys.exit() on completion of tests so I had to subclass it, overriding one of it's methods. I did this because after catching the unittest result stream to a StringIO, I would restore stderr and write the results to it.

I added in some test cases, print_stdout and print_stderr, comparing the results (I of course had to put a redirect mode onto single process mode for purposes of testing). Everything was OK again, until I ran it again on the the linux box through ssh.

495 tests OK. (should have been 504 with one failure)

Damn it! So it seems that some stderr, stdout is not redirected. I imagine it's mostly C extensions (or system calls) and the like that would do this but then that is pygame all over. Briefly I pondered printing results back out on stdout, and just PrayingTM that any such noise would always be stderr.

So what did I do? What any fool, already invested would do. I decided to markup the results, with lines like.

<!-- UNITTEST_RESULTS_START_HERE --!>


I created 3 sections using 2 divisors. The first is all the noise output, anything not respecting redirection. The second is the unittest results and the last is the multiplexed results, what you would see if running the script in a shell. I overrode the write method on a StringIO collecting unittest results and made it also write to (a previously redirected) stdout. using subprocess.Popen(...., stdout=subprocess.PIPE, stderr = subprocess.STDOUT) everything is muxed together. I then wrote a function that regex splits the 3, keeping the results for compiling DOTS. It's a long way from Kansas though isn't it Toto.

What a PITA? That's not even the half of it. I ended up having to rewrite all the command lines I was passing to subprocess.Popen from string template to lists so it would work cross platform. Also, the way subprocess multiplexes stderr and stdout when you use the same file object for both is inconsistent cross platform. What you would see is not neccessarily what you get. On windows it would suffice to just "print compiled_test_results", but on linux had there was need to print >> sys.stderr.

All in all, a lot of tipsy toeing around unittest. I really made a complete tangled webby mess of the whole job. A black comedy of errors. I'm not sure whether to remove the stderr/stdout redirection and replace the regexes with something less fragile. It's already been too much of a hole, sucking in time. I would have to update the run_tests__tests also.

What would I do differently looking back? What would I do if I had no constraints? Unfortunately, probably two very different questions.

** What I would do differently? **
==================================

This much I do know, the build page and the test runner script require intersecting functionality. They both parse the results of a unittest TextTestRunner output to gather statistics on test results. I could have modularised this parsing functionality, sharing between the two of them. This really begs the question though, why parse something designed for human consumption at all? Why not pass a customised test runner class into unittest?

Still there is the problem of communication across process boundaries, solved by using an asynchronous extension class of subprocess.Popen. Would you log the result of each processes output to a file using something like xml? Or maybe, pickling the results and then joining them back together? You could even have a client / server architecture, using sockets to transfer pickled test results as native python objects back to the server to piece together.

As well as the requirement for isolation of tests, we are wanting to add profiling functionality and tagging to split tests into different groups.

Tuesday, July 8, 2008

killer redux

I was laboring under the bastard conception that when using subprocess.Popen(), shell=True is required for a subprocess executable to have access to the environment variables. Where the hell did I get that idea? Stupid unquestioned assumption that almost gave birth to a lasting bug.

For the test runner I was using system calls to taskkill or pskill for process controll under windows. The idea was to try executing each and if one was on the %PATH% the return code would not be one of err. If this was the case then the search was over and a Popen wrapper of (taskkill|pskill) would suffice as an os.kill().

This worked fine and dandy except that on windows98, there would be no error code if either of the task killers weren't on the path. It would define a useless os.kill.

Lenard, the windows maintainer of PyGame questioned why use a hacky wrapper of pskill or one of it's ilk, when if there was already a reliance on pywin32, why not use win32api.TerminateProcess?

That works fine but does not kill process trees, something I thought was a requirement due to using shell = True as a Popen constructor argument. Using shell = 1 calls cmd.exe etc which in turn calls the subprocess of choice.

Realizing that there was only need to kill one process, and that it would also avoid problems with differing return codes on older versions of windows, TerminateProcess was given the job.

Long live TerminateProcess.

Friday, July 4, 2008

dot points on build page extensions

I have been thinking about making some extensions to the build page.

Raw Data

  • Keep raw_data to process at any time. No need to discount old data collected from buggy analysis.

Profiling

  • Use function wrappers, that log profiling of each test and multiple calls.

  • -p|--profile command line mode

Tests

  • Use subprocess mode by default for run_tests.py

  • Web interface for ticketing off tests

Build information

  • Post compiler version

  • Post complete Setup file

  • Post complete build output

  • Post complete test output

  • Python sys.path

  • Environment variables

  • As much as possible, unprocessed for archives

Machine information

  • Processor speed

  • CDRom availability

  • etc, etc.

Breaking up tests

Should the tests fail if a machine doesn't have a CD drive (assuming stubs were filled out) for example?

Should tests that require Numeric or NumPy fail if neither available?

There are some classes of tests that it seems to make sense to split apart from the main "base" group of tests. What should be the "base" group of tests to automate with the run_tests.py test runner?

What about tests that require human verification? For the build page a "base" group of tests should be specified.

What should be the requirements for machines sending results to the build page? Numeric, Numpy? win32 extensions on windows? A CD rom drive? 32 bit color display?

Thursday, July 3, 2008

test_not_implemented()

def test_get_arraytypes(self):

# __doc__ (as of 2008-06-25) for pygame.sndarray.get_arraytypes:

# pygame.sndarray.get_arraytypes (): return tuple
#
# Gets the array system types currently supported.
#
# Checks, which array system types are available and returns them as a
# tuple of strings. The values of the tuple can be used directly in
# the use_arraytype () method.
#
# If no supported array system could be found, None will be returned.

self.assert_(test_not_implemented())


test_not_implemented() will fail if any test suite is run with a "(-i|--incomplete)" command line option.

As mentioned in previous posts, I developed a unittest stub generator that will output stubs for any untested units. It is supported by a naming scheme for the tests. The stubber will inspect the xxxx_test.py modules and based upon the names of the unittest.TestCase's and their children test_xxxx methods will determine what is already tested.

For each public callable there is a corresponding test named test_$callable_name. Comments or descriptions will be appended to this separated by a double underscore.

test_quit__returns_None_if_not_already_init


What if there is a module.quit and a module.class.quit ? Each class has it's own TestCase (and thus namespace) named $classTypeTest. This is typically the case anyway with setUp()'s specific to the class tested.

def get_callables(obj, if_of = None, check_where_defined=False):
publics = (getattr(obj, x) for x in dir(obj) if is_public(x))
callables = (x for x in publics if callable(x) or isgetsetdescriptor(x))

if check_where_defined:
callables = (c for c in callables if ( 'pygame' in c.__module__ or
('__builtin__' == c.__module__ and isclass(c)) )
and REAL_HOMES.get(c, 0) in (0, obj))

if if_of:
callables = (x for x in callables if if_of(x)) # isclass, ismethod etc

return set(callables)


The script uses inspection to find all testables in pygame but there were a few complications, for example getter/setter properties and the fact that some objects need to be instantiated before inspection reveals their innards. Also, filtering out non-pygame callables and after that callables that appeared in more than one module.

eg pygame.rect.Rect led a double life as pygame.sprite.Rect. Just check the __module__ attribute ?

In [4]: pygame.sprite.Rect.__module__
Out[4]: 'pygame'


The workaround was to make a mapping of object to the place where it was defined. There were only 9 of these.

REAL_HOMES = {
pygame.rect.Rect : pygame.rect,
pygame.mask.from_surface : pygame.mask,
pygame.time.get_ticks : pygame.time,
.....


On some of the classes the __module__ attribute was __builtin__ so I needed put an exception for them in the filtering out of non pygame callables.

In [7]: pygame.cdrom.CDType.__module__
Out[7]: '__builtin__'


def module_stubs(module):
stubs = {}
all_callables = get_callables(module, check_where_defined = True) - IGNORES
classes = set (
c for c in all_callables if isclass(c) or c in MUST_INSTANTIATE
)

for class_ in classes:
base_type = class_

if class_ in MUST_INSTANTIATE:
class_ = get_instance(class_)

stubs.update (
make_stubs(get_callables(class_) - IGNORES, module, base_type)
)

stubs.update(make_stubs(all_callables - classes, module))

return stubs


The stubber finds all modules in the pygame package. For each module it uses inspection to create a set of all the callables minus those set in the IGNORE setting. This is here for any exceptions to the filtering and also for tests that have been grouped under one test name. These objects will not be stubbed.

IGNORES = set([

pygame.rect.Rect.h, pygame.rect.Rect.w,
pygame.rect.Rect.x, pygame.rect.Rect.y,

pygame.color.Color.a, pygame.color.Color.b,
pygame.color.Color.g, pygame.color.Color.r,

......



From that it creates a subset of "classes", the criteria being that for each element "inspect.isclass(element)" or that the element is in the manually set MUST_INSTANTIATE dict. This is a mapping of class to helper function, and instantiation args required to return an instance.

MUST_INSTANTIATE = {

# BaseType / Helper # (Instantiator / Args) / Callable

pygame.cdrom.CDType : (pygame.cdrom.CD, (0,)),
pygame.mixer.ChannelType : (pygame.mixer.Channel, (0,)),
pygame.time.Clock : (pygame.time.Clock, ()),


..

}


Inspecting the xxxxType would reveal no methods, and they needed to be instantiated, but then the object returned contained no other attributes; one example being __name__ needed later for determing the test name. Therefore the xxxxType was sent to the stub generation function as the "parent class" for each callable that was gathered by inspecting the instantiation.

Any callables not in the "classes" set are assumed module level functions and a stub is created for each.


The test stubber is used from the command line:

$ gen_stubs.py --help
Usage:
$ gen_stubs.py ROOT

eg.

$ gen_stubs.py sprite.Sprite

def test_add(self):

# Doc string for pygame.sprite.Sprite:

...


Options:
-h, --help show this help message and exit
-l, --list list callable names not stubs
-t, --test_names list test names not stubs


$ gen_stubs.py pygame -l
pygame.base.error.args,
pygame.bufferproxy.BufferProxy.length,
pygame.bufferproxy.BufferProxy.raw,
pygame.event.Event,
pygame.image.tostring,
pygame.joystick.Joystick,
pygame.key.get_repeat,
pygame.mask.Mask,
pygame.mixer.Channel,
pygame.movie.Movie,
pygame.overlay.overlay.display,
pygame.overlay.overlay.get_hardware,
pygame.overlay.overlay.set_location,
pygame.pixelarray.PixelArray.surface,
pygame.sprite.AbstractGroup.add,
pygame.sprite.AbstractGroup.add_internal,
pygame.sprite.AbstractGroup.clear,
pygame.sprite.AbstractGroup.copy,
pygame.sprite.AbstractGroup.draw,
pygame.sprite.AbstractGroup.empty,
pygame.sprite.AbstractGroup.has_internal,
pygame.sprite.AbstractGroup.remove,
pygame.sprite.AbstractGroup.remove_internal,
pygame.sprite.AbstractGroup.sprites,
pygame.sprite.AbstractGroup.update,
pygame.sprite.collide_rect,


Commas are appended for easy copy/paste into IGNORE list.

gen_stubs.py is an integral part of the plan to make it extremely easy for people to contribute to unittests. One man can only do so much.