My implementation is about 30x slower than @cbiffle's
(90 secs vs 3 secs for the final image of RTTNW) . I have to figure out where I'm doing it wrong. I guess it is somewhere in the bvh implementation.
Edit: indeed there was an error in the axis-aligned bounding box hit function, now fixed. Now it is about 8 secs. There's still room for improvements for the bvh implementation.