I am trying to make a typical "A / B testing" approach, similar to the two options for implementing a real algorithm, using the same dataset in both cases. The algorithm is deterministic in terms of execution, so I really expect the results to be repeatable.
In Core 2 Duo, this is also the case. Using only the linux time command, I get changes in runtime of about 0.1% (more than 10 starts).
On i7, I get all kinds of variations, and I can easily have 30% of the variations up and down from the average. I guess this is due to the various CPU optimizations that i7 does (dynamic overclocking, etc.), but it really makes it difficult to do such testing. Is there any other way to determine which of the two algorithms is βbest,β any other reasonable metrics that I can use?
Edit: The algorithm is not supported for a very long time, and in fact this is the real scenario that I am trying to execute. So multiple re-execution is not really an option.
source
share