The Good, The Bad and The Ugly - Rule Engine Benchmarks

Goodbaduglydvd

Last night I watched “Unforgiven” so I thought I’d continue the spaghetti Western theme today; looking at the challenge of developing, running and interpreting benchmarks for Business Rule Management Systems. I recently read the excellent article “Behind the benchmarks: SPEC, GFLOPS, MIPS et al” by Jon “Hannibal” Stokes. Although Jon discusses CPU benchmarking many of his points are relevant to rule engine benchmarking as well. For example, here is how he describes the various techniques used to to measure CPU performance:

  • Real-world Applications: one of the most popular, and best, ways to benchmark a system is by running a real program on it and see how long it takes the program to complete a task.
  • Kernels: a small, CPU-intensive portion of a real program that’s intended to be run on its own or as part of a benchmarking suite.
  • Toy Applications: small programs like Quicksort or the Sieve of Eratosthenes that people cook up to compile and run on any computer.
  • Synthetic Benchmarks: try to determine the average instruction mix of a typical program so that it can replicate that mix and run those types of instructions on the CPU.

The current “Academic Benchmarks” for BRMS fall somewhere within the definition of a “Kernel” or a “Toy Application”, probably closer to the latter. Although they can be useful for implementors of rule engines, I’d argue they are of no use to evaluators of rule engine technology and may be a significant distraction. ILOG has been guilty of publishing academic benchmark numbers in the past of course, so I am not putting us on a pedestal!

Academic Benchmarks

The issues with the academic benchmarks for BRMS are fairly well documented, and there seems to be a general consensus that they should not be used for reliable product performance comparison, however they keep cropping up! Even in Ruby!

Here are some of the problems:

  • Very small number of rules (less than 50)
  • Test worst-case RETE inference and agenda usage, which is rarely encountered with well designed applications
  • Implement solutions to combinatorial search problems, typically better implemented using a CP Engine rather than a RETE engine
  • The solutions are not verified, meaning that correctness may be inadvertently (?) sacrificed for speed. This is particularly the case for Waltz and WaltzDB for which I’ve never seen any verification of the results.
  • The benchmarks are easy to “game” because every vendor ports from the original OPS5 source code into their own rule language, where they take advantage of their product’s features

For reference, I have included the descriptions for the three most famous academic benchmarks below (paraphrased from the paper “Effects of Database Size on Rule System Performance: Five Case Studies” published by Daniel Miranker et al). How representative do they sound of your application!?

Manners

Manners is based on a depth-first search solution to the problem of finding an acceptable seating arrangement for guests at a dinner party. The seating protocol ensures that each guest is seated next to someone of the opposite sex who shares at least one hobby.

Waltz and Waltzdb

Waltz was developed at Columbia University. It is an expert system designed to aid in the 3-dimensional interpretation of a 2-dimensional line drawing. It does so by labeling all lines in the scene based on constraint propagation. Only scenes containing junctions composed of two and three lines are permitted. The knowledge that Waltz uses is embedded in the rules. The constraint propagation consists of 17 rules that irrevocably assign labels to lines based on the labels that already exist. Additionally, there are 4 rules that establish the initial labels.

Waltzdb was developed at the University of Texas at Austin. It is more general version of the Waltz program. Walzdb is designed so that it can be easily adapted to support junctions of 4, 5, and 6 lines. The method used in solving the labeling problem is a version of the algorithm described by Winston [Winston, 1984]. The key difference between the problem solving technique used in waltz and waltzdb is that waltzdb uses a database of legal line labels that are applied to the junctions in a constrained manner. In Waltz the constraints are enforced by constant tests within the rules. The input data for waltz is a set of lines defined by Cartesian coordinate pairs.

Conclusions

If performance is one of your major buying criteria then I would strongly encourage you to build a proof-of-concept set of rules and data and verify rule engine performance in your own environment. It is impossible to meaningfully extrapolate from published academic benchmark results to your running application, with your rules and data, deployed to your OS and hardware. In addition this will also allow you to evaluate the BRMS from as many angles as possible, spanning ease-of-use, business user accessibility, support and professional services, performance, deployment platforms, scalability etc.

References

Bookmark with: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Digg
  • del.icio.us
  • Technorati
  • Slashdot
  • StumbleUpon
  • Furl

Tags: , , ,

4 Responses to “The Good, The Bad and The Ugly - Rule Engine Benchmarks”

  1. Peter Lin Says:

    I think Manners or Waltz benchmark shouldn’t be blamed for sales guys misusing them. They were originally designed to test the worst case, which it does well. From first hand experience many developers write inefficient rules. In many cases, their rules end up being worst case. One can argue they shouldn’t do that, but the reality is it takes a lot of time to gain competence with rule engines.

    Even though manners and waltz do not represent or predict real world performance, it does measure the efficiency of the implementation. For that, I think it does a good job. For the last few years I’ve been writing various rule benchmarks for testing Jamocha. What is needed is a decent set of academic and “realistic” benchmarks, which end users can run. This way, the end user can look at the sample rules to see how it impacts performance. They can also learn from the sample and see what trade offs impact performance.

    You and I have talked about this in the past. Once OMG PRR is finalized, I plan to port some of my benchmarks to PRR. Rather than complain about abuse of Manners and Waltz, it would be more productive to collaborate across the industry to produce a set of benchmarks. Benchmarks aren’t inherently good or bad. As long as the developer understands what it does and what the results show, it’s a useful tool.

  2. Charles Young Says:

    You were kind enough to call out one of my blog posts. I’ve debated the usefulness of these tests (especially Miss Manners) with Peter on a few occasions. I think it’s fair to say that I’ve taken the more negative line, but I do agree with Peter that there is little point in carping on about the inadequacy of specific tests. In reality, it is inherent to the way production systems work that any one single test will always prove an ‘inadequate’ measure of an engine’s general performance characteristics. In computer science, it is common to characterise algorithmic performance in terms of asymptotic time and space complexities using ‘Big-O’ notation. However, you can’t do that with the Rete algorithm, per se. It makes no sense to claim, in a general sense, that a production engine exhibits, say, O(log n) time complexity. You *always* have to qualify such statements by relating them to specific retia. Individual rule sets, and their corresponding executable rete networks, will exhibit widely different time and space complexities because, in a sense, they are the things which are truly ‘algorithmic’ at run time.

    If benchmarks must be used to try to determine the relative performance merits of different engines, then Peter is again correct in arguing for a suite of benchmarks. Such a suite needs to be very varied in order to investigate the behaviour of the engine in a reasonably comprehensive way. I think it would be useful if individual tests clearly exercised the engine in specific ways - e.g., rule sets that generate no beta nodes, rule sets that chiefly depend on join nodes, rule sets (like Manners) that chiefly depend on NotCE nodes or (again, like Manners) perform a large number of assertions and retraction during forward chaining, tests that test scalability for large rule sets and for large fact bases, etc., etc. It would also be useful to test parallel execution of rule sets (I don’t mean parallel Rete implementations - few engines support that, currently; I mean the kind of scenario you typically encounter in integration and process automation where multiple instances of the engine may be executing in parallel on a single box).

    With care, a rich suite of tests could be used to provide a broad-brush comparison of the performance characteristics of different engines. You really do have to take care though. Consider the current version of CLIPS (6.2). CLIPS is set to become significantly faster in the next release (currently in late Beta) thanks to the introduction of hash indexing and other features. However, even today’s version can be blindingly fast in many scenarios. It performs Miss Manners very well indeed for very small guest sizes, but poorly for large guest sizes (something that improves dramatically in v6.3). So, is it a ‘fast’ or a ’slow’ engine? It’s neither, or both, depending on your viewpoint. These complex realities are often lost in general benchmarking.

  3. peter lin Says:

    I hope with deligent work and cooperation across the community, a decent set of benchmarks can improve the situation. Having realistic examples or even multiple implementations solving the same business problem would become a good tool for users.
    Even though I have some benchmarks in CLIPS format, many people don’t like it, so using a standard language hopefully will improve the situation. Atleast that is my naive hope.

    peter

  4. embedded cpu benchmark | computer tags Says:

    […] The Good, The Bad and The Ugly - Rule Engine Benchmarks I recently read the excellent article ?Behind the benchmarks: SPEC, GFLOPS, MIPS et al? by Jon ?Hannibal? Stokes. Although Jon discusses CPU benchmarking many of his points are relevant to rule engine benchmarking as well. …ILOG BRMS Blogs - http://blogs.ilog.com/brms […]

Leave a Reply