November 20, 2010

Code Bloating in Genetic Programming

In this post I'll try to present some information regarding the main subject of my Master's Thesis which is entitled "Towards Solution Parsimony in an Enhanced Genetic Programming Process" (sounds a bit bombastic, I know :P). Work on this paper has been carried out at the Heuristic and Evolutionary Algorithms Laboratory (HEAL) in the Hagenberg Softwarepark, Austria. My research was coordinated by Prof. Dr. Michael Affenzeller and Prof. Dr. Daniela Zaharie and was sponsored by the SPRERS project and ISI Hagenberg.

In order to present what I've worked on for the better part of one year, I'll start with a concise wiki definition of what genetic programming (GP) is:
In artificial intelligence, genetic programming (GP) is an evolutionary algorithm-based methodology, inspired by biological evolution, to find computer programs that perform a user-defined task. It is a specialization of genetic algorithms (GA) where each individual is a computer program.
As the user-defined task can vary from detecting military equipment in  satellite recconaissance  photos to designing analogue circuits, to solving symbolic regression problems or to evolving competitive AI players, it is easy to see that GP is a domain-independent method that can be used either as an automatic programming tool, a machine learning tool or a problem solving engine.

One of the main problems GP practitioners have always been confronted with is a phenomenon called code bloat: the average size of the GP evolved programs starts to grow at a very rapid pace after a certain number of generations but this increase in overall program complexity (size) is not matched by any corresponding increase in program accuracy. In time, bloat brings the entire evolutionary process to a grinding halt

The problem is not a trivial one as there are several studies that tie code bloating to the driving force behind GP and most evolutionary methods: the search for fitness

The objective of my Master Thesis was to enhance the general interpretability of GP based symbolic regression models produced by HeuristicLab (the heuristic optimization framework developed by HEAL). The approach for achieving this objective focused on understanding and reducing/controlling the bloating phenomenon that affected the enhanced GP process implemented in HeuristicLab. For more details please consult the full paper which is presented in this post.

GP has become a quite intense field of the study in the last couple of years and there is a wealth of (high quality) scientific papers regarding various aspects of this evolutionary method. For those that are very interested in GP, a good start is an excellent free online book about GP written by Ricardo Poli and William Langdon, two of the most well known researchers in the field, alongside John Koza of course.

No comments: