February 2, 2012

Paper: Improving the Parsimony of Regression Models for an Enhanced Genetic Programming Process - EUROCAST 2011

This paper was prepared for the EUROCAST 2011 conference and it basically contains a short summary of the work I have carried out for my Master Thesis Project. Here is the abstract of the article:
This research is focused on reducing the average size of the solutions generated by an enhanced GP process without affecting the high predictive accuracy this method exhibits when being applied on a complex, industry proposed, regression problem. As such, the effects the GP enhancements have on bloat have been studied and, finally, a bloat control system based on dynamic depth limiting (DDL) and iterated tournament pruning (ITP) was designed. The resulting bloat control system is able to improve by approx. 40% the average GP solution parsimony without impacting average solution accuracy.
You can download a preliminary version of the paper by clicking here or from my Downloads box (Improving the Parsimony of Regression Models for an Enhanced GP Process - EUROCAST 2011.pdf). The same preliminary draft of the document can be previewed at the bottom of this post. For the official, final version of the paper, please refer to the SpringerLink website.

For citations please use the following BibTeX reference:

@INCOLLECTION{Alexandru-CiprianZavoianu2012,
  author = {Alexandru-Ciprian Z\u{a}voianu and Gabriel Kronberger and Michael Kommenda and Daniela Zaharie and Michael Affffenzeller},
  title = {Improving the Parsimony of Regression Models for an Enhanced Genetic Programming Process},
  booktitle = {Computer Aided Systems Theory - EUROCAST 2011},
  publisher = {Springer Berlin / Heidelberg},
  year = {2012},
  editor = {Moreno-Diaz, Roberto and Pichler, Franz and Quesada-Arencibia, Alexis},
  volume = {6927},
  series = {Lecture Notes in Computer Science},
  pages = {264-271},
  doi = {10.1007/978-3-642-27549-4_34},
}

July 9, 2011

Tips & Tricks: Handling expensive evaluation functions in Evolutionary Computation

Evolutionary algorithms (EA) have become very popular for solving complex optimization / data based modeling problems as, generally, they are fairly robust, easy to use and able to produce high quality results.

A common problem in real-word applications is the fact that the evaluation/fitness function employed by this type of algorithms is quite expensive (time-wise or even financially) to evaluate. As such, when taking into consideration the restrictions imposed on the running time of the EA, only a limited amount of fitness evaluations can be performed and this usually has a huge impact on the quality of the obtained solutions.

The approaches that have been proposed by scientific literature to deal with this problem roughly fall within two main categories:
  • surrogate approximation - construct a new (faster to evaluate) expression of the objective function based on previous input to output mappings obtained using the original (expensive) evaluation function.
  • evolutionary approximation - reduce the number of expensive fitness evaluations performed during the run by trying to estimate the fitness of an individual without using the costly evaluation function by combining information from other similar individuals that have already been evaluated and, if available, information regarding parameter sensitivity.
Two of the most common methods for surrogate approximation are based on neural networks (NNs) or support vector machines (SVMs). The idea behind the approach is quite simple: when having a sufficient number of individual fitness evaluations (i.e. sampling data), try to train a NN or a SVM model that can approximate reasonably well the actual expensive fitness evaluation function. The quality of the approximation depends a lot on the complexity and roughness of the search space, on the amount of sampling data and of course on the choice of NN or SVM kernel. After the surrogate NN or SVM model has been trained we can use it to perform a lot of inexpensive but relatively accurate fitness evaluations during the run of the EA.

Evolutionary approximation is based on the fact that, for most search spaces, two individuals that share many common features also tend to have a similar fitness value. For example, if in an EA individuals are coded by a real parameter vector containing 15 values and one individual has 14 values that match exactly the values of an individual that has been evaluate 3 generations before and the 15 value is only slightly different, we can approximate the fitness of the current individual with that of its ancestor. This is a very brute approximation, that can be extended and refined (i.e. using advanced similarity metrics, interpolation, variable correlation information, etc) according to the concrete problem at hand.

A hybrid strategy worth mentioning is that of using the surrogate models especially in the early stages of the EA run and then switching back to the original fitness evaluation function at the latter stages of the run, when precision is of the utmost importance.  

May 30, 2011

Tips&Tricks: A little bit on personal information security - Tor and TrueCrypt

In these most auspicious times for social networking, business cloud migration and setting trends on twitter about Justin Bieber, I'll try to draw a bit more attention towards two open-source, free, industry proven, high quality pieces of software that can help you add a little more privacy and security to your personal information (if for whatever conspiracy theory driven reason you would feel the need to).

Tor (anonymity network) is a system, composed of client software and a network of volunteer servers, dedicated to ensuring online anonymity by helping to hide information about users' location and other factors which may help to identify them. Tor is an open-source implementation of the onion routing concept. Simplified, a message is repeatedly encrypted (n times) by the sender and then sent over n (randomly chosen) onion nodes to the receiver; each onion node removes one layer of encryption and forwards the message to the next node until the final exit node in the chain removes the last layer of encryption and forwards the message to its intended receiver. An onion node only has information about the node where it received the message from and the node to which it should forward the peeled message to. At no point along the path (except the exit onion node) does a routing node on the path hold the original message, the sender and the receiver. The combinination between Tor and a TLS system (e.g. HTTPS - check out The HTTPS Everywhere Project) removes the exit node vulnerability and makes for a quite free and robust online anonymity tool. Tor is free to download and extremely easy to install and use. Tor is by no means fail safe but it does a very good job at trying to keep your online anonymity intact from the perspective of message route path tracking.

TrueCrypt is a free, state of the art, on-the-fly encryption (OTFE) software that is compatible with Windows (7/Vista/XP), Mac OS X and Linux. It supports the AES, Twofish and Serpent encryption algorithms as well as various combinations of cascaded algorithms. Most importantly, on most current hardware architectures, parallelization and pipelining allow the system to read and write from/to an encrypted drive as fast as if the drive was not encrypted. TrueCrypt allows for the creation within a file of a virtual encrypted disk that can be mounted and unmounted, the encryption of an entire partition or storage device and plausible deniability tools like hidden volume and hidden operating system. TrueCrypt is also free to download and fairly simple to install and use.

I hope you found these software recommendations as interesting/useful as I did and I also hope that, if needed, they'll prove useful and save you some inconvenience. Now, let's get back to work and help trend #TrustInTheCloud #TheCloudIsSafe, #TheCloudCannotFail and #TheCouldIsYourFriend :P:D.

February 14, 2011

Tips&Tricks: iWork Numbers - difference in minutes between two dates

I can work with relative ease with Excel and OpenOffice Spreadsheet but the iWork Numbers application was totally new for me. I was working on a report in Numbers and I didn't know how to calculate the difference in minutes between two date fields and I must say that I didn't find the information I needed fast enough on the web.

To keep the long story short, after 2-3 minutes of trial and error, I came across the answer: the formula I was looking for was:

DUR2MINUTES(Field1-Field2)

Hope this saves someone a bit of time.