March 20, 2010

Mini-project: Histograms R pretty cool - uni-dimensional binary classification example

A few months ago I had to solve a pretty interesting, though not very difficult, assignment for my Machine Learning class.  I really think that this is the very first lecture in any Machine Learning Class around the world.
.
So, given a data set containing 4 distinct features and 2 possible outcomes (or classes, noted for simplicity with -1 and +1) the task was to determine if any of the 4 features would be suitable for the construction of a uni-dimensional classifier. In plain English, determining if by considering just one single feature one could accurately predict if a new data sample is a "-1" or a "+1".
.
Apparently a very nice way to check the individual quality of such features is to plot a joint histogram of the two given output classes (marked by +1 and -1). And plotting the histogram of any characteristic with regards to a given domain is a rather simple thing: divide the domain in a certain number of equal sub-intervals and for each sub-interval just count the number of records that have the characteristic you are interested in. In our case the domain is actually the interval between the smallest and highest value of the given feature and the characteristic is the output class we are interested in.
.
The idea was to plot the joint histogram of the two classes for each feature and see if there is a threshold that could separate them.
.
So here is how the input file is structured. The file starts with a header line that contains the name of the 4 fields containing features, F1 to F4 in our case and the name that marks the outcome field, R. On each of the next data lines, we find 4 real values (in the interval [0,1]), one for each feature and the outcome class. For example, the beginning of the file might look like this:
F1   ,F2   ,F3   ,F4   , R
0.275,0.975,0.304,0.638,-1
0.665,0.240,0.241,0.804,-1
0.129,0.388,0.754,0.717,-1
0.832,0.368,0.988,0.271, 1
0.956,0.820,0.787,0.012, 1
And here is what the output should look like for a file containing 100 data lines(green for -1 class, red for +1 class, brown for overlapping, and the blue line marks the optimal threshold value found by means of brute force):


Obviously, for the given input, the only feature that seems suitable for uni-dimensional classification is Feature nr. 1.
.
You can download a script I have written in R that outputs such joint histograms from well formatted input files by clicking here or from the Downloads box (JoinHistogramsInR.zip).

Application platform: Platform independent

No comments: