bodhidharma.classifiers.bio_k_nearest_neighbour
Class BioKNearestNeighbour

java.lang.Object
  extended by bodhidharma.classifiers.SupervisedClassifier
      extended by bodhidharma.classifiers.bio_k_nearest_neighbour.BioKNearestNeighbour

public class BioKNearestNeighbour
extends SupervisedClassifier

An interface for using the k-nearest neighbour algorithm with feature selection and/or weighting performed with genetic alogrithms.

This classifier is trained by having the genetic algorithm choose the feature selection and/or weighting. The k-nearest neighbour algorithm only needs one iteration of training, of course. The "iterations" discussed in this class' description actually refer to generations of the genetic algorithm.

This classifier also calclulates scaling factors for all of the input feature values. This is done to ensure that all features fall into roughly the same range, so that they have roughly the same "weighting" before the genetic algorithm goes to work. The scaling factors are calculated during the first iteration of training and stored. They are then stored and applied to future unknown features that are input. They are not altered unless further training is performed.

WARNING: the scaling factors are only calculated the first time that this object is trained. Additional training will not cause recalculation of scaling factors. To do this, a new BioKNearestNeighbour must be instantiated.

This system uses exemplar based learning to classify arbitrary feature sets after training. This implementation makes it possible to assign more than one label to a single feature set.

Feature sets are fed into the classifier as arrays of doubles. Categories are specified as arrays of Strings.

Use the train method to train the classifier (the getFeatureNames method is useful for deteriming the features that can be fed to the classifer). This also calculates the feature value scaling factors.

One constructor is provided for creating a new network and one is provided for parsing XML code and using it to reconstruct a trained network.

Use the classify method to classify feature sets once training has been completed (the getCategories method is useful for determining what categories feature sets can be classified into and for determining the order of the categories when parsing classification results).

Use the save method to save the classifier and its current state to disk.

Use the getClassifierName and getClassifierParameters methods to obtain information about the classifier.

Use the getClassifierIdentifier method to get a name or code that was given to an instantiation of the classifier when it was constructed. This identifier can be used by external classes to identify the instantiation.

Use the getScaledFeatureValues method to find out what the values of a set of feature values would be after scaling.

See Also:
KNN, Breeder, FeatureSelectionEvaluator, FeeatureWeightingEvaluator
Author:
Cory McKay

Field Summary
 
Fields inherited from class bodhidharma.classifiers.SupervisedClassifier
categories, feature_names, identifier, training_monitor
 
Constructor Summary
BioKNearestNeighbour(java.lang.String file_path)
          Parse the file specified by the given file path to recreate the specificed trained classifier.
BioKNearestNeighbour(java.lang.String[] feature_names, java.lang.String[] categories, java.lang.String identifier, boolean using_feature_selection, boolean using_feature_weighting, GeneticAlgorithmJFrame gen_alg_settings, double fsw_training_fraction, ClassificationResultsInterpereter results_interpereter)
          Generate a BioKNearestNeighbour with the given parameters and randomly.
 
Method Summary
 double[][] classify(double[][] feature_sets, java.lang.String[] feature_labels)
          Returns the relative scores of each of the possible categories when the given sets of features are classified.
 java.lang.String getClassifierName()
          Returns the name of the type of classifier.
 java.lang.String getClassifierParameters()
          Returns a String describing the parameters of the classifier.
 boolean[] getFeatureSelection()
          Returns a copy of the the array indicating whether or not a given feature is to be used for classification.
 double[] getFeatureWeights()
          Returns a copy of the the array indicating the feature weights that are to be used for classification.
 double[] getMaxFeatureValueCutoffs()
          Returns the maximum allowable value for each of the features.
 double[] getMinFeatureValueCutoffs()
          Returns the minimum allowable value for each of the features.
 double[][] getScaledFeatureValues(double[][] feature_sets, java.lang.String[] feature_labels)
          Returns an array that represents the features_to_scale parameter with values scaled to fall between 0 and 1 based on previous training of the BioKNearestNeighbour.
 void save(java.io.File place_to_save)
          Saves all of the fields to the given file.
 double[] train(double[][] feature_sets, java.lang.String[] feature_labels, java.lang.String[][] model_categories, int iterations, double acceptable_threshold, int consecutive_iterations)
          Trains the BioKNearestNeighbour using the given feature sets.
 
Methods inherited from class bodhidharma.classifiers.SupervisedClassifier
getCategories, getClassifierIdentifier, getFeatureNames, getModelResults, getOrderedFeatureSets, setTrainingMonitor
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BioKNearestNeighbour

public BioKNearestNeighbour(java.lang.String[] feature_names,
                            java.lang.String[] categories,
                            java.lang.String identifier,
                            boolean using_feature_selection,
                            boolean using_feature_weighting,
                            GeneticAlgorithmJFrame gen_alg_settings,
                            double fsw_training_fraction,
                            ClassificationResultsInterpereter results_interpereter)
Generate a BioKNearestNeighbour with the given parameters and randomly.

Parameters:
feature_names - The names of the features that classifications will be based on.
categories - The names of categories that correspond to the possible classification results.
identifier - An identifier that can be associated with the classifier so that outside classes can identify it.
using_feature_selection - Whether or not feature selection will be performed.
using_feature_weighting - Whether or not variable feature weighting will be performed.
gen_alg_settings - Initialization settings for the genetic algorithm.
fsw_training_fraction - Fraction of training samples to actually be used for training when feature selection and weighting are calculated. Must be between 0 and 1.
training_monitor - Used to monitor training progress. It may be null if it is not used.
results_interpereter - Used to calculate fitnesses with genetic algorithms. It may be null if the user wants to use alternative fitness measures not based on actual classifications.

BioKNearestNeighbour

public BioKNearestNeighbour(java.lang.String file_path)
                     throws java.lang.Exception
Parse the file specified by the given file path to recreate the specificed trained classifier. Throws an exception if problems occur during parsing.

Parameters:
file_path - The file path of a BioKNearestNeighbour_file that this instantiation is to be based on.
Throws:
java.lang.Exception
Method Detail

getClassifierName

public java.lang.String getClassifierName()
Returns the name of the type of classifier. This information can be used by external classes to identify the type of classifier in situations such as when the configuration of a trained classifier is being read from a file.

Specified by:
getClassifierName in class SupervisedClassifier

getClassifierParameters

public java.lang.String getClassifierParameters()
Returns a String describing the parameters of the classifier. This includes word length, population size, mating rate, mutation probability, whether or not elitism is allowed, number of villages, migration size, migration delay, whether or not feature selection is enabled, whether or not feature weighting is enabled and the value of K.

Specified by:
getClassifierParameters in class SupervisedClassifier

getFeatureSelection

public boolean[] getFeatureSelection()
Returns a copy of the the array indicating whether or not a given feature is to be used for classification. Indices correspond to those of the feature_names field. These values are set during the feature selection portion of the train method. True means that the feature is to be used and false that it isn't. A value of null is returned if feature selection has not been performed, and all features are to be used.


getFeatureWeights

public double[] getFeatureWeights()
Returns a copy of the the array indicating the feature weights that are to be used for classification. Indices correspond to those of the feature_names field. These values are set during the feature weighting portion of the train method. Values are between 0 and 1, with 1 meaning that the feature is very important relative to others. A value of null is returned if feature weighting has not been performed, and all weights are defaulted to 1.


save

public void save(java.io.File place_to_save)
          throws java.lang.Exception
Saves all of the fields to the given file. A separate file is created for the knn object. Throws an exception if a problem occurs.

Specified by:
save in class SupervisedClassifier
Throws:
java.lang.Exception

train

public double[] train(double[][] feature_sets,
                      java.lang.String[] feature_labels,
                      java.lang.String[][] model_categories,
                      int iterations,
                      double acceptable_threshold,
                      int consecutive_iterations)
               throws java.lang.Exception
Trains the BioKNearestNeighbour using the given feature sets. The training done by this mehtod consists of storing the training points in the KNN classifier and finding good feature selection and feature weighting settings (if these two options were set in the constructor).

NOTE: It should be noted that feature weighting is done using the features selection calculated during the feature selection portion of training, if the feature selection option is selected.

The first indice of the feature_sets parameter corresponds to different feature sets. The second indice corresponds to different features in the given featue set. It should be noted that all feature sets must use the same features in the same order as given in the feature_labels parameter.

The feature_labels parameter specifies the names of each of the features in the feature_sets parameter. The features in the feature_sets parameter will automatically be matched to the features in the feature_names field based on the content of the feature_labels parameter unless a value of null is passed to the feature_labels. In this case, the feature values in the feature_sets parameter will simply be fed into the classifier in the order that they occur.

The model_categories parameter gives the categories of each of the given feature sets. The first indice corresponds to the feature set. The second indice corresponds to different model categories for the given feature set. Only categories to which the feature set belongs should be included.

The iterations parameter specifies the number of training iterations performed. If a negative value is passed here, then the number of iterations to perform is calculated automatically based on the acceptable_threshold parameter, which specifies the absolute rate of change of the training error below which training will stop, and the consecutive_iterations parameter, which specifies the number of consecutive iterations for which the rate of change must be below this threshold in order for training to stop. The number of iterations that go by will never exceed the absolute valud of the iterations value, irregardless of the other parameters.

For example, if a value of 1000 is given for iterations, then 1000 iterations will be performed irregardless of the other parameters. If a value of -1000 is given, then training will automatically stop if the absolute value of the rate of change of the training error from one sample to the next falls below the acceptable_threshold parameter for consecutive_iterations iterations, but no more than 1000 iterations will be performed in any case.

The returned double is an average error after training iterations. The indice of the returned array corresponds to the iteration of training that the error is associated with. This is the data for feature weighting if this option was selected, the data for feature selection if this option was selected but feature weighting was not and null otherwise (since basic KNN does not need training iteration.

This method also calclulates scaling factors for all of the input feature values. This is done to ensure that all features fall into roughly the same range, so that they have roughly the same "weighting" before the genetic algorithm goes to work. The scaling factors are calculated during the first iteration of training and stored. They are then stored and applied to future unknown features that are input. They are not altered unless further training is performed.

WARNING: the scaling factors are only calculated the first time that this object is trained. Additional training will not cause recalculation of scaling factors. To do this, a new BioKNearestNeighbour must be instantiated. Also, each retraining will reset feature selection and feature weighting settings, if these have been found during earlier trainings.

An exception if thrown if the feature_labels do not contain the same names as feature_names (although a different ordering is permitted) or if any of the feature sets in feature_sets have a different number of features than feature_names. An exception is also thrown if feature_sets and model_categories have different sizes in regard to their first parameters. An exception is also thrown if the given_results parameter contains a name not present in the categories field or if it contains the same category more than once. An exception is also thrown if there are problems during evolution.

Specified by:
train in class SupervisedClassifier
Throws:
java.lang.Exception

classify

public double[][] classify(double[][] feature_sets,
                           java.lang.String[] feature_labels)
                    throws java.lang.Exception
Returns the relative scores of each of the possible categories when the given sets of features are classified. There is one entry in the returned array for each of the entries in the categories field, and they appear in the same order. Higher scores in the returned array correspond to a greater certainty that the feature set should have the corresponding label. Scores fall in the range between 0.0 and 1.0. The first indice of the returned array corresponds to the feature set and the second corresponds to the category. The order of the categories is the same as in the categories field and the order of the feature sets is the same as the order in which they were passed to the feature_sets parameter.

The feature_sets parameter specifies the feature sets to be classified. The first indice corresponds to different feature sets. The second indice corresponds to different features in the given featue set. It should be noted that all feature sets must use the same features in the same order as given in the feature_labels parameter.

The feature_labels parameter specifies the names of each of the features in the feature_sets parameter. The features in the feature_sets parameter will automatically matched to the features in the feature_names field based on the content of the feature_labels parameter unless a value of null is passed to the feature_labels. In this case, the feature values in the feature_sets parameter will simply be fed into the classifier in the order that they occur.

An exception if thrown if the feature_labels do not contain the same names as feature_names (although a different ordering is permitted) or if any of the feature sets in feature_sets have a different number of features than feature_names. Also throws an exception if the KNN classifier is untrained.

Specified by:
classify in class SupervisedClassifier
Throws:
java.lang.Exception

getScaledFeatureValues

public double[][] getScaledFeatureValues(double[][] feature_sets,
                                         java.lang.String[] feature_labels)
                                  throws java.lang.Exception
Returns an array that represents the features_to_scale parameter with values scaled to fall between 0 and 1 based on previous training of the BioKNearestNeighbour. The first indice of both the feature_sets parameter and the return array identifies the recording and the second identifies the feature. The order of features in the returned array is the same as the order in the feature_labels parameter.

The feature_labels parameter specifies the names of each of the features in the feature_sets parameter. The features in the feature_sets parameter will automatically matched to the features in the feature_names field based on the content of the feature_labels parameter.

The acual values of the feature_sets parameter are not themselves changed.

An exception if thrown if the feature_labels do not contain the same names as feature_names (although a different ordering is permitted) or if any of the feature sets in feature_sets have a different number of features than feature_names. Also throws an exception if the scaling factors have not yet been calculated.

Throws:
java.lang.Exception

getMinFeatureValueCutoffs

public double[] getMinFeatureValueCutoffs()
                                   throws java.lang.Exception
Returns the minimum allowable value for each of the features. Features below their respective values are rounded up to it. Features occur in the same order as in the feature_names field.

Throws:
java.lang.Exception

getMaxFeatureValueCutoffs

public double[] getMaxFeatureValueCutoffs()
                                   throws java.lang.Exception
Returns the maximum allowable value for each of the features. Features above their respective values are rounded down to it.

Throws:
java.lang.Exception