ace
Class CrossValidator

java.lang.Object
  extended by ace.CrossValidator

public class CrossValidator
extends java.lang.Object

Cross validates a set of Weka Instances.

Instances are partitioned into folds. Different instances are used for training and testing in each fold. A Weka Classifier is trained and tested for each fold. Results are used to evaluate the performance of various classification techniques, including feature selection and classifier ensembles. Methods of this class are used both in the context of a single cross validation and in the context of experimentation where multiple cross validations are performed.

Instances are partitioned in the constructor of this class. An array of integers generated by a call to generatePartitionAray stores the directions for the partitioning. This array is either generated in the constructor (in the context of a single cross validation) or in the Experimenter class (in the context of experimentation).


Constructor Summary
CrossValidator(weka.core.Instances instances, int[] partition, int num_folds)
          This constructor will be called in the context of Experimentation when multiple Classifiers and types of dimensionality reduction are being used.
CrossValidator(weka.core.Instances instances, int num_folds, java.lang.String[] identifiers)
          This constructor will be called when a single cross validation is being performed with only one type of Classifier and one type of dimensionality reduction.
 
Method Summary
 java.lang.String crossValidate(TrainedModel trained, CrossValidationResults[] cvres, weka.core.Instances instances, java.io.OutputStream out, java.lang.StringBuffer cv_results, java.lang.String file_name, java.lang.String feature_selector, boolean save_intermediate_arffs, boolean verbose, int i)
          Cross validates a set of Weka Instances.
static int[] generatePartitionArray(int num_folds, int num_instances)
          Generates an array of evenly distributed random numbers between 0 and num_folds -1 to be used during the partitioning of Instances into cross validation folds.
static java.lang.StringBuffer getClassifications(weka.core.Instances actual, weka.core.Instances predicted, weka.core.Instances training, java.lang.String[][] identifiers)
          Prints the training instances and testing instances with their corresponding model and predicted classification.
static java.lang.String[] getClassNames(weka.core.Instances instances)
          Gets the names of the possible classes into which an instance of the given data set could be classified.
 double[][] getOverallConfusionMatrix(double[][][] confusion_matrices)
          Gets the confusion matrix for the cross validation as a whole from the confusion matrices of each fold.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

CrossValidator

public CrossValidator(weka.core.Instances instances,
                      int num_folds,
                      java.lang.String[] identifiers)
This constructor will be called when a single cross validation is being performed with only one type of Classifier and one type of dimensionality reduction. The partition array is created.

Parameters:
instances - The Weka Instances to be used for cross validation.
num_folds - The number of folds into which the Instances should be partitioned.
identifiers - String of unique identifiers for the given Instances. These identifiers will be partitioned alongside the Instances.

CrossValidator

public CrossValidator(weka.core.Instances instances,
                      int[] partition,
                      int num_folds)
This constructor will be called in the context of Experimentation when multiple Classifiers and types of dimensionality reduction are being used. The partition array is passed as a parameter in this case because the partitioning of the Instances must be identical for each cross validation.

Parameters:
instances - The instances to be used for cross validation.
partition - An array of numbers that specify the way in which the given Instances should be partitioned. This parameter is necessary in the context of
num_folds - The number of folds into which the Instances should be partitioned.
Method Detail

crossValidate

public java.lang.String crossValidate(TrainedModel trained,
                                      CrossValidationResults[] cvres,
                                      weka.core.Instances instances,
                                      java.io.OutputStream out,
                                      java.lang.StringBuffer cv_results,
                                      java.lang.String file_name,
                                      java.lang.String feature_selector,
                                      boolean save_intermediate_arffs,
                                      boolean verbose,
                                      int i)
                               throws java.lang.Exception
Cross validates a set of Weka Instances.

Parameters:
trained - The Serializable object that stores the Weka Classifier and dimensionality reduction objects.
cvres - Holds the results of the cross validation. In the context of a single cross validation, an array of size one is passed. In the context of experimentation, the array will have a cell for each Classifier that is being tested.
instances - The Weka Instances to use in cross validation.
out - A progess report is printed to this OutputStream.
cv_results - Results of the cross validation are appended to this StringBuffer. In the context of a single cross validation, this will be instantiated in Coordinator and will only be accessed by this method. In the context of experimentation, Experimenter will instantiate and also write to this StringBuffer.
file_name - The file to save the results to. The content of this file will be the same that is returned. This string will always be null in the context of experimentation because Experimenter writes the results to a file itself.
feature_selector - The name of the feature selector being used for this cross validation. If null, "None" will be printed in the results sting as the type of dimensionality reduction performed.
save_intermediate_arffs - Whether or not to save training data to an arff file after parsing, after thinning and, and again after feature selection, if any.
verbose - Whether or not to print and save a detailed report of the cross validation, including the partitioning and classification of individual instances and detailed report of dimensionality reduction that was performed. Incorrect classifications are marked with an asterix.
i - The index of the array of CrossValidationResults objects to access. This will always be 0 in the context of a single cross validation.
Returns:
A string containing a summary of the results of this cross validation.
Throws:
java.lang.Exception - If a problem occurs.

generatePartitionArray

public static int[] generatePartitionArray(int num_folds,
                                           int num_instances)
Generates an array of evenly distributed random numbers between 0 and num_folds -1 to be used during the partitioning of Instances into cross validation folds. An instance may only be a testing instance for one fold, and will be a training instance for every other fold. By keeping track of for which fold each instance is a testing instance, we can store the instructions for the partitioning of a set of instances with a single array of integers. Each index of the returned array corresponds to an instance and the value of of each index of the array indicates for which fold that instance will be a testing instance.

Parameters:
num_folds - The number of folds into which the instances should be divided.
num_instances - The number of instances to be partitioned.
Returns:
Array of integers indicating which folds should contain which instances as testing instances.

getOverallConfusionMatrix

public double[][] getOverallConfusionMatrix(double[][][] confusion_matrices)
Gets the confusion matrix for the cross validation as a whole from the confusion matrices of each fold. Corresponding values are simply summed. Total number of instances represented in the matrix should be equal to the total number of instances.

Parameters:
confusion_matrices - 3D array containing the confusion matrices for each fold. First index will be the fold.
Returns:
2D array containing the confusion matrix for the cross validation over all.

getClassNames

public static java.lang.String[] getClassNames(weka.core.Instances instances)
Gets the names of the possible classes into which an instance of the given data set could be classified.

Parameters:
instances - The Weka Instances in question.
Returns:
An array of class names for this set of Instances.

getClassifications

public static java.lang.StringBuffer getClassifications(weka.core.Instances actual,
                                                        weka.core.Instances predicted,
                                                        weka.core.Instances training,
                                                        java.lang.String[][] identifiers)
Prints the training instances and testing instances with their corresponding model and predicted classification. An asterisk(*) is printed before the instance if it was incorrectly classified and a caret(^) precedes the instance if it was partially misclassified. This method is called when the verbose option is specified at the command line. Model classifications must be present (model classifications will always be present because this method is only called in the context of a cross validation).

Parameters:
actual - The testing set of Weka Instances.
predicted - The classified Weka Instances that were returned by classification.
training - The training set of Weka Instances.
identifiers - 2D array containing the identifiers for the training and testing data. First index contains identifiers for training data. Second index contains identifiers for testing data.
Returns:
A detailed summary of the classification results.