ace.datatypes
Class DataBoard

java.lang.Object
  extended by ace.datatypes.DataBoard
All Implemented Interfaces:
java.io.Serializable

public class DataBoard
extends java.lang.Object
implements java.io.Serializable

Stores the data needed for training, testing and using classifiers. Stores a taxonomy, feature definitions, feature vectors of instances and model classifications of instances.

The contents of objects of this class can be loaded from ACE XML files or a Weka ARFF file using one of the constructors. Methods are also implemented for saving and loading objects of this class directly as serializable objects. The contents of an object of this class may also be separated and saved as individual XML files.

A method is also available for generating a Weka ARFF file from an object of this class. This method also generates an array of strings identifying the source of each line in the resulting ARFF file.

See Also:
Serialized Form

Field Summary
 FeatureDefinition[] feature_definitions
          Holds meta-data about the feautres that characterize instances.
 DataSet[] feature_vectors
          Feature vectors for a set of instances.
 SegmentedClassification[] model_classifications
          The model classifications that are used in supervised training.
 Taxonomy taxonomy
          The taxonomy that instances are classified into.
 
Constructor Summary
DataBoard()
          Generates an empty DataBoard.
DataBoard(java.lang.String arff_file)
          Generates the ACE datatypes from a Weka ARFF file.
DataBoard(java.lang.String taxonomy_file, java.lang.String feature_key_file, java.lang.String[] feature_vector_files, java.lang.String classifications_file)
          Generates a DataBoard based on the contents of the given XML files.
DataBoard(Taxonomy taxonomy, FeatureDefinition[] feature_definitions, DataSet[] feature_vectors, SegmentedClassification[] model_classifications)
          Generates a DataBoard with the fields specified in the parameters.
 
Method Summary
 SegmentedClassification[] getClassifiedResults(weka.core.Instances instances, boolean save_intermediate_arffs, TrainedModel trained, boolean use_top_level_features, boolean use_sub_section_features)
          Classify the given set of Instances using the given AttributeSelection and the given Classifier.
 FeatureDefinition[] getFeatureDefinitions()
          Returns meta-data about the feautres that characterize instances.
 int[] getFeatureDimensionalities()
          Returns the number of dimensions of each of the features stored in the feature_definitions field.
 java.lang.String[] getFeatureNames()
          Returns the names of the features stored in the feature_definitions field.
 DataSet[] getFeatureVectors()
          Returns feature vectors for a set of instances.
 weka.core.Instances getInstanceAttributes(java.lang.String data_set_name, int initial_capacity)
          Uses the feature definitions and taxonomy stored in this DataBoard to return an empty set of Weka Instances.
 java.lang.String[] getInstanceIdentifiers()
          Gets array of unique identifiers for DataSet object of this DataBoard.
 java.lang.String[] getInstanceMetaDataFields()
          Returns the names of all meta-data fields stored in the contents of any of the instances stored in the model_classifications field.
 SegmentedClassification getMatchingModelClassification(DataSet data_set)
          Searches the model_classifications stored in this DataBoard with an identifier that matches the identifier of the given DataSet.
 SegmentedClassification[] getModelClassifications()
          Returns the model classifications that are used in supervised training.
 Taxonomy getTaxonomy()
          Returns the taxonomy that instances are to be classified into.
 boolean hasSections()
           
static DataBoard loadDataBoard(java.io.File databoard_file)
          Load the specified DataBoard serialized object file and return its contents.
static void saveDataBoard(DataBoard to_save, java.io.File databoard_file)
          Save the contents of this DataBoard to a File.
static void saveInstancesAsARFF(weka.core.Instances instances, java.lang.String file_path)
          Save the given Weka Instances as an arff file with the given path.
 java.lang.String[] saveToARFF(java.lang.String relation_name, java.io.File databoard_file, boolean use_top_level_features, boolean use_sub_section_features)
          Produces a Weka ARFF file based on the contents of this object.
 void saveXMLFiles(java.io.File taxonomy_file, java.io.File feature_key_file, java.io.File feature_vector_file, java.io.File classifications_file)
          Saves the stored taxonomy, feature definitions, feature vectors and/or model classifications stored in this DataBoard to individual XML files of the respectively appropriate type.
 void storeInstances(weka.core.Instances set_of_instances, boolean use_top_level_features, boolean use_sub_section_features)
          Extracts the feature values and model classifications stored in this DataBoard object and stores them in the given set of Weka Instances.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

taxonomy

public Taxonomy taxonomy
The taxonomy that instances are classified into. May be hierarchical.

May be null if clustering algorithms are to be used or if the taxonomy is to be derived from the model_classifications field.


feature_definitions

public FeatureDefinition[] feature_definitions
Holds meta-data about the feautres that characterize instances.

May be null if the feature_vectors have sufficient self-contained infromation, although this is not recommended.


feature_vectors

public DataSet[] feature_vectors
Feature vectors for a set of instances. Can include features for sub-sections of instances as well as for instances as a whole.

In general, these should be taken in conjunction with feature_definitions in order to minimize storage space and processing overhead.


model_classifications

public SegmentedClassification[] model_classifications
The model classifications that are used in supervised training. Can include classifications for sub-sections of instances as well as for instances as a whole.

Class names should correspond with those in the taxonomy field. Instances should correspond to those in the feature_vectors field.

May be null if clustering algorithms are to be used of if this DataBoard is being used to classify novel patterns with already trained classifiers.

Constructor Detail

DataBoard

public DataBoard()
Generates an empty DataBoard.


DataBoard

public DataBoard(Taxonomy taxonomy,
                 FeatureDefinition[] feature_definitions,
                 DataSet[] feature_vectors,
                 SegmentedClassification[] model_classifications)
          throws java.lang.Exception
Generates a DataBoard with the fields specified in the parameters. Note that if feature definitions and feature vectors are both provided then the feature vectors will be compacted and ordered automatically based on the feature definitions. Some validation is performed on the loaded values.

Parameters:
taxonomy - The taxonomy to classify instances into.
feature_definitions - Descriptiosn of features to characterize features with.
feature_vectors - The feature vectors characterizing instances.
model_classifications - Model classifications for use in supervised training.
Throws:
java.lang.Exception - An informative exception is thrown if any of the data in the providedfields are incompatible with one another.

DataBoard

public DataBoard(java.lang.String taxonomy_file,
                 java.lang.String feature_key_file,
                 java.lang.String[] feature_vector_files,
                 java.lang.String classifications_file)
          throws java.lang.Exception
Generates a DataBoard based on the contents of the given XML files. Note that if feature definitions and feature vectors are both provided then the feature vectors will be compacted and ordered automatically based on the feature definitions. Some validation is performed on the loaded values.

Parameters:
classifications_file - The path of a classifications_file XML file holding a taxonomy. May be null if clustering is to be used to derive a new taxonomy or if a provided set of model classificatios will be used to construct a taxonomy. An entry of "" is considered equivalent to null.
feature_key_file - The path of a feature_key_file XML file holding feature descriptions. May be null if the provided feature vectors have enough self-contained information, but this is not recommended. An entry of "" is considered equivalent to null.
feature_vector_files - An array of file paths referring to feature_vector_files holding eature vectors for a set of instances. If a feature_key_file was provided, the feature_vector files are ordered and compacted based on it. An entrys of "" i considered equivalent to null.
taxonomy_file - The path of a taxonomy_file XML file holding the model classifications that are used in supervised training using the given feature_vector_files. May be null if clustering algorithms are to be used of if this DataBoard is being used to classify novel patterns with already trained classifiers. An entry of "" is considered equivalent to null.
Throws:
java.lang.Exception - An informative exception is thrown if any of the file paths provided are invalid or if the data contained in the files is incompatible with one another.

DataBoard

public DataBoard(java.lang.String arff_file)
          throws java.lang.Exception
Generates the ACE datatypes from a Weka ARFF file. Used for classifiying data from an ARFF file. Because of the restrictions of Weka ARFF files, the created taxonomy will be flat (will have no hierarchical structure), instances will be numbered (since they have no unique identifier in ARFF format), and instances will only have one classification.

Parameters:
arff_file - The Weka ARFF file containing the Instances to be stored in this DataBoard.
Throws:
java.lang.Exception
Method Detail

getTaxonomy

public Taxonomy getTaxonomy()
Returns the taxonomy that instances are to be classified into. May be null if clustering algorithms are to be used or if the taxonomy is to be derived from the model_classifications file.


getFeatureDefinitions

public FeatureDefinition[] getFeatureDefinitions()
Returns meta-data about the feautres that characterize instances. This may be null if the feature_vectors have sufficient self-contained infromation, although this is not recommended.


getFeatureVectors

public DataSet[] getFeatureVectors()
Returns feature vectors for a set of instances. This can include features for sub-sections of instances as well as for instances as a whole.

In general, these should be taken in conjunction with FeatureDefinitions in order to minimize storage space and processing overhead.


getModelClassifications

public SegmentedClassification[] getModelClassifications()
Returns the model classifications that are used in supervised training. This can include classifications for sub-sections of instances as well as for instances as a whole.

Class names should correspond with those in the Taxonomy. Instances should correspond to those in the DataSet feature vectors.

Will return null if clustering algorithms are to be used of if this DataBoard is being used to classify novel patterns with already trained classifiers.


getFeatureNames

public java.lang.String[] getFeatureNames()
Returns the names of the features stored in the feature_definitions field. Returns null if nothing is stored in this field.

Returns:
The names of the features in the feature_definitions field, or null if there are none stored there.

getFeatureDimensionalities

public int[] getFeatureDimensionalities()
Returns the number of dimensions of each of the features stored in the feature_definitions field. Returns null if nothing is stored in this field.

Returns:
The number of dimensions of each of the features stored in the feature_definitions field, or null if there are none stored there.

getInstanceMetaDataFields

public java.lang.String[] getInstanceMetaDataFields()
Returns the names of all meta-data fields stored in the contents of any of the instances stored in the model_classifications field. Returns null if model_classifications is empty or if there are no meta-data fields stored.

Returns:
The names of the meta-data fields, or null if there are none.

getMatchingModelClassification

public SegmentedClassification getMatchingModelClassification(DataSet data_set)
Searches the model_classifications stored in this DataBoard with an identifier that matches the identifier of the given DataSet. Null is returned if no SegmentedClassifications are available or no matching one is present.

Parameters:
data_set - The DataSet to attempt to find a matching model classification for.
Returns:
The SegmentedClassification that has the same identifier as the given DataSet.

getInstanceAttributes

public weka.core.Instances getInstanceAttributes(java.lang.String data_set_name,
                                                 int initial_capacity)
                                          throws java.lang.Exception
Uses the feature definitions and taxonomy stored in this DataBoard to return an empty set of Weka Instances. If no taxonomy is available in this DataBoard then model classifications are used to find class names.

The returned set includes all feature names, including numbered feature names for multi-dimensional features, as well as class names. Class names are put in the last Attribute. Only leaf class names are used.

Note that Attribute information may not be changed after this method is called.

Parameters:
data_set_name - The name to assign to the relation.
initial_capacity - The initial capacity of the set.
Returns:
The empty set of WekaInstances with properly set Attributes.
Throws:
java.lang.Exception - An informative exception is thrown if insufficient information is available to construct the Attributes.

storeInstances

public void storeInstances(weka.core.Instances set_of_instances,
                           boolean use_top_level_features,
                           boolean use_sub_section_features)
                    throws java.lang.Exception
Extracts the feature values and model classifications stored in this DataBoard object and stores them in the given set of Weka Instances.

Both pre-classified and unclassified data may be dealt with. Both overal data sets and data sets involving sub-sections may be dealt with.

If the model_classifications field is null, no model classes are saved. If the taxonomy field is null, then the class names are extracted from the model_classifications field if it is not null.

IMPORTANT: Since ARFF files cannot accomodate multiple classes per instance, the feature vector for an instance with multiple classes is repeated twice, once for each class.

Parameters:
set_of_instances - The Weka Instances object to store individual instances in.
use_top_level_features - Whether or not to store overall classifications for individual instances.
use_sub_section_features - Whether or not to store the sub- sections of instances.
Throws:
java.lang.Exception - An exception is thrown if no feature definitions or no feature vectors are available. An exception is also thrown if both of the boolean parameters are false.

getClassifiedResults

public SegmentedClassification[] getClassifiedResults(weka.core.Instances instances,
                                                      boolean save_intermediate_arffs,
                                                      TrainedModel trained,
                                                      boolean use_top_level_features,
                                                      boolean use_sub_section_features)
                                               throws java.lang.Exception
Classify the given set of Instances using the given AttributeSelection and the given Classifier. Return the results in a new SegmentedClassification object.

No reference is mad to any model classifications.

IMPORTANT: The order of the instances must not have been changed from the time that they were constructed by a call to the storeInstances method. If they have, or if the attribute_selector reorders instances, then this method will not work properly.

IMPORTANT: The use_top_level_features and use_sub_section_features parameters must be the same as when the instances were constructed with the storeInstances method.

Parameters:
instances - The Weka Instances object to that individual instances tob be classified are stored in. In general, should have been generated with the storeInstances method.
save_intermediate_arffs - Whether or not to save testing data to an arff file after after feature selection, if any. Useful for testing.
trained - Serializable object containing reference the Weka objects needed for classification (Classifier, AttributeSelection, Attribute (class attribute))
use_top_level_features - Whether or not to store overall classifications for individual instances.
use_sub_section_features - Whether or not to store the sub- sections of instances.
Returns:
The resulting classifications stored in an array of SegmentedClassification objects.
Throws:
java.lang.Exception - An exception occurs if Weka encounters a problem.

saveToARFF

public java.lang.String[] saveToARFF(java.lang.String relation_name,
                                     java.io.File databoard_file,
                                     boolean use_top_level_features,
                                     boolean use_sub_section_features)
                              throws java.lang.Exception
Produces a Weka ARFF file based on the contents of this object. One option is to save only the overall classifications for each instance. Alternatively, the user can opt to save only the overall classifications for each sub-section of each instance, without the overall classifications. Finally, both can be saved together in the same file if the user wishes.

If the model_classifications field is null, no model classes are saved. If the taxonomy field is null, then the class names are extracted from the model_classifications field if it is not null.

An array of strings is returned. There is one entry for each data line saved to the ARFF file, with the entry identifying the data set and (if appropriate) the section that each ARFF data line corresponds to.

IMPORTANT: Since ARFF files cannot accomodate multiple classes per instance, the feature vector for an instance with multiple classes is repeated twice, once for each class.

IMPORTANT: All class names and feature names have blank spaces replaced by underscores in the ARFF file.

Parameters:
relation_name - The name of the relation that is being saved to the ARFF file.
databoard_file - The ARFF file to be saved into.
use_top_level_features - Whether or not to save overall classifications for individual instances.
use_sub_section_features - Whether or not to save the sub- sections of instances.
Returns:
The data set and section corresponding to each feature vector line saved in the ARFF file.
Throws:
java.lang.Exception - An exception is thrown if no feature definitions or no feature vectors are provided. An exception is also thrown if both of the boolean parameters are false.

saveXMLFiles

public void saveXMLFiles(java.io.File taxonomy_file,
                         java.io.File feature_key_file,
                         java.io.File feature_vector_file,
                         java.io.File classifications_file)
                  throws java.lang.Exception
Saves the stored taxonomy, feature definitions, feature vectors and/or model classifications stored in this DataBoard to individual XML files of the respectively appropriate type. Each file is only saved if the corresponding parameter is not null.

Parameters:
taxonomy_file - The file to save the taxonomy to. Null if the taxonomy is not to be saved.
feature_key_file - The file to save the feature defintions to. Null if the definitions are not to be saved.
feature_vector_file - The file to save the feature vectors to. Null if the vectors are not to be saved.
classifications_file - The file to save the model classifications to. Null if the classificaitons not to be saved.
Throws:
java.lang.Exception - An informative exception is thrown if a request is made to save a file type whose corresponding field is empty.

saveDataBoard

public static void saveDataBoard(DataBoard to_save,
                                 java.io.File databoard_file)
                          throws java.lang.Exception
Save the contents of this DataBoard to a File.

Parameters:
databoard_file - The File to save to.
to_save - The DataBoard to save.
Throws:
java.lang.Exception - Throws an exception if an error occurs during saving.

loadDataBoard

public static DataBoard loadDataBoard(java.io.File databoard_file)
                               throws java.lang.Exception
Load the specified DataBoard serialized object file and return its contents.

Parameters:
databoard_file - The File to load.
Returns:
The loaded DataBoard.
Throws:
java.lang.Exception - Throws an exception if an error occurs during loading.

saveInstancesAsARFF

public static void saveInstancesAsARFF(weka.core.Instances instances,
                                       java.lang.String file_path)
                                throws java.lang.Exception
Save the given Weka Instances as an arff file with the given path.

Parameters:
instances - The weka instances to save.
file_path - The path of the arff file to save.
Throws:
java.lang.Exception - Throws an exception if cannot save the given instances.

getInstanceIdentifiers

public java.lang.String[] getInstanceIdentifiers()
Gets array of unique identifiers for DataSet object of this DataBoard.

Returns:
Array of unique names for this set of instances.

hasSections

public boolean hasSections()
Returns:
True if either the DataSet of SegmentedClassification of this DataBoard has sub-sections.