ace.datatypes
Class DataSet

java.lang.Object
  extended by ace.datatypes.DataSet
All Implemented Interfaces:
java.io.Serializable

public class DataSet
extends java.lang.Object
implements java.io.Serializable

Objects of this class each hold feature values for an item to be classified. Methods are included for displaying these values as formatted strings, saving them to disk or loading them from disk. A method is also available for reconciling these objects with FeatureDefinition objects. Methods are also available for extracting feature values in String form.

See Also:
Serialized Form

Field Summary
 java.lang.String[] feature_names
          The names of the features in each corresponding (by first indice) entry of feature_values.
 double[][] feature_values
          The feature values for this DataSet as a whole.
 java.lang.String identifier
          The name of the data set.
 DataSet parent
          If this object is a sub-set of another DataSet, this field points to that parent dataset.
 double start
          Identifies the start of a sub-set of a DataSet.
 double stop
          Identifies the end of a sub-set of a DataSet.
 DataSet[] sub_sets
          Sub-sets of this DataSet.
 
Constructor Summary
DataSet()
          Generate an empty DataSet.
DataSet(weka.core.Instance instance, int inst)
          Generates a DataSet from a Weka ARFF file.
DataSet(java.lang.String identifier, DataSet[] sub_sets, java.lang.Double start, java.lang.Double stop, double[][] feature_values, java.lang.String[] feature_names, DataSet parent)
          Explicitly creates a DataSet
 
Method Summary
 java.lang.String getDataSetDescription(int depth)
          Generate a formatted strind detailing the contents of this DataSet.
static java.lang.String getDataSetDescriptions(DataSet[] dataset)
          Returns a formatted text description of the given DataSet objects.
 java.lang.String[][][] getFeatureValuesOfSubSections(FeatureDefinition[] definitions)
          Returns the feature values stored in the DataSets in the sub_sets field of this object.
 java.lang.String[][] getFeatureValuesOfTopLevel(FeatureDefinition[] definitions)
          Returns the feature values stored in the feature_values field of this object.
static DataSet[] getMergedFeatureTypes(DataSet[][] datasets_to_combine, FeatureDefinition[] combined_feature_definitions, java.lang.String[][] matching_identifier_keys)
          Merges the different extracted features contained in multiple DataSet objects that hold references to the same instances.
 void orderAndCompactFeatures(FeatureDefinition[] definitions, boolean is_top_level)
          Processes this DataSet based on the given definitions parameter.
static DataSet[] parseDataSetFile(java.lang.String data_set_file_path)
          Parses a feature_vector_file XML file and returns an array of DataSet objects holding its contents.
static DataSet[] parseDataSetFile(java.lang.String data_set_file_path, FeatureDefinition[] definitions)
          Parses a feature_vector_file XML file and returns an array of DataSet objects holding its contents.
static DataSet[] parseDataSetFiles(java.lang.String[] data_set_file_paths, FeatureDefinition[] definitions)
          Parses a several feature_vector_file XML files and returns an array of DataSet objects holding the combined contents of all of the files.
static void saveDataSets(DataSet[] data_sets, FeatureDefinition[] definitions, java.io.File to_save_to, java.lang.String comments)
          Saves a feature_vector_file XML file with the contents specified in the given DataSet array and the comments specified in the comments parameter.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

identifier

public java.lang.String identifier
The name of the data set. This name should be unique among each group of data sets. Should be null for non-top-level DataSets.


sub_sets

public DataSet[] sub_sets
Sub-sets of this DataSet. Each such sub-set can serve as an instance that is individually classifiable. For example, sub-sets could consist of windows of audio extracted from the recording that makes the overall DataSet. The sub_sets field should be null if there are no sub-sets that can be individually classified.


start

public double start
Identifies the start of a sub-set of a DataSet. Set to NaN if this object is a top-level DataSet.


stop

public double stop
Identifies the end of a sub-set of a DataSet. Set to NaN if this object is a top-level DataSet.


feature_values

public double[][] feature_values
The feature values for this DataSet as a whole. If there are any sub-sets, they will store there own feature values, and these will not be referenced here. The first indice identifies the feature and the second indice identifies the dimension of the feature. It is clear that features of arbitrary dimensions may be accomodated. Features whose value or values are missing are assigned a value of null. This field is assigned a value of null if no features have been extracted. It is assumed that the Java Class calling the DataSet knows the ordering and identity of the features of the DataSet and its sub-sets. The feature_values may be ordered based on FeatureDefinitions using the orderAndCompactFeatures method. Individual features may also be assigned null values if they are unknown or inappropriate.


feature_names

public java.lang.String[] feature_names
The names of the features in each corresponding (by first indice) entry of feature_values. These are often only stored here temporarily until they can be accessed and stored externally in a more efficient fashion. This field is therefore often null, even when the feature_values field is not.


parent

public DataSet parent
If this object is a sub-set of another DataSet, this field points to that parent dataset. Otherwise this field is null.

Constructor Detail

DataSet

public DataSet()
Generate an empty DataSet.


DataSet

public DataSet(java.lang.String identifier,
               DataSet[] sub_sets,
               java.lang.Double start,
               java.lang.Double stop,
               double[][] feature_values,
               java.lang.String[] feature_names,
               DataSet parent)
Explicitly creates a DataSet

Parameters:
identifier - The name of the data set.
sub_sets - Sub-sets of this DataSet.
start - Identifies the beginning of a sub-set of a DataSet.
stop - Identifies the end of a sub-set of a DataSet.
feature_values - The feature values for this DataSet as a whole.
feature_names - The names of the features in each corresponding (by first indice) entry of feature_values.
parent - The parent dataset, or null if not a subsection.

DataSet

public DataSet(weka.core.Instance instance,
               int inst)
Generates a DataSet from a Weka ARFF file.

Parameters:
instance - The Weka Instance from which to get the feature values and feature names of this DataSet.
inst - The index of this Instance in its parent Instances object. This number precedes the identifier to ensure that all identifiers are unique.
Method Detail

orderAndCompactFeatures

public void orderAndCompactFeatures(FeatureDefinition[] definitions,
                                    boolean is_top_level)
                             throws java.lang.Exception
Processes this DataSet based on the given definitions parameter. The feature values stored in the feature_values field are re-ordered based on the correspondance between the feature_names field and the defintions parameter. All features in feature_names that are not referred to in definitions are deleted. All features referred to in definitions but not present in feature_names are given a null entry in feature_values. The feature_names field is set to null at the end of processing in order to save memory.

This method also processes the sub_sets of this DataSet recursively.

The end result of running this method is that the features in feature_values that are referred to in both feature_names and definitions are given the same order as in definitions. Any features in definitions that are not present in feature_names are set to null in feature_values. Any features in feature_names that are not in definitions are deleted. At the end of running this method, feature_names is null and feature_values has the same number of entries as definitions.

The purpose of running this method is to put this DataSet in a configuration that can be stored and processed more efficiently and to verify the validity of the stored features.

Parameters:
definitions - The feature definitions to order the feature_values field by.
is_top_level - True if this DataSet is a top-level DataSet (i.e. not a sub-set of another DataSet). This parameter should always be true when this method is called externally.
Throws:
java.lang.Exception - An informative exception is thrown if the dimensions of a stored feature does not match the dimensions that it should have according to its definition. An excpetion is also thrown if features in the sub_sets that have values of false for the is_sequential field of the corresponding FeatureDefinition are present in the sub-set.

getFeatureValuesOfTopLevel

public java.lang.String[][] getFeatureValuesOfTopLevel(FeatureDefinition[] definitions)
Returns the feature values stored in the feature_values field of this object. The first indice of the returned array denotes the feature and the second indice indicates the dimension of the feature (in order to accomodate multi-dimensional features).

The returned array is null if no features have been extracted. If a particular feature value is not available, then a question mark is returned in the appropriate entry.

Parameters:
definitions - Feature definitions that are used to get the dimensions of unknown features.
Returns:
The array of feature values.

getFeatureValuesOfSubSections

public java.lang.String[][][] getFeatureValuesOfSubSections(FeatureDefinition[] definitions)
Returns the feature values stored in the DataSets in the sub_sets field of this object. The first indice of the returned array denotes the sub-section. The second indice indicates the feature and the third indice indicates the dimension of the feature (in order to accomodate multi-dimensional features).

The returned array is null if no sub-sections are available. The first dimension is null if no features have been extracted for a given sub-section. If a particular feature value is not available, then a question mark is returned in the appropriate entry.

Parameters:
definitions - Feature definitions that are used to get the dimensions of unknown features.
Returns:
The array of feature values.

getDataSetDescription

public java.lang.String getDataSetDescription(int depth)
Generate a formatted strind detailing the contents of this DataSet.

Parameters:
depth - How deep this DataSet is in a hierarchy of DataSets (i.e. through the sub_sets field). This parameter should generally be 0 when called externally, as this method operates recursively.
Returns:
A formatted string describing this DataSet.

getDataSetDescriptions

public static java.lang.String getDataSetDescriptions(DataSet[] dataset)
Returns a formatted text description of the given DataSet objects.

Parameters:
dataset - The data sets to describe.
Returns:
The formatted description.

parseDataSetFile

public static DataSet[] parseDataSetFile(java.lang.String data_set_file_path)
                                  throws java.lang.Exception
Parses a feature_vector_file XML file and returns an array of DataSet objects holding its contents. An exception is thrown if the file is invalid in some way.

Parameters:
data_set_file_path - The path of the XML file to parse.
Returns:
An array of DataSet objects holding the contents of the given ACE XML feature vectors file.
Throws:
java.lang.Exception - Informative exception is thrown if an invalid file or file path is specified.

parseDataSetFile

public static DataSet[] parseDataSetFile(java.lang.String data_set_file_path,
                                         FeatureDefinition[] definitions)
                                  throws java.lang.Exception
Parses a feature_vector_file XML file and returns an array of DataSet objects holding its contents. An exception is thrown if the file is invalid in some way.

Also processes each resulting DataSet in order to reconcile it with the given definitions. See the orderAndCompactFeatures method for details.

Parameters:
data_set_file_path - The path of the XML file to parse.
definitions - FeatureDefinitions to use for formatting and validating the contents of the file to be parsed.
Returns:
array of DataSet objects holding the contents of the given ACE XML feature fectors file.
Throws:
java.lang.Exception - Informative exceptions is thrown if an invalid file or file path is specified. An exception is also thrown if the given feature definitions are incompatible with the contents of the file.

parseDataSetFiles

public static DataSet[] parseDataSetFiles(java.lang.String[] data_set_file_paths,
                                          FeatureDefinition[] definitions)
                                   throws java.lang.Exception
Parses a several feature_vector_file XML files and returns an array of DataSet objects holding the combined contents of all of the files. An exception is thrown if the file is invalid in some way.

Also processes each resulting DataSet in order to reconcile it with the given definitions. See the orderAndCompactFeatures method for details. This will not occur if the definitions parameter is null.

Parameters:
data_set_file_paths - The paths of the XML files to parse.
definitions - FeatureDefinitions to use for formatting and validating the contents of the files to be parsed.
Returns:
An array of DataSet objects holding the combined contents of all of the given ACE XML feature vectors files.
Throws:
java.lang.Exception - Informative exceptions is thrown if an invalid file or file path is specified. An exception is also thrown if the given feature definitions are incompatible with the contents of a file.

getMergedFeatureTypes

public static DataSet[] getMergedFeatureTypes(DataSet[][] datasets_to_combine,
                                              FeatureDefinition[] combined_feature_definitions,
                                              java.lang.String[][] matching_identifier_keys)
                                       throws java.lang.Exception
Merges the different extracted features contained in multiple DataSet objects that hold references to the same instances. This could be useful, for example, for combining features extracted from jAudio with features extracted with jWebMiner for the same songs.

Note that the different DataSets must each contain references to the same instances, but with entirely different feature types. The original DataSets ARE changed.

NOTE THAT THIS HAS NOT BEEN DESIGNED TO WORK WITH DATASETS THAT HAVE SUBSETS YET (e.g. features separately extracted separately for windows). Such subsets are currently just ignored.

Parameters:
datasets_to_combine - An array of sets of DataSets to combine into one. The first dimension refers to the group of DataSets and the second dimension refers to the DataSets within the given group. Each DataSet in each group should have the same feature types, but the feature types in the different groups should be entirely different. There should be one version of each instance in each group.
combined_feature_definitions - The combined set of all FeatureDefinition objects for all features in all of the datasets_to_combine.
matching_identifier_keys - Sets of identifiers linking DataSet objects from different groups in the datasets_to_combine parameter. The first dimension indicates an instance (ordering does not matter) and the second dimension has one value for each value DataSet group indicating the corresponding identifier field, plus one more indicating the value to use in the identifier field in the returned DataSet[] (ordering does matter her). Note that keys must be unique within each column
Returns:
The instances with their features combined.
Throws:
java.lang.Exception - An informative Exception is returned if a problem occurs.

saveDataSets

public static void saveDataSets(DataSet[] data_sets,
                                FeatureDefinition[] definitions,
                                java.io.File to_save_to,
                                java.lang.String comments)
                         throws java.lang.Exception
Saves a feature_vector_file XML file with the contents specified in the given DataSet array and the comments specified in the comments parameter. Uses the feature_names in each of the data_sets if they are present, and uses those in the definitions parameter if they are not present in a given DataSet. If all data_sets contain feature_names, then the passed value of definitions may be null. This method does not apply the orderAndCompactFeatures method.

In general, it is best to have applied the orderAndCompactFeatures method to data_sets before calling this saveDataSets method.

Parameters:
data_sets - The DataSets to save.
definitions - The FeatureDefinitions to base feature names on if they are not present in individual DataSets. May be null.
to_save_to - The file to save to.
comments - Any comments to be saved inside the comments element of the XML file.
Throws:
java.lang.Exception - An informative exception is thrown if the file cannot be saved or if feature names are available in neither individual data_sets nor in definitions.