ML: Lecture 4

Readings for the quiz

in Witten: Chapters 1 and 2 and section 11.3
The types of concepts: classification, clustering, associations,
types of data: nominal, numeric, ordinal
examples: sister-of, contact lenses, weather, irises, soybean classification, etc
ethics: why do we need to think about it?

Additional resources

Computer Vision by Shapiro and Stockman
Image processing by Steve Tanimoto (probably too elementary)

Attributes and Data

What kinds of data are there in Weka?
Why data needs to be cleaned/preprocessed: missing values, inconsistent values
Summarizing data: mean, standard deviation, min, max, quartiles
goal: finding a minimum set of attributes that adequately describes the concept.

What is the syntax for ARFF format?

@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@data

How does ARFF denote string data?

How does ARFF handle multi-instance data?

What does the following mean?

  @attribute bag relational
      @attribute outlook  {sunny, overcast, rainy}
      @attribute temperature numeric
  @end bag
  @attribute play? {yes, no}

attribute subset selection: finding a minimum set of attributes that adequately describes the concept.

Filtering data

A filter can be almost any function that transforms the input
There are different types of filters: supervised and unsupervised, attribute filters,
instance filters.

Examples:

   AddCluster: adds a new nominal attribute which is an ID of each cluster
   AddNoise: flips some of the values in the input for a nominal attribute 
   NominalToBinary (supervised in the case of numeric values)
   Randomize(unsupervised, instance)
   Resample(unsupervised, instance)
   SpeadSubSample (supervised, instance)