This week
- in Witten: Finish Chapter 4, start Chapter 5
- Test tomorrow.
- Hw 3 will be due in Week 6 and will be based on Lab 4
- Sign up for the Science Carnival
Training and Testing
- It is important to test an algorithm on different data from the training data.
- cross-validation: divide the data into n disjoint sets, train the data on one subset and test it on the others. It is generally better to train on more data. For 10 folds, train on 9/10 of the data and test it on 1/10. Repeat this 10 times, and then repeat that 10 times with different folds.
- General rules: stratification generally improves the results, 10 folds is often good.
- Bootstrap is resampling with replacement: given n instances, choose a random subsample with n instances. Get about 63% of the set, use the other 37% for testing.
- Leave one out is similar.
- Sometimes there are three data sets: training, validation, testing
Comparing Data Mining Schemes
- How would you test if two different algorithms produce significantly different results?
- What is a paired t-test and what is the difference between that and a standard t-test?
davg t = -------- √σ2 / n
Some Linear Algebra
- A matrix has two applications in this context: linear functions and linear equations
2x + y = 1 3x + 2y = 5 f(x, y) = (2x + y, 3x + 2y)
- What does it mean to solve the equation above?
f(x, y) = (1, 5) (x, y) = f-1(1, 5)
- We have practiced multiplying matrices — what is the rule?
- What is the inverse of a matrix?