This week
- in Witten: Chapter 6 (more tree depth, clustering)
- Sign up for the Science Carnival the deadline says May 3, but you can still sign up this week.
- jobs at Res-net for next year
Matrices
What is the inverse of
| 1 1|-1 |-1 1| |0.5 -0.5| |0.5 0.5|
t-test
- read the pages in Witten on t-test. Suppose you have the following data:
deviations squared deviations Xi | Xi - μ | (Xi - μ)2 0.5 0.125 -0.3 0.25 -------- μ = 0.525/4
- Bernoulli trials
deviations squared deviations Xi | P(x) | XiP(Xi) | Xi - μ | (Xi - μ)2 0 0.2 0 1 0.8 0.8 -------- μ = 0.8
- Now you try it for two trials (Binomial, n=2)
deviations squared deviations Xi | P(x) | XiP(Xi) | Xi - μ | (Xi - μ)2 0 0.04 1 0.32 2 -------- μ =
K means clustering
(1, 0), (2, 0), (2.9, 0), (4, 0), (5, 0)
Principle of Least Squares
- for example, in regression, you have
y(1) = ∑j (wj⋅xj(1)) y(2) = ∑j (wj⋅xj(2)) etc minimize ∑ (y(1) - ∑j (wj⋅xj(1)))2
Decision Tree Algorithms
- What is the information gain (review): info([4,2],[5,3])
1. 4 log(4) + 2 Log(2) + 5 log(5) + 3 log(3) 2. 6/14 ⋅ info([4,2] + 8/14 ⋅ info([5,3]) 3. 4/6 log(4/6) + 2/6 log(2/6) + 5/8 log(5/8) + 3/8 log(3/8) 4. 4/14 log(4/6) + 2/14 log(2/6) + 5/14 log(5/8) + 3/14 log(3/8) 5. 9/14 log(9/14) + 5/14 log(5/14) - (6/14 ⋅ info([4,2] + 8/14 ⋅ info([5,3])))
- MDL: minimum description length, section 5.9
- How to deal with numeric attributes in a decision tree, assuming that each split is binary.
- Missing values are handled by propagating multiple instances to each branch
- Pruning is very important: subtree replacement, subtree raising. The decisions on subtree
replacement are based on information gain, but use a pessimistic confidence interval for the training data. Let’s go over the calculation (P 198), you need to understand the normal distribution and error rates.how do you use c = 25%? z = (x - μ) / σ z = (f - q) / sqrt(q⋅(1-q)/N) where f = proportion E/N observed, and q = true value for E/N a pessimistic value is going to be f + something positive