Case Western Reserve University
CSDS 435/335 Data Mining
1. (Total 5 pts) We try to build a decision tree using the same training data in the table on page one.
A) (1pt) What is the Gini index of the root node?
B) (3pts) We will check each of the three attributes (Gender, Car Type, Shirt size), and calculate the Gini index of child nodes (weighted sum of the Gini index of the child nodes) for each of the attributes. What are the values of the Gini index of child nodes when we use Gender, Car Type, Shirt Size as the splitting attribute, respectively?
2. (4pts) Consider the task of building a classifier from random data, where the attribute values are generated randomly irrespective of the class labels. Assume the data set contains records from two classes, “+” and “−.” Half of the data set is used for training while the remaining half is used for testing.A. Suppose there are an equal number of positive and negative records in the data and the decision tree classifier predicts every test record to be positive. What is the expected error rate of the classifier on the test data?B. Repeat the previous analysis in A, assuming that the classifier predicts each test record to be positive class with probability 0.8 and negative class with probability 0.2.C. Suppose two-thirds of the data belong to the positive class and the remaining one-third belong to the negative class. What is the expected error of a classifier that predicts every test record to be positive?D. Repeat the previous analysis in C, assuming that the classifier predicts each test record to be positive class with probability 2/3 and negative class with probability 1/3.Assume the total number of samples are 2n (so the number of test data is n). For each of the above, please provide the (expected) confusion matrix on the test data, based on which, you can calculate the error rate.
3. (total 11pts) You are asked to evaluate the performance of two classification models, M1 and M2. The test set you have chosen contains 26 binary attributes, labeled as A through Z. The table below shows the posterior probabilities obtained by applying the models to the test set. (Only the posterior probabilities for the positive class are shown). As this is a two-class problem, P(−) = 1 − P(+) and P(−|A, . . ., Z) = 1 − P(+|A, . . . , Z). Assume that we are mostly interested in detecting instances from the positive class.
A. (5pts) Plot the ROC curve for both M1 and M2.
B. (2pts) For model M1, suppose you choose the cutoff threshold to be t = 0.5. In other words, any test instances whose posterior probability is greater than t will be classified as a positive example. Compute the precision, recall, and F-measure for the model at this threshold value.
C. (2pts) Repeat the analysis for part (B) using the same cutoff threshold on model M2. Compare the F-measure results for both models. Which model is better? Are the results consistent with what you expect from the ROC curve?
D. (2pts) Repeat part (B) for model M1 using the threshold t = 0.1. Which threshold do you prefer, t = 0.5 or t = 0.1? Are the results consistent with what you expect from the ROC curve?
咨询 Alpha 小助手,获取更多课业帮助