中科院機器學習題庫-new

上傳人：5*** IP屬地：湖北上傳時間：2022-02-15 格式：DOC 頁數(shù)：44 大小：5.70MB 積分：30 舉報 版權申訴

已閱讀5頁，還剩39頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

1、機器學習題庫一、極大似然1、 ML estimation of exponential model (10)A Gaussian distribution is often used to model data on the real line, but is sometimes inappropriate when the data are often close to zero but constrained to be nonnegative. In such cases one can fit an exponential distribution, whose probabilit

2、y density function is given byGiven N observations xi drawn from such a distribution:(a) Write down the likelihood as a function of the scale parameter b.(b) Write down the derivative of the log likelihood.(c) Give a simple expression for the ML estimate for b.2、換成Poisson分布：3、二、貝葉斯假設在考試的多項選擇中，考生知道正

3、確答案的概率為p，猜測答案的概率為1-p，并且假設考生知道正確答案答對題的概率為1，猜中正確答案的概率為，其中m為多選項的數(shù)目。那么已知考生答對題目，求他知道正確答案的概率。1、 Conjugate priorsThe readings for this week include discussion of conjugate priors. Given a likelihood for a class models with parameters , a conjugate prior is a distribution with hyperparameters , such that th

4、e posterior distribution與先驗的分布族相同(a) Suppose that the likelihood is given by the exponential distribution with rate parameter :Show that the gamma distribution _is a conjugate prior for the exponential. Derive the parameter update given observations and the prediction distribution .(b) Show that the

5、 beta distribution is a conjugate prior for the geometric distributionwhich describes the number of time a coin is tossed until the first heads appears, when the probability of heads on each toss is . Derive the parameter update rule and prediction distribution.(c) Suppose is a conjugate prior for t

6、he likelihood ; show that the mixture prioris also conjugate for the same likelihood, assuming the mixture weights wm sum to 1. (d) Repeat part (c) for the case where the prior is a single distribution and the likelihood is a mixture, and the prior is conjugate for each mixture component of the like

7、lihood.some priors can be conjugate for several different likelihoods; for example, the beta is conjugate for the Bernoulliand the geometric distributions and the gamma is conjugate for the exponential and for the gamma with fixed (e) (Extra credit, 20) Explore the case where the likelihood is a mix

8、ture with fixed components and unknown weights; i.e., the weights are the parameters to be learned.三、判斷題（1）給定n個數(shù)據(jù)點，如果其中一半用于訓練，另一半用于測試，則訓練誤差和測試誤差之間的差別會隨著n的增加而減小。（2）極大似然估計是無偏估計且在所有的無偏估計中方差最小，所以極大似然估計的風險最小。（）回歸函數(shù)A和B，如果A比B更簡單，則A幾乎一定會比B在測試集上表現(xiàn)更好。（）全局線性回歸需要利用全部樣本點來預測新輸入的對應輸出值，而局部線性回歸只需利用查詢點附近的樣本來預測輸出值。所以全

9、局線性回歸比局部線性回歸計算代價更高。（）Boosting和Bagging都是組合多個分類器投票的方法，二者都是根據(jù)單個分類器的正確率決定其權重。() In the boosting iterations, the training error of each new decision stump and the training error of the combined classifier vary roughly in concert （F）While the training error of the combined classifier typically decreases a

10、s a function of boosting iterations, the error of the individual decision stumps typically increases since the example weights become concentrated at the most difficult examples.() One advantage of Boosting is that it does not overfit. （F）() Support vector machines are resistant to outliers, i.e., v

11、ery noisy examples drawn from a different distribution. （）（9）在回歸分析中，最佳子集選擇可以做特征選擇，當特征數(shù)目較多時計算量大；嶺回歸和Lasso模型計算量小，且Lasso也可以實現(xiàn)特征選擇。（10）當訓練數(shù)據(jù)較少時更容易發(fā)生過擬合。（11）梯度下降有時會陷于局部極小值，但EM算法不會。（12）在核回歸中，最影響回歸的過擬合性和欠擬合之間平衡的參數(shù)為核函數(shù)的寬度。(13) In the AdaBoost algorithm, the weights on all the misclassified points will go up

12、 by the same multiplicative factor. （T）(14) True/False: In a least-squares linear regression problem, adding an L2 regularization penalty cannot decrease the L2 error of the solution w on the training data. （F）(15) True/False: In a least-squares linear regression problem, adding an L2 regularization

13、 penalty always decreases the expected L2 error of the solution w on unseen test data （F）.(16)除了EM算法，梯度下降也可求混合高斯模型的參數(shù)。 (T)(20) Any decision boundary that we get from a generative model with class-conditional Gaussian distributions could in principle be reproduced with an SVM and a polynomial kernel.

14、True! In fact, since class-conditional Gaussians always yield quadratic decision boundaries, they can be reproduced with an SVM with kernel of degree less than or equal to two.(21) AdaBoost will eventually reach zero training error, regardless of the type of weak classifier it uses, provided enough

15、weak classifiers have been combined.False! If the data is not separable by a linear combination of the weak classifiers, AdaBoost cant achieve zero training error.(22) The L2 penalty in a ridge regression is equivalent to a Laplace prior on the weights. （F）(23) The log-likelihood of the data will al

16、ways increase through successive iterations of the expectation maximation algorithm. (F)(24) In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions. (F)一、回歸1、考慮回歸一個正則化回歸問題。在下圖中給出了懲罰函數(shù)為二次正則函數(shù)，當正則化參數(shù)C取不同值時，在訓練集和測試

17、集上的log似然（mean log-probability）。（10分）（1）說法“隨著C的增加，圖2中訓練集上的log似然永遠不會增加”是否正確，并說明理由。（2）解釋當C取較大值時，圖2中測試集上的log似然下降的原因。2、考慮線性回歸模型：，訓練數(shù)據(jù)如下圖所示。（10分）（1）用極大似然估計參數(shù)，并在圖（a）中畫出模型。（3分）（2）用正則化的極大似然估計參數(shù)，即在log似然目標函數(shù)中加入正則懲罰函數(shù)，并在圖（b）中畫出當參數(shù)C取很大值時的模型。（3分）（3）在正則化后，高斯分布的方差是變大了、變小了還是不變？（4分）圖(a) 圖(b)3. 考慮二維輸入空間點上的回歸問題，其中在單位正

18、方形內。訓練樣本和測試樣本在單位正方形中均勻分布，輸出模型為，我們用1-10階多項式特征，采用線性回歸模型來學習x與y之間的關系（高階特征模型包含所有低階特征），損失函數(shù)取平方誤差損失。(1) 現(xiàn)在個樣本上，訓練1階、2階、8階和10階特征的模型，然后在一個大規(guī)模的獨立的測試集上測試，則在下3列中選擇合適的模型（可能有多個選項），并解釋第3列中你選擇的模型為什么測試誤差小。（10分）訓練誤差最小訓練誤差最大測試誤差最小1階特征的線性模型X2階特征的線性模型X8階特征的線性模型X10階特征的線性模型X(2) 現(xiàn)在個樣本上，訓練1階、2階、8階和10階特征的模型，然后在一個大規(guī)模的獨立的測試集上測

19、試，則在下3列中選擇合適的模型（可能有多個選項），并解釋第3列中你選擇的模型為什么測試誤差小。（10分）訓練誤差最小訓練誤差最大測試誤差最小1階特征的線性模型X2階特征的線性模型8階特征的線性模型XX10階特征的線性模型X(3) The approximation error of a polynomial regression model depends on the number of training points. (T)(4) The structural error of a polynomial regression model depends on the number of

20、training points. (F)4、We are trying to learn regression parameters for a dataset which we know was generated from a polynomial of a certain degree, but we do not know what this degree is. Assume the data was actually generated from a polynomial of degree 5 with some added Gaussian noise (that is .Fo

21、r training we have 100 x,y pairs and for testing we are using an additional set of 100 x,y pairs. Since we do not know the degree of the polynomial we learn two models from the data. Model A learns parameters for a polynomial of degree 4 and model B learns parameters for a polynomial of degree 6. Wh

22、ich of these two models is likely to fit the test data better?Answer: Degree 6 polynomial. Since the model is a degree 5 polynomial and we have enough training data, the model we learn for a six degree polynomial will likely fit a very small coefficient for x6 . Thus, even though it is a six degree

23、polynomial it will actually behave in a very similar way to a fifth degree polynomial which is the correct model leading to better fit to the data.5、Input-dependent noise in regressionOrdinary least-squares regression is equivalent to assuming that each data point is generated according to a linear

24、function of the input plus zero-mean, constant-variance Gaussian noise. In many systems, however, the noise variance is itself a positive linear function of the input (which is assumed to be non-negative, i.e., x >= 0).a) Which of the following families of probability models correctly describes t

25、his situation in the univariate case? (Hint: only one of them does.)(iii) is correct. In a Gaussian distribution over y, the variance is determined by the coefficient of y2; so by replacing by , we get a variance that increases linearly with x. (Note also the change to the normalization “constant.”)

26、 (i) has quadratic dependence on x; (ii) does not change the variance at all, it just renames w1.b) Circle the plots in Figure 1 that could plausibly have been generated by some instance of the model family(ies) you chose.(ii) and (iii). (Note that (iii) works for .) (i) exhibits a large variance at

27、 x = 0, and the variance appears independent of x.c) True/False: Regression with input-dependent noise gives the same solution as ordinary regression for an infinite data set generated according to the corresponding model.True. In both cases the algorithm will recover the true underlying model.d) Fo

28、r the model you chose in part (a), write down the derivative of the negative log likelihood with respect to w1.二、分類1. 產生式模型 vs. 判別式模型(a) points Your billionaire friend needs your help. She needs to classify job applications into good/bad categories, and also to detect job applicants who lie in thei

29、r applications using density estimation to detect outliers. To meet these needs, do you recommend using a discriminative or generative classifier? Why? final_sol_s07產生式模型因為要估計密度(b) points Your billionaire friend also wants to classify software applications to detect bug-prone applications using feat

30、ures of the source code. This pilot project only has a few applications to be used as training data, though. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?判別式模型樣本數(shù)較少，通常用判別式模型直接分類效果會好些(d) points Finally, your billionaire friend also want

31、s to classify companies to decide which one to acquire. This project has lots of training data based on several decades of research. To create the most accurate classifier, do you recommend using a discriminative or generative classifier? Why?產生式模型樣本數(shù)很多時，可以學習到正確的產生式模型2、logstic回歸Figure 2: Log-probabi

32、lity of labels as a function of regularization parameter CHere we use a logistic regression model to solve a classification problem. In Figure 2, we have plotted the mean log-probability of labels in the training and test sets after having trained the classifier with quadratic regularization penalty

33、 and different values of the regularization parameter C.(1) In training a logistic regression model by maximizing the likelihood of the labels given the inputs we have multiple locally optimal solutions. (F)Answer: The log-probability of labels given examples implied by the logistic regression model

34、 is a concave (convex down) function with respect to the weights. The (only) locally optimal solution is also globally optimal(2) A stochastic gradient algorithm for training logistic regression models with a fixed learning rate will find the optimal setting of the weights exactly. （F）Answer: A fixe

35、d learning rate means that we are always taking a finite step towards improving the log-probability of any single training example in the update equation. Unless the examples are somehow “aligned”, we will continue jumping from side to side of the optimal solution, and will not be able to get arbitr

36、arily close to it. The learning rate has to approach to zero in the course of the updates for the weights to converge.(3) The average log-probability of training labels as in Figure 2 can never increase as we increase C. （T）Stronger regularization means more constraints on the solution and thusthe (

37、average) log-probability of the training examples can only get worse.(4) Explain why in Figure 2 the test log-probability of labels decreases for large values of C. As C increases, we give more weight to constraining the predictor, and thus give lessflexibility to fitting the training set. The incre

38、ased regularization guarantees that thetest performance gets closer to the training performance, but as we over-constrainour allowed predictors, we are not able to fit the training set at all, and although thetest performance is now very close to the training performance, both are low.(5) The log-pr

39、obability of labels in the test set would decrease for large values of C even if we had a large number of training examples. （T）The above argument still holds, but the value of C for which we will observe such a decrease will scale up with the number of examples.(6) Adding a quadratic regularization

40、 penalty for the parameters when estimating a logistic regression model ensures that some of the parameters (weights associated with the components of the input vectors) vanish.A regularization penalty for feature selection must have non-zero derivative at zero. Otherwise, the regularization has no

41、effect at zero, and weight will tend to be slightly non-zero, even when this does not improve the log-probabilities by much.3、正則化的Logstic回歸This problem we will refer to the binary classification task depicted in Figure 1(a), which we attempt to solve with the simple linear logistic regression model(

42、for simplicity we do not use the bias parameter w0). The training data can be separated with zero training error - see line L1 in Figure 1(b) for instance.(a) The 2-dimensional data set used in Problem 2(b) The points can be separated by L1 (solid line). Possible other decision boundaries are shown

43、by L2;L3;L4. (1) Consider a regularization approach where we try to maximize for large C. Note that only w2 is penalized. Wed like to know which of the four lines in Figure 1(b) could arise as a result of such regularization. For each potential line L2, L3 or L4 determine whether it can result from

44、regularizing w2. If not, explain very briefly why not.L2: No. When we regularize w2, the resulting boundary can rely less on the value of x2 and therefore becomes more vertical. L2 here seems to be more horizontal than the unregularized solution so it cannot come as a result of penalizing w2L3: Yes.

45、 Here w22 is small relative to w12 (as evidenced by high slope), and even though it would assign a rather low log-probability to the observed labels, it could be forced by a large regularization parameter C.L4: No. For very large C, we get a boundary that is entirely vertical (line x1 = 0 or the x2

46、axis). L4 here is reflected across the x2 axis and represents a poorer solution than its counter part on the other side. For moderate regularization we have to get the best solution that we can construct while keeping w2 small. L4 is not the best and thus cannot come as a result of regularizing w2.(

47、2) If we change the form of regularization to one-norm (absolute value) and also regularize w1 we get the following penalized log-likelihoodConsider again the problem in Figure 1(a) and the same linear logistic regression model. As we increase the regularization parameter C which of the following sc

48、enarios do you expect to observe (choose only one):( x ) First w1 will become 0, then w2.( ) w1 and w2 will become zero simultaneously( ) First w2 will become 0, then w1.( ) None of the weights will become exactly zero, only smaller as C increasesThe data can be classified with zero training error a

49、nd therefore also with high log-probability by looking at the value of x2 alone, i.e. making w1 = 0. Initially we might prefer to have a non-zero value for w1 but it will go to zero rather quickly as we increase regularization. Note that we pay a regularization penalty for a non-zero value of w1 and

50、 if it doesnt help classification why would we pay the penalty? The absolute value regularization ensures that w1 will indeed go to exactly zero. As C increases further, even w2 will eventually become zero. We pay higher and higher cost for setting w2 to a non-zero value. Eventually this cost overwh

51、elms the gain from the log-probability of labels that we can achieve with a non-zero w2. Note that when w1 = w2 = 0, the log-probability of labels is a finite value nlog(0:5).1、 SVMFigure 4: Training set, maximum margin linear separator, and the support vectors (in bold).(1) What is the leave-one-ou

52、t cross-validation error estimate for maximum margin separation in figure 4? (we are asking for a number) （0）Based on the figure we can see that removing any single point would not chance the resulting maximum margin separator. Since all the points are initially classified correctly, the leave-one-o

53、ut error is zero.(2) We would expect the support vectors to remain the same in general as we move from a linear kernel to higher order polynomial kernels. (F)There are no guarantees that the support vectors remain the same. The feature vectors corresponding to polynomial kernels are non-linear funct

54、ions of the original input vectors and thus the support points for maximum margin separation in the feature space can be quite different.(3) Structural risk minimization is guaranteed to find the model (among those considered) with the lowest expected loss. （F）We are guaranteed to find only the mode

55、l with the lowest upper bound on the expected loss.(4) What is the VC-dimension of a mixture of two Gaussians model in the plane with equal covariance matrices? Why?A mixture of two Gaussians with equal covariance matrices has a linear decision boundary. Linear separators in the plane have VC-dim ex

56、actly 3.4、SVM對如下數(shù)據(jù)點進行分類：(a) Plot these six training points. Are the classes +, linearly separable?yes(b) Construct the weight vector of the maximum margin hyperplane by inspection and identify the support vectors.The maximum margin hyperplane should have a slope of 1 and should satisfy x1 = 3/2, x2

57、= 0.Therefore its equation is x1 + x2 = 3/2, and the weight vector is (1, 1)T .(c) If you remove one of the support vectors does the size of the optimal margin decrease, stay the same, or increase?In this specific dataset the optimal margin increases when we remove the support vectors (1, 0) or (1,

58、1) and stays the same when we remove the other two.(d) (Extra Credit) Is your answer to (c) also true for any dataset? Provide a counterexample or give a short proof.When we drop some constraints in a constrained maximization problem, we get an optimal value which is at least as good the previous one. It is

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經(jīng)權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

中科院機器學習題庫-new

文檔簡介

溫馨提示

最新文檔

評論

中科院機器學習題庫-new

文檔簡介

溫馨提示

最新文檔

評論

相關文檔