classification,機器學習中分類問題的探討.ppt_第1頁
classification,機器學習中分類問題的探討.ppt_第2頁
classification,機器學習中分類問題的探討.ppt_第3頁
classification,機器學習中分類問題的探討.ppt_第4頁
classification,機器學習中分類問題的探討.ppt_第5頁
已閱讀5頁,還剩83頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

1、Classification,Michael I. Jordan University of California, Berkeley,Classification,In classification problems, each entity in some domain can be placed in one of a discrete set of categories: yes/no, friend/foe, good/bad/indifferent, blue/red/green, etc. Given a training set of labeled entities, dev

2、elop a rule for assigning labels to entities in a test set Many variations on this theme: binary classification multi-category classification non-exclusive categories ranking Many criteria to assess rules and their predictions overall errors costs associated with different kinds of errors operating

3、points,Representation of Objects,Each object to be classified is represented as a pair (x, y): where x is a description of the object (see examples of data types in the following slides) where y is a label (assumed binary for now) Success or failure of a machine learning classifier often depends on

4、choosing good descriptions of objects the choice of description can also be viewed as a learning problem, and indeed well discuss automated procedures for choosing descriptions in a later lecture but good human intuitions are often needed here,Data Types,Vectorial data: physical attributes behaviora

5、l attributes context history etc Well assume for now that such vectors are explicitly represented in a table, but later (cf. kernel methods) well relax that asumption,Data Types,text and hypertext, Welcome to FairmontNET .stdtext font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; c

6、olor: #1F3D4E; .stdtext_wh font-family: Verdana, Arial, Helvetica, sans-serif; font-size: 11px; color: WHITE; ,Data Types,email,Return-path Receivedfrom relay2.EECS.Berkeley.EDU (relay2.EECS.Berkeley.EDU 8) by imap4.CS.Berkeley.EDU (iPlanet Messaging Server 5.2 HotFix 1.16 (built May 14

7、2003) with ESMTP id ; Tue, 08 Jun 2004 11:40:43 -0700 (PDT)Receivedfrom relay3.EECS.Berkeley.EDU (localhost ) by relay2.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58Ieg3N000927; Tue, 08 Jun 2004 11:40:43 -0700 (PDT)Receivedfrom redbirds (dhcp-168-35.EECS.Berkeley.EDU 5) by

8、 relay3.EECS.Berkeley.EDU (8.12.10/8.9.3) with ESMTP id i58IegFp007613; Tue, 08 Jun 2004 11:40:42 -0700 (PDT)DateTue, 08 Jun 2004 11:40:42 -0700FromRobert Miller SubjectRE: SLT headcount = 25In-reply-toToRandy Katz CcGlenda J. Smith , Gert Lanckriet Message-idMIME-version1.0X-MIMEOLEProduced By Micr

9、osoft MimeOLE V6.00.2800.1409X-MailerMicrosoft Office Outlook, Build 11.0.5510Content-typemultipart/alternative; boundary=-=_NextPart_000_0033_01C44D4D.6DD93AF0Thread-indexAcRMtQRp+R26lVFaRiuz4BfImikTRAA0wf3Qthe headcount is now 32. - Robert Miller, Administrative Specialist University of California

10、, Berkeley Electronics Research Lab 634 Soda Hall #1776 Berkeley, CA 94720-1776 Phone: 510-642-6037 fax: 510-643-1289,Data Types,protein sequences,Data Types,sequences of Unix system calls,Data Types,network layout: graph,Data Types,images,Example: Spam Filter,Input: email Output: spam/ham Setup: Ge

11、t a large collection of example emails, each labeled “spam” or “ham” Note: someone has to hand label all this data Want to learn to predict labels of new, future emails Features: The attributes used to make the ham / spam decision Words: FREE! Text Patterns: $dd, CAPS Non-text: SenderInContacts ,Dea

12、r Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. ,TO BE REMOVED FROM FUTURE MAILINGS, SIMPLY REPLY TO THIS MESSAGE AND PUT REMOVE IN THE SUBJECT. 99 MILLION EMAIL ADDRESSES FOR ONLY $99,Ok, Iknow this is

13、blatantly OT but Im beginning to go insane. Had an old Dell Dimension XPS sitting in the corner and decided to put it to use, I know it was working pre being stuck in the corner, but when I plugged it in, hit the power nothing happened.,Example: Digit Recognition,Input: images / pixel grids Output:

14、a digit 0-9 Setup: Get a large collection of example images, each labeled with a digit Note: someone has to hand label all this data Want to learn to predict labels of new, future digit images Features: The attributes used to make the digit decision Pixels: (6,8)=ON Shape Patterns: NumComponents, As

15、pectRatio, NumLoops Current state-of-the-art: Human-level performance,0,1,2,1,?,Other Examples of Real-World Classification Tasks,Fraud detection (input: account activity, classes: fraud / no fraud) Web page spam detection (input: HTML/rendered page, classes: spam / ham) Speech recognition and speak

16、er recognition (input: waveform, classes: phonemes or words) Medical diagnosis (input: symptoms, classes: diseases) Automatic essay grader (input: document, classes: grades) Customer service email routing and foldering Link prediction in social networks Catalytic activity in drug design many many mo

17、re Classification is an important commercial technology,Training and Validation,Data: labeled instances, e.g. emails marked spam/ham Training set Validation set Test set Training Estimate parameters on training set Tune hyperparameters on validation set Report results on test set Anything short of t

18、his yields over-optimistic claims Evaluation Many different metrics Ideally, the criteria used to train the classifier should be closely related to those used to evaluate the classifier Statistical issues Want a classifier which does well on test data Overfitting: fitting the training data very clos

19、ely, but not generalizing well Error bars: want realistic (conservative) estimates of accuracy,Training Data,Validation Data,Test Data,Some State-of-the-art Classifiers,Support vector machine Random forests Kernelized logistic regression Kernelized discriminant analysis Kernelized perceptron Bayesia

20、n classifiers Boosting and other ensemble methods (Nearest neighbor),Intuitive Picture of the Problem,Some Issues,There may be a simple separator (e.g., a straight line in 2D or a hyperplane in general) or there may not There may be “noise” of various kinds There may be “overlap” One should not be d

21、eceived by ones low-dimensional geometrical intuition Some classifiers explicitly represent separators (e.g., straight lines), while for other classifiers the separation is done implicitly Some classifiers just make a decision as to which class an object is in; others estimate class probabilities,Me

22、thods,I) Instance-based methods: 1) Nearest neighbor II) Probabilistic models: 1) Nave Bayes 2) Logistic Regression III) Linear Models: 1) Perceptron 2) Support Vector Machine IV) Decision Models: 1) Decision Trees 2) Boosted Decision Trees 3) Random Forest,Methods,I) Instance-based methods: 1) Near

23、est neighbor II) Probabilistic models: 1) Nave Bayes 2) Logistic Regression III) Linear Models: 1) Perceptron 2) Support Vector Machine IV) Decision Models: 1) Decision Trees 2) Boosted Decision Trees 3) Random Forest,Methods,I) Instance-based methods: 1) Nearest neighbor II) Probabilistic models: 1

24、) Nave Bayes 2) Logistic Regression III) Linear Models: 1) Perceptron 2) Support Vector Machine IV) Decision Models: 1) Decision Trees 2) Boosted Decision Trees 3) Random Forest,Methods,I) Instance-based methods: 1) Nearest neighbor II) Probabilistic models: 1) Nave Bayes 2) Logistic Regression III)

25、 Linear Models: 1) Perceptron 2) Support Vector Machine IV) Decision Models: 1) Decision Trees 2) Boosted Decision Trees 3) Random Forest,Methods,I) Instance-based methods: 1) Nearest neighbor II) Probabilistic models: 1) Nave Bayes 2) Logistic Regression III) Linear Models: 1) Perceptron 2) Support

26、 Vector Machine IV) Decision Models: 1) Decision Trees 2) Boosted Decision Trees 3) Random Forest,Linearly Separable Data,Nonlinearly Separable Data,Non Linear Classifier,Which Separating Hyperplane to Use?,x1,x2,Maximizing the Margin,x1,x2,Margin Width,Margin Width,Select the separating hyperplane

27、that maximizes the margin,Support Vectors,x1,x2,Margin Width,Support Vectors,Setting Up the Optimization Problem,x1,x2,The maximum margin can be characterized as a solution to an optimization problem:,Setting Up the Optimization Problem,If class 1 corresponds to 1 and class 2 corresponds to -1, we c

28、an rewrite as So the problem becomes:,or,Linear, Hard-Margin SVM Formulation,Find w,b that solves Problem is convex so, there is a unique global minimum value (when feasible) There is also a unique minimizer, i.e. weight and b value that provides the minimum Quadratic Programming very efficient comp

29、utationally with procedures that take advantage of the special structure,Nonlinearly Separable Data,Var1,Var2,Introduce slack variables Allow some instances to fall within the margin, but penalize them,Formulating the Optimization Problem,Var1,Var2,Constraints becomes : Objective function penalizes

30、for misclassified instances and those within the margin C trades-off margin width and misclassifications,Linear, Soft-Margin SVMs,Algorithm tries to maintain i to zero while maximizing margin Notice: algorithm does not minimize the number of misclassifications (NP-complete problem) but the sum of di

31、stances from the margin hyperplanes Other formulations use i2 instead As C0, we get the hard-margin solution,Robustness of Soft vs Hard Margin SVMs,Var1,Var2,Soft Margin SVN,Hard Margin SVN,Disadvantages of Linear Decision Surfaces,Advantages of Nonlinear Surfaces,Linear Classifiers in High-Dimensio

32、nal Spaces,Var1,Var2,Constructed Feature 1,Find function (x) to map to a different space,Constructed Feature 2,Mapping Data to a High-Dimensional Space,Find function (x) to map to a different space, then SVM formulation becomes: Data appear as (x), weights w are now weights in the new space Explicit

33、 mapping expensive if (x) is very high dimensional Solving the problem without explicitly mapping the data is desirable,The Dual of the SVM Formulation,Original SVM formulation n inequality constraints n positivity constraints n number of variables The (Wolfe) dual of this problem one equality const

34、raint n positivity constraints n number of variables (Lagrange multipliers) Objective function more complicated NOTE: Data only appear as (xi) (xj),The Kernel Trick,(xi) (xj): means, map data into new space, then take the inner product of the new vectors We can find a function such that: K(xi xj) =

35、(xi) (xj), i.e., the image of the inner product of the data is the inner product of the images of the data Then, we do not need to explicitly map the data into the high-dimensional space to solve the optimization problem,X=x z,(X)=x2 z2 xz,Example,f(x) = sign(w1x2+w2z2+w3xz + b),(X1)T(X2)= x12 z12 2

36、1/2x1z1 x22 z22 21/2x2z2T = x12z12 + x22 z22 + 2 x1 z1 x2 z2 = (x1z1 + x2z2)2 = (X1T X2)2,Example,X1=x1 z1 X2=x2 z2,(X1)=x12 z12 21/2x1z1 (X2)=x22 z22 21/2x2z2,Expensive! O(d2),Efficient! O(d),Kernel Trick,Kernel function: a symmetric function k : Rd x Rd - R Inner product kernels: additionally, k(x

37、,z) = (x)T (z) Example:,O(d2),O(d),Kernel Trick,Implement an infinite-dimensional mapping implicitly Only inner products explicitly needed for training and evaluation Inner products computed efficiently, in finite dimensions The underlying mathematical theory is that of reproducing kernel Hilbert sp

38、ace from functional analysis,Kernel Methods,If a linear algorithm can be expressed only in terms of inner products it can be “kernelized” find linear pattern in high-dimensional space nonlinear relation in original space Specific kernel function determines nonlinearity,Kernels,Some simple kernels Li

39、near kernel: k(x,z) = xTz equivalent to linear algorithm Polynomial kernel: k(x,z) = (1+xTz)d polynomial decision rules RBF kernel: k(x,z) = exp(-|x-z|2/2) highly nonlinear decisions,Gaussian Kernel: Example,A hyperplane in some space,Kernel Matrix,k(x,y),K,i,j,Kij=k(xi,xj),Kernel matrix K defines a

40、ll pairwise inner products Mercer theorem: K positive semidefinite Any symmetric positive semidefinite matrix can be regarded as an inner product matrix in some space,Kernel-Based Learning,K,(xi,yi),Data,Embedding,Linear algorithm,k(x,y) or y,Kernel-Based Learning,K,Data,Embedding,Linear algorithm,K

41、ernel design,Kernel algorithm,Kernel Design,Simple kernels on vector data More advanced string kernel diffusion kernel kernels over general structures (sets, trees, graphs.) kernels derived from graphical models empirical kernel map,Methods,I) Instance-based methods: 1) Nearest neighbor II) Probabil

42、istic models: 1) Nave Bayes 2) Logistic Regression III) Linear Models: 1) Perceptron 2) Support Vector Machine IV) Decision Models: 1) Decision Trees 2) Boosted Decision Trees 3) Random Forest,From Tom Mitchells slides,Spatial example: recursive binary splits,+,+,+,+,+,+,+,+,+,+,+,+,Spatial example:

43、 recursive binary splits,+,+,+,+,+,+,+,+,+,+,+,+,Spatial example: recursive binary splits,+,+,+,+,+,+,+,+,+,+,+,+,Spatial example: recursive binary splits,+,+,+,+,+,+,+,+,+,+,+,+,Spatial example: recursive binary splits,+,+,+,+,+,+,+,+,+,+,+,+,pm=5/6,Once regions are chosen class probabilities are easy to calculate,How to choose a split,+,+,+,+,+,+,+,+,+,+,+,+,C1,C2,Impurity measures: L(p) Information gain (entropy): - p log p - (1-p)

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論