機(jī)器學(xué)習(xí)工具WEKA的使用總結(jié),包括算法選擇、屬性選擇、參數(shù)優(yōu)化_第1頁(yè)
機(jī)器學(xué)習(xí)工具WEKA的使用總結(jié),包括算法選擇、屬性選擇、參數(shù)優(yōu)化_第2頁(yè)
機(jī)器學(xué)習(xí)工具WEKA的使用總結(jié),包括算法選擇、屬性選擇、參數(shù)優(yōu)化_第3頁(yè)
機(jī)器學(xué)習(xí)工具WEKA的使用總結(jié),包括算法選擇、屬性選擇、參數(shù)優(yōu)化_第4頁(yè)
機(jī)器學(xué)習(xí)工具WEKA的使用總結(jié),包括算法選擇、屬性選擇、參數(shù)優(yōu)化_第5頁(yè)
已閱讀5頁(yè),還剩8頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、一、屬性選擇:1、理論知識(shí):見以下兩篇文章:數(shù)據(jù)挖掘中的特征選擇算法綜述及基于WEKA的性能比較_陳良龍數(shù)據(jù)挖掘中約簡(jiǎn)技術(shù)與屬性選擇的研究_劉輝2、weka中的屬性選擇2.1評(píng)價(jià)策略(attribute evaluator)總的可分為filter和wrapper方法,前者注重對(duì)單個(gè)屬性進(jìn)行評(píng)價(jià),后者側(cè)重對(duì)特征子集進(jìn)行評(píng)價(jià)。Wrapper方法有:CfsSubsetEvalFilter方法有:CorrelationAttributeEval2.1.1 Wrapper方法:(1)CfsSubsetEval 根據(jù)屬性子集中每一個(gè)特征的預(yù)測(cè)能力以及它們之間的關(guān)聯(lián)性進(jìn)行評(píng)估,單個(gè)特征預(yù)測(cè)能力強(qiáng)且特征子集內(nèi)

2、的相關(guān)性低的子集表現(xiàn)好。Evaluates the worth of a subset of attributes by considering the individual predictive ability of each feature along with the degree of redundancy between them.Subsets of features that are highly correlated with the class while having low intercorrelation are preferred.For more informati

3、on see:M. A. Hall (1998). Correlation-based Feature Subset Selection for Machine Learning. Hamilton, New Zealand.(2)WrapperSubsetEvalWrapper方法中,用后續(xù)的學(xué)習(xí)算法嵌入到特征選擇過程中,通過測(cè)試特征子集在此算法上的預(yù)測(cè)性能來(lái)決定其優(yōu)劣,而極少關(guān)注特征子集中每個(gè)特征的預(yù)測(cè)性能。因此,并不要求最優(yōu)特征子集中的每個(gè)特征都是最優(yōu)的。Evaluates attribute sets by using a learning scheme. Cross validati

4、on is used to estimate the accuracy of the learning scheme for a set of attributes.For more information see:Ron Kohavi, George H. John (1997). Wrappers for feature subset selection. Artificial Intelligence. 97(1-2):273-3 Filter方法:如果選用此評(píng)價(jià)策略,則搜索策略必須用Ranker。(1)CorrelationAttributeEval 根據(jù)單個(gè)屬性和類別

5、的相關(guān)性進(jìn)行選擇。Evaluates the worth of an attribute by measuring the correlation (Pearson's) between it and the class.Nominal attributes are considered on a value by value basis by treating each value as an indicator. An overall correlation for a nominal attribute is arrived at via a weighted average.(

6、2)GainRatioAttributeEval 根據(jù)信息增益比選擇屬性。Evaluates the worth of an attribute by measuring the gain ratio with respect to the class.GainR(Class, Attribute) = (H(Class) - H(Class | Attribute) / H(Attribute).(3)InfoGainAttributeEval 根據(jù)信息增益選擇屬性。Evaluates the worth of an attribute by measuring the informatio

7、n gain with respect to the class.InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute).(4)OneRAttributeEval 根據(jù)OneR分類器評(píng)估屬性。Class for building and using a 1R classifier; in other words, uses the minimum-error attribute for prediction, discretizing numeric attributes. For more information, see:R.

8、C. Holte (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning. 11:63-91.(5)PrincipalComponents主成分分析(PCA)。Performs a principal components analysis and transformation of the data. Use in conjunction with a Ranker search. Dimensionality reduction is acc

9、omplished by choosing enough eigenvectors to account for some percentage of the variance in the original data-default 0.95 (95%). Attribute noise can be filtered by transforming to the PC space, eliminating some of the worst eigenvectors, and then transforming back to the original space.(6)ReliefFAt

10、tributeEval根據(jù)ReliefF值評(píng)估屬性。Evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. Can operate on both discrete and continuous class data.For more information see:Kenji Kira, Larr

11、y A. Rendell: A Practical Approach to Feature Selection. In: Ninth International Workshop on Machine Learning, 249-256, 1992.Igor Kononenko: Estimating Attributes: Analysis and Extensions of RELIEF. In: European Conference on Machine Learning, 171-182, 1994.Marko Robnik-Sikonja, Igor Kononenko: An a

12、daptation of Relief for attribute estimation in regression. In: Fourteenth International Conference on Machine Learning, 296-304, 1997.(7)SymmetricalUncertAttributeEval 根據(jù)屬性的對(duì)稱不確定性評(píng)估屬性。Evaluates the worth of an attribute by measuring the symmetrical uncertainty with respect to the class. SymmU(Class

13、, Attribute) = 2 * (H(Class) - H(Class | Attribute) / H(Class) + H(Attribute).2.2搜索策略(Search Method)2.2.1和評(píng)價(jià)策略中的wrapper方法對(duì)應(yīng)(1)BestFirst最好優(yōu)先的搜索策略。是一種貪心搜索策略。Searches the space of attribute subsets by greedy hillclimbing augmented with a backtracking facility. Setting the number of consecutive non-impr

14、oving nodes allowed controls the level of backtracking done. Best first may start with the empty set of attributes and search forward, or start with the full set of attributes and search backward, or start at any point and search in both directions (by considering all possible single attribute addit

15、ions and deletions at a given point).(2)ExhaustiveSearch窮舉搜索所有可能的屬性子集。Performs an exhaustive search through the space of attribute subsets starting from the empty set of attrubutes. Reports the best subset found.(3)GeneticSearch基于Goldberg在1989年提出的簡(jiǎn)單遺傳算法進(jìn)行的搜索。Performs a search using the simple geneti

16、c algorithm described in Goldberg (1989).For more information see:David E. Goldberg (1989). Genetic algorithms in search, optimization and machine learning. Addison-Wesley.(4)GreedyStepwise 向前或向后的單步搜索。Performs a greedy forward or backward search through the space of attribute subsets. May start with

17、 no/all attributes or from an arbitrary point in the space. Stops when the addition/deletion of any remaining attributes results in a decrease in evaluation. Can also produce a ranked list of attributes by traversing the space from one side to the other and recording the order that attributes are se

18、lected.(5)RandomSearch 隨機(jī)搜索。Performs a Random search in the space of attribute subsets. If no start set is supplied, Random search starts from a random point and reports the best subset found. If a start set is supplied, Random searches randomly for subsets that are as good or better than the start

19、point with the same or or fewer attributes. Using RandomSearch in conjunction with a start set containing all attributes equates to the LVF algorithm of Liu and Setiono (ICML-96).For more information see:H. Liu, R. Setiono: A probabilistic approach to feature selection - A filter solution. In: 13th

20、International Conference on Machine Learning, 319-327, 1996.(6)RankSearch 用一個(gè)評(píng)估器計(jì)算屬性判據(jù)值并排序。Uses an attribute/subset evaluator to rank all attributes. If a subset evaluator is specified, then a forward selection search is used to generate a ranked list. From the ranked list of attributes, subsets of

21、increasing size are evaluated, ie. The best attribute, the best attribute plus the next best attribute, etc. The best attribute set is reported. RankSearch is linear in the number of attributes if a simple attribute evaluator is used such as GainRatioAttributeEval. For more information see:Mark Hall

22、, Geoffrey Holmes (2003). Benchmarking attribute selection techniques for discrete class data mining. IEEE Transactions on Knowledge and Data Engineering. 15(6):1437-14和評(píng)價(jià)策略中的filter方法對(duì)應(yīng)(1)Ranker :對(duì)屬性的判據(jù)值進(jìn)行排序,和評(píng)價(jià)策略中的Filter方法結(jié)合使用。Ranks attributes by their individual evaluations. Use in conjunc

23、tion with attribute evaluators (ReliefF, GainRatio, Entropy etc).3、我的總結(jié)針對(duì)某一算法及其參數(shù)設(shè)置,選用WrapperSubsetEval評(píng)價(jià)策略和ExhaustiveSearch搜索策略,能夠保證找到適合該算法即參數(shù)設(shè)置的最優(yōu)屬性子集。但其計(jì)算時(shí)間較長(zhǎng),并且隨著屬性個(gè)數(shù)的增多成指數(shù)級(jí)增長(zhǎng)。二、參數(shù)優(yōu)化針對(duì)某一特定算法,進(jìn)行參數(shù)優(yōu)化有以下三種方法:CVParameterSelection、GridSearch、MultiSearch。1、CVParameterSelection采用交叉驗(yàn)證的方法,對(duì)參數(shù)進(jìn)行優(yōu)化選擇。優(yōu)點(diǎn):可以

24、對(duì)任意數(shù)量的參數(shù)進(jìn)行優(yōu)化選擇;缺點(diǎn):參數(shù)太多時(shí),可能造成參數(shù)組合數(shù)量的爆炸性增長(zhǎng);只能優(yōu)化分類器的直接參數(shù),不能優(yōu)化其嵌入的參數(shù),比如可以優(yōu)化weka.classifiers.functions.SMO里的參數(shù)C,但不能優(yōu)化weka.classifiers.meta.FilteredClassifier中的嵌入算法weka.classifiers.functions.SMO里的參數(shù)C。示例:優(yōu)化J48算法的置信系數(shù)C 載數(shù)據(jù)集; 選擇 weka.classifiers.meta.CVParameterSelection作為分類器; 選擇weka.classifiers.trees.J

25、48作為的基分類器; 參數(shù)優(yōu)化的字符串:C 0.1 0.5 5(優(yōu)化參數(shù)C,范圍是從0.1至0.5,步距是0.5/5=0.1) 進(jìn)行運(yùn)算,得到如下圖所示的結(jié)果(最后一行是優(yōu)化的參數(shù)):2、GridSearch采用網(wǎng)格搜索,而不是試驗(yàn)所有的參數(shù)組合,進(jìn)行參數(shù)的選擇。優(yōu)點(diǎn):理論上,相同的優(yōu)化范圍及設(shè)置,GridSearch應(yīng)該比CVParameterSelection要快;不限于優(yōu)化分類器的直接參數(shù),也可以優(yōu)化其嵌入算法的參數(shù);優(yōu)化的2個(gè)參數(shù)中,其中一個(gè)可以是filter里的參數(shù),所以需要在屬性表達(dá)式中加前綴classifier.或filter.;支持范圍的自動(dòng)擴(kuò)展。缺點(diǎn):最多優(yōu)化2個(gè)參數(shù)。示例:

26、優(yōu)化以RBFKernel為核的SMO算法的參數(shù) 加載數(shù)據(jù)集; 選擇GridSearch為Classifier; 選擇GridSearch的Classifier為weka.classifiers.functions.SMO ,kernel為weka.classifiers.functions.supportVector.RBFKernel。 設(shè)置X參數(shù)。XProperty:classifier.c,XMin:1,XMax:16,XStep:1,XExpression:I。這的意思是:選擇參數(shù)c,其范圍是1到16,步長(zhǎng)1。 設(shè)置Y參數(shù)。YProperty:"classifier

27、.kernel.gamma,YMin:-5,YMax:2,YStep:1,YBase:10,YExpression: pow(BASE,I)。這的意思是:選擇參數(shù)kernel.gamma,其范圍是10-5,10-4,102。 輸出如下(最后一行是優(yōu)化的參數(shù)):3、MultiSearch類似網(wǎng)格參數(shù),但更普通更簡(jiǎn)單。優(yōu)點(diǎn):不限于優(yōu)化分類器的直接參數(shù),也可以優(yōu)化其嵌入算法的參數(shù)或filter的參數(shù);支持任意數(shù)量的參數(shù)優(yōu)化;缺點(diǎn):不支持自動(dòng)擴(kuò)展邊界。4、我的總結(jié)如果需要優(yōu)化的參數(shù)不大于2個(gè),選用gridsearch,并且設(shè)置邊界自動(dòng)擴(kuò)展;如果需要優(yōu)化的參數(shù)大于2個(gè),選用MultiSearch;如果優(yōu)

28、化分類器的直接參數(shù),且參數(shù)數(shù)量不大于2個(gè),也可以考慮用CVParameterSelection。三、meta-Weka的算法1、算法及描述LocalWeightedLearning:局部加權(quán)學(xué)習(xí);AdaBoostM1:AdaBoost方法;AdditiveRegression:GBRT(Grandient Boosting Regression Tree)梯度下降回歸樹。是屬于Boosting算法,也是將多分類器進(jìn)行級(jí)聯(lián)訓(xùn)練,后一級(jí)的分類器則更多關(guān)注前面所有分類器預(yù)測(cè)結(jié)果與實(shí)際結(jié)果的殘差,在這個(gè)殘差上訓(xùn)練新的分類器,最終預(yù)測(cè)時(shí)將殘差級(jí)聯(lián)相加。AttributeSelectedClassifie

29、r:將屬性選擇和分類器集成設(shè)置,先進(jìn)行屬性選擇、再進(jìn)行分類或回歸;Bagging:bagging方法;ClassificationViaRegression:用回歸的方法進(jìn)行分類;LogitBoost:是一種boosting算法,用回歸進(jìn)行分類。MultiClassClassifier:使用兩類分類器進(jìn)行多類分類的方法。RondomCommittee:隨機(jī)化基分類器結(jié)果的平均值作為結(jié)果。RandomSubspace;FilterClassifier:將過濾器和分類器集成設(shè)置,先進(jìn)行過濾、再進(jìn)行分類或回歸;(autoweka中沒有)MultiScheme:在所指定的多個(gè)分類器或多種參數(shù)配置中,選

30、擇最優(yōu)的一個(gè)。(猶如experiment)(autoweka中沒有)RandomizableFitteredClassifier:是FilterClassifier的變體,對(duì)于RondomCommittee的ensemble classifiers是很有用的。要求不管是filter還是classifier都支持randomizable接口。(autoweka中沒有)Vote;Stacking。2、我的總結(jié)Meta提供了很多以基分類器為輸入的方法,其中:AdaBoostM1和Bagging方法是常用的meta方法;MultiScheme和experiment的功能類似;AttributeSele

31、ctedClassifier將屬性選擇和分類器集成設(shè)置,比較方便。四、Auto-WEKAAuto-WEKA支持屬性、算法、參數(shù)的自動(dòng)選擇。1、屬性選擇屬性選擇作為數(shù)據(jù)的預(yù)處理步驟,在分類或回歸前運(yùn)行。Auto-WEKA中屬性選擇的評(píng)價(jià)策略和搜索策略如上圖所示。其中標(biāo)*的是搜索策略,其余的是評(píng)價(jià)策略??梢?,不包括WrapperSubsetEval評(píng)價(jià)策略和ExhaustiveSearch搜索策略組合的完備搜索。2、算法選擇上圖是Auto-WEKA中包括的分類或回歸算法,共39種:27種基分類器、10種meta分類器、2種ensemble分類器。其中,meta分類器可以選任意一種基分類器作為輸入,

32、ensemble分類器可以使用最多5種基分類器作為輸入。27種基分類器包括:Bayes里的3種:BayesNet、NaiveBayes、和NaiveBayesMultinomial;Functions里的9種:GaussianProcesses、LinearRegression、LogisticRegression、SingleLayerPerceptron、SGD、SVM、SimpleLinearRegression、SimpleLogistivRegression、VotedPerceptron。注意,沒有常用的MultilayerPerceptron、RBFClassifier和RBFN

33、etwork。Lazy里的2種:KNN、KStar(*)。Rules里的6種:DecisionTables、RIPPER、M5Rules、1-R、PART、0-R。Trees里的7種:DecisionStump、C4.5DecisionTrees、LogisticModelTree、M5Tree、RandomForest、RondomTree、REPTree。10種meta分類器:LocalWeightedLearning:局部加權(quán)學(xué)習(xí);AdaBoostM1:AdaBoost方法;AdditiveRegression:GBRT(Grandient Boosting Regression Tre

34、e)梯度下降回歸樹。是屬于Boosting算法,也是將多分類器進(jìn)行級(jí)聯(lián)訓(xùn)練,后一級(jí)的分類器則更多關(guān)注前面所有分類器預(yù)測(cè)結(jié)果與實(shí)際結(jié)果的殘差,在這個(gè)殘差上訓(xùn)練新的分類器,最終預(yù)測(cè)時(shí)將殘差級(jí)聯(lián)相加。AttributeSelectedClassifier:將屬性選擇和分類器集成設(shè)置,先進(jìn)行屬性選擇、再進(jìn)行分類或回歸; Bagging:bagging方法;ClassificationViaRegression:用回歸的方法進(jìn)行分類;LogitBoost:是一種boosting算法,用回歸進(jìn)行分類。MultiClassClassifier:使用兩類分類器進(jìn)行多類分類的方法。RondomCommittee

35、:隨機(jī)化基分類器結(jié)果的平均值作為結(jié)果。RandomSubspace。2種ensamble方法:Vote和stacking。3、我的總結(jié)Auto-Weka有兩點(diǎn)不足:屬性選擇里不包括WrapperSubsetEval評(píng)價(jià)策略和ExhaustiveSearch搜索策略組合的完備搜索。注意,沒有常用的MultilayerPerceptron、RBFClassifier和RBFClassifier。五、總結(jié)1、屬性、算法、參數(shù)的選擇及優(yōu)化對(duì)于一個(gè)不熟悉的數(shù)據(jù)集合,想要快速獲取最佳的分類器、參數(shù)設(shè)置、屬性子集,可以按照以下步驟:Auto-Weka:選擇范圍是大部分分類器和屬性選擇策略,但不包括MultilayerPerceptron、RBFClassifier和RBFNetwork等分類器和完備搜索的屬性選擇策略;補(bǔ)充常用分類器及其參數(shù)、屬性的選擇:針對(duì)的不足,選用常用分類器進(jìn)行屬性選擇和參數(shù)優(yōu)化,這些常用分類器有MultilayerPerceptron、RBFClassifier、RBFClassifier、BayesNet、NaïveBayes、SMO或SVM、linerRegression,選用其中一種,或逐一試驗(yàn),進(jìn)行;特定分類器的屬性選擇:選擇exp

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論