Weka[25]Bagging源代碼分析

上傳人：y*** IP屬地：天津上傳時(shí)間：2021-04-19 格式：DOCX 頁(yè)數(shù)：7 大?。?9.45KB 積分：15 舉報(bào) 版權(quán)申訴

已閱讀5頁(yè)，還剩2頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Weka25 Baggi ng源代碼分析作者：Koala+屈偉先翻譯一段 Bagging 的介紹，Breiman 的 bagging 算法，是 bootstrap aggregating 的縮寫, 是最早的Ensemble算法之一，它也是最直接容易實(shí)現(xiàn)，又有著另人驚訝的好效果的算法之一。Bagging中的多樣性是由有放回抽取訓(xùn)練樣本來實(shí)現(xiàn)的，用這種方式隨機(jī)產(chǎn)生多個(gè)訓(xùn)練數(shù)據(jù)的子集，在每一個(gè)訓(xùn)練集的子集上訓(xùn)練一個(gè)同種分類器，最終分類結(jié)果是由多個(gè)分類器的分類結(jié)果多數(shù)投票而產(chǎn)生的。Breiman s bagging, short for bootstrap aggregating, is o

2、ne of the earliest en semble based algorithms. It is also one of the most in tuitive and simp lest to imp leme nt, with a sur prisin gly good p erforma nee . Diversity in bagg ing is obta ined by using bootstra pped rep licas of the tra ining data: differe nt training data subsets are ran domly draw

3、 n with repl aceme nt from the en tire training data. Each tra ining data subset is used to train a differe nt classifier of the same type. In dividual classifiers are the n comb ined by tak ing a majority vote of their decisi ons. For any give n in sta nee, the class chose n by most classifiers is

4、the en semble decisi on. 包下面。 weka.classifiers.meta Bagging類在 weka.classifiers.meta 包下面。Bagging繼承自 RandomizeablelnteratedSingleClassifierEnhancer，而它又繼承自IteratedSingleClassifierEnhancer ，它再繼承自SingleClassifierEnhancer，最后一個(gè)繼承自Classifier。我的 UML工具似乎過期了，有空補(bǔ)上。看一下構(gòu)造函數(shù)： public Bagging() m_Classifier =

5、new weka.classifiers.trees.REPTree(); 可以看到默認(rèn)的基分類器是REPTrea 接下來看 buildClassifier函數(shù)： / can classifier handle the data? getCa pabilities().testWithFail(data); / remove instances with missing class data = new Instances(data); data.deleteWithMissingClass(); super .buildClassifier(data); != 100) Bag size n

6、eeds to be 100% if )； if ( m_CalcOutOfBag makeCopies ( m_Classifier m_Classifiers = Classifier. m_Numlterations ); 這里將 m_Classifier 復(fù)制 m_Numlterations 份到 m_Classifiers 數(shù)組中去。 int bagSize = data.numInstances() *m_BagSize Percent / 100; Random random = new Random( m_Seed ); boolean inBag =null ; if ( m

7、_CalcOutOfBag ) for inBag = new boolean m_Classifiers . length ; bagSize是一個(gè)Bag的大小，也就是它里面有多少樣本。 .length ; j+) (int j = 0; j m_Classifiers Instances bagData =null ; / create the in-bag dataset if ( m_CalcOutOfBag ) inBagj = new boolean bagData = resa mpl eWithWeights(data, random, inBagj); else bagDat

8、a = data.resa mpl eWithWeights(random); if data.numlnstances(); (bagSize data.numInstances() bagData.randomize(random); Instances newBagData = bagData = newBagData; new lnstances(bagData, 0, bagSize); if Classifier m. (Randomizable) instanceof m Classifiers Randomizable) j).setSeed(random.nextlnt();

9、 / build the classifier m_Classifiers j.buildClassifier(bagData); 暫時(shí)不去看 m_CalcOutOfBag的情況，當(dāng)然最關(guān)鍵的是resampleWithWeights : /* * Creates a new dataset of the same size using random samp ling with * repl acement according to the current instance weights. The weights of * the instances in the new dataset a

10、re set to one. */ p ublicInstances resa mpl eWithWeights(Random random) double weights = new double numlnstances(); for ( int i = 0; i weights.length ; i+) weightsi = instance(i).weight(); returnresa mpl eWithWeights(random, weights); 注釋上寫的是根據(jù)當(dāng)前樣本的權(quán)重用有放回取樣的方法創(chuàng)建一個(gè)同樣大小的新數(shù)據(jù)集, 新數(shù)據(jù)集中的樣本權(quán)重為 1。這里先是把權(quán)重記錄下來，

11、再用一個(gè)重載函數(shù)去做：接下來是看數(shù)據(jù)集中的樣本是否大于果大于，再把 bagSize，如果不大于，其實(shí)就沒什么意思了。如 bagData隨機(jī)一次，取前面的bagSize個(gè)樣本，下面如果 m_Classifier是 Ran domizable 候，常常忘記。現(xiàn)在再看的一個(gè)實(shí)例，那么就給它再指定一個(gè)新的隨機(jī)種子，這點(diǎn)很關(guān)鍵，自己寫的時(shí) 最后訓(xùn)練第j個(gè)分類器。 resampleWithWeights : double weights) p ublicInstances resa mpl eWithWeights(Random random, ); if (weights. length !=

12、numInstances() throw new IllegalArgumentExce ption( weights.length != numInstances. Instances newData =new Instances( this , numInstances(); if (numInstances() = 0) return newData; p robabilities =new double numlnstances(); sumProbs = 0, sumOfWeights = Utils.sum (weights); double double for ( int i

13、= 0; i numInstances(); i+) sumP robs += random.nextDouble(); p robabilitiesi = sumP robs; Utils. normalize (p robabilities, sumP robs / sumOfWeights); / Make sure that rounding errors dont mess things up p robabilitiesnumInstances() - 1 = sumOfWeights; int k = 0; int l = 0; sumP robs = 0; while (k n

14、umInstances() throw new IllegalArgumentExce ption( Weights have to be p ositive. sumP robs += weightsl; while (k numInstances() newData.instance(k).setWeight(1); k+; l+; return newData; 是第i次的總和，Utils.normalize 的代 sumProbs是產(chǎn)生的隨機(jī)數(shù)的總和，而probabilities 碼如下： p ublic static void if normalize( double doubles

15、, (Double. isNaN (sum) throw new IllegalArgumentExce ption( Cant normalize array. Sum is NaN. double sum) ）； if (sum = 0) / Maybe this should just be a return. throw new lllegalArgumentExce ption( Cant normalize array. Sum is zero. ）； for (int i = 0; i doubles. doublesi /= sum; length ; i+) probabil

16、ities 在（0,1）范圍內(nèi)，可能與樣本權(quán)重對(duì)應(yīng)不起來，在下面的二重循環(huán)中，看到sumProbs重新記數(shù)，它的意義就是加上weightsl之后，probabilityk 如果到不到相應(yīng)的 sumProbs，就重復(fù)地加這一個(gè)相同的樣本。這一步是將所產(chǎn)生的隨機(jī)數(shù)與權(quán)重對(duì)應(yīng)起來，因?yàn)楫a(chǎn)生的通過這種方式來產(chǎn)生有放回的取樣樣本。現(xiàn)在看m_CalcOutOfBag為true的時(shí)候，首先會(huì)有一個(gè)in Bag二維數(shù)組，第一維大小為分類器個(gè)數(shù)，第二維為樣本個(gè)數(shù)。 p ublic final In sta nces resa mp leWithWeights(l nsta nces data, Ra

17、ndom random, boolean sampled）這個(gè)函數(shù)與Intances中的差不多，只多了一句話就是 sampledl = true，表示這個(gè)樣本采樣時(shí)有它。接下來看buildClassifier的后面一部分，看起來很長(zhǎng)，其實(shí)蠻簡(jiǎn)單的。 / calc OOB error? if (getCalcOutOfBagO) double outOfBagCount = 0.0; double errorSum = 0.0; boolean numeric = data.classAttribute().isNumeric(); for ( int i = 0; i data.numlns

18、tances(); i+) double vote; double votes; if (numeric) votes = new double else votes = new double 【1】； data.numClasses(); / determine p redictions for instance int voteCount = 0; for ( int j = 0; j 0) vote /= voteCount; / average else vote = Utils. maxIndex (votes); / majority vote / error for instance outOfBagCount += data.instance(i).weight(); if (numeric) errorSum += StrictMath.abs (vote -data.instance(i).classValue() * data.instance(i).weight(); else if (vote != data.instance(i).c

人人文庫(kù)> 全部分類> 行業(yè)資料 > 信息產(chǎn)業(yè)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

Weka[25]Bagging源代碼分析

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

Weka[25]Bagging源代碼分析

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔