Weka[25]Bagging源代碼分析_第1頁(yè)
Weka[25]Bagging源代碼分析_第2頁(yè)
Weka[25]Bagging源代碼分析_第3頁(yè)
Weka[25]Bagging源代碼分析_第4頁(yè)
Weka[25]Bagging源代碼分析_第5頁(yè)
已閱讀5頁(yè),還剩2頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Weka25 Baggi ng源代碼分析 作者:Koala+屈偉 先翻譯一段 Bagging 的介紹,Breiman 的 bagging 算法,是 bootstrap aggregating 的縮寫, 是最早的Ensemble算法之一,它也是最直接容易實(shí)現(xiàn),又有著另人驚訝的好效果的算法之 一。Bagging中的多樣性是由有放回抽取訓(xùn)練樣本來實(shí)現(xiàn)的,用這種方式隨機(jī)產(chǎn)生多個(gè)訓(xùn)練 數(shù)據(jù)的子集,在每一個(gè)訓(xùn)練集的子集上訓(xùn)練一個(gè)同種分類器,最終分類結(jié)果是由多個(gè)分類器 的分類結(jié)果多數(shù)投票而產(chǎn)生的。Breiman s bagging, short for bootstrap aggregating, is o

2、ne of the earliest en semble based algorithms. It is also one of the most in tuitive and simp lest to imp leme nt, with a sur prisin gly good p erforma nee . Diversity in bagg ing is obta ined by using bootstra pped rep licas of the tra ining data: differe nt training data subsets are ran domly draw

3、 n with repl aceme nt from the en tire training data. Each tra ining data subset is used to train a differe nt classifier of the same type. In dividual classifiers are the n comb ined by tak ing a majority vote of their decisi ons. For any give n in sta nee, the class chose n by most classifiers is

4、the en semble decisi on. 包下面。 weka.classifiers.meta Bagging類 在 weka.classifiers.meta 包 下面。Bagging繼承 自 RandomizeablelnteratedSingleClassifierEnhancer,而它又繼承自IteratedSingleClassifierEnhancer , 它再繼承自SingleClassifierEnhancer,最后一個(gè)繼承自Classifier。我的 UML工具似乎過期了, 有空補(bǔ)上。 看一下構(gòu)造函數(shù): public Bagging() m_Classifier =

5、new weka.classifiers.trees.REPTree(); 可以看到默認(rèn)的基分類器是REPTrea 接下來看 buildClassifier函數(shù): / can classifier handle the data? getCa pabilities().testWithFail(data); / remove instances with missing class data = new Instances(data); data.deleteWithMissingClass(); super .buildClassifier(data); != 100) Bag size n

6、eeds to be 100% if ); if ( m_CalcOutOfBag makeCopies ( m_Classifier m_Classifiers = Classifier. m_Numlterations ); 這里將 m_Classifier 復(fù)制 m_Numlterations 份到 m_Classifiers 數(shù)組中去。 int bagSize = data.numInstances() *m_BagSize Percent / 100; Random random = new Random( m_Seed ); boolean inBag =null ; if ( m

7、_CalcOutOfBag ) for inBag = new boolean m_Classifiers . length ; bagSize是一個(gè)Bag的大小,也就是它里面有多少樣本。 .length ; j+) (int j = 0; j m_Classifiers Instances bagData =null ; / create the in-bag dataset if ( m_CalcOutOfBag ) inBagj = new boolean bagData = resa mpl eWithWeights(data, random, inBagj); else bagDat

8、a = data.resa mpl eWithWeights(random); if data.numlnstances(); (bagSize data.numInstances() bagData.randomize(random); Instances newBagData = bagData = newBagData; new lnstances(bagData, 0, bagSize); if Classifier m. (Randomizable) instanceof m Classifiers Randomizable) j).setSeed(random.nextlnt();

9、 / build the classifier m_Classifiers j.buildClassifier(bagData); 暫時(shí)不去看 m_CalcOutOfBag的情況,當(dāng)然最關(guān)鍵的是resampleWithWeights : /* * Creates a new dataset of the same size using random samp ling with * repl acement according to the current instance weights. The weights of * the instances in the new dataset a

10、re set to one. */ p ublicInstances resa mpl eWithWeights(Random random) double weights = new double numlnstances(); for ( int i = 0; i weights.length ; i+) weightsi = instance(i).weight(); returnresa mpl eWithWeights(random, weights); 注釋上寫的是根據(jù)當(dāng)前樣本的權(quán)重用有放回取樣的方法創(chuàng)建一個(gè)同樣大小的新數(shù)據(jù)集, 新數(shù)據(jù)集中的樣本權(quán)重為 1。這里先是把權(quán)重記錄下來,

11、再用一個(gè)重載函數(shù)去做: 接下來是看數(shù)據(jù)集中的樣本是否大于 果大于,再把 bagSize,如果不大于,其實(shí)就沒什么意思了。如 bagData隨機(jī)一次,取前面的bagSize個(gè)樣本,下面如果 m_Classifier是 Ran domizable 候,常常忘記。 現(xiàn)在再看 的一個(gè)實(shí)例,那么就給它再指定一個(gè)新的隨機(jī)種子,這點(diǎn)很關(guān)鍵,自己寫的時(shí) 最后訓(xùn)練第j個(gè)分類器。 resampleWithWeights : double weights) p ublicInstances resa mpl eWithWeights(Random random, ); if (weights. length !=

12、numInstances() throw new IllegalArgumentExce ption( weights.length != numInstances. Instances newData =new Instances( this , numInstances(); if (numInstances() = 0) return newData; p robabilities =new double numlnstances(); sumProbs = 0, sumOfWeights = Utils.sum (weights); double double for ( int i

13、= 0; i numInstances(); i+) sumP robs += random.nextDouble(); p robabilitiesi = sumP robs; Utils. normalize (p robabilities, sumP robs / sumOfWeights); / Make sure that rounding errors dont mess things up p robabilitiesnumInstances() - 1 = sumOfWeights; int k = 0; int l = 0; sumP robs = 0; while (k n

14、umInstances() throw new IllegalArgumentExce ption( Weights have to be p ositive. sumP robs += weightsl; while (k numInstances() newData.instance(k).setWeight(1); k+; l+; return newData; 是第i次的總和,Utils.normalize 的代 sumProbs是產(chǎn)生的隨機(jī)數(shù)的總和,而probabilities 碼如下: p ublic static void if normalize( double doubles

15、, (Double. isNaN (sum) throw new IllegalArgumentExce ption( Cant normalize array. Sum is NaN. double sum) ); if (sum = 0) / Maybe this should just be a return. throw new lllegalArgumentExce ption( Cant normalize array. Sum is zero. ); for (int i = 0; i doubles. doublesi /= sum; length ; i+) probabil

16、ities 在(0,1)范圍內(nèi),可 能與樣本權(quán)重對(duì)應(yīng)不起來,在下面的二重循環(huán)中,看到sumProbs重新記數(shù),它的意義就是加 上weightsl之后,probabilityk 如果到不到相應(yīng)的 sumProbs,就重復(fù)地加這一個(gè)相同的樣本。 這一步是將所產(chǎn)生的隨機(jī)數(shù)與權(quán)重對(duì)應(yīng)起來,因?yàn)楫a(chǎn)生的 通過這種方式來產(chǎn)生有放回的取樣樣本。 現(xiàn)在看m_CalcOutOfBag為true的時(shí)候,首先會(huì)有一個(gè)in Bag二維數(shù)組,第一維大小為分類 器個(gè)數(shù),第二維為樣本個(gè)數(shù)。 p ublic final In sta nces resa mp leWithWeights(l nsta nces data, Ra

17、ndom random, boolean sampled)這個(gè)函數(shù)與Intances中的差不多,只多了一句話就是 sampledl = true,表示這個(gè)樣本采樣時(shí)有它。接下來看buildClassifier的后面一部分,看起來 很長(zhǎng),其實(shí)蠻簡(jiǎn)單的。 / calc OOB error? if (getCalcOutOfBagO) double outOfBagCount = 0.0; double errorSum = 0.0; boolean numeric = data.classAttribute().isNumeric(); for ( int i = 0; i data.numlns

18、tances(); i+) double vote; double votes; if (numeric) votes = new double else votes = new double 【1】; data.numClasses(); / determine p redictions for instance int voteCount = 0; for ( int j = 0; j 0) vote /= voteCount; / average else vote = Utils. maxIndex (votes); / majority vote / error for instance outOfBagCount += data.instance(i).weight(); if (numeric) errorSum += StrictMath.abs (vote -data.instance(i).classValue() * data.instance(i).weight(); else if (vote != data.instance(i).c

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論