樸素貝葉斯python代碼實現(xiàn)_第1頁
樸素貝葉斯python代碼實現(xiàn)_第2頁
樸素貝葉斯python代碼實現(xiàn)_第3頁
樸素貝葉斯python代碼實現(xiàn)_第4頁
樸素貝葉斯python代碼實現(xiàn)_第5頁
已閱讀5頁,還剩12頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

樸素貝葉斯優(yōu)點:在數(shù)據(jù)較少的狀況下仍然有效,能夠解決多類別問題缺點:對于輸入數(shù)據(jù)的準(zhǔn)備方式較為敏感合用數(shù)據(jù)類型:標(biāo)稱型數(shù)據(jù)貝葉斯準(zhǔn)則:使用樸素貝葉斯進行文檔分類樸素貝葉斯的普通過程(1)收集數(shù)據(jù):能夠使用任何辦法。本文使用RSS源(2)準(zhǔn)備數(shù)據(jù):需要數(shù)值型或者布爾型數(shù)據(jù)(3)分析數(shù)據(jù):有大量特性時,繪制特性作用不大,此時使用直方圖效果更加好(4)訓(xùn)練算法:計算不同的獨立特性的條件概率(5)測試算法:計算錯誤率(6)使用算法:一種常見的樸素貝葉斯應(yīng)用是文檔分類。能夠在任意的分類場景中使用樸素貝葉斯分類器,不一定非要是文本。準(zhǔn)備數(shù)據(jù):從文本中構(gòu)建詞向量摘自機器學(xué)習(xí)實戰(zhàn)。[['my','dog','has','flea','problems','help','please'],0['maybe','not','take','him','to','dog','park','stupid'],1['my','dalmation','is','so','cute','I','love','him'],0['stop','posting','stupid','worthless','garbage'],1['mr','licks','ate','my','steak','how','to','stop','him'],0['quit','buying','worthless','dog','food','stupid']]1以上是六句話,標(biāo)記是0句子的表達正常句,標(biāo)記是1句子的表達為粗口。我們通過分析每個句子中的每個詞,在粗口句或是正常句出現(xiàn)的概率,能夠找出那些詞是粗口。在bayes.py文獻中添加以下代碼:[python]

\o"viewplain"viewplain\o"copy"copy#

coding=utf-8

def

loadDataSet():

postingList

=

[['my',

'dog',

'has',

'flea',

'problems',

'help',

'please'],

['maybe',

'not',

'take',

'him',

'to',

'dog',

'park',

'stupid'],

['my',

'dalmation',

'is',

'so',

'cute',

'I',

'love',

'him'],

['stop',

'posting',

'stupid',

'worthless',

'garbage'],

['mr',

'licks',

'ate',

'my',

'steak',

'how',

'to',

'stop',

'him'],

['quit',

'buying',

'worthless',

'dog',

'food',

'stupid']]

classVec

=

[0,

1,

0,

1,

0,

1]

#

1代表侮辱性文字,0代表正常言論

return

postingList,

classVec

def

createVocabList(dataSet):

vocabSet

=

set([])

for

document

in

dataSet:

vocabSet

=

vocabSet

|

set(document)

return

list(vocabSet)

def

setOfWords2Vec(vocabList,

inputSet):

returnVec

=

[0]

*

len(vocabList)

for

word

in

inputSet:

if

word

in

vocabList:

returnVec[vocabList.index(word)]

=

1

else:

print

"the

word:

%s

is

not

in

my

Vocabulary!"

%

word

return

returnVec

運行成果:訓(xùn)練算法:從詞向量計算概率[python]

\o"viewplain"viewplain\o"copy"copy#

樸素貝葉斯分類器訓(xùn)練函數(shù)

#

trainMatrix:

文檔矩陣,

trainCategory:

由每篇文檔類別標(biāo)簽所構(gòu)成的向量

def

trainNB0(trainMatrix,

trainCategory):

numTrainDocs

=

len(trainMatrix)

numWords

=

len(trainMatrix[0])

pAbusive

=

sum(trainCategory)

/

float(numTrainDocs)

p0Num

=

zeros(numWords);

p1Num

=

zeros(numWords);

p0Denom

=

0.0;

p1Denom

=

0.0;

for

i

in

range(numTrainDocs):

if

trainCategory[i]

==

1:

p1Num

+=

trainMatrix[i]

p1Denom

+=

sum(trainMatrix[i])

else:

p0Num

+=

trainMatrix[i]

p0Denom

+=

sum(trainMatrix[i])

p1Vect

=

p1Num

/

p1Denom

p0Vect

=

p0Num

/

p1Denom

return

p0Vect,

p1Vect,

pAbusive

運行成果:測試算法:根據(jù)現(xiàn)狀修改分類器上一節(jié)中的trainNB0函數(shù)中修改幾處:p0Num

=

ones(numWords);p1Num

=

ones(numWords);p0Denom

=

2.0;p1Denom

=

2.0;p1Vect

=

log(p1Num

/

p1Denom)p0Vect

=

log(p0Num

/

p1Denom)[python]

\o"viewplain"viewplain\o"copy"copy#

樸素貝葉斯分類器訓(xùn)練函數(shù)

#

trainMatrix:

文檔矩陣,

trainCategory:

由每篇文檔類別標(biāo)簽所構(gòu)成的向量

def

trainNB0(trainMatrix,

trainCategory):

numTrainDocs

=

len(trainMatrix)

numWords

=

len(trainMatrix[0])

pAbusive

=

sum(trainCategory)

/

float(numTrainDocs)

p0Num

=

ones(numWords);

p1Num

=

ones(numWords);

p0Denom

=

2.0;

p1Denom

=

2.0;

for

i

in

range(numTrainDocs):

if

trainCategory[i]

==

1:

p1Num

+=

trainMatrix[i]

p1Denom

+=

sum(trainMatrix[i])

else:

p0Num

+=

trainMatrix[i]

p0Denom

+=

sum(trainMatrix[i])

p1Vect

=

log(p1Num

/

p1Denom)

p0Vect

=

log(p0Num

/

p1Denom)

return

p0Vect,

p1Vect,

pAbusive

#

樸素貝葉斯分類函數(shù)

def

classifyNB(vec2Classify,

p0Vec,

p1Vec,

pClass1):

p1

=

sum(vec2Classify

*

p1Vec)

+

log(pClass1)

p0

=

sum(vec2Classify

*

p0Vec)

+

log(1.0

-

pClass1)

if

p1

>

p0:

return

1

else:

return

0

def

testingNB():

listOPosts,

listClasses

=

loadDataSet()

myVocabList

=

createVocabList(listOPosts)

trainMat

=

[]

for

postinDoc

in

listOPosts:

trainMat.append(setOfWords2Vec(myVocabList,

postinDoc))

p0V,

p1V,

pAb

=

trainNB0(array(trainMat),

array(listClasses))

testEntry

=

['love',

'my',

'dalmation']

thisDoc

=

array(setOfWords2Vec(myVocabList,

testEntry))

print

testEntry,

'classified

as:

',

classifyNB(thisDoc,

p0V,

p1V,

pAb)

testEntry

=

['stupid',

'garbage']

thisDoc

=

array(setOfWords2Vec(myVocabList,

testEntry))

print

testEntry,

'classified

as:

',

classifyNB(thisDoc,

p0V,

p1V,

pAb)

運行成果:準(zhǔn)備數(shù)據(jù):文檔詞袋模型詞集模型(set-of-words

model):每個詞與否出現(xiàn),每個詞只能出現(xiàn)一次詞袋模型(bag-of-words

model):一種詞能夠出現(xiàn)不止一次[python]

\o"viewplain"viewplain\o"copy"copy#

樸素貝葉斯詞袋模型

def

bagOfWords2VecMN(vocabList,

inputSet):

returnVec

=

[0]

*

len(vocabList)

for

word

in

inputSet:

if

word

in

vocabList:

returnVec[vocabList.index(word)]

+=

1

return

returnVec

示例:使用樸素貝葉斯過濾垃圾郵件(1)收集數(shù)據(jù):提供文本文獻(2)準(zhǔn)備數(shù)據(jù):將文本文獻解析成詞條向量(3)分析數(shù)據(jù):檢查詞條確保解析的對的性(4)訓(xùn)練算法:使用我們之前建立的trainNB0()函數(shù)(5)測試算法:使用classifyNB(),并且構(gòu)建一種新的測試函數(shù)來計算文檔集的錯誤率(6)使用算法:構(gòu)建一種完整的程序?qū)σ唤M文檔進行分類,將錯分的文檔輸出到屏幕上準(zhǔn)備數(shù)據(jù):切分文本使用正則體現(xiàn)式切分句子測試算法:使用樸素貝葉斯進行交叉驗證[python]

\o"viewplain"viewplain\o"copy"copy#

該函數(shù)接受一種大寫字符的字串,將其解析為字符串列表

#

該函數(shù)去掉少于兩個字符的字符串,并將全部字符串轉(zhuǎn)換為小寫

def

textParse(bigString):

import

re

listOfTokens

=

re.split(r'\W*',

bigString)

return

[tok.lower()

for

tok

in

listOfTokens

if

len(tok)

>

2]

#

完整的垃圾郵件測試函數(shù)

def

spamTest():

docList

=

[]

classList

=

[]

fullText

=

[]

#

導(dǎo)入并解析文本文獻

for

i

in

range(1,

26):

wordList

=

textParse(open('email/spam/%d.txt'

%

i).read())

docList.append(wordList)

fullText.extend(wordList)

classList.append(1)

wordList

=

textParse(open('email/ham/%d.txt'

%

i).read())

docList.append(wordList)

fullText.extend(wordList)

classList.append(0)

vocabList

=

createVocabList(docList)

trainingSet

=

range(50)

testSet

=

[]

#

隨機構(gòu)建訓(xùn)練集

for

i

in

range(10):

randIndex

=

int(random.uniform(0,

len(trainingSet)))

testSet.append(trainingSet[randIndex])

del(trainingSet[randIndex

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論