數(shù)據(jù)挖掘?qū)嵗齙第1頁(yè)
數(shù)據(jù)挖掘?qū)嵗齙第2頁(yè)
數(shù)據(jù)挖掘?qū)嵗齙第3頁(yè)
數(shù)據(jù)挖掘?qū)嵗齙第4頁(yè)
免費(fèi)預(yù)覽已結(jié)束,剩余1頁(yè)可下載查看

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、數(shù)據(jù)挖掘?qū)嵗龜?shù)據(jù)挖掘?qū)嵗康模航o定某些屬性,判斷某貸款顧客的可信性(即good/bad狀況)。簡(jiǎn)單思路:該數(shù)據(jù)包含了 666條貸款顧客的歷史數(shù)據(jù)和21個(gè)屬性。但是我 們認(rèn)為,這21個(gè)屬性不都能夠有效地幫助我們判斷顧客的可行性,所以我們首 先把一些比較不相關(guān)的屬性去掉。接下來(lái),我們?cè)儆镁垲惙椒◣椭覀儼褞в?連續(xù)變量的屬性離散化。做完預(yù)處理后我們?cè)購(gòu)闹姓页鰧?duì)有參考價(jià)值的關(guān)聯(lián)規(guī) 則。基本步驟:1. 去掉多余屬性關(guān)聯(lián)規(guī)則的隨機(jī)性該數(shù)據(jù)里有一布爾屬性foreign workers,取值為yes或no。我們發(fā)現(xiàn),其中取值為yes的元組占了所有元組的96%置信度其實(shí)代 表了一種條件概率,它無(wú)法判斷兩個(gè)屬性

2、之間是否帶有隨機(jī)性。因此,帶有 foreign workers的關(guān)聯(lián)規(guī)則無(wú)法提供我們更多可以參考的信息。x 2依賴性檢驗(yàn)首先,我們利用x 2-檢驗(yàn)試探各屬性(Duration inmonths,Credit Amount及Age in years 除外)與good/bad屬性之間是否存有依賴性。以下以Credit History為例描述算法:Credit HistoryBadGood All Paid Duly 171027 Bank Paid Duly 181735 Critical 38157195 Delay 174057 Duly Till Now 119233352 20945766

3、6 Degrees of freedom : 4 Chi-square value=32.8752245686945 p-value is less than or equal to 0.001.The distribution is significant.x 2-檢驗(yàn)顯示屬性Credit History與屬性Good/Bad之間存有依賴性。經(jīng)多番檢驗(yàn),只有Status of Checking Account,CreditHistory,卩urpose,Savi ngs Acco unt/Bon ds,Prese nt Employme ntSince,Property,Housing以及F

4、oreign Worker屬性與good/bad屬性之間有顯著性(a =0.05)的依賴性。因此,我們將重點(diǎn)放在這9個(gè)屬性上,再可能的情況下對(duì)這幾個(gè)屬性的取值類進(jìn)行加以分類或歸類,希望最終能夠從中得到 這些屬性與good/bad屬性之間更好的關(guān)聯(lián)規(guī)則。2. 把連續(xù)變量離散化(離散化/分類/歸類)經(jīng) x 2 檢驗(yàn)后,我們利用 Clustering ,Classification以及 Equal-width方法針對(duì)屬性Duratio n in mon ths, Credit Amou nt及Age in years 進(jìn)行離散化以及對(duì)以上有顯著依賴性的屬性取值進(jìn)行加以分類或歸類。Equal-widt

5、h我們利用weka里的Discretize 功能將連續(xù)變量離散化。以下以Duration in mon ths屬性為例:我們用 weka.filters.unsupervised.attribute.Discretized功能將Duratio n in mon ths屬性的取值分為 3 大類:Short-term,Mid-term 以及Long-term。分類后各類的數(shù)據(jù)數(shù)量為:Short-term(0-12 個(gè)月)245條數(shù) 據(jù),Mid-term(13 個(gè)月-24個(gè)月)270條數(shù)據(jù)以及Long-term(25 個(gè)月以上)151 條數(shù)據(jù)。Simple K-Mea ns Clusteri ng

6、K-Mea ns算法是將數(shù)據(jù)分入預(yù)先設(shè)定的聚類數(shù)。首先,它隨機(jī)性地將幾個(gè)數(shù)據(jù)點(diǎn)設(shè)定為質(zhì)心(cluster cen troid)。接著,它再計(jì)算出各聚類的邊界及新的質(zhì)心位置。反復(fù)運(yùn)行以上步驟就會(huì)得到預(yù)先想 得到的幾個(gè)聚類,從而把連續(xù)變量值離散化,或進(jìn)一步聚類某些屬性的取值分 類。以Credit Amount屬性為例描述 Simple K-Means Clustering離散化方法:我們利用weka里的Cluster功能SimpleKMeans算法將Credit Amount屬性中 的取值離散化,分為 4 類:low(0-2500),mid(2501-4400) ,high(4401-8500)及

7、veryhigh(8500 以上)。請(qǐng)看下圖。附圖 3 Classification我們也把property屬性重新離散化,把各個(gè)取值再加以分類,希望能夠從 中得到更有參考價(jià)值的關(guān)聯(lián)規(guī)則。附圖4屬性與其取值聚類屬性取值聚類Status of Exist ing Checki ng Accou nt ODM 200DM 200DM no check ing acco unt Durati on in month 13(short-term)13-24(mid-term)24(l on g-term)Credit History all paid duly bank paid duly criti

8、cal duly till now delay Purpose tan gible ocar used new ohousehold fur niture radiotv intan gible obus in ess orepair oeducati on oretra ining Credit Amount 0-2500(low)2501-4400(mid)4401-8500(high)8500(veryhigh)Savi ngs Accou nt/Bo nds 100DM 100-500DM 500-1000DM 1000DM unknown/no sav ings acco unt P

9、rese nt Employme nt Since un employed 1-4 4and above Number of People being Liable to Provide Maintenance for one two Pers onal Status and Sex sin gle male married male divorced male divorced female Other Debtors/Guara ntors none co-applica nt guara ntor Property real estate buildi ng society car un

10、known Age in years 22(you ng)23-35(mid)36-51(old)51(retired)Other In stallme nt Pla ns banks stores none Hous ing rent own fo rfree Number of Existi ng Credits at This Bank one two Status good bad3. 關(guān)聯(lián)規(guī)則利用weka的association功能,我們得到許多的關(guān)聯(lián)規(guī)則。在眾多關(guān)聯(lián)規(guī)則中,以下15條規(guī)則屬于較有參考價(jià)值:1.Statusofexist in gcheck in gacco unt=

11、no acco unt Purpose-3=Ta ngible Pers on alstatusa ndsex=s in gle-male Other-debtors/guara ntors=none Otheri nstallme ntpla ns=none Hous in g=ow n=Status=good.c onf: (0.95)2.Statusofexisti ngcheck in gacco unt=no acco unt Credithistory=dulytill now Hous in g=ow n Numberofexisti ngcreditsatthisba nk=o

12、ne Liabletoprovidemaintenancefor=one=status=good.conf: (0.92)3.Statusofexisti ngcheck in gacco unt=no acco untPresentemploymentsince=over-seven=status=good.conf: (0.91)4.Statusofexisti ngchecki ngacco unt=noaccount Credithistory=dulytill now Numberofexist in gcreditsatthisba nk=one Liabletoprovidema

13、intenancefor=one=status=good.conf: (0.90)5. Purpose=radio-tv Hous in g=ow n Job=skilled=status=good.c onf (0.89)6. Prese ntemployme ntsi nce=4-years Age in years=middleageJob=skilled=status=good.conf: (0.88)7.Statusofexisti ngcheck in gacco unt=no acco unt Duratio ninmon th=mid- term Hous in g=ow n=

14、status=good.c onf : (0.88)8.Statusofexisti ngch eck in gacco unt=no acco untCredithistory=dulytill now Hous in g=ow n=status=good.c onf: (0.87)9.Purpose-3=Ta ngible Pers on alstatusa ndsex=si ngle-male Other- debtors/guara ntors=none Otheri nstallme ntpla ns=none Hous in g=ow n Job=skilled=Status=go

15、od.conf: (0.86)10.Statusofexisti ngcheck in gacco unt=no acco unt Property=car Hous in g=ow n=Status=good.c onf : (0.86)11. Purpose-2=Household Prese ntemployme ntsi nce=4- years=Status=good.conf : (0.85)12. Credit-am oun t-simplekmea ns=low Property=real- estate=Status=good.conf : (0.77)13. Purpose

16、-2=Household Credit-amou nt-simplekmeans=low=Status=good.conf : (0.76)14. Purpose-2=Household Job=skilled=Status=good.conf: (0.73)15. Prese ntemployme ntsi nce=4-years Job=skilled=Status=good.c onf(0.72)4. Weightage :根據(jù)所得關(guān)聯(lián)規(guī)則,我們發(fā)現(xiàn)以下13屬性的某些取值類傾向?qū)傩許tatus=good. : Status of existing checking account: No

17、 checkingaccount Duration in month: Mid-term(13-24 months)Credit history : Allpaid duly ; No existing credit Purpose : Household Credit amount Low(0-2500)Prese nt employme nt since: 4 Pers onal status and sex :Single male Other debtors/guarantors: None Property : Real estate ; CarAge in years : Mid(23-35)Other in stallme nt pla ns: None Hous ing : OwnJob : Skilled根據(jù)歷史數(shù)據(jù),若某顧客擁有以上13個(gè)屬性值的任意7個(gè),我們可以認(rèn)為 該顧客的Status為good。5. 預(yù)測(cè):我們可用以上 weightage方法來(lái)預(yù)測(cè)Germantest數(shù)據(jù)庫(kù)中顧客的Status 我們從Germantest數(shù)據(jù)庫(kù)中取出一名顧客的資料來(lái)預(yù)測(cè)他的Status:no-accou nt,24,duly-ti

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論