![由知識挖掘提升商務(wù)智能應(yīng)用(謝邦昌)課件_第1頁](http://file4.renrendoc.com/view/fd7a427783b51e4f78e4468d3f9f89d1/fd7a427783b51e4f78e4468d3f9f89d11.gif)
![由知識挖掘提升商務(wù)智能應(yīng)用(謝邦昌)課件_第2頁](http://file4.renrendoc.com/view/fd7a427783b51e4f78e4468d3f9f89d1/fd7a427783b51e4f78e4468d3f9f89d12.gif)
![由知識挖掘提升商務(wù)智能應(yīng)用(謝邦昌)課件_第3頁](http://file4.renrendoc.com/view/fd7a427783b51e4f78e4468d3f9f89d1/fd7a427783b51e4f78e4468d3f9f89d13.gif)
![由知識挖掘提升商務(wù)智能應(yīng)用(謝邦昌)課件_第4頁](http://file4.renrendoc.com/view/fd7a427783b51e4f78e4468d3f9f89d1/fd7a427783b51e4f78e4468d3f9f89d14.gif)
![由知識挖掘提升商務(wù)智能應(yīng)用(謝邦昌)課件_第5頁](http://file4.renrendoc.com/view/fd7a427783b51e4f78e4468d3f9f89d1/fd7a427783b51e4f78e4468d3f9f89d15.gif)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)
文檔簡介
1、由知識挖掘提升商務(wù)智能應(yīng)用-統(tǒng)計(jì)分析的進(jìn)階加值應(yīng)用From Knowledge Mining to Business Intelligence-Advanced Statistics Application 謝邦昌 博士廈門大學(xué)講座教授兼博導(dǎo) 首都經(jīng)貿(mào)大學(xué)講座教授兼博導(dǎo)中央財(cái)經(jīng)大學(xué)講座教授兼博導(dǎo) 西南財(cái)經(jīng)大學(xué)講座教授中國人民大學(xué)兼職教授輔仁大學(xué)統(tǒng)計(jì)資訊學(xué)系及應(yīng)用統(tǒng)計(jì)所教授中華資料采礦協(xié)會理事長Outline知識采礦(整合數(shù)據(jù)采礦與文本采礦)與商業(yè)智慧的發(fā)展知識采礦程序、步驟、產(chǎn)出與應(yīng)用如何進(jìn)行數(shù)據(jù)采礦與文本采礦整合知識采礦之技術(shù)發(fā)展評論知識保存價值減少循環(huán)時間反應(yīng)時間重復(fù)投資作業(yè)花費(fèi)會議時間外
2、界顧問等等增加生產(chǎn)力與質(zhì)量企業(yè)知識的轉(zhuǎn)換快且有效的決策課程創(chuàng)新群策群力 等等 企業(yè)知識的保留與轉(zhuǎn)換知識資產(chǎn)的投資精簡與退休人員輪替 生產(chǎn)力能力重復(fù)能量消耗過多的會議溝通問題組織目標(biāo) 下達(dá)決策可行性快速非正規(guī)為何知識如此迫切?“The chief economic priority for developed countries is to raise the productivity of knowledge . . . The country that does this first will dominate the twenty-first century economically.”開
3、發(fā)中國家首要經(jīng)濟(jì)目標(biāo)為知識的創(chuàng)造力誰先掌握誰就統(tǒng)領(lǐng)二十一世紀(jì)的經(jīng)濟(jì)Peter F. Drucker資料知識形成流程DataWarehouseKnowledgeSelection/cleansingPreprocessingTarget DataPreprocessed DataPatternTransformedData Data MiningTransformationInterpretation/EvaluationIntegrationRawDataUnderstandingBI結(jié)構(gòu)Monitor&IntegratorComplete DataWarehouseExtractTransf
4、ormLoadRefreshmetadataOLAPServer1. Comprehensive Performance Management2. Analysis3. Query4. Reports5. Data miningData SourcesToolsServeData MartsOperationalDBsOther sourcesBusiness Intelligence資料采礦/探勘rule inductionneural networkstree generatorsrule inductionsupport vector machineregressionCOWEBesti
5、mation maximizationk-meansrough setsapriorigranular computingtrend functionsrule inductionneural networksCategorize your customers or clientsClassificationForecast future sales or usagePredictionGroup similar customers or clientsSegmentationDiscover products that are purchased togetherAssociationFin
6、d patterns and trends over timeSequenceGaining market intelligence from news feedsSreekumar Sukumaran and Ashish SurekaIntegrated BI SystemsComplete DataWarehouseETLStructural DataDBMSFile SystemXMLEALegacyUnstructured DataCMSScannedDocumentsEmailETLText taggor & AnnotatorIntermedia DataRDBMSXMLSree
7、kumar Sukumaran and Ashish Sureka知識來源與價值“On average, professional users spend 11 hours per week looking for information. Seventy-one percent said they could not find what they were looking for. Information Management SoftwareLazard Freres & Co. LLCFebruary 2001The volume of digitized information wil
8、l double every year from 2000 to 2005(an increase to 30 times todays volume). Knowledge Management vs. Information ManagementGartner GroupSeptember 2000網(wǎng)絡(luò)訊息新聞報導(dǎo)專利電子郵件文件文獻(xiàn)問題出版統(tǒng)計(jì)8TB(書籍),25TB(新聞),20TB(雜志),2TB(期刊)平均每分鐘科學(xué)知識增加2000頁新材料的閱讀須時5年(24hrs/day)How Can I Keep Up With the Literature?Evolution“To stu
9、dy history one must know in advance that one is attempting something fundamentally impossible, yet necessary and highly important.” Father Jacobus (Hesses Magister Ludi)Das Glasperlenspiel (The Glass Bead Game)文件知識發(fā)掘與管理技術(shù)檢索文件 過濾分類摘要 分群自然語言內(nèi)文分析萃取探勘可視化萃取應(yīng)用探勘應(yīng)用信息存取知識認(rèn)知信息結(jié)構(gòu)知識產(chǎn)生Raw textTermsimilarityDocs
10、imilarityVector centroid分群 d分類META-DATA/ANNOTATION d d d d d d d d d d d d d d t t t t t t t t t t t tStemming & Stop wordsTokenized textTerm Weightingw11w12w1nw21w22w2n wm1wm2wmn t1t2 tn d1 d2 dmSentenceselection摘要Text ETL to MiningCall Taker: JamesDate: Aug. 30, 2002Duration: 10 min.CustomerID: AD
11、C00123Q:cust sys hasstopped working.A: checked custbios anditneed updated. Unstructured DataStructured DataCall Taker JamesDate 2002/08/30Duration 10 min.CustomerID ADC00123NounCustomerSoftwareBIOSSubj.Verb customer system.stopSW.Problem BIOS.needOriginal DataMeta DataLinguisticAnalysisTaggingDepend
12、ency AnalysisNamed Entity ExtractionIntention AnalysisCategoryDictionarySynonymDictionaryCategoryItemVisualization &Interactive MiningMiningIBM TAKMI(Nasukawa, Nagano,1999)Mining target: individual textMining unit: texts category labeled items extracted from text using NLPText is Tough其系一個極不容易表達(dá)的抽象性
13、概念 (AI-Complete) 是許多概念彼此間抽象而復(fù)雜的無盡關(guān)系組合一種名詞可以代表很多不同的概念CELL, IV類似的概念也有很多種方式可以表達(dá) (aliases)space ship, flying saucer, UFO, figment of imagination概念是很難加以可視化的高維度 其分析構(gòu)面可能高達(dá)成百上千Text Mining is Easy重復(fù)性很高只要一些簡單的算法,就可以從一些極為粗糙的工作中,得到不錯的結(jié)果找出重要詞組找到有意義的相關(guān)字從文章中建立摘要主要問題:結(jié)果評估必須定義目標(biāo)及目的Traditional IR-based Extractiondocv
14、ector 1profile vector docvector nscoringscorejudgments rejected docs accepted docs noyesvectorlearningthresholdlearningutility functionOntologyVector initializationThreshold initializationReuse retrieval algorithmsNew threshold algorithmsScore ?threshold Text-DBLexiconsLuhns ideasIt is here proposed
15、 that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The s
16、ignificance factor of a sentence will therefore be based on a combination of these two measurements.信息萃取-Job2 JobTitle: Ice Cream Guru Employer: JobCategory: Travel/Hospitality JobFunction: Food Services JobLocation: Upper MidwestContact Phone: 800-488-2611 DateExtracted: January 8, 2001 Source: /jo
17、bs_midwest.html OtherCompanyJobs: -Job1Information ExtractionGiven:Source of textual documentsWell defined limited query (text based)Find:Sentences with relevant informationExtract the relevant information and ignore non-relevant information (important!)Link related information and output in a prede
18、termined formatAdvisoryProgrammer- Oracle (Austin, TX) Response Code: 1008-0074-97-iexc-jcn Responsibilities: This is an exciting opportunity withSiemens Wireless Terminals; a start-up venture fully capitalized by a Global Leader in Advanced Technologies. Qualified candidates will: Responsible for a
19、ssisting with requirements definition, analysis, design and implementation that meet objectives, codes difficult and sophisticated routines . Develops project plans, schedules and cost data. Develop test plans and implement physical design of databases. Develop shell scripts for administrative and b
20、ackground tasks, stored procedures and triggers. Using Oracles Designer 2000, assist with Data Model maintenance and assist with applications development using Oracle Forms. Qualifications: BSCS, BSMIS or closely related field or related equivalent knowledge normally obtained through technical educa
21、tion programs. 5-8 years of professional experience in development, system design analysis, programming, installation using Oracle developmentAutomatic Pattern-Learning SystemsPros:Portable across domainsTend to have broad coverageRobust in the face of degraded input.Automatically find appropriate s
22、tatistical patternsSystem knowledge not needed by those who supply the domain knowledge.Cons:Annotated training data, and lots of it, is needed.Isnt necessarily better or cheaper than hand-built solnExamples: Riloff et al., AutoSlog, Soderland WHISK (UMass); Mooney et al. Rapier (UTexas); Ciravegna
23、(Sheffield) Learn lexicon-syntactic patterns from templatesTrainerDecoderModelLanguageInputAnswersAnswersLanguageInputText Analysis SpectrumEntity ExtractionTargeted Factsand EventsClassificationClusteringConceptIdentificationWhat is thisdocumentabout?Who didwhat towhom whenwhere, etc.Why is getting
24、 dimensional data so hard?Hank bought plastic explosives from Henry inTucson yesterday.Named Entity ExtractionPeople,Weapons,Vehicles,DatesNEREngineHankHenryPlastic explosivesTucson11/01/07FrameNetName Extraction via MMsTextSpeechRecognitionExtractorSpeechEntities NEModelsLocationsPersonsOrganizatio
25、nsThe delegation, which included the commander of the U.N. troops in Bosnia, Lt. Gen. Sir Michael Rose, went to the Serb stronghold of Pale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.TrainingProgramtrainingsentencesanswersThe delegation, which included the commander of theU.
26、N. troops inBosnia, Lt. Gen. SirMichael Rose, went to the Serb stronghold ofPale, near Sarajevo, for talks with Bosnian Serb leader Radovan Karadzic.An easy but successful HMM application:Prior to 1997 - no learning approach competitive with hand-built rule systemsSince 1997 - Statistical approaches
27、 (BBN (Bikel et al. 1997), NYU, MITRE, CMU/JustSystems) achieve state-of-the-art performanceNER數(shù)據(jù)庫探勘作業(yè)流程決策參考決策建議自動分群自動專家分類事件關(guān)連分析文檔庫知識本體論推論圖知識地圖概念分群documentDocumentCollectionsunbeachFrequent term set:surffunsun, beachclusterC1C2C4C5C3Clustering:C1, C2, C4, C5.Clustering Description:surf, sun, beach,
28、fun.AnophelesFeedback as Model InterpolationConcept CDocument DResultsFeedback DocsF=d1, d2 , , dnGenerative modelDivergence minimization=0No feedback=1Full feedback非單調(diào)性資料(Heterogeneous)TDRTDRTDRTDRTDR成千成萬的歷史紀(jì)錄巨量分析文件分群 1000解決方案個案庫Mooter科學(xué)人雜志3月號文件數(shù)據(jù)分群Annotation and TaggingOnNovember 16, 2005, IBM ann
29、ounced it hadacquired Collation, a privately held companybased inRedwood City, California forundisclosed amount.DateAcquiringOrganizationAcquisitionEventAcquiredOrganizationPlaceAmountText AnnotatorDateOrganizationPlaceAmountNov. 16IBMRedwood City, CAUndisclosedOutput toRDBMSXMLoutputOn November 16,
30、 2005, IBM announced it had acquired Collation, a privately held company based in Redwood City, California for undisclosed amount.Linguistic Concept Extractionfrom Customer Service Records Bag of “Words”extractionCstmr IDCustomerYellowIncHappyNotSwitchCellPhoneExpressionsextractionCstmr IDCustomerYe
31、llow IncswitchCell PhoneNot happyNamed EntitiesextractionCustomer CRM termCstmr?Yellow Inc Telco CompanyCell Phone Telco TermNot happySwitchEvents/SentimentExtractionCustomer (cstmr) cell phone unhappy (Negative)Switch to (Negative Predicate) yellow inc (Competition)CombinedWith structured dataDecis
32、ion MakingChurner Special OfferKnowledge InferenceInformation ExtractionInformation RetrievalExtracting Information From TextStructuring knowledge from texttagging, compounds, grammatical analysis, ontological interpretation, regular expressions, patter recognitionTextDatabaseOntologyMinimalrecursio
33、nsemanticsrepresentationsDeep Thought EU projectKnowledge ConstructionWant to extract prominent concepts/relations from texttagging, compounds, NP recognition, term frequencies, stopwords, language identificationBrasethvik & Gulla, DKE, 38/1, 2001Domaindoc.coll.OntologyStatistical &linguisticanalyse
34、sManual laborPatterns ConstructionTaipeiTokyoNew YorkRepositoryTagging &annotationCDWKnowledge RepositoryOr structured dataPatternsPatternsExplorerWeb BrowserHard diskWindows XPDesktop computerHard disk size 40 GBProductsLaptopcomputersOperating SystemLinuxMacintoshis acrashesInstalled from http:/.人
35、、事、時、地、物元資料participate in人物性質(zhì)Conceptual ObjectsPhysical EntitiesTemporal Entities應(yīng)用affect or / refer torefer to / refinerefer to / identifielocationatwithin地點(diǎn)時間資源索引人物事件物件Derivedknowledgedata (e.g. RDF)ThesauriextentCRM entitiesOntologyexpansionSourcesandmetadata(XML/RDF)Backgroundknowledge /Authorit
36、iesCIDOCCRM orDCConcept LatticeC1:(D1,)C2:(d1,d2,d4,t1,t6)C3:(d3,d4,t4)C4:(d1,d2,t1,t3,t5,t6)C5:(d4,t1,t4,t6)C6:(d3,t2,t4)C7:(, T1)The formal conceptC4 has two own termst3,t5 and two inheritedterms t1,t6Given the context (D1,T1) whereD1 = d1,d2,d3,d4 & T1 = t1,t2,t3,t4,t5,t6 R t1 t2 t3 t4 t5 t6d11 0
37、 1 0 1 1 d21 0 1 0 1 1d30 1 0 1 0 0d41 0 0 1 0 1Table: The input relationR = documents keywordsHasseDiagramP14 performedP11 participated inP94 has createdE31 Document“Yalta Agreement”E7 Activity“Crimea Conference”E65 Creation Event*E38 ImageP86 falls withinP7 took place atP67 is referred to byE52 Ti
38、me-SpanFebruary 1945P81 ongoing throughoutP82 at some time withinE39 ActorE39 ActorE39 ActorE53 Place7012124E52 Time-Span11-2-1945Explicit Events, Object Identity, SymmetryRules ExtractionThe formal concept C4 makes it possible the following rules R1 : t3 t1 t6R2 : t5 t1 t6R3 : t3 t5The interpretati
39、on of the R1 and R2: The use of terms t3 or t5 is always associated with that of terms t1 and t6The rule R3 express mutual equivalence of the terms t3,t5: All the documents which have the term t3 also have the t5 term.文獻(xiàn)知識群組專家與決策知識呈現(xiàn)實(shí)時性分群Real-time IndexMetadata ofSearching Results公文性資料中低收入戶補(bǔ)助因果圖-失依兒
40、童各縣市福利, 信托基金的成立所在各縣市失依兒童狀態(tài)各縣市政府,社會局等介入 對單親家庭的補(bǔ)助之災(zāi)后重建及經(jīng)費(fèi)相關(guān)使用災(zāi)后重建基金規(guī)則Clustering范例很適合用機(jī)洗香味好聞去污力強(qiáng)洗衣省力氣味清香能去除99種污漬洗得特別干凈香味好聞白襪子洗得最干凈氣味很香不傷手能夠很好的去除污漬衣服不易褪色洗衣不費(fèi)力能去除99種污漬用量少洗得干凈對皮膚刺激少洗各種污漬都很干凈洗得干凈價格適當(dāng)洗衣服的效果較好氣味不錯一直使用該品牌洗好的衣物更白氣味好聞廣告印象深洗得干凈易漂清不太傷手洗得干凈用量少洗得干凈用量比別的牌子少廣告大洗得干凈用量少質(zhì)量好用量少洗得干凈包裝好廣告多,吸引人香味好聞洗的干凈、白宣傳好
41、,廣告有趣很多人都說好知識脈絡(luò)知識地圖事件追蹤信息檢索知識概念Kuhns Descriptive ProjectImmature ScienceNormal ScienceAnomaliesCrisisRevolutionEvolutionary theory is evolvingTasks in News DetectionNews FeedsDetectionSegmentationOn-LineRetroTrackingMight be RelevantUSS ColeOctober 12, 2000世貿(mào)中心五角大廈2001年九月11日 LocationAden,YemenDateOc
42、tober 12,200011:18 am (UTC+3)Attacktypesuicide bombingDeaths19 (including the 2 perpetrators)Injured39Perpetrator(s)al-Qaeda, carried out by Ibrahim al-Thawr and Abdullah al-Misawa911事件可預(yù)防FBI 明尼蘇達(dá)干員Zacarias Moussaoui 個人計(jì)算機(jī)FBI鳳凰城備忘錄(George Will)Dr. Bhandari(Virtual Gold, Inc)資料探勘 可預(yù)防911悲劇恐怖份子911恐怖份子網(wǎng)
43、絡(luò)911恐怖份子網(wǎng)絡(luò)赤軍旅(RedArmy Faction)威脅Horst Herold (德國聯(lián)邦警察總長)建立數(shù)據(jù)探勘之信息網(wǎng)GermanysBundeskriminalamt 1972數(shù)據(jù)源房屋銷售、能源公司成果Rolf Heissler (RAF 成員)結(jié)果erold遭報導(dǎo)違反人權(quán)退休1986修改犯罪條例911三個飛行員系來自Hamburg疫病警示及通報系統(tǒng)世界衛(wèi)生組織多年前即建立了疫病警示及通報系統(tǒng)(Epidemic Alert and Response)。由于一些國家可能基于經(jīng)濟(jì)沖擊的考慮,可能淡化有關(guān)疫情的報導(dǎo),世界衛(wèi)生組織的這套系統(tǒng)特別裝置了一套軟件,可以由各國媒體的網(wǎng)站上抓取
44、相關(guān)資料并由二十位專家分析這些資料中的信息。HighW信息 與 知識 Amazon數(shù)字相機(jī)銷售新聞事件華盛頓時報美國家衛(wèi)生院 NIH熱門研究Proposals by Funding/Date across IRGs and Activity Types疾病診療指引 Athena/EON - StanfordAthena臨床指引R. D. Shankar, et al. 2001高血壓臨床指引 Athena Hypertension GuidelineA. Advani, et al. 2003受災(zāi)戶(金融輔助政策)貸款(受災(zāi)戶、臨時住宅)Generative Discriminative重建家
45、園專案金融機(jī)構(gòu)貸款震災(zāi)重建暫行條例受災(zāi)戶房屋利息損毀災(zāi)戶objectmethodObject:attributeObject:attributeObject:attributeObject:conditionObject:attributeObject:Attribute (condition)Object:attributeSpecifyGeneralizeIntegrating Distributed KnowledgeAdaptive knowledge infrastructure is in placeKnowledge resources identified and shared
46、 appropriatelyTimely knowledge gets to the right person to make decisionsIntelligent tools for authoring through archivingCohesive knowledge development between JPL, its partners, and customersInstrument design is semi-automatic based on knowledge repositoriesMission software auto-instantiates based
47、 on unique mission parametersKM principals are part of Lab culture and supported by layered COTS productsRemote data management allows spacecraft to self-commandKnowledge gathered anyplace from hand-held devices using standard formats on interplanetary InternetExpert systems on spacecraft analyze and upload dataAutonomous agents operate across existing sensor and telemetry productsIndustry and academia supply spacecraft parts based on collaborative designs derived from JPLs
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年稅務(wù)工作者工作總結(jié)范文(3篇)
- 2024-2025學(xué)年廣東省清遠(yuǎn)市八校聯(lián)盟高一上學(xué)期教學(xué)質(zhì)量檢測(二)歷史試卷
- 2025年企業(yè)文化建設(shè)策劃咨詢協(xié)議
- 2025年企業(yè)數(shù)據(jù)保密共享協(xié)議
- 2025年基礎(chǔ)設(shè)施建設(shè)項(xiàng)目合同律師服務(wù)協(xié)議
- 2025年公司員工協(xié)議范本
- 2025年設(shè)備采購租賃合同協(xié)議范本
- 2025年裂隙燈顯微鏡項(xiàng)目立項(xiàng)申請報告模板
- 2025年醫(yī)藥產(chǎn)品銷售合同樣本
- 2025年頻率測量儀器項(xiàng)目立項(xiàng)申請報告模板
- 17J008擋土墻(重力式、衡重式、懸臂式)圖示圖集
- 道教系統(tǒng)諸神仙位寶誥全譜
- 中國經(jīng)濟(jì)轉(zhuǎn)型導(dǎo)論-政府與市場的關(guān)系課件
- 二十四節(jié)氣文化融入幼兒園食育的有效途徑
- 統(tǒng)計(jì)過程控制SPC培訓(xùn)資料
- 食品經(jīng)營操作流程圖
- 新視野大學(xué)英語讀寫教程 第三版 Book 2 unit 8 教案 講稿
- 小學(xué)生必背古詩詞80首硬筆書法字帖
- X52K銑床參數(shù)
- 村務(wù)公開表格
- 人教精通五年級英語下冊譯文
評論
0/150
提交評論