




版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領
文檔簡介
由知識挖掘提升商務智能應用
--統(tǒng)計分析的進階加值應用
FromKnowledgeMiningtoBusinessIntelligence-AdvancedStatisticsApplication
謝邦昌博士廈門大學講座教授兼博導
首都經(jīng)貿(mào)大學講座教授兼博導中央財經(jīng)大學講座教授兼博導西南財經(jīng)大學講座教授中國人民大學兼職教授輔仁大學統(tǒng)計資訊學系及應用統(tǒng)計所教授中華資料采礦協(xié)會理事長Outline知識采礦(整合數(shù)據(jù)采礦與文本采礦)與商業(yè)智慧的發(fā)展知識采礦程序、步驟、產(chǎn)出與應用如何進行數(shù)據(jù)采礦與文本采礦整合知識采礦之技術(shù)發(fā)展評論知識保存價值減少循環(huán)時間反應時間重復投資作業(yè)花費會議時間外界顧問…等等增加生產(chǎn)力與質(zhì)量企業(yè)知識的轉(zhuǎn)換快且有效的決策課程創(chuàng)新群策群力…
等等
企業(yè)知識的保留與轉(zhuǎn)換知識資產(chǎn)的投資精簡與退休人員輪替
生產(chǎn)力能力重復能量消耗過多的會議溝通問題組織目標
下達決策可行性快速非正規(guī)為何知識如此迫切?“Thechiefeconomicpriorityfordevelopedcountriesistoraisetheproductivityofknowledge...Thecountrythatdoesthisfirstwilldominatethetwenty-firstcenturyeconomically.”開發(fā)中國家首要經(jīng)濟目標為知識的創(chuàng)造力…誰先掌握誰就統(tǒng)領二十一世紀的經(jīng)濟PeterF.Drucker資料知識形成流程DataWarehouseKnowledgeSelection/cleansingPreprocessingTargetDataPreprocessedDataPatternTransformedData
DataMiningTransformationInterpretation/EvaluationIntegrationRawDataUnderstandingBI結(jié)構(gòu)Monitor&IntegratorCompleteDataWarehouseExtractTransformLoadRefreshmetadataOLAPServer1.ComprehensivePerformanceManagement2.Analysis3.Query4.Reports5.DataminingDataSourcesToolsServeDataMartsOperationalDBsOthersourcesBusinessIntelligence資料采礦/探勘ruleinduction neuralnetworkstreegeneratorsruleinductionsupportvectormachine
regressionCOWEBestimationmaximizationk-meansroughsetsapriori granularcomputingtrendfunctionsruleinduction neuralnetworks CategorizeyourcustomersorclientsClassificationForecastfuturesalesorusagePredictionGroupsimilarcustomersorclientsSegmentationDiscoverproductsthatarepurchasedtogetherAssociationFindpatternsandtrendsovertimeSequenceGainingmarketintelligencefromnewsfeedsSreekumarSukumaranandAshishSurekaIntegratedBISystemsCompleteDataWarehouseETLStructuralDataDBMSFileSystemXMLEALegacyUnstructuredDataCMSScannedDocumentsEmailETLTexttaggor&AnnotatorIntermediaDataRDBMSXMLSreekumarSukumaranandAshishSureka知識來源與價值“Onaverage,professionalusersspend11hoursperweeklookingforinformation.Seventy-onepercentsaidtheycouldnotfindwhattheywerelookingfor."
—"InformationManagementSoftware"
LazardFreres&Co.LLC
February2001"Thevolumeofdigitizedinformationwilldoubleeveryyearfrom2000to2005(anincreaseto30timestoday'svolume)."
—"KnowledgeManagementvs.InformationManagement"
GartnerGroup
September2000網(wǎng)絡訊息新聞報導專利電子郵件文件…文獻問題出版統(tǒng)計8TB(書籍),25TB(新聞),20TB(雜志),2TB(期刊)平均每分鐘鐘科學知識識增加2000頁新材料的的閱讀須時時5年(24hrs/day)HowCanIKeepUpWiththeLiterature?Evolution“Tostudyhistoryonemustknowinadvancethatoneisattemptingsomethingfundamentallyimpossible,yetnecessaryandhighlyimportant.”FatherJacobus(Hesse'sMagisterLudi)DasGlasperlenspiel(TheGlassBeadGame)文件件知知識識發(fā)發(fā)掘掘與與管管理理技技術(shù)術(shù)檢索索文件件過濾濾分類類摘要要分群群自然然語語言言內(nèi)內(nèi)文文分分析析萃取取探勘勘可視視化化萃取取應應用用探勘勘應應用用信息息存存取取知識識認認知知信息息結(jié)構(gòu)構(gòu)知識識產(chǎn)產(chǎn)生生RawtextTermsimilarityDocsimilarityVectorcentroid分群d分類META-DATA/ANNOTATIONddddddddddddddttttttttttttStemming&StopwordsTokenizedtextTermWeightingw11w12…w1nw21w22…w2n……wm1wm2…wmn
t1t2…tn
d1
d2
…dmSentenceselection摘要TextETLtoMiningCallTaker:JamesDate:Aug.30,2002Duration:10min.CustomerID:ADC00123Q:custsyshasstoppedworking.A:checkedcustbiosanditneedupdated.……UnstructuredDataStructuredData[CallTaker]James[Date]2002/08/30[Duration]10min.[CustomerID]ADC00123[Noun]Customer[Software]BIOS[Subj...Verb]customersystem..stop[SW..Problem]BIOS..needOriginalDataMetaDataLinguisticAnalysisTaggingDependencyAnalysisNamedEntityExtractionIntentionAnalysisCategoryDictionarySynonymDictionaryCategoryItemVisualization&InteractiveMiningMiningIBMTAKMI(Nasukawa,Nagano,1999)Miningtarget:individualtextMiningunit:>texts>categorylabeleditemsextractedfromtextusingNLPTextisTough其系一個個極不容容易表達達的抽象象性概念念(AI-Complete)是許多概概念彼此此間抽象象而復雜雜的無盡盡關系組組合一種名詞詞可以代代表很多多不同的的概念CELL,IV類似的概概念也有有很多種種方式可可以表達達(aliases)spaceship,flyingsaucer,UFO,figmentofimagination概念是很很難加以以可視化化的高維度其分析構(gòu)構(gòu)面可能能高達成成百上千千TextMiningisEasy重復性很很高只要一些些簡單的的算法,,就可以以從一些些極為粗粗糙的工工作中,,得到不不錯的結(jié)結(jié)果找出重要要詞組找到有意意義的相相關字從文章中中建立摘摘要主要問題題:結(jié)果評估估必須定義義目標及及目的TraditionalIR-basedExtractiondocvector1profilevector
docvectorn…scoringscorejudgments
rejecteddocs
accepteddocs
noyesvectorlearningthresholdlearningutilityfunctionOntologyVectorinitializationThresholdinitializationReuseretrievalalgorithmsNewthresholdalgorithmsScore>?threshold
Text-DBLexiconsLuhn'sideasItishereproposedthatthefrequencyofwordoccurrenceinanarticlefurnishesausefulmeasurementofwordsignificance.Itisfurtherproposedthattherelativepositionwithinasentenceofwordshavinggivenvaluesofsignificancefurnishausefulmeasurementfordeterminingthesignificanceofsentences.Thesignificancefactorofasentencewillthereforebebasedonacombinationofthesetwomeasurements.信息萃取取-Job2
JobTitle:IceCreamGuru
Employer:
JobCategory:Travel/Hospitality
JobFunction:FoodServices
JobLocation:UpperMidwestContactPhone:800-488-2611
DateExtracted:January8,2001
Source:/jobs_midwest.html
OtherCompanyJobs:-Job1InformationExtractionGiven:SourceoftextualdocumentsWelldefinedlimitedquery(textbased)Find:SentenceswithrelevantinformationExtracttherelevantinformationandignorenon-relevantinformation(important!)LinkrelatedinformationandoutputinapredeterminedformatAdvisoryProgrammer-Oracle(Austin,TX)ResponseCode:1008-0074-97-iexc-jcnResponsibilities:ThisisanexcitingopportunitywithSiemensWirelessTerminals;astart-upventurefullycapitalizedbyaGlobalLeaderinAdvancedTechnologies.Qualifiedcandidateswill:Responsibleforassistingwithrequirementsdefinition,analysis,designandimplementationthatmeetobjectives,codesdifficultandsophisticatedroutines.Developsprojectplans,schedulesandcostdata.Developtestplansandimplementphysicaldesignofdatabases.Developshellscriptsforadministrativeandbackgroundtasks,storedproceduresandtriggers.UsingOraclesDesigner2000,assistwithDataModelmaintenanceandassistwithapplicationsdevelopmentusingOracleForms.Qualifications:BSCS,BSMISorcloselyrelatedfieldorrelatedequivalentknowledgenormallyobtainedthroughtechnicaleducationprograms.5-8yearsofprofessionalexperienceindevelopment,systemdesignanalysis,programming,installationusingOracledevelopment…AutomaticPattern-LearningSystemsPros:PortableacrossdomainsTendtohavebroadcoverageRobustinthefaceofdegradedinput.AutomaticallyfindappropriatestatisticalpatternsSystemknowledgenotneededbythosewhosupplythedomainknowledge.Cons:Annotatedtrainingdata,andlotsofit,isneeded.Isn’tnecessarilybetterorcheaperthanhand-builtsol’nExamples:Riloffetal.,AutoSlog,SoderlandWHISK(UMass);Mooneyetal.Rapier(UTexas);Ciravegna(Sheffield)Learnlexicon-syntacticpatternsfromtemplatesTrainerDecoderModelLanguageInputAnswersAnswersLanguageInputTextAnalysisSpectrumEntityExtractionTargetedFactsandEventsClassificationClusteringConceptIdentificationWhatisthisdocumentabout?Whodidwhattowhomwhenwhere,etc.Whyisgettingdimensionaldatasohard?HankboughtplasticexplosivesfromHenryinTucsonyesterday.NamedEntityExtractionPeople,Weapons,Vehicles,DatesNEREngineHankHenryPlasticexplosivesTucson11/01/07FrameNetNameExtractionviaMMsTextSpeechRecognitionExtractorSpeechEntitiesNEModelsLocationsPersonsOrganizationsThedelegation,whichincludedthecommanderoftheU.N.troopsinBosnia,Lt.Gen.SirMichaelRose,wenttotheSerbstrongholdofPale,nearSarajevo,fortalkswithBosnianSerbleaderRadovanKaradzic.TrainingProgramtrainingsentencesanswersThedelegation,whichincludedthecommanderoftheU.N.troopsinBosnia,Lt.Gen.SirMichaelRose,wenttotheSerbstrongholdofPale,nearSarajevo,fortalkswithBosnianSerbleaderRadovanKaradzic.AneasybutsuccessfulHMMapplication:Priorto1997-nolearningapproachcompetitivewithhand-builtrulesystemsSince1997-Statisticalapproaches(BBN(Bikeletal.1997),NYU,MITRE,CMU/JustSystems)achievestate-of-the-artperformanceNER數(shù)據(jù)庫探探勘作業(yè)業(yè)流程決策參考決策建議自動分群自動/專家分類事件關連分析文檔庫知識本體論推論圖知識地圖概念分群群documentDocumentCollection{sun}{beach}Frequenttermset:{surf}{fun}{sun,beach}clusterC1C2C4C5C3Clustering:{C1,C2,C4,C5}.ClusteringDescription:{surf,sun,beach,fun}.AnophelesFeedbackasModelInterpolationConceptCDocumentDResultsFeedbackDocsF={d1,d2,…,dn}GenerativemodelDivergenceminimization=0Nofeedback=1Fullfeedback非單調(diào)性性資料(Heterogeneous)TDRTDRTDRTDRTDR成千成萬的歷史紀錄巨量分析文件分群群1000解決方案個案庫Mooter科學人雜雜志3月號文件數(shù)據(jù)據(jù)分群AnnotationandTaggingOnNovember16,2005,IBMannouncedithadacquiredCollation,aprivatelyheldcompanybasedinRedwoodCity,Californiaforundisclosedamount.DateAcquiringOrganizationAcquisitionEventAcquiredOrganizationPlaceAmountTextAnnotatorDateOrganizationPlaceAmountNov.16IBMRedwoodCity,CAUndisclosedOutputtoRDBMSXMLoutputOn<Date>November16,2005</Date>,<ACQUIRINGORG>IBM</ACQUIRINGORG>announcedithad<ACQUISITIONEVENT>acquired</ACQUISITIONEVENT><ACQUIREDORG>Collation</ACQUIREDORG>,aprivatelyheldcompanybasedin<PLACE>RedwoodCity,California</PLACE>for<AMOUNT>undisclosed</AMOUNT>amount.LinguisticConceptExtractionfromCustomerServiceRecordsBagof““Words”extractionCstmrIDCustomerYellowIncHappyNotSwitchCellPhoneExpressionsextractionCstmrIDCustomerYellowIncswitchCellPhoneNothappyNamedEntitiesextractionCustomerCRMtermCstmr?YellowIncTelcoCompanyCellPhoneTelcoTermNothappySwitchEvents/SentimentExtractionCustomer(cstmr)cellphoneunhappy(Negative)Switchto(NegativePredicate)yellowinc(Competition)CombinedWithstructureddataDecisionMakingChurnerSpecialOfferKnowledgeInferenceInformationExtractionInformationRetrievalExtractingInformationFromTextStructuringknowledgefromtexttagging,compounds,grammaticalanalysis,ontologicalinterpretation,regularexpressions,patterrecognitionTextDatabaseOntologyMinimalrecursionsemanticsrepresentations[DeepThoughtEUproject]KnowledgeConstructionWanttoextractprominentconcepts/relationsfromtexttagging,compounds,NPrecognition,termfrequencies,stopwords,languageidentification[Brasethvik&Gulla,DKE,38/1,2001]Domaindoc.coll.OntologyStatistical&linguisticanalysesManuallaborPatternsConstructionTaipeiTokyoNewYorkRepositoryTagging&annotationCDWKnowledgeRepositoryOrstructureddataPatternsPatternsExplorerWebBrowserHarddiskWindowsXPDesktopcomputerHarddisksize40GBProductsLaptopcomputersOperatingSystemLinuxMacintoshisacrashesInstalledfromhttp://...人、事、時、地、物元資料料participatein人物性質(zhì)ConceptualObjectsPhysicalEntitiesTemporalEntities應用affector/refertoreferto/refinereferto/identifielocationatwithin地點時間資源索引引人物事件物件Derivedknowledgedata(e.g.RDF)ThesauriextentCRMentitiesOntologyexpansionSourcesandmetadata(XML/RDF)Backgroundknowledge/AuthoritiesCIDOCCRMorDCConceptLatticeC1:(D1,?)C2:({d1,d2,d4},{t1,t6})C3:({d3,d4},{t4})C4:({d1,d2},{t1,t3,t5,t6})C5:({d4},{t1,t4,t6})C6:({d3},{t2,t4})C7:(?,T1)TheformalconceptC4hastwoownterms{t3,t5}andtwoinheritedterms{t1,t6}Giventhecontext(D1,T1)whereD1={d1,d2,d3,d4}&T1={t1,t2,t3,t4,t5,t6}Rt1t2t3t4t5t6d1101011d2101011d3010100d4100101Table:TheinputrelationR=documentskeywordsHasseDiagramP14performedP11participatedinP94hascreatedE31Document“YaltaAgreement”E7Activity“CrimeaConference”E65CreationEvent*E38ImageP86fallswithinP7tookplaceatP67isreferredtobyE52Time-SpanFebruary1945P81ongoingthroughoutP82atsometimewithinE39ActorE39ActorE39ActorE53Place7012124E52Time-Span11-2-1945ExplicitEvents,ObjectIdentity,SymmetryRulesExtractionTheformalconceptC4makesitpossiblethefollowingrulesR1:t3t1t6R2:t5t1t6R3:t3t5TheinterpretationoftheR1andR2:Theuseoftermst3ort5isalwaysassociatedwiththatoftermst1andt6TheruleR3expressmutualequivalenceoftheterms{t3,t5}:Allthedocumentswhichhavethetermt3alsohavethet5term.文獻知識群組專家與決策策知識呈現(xiàn)實時性分群群Real-timeIndexMetadataofSearchingResults公文性資料料中低收入戶補助因果圖--失依兒童各縣市福利利,信托基金的的成立所在各縣市市失依兒童童狀態(tài)各縣市政府府,社會局等介介入對單親家庭庭的補助之之災后重建建及經(jīng)費相相關使用災后重建基基金規(guī)則Clustering范例很適合用機洗香味好聞去污力強洗衣省力氣味清香能去除99種污漬洗得特別干凈香味好聞白襪子洗得最干凈氣味很香不傷手能夠很好的去除污漬衣服不易褪色洗衣不費力能去除99種污漬用量少洗得干凈對皮膚刺激少洗各種污漬都很干凈洗得干凈價格適當洗衣服的效果較好氣味不錯一直使用該品牌洗好的衣物更白氣味好聞廣告印象深洗得干凈易漂清不太傷手洗得干凈用量少洗得干凈用量比別的牌子少廣告大洗得干凈用量少質(zhì)量好用量少洗得干凈包裝好廣告多,吸引人香味好聞洗的干凈、白宣傳好,廣告有趣很多人都說好知識脈絡知識識地地圖圖事件件追追蹤蹤信息息檢檢索索知識識概概念念Kuhn’sDescriptiveProjectImmatureScienceNormalScienceAnomaliesCrisisRevolutionEvolutionarytheoryisevolvingTasksinNewsDetectionNewsFeedsDetectionSegmentationOn-LineRetroTrackingMightbeRelevantUSSColeOctober12,2000世貿(mào)貿(mào)中中心心五角角大大廈廈2001年九九月月11日LocationAden,YemenDateOctober12,2000
11:18am(UTC+3)Attack
typesuicidebombingDeaths19(includingthe2perpetrators)Injured39Perpetrator(s)al-Qaeda,carriedoutbyIbrahimal-ThawrandAbdullahal-Misawa911事件件可預預防防FBI明尼尼蘇蘇達達干干員員ZacariasMoussaoui個人人計計算算機機FBI鳳凰凰城城備備忘忘錄錄(GeorgeWill)Dr.Bhandari(VirtualGold,Inc)資料料探探勘勘可可預預防防911悲劇劇恐怖怖份份子子911恐怖怖份份子子網(wǎng)網(wǎng)絡絡911恐怖怖份份子子網(wǎng)網(wǎng)絡絡赤軍軍旅旅(RedArmyFaction)威脅脅HorstHerold(德國國聯(lián)聯(lián)邦邦警警察察總總長長)建立立數(shù)數(shù)據(jù)據(jù)探探勘勘之之信信息息網(wǎng)網(wǎng)Germany’’sBundeskriminalamt1972數(shù)據(jù)據(jù)源源房屋屋銷銷售售、、能能源源公公司司…成果果RolfHeissler(RAF成員員)結(jié)果果Herold遭報報導導違違反反人人權(quán)權(quán)退退休休1986修改改犯犯罪罪條條例例911三個個飛飛行行員員系系來來自自Hamburg疫病病警警示示及及通通報報系系統(tǒng)統(tǒng)世界界衛(wèi)衛(wèi)生生組組織織多多年年前前即即建建立立了了「「疫疫病病警警示示及及通通報報系系統(tǒng)統(tǒng)」」(EpidemicAlertandResponse)。由于一些國家家可能基于經(jīng)經(jīng)濟沖擊的考考慮,可能淡淡化有關疫情情的報導,世世界衛(wèi)生組織織的這套系統(tǒng)統(tǒng)特別裝置了了一套軟件,,可以由各國媒體的的網(wǎng)站上抓取相關資料料并由二十位專專家分析這些些資料中的信信息。信息與知知識–Amazon數(shù)字相機銷售售新聞事件–華盛頓時報美國家衛(wèi)生院院NIH熱門研究ProposalsbyFunding/DateacrossIRGsandActivityTypes疾病診療指引引Athena/EON-StanfordAthena臨床指引R.D.Shankar,etal.2001高血壓臨床指指引AthenaHypertensionGuidelineA.Advani,etal.2003受災戶(金融輔助政策策)貸款(受災戶、臨時時住宅)GenerativeDiscriminative重建家園專案案金融機構(gòu)貸款震災重建暫行行條例受災戶房屋利息損毀災戶objectmethodObject:attributeObject:attributeObject:attributeObject:conditionObject:attributeObject:Attribute(condition)Object:attributeSpecifyGeneralizeIntegratingDistributedKnowledgeAdaptiveknowledgeinfrastructureisinplaceKnowledgeresourcesidentifiedandsharedappropriatelyTimelyknowledgegetstotherightpersontomakedecisionsIntelligenttoolsforauthoringthrougharchivingCohesiveknowledgedevelopmentbetweenJPL,itspartners,andcustomersInstrumentdesignissemi-automaticbasedonknowledgerepositoriesMissionsoftwareauto-instantiatesbasedonuniquemissionparametersKMprincipalsarepartofLabcultureandsupportedbylayeredCOTSproductsRemotedatamanagementallowsspacecrafttoself-commandKnowledgegatheredanyplacefromhand-helddevicesusingstandardformatsoninterplanetaryInternetExpertsystemsonspacecraftanalyzeanduploaddataAutonomousagentsoperateacrossexistingsensorandtelemetryproductsIndustryandacademiasupplyspacecraftpartsbasedoncollaborativedesignsderivedfromJPL’sknowledgesystemCapturingKnowledgeSharingKnowledgeMarsNetEuropaOrbiterSpaceInterferometryMissionEnablescaptureofknowledgeatthepointoforigin,humanorrobotic,withoutinvasivetechnologyEnablesseamlessintegrationofsystemsthroughouttheworldandwithroboticspacecraftEnablessharingofessentialknowledgetocompleteAgencytasksModelingExpertKnowledgeSystemsmodelexpert
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 辦公門安裝合同范例
- 二建水利合同范本
- 2025年臨滄貨運從業(yè)資格證模擬考試題庫
- 互惠合同范本
- 農(nóng)藥倉儲配送合同范本
- 兼職中介合同范本
- 傳媒公司投資合同范本
- 勞動合同范本 襄陽
- saas服務合同范本
- 加工維修承攬合同范本
- 2024年高考時事政治考試題庫(134題)
- 有關煤礦生產(chǎn)新技術(shù)、新工藝、新設備和新材料及其安全技術(shù)要求課件
- DZ∕T 0201-2020 礦產(chǎn)地質(zhì)勘查規(guī)范 鎢、錫、汞、銻(正式版)
- 安全生產(chǎn)責任制考試試卷及答案
- 產(chǎn)科臨床診療指南
- 擠壓模具拋光培訓課件
- 教育學原理-第八章-教學-適用于項賢明主編《教育學原理》(馬工程)
- 學校安全教育教師培訓
- 大學生寒假回訪母校社會實踐報告
- 配件供應技術(shù)服務和質(zhì)保期服務計劃方案
- 電機制造中的質(zhì)量體系標準化建設
評論
0/150
提交評論