文本大數(shù)據(jù)分析與挖掘介紹_第1頁
文本大數(shù)據(jù)分析與挖掘介紹_第2頁
文本大數(shù)據(jù)分析與挖掘介紹_第3頁
文本大數(shù)據(jù)分析與挖掘介紹_第4頁
文本大數(shù)據(jù)分析與挖掘介紹_第5頁
已閱讀5頁,還剩47頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

1、文本大數(shù)據(jù)分析與挖掘:機(jī)遇,挑戰(zhàn),及應(yīng)用前景技術(shù)創(chuàng)新,變革未來Analysis and Mining of Big Text Data: Opportunities, Challenges,and ApplicationsText data cover all kinds of topics65M msgs/dayTopics: People EventsProducts Services, Sources: Blogs Microblogs Forums Reviews ,53M blogs 1307M posts115M users10M groups45M reviews人= 主觀智能“

2、傳感器”Humans as Subjective & Intelligent “Sensors”Real WorldSensorDataReportSenseThermometerWeather3C , 15F, LocationsGeo SensorNetwork SensorNetworks41N and 120W .01000100011100PerceiveExpress“Human Sensor”3文本數(shù)據(jù)的特殊應(yīng)用價值Unique Value of Text Data對所有大數(shù)據(jù)應(yīng)用都有應(yīng)用價值: Useful to all big dataapplications特別有助于挖掘,

3、利用有關(guān)人的行為,心態(tài),觀點(diǎn)的知識: Especially useful for mining knowledge about peoples behavior, attitude, and opinions直接表達(dá)知識;高質(zhì)量數(shù)據(jù)( Directly express knowledge about our world ) 小文本數(shù)據(jù)應(yīng)用 (Small text data are also useful!)Data Information KnowledgeText DataOpportunities of Text Mining ApplicationsReal WorldObserved

4、World(English)PerceiveExpress1. Mining knowledge aboutlanguage(Perspective)2. Mining content of text data3. Mining knowledgeabout the observer4. Infer other real-world variables (predictive analytics)+ Non-Text DataText Data+ ContextChallenges in Understanding Text Data (NLP) Adogischasingaboyonthep

5、layground DetNounAuxVerbDetNounPrepDetNounNoun Phrase Complex VerbNoun PhraseNoun PhraseVerb PhraseVerb PhraseSentenceDog(d1).Boy(b1). Playground(p1). Chasing(d1,b1,p1).Semantic analysisLexicalanalysis(part-of-speechtagging)Prep PhraseSyntactic analysis(Parsing)A person saying this maybe reminding a

6、nother person to get the dog back.Pragmatic analysis (speech act)Scared(x) if Chasing(_,x,_).+Scared(b1)InferenceNLP is hard!Natural language is designed to make human communicationefficient. As a result,we omit a lot of common sense knowledge, which we assume the hearer/reader possesses.we keep a l

7、ot of ambiguities, which we assume the hearer/readerknows how to resolve.This makes EVERY step in NLP hardAmbiguity is a killer!Common sense reasoning is pre-required.Examples of ChallengesWord-level ambiguity:“root” has multiple meanings (ambiguous sense)Syntactic ambiguity:“natural language proces

8、sing” (modification)“A man saw a boy with a telescope.” (PP Attachment)Anaphora resolution: “John persuaded Bill to buy a TV for himself.”(himself = John or Bill?)Presupposition: “He has quit smoking” implies that he smoked before.The State of the Art: Mostly Relying on Machine Learning A dog ischas

9、ingaboyontheplaygroundDetNounAuxVerbDetNounPrepDetNounNoun PhraseComplex VerbNoun PhraseNoun PhrasePrep PhraseVerb PhraseVerb PhraseSentenceSemantics: some aspectsEntity/relation extractionWord sense disambiguationSentiment analysisPOS Tagging: 97%Parsing: partial 90%(?)Speech act analysis: ?Inferen

10、ce: ?Robustand general NLP tends to be shallowwhile deep understanding doesnt scale up.Grand Challenge:How can we leverage imperfect NLP to build aperfect application?如何將不完善的技術(shù)轉(zhuǎn)化為完善的產(chǎn)品?Answer:Having humans in the loop!優(yōu)化人機(jī)合作!文本數(shù)據(jù)鏡拓寬人的感知TextScope to enhance human perceptionTextScope(文本數(shù)據(jù)鏡)MicroscopeT

11、elescope集信息檢索和文本分析挖掘與一體 支持交互式分析,決策支持TextScope Interface & Major Text Mining TechniquesTextScopeTask PanelTopic AnalyzerOpinionPredictionEvent RadarSearch BoxMyFilter1MyFilter2Select TimeSelect RegionMicrosoft (MSFT,) Google, IBM (IBM) and other cloud- computing rivals of Amazon Web Services are brac

12、ing for an AWS partnership announcement with VMware expected to be announced Thursday. My WorkSpaceProject 1Alert A Alert B .檢索過濾分類可視化摘要推薦主題分析觀點(diǎn)分析預(yù)測交互式工作流程管理TextScope in Action: interactive decision supportReal WorldSensor kSensor 1Non-Text DataTextDataJoint Mining of Non-Text and TextPredictiveMode

13、lMultiple Predictors(Features)Predicted Valuesof Real World VariablesOptimal Decision MakingApplication Example 1: Aviation SafetyReal WorldSensor kSensor 1Non-Text DataTextDataJoint Mining of Non-Text and TextPredictiveModelMultiple Predictors(Features)blesPredicted Valuesf Real World VariaOptimal

14、Decision MakingAviation SafetyAviation AdministratorAnomalous events &o causesAbundance of text data in the aviation domainCollecting reports since 1976860,000 reports as of Dec. 2009Monthly intake has been increasing (4k reports/month)Slide source: /overview/summary.htmlLots of useful knowledge bur

15、ied in textASRS Report ACN: 928983 (Date: 201101,Time: 1801-2400. )We were delayed inbound for about 2 hours and 20 minutes. On the approach there was ice that accumulated on the aircraft. The Captain wrote up The flight crew who picked up the plane the following morning notified us of an incorrect

16、remark section write up. I believe a few years ago, there was a different procedure for writing up aborted takeoffs. I think there was some confusion as to what the proper write-up for the aborted takeoff was. A contributing factor for this incorrect entry into the log may have been fatigue. I had p

17、ersonally been awake for about 14 hours and still hadanother leg to do. Also a contributing factor is that this event does not happen regularly. A more thorough review and adherence to the operations manual section regarding aircraft status would have prevented this, as well as, a better recognition

18、 of the onset of fatigue. The manual is sometimes so large that finding pertinent data is difficult. Even after it was determined that the event had occurred, it took me 15 to 20 minutes to find the section regarding aborted takeoffs.Event CubeMultidimensonal Text Databasei98.01999.091.0298.02LAXSJC

19、 MIA AUSovershootLocationTopicturbulence birds undershootCAFLTXLocation19981999DeviationTopic EncounterdownEvent CubeRepresentationAnalystMultidimensional OLAP, Ranking, Cause Analysis,Topic Summarization/ComparisonAnalysis SupportDuo Zhang, ChengXiang Zhai, Jiawei Han, Ashok Srivastava, Nikunj Oza.

20、 Topic Modeling for OLAP on Multidimensional Text Databases: Topic Cube and its Applications, Statistical Analysis and Data Mining, Vol. 2, pp.378-395, 2009.Sample Topic Coverage ComparisonComparison of distributions of anomalies in FL, TX, and CATurbulance (Texas)Improper Documentation (Florida)Com

21、parative Analysis of Shaping FactorsTexasFloridaApplication Example 2: Medical & HealthReal WorldSensor kSensor 1Non-Text DataTextDataJoint Mining of Non-Text and TextPredictiveModelMultiple Predictors(Features)Predicted Valuesf Real World VariablesOptimal Decision MakingMedical & HealthDoctors, Nur

22、ses, PatientsDiagnosis, optimal treatmento Side effects of drugs, Overview of Text Mining for Medical/Health ApplicationsEHR (Patient Records)How can we find similar medical casesin medical literature, in online forums, ?MedicalCase RetrievalSimilar Medical CasesHow can we analyze EHR to discover va

23、luable medical knowledge (e.g., symptom evolution profile of a disease)from EHR?Medical Knowledge DiscoveryMedical KnowledgeImproved Health CareMedical Case RetrievalQuery: “Female patient, 25 years old, with fatigue and a swallowing disorder (dysphagia worsening during a meal).The frontal chest X-r

24、ay shows opacity with clear contours in contact with the right heart border. Right hilar structures are visible through the mass. The lateral X-ray confirms the presence of a mass in the anterior mediastinum. On CT images, the mass has a relatively homogeneous tissue density.”Find all medical litera

25、ture articles discussing a similar case We developed techniques to leverage medical ontology andFeedback to improve accuracy. The UIUC-IBM team wasranked #1 in ImageCLEF 2010 evaluation.Parikshit Sondhi, Jimeng Sun, ChengXiang Zhai, Robert Sorrentino and Martin S. Kohn. Leveraging Medical Thesauri a

26、nd Physician Feedback for Improving Medical Literature Retrieval for Case Queries, Journal of the American Medical Informatics Association , 19(5), 851-858 (2012). doi:10.1136/amiajnl-2011-000293.Extraction of Symptom Graphs from EHREHR (Patient Records)Multi-Level Symptom GraphsDiscovery of symptom

27、 profiles of diseasesPredict the future onset of a disease (e.g., Congestive Heart Failure) for a patientDiscovered symptoms improves accuracy of prediction by +10%Parikshit Sondhi, Jimeng Sun, Hanghang Tong, ChengXiang Zhai. SympGraph: A Mining Framework of Clinical Notes through Symptom Relation G

28、raphs, Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD12), pp. 1167-1175, 2012.Discovery of Adverse Drug Reactions from ForumsGreen: Disease symptoms Blue: Side effect symptoms Red: DrugDrug: CefalexinADR:panic attack faint.Sheng Wang et al. 20

29、14. SideEffectPTM: an unsupervised topic model to mine adverse drug reactions from health forums. In ACM BCB 2014.Sample ADRs DiscoveredDrug(Freq)Drug UseSymptoms in Descending OrderZoloftantidepressant(84)weigh gain, weight, depression, side effects, mgs, gain weight, anxiety, nausea, head, brain,

30、pregnancy, pregnant, headaches, depressed, tiredAtivan (33)anxiety disordersAtivan, sleep, Seroquel, doc prescribed seroqual, raising blood sugar levels, anti-psychotic drug, diabetic, constipation, diabetes, 10mg, benzo, addictedTopamax(20)anticonvulsantTopmax, liver, side effects, migraines, heada

31、ches, weight, Topamax, pdoc, neurologist, supplement, sleep, fatigue, seizures, liver problems, kidney stonesEphedrine (2)stimulantdizziness, stomach, Benadryl, dizzy, tired, lethargic, tapering, tremors, panic attach, headUnreported to FDAMining Traditional Chinese Medicine Patient RecordsCollabora

32、tion with Beijing TCM Data CenterClinical warehouse since 2007More than 300,000 clinical cases from six hospitalsEach hospital has 3 million patient visitsTwo lines of workSubcategorization of patient recordsTCM knowledge discoveryBeijing TCM Data CenterSubcategorization of Patient RecordsEdward W H

33、uang, Sheng Wang, Runshun Zhang, Baoyan Liu, Xuezhong Zhou, and ChengXiang Zhai. PaReCat: Patient Record Subcategorization for Precision Traditional Chinese Medicine. ACM BCB, Oct. 2016.TCM Knowledge Discovery10,907 patients TCM records in digestive system treatment3,000 symptoms, 97 diseases and 65

34、2 herbsMost frequently occurring disease: chronic gastritisMost frequently occurring symptoms: abdominal pain and chillsGround truth: 27,285 manually curated herb-symptom relationship.Sheng Wang, Edward Huang, Runshun Zhang, Xiaoping Zhang, Baoyan Liu, Xuezhong Zhou, and ChengXiang Zhai, A Condition

35、al Probabilistic Model for Joint Analysis of Symptoms, Diagnoses, and Herbs in Traditional Chinese Medicine Patient Records , IEEE BIBM 2016.Top 10 herb-symptoms relationshipsTypical Symptoms of three DiseasesTypical Herbs for three DiseasesApplication Example 3:Business intelligenceReal WorldSensor

36、 kSensor 1Non-Text DataTextDataJoint Mining of Non-Text and TextPredictiveModelMultiple Predictors(Features)blesPredicted Valuesf Real World VariaOptimal Decision MakingBusiness analysts, Market researcherProductsBusiness intelligenceo Consumer trendsMotivationHow to infer aspect ratings?ValueLocati

37、onServiceHow to infer aspect weights?ValueLocationServiceHongning Wang, Yue Lu, ChengXiang Zhai. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD10), pages 115-124,

38、 2010.Solving LARA in two stages: Aspect Segmentation + Rating RegressionReviews + overall ratingsaccommodating:1smile:1 friendliness:1 attentiveness:1Aspect segmentsTerm Weightslocation:10.0amazing:12.9walk:10.1anywhere:10.9room:10.1nicely:11.7appointed:10.13.9comfortable:1nice:4

39、.85.8Aspect RatingAspect WeightObservedAspect Segmentation+Latent Rating RegressionLatent!HotelValueRoomLocationCleanlinessGrand Mirage Resort4.2(4.7)3.8(3.1)4.0(4.2)4.1(4.2)Gold Coast Hotel4.3(4.0)3.9(3.3)3.7(3.1)4.2(4.7)Eurostars Grand Marina Hotel3.7(3.8)4.4(3.8)4.1(4.9)4.5(4.8)Sample Re

40、sult 1: Rating DecompositionReveal detailed opinions at the aspect levelHotels with the same overall rating but different aspect ratings(All 5 Stars hotels, ground-truth in parenthesis.)Sample Result 2: Comparison of reviewersReviewer-level Hotel A Different reviewers rat Reveal differences in onaly

41、sisings on the same hotelRoomLocationCleanliness3.5(4.0)3.7(4.0)5.8(5.0)3.0(3.0)5.0(4.0)3.5(4.0)(Hotel Riu Palace Punta Cana)pinions of different reviewersReviewerValueMr.Saturday3.7(4.0)Salsrug5.0(5.0)Sample Result 3:Aspect-Specific Sentiment LexiconUncover sentimental information directly from the

42、 dataValueRoomsLocationCleanlinessresort 22.80view 28.05restaurant 24.47clean 55.35value 19.64comfortable 23.15walk 18.89smell 14.38excellent 19.54modern 15.82bus 14.32linen 14.25worth 19.20quiet 15.37beach 14.11maintain 13.51bad -24.09carpet -9.88wall -11.70smelly -0.53money -11.02smell -8.83bad -5

43、.40urine -0.43terrible -10.01dirty -7.85road -2.90filthy -0.42overprice -9.06stain -5.85website -1.67dingy -0.38Application 1: Discover consumer preferencesAmazon reviews: no guidancebattery lifeaccessoryservicefile formatvolumevideoApplication 2: User Rating Behavior AnalysisExpensive HotelCheap Ho

44、tel5 Stars3 Stars5 Stars1 StarValue0.1340.1480.1710.093Room0.0980.1620.1260.121Location0.1710.0740.1610.082Cleanliness0.0810.1630.1160.294Service0.2510.1010.1010.049Peoplelikeexpensive hotels because ofgood servicePeoplelike cheap hotels because of good valueApplication 3:Personalized Recommendation

45、 of EntitiesQuery: 0.9 value0.1 othersNon-PersonalizedPersonalizedApplication Example 4: Prediction of Stock MarketReal WorldSensor kSensor 1Non-Text DataTextDataJoint Mining of Non-Text and TextPredictiveModelMultiple Predictors(Features)iablesPredicted ValuReal World VarOptimal Decision MakingStoc

46、k tradersEvents in Real WorldMarket volatility esof Stock trends, TimeAny clues in the companion news stream?Dow Jones Industrial Average Source: Yahoo FinanceText Mining for Understanding Time SeriesWhat might have caused the stock market crash?Sept 11 attack!Stock-Correlated Topics in New York Tim

47、es:June 2000 Dec. 2011AAMRQ (American Airlines)AAPL (Apple)russia russian putin europe europeangermanybush gore presidential police court judge airlines airport air united trade terrorism food foods cheese nets scott basketball tennis williams openpaid notice st russia russian europeolympicgames oly

48、mpics she her msoil ford prices black fashion blacksfootball giants jets japan japanese planeawards gay boymoss minnesota chechnyaTopics are biased toward each time seriescomputer technology softwareinternet com webHyun Duk Kim, Malu Castellanos, Meichun Hsu, ChengXiang Zhai, Thomas A. Rietz, Daniel

49、 Diermeier. Mining causal topics in text data: iterative topic modeling with time series feedback, Proceedings of the 22nd ACM international conference on Informationand knowledge management (CIKM 13), pp. 885-890, 2013.“Causal Topics” in 2000 Presidential ElectionTop Three Wordsin Significant Topic

50、s from NY Timestax cut 1screen pataki guiliani enthusiasm door symbolic oil energy pricesnews w top pres al vicelove tucker presentedpartial abortion privatization court supreme abortion gun control nraIssues known to be important in the2000 presidential electionText: NY Times (May 2000 - Oct. 2000)

51、 Time Series: Iowa Electronic Market /iem/Information Retrieval with Time Series Query20002001 News7060504030201002000/7/32000/8/32000/9/32000/10/32000/11/32000/12/32001/1/32001/2/32001/3/32001/4/32001/5/32001/6/32001/7/32001/8/32001/9/32001/10/32001/11/32001/12/3Price ($)DateApple Stock PriceRANKDA

52、TEEXCERPT1239/29/200012/8/200010/19/200044/19/200157/20/2001Expect earning will be far below$4 billion cash in company Disappointing earning reportDow and Nasdaq soar after rate cut by Federal ReserveApples new retail storeHyun Duk Kim, Danila Nikitin, ChengXiang Zhai, Malu Castellanos, and Meichun Hsu. 2013. Information Retrievalwith Time Series Query. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR 13),Top ranked documentsby American Airli

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論