![From-Data-Fusion-to-Knowledge-Fusion.ppt_第1頁](http://file1.renrendoc.com/fileroot_temp2/2020-3/24/f1d2c789-28a9-485a-8e78-e024489efe75/f1d2c789-28a9-485a-8e78-e024489efe751.gif)
![From-Data-Fusion-to-Knowledge-Fusion.ppt_第2頁](http://file1.renrendoc.com/fileroot_temp2/2020-3/24/f1d2c789-28a9-485a-8e78-e024489efe75/f1d2c789-28a9-485a-8e78-e024489efe752.gif)
![From-Data-Fusion-to-Knowledge-Fusion.ppt_第3頁](http://file1.renrendoc.com/fileroot_temp2/2020-3/24/f1d2c789-28a9-485a-8e78-e024489efe75/f1d2c789-28a9-485a-8e78-e024489efe753.gif)
![From-Data-Fusion-to-Knowledge-Fusion.ppt_第4頁](http://file1.renrendoc.com/fileroot_temp2/2020-3/24/f1d2c789-28a9-485a-8e78-e024489efe75/f1d2c789-28a9-485a-8e78-e024489efe754.gif)
![From-Data-Fusion-to-Knowledge-Fusion.ppt_第5頁](http://file1.renrendoc.com/fileroot_temp2/2020-3/24/f1d2c789-28a9-485a-8e78-e024489efe75/f1d2c789-28a9-485a-8e78-e024489efe755.gif)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)
文檔簡介
1、,From Data Fusion to Knowledge Fusion,The task of data fusion is to identify the true values of data items (e.g., the true date of birth for Tom Cruise) among multiple observed values drawn from different sources (e.g., Web sites) of varying (and unknown) reliability.Knowledge fusion identifies true
2、 subject-predicate-object triples extracted by multiple information extractors from multiple information sources.,Introduction,Extractor To build a knowledge base, we employ multiple knowledge extractors to extract base. This involves three key steps: Identifying which parts of the data indicate a d
3、ata item and its value. Linking any entities that are mentioned to the corresponding entity identifier. Linking any relations that are mentioned to the corresponding knowledge base schema.,Some concepts,subject-predicate-object triples We can define the form (subject,predicate,object) as a linkage o
4、f some entities and relations. e.g.,(Tom Cruise, date_of_birth, 7/3/1962),Some concepts,We define the knowledge fusion problem, and to adapt existing data fusion techniques to solve this problem. We suggest some simple improvements to existing methods that substantially improve their quality. We mak
5、e a detailed error analysis of our methods, and a list of suggested directions for future research to address some of the new problems raised by knowledge fusion.,Contribution,Our goal is to build a high-quality Web-scale knowledge base.The figure depicts the architecture of our system.,System archi
6、tecture,Voting: Among conflicting values, each value has one vote from each data source, and we take the value with the highest vote count. Quality-based: Quality-based methods evaluate the trustworthiness of data sources and accordingly compute a higher vote count for a high-quality source. Relatio
7、n-based: Relation-based methods extend quality-based methods by additionally considering the relationships between the sources.,Data fusion method,We follow the data format and ontology in Freebase, and store the knowledge as (subject, predicate, object) triples, because data in Freebase is structur
8、ed but not triples. Note that in each triple the (subject, predicate) pair corresponds to a “data item” in data fusion, and the object can be considered as a “value” provided for the data item, like the (key,value) pair. So our goal is find (that means extract them by extractors) new facts abouts su
9、bject and predicate.,Knowledge base,Now we need to crawl a large set of Web pages and extract knowledge from four types of Web contents.There are some source types:TXT,DOM,TBL and ANO. Contributions from Web sources are highly skewed: the largest Web pages each contributes 50K triples while half of
10、the Web pages each contributes a single triple.,Web sources,3 tasks in knowledge extraction triple identification: deciding which words or phrases describe a triple. entity linkage: deciding which Freebase entity a word or phrase refers to. predicate linkage: to decide which Freebase predicate is ex
11、pressed in the given piece of text(beacuse predicates are implicit).,Extractors,Evaluating the quality of the extracted triples requires a gold standard that contains true triples and false triples and Freebase uses closed-world assumption to make it.However, this assumption is not always valid beca
12、use of facts missing.Instead, we use local closed-world assupmtion(LCWA). Some of the erroneous triples are due to wrong information provided by Web sources whereas others are due to mistakes in extractions,but extractions are responsible for the majority of the errors(more than 96% errors are provi
13、ded by extractors).,Quality of extracted knowledge,The more Web sources from which we extract a triple, or the more extractors that extract a triple, the more likely the triple is true. But there can be exceptions:,Quality of extracted knowledge,Given a set of extracted knowledge triples, each assoc
14、iated with provenance information such as the extractor and the Web source, knowledge fusion computes for each unique triple the probability that it is true. 3 challenges: The input of knowledge fusion is three-dimensional(the third is extractors). The output of knowledge fusion is a truthfulness pr
15、obability for each triple. The scale of knowledge is typically huge.,Knowledge fusion,VOTE: For each data item, VOTE counts the sources for each value and trusts the value with the largest number of sources. ACCU: For each source S that provides a set of values VS, the accuracy of S is computed as t
16、he average probability for values in VS. For each data item D and the set of values VD provided for D, the probability of a value v VD is computed as its a posterior probability conditioned on the observed data using Bayesian analysis.,Adapting data fusion techniques,Three assumptions about ACCU: Fo
17、r each D there is a single true value. There are N uniformly distributed false values. The sources are independent of each other. By default we set N = 100 and accuracy A = 0.8. POPACCU: POPACCU(more robust than ACCU) extends ACCU by removing the assumption that wrong values are uniformly distribute
18、d; instead, it computes the distribution from real data and plugs it in to the Bayesian analysis.,Adapting data fusion techniques,Adaptations: We reduce the dimension of the KF input by considering each (Extractor, URL) pair as a data source, which we call a provenance. For ACCU and POPACCU, we simp
19、ly take the probability computed by the Bayesian analysis. For VOTE, we assign a probability as follows: if a data item D = (s; p) has n provenances in total and a triple T = (s; p; o) has m provenances, the probability of T is p(T) = m/n. We scale up the three methods using a MapReduce-based framew
20、ork.,Adapting data fusion techniques,Calibration curve: We plot the predicted probability versus the real probability and divide the triples into l + 1 buckets. Here we use l = 20 when we report our results. We compute the real probability for each bucket as the percentage of true triples in the buc
21、ket compared with our gold standard. We summarize the calibration using two measures. The deviation computes the average square loss between predicted probabilities and real probabilities. The weighted deviation is the same except that it weighs each bucket by the number of triples in the bucket.,Ex
22、perimental evaluation,Experimental evaluation,The basic models consider an (Extractor, URL) pair as a provenance. Now maybe we can vary the granularity. The page-level and site-level. The predicate-level and all triples. The parttern-level and extractor-level.,Granularity of provenances,Granularity
23、of provenances,We consider filtering provenances by two criteria the coverage and the accuracy. We compute triple probabilities for data items where at least one triple is extracted more than once, and then re-evaluate accuracy for each provenance. We ignore provenances for which we still use the default accuracy. We use a threshold on accuracy to ingore some provenances, while this method could cause a problem: we may lose all provenances so cannot predict the probability for any triple.,Provenance selection,Provenance selection,Leveraging the gold standard(by Freebase),Filter provenanc
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 新型機械加工合同范本
- 消防水箱采購合同范本
- 工程機械租賃合同
- 農(nóng)莊承包合同
- 裝修泥工合同模板
- 影視制作承包合同范本
- 租賃合同銑刨機1
- 車輛租賃服務(wù)合同
- 物業(yè)管理的咨詢與顧問服務(wù)
- 衣服租賃合同范本
- 子宮畸形的超聲診斷
- 2024年1月高考適應(yīng)性測試“九省聯(lián)考”數(shù)學(xué) 試題(學(xué)生版+解析版)
- JT-T-1004.1-2015城市軌道交通行車調(diào)度員技能和素質(zhì)要求第1部分:地鐵輕軌和單軌
- (高清版)WST 408-2024 定量檢驗程序分析性能驗證指南
- (正式版)JBT 11270-2024 立體倉庫組合式鋼結(jié)構(gòu)貨架技術(shù)規(guī)范
- DB11∕T 2035-2022 供暖民用建筑室溫?zé)o線采集系統(tǒng)技術(shù)要求
- 《復(fù)旦大學(xué)》課件
- 針灸與按摩綜合療法
- Photoshop 2022從入門到精通
- T-GDWJ 013-2022 廣東省健康醫(yī)療數(shù)據(jù)安全分類分級管理技術(shù)規(guī)范
- DB43-T 2775-2023 花櫚木播種育苗技術(shù)規(guī)程
評論
0/150
提交評論