計算機專業(yè)外文翻譯關(guān)于研究NutchLucene的互操作性_第1頁
計算機專業(yè)外文翻譯關(guān)于研究NutchLucene的互操作性_第2頁
計算機專業(yè)外文翻譯關(guān)于研究NutchLucene的互操作性_第3頁
計算機專業(yè)外文翻譯關(guān)于研究NutchLucene的互操作性_第4頁
計算機專業(yè)外文翻譯關(guān)于研究NutchLucene的互操作性_第5頁
已閱讀5頁,還剩3頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、 scale-up x scale-out: a case study using nutch/lucenemaged michael, josé e. moreira, doron shiloach, robert w. wisniewskiibm thomas j. watson research centeryorktown heights, ny 10598-0218 向上向外擴展:關(guān)于研究nutch/lucene的互操作性 摘要在過去幾年中,多處理系統(tǒng)提高運行能力的解決方案一直困擾著主流的商業(yè)計算。主要的服務(wù)器供應(yīng)商繼續(xù)提供越來越強悍的機器,而近期,向外擴展的解決方案,規(guī)模

2、較小的機器集群的形式,更加被商業(yè)計算所接受。向外擴展的解決方案是以網(wǎng)絡(luò)為中心高吞吐量的特別有效的應(yīng)用。在本文中,我們調(diào)查了向上擴展和向外擴展這兩種相對的方法在一個新興的搜索應(yīng)用程序中并行的情況。我們的結(jié)論表明,向外擴展的策略即使在向上擴展的機器中依然可以表現(xiàn)良好。此外,向外擴展的解決方案提供更好的價格/性能比,雖然增加了管理的復(fù)雜性。1 簡介 在過去10年里的商業(yè)計算中,我們目睹了計算機系統(tǒng)單處理器到多處理器的全面換代。80年代初期引發(fā)的計算機行業(yè)的科技革命導(dǎo)致它占領(lǐng)了90年代商業(yè)計算大部分的市場。我們可以根據(jù)不同的做法,將采用多處理器系統(tǒng)的運算(包括商業(yè)和技術(shù)/科學(xué))分為兩個大組:·

3、;向上擴展:大型共享存儲服務(wù)器的部署應(yīng)用(多處理系統(tǒng))。·向外擴展:多個小相互服務(wù)器應(yīng)用程序的部署(網(wǎng)絡(luò)集群)。 在第一階段的商業(yè)計算革命中,向上擴展的優(yōu)勢是顯而易見的。多處理系統(tǒng)規(guī)模的增加,處理器時鐘速率的提高,提供更高的計算能力來處理事物的需要,即使是目前最大的公司也面臨這些問題. 對稱多處理系統(tǒng)是目前的主流商業(yè)計算。ibm 、惠普和sun這樣的公司每一代人都投入巨資以建設(shè)更大、更好多處理系統(tǒng)。 最近,針對于商業(yè)計算的向外擴展越來越被關(guān)注。對于許多新的以網(wǎng)絡(luò)產(chǎn)品為主的企業(yè)(例如谷歌、雅虎、ebay、亞馬遜),使用向外擴展是是解決必要計算能力唯一的辦法。另外,計算機制造商更容易部署

4、基于機架最佳化和刀片服務(wù)器的向外擴展解決方案。向外擴展在過去好多年一直是大規(guī)??茖W(xué)計算的唯一可行方案,我們可以觀察世界500強系統(tǒng)的發(fā)展。在此論文中,我們研究的是一個新興的商業(yè)應(yīng)用,非結(jié)構(gòu)化數(shù)據(jù)的檢索,根據(jù)兩個不同的系統(tǒng):一個是以向上擴展為基礎(chǔ)的超線程酷睿power5處理器。另一種是基于ibmblade center刀片服務(wù)器向外擴展系統(tǒng)。這兩個系統(tǒng)配置價格差不多(約20萬美元),從而可以公平的進行性價比的比較。其中一個重要的結(jié)論,我們的工作是一個“純粹”的向上擴展的方案而不是很有效的利用所有的處理器在一個大型的對稱多處理。在純粹的向上擴展中,我們只有一個實例運行的應(yīng)用程序中的smp,并使用該

5、實例的所有可用資源(處理器)。我們更擅長于開發(fā)power5的對稱多處理與“基于abox向外擴展”的方案。在這種情況下,多個實例方法同時運行在一個單操作系統(tǒng)下。后一種做法顯著提高性能,同時又能保持單一系統(tǒng)形象,是一個很具優(yōu)勢的對稱多處理系統(tǒng)。我們的另外一個結(jié)論是,同樣的價格尺度系統(tǒng)的情況下,向外擴展的系統(tǒng)能夠?qū)崿F(xiàn)4倍的性能。在我們的應(yīng)用案例中,這一業(yè)績的衡量精確到了秒。向外擴展系統(tǒng)需要使用多個系統(tǒng)的圖像,因此,可以有效方便的降低管理成本。不同情況下,這點或許能改善性能。其余部分本文安排如下.section 2敘述了向上擴展系統(tǒng)和向外擴展系統(tǒng)榮在我們的研究中的配置。第3節(jié)介紹了nutch / lu

6、cene在我們的系統(tǒng)中運行的工作量。第4節(jié)介紹我們的結(jié)論。2 向上和向外擴展系統(tǒng) 在ibm的產(chǎn)品線,系統(tǒng)z,p和i全部建立在具有跨度范圍廣泛的計算能力的多處理系統(tǒng)上。我們選擇了power5的p5 575機器作為代表著目前技術(shù)水平的系統(tǒng)。這個8位或16位系統(tǒng)已經(jīng)由于其低成本,高性能,小型化(2u或3.5英寸的高24英寸機架)已經(jīng)吸引了不少客戶。power5的p5 575是圖片如圖1所示。我們所用的特殊p5 575測試系統(tǒng)擁有16個8位酷睿單元和32gb(1g= 1,073 , 741824字節(jié))的主存。每個核心是雙線程,因此這個操作系統(tǒng)相當(dāng)于一個32位的smp。處理器速度是1.5g赫茲。另外,p

7、5 575connects有兩個gigabit/s以太網(wǎng)接口。它也有自己的專用ds4100存儲控制器。(見下面的說明ds4100)向外擴展系統(tǒng)有許多不同的形狀和形式,但它們一般包括多個相互關(guān)聯(lián)的節(jié)點,每一個節(jié)點代表一個獨立的操作系統(tǒng)。我們選擇的bladecenter作為我們的向外擴展平臺。這是這個平臺基于向外擴展方向的一個自然選擇。第一種在商業(yè)計算成為流行的向外擴展系統(tǒng)是機架式集群。ibm bladecenter,解決方案(和類似的系統(tǒng)公司,如惠普和戴爾)引領(lǐng)著下一步機架式集群向外擴展系統(tǒng)的商業(yè)計算。bladecenter的刀片服務(wù)器使用和機架式集群服務(wù)器相似的能力: 4處理器的配置, 16-

8、32培養(yǎng)基的最大內(nèi)存,內(nèi)置以太網(wǎng),并擴展卡兩種光纖通道,infiniband的, myrinet的,或10gbit/s以太網(wǎng)。同時還提供有多達(dá)8個處理器的雙寬葉片配置和額外的內(nèi)存。bladecenter-h是最新的ibm bladecenter機架。與之前的bladecenter 1機架相比,它有14個刀片插槽的刀片服務(wù)器。它也有多達(dá)兩個管理單元, 4個交換機模塊,四橋模塊和四個高速交換機模塊的空間。(在機架上交換機模塊3與4和橋梁模塊3與4均共享相同的插槽。)我們在每個機架配備兩個1-gbit/s以太網(wǎng)交換機模塊和2個光纖通道交換機模塊。三種不同葉片中使用了我們的集群:js21( power

9、pc處理器),hs21 (英特爾woodcrest處理器)和ls21( amd opteron處理器)。每一個刀片(js21, hs21,或ls21)既有本地磁盤驅(qū)動器(73 gb的容量)也有雙光纖通道的網(wǎng)絡(luò)適配器。在光纖通道適配器,兩個用于連接的刀片光纖通道交換機,都被插入機架。大約一半的集群(4底盤)組成js21刀片。這是四處理器(雙插槽,雙核心)的powerpc 970片,運行在2.5 ghz。每一個刀片有8gib的內(nèi)存。在本文中的結(jié)論報告中,我們著重關(guān)注這些js21刀片。ds4100存儲子系統(tǒng)包括雙存儲控制器,每一個都配有2gb/s的光纖通道接口,并且在主要抽屜中容納了14個sata驅(qū)

10、動器。盡管每個ds4100是搭配一個專門的bladecenter-h機架,但由于我們運行的光纖通道網(wǎng)絡(luò),集群中的任何刀片都可以可以查看到存儲系統(tǒng)的每個邏輯單元。3 nutch / lucene的工作量 nutch / lucene是一種執(zhí)行搜索應(yīng)用的框架。這是基于非結(jié)構(gòu)化數(shù)據(jù)(網(wǎng)頁)搜索的應(yīng)用程序日益增多的表現(xiàn)。我們已經(jīng)習(xí)慣了谷歌和雅虎這樣開放互聯(lián)網(wǎng)運作的搜索引擎。然而,搜索也是公司局域網(wǎng)、內(nèi)部網(wǎng)絡(luò)的一個重要的運作。nutch / lucene完全是基于java和其代碼的開源性。nutch / lucene,作為一個典型的搜尋工作,有三個主要部分組成:(1)檢索,(2)索引,和(3)查詢。在本

11、文中,我們列出查詢結(jié)果的組成部分。為了完整性,我們簡要介紹了其他組成部分。抓取操作是瀏覽和檢索信息的網(wǎng)頁,然后輸入將要搜索的文本信息。這一套文件在搜索術(shù)語稱為語料庫。爬行可以同時在內(nèi)部網(wǎng)絡(luò)(內(nèi)聯(lián)網(wǎng))以及外部網(wǎng)絡(luò)(因特網(wǎng))內(nèi)執(zhí)行。檢索,尤其是在互聯(lián)網(wǎng),是一個復(fù)雜的工作。無論是有意還是無意,總有許多的網(wǎng)站難以檢索到。檢索的性能通常是被檢索系統(tǒng)和被檢索系統(tǒng)之間的網(wǎng)絡(luò)帶寬給制約著。在nutch / lucene的搜索框架包含一個使用mapreduce編程模型的并行索引操作書面。mapreduce提供了一個方便的方式處理一個重要的(盡管有限)類,通過程序員在現(xiàn)實生活中的商業(yè)應(yīng)用并行和容錯性問題讓他們關(guān)注

12、問題域。mapreduce在2004年出版了谷歌網(wǎng)站,并迅速成為這類工作量分析的標(biāo)準(zhǔn)。mapreduce模式的并行索引操作如下。首先,將要建立的數(shù)據(jù)分割成大致相同大小的部分。每一部分,按照既定的方式進行處理,生成(鍵,值),其中key是查詢索引關(guān)鍵字,value是包含關(guān)鍵字的一整套文檔(和儲存關(guān)鍵字的文檔)。這相當(dāng)于在地圖階段,用mapreduce 。在下一階段,在減少的階段,每一個減速任務(wù)收集所有對某一特定的關(guān)鍵字,從而產(chǎn)生一個單一的指數(shù)表的關(guān)鍵字。當(dāng)所有的按鍵都處理后,我們有完整的關(guān)鍵字集作為整個數(shù)據(jù)集。在大多數(shù)的搜索應(yīng)用程序中,查詢絕大多數(shù)代表著運算能力。執(zhí)行查詢功能的時候,索引格式被提

13、交給搜索引擎,然后檢索文件,得到最符合要求的結(jié)果。nutch / lucene的并行查詢引擎的總體結(jié)構(gòu)如圖3所示。查詢引擎部分包含一個或多個前臺,一個或多個后臺。每個后臺都包含該分類完整的數(shù)據(jù)集。驅(qū)動作為外圍用戶的代表也是衡量查詢性能的一個關(guān)鍵點,每秒查詢(qps)。查詢操作的方式如下:驅(qū)動程序提交特定查詢(索引格式)的任意一個前臺。前臺緊接著分派查詢?nèi)蝿?wù)給所有的后臺。每個后臺負(fù)責(zé)執(zhí)行對數(shù)據(jù)段的查詢并返回最符合查詢要求的結(jié)果文件列表(通常是10個)。每個文件返回一個百分?jǐn)?shù)值,以此量化查詢匹配度。前臺收集所有后臺的回復(fù)兩端產(chǎn)生一個單一的頂端文件列表(通常是10條最佳匹配結(jié)果)。一旦前臺產(chǎn)生了該列

14、表,它會練習(xí)后臺根據(jù)索引目錄檢索文章的片段。只有頂端文件的片段會被檢索。前臺一次只能與一個后臺建立聯(lián)系,從后臺數(shù)據(jù)段對應(yīng)的文檔中回復(fù)片段。4 總結(jié) 我們的工作的第一個結(jié)論是,相對于向上擴展來說,向外擴展的解決方案在檢索工作量方面毋庸置疑有著很高的性價比優(yōu)勢。高度并行性的工作量,再加上在處理器、網(wǎng)絡(luò)和存儲的可擴展性方面的可預(yù)測性,使得向外擴展成為搜索方面的完美候選。 此外,即使在向上擴展系統(tǒng)中,在處理器利用效率方面采取“在單位空間內(nèi)向外擴展”的方法比單純的向上擴展效果要好的多。這與目前已有的大型共享存儲系統(tǒng)的科技計算已經(jīng)沒有太大的差別。這些機器中,在機器中運行mpi(向外擴展)應(yīng)用往往比依賴于共

15、享內(nèi)存(向上擴展)編程更加有效。 向外擴展的系統(tǒng)在系統(tǒng)管理方面仍然不如向上擴展。使用傳統(tǒng)的管理觀念消耗的鏡像成本成比例的增加,很明顯,向外擴展的解決方案比向上擴展要耗費更高的管理成本。外文文獻(xiàn)原文:abstractscale-up solutions in the form of large smps have represented the mainstream of commercial computing for the past several years. the major server vendors continue to provide increasingly larger

16、 and more powerful machines. more recently, scale-out solutions,in the form of clusters of smaller machines, have gained increased acceptance for commercial computing. scale-out solutions are particularly effective in high-throughput web-centric applications.in this paper, we investigate the behavio

17、r of two competing approaches to parallelism, scale-up and scale-out, in an emerging search application. our conclusions show that a scale-out strategy can be the key to good performance even on a scale-up machine. furthermore, scale-out solutions offer better price/performance, although at an incre

18、ase in management complexity.1 introductionduring the last 10 years of commercial computing,we have witnessed the complete replacement of uniprocessor computing systems by multiprocessor ones. the revolution that started in the early to mideighties in scientific and technical computing finally caugh

19、t up with the bulk of the marketplace in the midnineties.we can classify the different approaches to employmultiprocessor systems for computing (both commercial and technical/scientific) into two large groups:·scale-up: the deployment of applications on large shared-memory servers (large smps).

20、·scale-out: the deployment of applications on multiple small interconnected servers (clusters).during the first phase of the multiprocessor revolution in commercial computing, the dominance of scale-up was clear. smps of increasing size, with processors of increasing clock rate, could offer eve

21、r more computing power to handle the needs of even the largest corporations.smps currently represent the mainstream of commercial computing. companies like ibm, hp and sun invest heavily in building bigger and better smps with each generation.more recently, there has been an increase in interest in

22、scale-out for commercial computing. for many of the new web-based enterprises (e.g., google, yahoo, ebay, amazon), a scale out approach is the only way to deliver the necessary computational power. also, computer manufacturers have made it easier to deploy scale-out solutions with rack-optimized and

23、 bladedservers. (scale-out has been the only viable alternative for large scale technical scientific computing for several years, as we observe in the evolution of the top500 systems.)in this paper, we study the behavior of an emerging commercial application, search of unstructured data, in two dist

24、inct systems: one is a modern scale-up system based on the power5 multi-core/multi-threaded processor 8,9. the other is a typical scale-outsystem based on ibm bladecenter 3. the systems were configured to have approximately the same list price (approximately $200,000), allowing a fair performance an

25、d price-performance comparison.one of the more important conclusions of our work is that a “pure” scale-up approach is not very effective in using all the processors in a large smp. in pure scale-up, we run just one instance of our application in the smp, and that instance uses all the resources (pr

26、ocessors) available. we were more successful in exploiting the power5 smp with a “scale-out-in-abox” approach. in that case, multiple instances of the application run concurrently, within a single operating system. this latter approach resulted in significant gains in performance while maintaining t

27、he single system image that is one of the great advantages of large smps.another conclusion of our work is that a scale-out system can achieve about four times the performance of a similarly priced scale-up system. in the case of our application, this performance is measured in terms of queries per

28、second. the scale-out system requires the use of multiple system images, so the gain in performance comes at a convenience and management cost. depending on the situation, that may be worth the improvement in performance or not.the rest of this paper is organized as follows.section 2 describes the c

29、onfiguration of the scale-out and scale-up systems we used in our study. section 3 presents the nutch/lucene workload that ran in our systems. section 4 reports our experimental findings.finally, section 5 presents our conclusions.2 scale-up and scale-out systemsin the ibm product line, systems z, p

30、, and i are allbased on smps of different sizes that span a widespectrum of computational capabilities. as an example of a state-of-the art scale-up system we adopted the power5 p5 575 machine 7. this 8- or 16-way system has been very attractive to customers due to its low-cost, high-performance and

31、 small form factor (2u or 3.5-inch high in a 24-inch rack). a picture of a power5 p5 575 is shown in figure 1.the particular p5 575 that we used for our scale-up measurements has 16 power5 processors in 8 dualcoremodules and 32 gib (1 gib = 1,07 3 ,741,824bytes) of main memory. each core is dual-thr

32、eaded, so to the operating system the machine appears as a 32-way smp. the processor speed is 1.5 ghz. the p5 575connects to the outside world through two (2) gigabit/s ethernet interfaces. it also has its own dedicatedds4100 storage controller. (see below for a description of the ds4100.)scale-out

33、systems come in many different shapes and forms, but they generally consist of multiple interconnected nodes with a self-contained operating system in each node. we chose bladecenter as our platform for scale-out. this was a natural choice giventhe scale-out orientation of this platform.the first fo

34、rm of scale-out systems to become popular in commercial computing was the rackmounted cluster. the ibm bladecenter, solution (and similar systems from companies such as hp and dell) represents the next step after rack-mounted clusters in scale-out systems for commercial computing. the blade servers

35、used in bladecenter are similar in capability to the densest rack-mounted cluster servers: 4-processor configurations, 16-32 gib of maximummemory, built-in ethernet, and expansion cards for either fiber channel, infiniband, myrinet, or 10 gbit/s ethernet. also offered are double-wide blades with up

36、to 8-processor configurations and additional memory.figure 2 is a high-level view of our cluster architecture. the basic building block of the cluster is a bladecenter-h (bc-h) chassis. we couple each bc-h chassis with one ds4100 storage controller through a 2-gbit/s fiber channel link. the chassis

37、themselves are interconnected through two nearest-neighbor networks. one of the networks is a 4-gbit/s fiber channel network and the other is a 1-gbit/s ethernet network. the cluster consists of 8 chassis of blades (112 blades in total) and eight ds4100 storage subsystems.the bladecenter-h chassis i

38、s the newest bladecenter chassis from ibm. as with the previousbladecenter-1 chassis, it has 14 blade slots for blade servers. it also has space for up to two management modules, four switch modules, four bridge modules, and four high-speed switch modules. (switch modules 3 and 4 and bridge modules

39、3 and 4 share the same slots in the chassis.) we have populated each of our chassis with two 1-gbit/s ethernet switch modules and two fiber channel switch modules.three different kinds of blades were used in our cluster: js21 (powerpc processors), hs21 (intel woodcrest processors), and ls21 (amd opt

40、eron processors). each blade (js21, hs21, or ls21) has both a local disk drive (73 gb of capacity) and a dual fiber channel network adapter. the fiber channel adapter is used to connect the blades to two fiber channel switches that are plugged in each chassis. approximately half of the cluster (4 ch

41、assis) is composed of js21 blades. these are quad-processor (dual-socket, dual-core) powerpc 970 blades, running at 2.5 ghz. each blade has 8 gib of memory. for the experiments reported in this paper, we focus on these js21 blades.the ds4100 storage subsystem consists of dual storage controllers, ea

42、ch with a 2 gb/s fiber channel interface, and space for 14 sata drives in the main drawer. although each ds4100 is paired with a specific bladecenter-h chassis, any blade in the cluster can see any of the luns in the storage system, thanks to the fiber channel network we implement.3 the nutch/lucene

43、 workload nutch/lucene 4 is a framework for implementing search applications. it is representative of a growing class of applications that are based on search of unstructured data (web pages). we are all used to search engines like google and yahoo that operate on the open internet. however, search

44、is also an important operation within intranets, the internal networks of companies. nutch/lucene is all implemented in java and its code is open source. nutch/lucene, as a typical search frame- work, has three major components: (1) crawling, (2) indexing, and (3) query. in this paper, we present ou

45、r results for the query component. for completeness, we briefly describe the othercomponents.crawling is the operation that navigates and retrieves the information in web pages, populating the set of documents that will be searched. this set of documents is called the corpus, in search terminology.c

46、rawling can be performed on internal networks (intranet) as well as external networks (internet).crawling, particularly in the internet, is a complex operation. either intentionally or unintentionally, many web sites are difficult to crawl. the performance of crawling is usually limited by the bandw

47、idth of the network between the system doing the crawling and the system being crawled.the nutch/lucene search framework includes a parallel indexing operation written using themapreduce programming model 2. mapreduce provides a convenient way of addressing an important(though limited)class of real-

48、life commercial applications by hiding parallelism and fault-tolerance issues from the programmers, letting them focus on the problem domain. mapreduce was published by google in 2004 and quickly became a de-facto standard for this kind of workloads. parallel indexing operations in the mapreduce mod

49、el works as follows. first, the data to be indexed is partitioned into segments of approximately equal size. each segment is then processed by a mapper task that generates the (key, value) pairs for that segment, where key is an indexing term and value is the set of documents that contain that term

50、(and the location of the term in the document). this corresponds to the map phase, in mapreduce. in the next phase, the reduce phase, each reducer task collects all the pairs for a given key, thus producing a single index table for that key. once all the keys are processed, we have the complete inde

51、x for the entiredata set.in most search applications, query represents the vast majority of the computation effort. when performing a query, a set of index terms is presented to a query engine, which then retrieves the documents that best match that set of terms. the overall architecture of the nutc

52、h/lucene parallel query engine is shown in figure 3. the query engine part consists of one or more front- ends, and one or more back-ends. each back-end is associated with a segment of the complete data set. the driver represents external users and it is also the point at which the performance of th

53、e query is measured, in terms of queries per second (qps). a query operation works as follows. the driversubmits a particular query (set of index terms) to one of the front-ends. the front-end then distributes the query to all the back-ends. each back-end is responsible for performing the query against its data segme

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論