版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
18/18附錄1對(duì)聚類(lèi)點(diǎn)集近似最鄰近搜索的分析SongritManeewongvatana和DavidM.Mount摘要最鄰近搜索是最基本的計(jì)算問(wèn)題.給定D維空間中的一個(gè)點(diǎn)集,它包含n個(gè)數(shù)據(jù)點(diǎn),最鄰近搜索要解決的問(wèn)題是用一個(gè)數(shù)據(jù)結(jié)構(gòu)對(duì)這些點(diǎn)進(jìn)行處理,以便給出一個(gè)查詢(xún)點(diǎn)時(shí),我們可以高效地在數(shù)據(jù)集中找到與該點(diǎn)距離最近的點(diǎn)。因?yàn)閿?shù)據(jù)集可能很大,所以我們更關(guān)注數(shù)據(jù)結(jié)構(gòu)所占的存儲(chǔ)空間,希望將其控制在O(dn)。一種較為流行的用于最鄰近搜索的數(shù)據(jù)結(jié)構(gòu)是kd樹(shù)及其變體,這項(xiàng)技術(shù)的基本思想是將分級(jí)的分解空間放進(jìn)一個(gè)盒子中。在建立這種數(shù)據(jù)結(jié)構(gòu)時(shí)的一個(gè)重要的問(wèn)題是分割算法的選擇,分割算法決定了分割維和用于空間分割的超平面.分割算法的選擇在很大程度上決定了數(shù)據(jù)結(jié)構(gòu)的效率,當(dāng)數(shù)據(jù)點(diǎn)和查詢(xún)點(diǎn)都聚集在低維子空間時(shí),這點(diǎn)體現(xiàn)的尤為明顯。這是因?yàn)檩^高的聚集性會(huì)讓子劃分生成長(zhǎng)寬比極高的盒子。我們將兩種可選的分割策略與眾所周知的優(yōu)化kd樹(shù)策略進(jìn)行了對(duì)比。第一種叫做滑動(dòng)中點(diǎn)分割策略。它試圖尋求一種平衡,以使得子劃分產(chǎn)生的盒子有一個(gè)長(zhǎng)寬比的上限,且每個(gè)盒子中都含有數(shù)據(jù)點(diǎn)。第二種被稱(chēng)為最小不確定性分割策略,它是一種基于詢(xún)問(wèn)的搜索方式。這種策略中,我們除了要給出數(shù)據(jù)點(diǎn)集外,還要給出一個(gè)訓(xùn)練查詢(xún)集,用于預(yù)處理。策略中使用了一種簡(jiǎn)單的貪心算法,以期選擇的分割平面能在對(duì)訓(xùn)練集查找最鄰近點(diǎn)時(shí)使得平均不確定性的總和最小。在產(chǎn)生的一系列點(diǎn)集和查詢(xún)集上,我們進(jìn)行了實(shí)驗(yàn),將這兩種策略與優(yōu)化kd樹(shù)策略進(jìn)行了比較,并給對(duì)實(shí)驗(yàn)結(jié)果進(jìn)行了分析。我們論證了:對(duì)于聚類(lèi)數(shù)據(jù)和查詢(xún)集的近似最鄰近查詢(xún),這兩種算法的效率較標(biāo)準(zhǔn)的kd樹(shù)有顯著提高。1。介紹最鄰近查詢(xún)需要解決的是這樣一個(gè)問(wèn)題:我們給定集合S,它含有度量空間X中的n個(gè)點(diǎn)。我們要對(duì)這n個(gè)點(diǎn)進(jìn)行處理,以便對(duì)給定的任意點(diǎn)q(q∈X)我們都能很快的找到S中距離它最近的點(diǎn)。最鄰近查詢(xún)?cè)诤芏囝I(lǐng)域中都有應(yīng)用,比如知識(shí)發(fā)現(xiàn)、數(shù)據(jù)挖掘、模式識(shí)別和分類(lèi)、機(jī)器學(xué)習(xí)、數(shù)據(jù)壓縮、多媒體數(shù)據(jù)庫(kù)、文件恢復(fù)、統(tǒng)計(jì)等等。對(duì)于度量空間,我們有許多種選擇。我們始終假定空間為Rd,即d維空間,空間中的距離采用閔可夫斯基Lm距離進(jìn)行度量。對(duì)于任意整數(shù)m≥1,Rd中任意兩點(diǎn)p=(p1,p2…,pd)、q=(q1,q2,…,qd)的距離被定義為∑1≤i≤d|pi—qi|m的m次方根。L1,L2和L∞分別被稱(chēng)為:曼哈頓距離,歐氏距離和最大距離。我們將主要的焦點(diǎn)集中在數(shù)據(jù)結(jié)構(gòu)上。因?yàn)閿?shù)據(jù)集可能很大,我們將其做一下限制,只考慮數(shù)據(jù)空間程線性增長(zhǎng)的情況.在眾多方法之中,最流行的是基于分等級(jí)分解空間的方法。在這個(gè)領(lǐng)域中,最根本的工作是由Friedman,Bentley和Finkel完成的。他們指出通過(guò)使用kd樹(shù),可以將固定維數(shù)空間中可預(yù)計(jì)其分布的數(shù)據(jù)的空間復(fù)雜度控制在O(n),查詢(xún)的時(shí)間復(fù)雜度控制在O(logn).對(duì)于這一問(wèn)題,可選擇的方法有許多種。但是,所有已知的方法都存在這樣一個(gè)缺陷:當(dāng)空間維數(shù)增加的時(shí)候,運(yùn)行時(shí)間或空間復(fù)雜度將隨之以指數(shù)倍增長(zhǎng)。我們?cè)噲D找到一種高效的算法,即使在最糟的情況下,它也會(huì)有較低的空間復(fù)雜度和較短的查詢(xún)時(shí)間,但事實(shí)證明,想要獲得這樣的算法是很困難的,但這同時(shí)也告訴我們尋找近似的最近臨是解決問(wèn)題的一條出路。設(shè)S為Rd中的點(diǎn)集,查詢(xún)點(diǎn)q為Rd中任意一點(diǎn),假定ε>0,如果dist(p,q)≤(1+ε)dist(p*,q)成立,我們就說(shuō)點(diǎn)p是點(diǎn)q的近似最近臨。換句話說(shuō),p是在相對(duì)誤差ε內(nèi)的真實(shí)最近臨。最近,人們已經(jīng)對(duì)近似最近臨問(wèn)題進(jìn)行了深入的研究。較有代表性的算法包括:Bern[Ber93],Arya和Mount[AM93b],Arya等人[AMN+98],Kleinberg[Kle97],Clarkson[Cla94],Chan[Cha97],Indyk和Motwani[IM98]以及Kushilevitz,Ostrovsky和Rabani[KOR98].本文中,我們以分級(jí)空間分解特別是kd樹(shù)為基礎(chǔ),將數(shù)據(jù)結(jié)構(gòu)的大小限定在O(dn)。這是因?yàn)樵诜N數(shù)據(jù)結(jié)構(gòu)較為簡(jiǎn)單而且應(yīng)用廣泛。kd樹(shù)是一種二元樹(shù),它依靠垂直于坐標(biāo)軸的分割超平面將多維空間劃分為多級(jí)子空間。我們將在下一節(jié)對(duì)這個(gè)問(wèn)題進(jìn)行深入的探討.設(shè)計(jì)kd樹(shù)的一個(gè)關(guān)鍵問(wèn)題是分割超平面的選擇。Friedman,Bentley和Finkel提出了一種分割策略:假設(shè)點(diǎn)集在第k維上有最大的延展度,則將過(guò)第k維中點(diǎn)且垂直于第k維坐標(biāo)軸的平面作為分割超平面。他們將用這種方法得到的樹(shù)稱(chēng)為優(yōu)化kd樹(shù),我們把用這種方法叫作標(biāo)準(zhǔn)分割策略.另一種較為普遍的方法是利用包含點(diǎn)集的盒子形狀(而非點(diǎn)集的分布)對(duì)點(diǎn)集劃分。這種策略將垂直于盒子最長(zhǎng)邊且過(guò)最長(zhǎng)邊中點(diǎn)的平面作為分割超平面。我們將其稱(chēng)為中點(diǎn)分割策略。以分級(jí)空間分解為基礎(chǔ),人們把出了許多用于最近臨查詢(xún)的數(shù)據(jù)結(jié)構(gòu)。Yianilos引入了vp樹(shù)。它不像kd樹(shù)那樣用一個(gè)平面去分割點(diǎn)集,而是利用一個(gè)稱(chēng)為優(yōu)勢(shì)點(diǎn)的數(shù)據(jù)點(diǎn)作為超球面的中心,將空間分為兩部分。有一些基于R樹(shù)及其變體的數(shù)據(jù)結(jié)構(gòu)被應(yīng)用于數(shù)據(jù)庫(kù)領(lǐng)域.比如X樹(shù),它通過(guò)避免高度堆疊提高了R*樹(shù)的性能。類(lèi)似的例子還有SR樹(shù).TV樹(shù)在處理高維高間時(shí)使用了較為獨(dú)特的方法。它通過(guò)維護(hù)一系列活躍維達(dá)到了降維的目的。當(dāng)一個(gè)節(jié)點(diǎn)中的數(shù)據(jù)點(diǎn)共享了一個(gè)活動(dòng)維上的相同的坐標(biāo)時(shí),該維將被列為無(wú)效,活躍維集合也將隨之改變.在這篇文章中我們研究了另外兩種分割策略的的性能,并且將其與以往的kd樹(shù)分割策略進(jìn)行了比較。第一種策略叫做“滑動(dòng)中點(diǎn)",它是由Mount和Arya針對(duì)近似最近臨搜索提出的,并應(yīng)用于ANN函數(shù)庫(kù)中。將這種方法引入函數(shù)庫(kù)是為了更好的處理高聚性數(shù)據(jù)集。這種策略采用一種較為簡(jiǎn)單的方法來(lái)更正標(biāo)準(zhǔn)kd樹(shù)分割策略中的一個(gè)嚴(yán)重缺陷。這個(gè)缺陷在于,當(dāng)高聚性的數(shù)據(jù)點(diǎn)存在于低維子空間時(shí),采用標(biāo)準(zhǔn)kd樹(shù)分割策略對(duì)點(diǎn)集進(jìn)行劃分將會(huì)產(chǎn)生許多長(zhǎng)寬比極高的盒子,這些盒子會(huì)使查詢(xún)時(shí)間大大延長(zhǎng).“滑動(dòng)中點(diǎn)"策略起初也是沿盒子的最長(zhǎng)邊進(jìn)行簡(jiǎn)單的中點(diǎn)分割,但是,如果分割產(chǎn)生的任何子盒中不含數(shù)據(jù)點(diǎn),該策略將使分割面沿點(diǎn)集沿伸方向滑動(dòng),直到遇到第一個(gè)數(shù)據(jù)點(diǎn)為止.在3.1節(jié)中我們將對(duì)這種分割策略進(jìn)行詳細(xì)描述并對(duì)其特性做出分析.第二種策略叫做“最小不確定性”分割策略,它是一項(xiàng)基于詢(xún)問(wèn)的技術(shù)。建樹(shù)時(shí)不僅要給出數(shù)據(jù)點(diǎn)集,還要選出一些示例性的查詢(xún)點(diǎn),這些點(diǎn)被稱(chēng)為訓(xùn)練點(diǎn)。建樹(shù)時(shí),算法應(yīng)用了一種啟發(fā)式的貪心算法,它試圖使對(duì)訓(xùn)練集的查詢(xún)時(shí)間最小。給定一個(gè)查詢(xún)點(diǎn)集,我們將最鄰近查詢(xún)算法的每一步看作一張雙向連通的圖,稱(chēng)之為候選圖,它的頂點(diǎn)代表了所有的數(shù)據(jù)點(diǎn)和查詢(xún)點(diǎn),圖中與查詢(xún)點(diǎn)緊挨著的數(shù)據(jù)子集可能就是該查詢(xún)點(diǎn)的最鄰近點(diǎn)。“最小不確定”策略在執(zhí)行的每一步選出一個(gè)分割平面,盡可能多地刪除候選圖中剩余的邊。在3.2節(jié)中,我們對(duì)這種策略進(jìn)行了更為詳細(xì)的描述。我們?cè)趫?zhí)行這兩種策略的同時(shí)也執(zhí)行了標(biāo)準(zhǔn)kd樹(shù)分割策略。在產(chǎn)生的一系列有著各種分布的低維聚類(lèi)數(shù)據(jù)上,我們對(duì)這些算法進(jìn)行了比較.我們相信這種類(lèi)型的聚類(lèi)數(shù)據(jù)在實(shí)際應(yīng)用中的較為普遍的。我們使用手工合成的數(shù)據(jù)作為與基準(zhǔn)數(shù)據(jù)的一個(gè)對(duì)比,以便于對(duì)我們聚類(lèi)數(shù)據(jù)的密度和維數(shù)進(jìn)行調(diào)整。我們的結(jié)果表明,對(duì)于低維聚類(lèi)數(shù)據(jù)集,這兩種算法的效率較標(biāo)準(zhǔn)的kd樹(shù)有顯著提高。以下的篇章是這樣組織的,在下一章中我們展示了kd樹(shù)的背景信息以及它是怎樣執(zhí)行最鄰近查詢(xún)的。在第3節(jié)中,我們對(duì)兩個(gè)新的分割策略進(jìn)行了描述.第四章,主要是對(duì)實(shí)驗(yàn)及結(jié)果的展示。2。背景在這節(jié)里我們描述了kd樹(shù)是如何執(zhí)行精確或近似的最鄰近查詢(xún)的。Bentley將kd樹(shù)作為一種在高維空間中通用的二元搜索樹(shù)[Ben75]。Kd樹(shù)的每一個(gè)的節(jié)點(diǎn)可以看作是一個(gè)多維矩形,我們將基稱(chēng)為盒子.根節(jié)點(diǎn)是一個(gè)包含了所有數(shù)據(jù)點(diǎn)的矩形。其余的節(jié)點(diǎn)代表了包含數(shù)據(jù)點(diǎn)子集的矩形。(在兩個(gè)矩形邊界上的點(diǎn)可以認(rèn)為是包含在其中任何一個(gè)矩形中).如果節(jié)點(diǎn)中的數(shù)據(jù)點(diǎn)個(gè)數(shù)小于一個(gè)被稱(chēng)為bucketsize的閾值,這個(gè)點(diǎn)就被稱(chēng)為葉節(jié)點(diǎn),這些數(shù)據(jù)點(diǎn)就存儲(chǔ)在葉節(jié)點(diǎn)中.否則,建立算法就選擇一個(gè)分割超平面,它垂直于其中一個(gè)坐標(biāo)軸,并且貫穿整個(gè)盒子。有許多分割策略可以用于超平面的選擇。我們將在下面對(duì)此進(jìn)行更詳細(xì)的討論。超平面將盒子劃分為兩個(gè)子矩形,相當(dāng)于是當(dāng)前節(jié)點(diǎn)的兩個(gè)孩子節(jié)點(diǎn),盒子中的數(shù)據(jù)點(diǎn)屬于哪個(gè)孩子節(jié)點(diǎn),取決于數(shù)據(jù)點(diǎn)在超平面的哪一側(cè)。樹(shù)中的每個(gè)內(nèi)部節(jié)點(diǎn)都與它的分割超平面相關(guān)。Fridman,Bentley和Finkel[FBF77]給出了一種用kd樹(shù)尋找最鄰近數(shù)據(jù)點(diǎn)的算法。他們所介紹了下面的分割策略,我們將面稱(chēng)為“標(biāo)準(zhǔn)分割策略”。對(duì)每個(gè)內(nèi)部節(jié)點(diǎn),我們將沿著點(diǎn)集的最大延展度方向選擇分割平面,并使其垂直于某個(gè)坐標(biāo)軸。分割點(diǎn)被選在中點(diǎn),這樣分割后形成的兩個(gè)子集中含有的數(shù)據(jù)點(diǎn)數(shù)基本相同。最終得到的樹(shù)的大小為O(n),高度為O(logn)。White和Jain[WJ96]提出了另一種方法,稱(chēng)為VAM分割,兩種方法采用的基本思想是一致的,但后者是將具有最大差異度的維定為分割維。我們用一種簡(jiǎn)單的遞歸算法對(duì)查詢(xún)進(jìn)行回答。在最基本的情況下,當(dāng)算法執(zhí)行到葉節(jié)點(diǎn)時(shí),它會(huì)對(duì)葉中的每個(gè)數(shù)據(jù)點(diǎn)與查詢(xún)點(diǎn)間的距離進(jìn)行計(jì)算。最小的距離值將被保存。當(dāng)我們到達(dá)內(nèi)部節(jié)點(diǎn)時(shí),算法首先決定查詢(xún)點(diǎn)位于超平面的哪一側(cè)。查詢(xún)點(diǎn)必須離即將被訪問(wèn)的孩子較近。算法遞歸地訪問(wèn)這個(gè)孩子。當(dāng)返回時(shí),算法判斷盒子中另一個(gè)孩子與查詢(xún)點(diǎn)的距離是否比當(dāng)前最小距離要近,如果是,些也對(duì)這個(gè)孩子進(jìn)行遞歸訪問(wèn).當(dāng)搜索返回到根時(shí),我們就能得到最鄰近點(diǎn)了。一個(gè)重要的觀察結(jié)論是:對(duì)于任何查詢(xún)點(diǎn),算法將訪問(wèn)每一個(gè)與查詢(xún)點(diǎn)距離小與最鄰近點(diǎn)的葉節(jié)點(diǎn)。要對(duì)這種近似最鄰近查詢(xún)進(jìn)行概括是較為簡(jiǎn)單的。設(shè)ε為允許的誤差邊界。在處理內(nèi)部節(jié)點(diǎn)時(shí),當(dāng)且僅當(dāng)先前距離較遠(yuǎn)的孩子節(jié)點(diǎn)與查詢(xún)點(diǎn)的距離是當(dāng)前最小距離的1/(1+ε)時(shí),我們才對(duì)這個(gè)孩子節(jié)點(diǎn)進(jìn)行訪問(wèn).Aray等人證明了這個(gè)過(guò)程的正確性。他們也展示了這種通用的算法是如何計(jì)算出k個(gè)精確或者近似鄰近點(diǎn)的。Arya和Mount[AM93a]提出了一些對(duì)這種算法的改進(jìn)。第一種叫作“增量距離計(jì)算”。這是一種可以用于任何Minkowski距離計(jì)算的方法.除了分割超平面,樹(shù)的每一個(gè)內(nèi)部節(jié)點(diǎn)還存儲(chǔ)了相關(guān)的盒子的信息。算法并不計(jì)算真正的距離,而是使用距離的平方。當(dāng)算法執(zhí)行到內(nèi)部節(jié)點(diǎn)時(shí),它計(jì)算查詢(xún)點(diǎn)與相關(guān)盒子的平方距。他們指出在固定的時(shí)間內(nèi)(不依賴(lài)維數(shù)),用這個(gè)信息計(jì)算出與每了孩子節(jié)點(diǎn)對(duì)應(yīng)的盒子的平方距是完全可能的。他們還展示了一種叫作“優(yōu)先搜索”的方法,這種方法使用堆,按照距離增加的順序訪問(wèn)每個(gè)葉節(jié)點(diǎn),而非按照樹(shù)結(jié)構(gòu)遞歸地訪問(wèn)。另外一種眾所周知的用于最近鄰查詢(xún)的算法叫作部分距離計(jì)算[BG85,Spr91]。當(dāng)計(jì)算查詢(xún)點(diǎn)與數(shù)據(jù)點(diǎn)距離時(shí),如果各平方量的和超過(guò)了當(dāng)前最近點(diǎn)與查詢(xún)點(diǎn)距離的平方,那么停止對(duì)該距離的計(jì)算.對(duì)近似最鄰近查詢(xún),Arya等人發(fā)現(xiàn),對(duì)于任何一種基于分解空間的近似最鄰近查詢(xún),都有兩條重要的性質(zhì)。(1)平衡:樹(shù)的高度應(yīng)該為O(logn),其中n是數(shù)據(jù)點(diǎn)的個(gè)數(shù).(2)長(zhǎng)寬比界限:樹(shù)的葉節(jié)點(diǎn)的長(zhǎng)寬比應(yīng)該有一個(gè)限界,這意味著每個(gè)葉節(jié)點(diǎn)的最長(zhǎng)邊與最短邊的比值應(yīng)該有一個(gè)固定的上限值.在這兩條規(guī)則的基礎(chǔ)上,他們又指出,最臨近查詢(xún)可以將數(shù)據(jù)節(jié)構(gòu)的大小控制在O(dn),將執(zhí)行時(shí)間控制在O(logn)。不幸的是,對(duì)于kd樹(shù)而言,這兩條性質(zhì)不總是同時(shí)成立,特別是當(dāng)數(shù)據(jù)點(diǎn)的分布特別集中時(shí).Arya等人提出了一種更復(fù)雜一些的數(shù)據(jù)結(jié)構(gòu),叫做“balancedbox-decomposition”樹(shù),它可以同時(shí)滿(mǎn)足這兩條性質(zhì)。這了證時(shí)他們的理論,這種額外的復(fù)雜性看來(lái)是有必要的,并且他們的實(shí)驗(yàn)也證明了當(dāng)數(shù)據(jù)集是低維子空間上的高聚性數(shù)據(jù)時(shí)這種復(fù)雜性是很重要的。一個(gè)有趣但又實(shí)際的問(wèn)題是,是否存在一種算法,它有著kd樹(shù)的簡(jiǎn)單,又能對(duì)實(shí)際中的高聚分布的數(shù)據(jù)提供高效的搜索. 固定長(zhǎng)寬比的上限,對(duì)高效搜索來(lái)說(shuō)是充分非必要條件。一種更準(zhǔn)確的應(yīng)于他們的實(shí)驗(yàn)結(jié)果的條件被稱(chēng)為“packingconstraint”。定義一個(gè)半徑為r的球?yàn)辄c(diǎn)集的中心?!皃ackingconstraint”表明與球相交的盒子的個(gè)數(shù)是有限的。PackingConstraint:大小至少為s的與半徑為r的球相交的葉節(jié)點(diǎn)的個(gè)數(shù)是有上限的,它受限于的r/s,但不依賴(lài)于n。如果樹(shù)的盒子有長(zhǎng)寬比限制,那么它一定滿(mǎn)足“packingconstraint”.Arya等人指出,如果滿(mǎn)足“packingconstraint",那么這個(gè)盒子數(shù)只取決于維數(shù)和ε。標(biāo)準(zhǔn)分割策略的不足在于它不能控制盒子的長(zhǎng)寬比,因此不能完全滿(mǎn)足“packingconstraint".附錄2AnalysisofApproximat(yī)eNearestNeighborSearchingwithClusteredPointSetsSongritManeewongvatanaandDavidM.MountAbstractNearestneighborsearchingisafundamentalcomputationalproblem.Asetofndat(yī)apointsisgiveninreald-dimensionalspace,andtheproblemistopreprocessthesepointsintoadat(yī)astructure,sothat(yī)givenaquerypoint,thenearestdatapointtothequerypointcanbereportedefficiently。Becausedatasetscanbequitelarge,weareprimarilyinterestedindatastructuresthatuseonlyO(dn)storage.Apopularclassofdatastructuresfornearestneighborsearchingisthekd-treeandvariantsbasedonhierarchicallydecomposingspaceintorectangularcells.Animportantquestionintheconstructionofsuchdatastructuresisthechoiceofasplittingmethod,whichdeterminesthedimensionandsplittingplanetobeusedateachstageofthedecomposition。Thischoiceofsplittingmethodcanhaveasigni_cantinfluenceonthee_ciencyofthedatastructure.Thisisespeciallytruewhendataandquerypointsareclusteredinlowdimensionalsubspaces。Thisisbecauseclusteringcanleadtosubdivisionsinwhichcellshaveveryhighaspectratios.Wecomparethewell-knownoptimizedkd-tree(cuò)splittingmethodagainsttwoalternativesplittingmethods.Thefirst,calledthesliding-midpointmethod,whichat(yī)temptstobalancethegoalsofproducingsubdivisioncellsofboundedaspectratio,whilenotproducinganyemptycells。Thesecond,calledtheminimum—ambiguitymethodisaquery-basedapproach。Inadditiontothedatapoints,itisalsogivenatrainingsetofquerypointsforpreprocessing.Itemploysasimplegreedyalgorithmtoselectthesplittingplanethatminimizestheaverageamountofambiguityinthechoiceofthenearestneighborforthetrainingpoints。Weprovideanempiricalanalysiscomparingthesetwomethodsagainsttheoptimizedkd-tree(cuò)constructionforanumberofsyntheticallygenerateddataandquerysets。Wedemonstratethat(yī)forclustereddataandquerysets,thesealgorithmscanprovidesigni_cantimprovementsoverthestandardkd-treeconstructionforapproximatenearestneighborsearching.1.IntroductionNearestneighborsearchingisthefollowingproblem:wearegivenasetSofndatapointsinametricspace,X,andareaskedtopreprocessthesepointssothat(yī),givenanyquerypointq∈X,thedatapointnearesttoqcanbereportedquickly.Nearestneighborsearchinghasapplicat(yī)ionsinmanyareas,includingknowledgediscoveryanddatamining[FPSSU96],patternrecognitionandclas-sificat(yī)ion[CH67,DH73],machinelearning[CS93],datacompression[GG92],multimediadatabases[FSN+95],documentretrieval[DDF+90],andstatistics[DW82].Therearemanypossiblechoicesofthemetricspace。ThroughoutwewillassumethatthespaceisRd,reald—dimensionalspace,wheredistancesaremeasuredusinganyMinkowskiLmdistancemetric.Foranyintegerm≥1,theLm—distancebetweenpointsp=(p1,p2…,pd)andq=(q1,q2,…,qd)inRdisdefinedtobethem-throotof∑1≤i≤d|pi-qi|m。TheL1,L2,andL∞metricsarethewell-knownManhat(yī)tan,Euclideanandmaxmetrics,respectively.Ourprimaryfocusisondatastructuresthatarestoredinmainmemory。Sincedatasetscanbelarge,welimitourselvestoconsiderationofdat(yī)astructureswhosetotalspacegrowslinearlywithdandn。Amongthemostpopularmethodsarethosebasedonhierarchicaldecompositionsofspace.TheseminalworkinthisareawasbyFriedman,Bentley,andFinkel[FBF77]whoshowedthatO(n)spaceandO(logn)querytimeareachievableforfixeddimensionalspacesintheexpectedcasefordatadistributionsofboundeddensitythroughtheuseofkd—trees。Therehavebeennumerousvariationsonthistheme。However,allknownmethodssufferfromthefactthatasdimensionincreases,eitherrunningtimeorspaceincreaseexponentiallywithdimension。Thedifficultyofobtainingalgorithmsthataree_cientintheworstcasewithrespecttobothspaceandquerytimesuggeststhealternativeproblemoffindingapproximat(yī)enearestneighbors.ConsiderasetSofdat(yī)apointsinRdandaquerypointq∈Rd.Givenε>0,wesaythatapointp∈Sisa(1+ε)-approximatenearestneighborofqifdist(p,q)≤(1+ε)dist(p*,q)wherep*isthetruenearestneighbortoq。Inotherwords,piswithinrelativeerrorεofthetruenearestneighbor.Theapproximatenearestneighborproblemhasbeenheavilystudiedrecently.ExamplesincludealgorithmsbyBern[Ber93],AryaandMount[AM93b],Arya,etal.[AMN+98],Clarkson[Cla94],Chan[Cha97],Kleinberg[Kle97],IndykandMotwani[IM98],andKushilevitz,OstrovskyandRabani[KOR98].InthisstudywerestrictattentiontodatastructuresofsizeO(dn)basedonhierarchicalspatialdecompositions,andthekd-treeinparticular.Inlargepartthisisbecauseofthesimplicityandwidespreadpopularityofthisdatastructure.Akd—treeisbinarytreebasedonahierarchicalsubdivisionofspacebysplittinghyperplanesthatareorthogonaltothecoordinateaxes[FBF77]。Itisdescribedfurtherinthenextsection.Akeyissueinthedesignofthekd-treeisthechoiceofthesplittinghyperplane.Friedman,Bentley,andFinkelproposedasplittingmethodbasedonselectingtheplaneorthogonaltothemediancoordinatealongwhichthepointshavethegreatestspread。Theycalledtheresultingtreeanopti—mizedkd-tree,andhenceforthwecalltheresultingsplittingmethodthestandardsplittingmethod。Anothercommonalternativeusestheshapeofthecell,ratherthanthedistributionofthedatapoints.Itsplitseachcellthroughitsmidpointbyahyperplaneorthogonaltoitslongestside。Wecallthisthemidpointsplitmethod.Anumberofotherdatastructuresfornearestneighborsearchingbasedonhierarchicalspatialdecompositionshavebeenproposed。Yianilosintroducedthevp-tree[Yia93]。Ratherthanusinganaxis—alignedplanetosplitanodeasinkd-tree,itusesadatapoint,calledthevantagepoint,asthecenterofahyperspherethatpartitionsthespaceintotworegions.Therehasalsobeenquiteabitofinterestfromthe_eldofdatabases.Thereareseveraldat(yī)astructuresfordatabaseapplicationsbasedonR-treesandtheirvariants[BKSS90,SRF87]。Forexample,theX—tree[BKK96]improvestheperformanceoftheR*—treebyavoidinghighoverlap.AnotherexampleistheSR—tree[KS97]。TheTV—tree[LJF94]usesadifferentmetherdtodealwithhighdimensionalspaces.Itreducesdimensionalitybymaintaininganumberofactivedimensions。Whenalldatapointsinanodesharethesamecoordinat(yī)eofanactivedimension,thatdimensionwillbedeactivat(yī)edandthesetofactivedimensionsshifts.Inthispaperwestudytheperformanceoftwoothersplittingmethods,andcomparethemagainstthekd—tree(cuò)splittingmethod。Thefirst,calledsliding-midpoint,isasplittingmethodthat(yī)wasintroducedbyMountandAryaintheANNlibraryforapproximatenearestneighborsearching[MA97].Thismethodwasintroducedintothelibraryinordertobetterhandlehighlyclustereddatasets.Weknowofnoanalysis(empiricalortheoretical)ofthismethod.Thismethodwasdesignedasasimpletechniqueforaddressingoneofthemostseriousflawsinthestandardkd-treesplittingmethod。Theflawisthatwhenthedatapointsarehighlyclusteredinlowdimensionalsubspaces,thenthestandardkd-tree(cuò)splittingmethodmayproducehighlyelongat(yī)edcells,andthesecanleadtoslowquerytimes.Thissplittingmethodstartswithasimplemidpointsplitofthelongestsideofthecell,butifthissplitresultsineithersubcellcontainingnodatapoints,ittranslates(or“slides”)thesplittingplaneinthedirectionofthepointsuntilhittingthefirrstdat(yī)apoint。InSection3。1wedescribethissplittingmethodandanalyzesomeofitsproperties.Thesecondsplittingmethod,calledminimum—ambiguity,isaquery-basedtechnique.Thetreeisgivennotonlythedatapoints,butalsoacollectionofsamplequerypoints,calledthetrainingpoints。Thealgorithmappliesagreedyheuristictobuildthetreeinanat(yī)tempttominimizethee(cuò)xpectedquerytimeonthetrainingpoints.Wemodelqueryprocessingastheproblemofeliminat(yī)ingdatapointsfromconsiderationasthepossiblecandidatesforthenearestneighbor。Givenacollectionofquerypoints,wecanmodelanystageofthenearestneigh-bouralgorithmasabipartitegraph,calledthecandidategraph,whoseverticescorrespondtotheunionofthedatapointsandthequerypoints,andinwhicheachquerypointisadjacenttothesubsetofdatapointsthatmightbeitsnearestneighbor.Theminimum-ambiguityselectsthesplittingplaneateachstagethat(yī)eliminatesthemaximumnumberofremainingedgesinthecandidategraph。InSection3.2wedescribethissplittingmethodingreaterdetail。Weimplementedthesetwosplittingmethods,alongwiththestandardkd-treesplittingmethod.Wecomparedthemonanumberofsyntheticallygenerat(yī)edpointdistributions,whichweredesignedtomodellow-dimensionalclustering。Webelievethistypeofclusteringisnotuncommoninmanyapplicationdatasets[JD88]。Weusedsyntheticdatasets,asopposedtostandardbenchmarks,sothatwecouldadjustthestrengthanddimensionalityoftheclustering。Ourresultssh-owthatthesenewsplittingmethodscanprovidesigni_cantimprovementsoverthestandardkd-treesplittingmethodfordatasetswithlow-dimensionalcluster-ing.Therestofthepaperisorganizedasfollows。Inthenextsectionwepresentbackgroundinformationonthekd-treeandhowtoperformnearestneighborsea-rchesinthistree.InSection3wepresentthetwonewsplittingmethods.InSection4wedescribeourimplementationandpresentourempiricalresults.2。BackgroundInthissectionwedescribehowkd-treesareusedforperformingexactandapproximatenearestneighborsearching.Bentleyintroducedthekd—treeasagen-eralizat(yī)ionofthebinarysearchtreeinhigherdimensions[Ben75]。Eachnodeofthetreeisimplicitlyassociatedwithad—dimensionalrectangle,calleditscell.Therootnodeisassociat(yī)edwiththeboundingrectangle,whichenclosesallofthedat(yī)apoints。Eachnodeisalsoimplicitlyassociatedwiththesubsetofdatapointsthat(yī)liewithinthisrectangle。(Datapointslyingontheboundarybetweentworectangles,maybeassociat(yī)edwitheither。)Ifthenumberofpointsassociatedwithanodefallsbelowagiventhreshold,calledthebucketsize,thenthisnodeisaleaf,andthesepointsarestoredwiththeleaf。(Inourexperimentsweusedabucketsizeofone.)Otherwise,theconstructionalgorithmselectsasplittinghyp—erplane,whichisorthogonaltooneofthecoordinateaxesandpassesthroughthecell.Thereareanumberofsplittingmethodsthatmaybeusedforchoosingthishyperplane。Wewilldiscusstheseingreaterdetailbelow.Thehyperplanesubdi-videstheassociat(yī)edcellintotwosubrectangles,whicharethenassociatedwiththechildrenofthisnode,andthepointsaresubdividedamongthesechildrenaccordingtowhichsideofthehyperplanetheylie.Eachinternalnodeofthetreeisassociatedwithitssplittinghyperplane(whichmaybegivenastheindexoftheorthogonalaxisandacuttingvaluealongthisaxis).Friedman,BentleyandFinkel[FBF77]presentanalgorithmto_ndthenear-estneighborusingthekd-trees。Theyintroducethefollowingsplittingmethod,whichwecallthestandardsplittingmethod.Foreachinternalnode,thesplittinghyperplaneischosentobeorthogonaltotheaxisalongwhichthepointshavethegreatestspread(di_erenceofmaximumandminimum).Thesplittingpointisch—osenatthemediancoordinate,sothatthetwosubsetsofdat(yī)apointshavenearly—qualsizes.TheresultingtreehasO(n)sizeandO(logn)height.WhiteandJain[WJ96]proposedanalternative,calledtheVAM—split,withthesamebasicidea,butthesplittingdimensionischosentobetheonewiththemaximumvariance.Queriesareansweredbyasimplerecursivealgorithm。Inthebasiscase,whenthealgorithmarrivesataleafofthetree,itcomputesthedistancefromthequerypointtoeachofthedatapointsassociatedwiththisnode。Thesmallestsuchdist-tanceissaved.Whenarrivingataninternalnode,it_rstdeterminesthesideoftheassociatedhyperplaneonwhichthequerypointlies.Thequerypointisnece—ssarilyclosertothischild’scell.Thesearchrecursivelyvisitsthischild.Onretu-rningfromthesearch,itdetermineswhetherthecellassociatedwiththeotherchildisclosertothequerypointthantheclosestpointseensofar。Ifso,thenthischildisalsovisitedrecursively.Whenthesearchreturnsfromtheroot,theclos-estpointseenisreturned.Animportantobservationisthatforeachquerypoint,everyleafwhosedistancefromthequerypointislessthanthenearestneighborwillbevisitedbythealgorithm。Itisaneasymattertogeneralizethissearchalgorithmforansweringapprox—imatenearestneighborqueries。Letεdenotetheallowederrorbound。Intheprocessingofaninternalnode,thefurtherchildisvisitedonlyifitsdistancefromthequerypointislessthanthedistancetotheclosestpointsofar,dividedby(1+ε)。Aryaetal.[AMN+98]showthecorrectnessofthisprocedure。Theyalsoshowhowtogeneralizethesearchalgorithmforcomputingthek—closestneighbors,eitherexactlyorapproximately.AryaandMount[AM93a]proposedanumberofimprovementstothisbasicalgorithm.Thefirstiscalledincrementaldistancecalculation.ThistechniquecanbeappliedforanyMinkowskimetric.Inadditiontostoringthesplittinghyperplane,eachinternalnodeofthetreealsostorestheextentsofassociatedcellprojectedorthogonallyontoitssplittingaxis.Thealgorithmdoesnotmain-taintruedistances,butinstead(fortheEuclideanmetric)maintainssquareddistances.Whenthealgorithmarrivesataninternalnode,itmaintainsthesquareddistancefromthequerypointtotheassociatedcell。Theyshowthatinconstanttime(independentofdimension)itispossibletousethisinformationtocomputethesquareddistancetoeachofthechildren’scell。Theyalsopresentedamethodcalledprioritysearch,whichusesaheaptovisittheleavesofthetreeinincreas—ingorderofdistancefromthequerypoint,ratherthanintherecursiveorderdica—tatedbythestructureofthetree.Yetanotherimprovementisawell-knowntech-niquefromnearestneighborsearching,calledpartialdistancecalculation[BG85,Spr91]。Whencomputingthedistancebetweenthequerypointandadatapoint,iftheaccumulatedsumofsquaredcomponentseverexceedsthesquareddistancetothenearestpointsofar,thenthedistancecomputationisterminated.Oneoftheimportantelementsofapproximatenearestneighborsearching,whichwasobservedbyAryaetal。[AMN+98],isthattherearetwoimportantpropertiesofanydatastructureforapproximat(yī)enearestneighborsearchingbas-edonspatialdecomposition.Balanc
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 長(zhǎng)江職業(yè)學(xué)院《中外版畫(huà)史與經(jīng)典作品欣賞》2023-2024學(xué)年第一學(xué)期期末試卷
- 云南大學(xué)滇池學(xué)院《畜牧試驗(yàn)設(shè)計(jì)與統(tǒng)計(jì)分析1》2023-2024學(xué)年第一學(xué)期期末試卷
- 校園安全管理規(guī)定與實(shí)施細(xì)則
- 2022年全國(guó)碩士研究生招生考試(思想政治理論)真題(含答案)
- 業(yè)務(wù)操作-房地產(chǎn)經(jīng)紀(jì)人《業(yè)務(wù)操作》模擬試卷1
- 房地產(chǎn)交易制度政策-《房地產(chǎn)基本制度與政策》預(yù)測(cè)試卷2
- 趣味數(shù)學(xué)游戲教學(xué)模板
- 公司員工生日晚會(huì)主持稿
- 二零二五版品牌合作承諾協(xié)議書(shū)模板
- 2024-2025學(xué)年陜西省渭南市高一(上)期末數(shù)學(xué)試卷(含答案)
- 成品出貨檢驗(yàn)報(bào)告模板
- 物業(yè)工程管理安全培訓(xùn)課件
- 《文化苦旅》讀書(shū)分享 PPT
- 氧化鋁生產(chǎn)工藝教學(xué)拜耳法
- 2023年十八項(xiàng)醫(yī)療核心制度考試題與答案
- 氣管切開(kāi)患者氣道濕化的護(hù)理進(jìn)展資料 氣管切開(kāi)患者氣道濕化
- GB/T 12706.1-2020額定電壓1 kV(Um=1.2 kV)到35 kV(Um=40.5 kV)擠包絕緣電力電纜及附件第1部分:額定電壓1 kV(Um=1.2 kV)和3 kV(Um=3.6 kV)電纜
- 管理模板:某跨境電商企業(yè)組織結(jié)構(gòu)及部門(mén)職責(zé)
- 底架總組裝工藝指導(dǎo)書(shū)
- 簡(jiǎn)單臨時(shí)工勞動(dòng)合同模板(3篇)
- 聚酯合成反應(yīng)動(dòng)力學(xué)
評(píng)論
0/150
提交評(píng)論