外文文獻(xiàn)翻譯-一種新的包裝布局和布線(xiàn)工具的FPGA研究_第1頁(yè)
外文文獻(xiàn)翻譯-一種新的包裝布局和布線(xiàn)工具的FPGA研究_第2頁(yè)
外文文獻(xiàn)翻譯-一種新的包裝布局和布線(xiàn)工具的FPGA研究_第3頁(yè)
外文文獻(xiàn)翻譯-一種新的包裝布局和布線(xiàn)工具的FPGA研究_第4頁(yè)
外文文獻(xiàn)翻譯-一種新的包裝布局和布線(xiàn)工具的FPGA研究_第5頁(yè)
已閱讀5頁(yè),還剩14頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

譯文VPR:一種新的包裝,布局和布線(xiàn)工具的FPGA研究

沃恩貝茨和喬納森羅斯

系電氣與計(jì)算機(jī)工程系,多倫多大學(xué)

多倫多,ON,加拿大M5S3G4{沃恩,jayar}@摘要

我們描述了一個(gè)基于FPGA新的功能和CAD工具使用的算法,各種途徑和方(VPR)。在減少路由面積計(jì)算方面,VPR優(yōu)于所有的FPGA布局布線(xiàn)工具,我們可以比較。雖然常用的算法是基于已知的方法,是我們目前而言改善運(yùn)行時(shí)間和質(zhì)量的幾個(gè)有效方法。我們目前的版圖和路由上的大型電路的一套新的結(jié)果,讓未來(lái)的基準(zhǔn)電路尺寸上的設(shè)計(jì)方法更多,用于今天的典型的FPGA布局布線(xiàn)工具工業(yè)品外觀設(shè)計(jì)。VPR是針對(duì)一個(gè)范圍廣泛的FPGA架構(gòu)的能力,并且源代碼是公開(kāi)的。它和相關(guān)的網(wǎng)表翻譯/群集工具VPACK已經(jīng)被用在世界各地的一些研究項(xiàng)目,并且是有用的FPGA體系結(jié)構(gòu)的研究。1簡(jiǎn)介在FPGA的研究中,人們通常必須評(píng)估新結(jié)構(gòu)特色的實(shí)用工具而做評(píng)估實(shí)驗(yàn)。也就是說(shuō)評(píng)估基準(zhǔn)電路技術(shù)映射,放置和FPGA的布線(xiàn)結(jié)構(gòu)上的關(guān)系和措施的架構(gòu)質(zhì)量,如運(yùn)算速度或區(qū)域,然后可以很容易地提取出來(lái)。因此,有相當(dāng)大的對(duì)于靈活CAD工具的需求,這樣才可以針對(duì)各種架構(gòu)的FPGA做高效的設(shè)計(jì),從而便于比較均勻的設(shè)計(jì)架構(gòu)。本文介紹了通用的地點(diǎn)和路線(xiàn)(VPR)工具,設(shè)計(jì)很靈活,足夠讓許多FPGA架構(gòu)的比較VPR可以執(zhí)行的位置,要么全球路由或合并后的全球詳細(xì)路由。這是公開(kāi)的/?jayar/軟件。為了使FPGA體系結(jié)構(gòu)的比較有意義,它是至關(guān)重要的CAD工具用于將每個(gè)電路架構(gòu),以地圖的高品質(zhì)展現(xiàn)。路由相優(yōu)于所有的VPR在查看FPGA的路由器方面,任何標(biāo)準(zhǔn)基準(zhǔn)測(cè)試的結(jié)果都可用,并且指出VPR的砂礦和路由器的組合勝過(guò)所有出版的FPGA布局和布線(xiàn)工具。本文結(jié)構(gòu)如下:在第2節(jié)我們描述了一些VPR功能的FPGA架構(gòu)和范圍與它可能被使用的地方。在第3和第4節(jié),我們描述了布局布線(xiàn)法。在第5節(jié)講述了比較有必要的VPR曲目數(shù)量和該電路成功的布線(xiàn)所要求的其他已發(fā)表的工具。在第6節(jié)得出了我們的結(jié)論,并提出一些VPR將來(lái)的升級(jí)。2概述VPR圖1概括了VPR的CAD流程。VPR投入到由一個(gè)technologymapped網(wǎng)表和一個(gè)文本文件描述了的FPGA架構(gòu)中。VPR可以放置電路,或一個(gè)預(yù)先存在的位置,可以讀入VPR可以執(zhí)行或者是全局的路線(xiàn)或合并后的全球/詳細(xì)的安置途徑。VPR的輸出由布局、布線(xiàn)和統(tǒng)計(jì)組成,評(píng)估一項(xiàng)有用的工具FPGA架構(gòu),如路由線(xiàn)長(zhǎng),跟蹤計(jì)數(shù)最大凈長(zhǎng)度。給出一些可指定的建筑結(jié)構(gòu)參數(shù)描述文件:?邏輯塊輸入和輸出的數(shù)量,?對(duì)每個(gè)邏輯塊的輸入和輸出端訪(fǎng)問(wèn)(S)之和?邏輯等價(jià)性不同的輸入和輸出引腳(例如,所有對(duì)照表輸入功能當(dāng)量),?對(duì)I/成一行或一列的FPGA適合O引腳數(shù),?邏輯塊陣列的尺寸(如23×30的邏輯塊)。此外,如果全球路由要執(zhí)行,你也可以指定:?橫向和縱向通道的相對(duì)寬度之和?在不同區(qū)域的FPGA的渠道相對(duì)寬度。最后,如??果合并后的全球和詳細(xì)的路由被執(zhí)行,一個(gè)也會(huì)進(jìn)行求值:?開(kāi)關(guān)塊[1]架構(gòu)(即為何路由曲目是相互關(guān)聯(lián)的),?曲目號(hào)碼,每個(gè)邏輯塊的輸入引腳連接([1]),?為邏輯塊輸出FC值,?對(duì)I/O口FC值。當(dāng)前的體系結(jié)構(gòu)描述格式不允許跨越多個(gè)領(lǐng)域和多個(gè)邏輯塊和被列入路由體系結(jié)構(gòu),但我們目前加入此功能。添加新的路由架構(gòu)的功能VPR相對(duì)容易,因?yàn)閂PR使用體系結(jié)構(gòu)描述來(lái)創(chuàng)建路由資源圖。每個(gè)路由跟蹤和建設(shè)中的每一個(gè)腳成為在這個(gè)圖中的節(jié)點(diǎn),圖邊表示為允許的連接。路由器,圖形可視化和統(tǒng)計(jì)計(jì)算程序都與此路由資源圖的工作相關(guān),所以添加新的路由架構(gòu)功能僅涉及更改的子程序來(lái)建設(shè)這個(gè)圖。雖然VPR最初是島式FPGA的開(kāi)發(fā)[2,3],它也可以和以行為為基礎(chǔ)的FPGA應(yīng)用[4]。VPR目前沒(méi)有能力為目標(biāo)的層次FPGA的[5],顯然增加一個(gè)適當(dāng)?shù)奈恢煤统杀竞瘮?shù)設(shè)計(jì)所需的布線(xiàn)資源圖形程序?qū)⑹蛊淠軌蚪鉀Q這些問(wèn)題。最后,VPR的內(nèi)置圖形允許交互式可視化的布局,路由可用資源和互連的可能途徑路由資源。VPACK邏輯塊包裝程序/網(wǎng)絡(luò)表翻譯VPACK讀取一個(gè)已經(jīng)技術(shù)映射電路網(wǎng)表格式blif到LUT和觸發(fā)器,包裝成所需的FPGA邏輯LUT和觸發(fā)器塊,并輸出在VPR的網(wǎng)表。VPACK可以針對(duì)邏輯塊組成一個(gè)LUT,如圖2所示,因?yàn)檫@是一種常見(jiàn)的FPGA邏輯元件。VPACK也針對(duì)邏輯塊包含幾個(gè)有用的LUT和幾個(gè)拖動(dòng)程序,有或沒(méi)有共享LUT的輸入[6]。這些“clusterbased”邏輯塊類(lèi)似于最近由AlteraFPGA開(kāi)發(fā)的工具類(lèi)型。3布局算法VPR采用模擬退火算法[7]。我們已經(jīng)嘗試與幾個(gè)不同的成本函數(shù)聯(lián)系,發(fā)現(xiàn)我們稱(chēng)之為線(xiàn)性擠塞的成本函數(shù)提供了一個(gè)合理的計(jì)算時(shí)間,最好的結(jié)果[8]。此成本函數(shù)的函數(shù)形式就是對(duì)所有的求和電路中的網(wǎng)進(jìn)行計(jì)算。對(duì)于每一個(gè)網(wǎng),北方新宇和bby指出在其邊界框的水平和垂直跨度分別為Q(n)的因數(shù)補(bǔ)償。邊界線(xiàn)長(zhǎng)度模型中的實(shí)際低估所需的布線(xiàn),就可以看成超過(guò)三個(gè)終端網(wǎng),作為建議[10]。它的價(jià)值取決于凈N兩端號(hào)碼;Q是對(duì)總體1有3個(gè)或更少的終端,并慢慢增加了50臺(tái)網(wǎng)邏輯與上2.79。賈夫常數(shù)x(n)、?(n)為平均信道容量(在首部)在X和Y方向,分別比較全凈邊框和成本函數(shù)的余量,需要更多的調(diào)配路由的領(lǐng)域,F(xiàn)PGA具有窄渠道。本文中的所有結(jié)果的得到,是利用FPGA中的所有通道都有相同的原則。在這種情況下,賈夫是一個(gè)常數(shù),函數(shù)的線(xiàn)性阻塞耗費(fèi)降低到一個(gè)包圍盒的成本函數(shù)。一個(gè)良好的退火算法的必要條件是時(shí)間表取得一個(gè)合理的高品質(zhì)的解決方案與模擬退火的計(jì)算時(shí)間相關(guān)聯(lián)。我們已經(jīng)開(kāi)發(fā)出一種新的退火附表,導(dǎo)致非常高品質(zhì)的展示位置,并在其中給出退火參數(shù)的自動(dòng)調(diào)節(jié)功能,不同的成本和電路尺寸。我們計(jì)算在初始溫度相同的方式為[11]。讓Nblocks是總數(shù)邏輯塊加的I/O口電路中的數(shù)量。我們首先創(chuàng)建一個(gè)隨機(jī)安置的電路。接下來(lái),我們執(zhí)行Nblocks移動(dòng)(成對(duì)掉期)的邏輯塊或I/O口,并計(jì)算出不同的成本,這些Nblocks標(biāo)準(zhǔn)偏差配置。初始溫度設(shè)定為20倍標(biāo)準(zhǔn)差,確保最初幾乎所有的行動(dòng)是在退火算法范圍內(nèi)被系統(tǒng)接受。正如在[12],默認(rèn)號(hào)碼的行為在每個(gè)溫度都有評(píng)價(jià)。這個(gè)默認(rèn)的數(shù)字可以在命令行被取代,從而讓不同的CPU時(shí)間和填筑質(zhì)量權(quán)衡。減少溫度每秒移動(dòng)數(shù)的10倍,例如,加快安置到10倍,并降低了大約只有10%的最終填筑質(zhì)量。當(dāng)溫度是如此之高,幾乎任何舉動(dòng)都可以被接受時(shí),我們基本上從一個(gè)位置隨機(jī)移動(dòng)到另一個(gè)位置所改善獲得的成本都是小成本。相反,如果動(dòng)作是很少被接受(因溫度當(dāng)前正處于低位,安置相當(dāng)高的品質(zhì)),也有不少改善成本。有了這個(gè)動(dòng)機(jī),我們提出了一個(gè)新的溫度更新附表,在溫度增加的時(shí)間花費(fèi)在一個(gè)重要的小區(qū)域上,但不是全部動(dòng)作都被接受。如表1:最后,它表明在[12,13],這是可取的Raccept保證作為近似0.44的量有可能被取值。為此,就需要利用Raccept值來(lái)控制這個(gè)范圍限制器。塊是小于或等于交匯處的值,Dlimit單位除了在X和Y方向嘗試。一個(gè)小的Dlimit增加值由Raccept確保這僅僅是塊進(jìn)行交換考慮。而這些“本地交換“往往導(dǎo)致安置成本相對(duì)較小的變化,越來(lái)越多被接受的可能性增加。最初,Dlimit設(shè)置為整個(gè)芯片。每當(dāng)溫度降低,Dlimit整個(gè)芯片的尺寸為這個(gè)結(jié)果退火的第一部分,逐漸萎縮退火過(guò)程中的中間階段,并正在為退火低溫第1部分最后設(shè)計(jì)余量,當(dāng)T退火終止“0.005*成本/Nnets。該運(yùn)動(dòng)的邏輯塊總是至少影響到一個(gè)網(wǎng)。當(dāng)溫度高于平均凈成本的一個(gè)單位時(shí),它是不可能接受任何成本增加的調(diào)配結(jié)果的,所以我們終止了退火。4路由算法VPR的路由器是基于試探談判的擁塞算法[14,8]?;旧显撍惴ㄓ勺畛醺鳁l線(xiàn)路的最短路徑找到網(wǎng),無(wú)論任何接線(xiàn)段或邏輯塊管腳,都可能會(huì)導(dǎo)致過(guò)度使用。路由器的迭代過(guò)程包含順序抓取行動(dòng)和重新路由(由最低成本路徑中找到)中的每個(gè)電路網(wǎng)。對(duì)使用路由資源成本的函數(shù),其對(duì)資源的任何過(guò)度使用都會(huì)讓當(dāng)前路由發(fā)生事先迭代。通過(guò)逐漸增加的多余認(rèn)購(gòu)路由資源成本,該算法勢(shì)力替代路線(xiàn)網(wǎng),以避免使用超額認(rèn)購(gòu)資源,只剩下網(wǎng)最需要一個(gè)給定的資源。對(duì)于本文的實(shí)驗(yàn)結(jié)果,我們?cè)O(shè)置路由器的最大數(shù)量迭代為45,如果電路中路由沒(méi)有成功,一定數(shù)目的目錄中45迭代就被假定為不可路由通道的寬度。為了避免過(guò)于迂回路線(xiàn)以節(jié)省CPU時(shí)間,我們讓一個(gè)去凈路由最外的3個(gè)通道的凈終端邊界框。一個(gè)重要的執(zhí)行細(xì)節(jié)值得一提。無(wú)論是原探路者算法和Vpr路由器使用的Dijkstra算法(即一個(gè)迷宮路由器[15]),以每個(gè)網(wǎng)絡(luò)連接和AK用線(xiàn)網(wǎng)為依據(jù),路由器調(diào)用通道的k-1次執(zhí)行所有需要的連接。在第一次調(diào)用迷宮路由波從凈源擴(kuò)大,直到它到達(dá)任何的K–1值之后。路徑從源到接收器作為現(xiàn)在這個(gè)網(wǎng)的路由的第一部分。波前的迷宮路由被清空,新波前擴(kuò)展是從整個(gè)網(wǎng)絡(luò)布線(xiàn)開(kāi)始發(fā)出的。之后的K-1路由器的迷宮調(diào)用凈終端將所有k值連接。不幸的是,這種方法需要高扇出網(wǎng)絡(luò)相當(dāng)多的CPU時(shí)間。高扇出網(wǎng)絡(luò)通??缭酱蟛糠只蛩械腇PGA。因此,后者調(diào)用迷宮路由器的路由部分作為凈源會(huì)非常大,它將需要相當(dāng)長(zhǎng)的時(shí)間以擴(kuò)大迷宮路由器波前部分到下一個(gè)接收器。幸好,有一個(gè)更有效的方法。當(dāng)達(dá)到凈水槽值時(shí),加入所有路由資源分部需要連接水槽和目前的局部路由成本為0的波前(即擴(kuò)展列表)。當(dāng)前不要空迷宮路由波前,只要保證繼續(xù)擴(kuò)大正常。由于增加新的路徑路由的部分有一個(gè)零成本,由于這項(xiàng)新路徑通常相當(dāng)小迷宮路由器將首先擴(kuò)大它范圍,也需要相對(duì)較少的時(shí)間來(lái)添加此新波,如果整個(gè)波前擴(kuò)展了能實(shí)現(xiàn)那么下一個(gè)接收器將達(dá)到的速度遠(yuǎn)遠(yuǎn)超過(guò)現(xiàn)在。圖3說(shuō)明了差異圖形。5實(shí)驗(yàn)結(jié)果各種FPGA在本節(jié)中使用的參數(shù),總是選擇與先前參數(shù)有明顯對(duì)比的那些參數(shù)。所得結(jié)果在本節(jié)獲得了邏輯的4輸入LUT加上一個(gè)觸發(fā)器組成的塊,如圖所示在圖2。時(shí)鐘網(wǎng)和時(shí)序電路沒(méi)有遞交,因?yàn)樗ǔJ锹酚赏ㄟ^(guò)專(zhuān)用FPGA的商業(yè)網(wǎng)絡(luò)中的路由。每個(gè)LUT的輸入出現(xiàn)在一個(gè)邏輯塊的一面,而邏輯塊輸出一般訪(fǎng)問(wèn)底部和右側(cè),如圖4。每個(gè)邏輯塊的輸入或輸出連接任何相鄰?fù)ǖ溃╯)(即Fc的=寬)。每根電線(xiàn)段和其他布線(xiàn)連接到三段,而在通道交叉口(即值=3)和開(kāi)關(guān)箱拓?fù)涫恰安幌嘟弧边@是因?yàn)樵?磁道接線(xiàn)段只連接在0磁道的其他布線(xiàn)段。5.1實(shí)驗(yàn)結(jié)果與輸入引腳Doglegs以往大多數(shù)FPGA布線(xiàn)結(jié)果認(rèn)為“輸入引腳doglegs”是可能。如果輸入引腳之間的音軌和它連接接線(xiàn)盒的Fc通過(guò)獨(dú)立的SRAM位控制晶體所組成,為了驗(yàn)證兩條軌道上的這些開(kāi)關(guān)通過(guò)電氣連接的可能性。我們將把這個(gè)作為一個(gè)輸入管腳doglegs。作為商業(yè)化的FPGA,實(shí)現(xiàn)從一個(gè)輸入引腳接線(xiàn)盒到多路通道,只有一個(gè)軌道可以連接到輸入引腳,使用多路復(fù)用器而不是獨(dú)立通過(guò)在FPGA中的晶體管布局來(lái)保存相當(dāng)?shù)拿娣e。另外,通常有一個(gè)緩沖軌道之間的連接塊和它連接多路復(fù)用這樣做的目的是為了提高速度,同時(shí)這也意味著緩沖輸入引腳doglegs不能被使用。因此,如果在未來(lái)FPGA的路由器測(cè)試時(shí)沒(méi)有輸入引腳doglegs那么我們必須讓輸入引腳doglegs和過(guò)去??的結(jié)果公平的比較這樣是最好的。在本節(jié)中我們比較了所需的最低數(shù)目,每一條成功的路徑和CAD工具的路由設(shè)置。所有的基準(zhǔn)circuits.1在表2給出結(jié)果,得到了路由Altor[16],制作了一個(gè)基于位置的工具min。列出三兩步(全球和詳細(xì))路由與其它路由器進(jìn)行合并后的全球和詳細(xì)的路由。

VPR要求比第二,第三最佳路由器降低10%的資源數(shù)目,表3列出了音軌需要執(zhí)行這些標(biāo)準(zhǔn)時(shí)數(shù)新的CAD工具,同時(shí)允許地方和路線(xiàn)的電路的連接。列出所有電路邏輯快的消息清單。VPR使用少于13%資源數(shù)目的同時(shí),它將執(zhí)行合并后的全球和詳細(xì)的路由,世嘉比用于執(zhí)行詳細(xì)路由對(duì)AAVPR生成全版圖走線(xiàn)。執(zhí)行安置和全局路由,在試圖改善繞線(xiàn)同時(shí)需要超過(guò)87%以上VPR總資源數(shù)目。最后,讓VPR配置電路而不是強(qiáng)迫它使用Altor內(nèi)存來(lái)減少資源數(shù)目的40%,這表明VPR的模擬退火算法單元遠(yuǎn)較Altor最小單元更好。5.2不輸入引腳的Doglegs實(shí)驗(yàn)比較了VPR與SPLACE/SROUTE工具,不允許輸入引腳doglegs的性能。當(dāng)這兩個(gè)工具都只能使用路線(xiàn)一,比起SROUTE軌道Altor產(chǎn)生的安置需求VPR減少13%,。當(dāng)然這些工具都支持允許布局和布線(xiàn)的電路,對(duì)于SPLACE/SROUTE組合VPR還需要少29%資源數(shù)目。無(wú)論是基于VPR和SPLACE只要是使用模擬退火算法,我們相信VPR單元在一方面優(yōu)于SPLACE是因?yàn)樗幚砀呱瘸鼍W(wǎng)絡(luò)更有效率,讓更多的動(dòng)作進(jìn)行評(píng)估,另一方面是因?yàn)樗行У耐嘶饡r(shí)間表給定的時(shí)間。大電路5.3實(shí)驗(yàn)結(jié)果在第5.1和5.2的54至358的邏輯基準(zhǔn)塊范圍內(nèi)使用面積計(jì)算顯然太小,因?yàn)檫@是特殊的FPGA。因此在本節(jié)中我們目前的實(shí)驗(yàn)結(jié)果,20個(gè)最大的MCNC基準(zhǔn)電路[27],它的大小范圍從1047到8383邏輯塊。我們使用Flowmap[28]以技術(shù)圖每4個(gè)LUT和拖動(dòng)塊并為VPACKtocombine拖動(dòng)塊,進(jìn)入我們的基本邏輯電路塊LUT。I/O引腳數(shù)每行或列適合設(shè)置為2,符合目前的商業(yè)化FPGA。每個(gè)電路被放置在最小的正方形FPGA可以包含它的路由并且輸入引腳doglegs是不允許的。請(qǐng)注意三個(gè)基準(zhǔn)bigkey,DES和dsip,是padlimited要求在FPGA架構(gòu)表5比較資源數(shù)量的地方,在完全路線(xiàn)電路與全版圖范圍內(nèi)所需地點(diǎn)與路線(xiàn)的電路與數(shù)字VPR,然后進(jìn)行詳細(xì)的路由世嘉[23]。表5還給出了大小每個(gè)邏輯塊的數(shù)量計(jì)算電路。在世嘉列中的條目3仿真無(wú)法成功,因?yàn)槭兰芜\(yùn)行路由內(nèi)存不足。由VPR增加路由產(chǎn)生的全版圖航線(xiàn)曲目總數(shù),有超過(guò)所需68%路線(xiàn)的電路主場(chǎng)由VPR路由完全執(zhí)行。顯然,世嘉處理無(wú)法進(jìn)行。因?yàn)槁酚纱箅娐樊?dāng)輸入引腳doglegs是不允許的。為了鼓勵(lì)其它FPGA研究人員公布的結(jié)果,以這些大型路由基準(zhǔn),我們發(fā)出以下“FPGA的挑戰(zhàn)?!泵看悟?yàn)證結(jié)果跳動(dòng)的最好驗(yàn)證先前對(duì)這些基準(zhǔn)結(jié)果公布,我們將每條信息支付1美元給作者(對(duì)不起,1元加幣。,而不是1美元),由他們來(lái)處理如果減少需要跟蹤的總數(shù)。該技術(shù)映射網(wǎng)表,由VPR生成和投放位置的目前最全的跟蹤路由在/?jayar/software.html。上可以找到6結(jié)論和未來(lái)工作我們已經(jīng)提出了一個(gè)優(yōu)于所有這類(lèi)工具的新的FPGA布局布線(xiàn)工具,它讓我們可以進(jìn)行直接的比較。此外,我們已經(jīng)提出更大的電路基準(zhǔn)測(cè)試結(jié)果。建立專(zhuān)門(mén)用于描述精密學(xué)術(shù)的FPGA布局布線(xiàn)工具。我們希望下一代的FPGACAD工具將優(yōu)化這些大型基點(diǎn),因?yàn)樗麄兪且幌盗忻芮械膯?wèn)題被映射成今天的FPGA。VPR的主要設(shè)計(jì)目標(biāo)之一是保持足夠的靈活性,允許工具使用在很多FPGA架構(gòu)的研究上。我們目前正進(jìn)行幾個(gè)VPR改進(jìn),才能進(jìn)一步提高其在FPGA架構(gòu)的研究能力。在不久的將來(lái)VPR將支持緩沖和分段路由結(jié)構(gòu),我們計(jì)劃增加定時(shí)分析儀和時(shí)序驅(qū)動(dòng)的路由。外文原文VPR:ANewPacking,PlacementandRoutingToolforFPGAResearch1VaughnBetzandJonathanRoseDepartmentofElectricalandComputerEngineering,UniversityToronto,ON,CanadaM5S3G4{vaughn,jayar}@AbstractWedescribethecapabilitiesofandalgorithmsusedinanewFPGACADtool,VersatilePlaceandRoute(VPR).Intermsofminimizingroutingarea,VPRoutperformsallpublishedFPGAplaceandroutetoolstowhichwecancompare.Althoughthealgorithmsusedarebasedonpreviouslyknownapproaches,wepresentseveralenhancementsthatimproverun-timeandquality.WepresentplacementandroutingresultsonanewsetoflargecircuitstoallowfuturebenchmarkcomparisonsofFPGAplaceandroutetoolsoncircuitsizesmoretypicaloftoday’sindustrialdesigns.VPRiscapableoftargetingabroadrangeofFPGAarchitectures,andthesourcecodeispubliclyavailable.Itandtheassociatednetlisttranslation/clusteringtoolVPACKhavealreadybeenusedinanumberofresearchprojectsworldwide,andshouldbeusefulinmanyareasofFPGAarchitectureresearch.1IntroductionInFPGAresearch,onemusttypicallyevaluatetheutilityofnewarchitecturalfeaturesexperimentally.Thatis,benchmarkcircuitsaretechnologymapped,placedandroutedontotheFPGAarchitecturesofinterest,andmeasuresofthearchitecture’squality,suchasspeedorarea,canthenreadilybeextracted.Accordingly,thereisconsiderableneedforflexibleCADtoolsthatcantargetawidevarietyofFPGAarchitecturesefficiently,andhenceallowfaircomparisonsofthearchitectures.ThispaperdescribestheVersatilePlaceandRoute(VPR)tool,whichhasbeendesignedtobeflexibleenoughtoallowcomparisonofmanydifferentFPGAarchitectures.VPRcanperformplacementandeitherglobalroutingorcombinedglobalanddetailedrouting.Itispubliclyavailablefrom/~jayar/software.html.InordertomakemeaningfulFPGAarchitecturecomparisons,itisessentialthattheCADtoolsusedtomapcircuitsintoeacharchitectureareofhighquality.TheroutingphaseofVPRoutperformsallpreviouslypublishedFPGAroutersforwhichstandardbenchmarksresultsareavailable,andthatthecombinationofVPR’splacerandrouteroutperformsallpublishedcombinationsofFPGAplacementandroutingtools.2Theorganizationofthispaperisasfollows.InSection2wedescribesomeofthefeaturesofVPRandtherangeofFPGAarchitectureswithwhichitmaybeused.InSections3and4wedescribetheplacementandroutingalgorithms.InSection5,wecomparethenumberoftracksrequiredbyVPRtosuccessfullyroutecircuitswiththatrequiredbyotherpublishedtools.InSection6weconcludeandoutlinesomefutureenhancementswhichwillbemadetoVPR.2OverviewofVPRFigure1outlinestheVPRCADflow.TheinputstoVPRconsistofatechnologymappednetlistandatextfiledescribingtheFPGAarchitecture.VPRcanplacethecircuit,orapre-existingplacementcanbereadin.VPRcanthenperformeitheraglobalrouteoracombinedglobal/detailedrouteoftheplacement.VPR’soutputconsistsoftheplacementandrouting,aswellasstatisticsusefulinassessingtheutilityofanFPGAarchitecture,suchasroutedwirelength,trackcount,andmaximumnetlength.Someofthearchitecturalparametersthatcanbespecifiedinthearchitecturedescriptionfileare:?thenumberoflogicblockinputsandoutputs,?theside(s)ofthelogicblockfromwhicheachinputandoutputisaccessible,?thelogicalequivalencebetweenvariousinputandoutputpins(e.g.allLUTinputsarefunctionallyequivalent),?thenumberofI/OpadsthatfitintooneroworonecolumnoftheFPGA,and?thedimensionsofthelogicblockarray(e.g.23x30logicblocks).Inaddition,ifglobalroutingistobeperformed,onecanalsospecify:?therelativewidthsofhorizontalandverticalchannels,and?therelativewidthsofthechannelsindifferentregionsoftheFPGA.Finally,ifcombinedglobalanddetailedroutingistobeperformed,onealsospecifies:?theswitchblock[1]architecture(i.e.howtheroutingtracksareinterconnected),?thenumberoftrackstowhicheachlogicblockinputpinconnects(Fc[1]),?theFcvalueforlogicblockoutputs,and?theFcvalueforI/Opads.Thecurrentarchitecturedescriptionformatdoesnotallowsegmentsthatspanmorethanonelogicblocktobeincludedintheroutingarchitecture,butwearepresentlyaddingthisfeature.AddingnewroutingarchitecturefeaturestoVPRisrelativelyeasy,sinceVPRusesthearchitecturedescriptiontocreatearoutingresourcegraph.Everyroutingtrackandeverypininthearchitecturebecomesanodeinthisgraph,andthegraphedgesrepresenttheallowableconnections.Therouter,graphicsvisualiza-tionandstatisticscomputationroutinesallworkonlywiththisroutingresourcegraph,soaddingnewroutingarchitecturefeaturesonlyinvolveschangingthesubroutinesthatbuildthisgraph.AlthoughVPRwasinitiallydevelopedforisland-styleFPGAs[2,3],itcanalsobeusedwithrow-basedFPGAs[4].VPRisnotcurrentlycapableoftargetinghierarchicalFPGAs[5],althoughaddinganappropriateplacementcostfunctionandtherequiredroutingresourcegraphbuildingroutineswouldallowittotargetthem.Finally,VPR’sbuilt-ingraphicsallowinteractivevisualizationoftheplacement,therouting,theavailableroutingresourcesandthepossiblewaysofinterconnectingtheroutingresources.2.1TheVPACKLogicBlockPacker/NetlistTranslatorVPACKreadsinablifformatnetlistofacircuitthathasbeentechnology-mappedtoLUTsandflip-flops,packstheLUTsandflipflopsintothedesiredFPGAlogicblock,andoutputsanetlistinVPR’snetlistformat.VPACKcantargetalogicblockconsistingofoneLUTandoneFF,asshowninFigure2,asthisisacommonFPGAlogicelement.VPACKisalsocapableoftargetinglogicblocksthatcontainseveralLUTsandseveralflipflops,withorwithoutsharedLUTinputs[6].These“clusterbased”logicblocksaresimilartothoseemployedinrecentFPGAsbyAltera,XilinxandLucentTechnologies.PlacementAlgorithmVPRusesthesimulatedannealingalgorithm[7]forplacement.Wehaveexperimentedwithseveraldifferentcostfunctions,andfoundthatwhatwecallalinearcongestioncostfunctionprovidesthebestresultsinareasonablecomputationtime[8].Thefunctionalformofthiscostfunctioniswherethesummationisoverallthenetsinthecircuit.Foreachnet,bbxandbbydenotethehorizontalandverticalspansofitsboundingbox,respectively.Theq(n)factorcompensatesforthefactthattheboundingboxwirelengthmodelunderestimatesthewiringnecessarytoconnectnetswithmorethanthreeterminals,assuggestedin[10].Itsvaluedependsonthenumberofterminalsofnetn;qis1fornetswith3orfewerterminals,andslowlyincreasesto2.79fornetswith50terminals.Cav,x(n)andCav,y(n)aretheaveragechannelcapacities(intracks)inthexandydirections,respectively,overtheboundingboxofnetn.ThiscostfunctionpenalizesplacementswhichrequiremoreroutinginareasoftheFPGAthathavenarrowerchannels.Alltheresultsinthispaper,however,areobtainedwithFPGAsinwhichallchannelshavethesamecapacity.InthiscaseCavisaconstantandthelinearcongestioncostfunctionreducestoaboundingboxcostfunction.Agoodannealingscheduleisessentialtoobtainhigh-qualitysolutionsinareasonablecomputationtimewithsimulatedannealing.Wehavedevelopedanewannealingschedulewhichleadstoveryhigh-qualityplacements,andinwhichtheannealingparametersautomaticallyadjusttodifferentcostfunctionsandcircuitsizes.Wecomputetheinitialtemperatureinamannersimilarto[11].LetNblocksbethetotalnumberoflogicblocksplusthenumberofI/Opadsinacircuit.Wefirstcreatearandomplacementofthecircuit.NextweperformNblocksmoves(pairwiseswaps)oflogicblocksorI/Opads,andcomputethestandarddeviationofthecostoftheseNblocksdifferentconfigurations.Theinitialtemperatureissetto20timesthisstandarddeviation,ensuringthatinitiallyvirtuallyanymoveisacceptedatthestartoftheanneal.Asin[12],thedefaultnumberofmovesevaluatedateachtemperatureis.Thisdefaultnumbercanbeoverriddenonthecommandline,however,toallowdifferentCPUtime/placementqualitytradeoffs.Reducingthenumberofmovespertemperaturebyafactorof10,forexample,speedsupplacementbyafactorof10andreducesfinalplacementqualitybyonlyabout10%.Whenthetemperatureissohighthatalmostanymoveisaccepted,weareessentiallymovingrandomlyfromoneplacementtoanotherandlittleimprovementincostisobtained.Conversely,ifveryfewmovesarebeingaccepted(duetothetemperaturebeinglowandthecurrentplacementbeingoffairlyhighquality),thereisalsolittleimprovementincost.Withthismotivationinmind,weproposeanewtemperatureupdateschedulewhichincreasestheamountoftimespentattemperatureswhereasignificantfractionof,butnotall,movesarebeingaccepted.AnewtemperatureiscomputedasTnew=aTold,wherethevalueofadependsonthefractionofattemptedmovesthatwereaccepted(Raccept)atTold,asshowninTable1.Finally,itwasshownin[12,13]thatitisdesirabletokeepRacceptnear0.44foraslongaspossible.WeaccomplishthisbyusingthevalueofRaccepttocontrolarangelimiter--onlyinterchangesofblocksthatarelessthanorequaltoDlimitunitsapartinthexandydirectionsareattempted.AsmallvalueofDlimitincreasesRacceptbyensuringthatonlyblockswhichareclosetogetherareconsideredforswapping.These“l(fā)ocalswaps”tendtoresultinrelativelysmallchangesintheplacementcost,increasingtheirlikelihoodofacceptance.Initially,Dlimitissettotheentirechip.Wheneverthetemperatureisreduced,thevalueofDlimitisupdatedaccordingto,andthenclampedtotherange1£Dlimit£maximumFPGAdimension.ThisresultsinDlimitbeingthesizeoftheentirechipforthefirstpartoftheanneal,shrinkinggraduallyduringthemiddlestagesoftheanneal,andbeing1forthelow-temperaturepartoftheanneal.Finally,theannealisterminatedwhenT<0.005*Cost/Nnets.Themovementofalogicblockwillalwaysaffectatleastonenet.Whenthetemperatureislessthanasmallfractionoftheaveragecostofanet,itisunlikelythatanymovethatresultsinacostincreasewillbeaccepted,soweterminatetheanneal.RoutingAlgorithmVPR’srouterisbasedonthePathfindernegotiatedcongestionalgorithm[14,8].Basically,thisalgorithminitiallyrouteseachnetbytheshortestpathitcanfind,regardlessofanyoveruseofwiringsegmentsorlogicblockpinsthatmayresult.Oneiterationoftherouterconsistsofsequentiallyripping-upandre-routing(bythelowestcostpathfound)everynetinthecircuit.Thecostofusingaroutingresourceisafunctionofthecurrentoveruseofthatresourceandanyoverusethatoccurredinpriorroutingiterations.Bygraduallyincreasingthecostofoversubscribedroutingresources,thealgorithmforcesnetswithalternativeroutestoavoidusingoversubscribedresources,leavingonlythenetthatmostneedsagivenresourcebehind.Fortheexperimentalresultsinthispaperwesetthemaximumnumberofrouteriterationsto45;ifacircuithasnotsuccessfullyroutedinagivennumberoftracksin45iterationsitisassumedtobeunroutablewithchannelsofthatwidth.ToavoidoverlycircuitousroutesandtosaveCPUtime,weallowtheroutingofanettogoatmost3channelsoutsidetheboundingboxofthenetterminals.Oneimportantimplementationdetaildeservesmention.BoththeoriginalPathfinderalgorithmandVPR’srouteruseDijkstra’salgorithm(i.e.amazerouter[15])toconnecteachnet.Forakterminalnet,themazerouterisinvokedk-1timestoperformalltherequiredconnections.Inthefirstinvocation,themazeroutingwavefrontexpandsoutfromthenetsourceuntilitreachesanyoneofthek-1netsinks.Thepathfromsourcetosinkisnowthefirstpartofthisnet’srouting.Themazeroutingwavefrontisemptied,andanewwavefrontexpansionisstartedfromtheentirenetroutingfoundthusfar.Afterk-1invocationsofthemazerouterallkterminalsofthenetwillbeconnected.Unfortunately,thisapproachrequiresconsiderableCPUtimeforhigh-fanoutnets.High-fanoutnetsusuallyspanmostoralloftheFPGA.Therefore,inthelatterinvocationsofthemazerouterthepartialroutingusedasthenetsourcewillbeverylarge,anditwilltakealongtimetoexpandthemazerouterwavefrontouttothenextsink.Fortunatelythereisamoreefficientmethod.Whenanetsinkisreached,addalltheroutingresourcesegmentsrequiredtoconnectthesinkandthecurrentpartialroutingtothewavefront(i.e.theexpansionlist)withacostof0.Donotemptythecurrentmazeroutingwavefront;justcontinueexpandingnormally.Sincethenewpathaddedtothepartialroutinghasacostofzero,themazerouterwillexpandarounditatfirst.Sincethisnewpathistypicallyfairlysmall,itwilltakerelativelylittletimetoaddthisnewwavefront,andthenextsinkwillbereachedmuchmorequicklythaniftheentirewavefrontexpansionhadbeenstartedfromscratch.Figure3illustratesthedifferencegraphically.5ExperimentalResultsThevariousFPGAparametersusedinthissectionwerealwayschosentoallowadirectcomparisonwithpreviouslypublishedresults.Alltheresultsinthissectionwereobtainedwithalogicblockconsistingofa4-inputLUTplusaflipflop,asshowninFigure2.Theclocknetwasnotroutedinsequentialcircuits,asitisusuallyroutedviaadedicatedroutingnetworkincommercialFPGAs.EachLUTinputappearsononesideofthelogicblock,whilethelogicblockoutputisaccessiblefromboththebottomandrightsides,asshowninFigure4.Eachlogicblockinputoroutputcanconnecttoanytrackintheadjacentchannel(s)(i.e.Fc=W).Eachwiresegmentcanconnecttothreeotherwiringsegmentsatchannelintersections(i.eFs=3)andtheswitchboxtopologyis“disjoint”--thatis,awiringsegmentintrack0connectsonlytootherwiringsegmentsintrack0andsoon.5.1ExperimentalResultswithInputPinDoglegsMostpreviousFPGAroutingresultshaveassumedthat“inputpindoglegs”arepossible.IftheconnectionboxbetweenaninputpinandthetrackstowhichitconnectsconsistsofFcindependentpasstransistorscontrolledbyFcSRAMbits,itispossibletoturnontwooftheseswitchesinordertoelectricallyconnecttwotracksviatheinputpin.Wewillrefertothisasaninputpindogleg.CommercialFPGAs,however,implementtheconnectionboxfromaninputpintoachannelviaamultiplexer,soonlyonetrackmaybeconnectedtotheinputpin.UsingamultiplexerratherthanindependentpasstransistorssavesconsiderableareaintheFPGAlayout.Aswell,normallythereisabufferbetweenatrackandtheconnectionblockmultiplexerstowhichitconnectsinordertoimprovespeed;thisbufferalsomeansthatinputpindoglegscannotbeused.Therefore,whileweallowinputpindoglegsinthissectioninordertomakeafaircomparisonwithpastresults,itwouldbebestifinthefutureFPGAroutersweretestedwithoutinputpindoglegs.InthissectionwecomparetheminimumnumberoftracksperchannelrequiredforasuccessfulroutingbyvariousCADtoolsonasetof9benchmarkcircuits.1AlltheresultsinTable2areobtainedbyroutingaplacementproducedbyAltor[16],amincutbasedplacementtool.Threeofthecolumnsconsistoftwo-step(globalthendetailed)routing,whiletheotherroutersperformcombinedglobalanddetailedrouting.VPRrequires10%fewertracksthanthesecondbestrouter,andthethirdbestrouterconsistsofVPR’sglobalroutephaseplusSEGAfordetailedrouting.Table3liststhenumberoftracksrequiredtoimplementthesebenchmarkswhennewCADtoolsareallowedtobothplaceandroutethecircuits.Thesizecolumnliststhenumberoflogicblocksineachcircuit.VPRuses13%fewertrackswhenitperformscombinedglobalanddetailedroutingthanitdoeswhenSEGAisusedtoperformdetailedroutingonaaVPR-generatedglobalroute.FPR,whichperformsplacementandglobalroutingsimultaneouslyinanattempttoimproveroutability,requires87%moretotaltracksthanVPR.Finally,allowingVPRtoplacethecircuitsinsteadofforcingittousetheAltorplacementsreducesthenumberoftracksVPRrequirestoroutethemby40%,indicatingthatVPR’ssimulatedannealingbasedplacerisconsiderablybetterthantheAltormin-cutplacer.5.2ExperimentalResultsWithoutInputPinDoglegsTable4comparestheperformanceofVPRwiththatoftheSPLACE/SROUTEtoolset,whichdoesnotallowinputpindoglegs.WhenbothtoolsareonlyallowedtorouteanAltor-generatedplacementVPRrequires13%fewertracksthanSROUTE.Whenthetoolsareallowedtobothplaceandroutethecircuits,VPRrequires29%fewertracksthantheSPLACE/SROUTEcombination.BothVPRandSPLACEarebasedonsimulatedannealing.WebelievetheVPRplaceroutperformsSPLACEpartiallybecauseithandleshigh-fanoutnetsmoreefficiently,allowingmoremovestobeevaluatedinagiventime,andpartiallybecauseofitsmoreefficientannealingschedule.5.3ExperimentalResultsonLargeCircuitsThebenchmarksusedinSections5.1and5.2rangeinsizefrom54to358logicblocks,andaccordinglyaretoosmalltobeveryrepresentativeoftoday’sFPGAs.Therefore,inthissectionwepresentexperimentalresultsforthe20largestMCNCbenchmarkcircuits[27],whichrangeinsizefrom1047to8383logicblocks.WeuseFlowmap[28]totechnologymapeachcircuitto4-LUTsandflipflops,andVPACKtocombineflipflopsandLUTsintoourbasiclogicblock.ThenumberofI/Opadsthatfitperroworcolumnissetto2,inlinewithcurrentcommercialFPGAs.EachcircuitisplacedandroutedinthesmallestsquareFPGAwhichcancontainit.Inputpindoglegsarenotallowed.Notethatthreeofthebenchmarks,bigkey,des,anddsip,arepadlimitedintheFPGAarchitectureassumed.Table5comparesthenumberoftracksrequiredtoplaceandcompletelyroutecircuitswithVPRwiththenumberrequiredtoplaceandgloballyroutethecircuitswithVPRandthenperformdetailedroutingwithSEGA[23].Table5alsogivesthesizeofeachcircuit,intermsofthenumberoflogicblocks.TheentriesintheSEGAcolumnwitha3signcouldnotbesuccessfullyroutedbecauseSEGAranoutofmemory.UsingSEGAtoperformdetailedroutingonaglobalroutegeneratedbyVPRincreasesthetotalnumberoftracksrequiredtoroutethecircuitsbyover68%vs.havingVPRperformtheroutingcompletely.ClearlySEGAhasdifficultyroutinglargecircuitswheninputpindoglegsarenotallowed.ToencourageotherFPGAresearcherstopublishroutingresultsusingtheselargerbenchmarks,weissuethefollowing“FPGAchallenge.”Eachtimeverifiedresultswhichbeatthepreviouslybestverifiedresultsonthesebenchmarksareannounced,wewillpaytheauthors$1(sorry,$1Cdn.,not$1U.S.)foreachtrackbywhichtheyreducethetotalnumberoftracksrequiredfromthatofthepreviouslybestresults.Thetechnology-mappednetlists,theplacementsgeneratedbyVPRandthecurrentlybestroutingtracktotalareavailableat/~jayar/software.html.6ConclusionsandFutureWorkWehavepresentedanewFPGAplacementandroutingtoolthatoutperformsallsuchtoolstowhichwecanmakedirectcomparisons.InadditionwehavepresentedbenchmarkresultsonmuchlargercircuitsthanhavetypicallybeenusedtocharacterizeacademicFPGAplaceandroutetools.WehopethenextgenerationofFPGACADtoolswillbecomparedonthebasisoftheselargerbenchmarks,astheyareacloserapproximationofthekindofproblemsbeingmappedintotoday’sFPGAs.OneofthemaindesigngoalsforVPRwastokeepthetoolflexibleenoughtoallowitsuseinmanyFPGAarchitecturalstudies.WearecurrentlyworkingonseveralimprovementstoVPRtofurtherincreaseitsutilityinFPGAarchitectureresearch.InthenearfutureVPRwillsupportbufferedandsegmentedroutingstructures,andsoonafterthatweplantoaddatiminganalyzerandtiming-drivenrouting.References[1]S.Brown,R.Francis,J.Rose,andZ.Vranesic,F(xiàn)ield-ProgrammableGateAr

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論