版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、High Performance Information Retrieval and MEMS CADon Intel ItaniumPerformance Tuning ParticipantsResearchersMark Adams (SNL), David Bailey (LBL), Parry Husbands (LBL), Xiaoye Li (LBL), Lenny Oliker (LBL)PhD StudentsYozo Hida, Geoff Pike, Rich VuducUndergradsBrian Gaeke , Jen Hsu, Shoaib Kamil, Suh
2、Kang, Hyun Kim, Gina Lee, Jaeseop Lee, Michael de Lorimier, Jin Moon, Randy Shoopman, Brandon Thompson OutlineMotivation, History, Related workTuning Sparse Matrix OperationsResults on Sun Ultra 1/170Preliminary results on P4Preliminary results on ItaniumApplications SUGAR a MEMS CAD systemInformati
3、on RetrievalFuture Work(Whats up with Millennium)(Background on Test Matrices)MotivationHistoryRelated Work Performance Tuning Motivation: performance of many applications dominated by a few kernelsMEMS CAD Nonlinear ODEs Nonlinear equations Linear equations Matrix multiplyMatrix-by-matrix or matrix
4、-by-vectorDense or SparseInformation retrieval by LSI Compress term-document matrix Sparse mat-vec multiplyInformation retrieval by LDA Maximum likelihood estimation Solve linear systemsMany other examples (not all linear algebra)Conventional Performance Tuning Vendor or user hand tunes kernels Draw
5、backs:Very time consuming and tedious workEven with intimate knowledge of architecture and compiler, performance hard to predictGrowing list of kernels to tuneExample: New BLAS StandardMust be redone for every architecture, compilerCompiler technology often lags architectureNot just a compiler probl
6、em: Best algorithm may depend on input, so some tuning must occur at run-time.Not all algorithms semantically or mathematically equivalentAutomatic Performance TuningApproach: for each kernelIdentify and generate a space of algorithmsSearch for the fastest one, by running themWhat is a space of algo
7、rithms?Depending on kernel and input, may varyinstruction mix and ordermemory access patternsdata structures mathematical formulation When do we search?Once per kernel and architecture At compile timeAt run timeAll of the aboveTuning pays off PHIPAC (Bilmes, Asanovic, Vuduc, Demmel)Tuning pays off A
8、TLAS (Dongarra, Whaley)Extends applicability of PHIPACIncorporated in Matlab (with rest of LAPACK)Other Automatic Tuning ProjectsFFTs and Signal ProcessingFFTW ()Given dimension n of FFT, choose best implementation at runtime by assembling prebuilt kernels for small factors of nWidely used, won 1999
9、 Wilkinson Prize for Numerical SoftwareSPIRAL (/spiral)Extensions to other transforms, DSPsUHFFT Extensions to higher dimension, parallelismSpecial session at ICCS 2001Organized by Yelick and Demmelwww.ucalgary.ca/iccsProceedings availablePointers to automatic tuning projects at /yelick/iccs-tuneSea
10、rch for optimal register tile sizes on Sun Ultra 1016 registers, but 2-by-3 tile size fastestSearch for Optimal L0 block size in dense matmul60% of peak on Pentium II-3004% of versions exceedHigh precision dense mat-vec multiply (XBLAS)High Precision Algorithms (XBLAS)Double-double (High precision w
11、ord represented as pair of doubles)Many variations on these algorithms; we currently use BaileysExploiting Extra-wide RegistersSuppose s(1) , , s(n) have f-bit fractions, SUM has Ff bit fractionConsider following algorithm for S = Si=1,n s(i)Sort so that |s(1)| |s(2)| |s(n)|SUM = 0, for i = 1 to n S
12、UM = SUM + s(i), end for, sum = SUMTheorem (D., Hida)If n 2F-f + 1, then sum agrees with S to about 1.5 ulpsIf n = 2F-f + 2, then error can be at least as large as 22f-F 1 ulpsIf n 2F-f + 3, then error can be arbitrary (S 0 but sum = 0 )Exampless(i) double (f=53), SUM double extended (F=64) accurate
13、 if n 211 + 1 = 2049Dot product of single precision x(i) and y(i) s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) accurate if n 216 + 1 = 65537Tuning Sparse Matrix ComputationsTuning Sparse matrix-vector multiply SparsityOptimizes y = A*x for a particular sparse AIm and YelickAlgorithm spac
14、eDifferent code organization, instruction mixesDifferent register blockings (change data structure and fill of A)Different cache blockingDifferent number of columns of xDifferent matrix orderingsSoftware and papers available/yelick/sparsityHow Sparsity tunes y = A*xRegister Blocking Store matrix as
15、dense r x c blocksPrecompute performance in Mflops of dense A*x for various register block sizes r x cGiven A, sample it to estimate Fill if A blocked for varying r x cChoose r x c to minimize estimated running time Fill/MflopsStore explicit zeros in dense r x c blocks, unrollCache Blocking Useful w
16、hen source vector x enormousStore matrix as sparse 2k x 2l blocksSearch over 2k x 2l cache blocks to find fastest Register-Blocked Performance of SPMV on Dense Matrices (up to 12x12)333 MHz Sun Ultra IIi800 MHz Pentium III1.5 GHz Pentium 4800 MHz Itanium70 Mflops35 Mflops425 Mflops310 Mflops175 Mflo
17、ps105 Mflops250 Mflops110 MflopsWhich other sparse operations can we tune?General matrix-vector multiply A*xPossibly many vectors xSymmetric matrix-vector multiply A*xSolve a triangular system of equations T-1*xy = AT*A*xKernel of Information Retrieval via LSI (SVD)Same number of memory references a
18、s A*xy = Si (A(i,:)T * (A(i,:) * x)Future workA2*x, Ak*xKernel of Information Retrieval used by GoogleIncludes Jacobi, SOR, Changes calling algorithmAT*M*AMatrix triple productUsed in multigrid solverTest MatricesGeneral Sparse MatricesUp to n=76K, nnz = 3.21MFrom many application areas1 Dense2 to 1
19、7 - FEM18 to 39 - assorted41 to 44 linear programming45 - LSISymmetric MatricesSubset of General Matrices1 Dense2 to 8 - FEM9 to 13 - assortedLower Triangular MatricesObtained by running SuperLU on subset of General Sparse Matrices1 Dense2 13 FEMDetails on test matrices at end of talkResults on Sun
20、Ultra 1/170Speedups on SPMV from Sparsity on Sun Ultra 1/170 1 RHSSpeedups on SPMV from Sparsity on Sun Ultra 1/170 9 RHSPreliminary Results on P4 using icc and gccSpeedup of SPMV from Sparsity on P4/icc-5.0.1Performance of SPMV from Sparsity on P4/icc-5.0.1Fill for SPMV from Sparsity on P4/icc-5.0.
21、1Possible ImprovementsDoesnt work as well as on Sun Ultra 1/170; Why?Current heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium since performance so “nondiagonally dominant”Sparsity reg blocking results on P4 for FEM/fluids matri
22、cesMatrix #2 (150 Mflops to 400 Mflops)Matrix #5 (50 Mflops to 350 Mflops)Sparsity cache blocking results on P4 for LSISymmetric Sparse Matrix-Vector Multiply on P4 (vs nave full = 1)Sparse Triangular Solve (Matlabs colmmd ordering) on P4 AT*A on P4 (Accesses A only once)Preliminary Results on Itani
23、um using eccSpeedup of SPMV from Sparsity on Itanium/ecc-5.0.1Raw Performance of SPMV from Sparsity on Itanium Fill for SPMV from Sparsity on Itanium Possible ImprovementsCurrent heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium
24、 since performance so “nondiagonally dominant”Matrix 8: Chose 2x2 (164 Mflops) Better: 3x1 (196 Mflops)Matrix 9:Chose 2x2 (164 Mflops) Better: 3x1 (213 Mflops)Application of Performance TuningSUGAR A CAD Tool for MEMSApplications to SUGAR a tool for MEMS CADDemmel, Bai, Pister, Govindjee, Agogino, G
25、u, Input: description of MicroElectroMechanical System (as netlist)Output:DC, steady state, modal, transient analyses to assess behaviorCIF for fabrication Simulation capabilitiesBeams and plates (linear, nonlinear, prestressed,)Electrostatic forces, circuitsThermal expansion, Couette damping Availa
26、bilityMatlabPublicly /cfm249 registered users, many unregisteredWeb service M & MEMSRuns on Millennium Now in use in EE 245 at UCB96 usersLots of new features being added, including interface to measurementsMicromirror (Last, Pister)Laterally ac
27、tuated torsionally suspended micromirrorOver 10K dof, 100 line netlist (using subnets)DC and frequency analysisAll algorithms reduce to previous kernelsApplication of Performance TuningInformation RetrievalInformation RetrievalJordanCollaboration with Intel team building probabilistic graphical mode
28、lsBetter alternatives to LSI for document modeling and searchLatent Dirichlet Allocation (LDA)Model documents as union of themes, each with own word distributionMaximum likelihood fit to find themes in set of documents, classify themComputational bottleneck is solution of enormous linear systemsOne
29、of largest Millennium usersIdentifying influential documentsGiven hyperlink patterns of documents, which are most influential?Basis of Google (eigenvector of link matrix sparse matrix vector multiply)Applying Markov chain and perturbation theory to assess reliabilityKernel ICAEstimate set of sources
30、 s and mixing matrix A from samples x = A*sNew way to sample such that sources are as independent as possibleAgain reduces to linear algebra kernelsMore on Kernel ICAAlgorithm 1nonlinear eigenvalue problem, reduces to a sequence of manyvery large generalized spd eigenproblems A l B Block structured,
31、 A dense, B block diagonalOnly smallest nonzero eigenvalue neededSparse eigensolver (currently ARPACK/eigs)Use Incomplete Cholesky (IC) to get low rank approximzation to dense subblocks comprising A and BUse Complete (=Diagonal) Pivoting but take only 20 n stepsCost is O(n)Evaluating matrix entries
32、(exponentials) could be bottleneckNeed fast, low precision exponentialAlgorithm 2Like Algorithm 1, but find all eigenvalues/vectors of A l B Use Holy Grail Future WorkExploit Itanium Architecture128 (82-bit) floating point registers9 HW formats: 24/8(v), 24/15, 24/17, 53/11, 53/15, 53/17, 64/15, 64/
33、17Many few load/store instructionsfused multiply-add instructionpredicated instructionsrotating registers for software pipeliningprefetch instructionsthree levels of cacheTune current and wider set of kernelsImprove heuristics, eg choice of r x cIncorporate intoSUGARInformation RetrievalFurther auto
34、mate performance tuningGeneration of algorithm space generatorsPossible collaborations with IntelGetting right toolsGetting faster, less accurate transcendental functionsProvide feedback on toolsProvide tuned kernels, benchmarks, IR appsProvide system for tuning future kernelsTo provide usersTo eval
35、uate architectural designsMillenniumMillenniumCluster of clusters at UC Berkeley309 CPU cluster in Soda HallSmaller clusters across campusMade possible by Intel equipment grantSignificant other supportNSF, Sun, Microsoft, Nortel, campusMillennium TopologyMillennium Usage Oct 1 11, 2001100% utilizati
36、on for last few daysAbout half the jobs are parallelUsage highlightsAMANDAAntarctic Muon And Neutrino Detector A128 scientists from 15 universities and institutes in the U.S. and Europe.TEMPESTEUV lithography simulations via 3D electromagnetic scatteringcuervo./Volcano/study t
37、he defect printability on multilayer masks TitaniumHigh performance Java dialect for scientific computing/projects/titaniumImplementation of shared address space, and use of SSE2Digital Library ProjectLarge database of imageselib./Used to run spectral image segmentation algorithm for clustering, sea
38、rch on imagesUsage highlights (continued)CS 267Graduate class in parallel computing, 33 enrolled/dbindel/cs267taHomeworkDisaster ResponseHelp find people after Sept 11, set up immediately 48K reports in database, linked to other survivor databasesMEMS CAD (MicroElectroM
39、echanical Systems Computer Aided Design)Tool to help design MEMS systemsUsed this semester in EE 245, 93 More later in talkInformation RetrievalDevelopment of faster information retrieval algorithms/jordanMore later in talkMany applications are part of CITRISBackground on
40、 Test MatricesSparse Matrix Benchmark Suite (1/3)#Matrix NameProblem DomainDimensionNo. Non-zeros1denseDense matrix1,0001.00 M2raefsky3Fluid structure interaction21,2001.49 M3inaccuraAccuracy problem16,1461.02 M4bcsstk35*Stiff matrix automobile frame30,2371.45 M5venkat01Flow simulation62,4241.72 M6c
41、rystk02*FEM crystal free-vibration13,965969 k7crystk03*FEM crystal free-vibration24,6961.75 M8nasasrb*Shuttle rocket booster54,8702.68 M93dtube*3-D pressure tube45,3303.21 M10ct20stif*CT20 engine block52,3292.70 M11baiAirfoil eigenvalue calculation23,560484 k12raefsky4Buckling problem19,7791.33 M13e
42、x113-D steady flow problem16,2141.10 M14rdist1Chemical process simulation4,13494.4 k15vavasis32-D PDE problem41,0921.68 MNote: * indicates a symmetric matrix.Sparse Matrix Benchmark Suite (2/3)#Matrix NameProblem DomainDimensionNo. Non-zeros16orani678Economic modeling2,52990.2 k17rimFEM fluid mechan
43、ics problem22,5601.01 M18memplusCircuit simulation17,758126 k19gemat11Power flow4,92933.1 k20lhr10Chemical process simulation10,672233 k21goodwin*Fluid mechanics problem7,320325 k22bayer02Chemical process simulation13,93563.7 k23bayer10Chemical process simulation13,43694.9 k24coater2Simulation of co
44、ating flows9,540207 k25finan512*Financial portfolio optimization74,752597 k26onetone2Harmonic balance method36,057228 k27pwt*Structural engineering36,519326 k28vibrobox*Vibroacoustics12,328343 k29wang4Semiconductor device simulation26,068177 k30lnsp3937Fluid flow modeling3,93725.4 kSparse Matrix Ben
45、chmark Suite (3/3)#Matrix NameProblem DomainDimensionsNo. Non-zeros31lns3937Fluid flow modeling3,93725.4 k32sherman5Oil reservoir modeling3,31220.8 k33sherman3Oil reservoir modeling5,00520.0 k34orsreg1Oil reservoir modeling2,20514.1 k35saylr4Oil reservoir modeling3,56422.3 k36shyy161Viscous flow cal
46、culation76,480330 k37wang3Semiconductor device simulation26,064177 k38mcfeAstrophysics76524.4 k39jpwh991Circuit physics problem9916,02740gupta1*Linear programming31,8022.16 M41lpcrebLinear programming9,648 x 77,137261 k42lpcredLinear programming8,926 x 73,948247 k43lpfit2pLinear programming3,000 x 1
47、3,52550.3 k44lpnug20Linear programming15,240 x 72,600305 k45lsiLatent semantic indexing10 k x 255 k3.7 MMatrix #2 raefsky3 (FEM/Fluids)Matrix #2 (contd) raefsky3 (FEM/Fluids)Matrix #22 - bayer02 (chemical process simulation)Matrix #22 (contd)- bayer02 (chemical process simulation)Matrix #27 - pwt (structural engineering)Matrix #27 (contd)- pwt (structura
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2024年度文化傳媒內(nèi)容制作合同
- 2024年大型活動(dòng)保障車輛租賃合同
- 2024年上海房屋裝修工程分包合同
- 2024年廉潔承諾函:雙方誠信自律協(xié)議
- 教育工作者主要先進(jìn)事跡(5篇)
- 中學(xué)生讀書演講稿
- 2024年度質(zhì)量控制合同:MLB棒球帽正品知識(shí)分享
- 2024年工程監(jiān)測(cè)與檢測(cè)合同
- 2024室內(nèi)外演唱會(huì)舞臺(tái)安全檢測(cè)合同
- 2024年國際商貿(mào)合同的科學(xué)與藝術(shù)
- 部編版七年級(jí)上冊(cè)語文閱讀高頻考點(diǎn)解析與突破課件
- 《初中英語寫作》課件
- DB37-T 5202-2021 建筑與市政工程基坑支護(hù)綠色技術(shù)標(biāo)準(zhǔn)
- 牙科手機(jī)的清洗消毒、滅菌及保養(yǎng)課件
- 人音版二年級(jí)下冊(cè)音樂《小蜜蜂》課件
- 打印版醫(yī)師執(zhí)業(yè)注冊(cè)健康體檢表(新版)
- 湘教版八年級(jí)美術(shù)上冊(cè)工作計(jì)劃
- 高滲性非酮癥糖尿病昏迷培訓(xùn)課件
- 國開成本會(huì)計(jì)第15章綜合練習(xí)試題及答案
- 2022年陜西投資集團(tuán)有限公司招聘筆試題庫及答案解析
- 醫(yī)院產(chǎn)后出血的應(yīng)急演練腳本
評(píng)論
0/150
提交評(píng)論