High Performance Information Retrieval and MEMS CADon_第1頁
High Performance Information Retrieval and MEMS CADon_第2頁
High Performance Information Retrieval and MEMS CADon_第3頁
High Performance Information Retrieval and MEMS CADon_第4頁
High Performance Information Retrieval and MEMS CADon_第5頁
已閱讀5頁,還剩71頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、High Performance Information Retrieval and MEMS CADon Intel ItaniumPerformance Tuning ParticipantsResearchersMark Adams (SNL), David Bailey (LBL), Parry Husbands (LBL), Xiaoye Li (LBL), Lenny Oliker (LBL)PhD StudentsYozo Hida, Geoff Pike, Rich VuducUndergradsBrian Gaeke , Jen Hsu, Shoaib Kamil, Suh

2、Kang, Hyun Kim, Gina Lee, Jaeseop Lee, Michael de Lorimier, Jin Moon, Randy Shoopman, Brandon Thompson OutlineMotivation, History, Related workTuning Sparse Matrix OperationsResults on Sun Ultra 1/170Preliminary results on P4Preliminary results on ItaniumApplications SUGAR a MEMS CAD systemInformati

3、on RetrievalFuture Work(Whats up with Millennium)(Background on Test Matrices)MotivationHistoryRelated Work Performance Tuning Motivation: performance of many applications dominated by a few kernelsMEMS CAD Nonlinear ODEs Nonlinear equations Linear equations Matrix multiplyMatrix-by-matrix or matrix

4、-by-vectorDense or SparseInformation retrieval by LSI Compress term-document matrix Sparse mat-vec multiplyInformation retrieval by LDA Maximum likelihood estimation Solve linear systemsMany other examples (not all linear algebra)Conventional Performance Tuning Vendor or user hand tunes kernels Draw

5、backs:Very time consuming and tedious workEven with intimate knowledge of architecture and compiler, performance hard to predictGrowing list of kernels to tuneExample: New BLAS StandardMust be redone for every architecture, compilerCompiler technology often lags architectureNot just a compiler probl

6、em: Best algorithm may depend on input, so some tuning must occur at run-time.Not all algorithms semantically or mathematically equivalentAutomatic Performance TuningApproach: for each kernelIdentify and generate a space of algorithmsSearch for the fastest one, by running themWhat is a space of algo

7、rithms?Depending on kernel and input, may varyinstruction mix and ordermemory access patternsdata structures mathematical formulation When do we search?Once per kernel and architecture At compile timeAt run timeAll of the aboveTuning pays off PHIPAC (Bilmes, Asanovic, Vuduc, Demmel)Tuning pays off A

8、TLAS (Dongarra, Whaley)Extends applicability of PHIPACIncorporated in Matlab (with rest of LAPACK)Other Automatic Tuning ProjectsFFTs and Signal ProcessingFFTW ()Given dimension n of FFT, choose best implementation at runtime by assembling prebuilt kernels for small factors of nWidely used, won 1999

9、 Wilkinson Prize for Numerical SoftwareSPIRAL (/spiral)Extensions to other transforms, DSPsUHFFT Extensions to higher dimension, parallelismSpecial session at ICCS 2001Organized by Yelick and Demmelwww.ucalgary.ca/iccsProceedings availablePointers to automatic tuning projects at /yelick/iccs-tuneSea

10、rch for optimal register tile sizes on Sun Ultra 1016 registers, but 2-by-3 tile size fastestSearch for Optimal L0 block size in dense matmul60% of peak on Pentium II-3004% of versions exceedHigh precision dense mat-vec multiply (XBLAS)High Precision Algorithms (XBLAS)Double-double (High precision w

11、ord represented as pair of doubles)Many variations on these algorithms; we currently use BaileysExploiting Extra-wide RegistersSuppose s(1) , , s(n) have f-bit fractions, SUM has Ff bit fractionConsider following algorithm for S = Si=1,n s(i)Sort so that |s(1)| |s(2)| |s(n)|SUM = 0, for i = 1 to n S

12、UM = SUM + s(i), end for, sum = SUMTheorem (D., Hida)If n 2F-f + 1, then sum agrees with S to about 1.5 ulpsIf n = 2F-f + 2, then error can be at least as large as 22f-F 1 ulpsIf n 2F-f + 3, then error can be arbitrary (S 0 but sum = 0 )Exampless(i) double (f=53), SUM double extended (F=64) accurate

13、 if n 211 + 1 = 2049Dot product of single precision x(i) and y(i) s(i) = x(i)*y(i) (f=2*24=48), SUM double extended (F=64) accurate if n 216 + 1 = 65537Tuning Sparse Matrix ComputationsTuning Sparse matrix-vector multiply SparsityOptimizes y = A*x for a particular sparse AIm and YelickAlgorithm spac

14、eDifferent code organization, instruction mixesDifferent register blockings (change data structure and fill of A)Different cache blockingDifferent number of columns of xDifferent matrix orderingsSoftware and papers available/yelick/sparsityHow Sparsity tunes y = A*xRegister Blocking Store matrix as

15、dense r x c blocksPrecompute performance in Mflops of dense A*x for various register block sizes r x cGiven A, sample it to estimate Fill if A blocked for varying r x cChoose r x c to minimize estimated running time Fill/MflopsStore explicit zeros in dense r x c blocks, unrollCache Blocking Useful w

16、hen source vector x enormousStore matrix as sparse 2k x 2l blocksSearch over 2k x 2l cache blocks to find fastest Register-Blocked Performance of SPMV on Dense Matrices (up to 12x12)333 MHz Sun Ultra IIi800 MHz Pentium III1.5 GHz Pentium 4800 MHz Itanium70 Mflops35 Mflops425 Mflops310 Mflops175 Mflo

17、ps105 Mflops250 Mflops110 MflopsWhich other sparse operations can we tune?General matrix-vector multiply A*xPossibly many vectors xSymmetric matrix-vector multiply A*xSolve a triangular system of equations T-1*xy = AT*A*xKernel of Information Retrieval via LSI (SVD)Same number of memory references a

18、s A*xy = Si (A(i,:)T * (A(i,:) * x)Future workA2*x, Ak*xKernel of Information Retrieval used by GoogleIncludes Jacobi, SOR, Changes calling algorithmAT*M*AMatrix triple productUsed in multigrid solverTest MatricesGeneral Sparse MatricesUp to n=76K, nnz = 3.21MFrom many application areas1 Dense2 to 1

19、7 - FEM18 to 39 - assorted41 to 44 linear programming45 - LSISymmetric MatricesSubset of General Matrices1 Dense2 to 8 - FEM9 to 13 - assortedLower Triangular MatricesObtained by running SuperLU on subset of General Sparse Matrices1 Dense2 13 FEMDetails on test matrices at end of talkResults on Sun

20、Ultra 1/170Speedups on SPMV from Sparsity on Sun Ultra 1/170 1 RHSSpeedups on SPMV from Sparsity on Sun Ultra 1/170 9 RHSPreliminary Results on P4 using icc and gccSpeedup of SPMV from Sparsity on P4/icc-5.0.1Performance of SPMV from Sparsity on P4/icc-5.0.1Fill for SPMV from Sparsity on P4/icc-5.0.

21、1Possible ImprovementsDoesnt work as well as on Sun Ultra 1/170; Why?Current heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium since performance so “nondiagonally dominant”Sparsity reg blocking results on P4 for FEM/fluids matri

22、cesMatrix #2 (150 Mflops to 400 Mflops)Matrix #5 (50 Mflops to 350 Mflops)Sparsity cache blocking results on P4 for LSISymmetric Sparse Matrix-Vector Multiply on P4 (vs nave full = 1)Sparse Triangular Solve (Matlabs colmmd ordering) on P4 AT*A on P4 (Accesses A only once)Preliminary Results on Itani

23、um using eccSpeedup of SPMV from Sparsity on Itanium/ecc-5.0.1Raw Performance of SPMV from Sparsity on Itanium Fill for SPMV from Sparsity on Itanium Possible ImprovementsCurrent heuristic to determine best r x c block biased to diagonal of performance plotDidnt matter on Sun, does on P4 and Itanium

24、 since performance so “nondiagonally dominant”Matrix 8: Chose 2x2 (164 Mflops) Better: 3x1 (196 Mflops)Matrix 9:Chose 2x2 (164 Mflops) Better: 3x1 (213 Mflops)Application of Performance TuningSUGAR A CAD Tool for MEMSApplications to SUGAR a tool for MEMS CADDemmel, Bai, Pister, Govindjee, Agogino, G

25、u, Input: description of MicroElectroMechanical System (as netlist)Output:DC, steady state, modal, transient analyses to assess behaviorCIF for fabrication Simulation capabilitiesBeams and plates (linear, nonlinear, prestressed,)Electrostatic forces, circuitsThermal expansion, Couette damping Availa

26、bilityMatlabPublicly /cfm249 registered users, many unregisteredWeb service M & MEMSRuns on Millennium Now in use in EE 245 at UCB96 usersLots of new features being added, including interface to measurementsMicromirror (Last, Pister)Laterally ac

27、tuated torsionally suspended micromirrorOver 10K dof, 100 line netlist (using subnets)DC and frequency analysisAll algorithms reduce to previous kernelsApplication of Performance TuningInformation RetrievalInformation RetrievalJordanCollaboration with Intel team building probabilistic graphical mode

28、lsBetter alternatives to LSI for document modeling and searchLatent Dirichlet Allocation (LDA)Model documents as union of themes, each with own word distributionMaximum likelihood fit to find themes in set of documents, classify themComputational bottleneck is solution of enormous linear systemsOne

29、of largest Millennium usersIdentifying influential documentsGiven hyperlink patterns of documents, which are most influential?Basis of Google (eigenvector of link matrix sparse matrix vector multiply)Applying Markov chain and perturbation theory to assess reliabilityKernel ICAEstimate set of sources

30、 s and mixing matrix A from samples x = A*sNew way to sample such that sources are as independent as possibleAgain reduces to linear algebra kernelsMore on Kernel ICAAlgorithm 1nonlinear eigenvalue problem, reduces to a sequence of manyvery large generalized spd eigenproblems A l B Block structured,

31、 A dense, B block diagonalOnly smallest nonzero eigenvalue neededSparse eigensolver (currently ARPACK/eigs)Use Incomplete Cholesky (IC) to get low rank approximzation to dense subblocks comprising A and BUse Complete (=Diagonal) Pivoting but take only 20 n stepsCost is O(n)Evaluating matrix entries

32、(exponentials) could be bottleneckNeed fast, low precision exponentialAlgorithm 2Like Algorithm 1, but find all eigenvalues/vectors of A l B Use Holy Grail Future WorkExploit Itanium Architecture128 (82-bit) floating point registers9 HW formats: 24/8(v), 24/15, 24/17, 53/11, 53/15, 53/17, 64/15, 64/

33、17Many few load/store instructionsfused multiply-add instructionpredicated instructionsrotating registers for software pipeliningprefetch instructionsthree levels of cacheTune current and wider set of kernelsImprove heuristics, eg choice of r x cIncorporate intoSUGARInformation RetrievalFurther auto

34、mate performance tuningGeneration of algorithm space generatorsPossible collaborations with IntelGetting right toolsGetting faster, less accurate transcendental functionsProvide feedback on toolsProvide tuned kernels, benchmarks, IR appsProvide system for tuning future kernelsTo provide usersTo eval

35、uate architectural designsMillenniumMillenniumCluster of clusters at UC Berkeley309 CPU cluster in Soda HallSmaller clusters across campusMade possible by Intel equipment grantSignificant other supportNSF, Sun, Microsoft, Nortel, campusMillennium TopologyMillennium Usage Oct 1 11, 2001100% utilizati

36、on for last few daysAbout half the jobs are parallelUsage highlightsAMANDAAntarctic Muon And Neutrino Detector A128 scientists from 15 universities and institutes in the U.S. and Europe.TEMPESTEUV lithography simulations via 3D electromagnetic scatteringcuervo./Volcano/study t

37、he defect printability on multilayer masks TitaniumHigh performance Java dialect for scientific computing/projects/titaniumImplementation of shared address space, and use of SSE2Digital Library ProjectLarge database of imageselib./Used to run spectral image segmentation algorithm for clustering, sea

38、rch on imagesUsage highlights (continued)CS 267Graduate class in parallel computing, 33 enrolled/dbindel/cs267taHomeworkDisaster ResponseHelp find people after Sept 11, set up immediately 48K reports in database, linked to other survivor databasesMEMS CAD (MicroElectroM

39、echanical Systems Computer Aided Design)Tool to help design MEMS systemsUsed this semester in EE 245, 93 More later in talkInformation RetrievalDevelopment of faster information retrieval algorithms/jordanMore later in talkMany applications are part of CITRISBackground on

40、 Test MatricesSparse Matrix Benchmark Suite (1/3)#Matrix NameProblem DomainDimensionNo. Non-zeros1denseDense matrix1,0001.00 M2raefsky3Fluid structure interaction21,2001.49 M3inaccuraAccuracy problem16,1461.02 M4bcsstk35*Stiff matrix automobile frame30,2371.45 M5venkat01Flow simulation62,4241.72 M6c

41、rystk02*FEM crystal free-vibration13,965969 k7crystk03*FEM crystal free-vibration24,6961.75 M8nasasrb*Shuttle rocket booster54,8702.68 M93dtube*3-D pressure tube45,3303.21 M10ct20stif*CT20 engine block52,3292.70 M11baiAirfoil eigenvalue calculation23,560484 k12raefsky4Buckling problem19,7791.33 M13e

42、x113-D steady flow problem16,2141.10 M14rdist1Chemical process simulation4,13494.4 k15vavasis32-D PDE problem41,0921.68 MNote: * indicates a symmetric matrix.Sparse Matrix Benchmark Suite (2/3)#Matrix NameProblem DomainDimensionNo. Non-zeros16orani678Economic modeling2,52990.2 k17rimFEM fluid mechan

43、ics problem22,5601.01 M18memplusCircuit simulation17,758126 k19gemat11Power flow4,92933.1 k20lhr10Chemical process simulation10,672233 k21goodwin*Fluid mechanics problem7,320325 k22bayer02Chemical process simulation13,93563.7 k23bayer10Chemical process simulation13,43694.9 k24coater2Simulation of co

44、ating flows9,540207 k25finan512*Financial portfolio optimization74,752597 k26onetone2Harmonic balance method36,057228 k27pwt*Structural engineering36,519326 k28vibrobox*Vibroacoustics12,328343 k29wang4Semiconductor device simulation26,068177 k30lnsp3937Fluid flow modeling3,93725.4 kSparse Matrix Ben

45、chmark Suite (3/3)#Matrix NameProblem DomainDimensionsNo. Non-zeros31lns3937Fluid flow modeling3,93725.4 k32sherman5Oil reservoir modeling3,31220.8 k33sherman3Oil reservoir modeling5,00520.0 k34orsreg1Oil reservoir modeling2,20514.1 k35saylr4Oil reservoir modeling3,56422.3 k36shyy161Viscous flow cal

46、culation76,480330 k37wang3Semiconductor device simulation26,064177 k38mcfeAstrophysics76524.4 k39jpwh991Circuit physics problem9916,02740gupta1*Linear programming31,8022.16 M41lpcrebLinear programming9,648 x 77,137261 k42lpcredLinear programming8,926 x 73,948247 k43lpfit2pLinear programming3,000 x 1

47、3,52550.3 k44lpnug20Linear programming15,240 x 72,600305 k45lsiLatent semantic indexing10 k x 255 k3.7 MMatrix #2 raefsky3 (FEM/Fluids)Matrix #2 (contd) raefsky3 (FEM/Fluids)Matrix #22 - bayer02 (chemical process simulation)Matrix #22 (contd)- bayer02 (chemical process simulation)Matrix #27 - pwt (structural engineering)Matrix #27 (contd)- pwt (structura

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論