容錯(cuò)計(jì)算英文版課件:1 Background and Motivation_第1頁
容錯(cuò)計(jì)算英文版課件:1 Background and Motivation_第2頁
容錯(cuò)計(jì)算英文版課件:1 Background and Motivation_第3頁
容錯(cuò)計(jì)算英文版課件:1 Background and Motivation_第4頁
容錯(cuò)計(jì)算英文版課件:1 Background and Motivation_第5頁
已閱讀5頁,還剩121頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、Sep. 2013Part I Introduction: Dependable SystemsSlide 1Sep. 2013Part I Introduction: Dependable SystemsSlide 2About This PresentationThis presentation is intended to support the use of the textbook Dependable Computing: A Multilevel Approach (traditional print or on-line open publication, TBD). It i

2、s updated regularly by the author as part of his teaching of the graduate course ECE 257A, Fault-Tolerant Computing, at Univ. of California, Santa Barbara. Instructors can use these slides freely in classroom teaching or for other educational purposes. Unauthorized uses, including distribution for p

3、rofit, are strictly prohibited. Behrooz ParhamiEditionReleasedRevisedRevisedRevisedRevisedFirstSep. 2006Oct. 2007Oct. 2009Oct. 2012Sep. 2013Sep. 2013Part I Introduction: Dependable SystemsSlide 3ECE 257A: Fault-Tolerant ComputingCourse IntroductionSep. 2013Part I Introduction: Dependable SystemsSlid

4、e 4Sep. 2013Part I Introduction: Dependable SystemsSlide 5Course Resources and Requirements: Fall 2013Grading: Four homework assignments, worth 20%, posted on Web site, each due in 9-14 days (no time extension is possible, so start work early)Open-book/notes midterm, worth 30%, Chaps. 1-16, W 11/06,

5、 10:00-11:45Research paper and presentation, worth 50%, due R 12/12 (see Web site)Grades due at the Registrars office by midnight on W 12/18Sources; Text /parhami/text_dep_comp.htm(books, journals, conference proceedings, and electronic resources at the library are listed on the course syllabus and

6、Web site)Course Web site: /parhami/ece_257a.htm(PowerPoint & pdf presentations, links to websites, occasional handouts)Meetings: MW 10:00-11:30, Phelps 1431Open office hours: MW 3:30-5:00 PM, Room 5155 HFHInstructor: Behrooz Parhami, 5155 HFH, x3211, parhami at Prerequisite: Computer architecture, a

7、t the level of ECE 154Sep. 2013Part I Introduction: Dependable SystemsSlide 6How the Cover Image Relates to Our CourseDependability as weakest-link attribute: Under stress, the weakest link will break, even if all other links are superstrongSafety factor (use of redundancy): Provide more resources t

8、han needed for the minimum acceptable functionalityAdditional resources not helpful if: - failures are not independent- Some critical component fails- Improve the least reliable part firstSep. 2013Part I Introduction: Dependable SystemsSlide 7Course Lecture Schedule: Fall 2013Day/DateM 09/30W 10/02M

9、 10/07W 10/09M 10/14W 10/16M 10/21W 10/23M 10/28W 10/30M 11/04W 11/06 W 11/11 . . .W 12/04R 12/12Lecture topicGoals, pretest (cancelled)Background, motivationDependability attributesCombinational modelingState-space modelingDefect avoidance; shieldingDefect circumvention; yield Fault testing; testab

10、ilityFault masking; voting/replicatnError detection; self-checkingError correction; disk arraysMidterm exam, 10:00-11:45No lecture, Veterans Day . . .Research poster presentationsFinal papers due by midnightChap012345, 76, 89, 1110, 1213, 1514, 161-16 . . .NotesSlides with Part IRev. probabilityDefe

11、ct-level viewFault-level viewError-level viewOpen book/notesHigh-level viewsDeadlinesHW1 1-4HW1 dueHW2 5-12Research topicHW2 dueHW3 13-20 . . .Poster PDFsPaper PDFsSep. 2013Part I Introduction: Dependable SystemsSlide 8About the Name of This CourseFault-tolerant computing: a discipline that began in

12、 the late 1960s 1st Fault-Tolerant Computing Symposium (FTCS) was held in 1971In the early 1980s, the name “dependable computing” was proposed for the field to account for the fact that tolerating faults is but one approach to ensuring reliable computation. The terms “fault tolerance” and “fault-tol

13、erant” were so firmly established, however, that people started to use “dependable and fault-tolerant computing.”In 2000, the premier conference of the field was merged with another and renamed “Intl Conf. on Dependable Systems and Networks” (DSN)In 2004, IEEE began the publication of IEEE Trans. On

14、 Dependable and Secure Systems (inclusion of the term “secure” is for emphasis, because security was already accepted as an aspect of dependability)Sep. 2013Part I Introduction: Dependable SystemsSlide 9Why This Course Shouldnt Be NeededIn an ideal world, methods for dealing with faults, errors, and

15、 other impairments in hardware and software would be covered within every computer engineering course that has a design componentAnalogy: We do not teach structural engineers about building bridges in one course and about bridge safety and structural integrity during high winds or earthquakes in ano

16、ther (optional) courseLogic Design: fault testing, self-checkingParallel Comp.: reliable commun., reconfigurationProgramming: bounds checking,checkpointingFault-Tolerant ComputingSep. 2013Part I Introduction: Dependable SystemsSlide 10Brief History of Dependable Computing1970s:The field developed qu

17、ickly (international conference,many research projects and groups, experimental systems)1980s:The field matured (textbooks, theoretical developments, use of ECCs in solid-state memories, RAID concept), but also suffered some loss of focus and interest because of the extreme reliability of integrated

18、 circuits2000s:Resurgence of interest owing to less reliable fabrication at ultrahigh densities and “crummy” nanoelectronic components1960s:NASA and military agencies supported research for long-life space missions and battlefield computing1950s:Early ideas by von Neumann (multichannel, with voting)

19、 and Moore-Shannon (“crummy” relays)1990s:Increased complexity at chip and system levels made verification, testing, and testability prime study topics1940s:ENIAC, with 17.5K vacuum tubes and 1000s of other electrical elements, failed once every 2 days (avg. down time = minutes)Sep. 2013Part I Intro

20、duction: Dependable SystemsSlide 11Dependable Computing in the 2010sThere are still ambitious projects; space and elsewhere Harsh environments (vibration, pressure, temperatures) External influences (radiation, micrometeoroids) Need for autonomy (commun. delays, unmanned probes)The need is expanding

21、 More complex systems (e.g., system-on-chip) Critical applications (medicine, transportation, finance) Expanding pool of unsophisticated users Continued rise in maintenance costs Digital-only data (needs more rigorous backup)The emphasis is shifting Mostly COTS-based solutions Integrated hardware/so

22、ftware systems Entire units replaced (system-level diagnosis)Sep. 2013Part I Introduction: Dependable SystemsSlide 12Pretest: Failures and ProbabilitiesThis test will not be graded or even collected, so answer the test questions truthfully and to the best of your ability / knowledgeQuestion 1: Name

23、a disaster that was caused by computer hardware or software failure. How do you define “disaster” and “failure”?Question 4: In a game show, there is a prize behind one of 3 doors with equal probabilities. You pick Door A. The host opens Door B to reveal that there is no prize behind it. The host the

24、n gives you a chance to switch to Door C. Is it better to switch or to stick to your choice?ABCQuestion 3: Which do you think is more likely: the event that everyone in this class was born in the first half of the year or the event that at least two people were born on the same day of the year?Quest

25、ion 2: Which of these patterns is more random?Sep. 2013Part I Introduction: Dependable SystemsSlide 13Pretest (Continued): Causes of MishapsQuestion 5: Does this photo depict a mishap due to design flaw, implementation bug, procedural inadequacies, or human error?Sep. 2013Part I Introduction: Depend

26、able SystemsSlide 14Pretest (Continued): Reliability and RiskQuestion 7: Which is more reliable: Plane X or Plane Y that carries four times as many passengers as Plane X and is twice as likely to crash?Question 9: Which surgeon would you prefer for an operation that you must undergo: Surgeon A, who

27、has performed some 500 operations of the same type, with 5 of his patients perishing during or immediately after surgery, or Surgeon B, who has a perfect record in 25 operations?Question 8: Which is more reliable: a 4-wheel vehicle with one spare tire or an 18-wheeler with 2 spare tires?Question 10:

28、 Which is more probable at your home or office: a power failure or an Internet outage? Which is likely to last longer?Question 6: Name an emergency backup system (something not normally used unless another system fails) that is quite commonplaceIf you had trouble with 3 or more questions, you really

29、 need this course!Sep. 2013Part I Introduction: Dependable SystemsSlide 15August 1, 2007 Interstate 35WBridge 9340 over the Mississippi, in Minneapolis(40-year old bridge was judged structurally deficient in 1990)Sep. 2013Part I Introduction: Dependable SystemsSlide 16History of Bridge 9340 in Minne

30、apolis1967: Opens to traffic1990: Dept. of Transportation classifies bridge as “structurally deficient”1993: Inspection frequency doubled to yearly2001: U. Minn. engineers deem bridge struc. deficient1999: Deck and railings fitted with de-icing system2004-07: Fatigue potential and remedies studiedSu

31、mmer 2007: $2.4M of repairs/maintenance on deck, lights, joints2007: Inspection plan chosen over reinforcementsSep. 18, 2008:Replacement bridge opensAug. 1, 2007: Collapses at 6:05 PM, killing 7Sep. 2013Part I Introduction: Dependable SystemsSlide 17What Do We Learn from Bridges that Collapse?Openin

32、g day of the Tacoma Narrows Bridge,July 1, 1940Nov. 7, 1940One catastrophic bridge collapse every 30 years or soSee the following amazing video clip (Tacoma Narrows Bridge):http:/www.enm.bris.ac.uk/research/nonlinear/tacoma/tacnarr.mpg “ . . . failures appear to be inevitable in the wake of prolonge

33、d success, which encourages lower margins of safety. Failures in turn lead to greater safety margins and, hence, new periods of success.” Henry Petroski, To Engineer is HumanSep. 2013Part I Introduction: Dependable SystemsSlide 18. . . or from “Unsinkable” Ships that Sink?“The major difference betwe

34、en a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong, it usually turns out to be impossible to get at or repair.” Douglas Adams, author of The Hitchhikers Guide to the GalaxyTitanic begins its maiden voyage from Queens

35、town,April 11, 1912(1:30 PM)April 15, 1912(2:20 AM)Sep. 2013Part I Introduction: Dependable SystemsSlide 19. . . or from Poorly Designed High-Tech Trains?Train built for demonstrating magnetic levitation technology in northwest Germany rams into maintenance vehicle left on track at 200 km/h, killing

36、 23 of 29 aboardTransrapidmaglev train on its test trackSep. 22, 2006Official investigation blames the accident on human error (train was allowed to depart before a clearance phone call from maintenance crew)Not a good explanation; even low-tech trains have obstacle detection systemsEven if manual p

37、rotocol is fully adequate under normal conditions, any engineering design must take unusual circumstances into account (abuse, sabotage, terrorism)Sep. 2013Part I Introduction: Dependable SystemsSlide 20Design Flaws in Computer SystemsHardware example: Intel Pentium processor, 1994For certain operan

38、ds, the FDIV instruction yielded a wrong quotientAmply documented and reasons well-known (overzealous optimization)Software example: Patriot missile guidance, 1991Missed intercepting a scud missile in 1st Gulf War, causing 28 deathsClock reading multiplied by 24-bit representation of 1/10 s (unit of

39、 time)caused an error of about 0.0001%; normally, this would cancel out in relative time calculations, but owing to ad hoc updates to some (not all) calls to a routine, calculated time was off by 0.34 s (over 100 hours), during which time a scud missile travels more than 0.5 kmUser interface example

40、: Therac 25 machine, mid 1980s1Serious burns and some deaths due to overdose in radiation therapyOperator entered “x” (for x-ray), realized error, corrected by entering “e” (for low-power electron beam) before activating the machine; activation was so quick that software had not yet processed the ov

41、erride1 Accounts of the reasons varySep. 2013Part I Introduction: Dependable SystemsSlide 21Causes of Human Errors in Computer Systems1. Personal factors (35%): Lack of skill, lack of interest or motivation, fatigue, poor memory, age or disability2. System design (20%): Insufficient time for reactio

42、n, tedium, lack of incentive for accuracy, inconsistent requirements or formats3. Written instructions (10%): Hard to understand, incomplete or inaccurate, not up to date, poorly organized4. Training (10%): Insufficient, not customized to needs, not up to date5. Human-computer interface (10%): Poor

43、display quality, fonts used, need to remember long codes, ergonomic factors6. Accuracy requirements (10%): Too much expected of operator7. Environment (5%): Lighting, temperature, humidity, noiseBecause “the interface is the system” (according to a popular saying), items 2, 5, and 6 (40%) could be c

44、ategorized under user interfaceSep. 2013Part I Introduction: Dependable SystemsSlide 22Sep. 2013Part I Introduction: Dependable SystemsSlide 23Properties of a Good User Interface1. Simplicity: Easy to use, clean and unencumbered look2. Design for error: Makes errors easy to prevent, detect, and reve

45、rse; asks for confirmation of critical actions3. Visibility of system state: Lets user know what is happening inside the system from looking at the interface4. Use of familiar language: Uses terms that are known to the user (there may be different classes of users, each with its own vocabulary)5. Mi

46、nimal reliance on human memory: Shows critical info on screen; uses selection from a set of options whenever possible6. Frequent feedback: Messages indicate consequences of actions7. Good error messages: Descriptive, rather than cryptic8. Consistency: Similar/different actions produce similar/differ

47、ent results and are encoded with similar/different colors and shapesSep. 2013Part I Introduction: Dependable SystemsSlide 24Example fromOn August 17, 2006, a class-two incident occurred at the Swedish atomic reactor Forsmark. A short-circuit in the electricity network caused a problem inside the rea

48、ctor and it needed to be shut down immediately, using emergency backup electricity. However, in two of the four generators, which run on AC, the AC/DC converters died. The generators disconnected, leaving the reactor in an unsafe state and the operators unaware of the current state of the system for

49、 approximately 20 minutes. A meltdown, such as the one in Chernobyl, could have occurred.Coincidence of problems in multiple protection levels seems to be a recurring theme in many modern-day mishaps - emergency systems had not been tested with the grid electricity being offForum on Risks to the Pub

50、lic in Computers and Related Systemshttp:/catless.ncl.ac.uk/Risks/(Peter G. Neumann, moderator)Sep. 2013Part I Introduction: Dependable SystemsSlide 25Worst Stock Market Computer FailureFirms and individual investors prevented from buying or selling stocks to minimize their capital gains taxesA spok

51、esman said the problems were “very technical” and involved corrupt dataDelaying end of financial year was considered, but not implemented; eventually, the system became operational at 3:45 PM and trading was allowed to continue until 8:00 PMLondon Stock Exchange confirmed it had a fault in its elect

52、ronic feed that sends the prices to dealers, but it gave no further explanationApril 5, 2000: Computer failure halts the trading for nearly 8 hours at the London Stock Exchange on its busiest day (end of financial year)Sep. 2013Part I Introduction: Dependable SystemsSlide 26Recent News Items inJuly

53、2012: A320 Lost 2 of 3 Hydraulic Systems on Takeoff No loss of life; only passenger discomfort. Full account of incident not yet available, but it shows that redundancy alone is not sufficient protectionMay 2012: Automatic Updates Considered ZombiewareSoftware updates take up much time/space; no one

54、 knows whats in themSeptember 2013: No password Safe from New Cracking SoftwareA new freely available software can crack passwords of up to 55 symbols by guessing a lot of common letter combinationsFebruary 2012: Programming Error Doomed Russian Mars ProbeFails to escape earth orbit due to simultane

55、ous reboot of two subsystemsMarch 2012: Eighteen Companies Sued over Mobile AppsFacebook, Apple, Twitter, and Yelp are among the companies sued over gathering data from the address books of millions of smartphone usersSep. 2013Part I Introduction: Dependable SystemsSlide 27How We Benefit from Failur

56、es“When a complex system succeeds, that success masks its proximity to failure. . . . Thus, the failure of the Titanic contributed much more to the design of safe ocean liners than would have her success. That is the paradox of engineering and design.”Henry Petroski, Success through Failure: The Par

57、adox of Design, Princeton U. Press, 2006, p. 95 194020061912Sep. 2013Part I Introduction: Dependable SystemsSlide 28Take-Home Survey Form: Due Next ClassMain reason for taking this courseList one important fact about yourself that is not evident from your academic record or CVUse the space below or

58、overleaf for any additional comments on your academic goals and/or expectations from this courseFrom the lecture topics on the courses website, pick one topic that you believe to be most interestingPersonal and contact info: Name, Perm#, e-mail address, phone #(s), degrees & institutions, academic l

59、evel, GPA, units completed, advisore.g.: interest, advisors suggestion, have to (not enough grad courses)e.g.: I like to solve mathematical, logical, and word puzzlesSep. 2013Part I Introduction: Dependable SystemsSlide 291 Background and MotivationSep. 2013Part I Introduction: Dependable SystemsSli

60、de 30“I should get this remote control looked at.”Sep. 2013Part I Introduction: Dependable SystemsSlide 31Sep. 2013Part I Introduction: Dependable SystemsSlide 321.1 The Need for DependabilityHardware problemsPermanent incapacitation due to shock, overheating, voltage spikeIntermittent failure due t

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論