版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1
AI-PoweredBugHunting-Evolutionand
benchmarking
AlfredoOrtega-ortegaalfredo@X:@ortegaalfredo
Neuroengine.aiJune27,2024
WhileAIholdspromiseforassistingwithbughunting,itsactualimpactre-mainsunclear.ThispresentationaddressesthisdoubtbyintroducingCrash-Benchmark,astandardizedevaluationframeworkforAI-drivenstaticanaly-sistools.We’llshareresultsfromasimplebug-huntingAIagent,AutoKaker,anddiscusstheimplicationsforoptimizingAI-basedbughuntinginC/C++codebases.
1Introduction
Opiniononautomaticbugfindingiscontroversial.Atthedateofthisarticle’spubli-cation,thereisnoconsensusaboutwhetherthisispossibleornot,ortowhatextent.ThisispartlyduetotherapidadvancementofLLMmodels;upuntilmonthsago,open-sourcemodelswerenotadvancedenoughtobeeffectiveatbugfinding.ThereexistsathresholdinthecomplexityofLLMsbeyondwhichbugfindingbecomespossible,andinthisarticlewebenchmarkvariousmodelsandfoundthatthisthresholdhasbeenreachedforsomevulnerabilities.
2CrashBench
Crashbench
[1]isasimpleautomatictest-casebasedbenchmarktool
.ItconnectstoserveralLLMservicesofferingopensourceandprivateAImodels,sendatestcaseandthenmeasureresults.IftheLLMfoundthebuginthecorrectline,thenscoreisincreasedbyone.
2.1Design
Mostofthetestcasesforthev1versionarebasedonGera’sInsecureProgrammingexercises
[2],plus3realvulnerabilityexamples
.TheLLMisassignedascorebased
2
onthenumberofvulnerabilitiesthatwerereported,withrealvulnerabilitieshaving10timesthescore.
TheconfigurationofCrashbenchisasingle.inifilecontainingtheprompt,testcasefilesandexpectedlineswherethebugisfound.
[SETTINGS]
SystemPrompt="Youareanexpertsecurityresearcher,programmerandbugfinder."
Prompt="Checkthiscodeforanyout-of-boundsor
integer-overflowvulnerability,explainitandpointatthelinewiththeproblem,
andnothingmore,inthisway:\n’Bugline=X’whereXisthelinenumberofthebug,
andthenprintthatlinenumber.
Ifthecodehasnobugs,thenprint’Bugline=0’."
[Basic]
file1=stack1.c,6file2=stack2.c,6file3=stack3.c,6file4=stack4.c,6file5=stack5.c,6
[ABOs]
file1=abo1.c,4
...
Inthisway,thetestnotonlytestsbugfindingcapabilities,butalsoaccuracyinreporting.Manymodelsaregoodatfindingvulnerabilities,buttheyfailataccuratelypointingexactlywherethebugislocatedinthecode.Tocreatenegativetests(testswherenovulnerabilityshouldbedetected),justsettheexpectedvulnlinenumbertozero.
2.2Parameters
Softwareusedwasvllmv0.5.0.post1[3]forAWQquantizationandaphrodite-engine
v0.5.3
[4]forEXL2quantization
.Parametersusedforinferenceusingvllmwere:
?temperature:1.2
?topp=1.0
?frequencypenalty=0.6
?presencepenalty=0.8
3
2.3Results
Thebenchmarkranagainst16LLMs,mostofthembeingthelatestversions,butalso
someoldermodelsbasedonLlama-2tocomparethem.Additionally,severalquanti-zationsofthesamemodelweretestedtomeasuretheeffectofquantizationonLLMbug-reportingaccuracy.
Figure1:Crashbenchscore
AsshowninFigure
1,Oldermodelsarenotcompetitiveatcodeunderstandingand
bugfinding,withnewermodelsbeingsignificantlybetter.EvenclosedmodelslikeChatGPTaresurpassedbythesenewermodelsintermsofperformance.Additionally,therelativelysmalleffectofquantizationonresultsisevident,asastrongquantizationofLlama-3-70B(2.25bpw)didnothaveasignificantimpactonthemodel’sscore.
2.4Quantizationeffects
AtFigure
2,wenowfocusontheeffectsofquantizationonthescore
.Quantizationisatechniquethatcompressesmodelsbyrepresentingweightsusingfewerbits,losingsomequalitybutreducingtheamountofmemoryneeded.Thisresultsinincreasedspeedandefficiency.SincecurrentGPUsaremostlylimitedbymemorybandwidth,theefficiencyofinferencedecreasesnearlylinearlywithsize.
Wesetthey-axisto0sothatitcanbemoreeasilyseenhowlowaneffectquantizationhadonthescores.Wecanalsoseetherapidincreaseinsizewiththeincreaseofbitsperword,withoutanycorrespondingincreaseinscore.
4
Figure2:Quantizationeffectsonscore.Model:Meta-LLama-3-70B-Instruct.
WecanplotasecondgraphatFig.
3,showingefficiencyofthedifferentmodels,
meaningthescorepersizeinGigabytes.Withdecreasedsize,speedandpowerrequiredforinferencealsodecreaseslinearly,increasingefficiencyofoperation.
WecanseehowthecurrentmostefficientmodelsarehighlyquantizedversionsofLlama-370B.Ataround25GB,thosemodelsarestilloutofreachformostpersonalhomecomputers.ThebestnextoptionwouldbetouseahighlyquantizedversionofMistral-8x7B,whichcanrunonCPUonmostmoderncomputersatanacceptablespeed.
2.5CrashbenchvsLMSysELO
TheLMsysleaderboard
[5]hasbecometheindustrystandardformodelbenchmarking
.Wecancomparehowourbug-findingbenchmarkcorrelateswiththeoverallmodelscore.
IntuitivelywewouldassumethatoverallELOandcrashbenchscoresshouldbesome-whatrelated.Butin
4
wecanseesomeinconsistencies,especiallywithmodernOpenAImodels.ThesemodelshavemuchbetterELOscoresthanCrashbenchscores.Thismeansthatthesemodelsaremuchbetterasgenericassistantsandcodegenerationthanatbugfinding.Wesuspectthatsuper-alignmentmightcausethesemodelstorefusetoshowbugs,asananalysisofgpt-4andgpt-4oshowsthattheydonotshowmanywrongbugsorlinesonthetest-cases;instead,theirlowscoresaremostlyduetodenyingthatthereisabugatall.Lowscoresmightalsoindicateproblemsonthebenchmark,aswediscussinthefollowingsection.
5
Figure3:Totalmodelefficiency.ThisgraphicshowshowmanypointsthemodelhaveforeveryGBinsize.
2.6Problems
Problemsthatmayaffectthisbenchmarkaccuracyare:
Incorrectparametersand/orpromptformat:Instructmodelshaveaspecificformat
thatmustbeusedonthepromptstomaximizetheirunderstandingoftherequests.ManyLLMsarequiteflexibleonthisformat,whilesomearenot.It’simportanttorespectthepromptformatofeachLLMtomaximizetheircode-understandingcapacity.
Modeltrainedonthesolutionsofthebenchmark:Asmostmodelsaretrainedonter-abytesoftokens,itisverylikelythatthetestcases,bothartificialandreal,werepartoftheirtraining,alongwiththesolutions.Thismightintroduceabiaswheremodelsareverygoodatpassingthebenchmark,butnotsogoodinreal-worldapplications.ThesolutiontothisproblemistocreatemoreunpublishedtestcasesthattheLLMdidn’tseeduringtraining.However,thisisashort-livedsolutionasit’sverylikelythatnewerversionsoftheLLMswillcontainthesenewtestcases,sotheymustbediscardedineverynewversionofthebenchmark.
Bugsoninferencesoftware/quantizationquality:Inferencesoftwareisevolvingrapidly,anditcontainsbugsthataffectqualityandreasoning.Asolutiontothisproblemforbenchmarkingistoalwaysusethesameinferencesoftware.Inourcase,weuseeithervLLMorAphroditeengine,whichinternallyusesvLLM.
6
Figure4:CrashbenchscorevsOverallmodelELOscore.Wecanseeageneralcorrelationexceptonclosedmodels.
Refusalsduetoalignmnet:Somemodelsrefusetodiscoverbugsbecausetheyreasonthattheycanbeusedformaliciouspurposes.Thiscanbebypassedwithseveraltechniquessuchaspromptjailbreakingorabliteration,butbothtechniquesmightaffectthecode-understandingcapacityofthemodel.However,theabliteratedversionofLlama-3-70Bwascomparedagainsttheoriginalversionandshowedaminimaleffectontheresults.
3AutoKaker:Automaticvulnerabilitydiscovery
Usingthesametechniqueofthebenchmarkwecaneasilyconstructatool[6]thatprocess
sourcecodeandannotateseveryvulnerabilityfound.Thealgoritmdescribedinfig
5
issimple:
1.Separatesourcecodeintoindividualchunksthatcontainoneormorefunctions
2.AssembleapromptaskingtheLLMtoanalyzethecode
3.Annotatetheresults
Thistool(seefig
6)canbelaunchedoncompletecodebasesandwillannotateevery
functionwithpossiblevulnerabilities,readyfortriageandexploitationbyahumanoperator.Unlikeotherapproaches,thistooldoesnotattempttoverifyorexploitthe
7
Figure5:Autokakermainloop
vulnerabilitiesfound,asthisisamuchmorecomplextask.Weproposeinthenextsectionthatitisunnecessary.ThetoolcurrentlysupportsonlyCcode,butthisisalimitationofthecurrentcodeparserduetoitsinabilitytoseparatefunctions.ThetoolcanrunonC++/Rustcodewithamodifiedcodeparser.
3.1ProblemswithautomatedAIexploitation
Wecanseeasimplifieddiagramofthestagesofvulnerabilitydiscoveryat
7.
Oncewefoundapossiblevulnerability,wehavetwopaths:Eitherconfirmitviaexploitation,orfixitviaapatch.Wecandotwoimportantobservations:
?Isnotnecessarytoconfirmapossiblevulnerabilitytopatchit.Thisfollowthephilosophyofdefensiveprogramming.
?Patchingavulnerabilityrequiresmuchlessskillsthanexploitingit,orevenfindingit.
Similartools/benchmarkssuchasMeta’sCybersecEval2[7]andGoogleProjectZero
Naptime
[8]aimtofindandverifyvulnerabilities,andduetothehigh-skillandhigh
-complexitynatureofthistask,currentAIsystemsperformpoorlyatthis.Theycanonlysucceedinbasicexampleswithoutanysoftwareprotectionsorexploitcountermeasures. WhileoffensiveAIwilleventuallybecomeadvancedenoughtosucceedatthistask,duetotheobservationthatit’softeneasiertofixavulnerabilitythantocreatean
8
Figure6:AutoKakerGUI
exploitforit,wecanassumethattheasymmetrybetweendefenseandattackwillcauseoffensiveAI-generatedexploitstoalmostneversucceed.ThisisbecauselesscomplexdefensiveAIwilldiscoverandpatchthemfirst.
AnotherconclusionisthatsincecurrentLLMsareadvancedenoughtodiscoversomevulnerabilities,theyalsohavethecapacitytoautomaticallypatchthem,asshowninthenextsection
4Auto-patching
Vulnerabilitydiscovery/annotationandvulnerabilitypatchinghavesimilarworkflow,butinsteadofaddingacommentdescribingthevulnerability,weasktheLLMtogenerateandaddcodethatfixesit.Theautokakertoolcanalreadyperformthistaskbyusingthe–patchcommand-lineargument,displayingasimpleGUI(seefig
8)
.
4.1Iterativepatching
MostSOTALLMslikeLlama-3,Mistral-Large,GPT4,GeminiorClaudearealreadycapableofgeneratingpatchesbuttheydonothavea100%rateofsuccess.Meaningthatthegeneratedfixeswillsometimeseithernotcompileorcreateadditionalbugs.
Wesolvethisproblemusingaclosed-loopapproach(seefig
9),inwhichafterevery
patchgeneration,theautokakeragentchecksifthecodecompilesandpassesalltests.IftheLLMcodefailstopassthesetests,wecantrymultipletimesuntilthegeneratedcodepassesalltests.Notably,mostSOTALLMsgeneratecorrectpatchesonthefirsttry.
9
Figure7:Simplifiedvulndiscoverystages
4.2Example:zlib-hardcored
Zlib
[9]isacompressionlibrarythatissmall,andincludeexampleutilitiesthatcom
-press/decompressbinarydata,thatcanbeusedasatestforthecorrectworkingsoftheseveralalgorithmsimplemented.Theautopatcherutilitywasrunonthiscodeusingthiscommandline:
cdzlib;pythonautok.py--patch--make"make&&example64".
Thiswillruntheautopatchrecursivelyonall.cfilesandrunthecommand’make&&example64’aftereachmodificationtocheckforthecorrectnessandvalidityofeverypatch.
Thisgeneratedacompatiblerefactoroftheoriginalzliblibrarywithover200ap-pliedsecuritypatches.
Thehardenedzlibcodecanbedownloadedat[10]
.Notably,themodificationofthisprojecttoaddadditionalcheckswasdone100%automatically
10
Figure8:AutopatcherGUI
withnohumanintervention.Whilenotallpatchesfixexploitablevulnerabilities,theyadddefensiveprogrammingthatprotectsthezlibfunctionfrommanyfutureunknownvulnerabilities,withtheaddedbenefitofrandomizingtheimplementationitself,makingROPattacksmuchharder.
4.3Example:OpenBSD-hardcored
SecondexampleistheOpenBSDkernel.
OpenBSD[12]isanoperatingsystemknown
foritssecurityandcorrectness.However,theAutokakertooldiscoveredmanyvulnera-bilities,makingitacandidateforautopatching.
Atthistime,autopatcherwasrunonthecompletenetinet/netinet6systemusing
GPT4asamodel,generatingaround2000securitychecks[11]
.Notethatmostpatcheswillresultinunusedcode,andmostchecksarenotreallyneeded,followingthesamephilosophyasdefensiveprogramming.
AsOpenBSDdoesnothaveteststhatcheckthecorrectnessoftheIPv4/IPv6stack,patchingwas’blind’inthesensethattheymaygenerateerrors.Therefore,thepatcheshadtobereviewedmanually.However,outofthousandsofmodifications,only2patchesneededmanualcorrection.
Itisnotrecommendedtousethis’hardened’codeinproductionasitstillmightcontainbugsintroducedbytheautopatcherandnotyetdetected.Also,aswediscusslater,thepatchescanbeeasilyregeneratedwithanewer,morepowerfulLLM.
4.4cost
Currently,thecompleterefactorofthenetinet/netinet6subsystemofOpenBSD7.5isthebiggestprojectthathasbeenautopatched.Wecancitesomenumbersoftheassociated
cost:
11
Figure9:Autopatcherdesign
SubsystemAPIreqContextTok.GeneratedTok.TotalTok.Cost(GPT-4o)
netinet3011752411249133001542.75$netinet65652609051876434585484.27$
Inthistest-run,costwasunder10usdforthecompletenetinet/netinet6processing,usingoneofthemostexpensivemodelsavailable(GPT-4o).Thiscostisverysmallcomparedtothecostofadeveloper,butmostofthecostofhardeningsoftwarewillbethecostofpatchreview.Performanceofdifferentmodelsregardingautopatchingwasnotmeasuredinthisarticle.Totaltimespentpatchingthenetinet/netinet6subsystemwasabout12hs.
4.5Recommendedusage
Theautopatchercangeneratecodewithadditionalchecksthatmaypreventmanyun-knownbugsfrombeingexploited.However,aswecanassumethatLLMswillcontinue
12
Figure10:OpenBSD7.5withAI-hardenedIPstackpatchesbooting.
toimproveatafastrate,itisnotrecommendedtocommitthegeneratedcheckstothecodepermanently,astheycanbeeasilyregeneratedwhenneededwithmoread-vancedLLMs,generatingbetterchecks.Inthisway,wecanseetheautopatcherasapre-compilationstageformostprojects.
5Conclusion
Thisarticleshowsthatcurrentstate-of-the-artLLMscandiscoversomeclassesofvulner-abilitiesonrealC/C++projects,specificallymemorycorruptionbugs.Andwhiletheyarenotadvancedenoughtoverify/exploitthem,theAIcaneasilygenerateandintegratepatchesthatpreventthem.Wearguethattheriskofauto-exploitationof
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 賠償責(zé)任合同范例
- 果園租地合同范例
- 進(jìn)出口維修合同范例
- 車(chē)輛外借使用合同范例
- 公司勞務(wù)合同范例范例
- 機(jī)械臺(tái)班勞務(wù)分包合同范例
- 食堂水電安裝合同范例
- 服務(wù)合同范例表格
- 人居環(huán)境整治合同范例
- 飯店改造公司合同范例
- 學(xué)校最小應(yīng)急單元應(yīng)急預(yù)案
- 一年級(jí)第一學(xué)期口算題(20以?xún)?nèi)口算天天練-15份各100題精確排版)
- 蘇教版小學(xué)六年級(jí)信息技術(shù)全冊(cè)教案
- 《鄉(xiāng)土中國(guó)》第12-14章
- 軌道交通先張法預(yù)應(yīng)力U型梁預(yù)制施工工法
- 人教版英語(yǔ)四年級(jí)上冊(cè)《Unit-3-My-friends》單元教學(xué)課件
- 工程變更矩陣圖
- 農(nóng)村土地買(mǎi)賣(mài)合同協(xié)議書(shū)范本
- GB/T 42828.2-2023鹽堿地改良通用技術(shù)第2部分:稻田池塘漁農(nóng)改良
- 急性腎衰竭診療規(guī)范內(nèi)科學(xué)診療規(guī)范診療指南2023版
- 國(guó)開(kāi)2023春計(jì)算機(jī)組網(wǎng)技術(shù)實(shí)訓(xùn)-咖啡店無(wú)線(xiàn)上網(wǎng)參考答案
評(píng)論
0/150
提交評(píng)論