人工智能驅(qū)動(dòng)的漏洞搜尋-演變和基準(zhǔn)測(cè)試_第1頁(yè)
人工智能驅(qū)動(dòng)的漏洞搜尋-演變和基準(zhǔn)測(cè)試_第2頁(yè)
人工智能驅(qū)動(dòng)的漏洞搜尋-演變和基準(zhǔn)測(cè)試_第3頁(yè)
人工智能驅(qū)動(dòng)的漏洞搜尋-演變和基準(zhǔn)測(cè)試_第4頁(yè)
人工智能驅(qū)動(dòng)的漏洞搜尋-演變和基準(zhǔn)測(cè)試_第5頁(yè)
已閱讀5頁(yè),還剩20頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1

AI-PoweredBugHunting-Evolutionand

benchmarking

AlfredoOrtega-ortegaalfredo@X:@ortegaalfredo

Neuroengine.aiJune27,2024

WhileAIholdspromiseforassistingwithbughunting,itsactualimpactre-mainsunclear.ThispresentationaddressesthisdoubtbyintroducingCrash-Benchmark,astandardizedevaluationframeworkforAI-drivenstaticanaly-sistools.We’llshareresultsfromasimplebug-huntingAIagent,AutoKaker,anddiscusstheimplicationsforoptimizingAI-basedbughuntinginC/C++codebases.

1Introduction

Opiniononautomaticbugfindingiscontroversial.Atthedateofthisarticle’spubli-cation,thereisnoconsensusaboutwhetherthisispossibleornot,ortowhatextent.ThisispartlyduetotherapidadvancementofLLMmodels;upuntilmonthsago,open-sourcemodelswerenotadvancedenoughtobeeffectiveatbugfinding.ThereexistsathresholdinthecomplexityofLLMsbeyondwhichbugfindingbecomespossible,andinthisarticlewebenchmarkvariousmodelsandfoundthatthisthresholdhasbeenreachedforsomevulnerabilities.

2CrashBench

Crashbench

[1]isasimpleautomatictest-casebasedbenchmarktool

.ItconnectstoserveralLLMservicesofferingopensourceandprivateAImodels,sendatestcaseandthenmeasureresults.IftheLLMfoundthebuginthecorrectline,thenscoreisincreasedbyone.

2.1Design

Mostofthetestcasesforthev1versionarebasedonGera’sInsecureProgrammingexercises

[2],plus3realvulnerabilityexamples

.TheLLMisassignedascorebased

2

onthenumberofvulnerabilitiesthatwerereported,withrealvulnerabilitieshaving10timesthescore.

TheconfigurationofCrashbenchisasingle.inifilecontainingtheprompt,testcasefilesandexpectedlineswherethebugisfound.

[SETTINGS]

SystemPrompt="Youareanexpertsecurityresearcher,programmerandbugfinder."

Prompt="Checkthiscodeforanyout-of-boundsor

integer-overflowvulnerability,explainitandpointatthelinewiththeproblem,

andnothingmore,inthisway:\n’Bugline=X’whereXisthelinenumberofthebug,

andthenprintthatlinenumber.

Ifthecodehasnobugs,thenprint’Bugline=0’."

[Basic]

file1=stack1.c,6file2=stack2.c,6file3=stack3.c,6file4=stack4.c,6file5=stack5.c,6

[ABOs]

file1=abo1.c,4

...

Inthisway,thetestnotonlytestsbugfindingcapabilities,butalsoaccuracyinreporting.Manymodelsaregoodatfindingvulnerabilities,buttheyfailataccuratelypointingexactlywherethebugislocatedinthecode.Tocreatenegativetests(testswherenovulnerabilityshouldbedetected),justsettheexpectedvulnlinenumbertozero.

2.2Parameters

Softwareusedwasvllmv0.5.0.post1[3]forAWQquantizationandaphrodite-engine

v0.5.3

[4]forEXL2quantization

.Parametersusedforinferenceusingvllmwere:

?temperature:1.2

?topp=1.0

?frequencypenalty=0.6

?presencepenalty=0.8

3

2.3Results

Thebenchmarkranagainst16LLMs,mostofthembeingthelatestversions,butalso

someoldermodelsbasedonLlama-2tocomparethem.Additionally,severalquanti-zationsofthesamemodelweretestedtomeasuretheeffectofquantizationonLLMbug-reportingaccuracy.

Figure1:Crashbenchscore

AsshowninFigure

1,Oldermodelsarenotcompetitiveatcodeunderstandingand

bugfinding,withnewermodelsbeingsignificantlybetter.EvenclosedmodelslikeChatGPTaresurpassedbythesenewermodelsintermsofperformance.Additionally,therelativelysmalleffectofquantizationonresultsisevident,asastrongquantizationofLlama-3-70B(2.25bpw)didnothaveasignificantimpactonthemodel’sscore.

2.4Quantizationeffects

AtFigure

2,wenowfocusontheeffectsofquantizationonthescore

.Quantizationisatechniquethatcompressesmodelsbyrepresentingweightsusingfewerbits,losingsomequalitybutreducingtheamountofmemoryneeded.Thisresultsinincreasedspeedandefficiency.SincecurrentGPUsaremostlylimitedbymemorybandwidth,theefficiencyofinferencedecreasesnearlylinearlywithsize.

Wesetthey-axisto0sothatitcanbemoreeasilyseenhowlowaneffectquantizationhadonthescores.Wecanalsoseetherapidincreaseinsizewiththeincreaseofbitsperword,withoutanycorrespondingincreaseinscore.

4

Figure2:Quantizationeffectsonscore.Model:Meta-LLama-3-70B-Instruct.

WecanplotasecondgraphatFig.

3,showingefficiencyofthedifferentmodels,

meaningthescorepersizeinGigabytes.Withdecreasedsize,speedandpowerrequiredforinferencealsodecreaseslinearly,increasingefficiencyofoperation.

WecanseehowthecurrentmostefficientmodelsarehighlyquantizedversionsofLlama-370B.Ataround25GB,thosemodelsarestilloutofreachformostpersonalhomecomputers.ThebestnextoptionwouldbetouseahighlyquantizedversionofMistral-8x7B,whichcanrunonCPUonmostmoderncomputersatanacceptablespeed.

2.5CrashbenchvsLMSysELO

TheLMsysleaderboard

[5]hasbecometheindustrystandardformodelbenchmarking

.Wecancomparehowourbug-findingbenchmarkcorrelateswiththeoverallmodelscore.

IntuitivelywewouldassumethatoverallELOandcrashbenchscoresshouldbesome-whatrelated.Butin

4

wecanseesomeinconsistencies,especiallywithmodernOpenAImodels.ThesemodelshavemuchbetterELOscoresthanCrashbenchscores.Thismeansthatthesemodelsaremuchbetterasgenericassistantsandcodegenerationthanatbugfinding.Wesuspectthatsuper-alignmentmightcausethesemodelstorefusetoshowbugs,asananalysisofgpt-4andgpt-4oshowsthattheydonotshowmanywrongbugsorlinesonthetest-cases;instead,theirlowscoresaremostlyduetodenyingthatthereisabugatall.Lowscoresmightalsoindicateproblemsonthebenchmark,aswediscussinthefollowingsection.

5

Figure3:Totalmodelefficiency.ThisgraphicshowshowmanypointsthemodelhaveforeveryGBinsize.

2.6Problems

Problemsthatmayaffectthisbenchmarkaccuracyare:

Incorrectparametersand/orpromptformat:Instructmodelshaveaspecificformat

thatmustbeusedonthepromptstomaximizetheirunderstandingoftherequests.ManyLLMsarequiteflexibleonthisformat,whilesomearenot.It’simportanttorespectthepromptformatofeachLLMtomaximizetheircode-understandingcapacity.

Modeltrainedonthesolutionsofthebenchmark:Asmostmodelsaretrainedonter-abytesoftokens,itisverylikelythatthetestcases,bothartificialandreal,werepartoftheirtraining,alongwiththesolutions.Thismightintroduceabiaswheremodelsareverygoodatpassingthebenchmark,butnotsogoodinreal-worldapplications.ThesolutiontothisproblemistocreatemoreunpublishedtestcasesthattheLLMdidn’tseeduringtraining.However,thisisashort-livedsolutionasit’sverylikelythatnewerversionsoftheLLMswillcontainthesenewtestcases,sotheymustbediscardedineverynewversionofthebenchmark.

Bugsoninferencesoftware/quantizationquality:Inferencesoftwareisevolvingrapidly,anditcontainsbugsthataffectqualityandreasoning.Asolutiontothisproblemforbenchmarkingistoalwaysusethesameinferencesoftware.Inourcase,weuseeithervLLMorAphroditeengine,whichinternallyusesvLLM.

6

Figure4:CrashbenchscorevsOverallmodelELOscore.Wecanseeageneralcorrelationexceptonclosedmodels.

Refusalsduetoalignmnet:Somemodelsrefusetodiscoverbugsbecausetheyreasonthattheycanbeusedformaliciouspurposes.Thiscanbebypassedwithseveraltechniquessuchaspromptjailbreakingorabliteration,butbothtechniquesmightaffectthecode-understandingcapacityofthemodel.However,theabliteratedversionofLlama-3-70Bwascomparedagainsttheoriginalversionandshowedaminimaleffectontheresults.

3AutoKaker:Automaticvulnerabilitydiscovery

Usingthesametechniqueofthebenchmarkwecaneasilyconstructatool[6]thatprocess

sourcecodeandannotateseveryvulnerabilityfound.Thealgoritmdescribedinfig

5

issimple:

1.Separatesourcecodeintoindividualchunksthatcontainoneormorefunctions

2.AssembleapromptaskingtheLLMtoanalyzethecode

3.Annotatetheresults

Thistool(seefig

6)canbelaunchedoncompletecodebasesandwillannotateevery

functionwithpossiblevulnerabilities,readyfortriageandexploitationbyahumanoperator.Unlikeotherapproaches,thistooldoesnotattempttoverifyorexploitthe

7

Figure5:Autokakermainloop

vulnerabilitiesfound,asthisisamuchmorecomplextask.Weproposeinthenextsectionthatitisunnecessary.ThetoolcurrentlysupportsonlyCcode,butthisisalimitationofthecurrentcodeparserduetoitsinabilitytoseparatefunctions.ThetoolcanrunonC++/Rustcodewithamodifiedcodeparser.

3.1ProblemswithautomatedAIexploitation

Wecanseeasimplifieddiagramofthestagesofvulnerabilitydiscoveryat

7.

Oncewefoundapossiblevulnerability,wehavetwopaths:Eitherconfirmitviaexploitation,orfixitviaapatch.Wecandotwoimportantobservations:

?Isnotnecessarytoconfirmapossiblevulnerabilitytopatchit.Thisfollowthephilosophyofdefensiveprogramming.

?Patchingavulnerabilityrequiresmuchlessskillsthanexploitingit,orevenfindingit.

Similartools/benchmarkssuchasMeta’sCybersecEval2[7]andGoogleProjectZero

Naptime

[8]aimtofindandverifyvulnerabilities,andduetothehigh-skillandhigh

-complexitynatureofthistask,currentAIsystemsperformpoorlyatthis.Theycanonlysucceedinbasicexampleswithoutanysoftwareprotectionsorexploitcountermeasures. WhileoffensiveAIwilleventuallybecomeadvancedenoughtosucceedatthistask,duetotheobservationthatit’softeneasiertofixavulnerabilitythantocreatean

8

Figure6:AutoKakerGUI

exploitforit,wecanassumethattheasymmetrybetweendefenseandattackwillcauseoffensiveAI-generatedexploitstoalmostneversucceed.ThisisbecauselesscomplexdefensiveAIwilldiscoverandpatchthemfirst.

AnotherconclusionisthatsincecurrentLLMsareadvancedenoughtodiscoversomevulnerabilities,theyalsohavethecapacitytoautomaticallypatchthem,asshowninthenextsection

4Auto-patching

Vulnerabilitydiscovery/annotationandvulnerabilitypatchinghavesimilarworkflow,butinsteadofaddingacommentdescribingthevulnerability,weasktheLLMtogenerateandaddcodethatfixesit.Theautokakertoolcanalreadyperformthistaskbyusingthe–patchcommand-lineargument,displayingasimpleGUI(seefig

8)

.

4.1Iterativepatching

MostSOTALLMslikeLlama-3,Mistral-Large,GPT4,GeminiorClaudearealreadycapableofgeneratingpatchesbuttheydonothavea100%rateofsuccess.Meaningthatthegeneratedfixeswillsometimeseithernotcompileorcreateadditionalbugs.

Wesolvethisproblemusingaclosed-loopapproach(seefig

9),inwhichafterevery

patchgeneration,theautokakeragentchecksifthecodecompilesandpassesalltests.IftheLLMcodefailstopassthesetests,wecantrymultipletimesuntilthegeneratedcodepassesalltests.Notably,mostSOTALLMsgeneratecorrectpatchesonthefirsttry.

9

Figure7:Simplifiedvulndiscoverystages

4.2Example:zlib-hardcored

Zlib

[9]isacompressionlibrarythatissmall,andincludeexampleutilitiesthatcom

-press/decompressbinarydata,thatcanbeusedasatestforthecorrectworkingsoftheseveralalgorithmsimplemented.Theautopatcherutilitywasrunonthiscodeusingthiscommandline:

cdzlib;pythonautok.py--patch--make"make&&example64".

Thiswillruntheautopatchrecursivelyonall.cfilesandrunthecommand’make&&example64’aftereachmodificationtocheckforthecorrectnessandvalidityofeverypatch.

Thisgeneratedacompatiblerefactoroftheoriginalzliblibrarywithover200ap-pliedsecuritypatches.

Thehardenedzlibcodecanbedownloadedat[10]

.Notably,themodificationofthisprojecttoaddadditionalcheckswasdone100%automatically

10

Figure8:AutopatcherGUI

withnohumanintervention.Whilenotallpatchesfixexploitablevulnerabilities,theyadddefensiveprogrammingthatprotectsthezlibfunctionfrommanyfutureunknownvulnerabilities,withtheaddedbenefitofrandomizingtheimplementationitself,makingROPattacksmuchharder.

4.3Example:OpenBSD-hardcored

SecondexampleistheOpenBSDkernel.

OpenBSD[12]isanoperatingsystemknown

foritssecurityandcorrectness.However,theAutokakertooldiscoveredmanyvulnera-bilities,makingitacandidateforautopatching.

Atthistime,autopatcherwasrunonthecompletenetinet/netinet6systemusing

GPT4asamodel,generatingaround2000securitychecks[11]

.Notethatmostpatcheswillresultinunusedcode,andmostchecksarenotreallyneeded,followingthesamephilosophyasdefensiveprogramming.

AsOpenBSDdoesnothaveteststhatcheckthecorrectnessoftheIPv4/IPv6stack,patchingwas’blind’inthesensethattheymaygenerateerrors.Therefore,thepatcheshadtobereviewedmanually.However,outofthousandsofmodifications,only2patchesneededmanualcorrection.

Itisnotrecommendedtousethis’hardened’codeinproductionasitstillmightcontainbugsintroducedbytheautopatcherandnotyetdetected.Also,aswediscusslater,thepatchescanbeeasilyregeneratedwithanewer,morepowerfulLLM.

4.4cost

Currently,thecompleterefactorofthenetinet/netinet6subsystemofOpenBSD7.5isthebiggestprojectthathasbeenautopatched.Wecancitesomenumbersoftheassociated

cost:

11

Figure9:Autopatcherdesign

SubsystemAPIreqContextTok.GeneratedTok.TotalTok.Cost(GPT-4o)

netinet3011752411249133001542.75$netinet65652609051876434585484.27$

Inthistest-run,costwasunder10usdforthecompletenetinet/netinet6processing,usingoneofthemostexpensivemodelsavailable(GPT-4o).Thiscostisverysmallcomparedtothecostofadeveloper,butmostofthecostofhardeningsoftwarewillbethecostofpatchreview.Performanceofdifferentmodelsregardingautopatchingwasnotmeasuredinthisarticle.Totaltimespentpatchingthenetinet/netinet6subsystemwasabout12hs.

4.5Recommendedusage

Theautopatchercangeneratecodewithadditionalchecksthatmaypreventmanyun-knownbugsfrombeingexploited.However,aswecanassumethatLLMswillcontinue

12

Figure10:OpenBSD7.5withAI-hardenedIPstackpatchesbooting.

toimproveatafastrate,itisnotrecommendedtocommitthegeneratedcheckstothecodepermanently,astheycanbeeasilyregeneratedwhenneededwithmoread-vancedLLMs,generatingbetterchecks.Inthisway,wecanseetheautopatcherasapre-compilationstageformostprojects.

5Conclusion

Thisarticleshowsthatcurrentstate-of-the-artLLMscandiscoversomeclassesofvulner-abilitiesonrealC/C++projects,specificallymemorycorruptionbugs.Andwhiletheyarenotadvancedenoughtoverify/exploitthem,theAIcaneasilygenerateandintegratepatchesthatpreventthem.Wearguethattheriskofauto-exploitationof

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論