悉尼理工大學(xué) - 強(qiáng)化學(xué)習(xí)的新挑戰(zhàn):安全和隱私綜述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第1頁
悉尼理工大學(xué) - 強(qiáng)化學(xué)習(xí)的新挑戰(zhàn):安全和隱私綜述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第2頁
悉尼理工大學(xué) - 強(qiáng)化學(xué)習(xí)的新挑戰(zhàn):安全和隱私綜述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第3頁
悉尼理工大學(xué) - 強(qiáng)化學(xué)習(xí)的新挑戰(zhàn):安全和隱私綜述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第4頁
悉尼理工大學(xué) - 強(qiáng)化學(xué)習(xí)的新挑戰(zhàn):安全和隱私綜述 New Challenges in Reinforcement Learning:A Survey of Security and Privacy_第5頁
已閱讀5頁,還剩82頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1arXivv[cs.LG]31DecarXivv[cs.LG]31Dec2022Zhu1*andWanleiZhou21*SchoolofComputerScience,UniversityofTechnologySydney,Broadway,Sydney,2007,NSW,Australia.2*SchoolofDataScience,CityUniversityofMacau,Macau,China.*Correspondingauthor(s).E-mail(s):Tianqing.Zhu@.au;Contributingauthors:Yunjiao.Lei@.au;Yuleisuiutseduauwlzhoucityuedu.mo;AbstractReinforcementlearning(RL)isoneofthemostimportantbranchesofAI.Duetoitscapacityforself-adaptionanddecision-makingindynamicenvironments,reinforcementlearninghasbeenwidelyappliedinmultipleareas,suchashealthcare,datamarkets,autonomousdriv-ing,androbotics.However,someoftheseapplicationsandsystemshavebeenshowntobevulnerabletosecurityorprivacyattacks,resultinginunreliableorunstableservices.Alargenumberofstudieshavefocusedonthesesecurityandprivacyproblemsinreinforcementlearning.However,fewsurveyshaveprovidedasystematicreviewandcomparisonofexistingproblemsandstate-of-the-artsolutionstokeepupwiththepaceofemergingthreats.Accordingly,wehereinpresentsuchacomprehensivereviewtoexplainandsummarizethechallengesassociatedwithsecurityandprivacyinreinforcementlearningfromanewperspective,namelythatoftheMarkovDecisionProcess(MDP).Inthissurvey,we?rstintroducethekeyconceptsrelatedtothisarea.Next,wecoverthesecurityandprivacyissueslinkedtothestate,action,environment,andrewardfunctionoftheMDPprocess,respectively.Wefurtherhighlightthespecialcharacteristicsofsecurityandprivacymethodologiesrelatedtoreinforcementlearning.Finally,wediscussthepossiblefutureresearchdirectionswithinthisarea.SpringerNature2021LATEXtemplate2NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyKeywords:ReinforcementLearning,Security,PrivacyPreservation,MarkovDecisionProcess,Multi-agentSystem1IntroductionReinforcementlearning(RL)isoneofthemostimportantbranchesofAI.Duetoitsstrongcapacityforself-adaptation,reinforcementlearninghasbeenwidelyappliedinmultipleareas,includinghealthcare[1],?nancialmar-kets[2],mobileedgecomputing(MEC)[3,4]androbotics[5].Reinforcementlearningisconsideredtobeaformofadaptive(orapproximate)dynamicpro-gramming[6]andhasachievedoutstandingperformanceinsolvingcomplexsequentialdecision-makingproblems.Reinforcementlearning’sstrongperfor-mancehasledtoitsimplementationanddeploymentacrossabroadrangeof?eldsinrecentyears,suchastheInternetofthings(IoT)[7],recommendsys-tems[8],healthcare[9],robotics[10],?nance[11],self-drivingcars[12],andsmartgrids[13],andsoon.Unlikeothermachinelearningtechniques,Rein-forcementlearninghasastrongabilitytolearnbytrialanderrorindynamicandcomplexenvironments.Inparticular,itcanlearnfromtheenvironmentwhichhasminimuminformationabouttheparameterstobelearned[14],andcanasamethodtoaddressoptimalproblems[15,16].Inthereinforcementlearningcontext,anagentcanbeviewedasaself-contained,concurrentlyexecutingthreadofcontrol[17].Itcaninteractwiththeenvironmentandobtainastateoftheenvironmentasinput.Thestateoftheenvironmentcanbethesituationsurroundingtheagent’slocation.Taketheroadconditionsinanautonomousdrivingscenarioasanexample.In?gure1,thegreenvehicleisanagent,andalltheobjectsarounditcanberegardedastheenvironment;thus,theenvironmentcomprisestheroad,thetra?csigns,othercars,etc.Basedonthestateoftheenvironment,theagentchoosesanactionasoutput.Next,theactionchangesthestateoftheenvironment,andtheagentwillreceiveascalarsignalthatcanberegardedasanindicatorofthevalueforthestatetransitionfromtheenvironment.Thisscalarsignalisalwaysrepresentedasareward.Theagent’spurposeistolearnanoptimalpolicyovertimebytrialanderrorinordertogainamaximalaccumulatedrewardasreinforcement.Inaddition,thecombinationofdeeplearningandreinforcementlearningfurtherenhancestheabilityofreinforcementlearning[18].1.1ReinforcementlearningsecurityandprivacyissuesHowever,reinforcementlearningisweaktosecurityattacks.Itistenderforattackerstoleveragethebreachabledatasource[19].Forexample,datapoi-soningattacks[20]andadversarialperturbations[21]areverypopularexistingproposedoverthepastfewyearstoaddressthesesecurityconcerns.SomeresearchershavefocusedonprotectingthemodelfromattacksandensuringSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy3Fig.1Anautonomousdrivingscenario.Thegreencarisanagent.theenvironmentcomprisestheroad,thetra?csigns,othercars,etc.thatthemodelstillperformswellwhileunderattack.Theaimistomakesurethemodeltakessafeactionsthatareexactlyknown,ortogetoptimalpolicyunderworsesituations,suchasbyusingadversarialtraining[22].Figure2presentsanexampleofsecurityattacksinreinforcementlearn-inginanautonomousdrivingscenario.Anautonomouscarisdrivingontheroadandobservingitsenvironmentthroughsensors.Tokeepsafewhiledriv-ingautonomously,itwillcontinuallyadjustitsbehaviorbasedontheroadconditions.Inthiscase,anattackermayfocusonin?uencingtheautonomousdrivingconditions.Forexample,ataparticulartime,theoptimalactionforthecartotakeistogostraight;however,anactionattackmaydirectlyin?uencetheagenttoturnright(theattackmayalsoimpactthevalueofthereward).Withregardtoenvironmentalin?uencingattacks,theattackermayconceiveorfalselyinsertacarintherightfrontoftheenvironment,andthisdisturbingmaymisleadtheautonomouscarintotakingawrongaction.Asforrewardattacks,rivalsmaytrytochangethevalueofthereward(e.g.,from+1to-1)andtherebyimpactthepolicyoftheautonomouscar.Moreover,reinforcementlearningalsohasbeensubjecttoprivacyattacksduetoitsweaknessesthatcanbeleveragedbyattackers.Establishedsamplesusedinreinforcementlearningcontainthelearningagent’sprivateinforma-tion,whichisvulnerabletoawidevarietyofattacks.Forexample,indiseasetreatmentapplicationswithreinforcementlearning[1],real-timehealthdataisrequired,andtoachieveanaccuratedosageofmedicine,theinformationisalwayscollectedandtransmittedinplaintext.Thismaycausedisclosureofusers’privateinformation;consequently,thereinforcementlearningsystemmaycollectdatafrompublicresources.Mostcollecteddatasetscontainpri-vateorsensitiveinformationthathasahighprobabilityofbeingdisclosed[23].Moreover,reinforcementlearningmayalsorequiredatasharing[24]andneedstotransmitinformationduringthesharingprocess.Thus,attacksonnetworklinkscanalsobesuccessfulinareinforcementlearningcontext.Furthermore,cloudcomputing,whichisalwaysusedforreinforcementlearningcomputationSpringerNature2021LATEXtemplate4NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyFig.2Asimpleexampleofasecurityattackinreinforcementlearninginthecontextofautomaticdriving.Anactionattack,environmentalattackandrewardattackareshownrespectively.Anactionattackworksbyin?uencingthechoiceofactiondirectly,suchasbytemptingtheagenttotaketheaction“turnright”ratherthantheoptimalaction“gostraight”.Environmentalattacksattempttochangetheagent’sperceptionoftheenviron-mentsoastomisleaditintotakinganincorrectaction.Finally,therewardattackworksbychangingthevalueofarewardgivenforaspeci?cactioninastate.andstoragehasinherentvulnerabilitiestocertainattacks[25].Ratherthanchangingora?ectingthemodel,theattackersmaychoosetofocusonobtainingorinferringtheprivacydata;forexample,Panetal.[26]inferredinformationaboutthesurroundingenvironmentbasedonthetransitionmatrix.Themainapproachestodefendingprivacyandsecurityinthereinforcementlearningcontextincludeencryptiontechnology[27]andinformation-hidingtechniques,suchasdi?erentialprivacy[28].Inaddition,somearti?cialalgo-atedlearning(FL)whichcanpreserveprivacyforthelearningmechanismandstructure.Yuetal.[30]adoptfederatedlearning(FL)intoadeepreinforce-mentlearningmodelinadistributedmanner,withthegoalofprotectingdataprivacyforedgedevices.1.2OutlineandSurveyOverviewAsanincreasingnumberofsecurityandprivacyissuesinreinforcementlearn-ingemerge,itismeaningfultoanalyzeandcompareexistingstudiestohelpsparkideasabouthowsecurityandprivacymightbeimprovedinfutureinthisspeci?c?eld.Overrecentyears,severalsurveysonthesecurityandprivacyofreinforcementlearninghavebeencompleted:(1)Chenetal.[31]reviewedtheresearchrelatedtoreinforcementlearningfromtheperspectiveofarti?cialintelligencesecurityaboutadversarialattacksanddefence.Theauthorsanalysedthecharacteristicsofadversarialattackiesrespectively(2)Luongetal.[32]presentedaliteraturereviewonapplicationsofdeepreinforcementlearningincommunicationsandnetworking;SuchastheInternetofThings(IoT).Theauthorsdiscusseddeepreinforcementlearningapproachesproposedaboutissuesincommunicationsandnetworking,whichSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy5includedynamicnetworkaccess,dataratecontrol,wirelesscaching,datao?oading,networksecurity,andconnectivitypreservation.(3)Anothersurveypaper[14]conductedaliteraturereviewonsecuringIoTdevicesusingreinforcementlearning.Thispaperpresenteddi?erenttypesofcyber-attacksagainstdi?erentIoTsystemsanddiscussedsecuritysolutionsbasedonreinforcementlearningagainsttheseattacks.(4)Wuetal.[33]surveyedthesecurityandprivacyrisksofthekeycompo-nentsofablockchainfromtheperspectiveofmachinelearning,andhelptoabetterunderstandingofthesemethodsinthecontextofIIoT.Chenetal.[34]alsoexploreddeepreinforcementlearninginthecontextofIoT.Ourworkdi?ersfromtheaboveworks.However,theworksmentionedaboveareallfocusedontheIoTorcommu-nicationnetworks.Theyareabouttheapplicationofreinforcementlearning.Veryfewexistingsurveyshavecomprehensivelypresentedthesecurityandprivacyissuesinreinforcementlearningratherthantheapplication.Someofthemconcentrateontheattackand/ordefensemethods.However,theyarejustanalysingthewholein?uence.Accordingly,inthispaper,wehighlighttheobjectsthattheattacksaimatandprovideacomprehensivereviewofthekeymethodsusedtoattackanddefendtheseobjects.Themaincontributionsofoursurveycanbesummarizedasfollows:●ThesurveyorganizestherelevantexistingstudiesfromanovelanglethatisbasedonthecomponentsoftheMarkovdecisionprocess(MDP).WeclassifycurrentresearchesonattacksanddefencesbasedontheirobjectsinMDP.Thisprovidesanewperspectivethatenablesfocusingonthetargetofthemethodsacrosstheentirelearningprocess.●Thesurveyprovidesaclearaccountoftheimpactcausedbythetargetedobjects.TheseobjectsarecomponentsinMDPthatarerelatedtoeachotherandmayexistinthesametimeor/andspace.AdoptingthisapproachenablesustofollowtheMDPtocomprehendtherelevantobjectsandtherelationshipsbetweenthem●Thesurveycomparesthemainmethodsofattackingordefendingthecom-ponentsofMDP,andtherebyshedssomelightsontheadvantagesanddisadvantagesofthesemethods.Theremainderofthispaperisstructuredasfollows.We?rstpresentpre-liminaryconceptsinreinforcementlearningsystemsinSection2.WethenoutlinethesecurityandprivacychallengesinreinforcementlearninginSection3.Next,wepresentfurtherdetailsonsecurityinreinforcementlearninginSection4,followedbyanoverviewofprivacyinreinforcementlearninginSection5.Wefurtherdiscussthesecurityandprivacyinreinforcementlearn-ingapplicationsinsection6.Finally,Sections7and8presentouravenuesfordiscussionandfutureworkandconclusionrespectively.SpringerNature2021LATEXtemplate6NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy2Preliminary2.1NotationTable1liststhenotationsusedinthisarticle.RLisreinforcementlearning,andDRLisdeepreinforcementlearning.MDPstandsfortheMarkovDecisionProcess,whichiswidelyusedinreinforcementlearning.MDPcanbedenotedbyatuple(S,A,T,r,γ),whichismadeupoftheagentactionspaceA,theenvironmentstatespaceS,therewardfunctionr,thetransitionmatrixT,andadiscountfactorγe[0,1).Thetransitionmatrixisaprobabilitymappingfromstate-actionpairstostatesT:(SxA)xS→[0,1].Theagent’spurposeisto?ndanoptimalpolicythatcanmapenvironmentstatestoagentactionstomaximizelong-termreward.vπ(s)andQπ(s,a)arethestateandaction-statevalues,whichcanregardasameansofevaluatingthepolicy.Table1Themainnotationsthroughthepaper.notationsmeaningRLReinforcementlearningDRLDeepreinforcementlearningMDPMarkovdecisionprocessATheactionspaceoftheagentSThestatespaceoftheenvironmentTThetransitionmatrixrTherewardfunctionγAdiscountfactorwhichiswithintherange(0,1)πPolicyv┐(s)StatevalueQ┐(s,a)Action-statevalue2.2ReinforcementlearningThereinforcementlearningmodelcontainstheenvironmentstatesS,theagentactionsA,andscalarreinforcementsignalsthatcanberegardedasrewardsr.Alltheelementsandtheenvironmentcanbeconceptualizedasawholesystem.Atstept,whenanagentinteractswiththeenvironment,itcanreceiveastateoftheenvironmentstasinput.Basedonthestateoftheenvironmentst,theagentchoosesanactionatusingthepolicyπasoutput.Next,theactionchangesthestateoftheenvironmenttost+1.Atthesametime,theagentwillobtainarewardrtfromtheenvironment.Thisrewardisascalarsignalthatcanberegardedasanindicatorofthevalueforthestatetransition.Inthisprocess,theagentlearnsapieceofknowledge,whichmayberecordedasst,at,rt,st+1inaQtable.Qtablehascalculatedthemaximumosethebestactionateachstate.Inthenextstep,theupdatedst+1andrt+1willbesenttotheagentagain.Theagent’spurposeistolearnanoptimalpolicyπsoastogainthehighestpossibleaccumulatedrewardr.ToarriveattheSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy7optimalpolicyπ,theagentcantrainbyapplyingatrialanderrorapproachoverthelong-termepisodes.AMarkovDecisionProcess(MDP)withdelayedrewardsisusedtohan-dlereinforcementlearningproblems,suchthatMDPisakeyformalisminreinforcementlearning.Fig.3TheinteractionbetweenagentandenvironmentwithMDP.Theagentinteractswiththeenvironmenttogainknowledge,whichmayberecordedasatableoraneuralnetworkmodel(inDRL),andthentakesanactionthatwillreacttotheenvironmentstate.Iftheenvironmentmodelisgiven,twosimpleiterativealgorithmscanbechosentoarriveatanoptimalmodelintheMDPcontext:namely,valueiter-ation[35]andpolicyiteration[36].Whentheinformationofthemodelisnotknowninadvance,theagentneedstolearnfromtheenvironmenttoobtainthisdatabasedonanappropriatealgorithm,whichisusuallyakindofstatisticalalgorithm.AdaptiveHeuristicCriticandTD(λ),whichisapolicyiterationmechanism,wereusedintheearlystagesofreinforcementlearningtolearnanoptimalpolicywithsamplesfromtherealworld[37].Subsequently,theQ-learningalgorithmincreasedinpopularity[38,39]andisnowalsoaveryimportantalgorithminreinforcementlearning.TheQ-learningalgorithmisalsoaniterativeapproachusedtoselectanactionwithamaximumQvalue,whichisanevaluationvalue,inordertoensurethatthechosenpolicyisopti-mal.Moreover,duetoitsabilitytodealwithhigh-dimensionaldataandtoapproximatethefunction,deeplearninghasbeencombinedwithreinforce-mentlearningtocreatethe?eldof“deepreinforcementlearning”(DRL)[40].Thiscombinationhasledtosigni?cantachievementsinseveral?elds,suchaslearningfromvisualperceptual[18]androbotics[41].AnexampleofreinforcementlearningispresentedinFigure4.The?guredepictsarobotsearchingforanobjectintheGridWorldenvironment.Theredcirclerepresentsthetargetobject,thegreyboxesdenotetheobstacles,andthewhiteboxesdenotetheroad.Therobot’spurposeisto?ndaroutetotheredcircle.Ateachstep,therobothasfourchoicesofaction:walkingup,down,leftandright.Inthebeginning,theagentreceivesinformationfromtheenvironmentwhichmaybeobtainedthroughsensorssuchasradarorcamera.SpringerNature2021LATEXtemplate8NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyTheagentthenchoosesanactionandreceivesacorrespondingreward.Inthepositionshowninthe?gure,choosingtheactionofup,leftorright,mayresultinalowerreward,asthereareobstaclesinthesethreedirections.However,takingtheactionofmovingdownwillresultinahigherreward,asitwillbringtheagentclosertoitsgoal.Fig.4Asimpleexampleofreinforcementlearning,inwhicharobottriesto?ndanobjectintheGridWorldenvironment.Thebluerobotcanbeseenastheagentinreinforcementlearning.Theredcircleisthetargetobject.Thegreyboxesdenotetheobstacles,whilethewhiteboxesdenotetheroad.Therobot’spurposeisto?ndaroutetotheredcircle.2.3MarkovDecisionProcess(MDP)TheMarkovdecisionprocess(MDP)isaframeworkusedtomodeldecisionsinanenvironment[42].Fromtheperspectiveofreinforcementlearning,MDPisanapproachwhichhasadelayedreward.InMDP,thestatetransitionsarenotrelatedtoanypreviousenvironmentstatesoragentactions.Thatistosay,thenextstateisindependentofthepreviousstatesandbasedonthecurrentenvironmentstate.MDPcanbedenotedasthetuple(S,A,T,r,γ),whichismadeupoftheagentactionspaceA,theenvironmentstatespaceS,therewardfunctionr,thetransitionmatrixT,andadiscountfactorγe[0,1).Thetransitionmatrixcanbede?nedasaprobabilitymappingfromstate-actionpairstostatesT:(SxA)xS→[0,1].Theagent’spurposeisto?ndanoptimalpolicyπthatcanmapenvironmentstatestoagentactionsinawaythatmaximizesitslong-termreward.Thediscountfactorγisappliedtotheaccumulatedrewardtodiscountfuturerewards.Inmanycases,thegoalofareinforcementlearningalgorithmwithMDPistomaximizetheexpecteddiscountedcumulativereward.Attimestept,wedenotetheenvironmentstate,agentaction,andrewardbyst,atandrtrespectively.Moreover,weusevπ(s)andQπ(s,a)toevaluateSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy9thestateandaction-statevalue.Thestatevaluefunctioncanbeexpressedasfollows:Vπ(s)=Eπ┌γkrt+k+1Ist=s,π┐Theaction-statevaluefunctionisasfollows:Qπ(s,a)=Eπ┌γkrt+k+1Ist=s,at=a,π┐(1)(2)whereγisthediscountfactorandrt+k+1istherewardoft+k+1step.Inawidevarietyofworks,Q-learningwasthemostpopulariterationmethodappliedtodiscountedin?nite-horizonMDPs.2.4DeepreinforcementlearningInsomecases,reinforcementlearning?ndsitdi?culttodealwithhigh-dimensionaldata,suchasvisualinformation.Deeplearningenablesreinforce-mentlearningtoaddresstheseproblems.Deeplearningisatypeofmachinelearningthatcanuselow-dimensionalfeaturestorepresenthigh-dimensionaldatathroughtheapplicationofamulti-layerArti?cialNeuralNetwork(ANN).Consequently,itcanworkwithhigh-dimensionaldatain?eldssuchasimageandnaturallanguageprocessing.Moreover,deepreinforcementlearning(DRL)combinesreinforcementlearningwithdeepneuralnetworks,therebyenablingreinforcementlearningtolearnfromhigh-dimensionalsituations.Hence,DRLcanlearndirectlyfromraw,high-dimensionaldata,andcanaccordinglyacquiretheabilitytounderstandthevisualworld.Moreover,DRLalsohasapowerfulfunctionapproximationcapacity,whichalsoemploysdeepneuralnetworkstotrainapproximatefunctionsinreinforcementlearning;forexam-ple,toproducetheapproximatefunctionofaction-statevalueQπ(s,a)andpolicyTheprocessofDRLisnearlythesameasthatofreinforcementlearning.Theagent’spurposeisalsotoobtainanoptimalpolicythatcanmapenvi-ronmentstatestoagentactionsinawaythatmaximizeslong-termreward.Themaindi?erencebetweentheDRLandreinforcementlearningprocessesliesintheQtable.AsshowninFigure3,inreinforcementlearning,thistablemaybeaformthatrecordsthemapfromstatetoaction;bycontrast,indeepreinforcementlearning,aneuralnetworkistypicallyusedtorepresenttheQtable.3SecurityandprivacychallengesinreinforcementlearningInthissection,wewillbrie?ydiscusssomerepresentativeattacksthatcausesecurityandprivacyissuesinreinforcementlearning.Inmoredetail,weSpringerNature2021LATEXtemplate10NewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacyexploredi?erenttypesofsecurityattacks(speci?cally,adversarialandpoi-soningattacks)andprivacyattacks(speci?cally,geneticalgorithm(GA)andinversereinforcementlearning(IRL)).Moreover,somerepresentativedefencemethodswillalsobediscussed(speci?cally,di?erentialprivacy,cryptography,andadversariallearning).Wefurtherpresentthetaxonomybasedonthecom-ponentsofMDPinthissection,alongwiththerelationshipsandimpactsamongthesecomponentsinreinforcementlearning.3.1Attackmethodology3.1.1SecurityattacksInthispart,wediscusssecurityattacksdesignedtoin?uenceorevendestroythereinforcementlearningmodelinthereinforcementlearningcontext.Specif-ically,webrie?yintroducesomerecentlyproposedattackmethodsdevelopedforthispurpose.Oneofthepopularmeaningsoftheterm”securityattack”isanadversar-ialattackwithadversarialexamples[43,44].Thecommonformofadversarialexamplesinvolvesaddingimperceptibleperturbationstodatawithapre-de?nedgoal;theseperturbationscandeceivethesystemintomakingmistakesthatcausemalfunctions,orpreventitfrommakingoptimaldecisions.Becausereinforcementlearninggathersexamplesdynamicallythroughoutthetrain-ingprocess,attackerscandirectlyaddimperceptibleperturbationstostates,environmentinformation,andrewards,allofwhichmayin?uencetheagentduringreinforcementlearningtraining.Forexample,considertheadditionoftinyperturbationstostatesinordertoproduces+6[40,45](6istheaddedperturbation).Eventhissmallchangemaya?ectthefollowingreinforcementlearningprocess.Attackersdeterminewhereandwhentoaddperturbations,andwhatperturbationstoadd,inordertomaximizethee?ectivenessoftheirattack.Manyalgorithmsthataddadversarialperturbationshavebeenproposed.Examplesincludethefastgradientsignmethod(FGSM),whichcancalculateadversarialexamples,thestrategically-timedattack,whichfocusesonselectingthetimestepofadversarialattacks,andenchantingattack(EA),whichcanmisleadtheagentregardingtheexpectedstatethroughaseriesofcraftedadversarialexamples.Moreover,defensestoadversarialexampleshavealsobeenstudied.Themostrepresentativemethodisadversarialtraining[46],whichtrainsagentsunderadversarialexamplesandtherebyimprovesmodelrobustness.Otherdefensivemethodsfocusonmodifyingtheobjectivefunction,suchasbyaddingtermstothefunctionoradoptingadynamicactivationfunction.Anothercommontypeofsecurityattackisthepoisoningattack,whichfocusesonmanipulatingtheperformanceofamodelbyinsertingmaliciouslycrafted”poisondata”intothetrainingexamples.Apoisoningattackisoftenselectedwhenanattackerhasnoabilitytomodifythetrainingdataitself;instead,theattackeraddsexamplestothetrainingset,andthoseexamplesSpringerNature2021LATEXtemplateNewChallengesinReinforcementLearning:ASurveyofSecurityandPrivacy11canalsoworkattesttime.Attacksbasedonapoisonedtraining

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論