2024年斯坦福Agent+AI+論文（英）

上傳人：中*** IP屬地：廣東上傳時(shí)間：2024-04-06 格式：DOC 頁(yè)數(shù)：80 大?。?.34MB 積分：25 舉報(bào) 版權(quán)申訴

已閱讀5頁(yè)，還剩75頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶(hù)提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

arXiv:2401.03568v2[cs.AI]25Jan2024

AGENTAI:

SURVEYINGTHEHORIZONSOFMULTIMODALINTERACTION

ZaneDurante1?*,QiuyuanHuang2??,NaokiWake2?,RanGong3?,JaeSungPark4?,BidiptaSarkar1?,RohanTaori1?,YusukeNoda5,DemetriTerzopoulos3,YejinChoi4,KatsushiIkeuchi2,HoiVo5,LiFei-Fei1,JianfengGao2

StanfordUniversity;2MicrosoftResearch,Redmond;

UniversityofCalifornia,LosAngeles;4UniversityofWashington;5MicrosoftGaming

Figure1:OverviewofanAgentAIsystemthatcanperceiveandactindifferentdomainsandapplications.AgentAIisemergingasapromisingavenuetowardArtificialGeneralIntelligence(AGI).AgentAItraininghasdemonstratedthecapacityformulti-modalunderstandinginthephysicalworld.Itprovidesaframeworkforreality-agnostictrainingbyleveraginggenerativeAIalongsidemultipleindependentdatasources.Largefoundationmodelstrainedforagentandaction-relatedtaskscanbeappliedtophysicalandvirtualworldswhentrainedoncross-realitydata.WepresentthegeneraloverviewofanAgentAIsystemthatcanperceiveandactinmanydifferentdomainsandapplications,possiblyservingasaroutetowardsAGIusinganagentparadigm.

?EqualContribution.?ProjectLead.?WorkdonewhileinterningatMicrosoftResearch,Redmond.

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

ABSTRACT

Multi-modalAIsystemswilllikelybecomeaubiquitouspresenceinoureverydaylives.Apromisingapproachtomakingthesesystemsmoreinteractiveistoembodythemasagentswithinphysicalandvirtualenvironments.Atpresent,systemsleverageexistingfoundationmodelsasthebasicbuildingblocksforthecreationofembodiedagents.Embeddingagentswithinsuchenvironmentsfacilitatestheabilityofmodelstoprocessandinterpretvisualandcontextualdata,whichiscriticalforthecreationofmoresophisticatedandcontext-awareAIsystems.Forexample,asystemthatcanperceiveuseractions,humanbehavior,environmentalobjects,audioexpressions,andthecollectivesentimentofascenecanbeusedtoinformanddirectagentresponseswithinthegivenenvironment.Toaccelerateresearchonagent-basedmultimodalintelligence,wedefine“AgentAI”asaclassofinteractivesystemsthatcanperceivevisualstimuli,languageinputs,andotherenvironmentally-groundeddata,andcanproducemeaningfulembodiedactions.Inparticular,weexploresystemsthataimtoimproveagentsbasedonnext-embodiedactionpredictionbyincorporatingexternalknowledge,multi-sensoryinputs,andhumanfeedback.WearguethatbydevelopingagenticAIsystemsingroundedenvironments,onecanalsomitigatethehallucinationsoflargefoundationmodelsandtheirtendencytogenerateenvironmentallyincorrectoutputs.TheemergingfieldofAgentAIsubsumesthebroaderembodiedandagenticaspectsofmultimodalinteractions.Beyondagentsactingandinteractinginthephysicalworld,weenvisionafuturewherepeoplecaneasilycreateanyvirtualrealityorsimulatedsceneandinteractwithagentsembodiedwithinthevirtualenvironment.

Contents

Introduction

1.1

Motivation

.......................................................

1.2

Background

......................................................

1.3

Overview

.......................................................

AgentAIIntegration

2.1

InfiniteAIagent

....................................................

2.2

AgentAIwithLargeFoundationModels

.......................................

2.2.1

Hallucinations

.................................................

2.2.2

BiasesandInclusivity

.............................................

2.2.3

DataPrivacyandUsage

............................................

2.2.4

InterpretabilityandExplainability

.......................................

2.2.5

InferenceAugmentation

............................................

2.2.6

Regulation

...................................................

2.3

AgentAIforEmergentAbilities

............................................

AgentAIParadigm

3.1

LLMsandVLMs

...................................................

3.2

AgentTransformerDefinition

.............................................

3.3

AgentTransformerCreation

..............................................

AgentAILearning

4.1

StrategyandMechanism

................................................

4.1.1

ReinforcementLearning(RL)

.........................................

4.1.2

ImitationLearning(IL)

............................................

4.1.3

TraditionalRGB

................................................

4.1.4

In-contextLearning

..............................................

4.1.5

OptimizationintheAgentSystem

......................................

4.2

AgentSystems(zero-shotandfew-shotlevel)

.....................................

4.2.1

AgentModules

................................................

4.2.2

AgentInfrastructure

..............................................

4.3

AgenticFoundationModels(pretrainingandfinetunelevel)

.............................

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

5AgentAICategorization

5.1

GeneralistAgentAreas

................................

................

5.2

EmbodiedAgents

...................................

................

5.2.1

ActionAgents

.................................

................

5.2.2

InteractiveAgents

...............................

................

5.3

SimulationandEnvironmentsAgents

.........................

................

5.4

GenerativeAgents

...................................

................

5.4.1

AR/VR/mixed-realityAgents

.........................

................

5.5

KnowledgeandLogicalInferenceAgents

.......................

................

5.5.1

KnowledgeAgent

...............................

................

5.5.2

LogicAgents

.................................

................

5.5.3

AgentsforEmotionalReasoning

.......................

................

5.5.4

Neuro-SymbolicAgents

............................

................

5.6

LLMsandVLMsAgent

................................

................

6AgentAIApplicationTasks

6.1

AgentsforGaming

..................................

................

6.1.1

NPCBehavior

.................................

................

6.1.2

Human-NPCInteraction

............................

................

6.1.3

Agent-basedAnalysisofGaming

.......................

................

6.1.4

SceneSynthesisforGaming

.........................

................

6.1.5

ExperimentsandResults

...........................

................

6.2

Robotics

........................................

................

6.2.1

LLM/VLMAgentforRobotics.

........................

................

6.2.2

ExperimentsandResults.

...........................

................

6.3

Healthcare

.......................................

................

6.3.1

CurrentHealthcareCapabilities

........................

................

6.4

MultimodalAgents

..................................

................

6.4.1

Image-LanguageUnderstandingandGeneration

...............

................

6.4.2

VideoandLanguageUnderstandingandGeneration

.............

................

6.4.3

ExperimentsandResults

...........................

................

6.5

Video-languageExperiments

.............................

................

6.6

AgentforNLP

.....................................

................

6.6.1

LLMagent

..................................

................

6.6.2

GeneralLLMagent

..............................

................

6.6.3

Instruction-followingLLMagents

......................

................

6.6.4

ExperimentsandResults

...........................

................

7AgentAIAcrossModalities,Domains,andRealities

7.1

AgentsforCross-modalUnderstanding

........................

................

7.2

AgentsforCross-domainUnderstanding

.......................

................

7.3

Interactiveagentforcross-modalityandcross-reality

.................

................

7.4

SimtoRealTransfer

..................................

................

8ContinuousandSelf-improvementforAgentAI

8.1

Human-basedInteractionData

............................

................

8.2

FoundationModelGeneratedData

..........................

................

9AgentDatasetandLeaderboard

9.1

“CuisineWorld”DatasetforMulti-agentGaming

...................

................

9.1.1

Benchmark

..................................

................

9.1.2

Task

......................................

................

9.1.3

MetricsandJudging

..............................

................

9.1.4

Evaluation

...................................

................

9.2

Audio-Video-LanguagePre-trainingDataset.

.....................

................

10BroaderImpactStatement

11EthicalConsiderations

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

DiversityStatement

References

Appendix

GPT-4VAgentPromptDetails

GPT-4VforBleedingEdge

GPT-4VforMicrosoftFightSimulator

GPT-4VforAssassin’sCreedOdyssey

GPT-4VforGEARSofWAR4

GPT-4VforStarfield

AuthorBiographies

Acknowledgemets

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

1 Introduction

1.1 Motivation

Historically,AIsystemsweredefinedatthe1956DartmouthConferenceasartificiallifeformsthatcouldcollectinformationfromtheenvironmentandinteractwithitinusefulways.Motivatedbythisdefinition,Minsky’sMITgroupbuiltin1970aroboticssystem,calledthe“CopyDemo,”thatobserved“blocksworld”scenesandsuccessfullyreconstructedtheobservedpolyhedralblockstructures.Thesystem,whichcomprisedobservation,planning,andmanipulationmodules,revealedthateachofthesesubproblemsishighlychallengingandfurtherresearchwasnecessary.TheAIfieldfragmentedintospecializedsubfieldsthathavelargelyindependentlymadegreatprogressintacklingtheseandotherproblems,butover-reductionismhasblurredtheoverarchinggoalsofAIresearch.

Toadvancebeyondthestatusquo,itisnecessarytoreturntoAIfundamentalsmotivatedbyAristotelianHolism.Fortunately,therecentrevolutioninLargeLanguageModels(LLMs)andVisualLanguageModels(VLMs)hasmadeitpossibletocreatenovelAIagentsconsistentwiththeholisticideal.Seizinguponthisopportunity,thisarticleexploresmodelsthatintegratelanguageproficiency,visualcognition,contextmemory,intuitivereasoning,andadaptability.ItexploresthepotentialcompletionofthisholisticsynthesisusingLLMsandVLMs.Inourexploration,wealsorevisitsystemdesignbasedonAristotle’sFinalCause,theteleological“whythesystemexists”,whichmayhavebeenoverlookedinpreviousroundsofAIdevelopment.

WiththeadventofpowerfulpretrainedLLMsandVLMs,arenaissanceinnaturallanguageprocessingandcomputervisionhasbeencatalyzed.LLMsnowdemonstrateanimpressiveabilitytodecipherthenuancesofreal-worldlinguisticdata,oftenachievingabilitiesthatparallelorevensurpasshumanexpertise(

OpenAI

2023

).Recently,researchershaveshownthatLLMsmaybeextendedtoactasagentswithinvariousenvironments,performingintricateactionsandtaskswhenpairedwithdomain-specificknowledgeandmodules(

Xietal.

2023

).Thesescenarios,characterizedbycomplexreasoning,understandingoftheagent’sroleanditsenvironment,alongwithmulti-stepplanning,testtheagent’sabilitytomakehighlynuancedandintricatedecisionswithinitsenvironmentalconstraints(

Wuetal.

2023

;

MetaFundamental

AIResearch(FAIR)DiplomacyTeametal.

2022

Buildingupontheseinitialefforts,theAIcommunityisonthecuspofasignificantparadigmshift,transitioningfromcreatingAImodelsforpassive,structuredtaskstomodelscapableofassumingdynamic,agenticrolesindiverseandcomplexenvironments.Inthiscontext,thisarticleinvestigatestheimmensepotentialofusingLLMsandVLMsasagents,emphasizingmodelsthathaveablendoflinguisticproficiency,visualcognition,contextualmemory,intuitivereasoning,andadaptability.LeveragingLLMsandVLMsasagents,especiallywithindomainslikegaming,robotics,andhealthcare,promisesnotjustarigorousevaluationplatformforstate-of-the-artAIsystems,butalsoforeshadowsthetransformativeimpactsthatAgent-centricAIwillhaveacrosssocietyandindustries.Whenfullyharnessed,agenticmodelscanredefinehumanexperiencesandelevateoperationalstandards.Thepotentialforsweepingautomationusheredinbythesemodelsportendsmonumentalshiftsinindustriesandsocio-economicdynamics.Suchadvancementswillbeintertwinedwithmultifacetedleader-board,notonlytechnicalbutalsoethical,aswewillelaborateuponinSection

.Wedelveintotheoverlappingareasofthesesub-fieldsofAgentAIandillustratetheirinterconnectednessinFig.

1.2 Background

Wewillnowintroducerelevantresearchpapersthatsupporttheconcepts,theoreticalbackground,andmodernimplementationsofAgentAI.

LargeFoundationModels:LLMsandVLMshavebeendrivingtheefforttodevelopgeneralintelligentmachines(

Bubecketal.

2023

;

Mirchandanietal.

2023

).Althoughtheyaretrainedusinglargetextcorpora,theirsuperiorproblem-solvingcapacityisnotlimitedtocanonicallanguageprocessingdomains.LLMscanpotentiallytacklecomplextasksthatwerepreviouslypresumedtobeexclusivetohumanexpertsordomain-specificalgorithms,rangingfrommathematicalreasoning(

Imanietal.

2023

;

Weietal.

2022

;

Zhuetal.

2022

)toansweringquestionsofprofessionallaw(

Blair-Staneketal.

2023

;

Choietal.

2023

;

Nay

2022

).RecentresearchhasshownthepossibilityofusingLLMstogeneratecomplexplansforrobotsandgameAI(

Liangetal.

2022

;

Wangetal.

2023a

;

Yaoetal.

2023a

;

Huang

etal.

2023a

),markinganimportantmilestoneforLLMsasgeneral-purposeintelligentagents.

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

EmbodiedAI:AnumberofworksleverageLLMstoperformtaskplanning(

Huangetal.

2022a

;

Wangetal.

2023b

;

Yaoetal.

2023a

;

Lietal.

2023a

),specificallytheLLMs’WWW-scaledomainknowledgeandemergentzero-shotembodiedabilitiestoperformcomplextaskplanningandreasoning.RecentroboticsresearchalsoleveragesLLMstoperformtaskplanning(

Ahnetal.

2022a

;

Huangetal.

2022b

;

Liangetal.

2022

)bydecomposingnaturallanguageinstructionintoasequenceofsubtasks,eitherinthenaturallanguageformorinPythoncode,thenusingalow-levelcontrollertoexecutethesesubtasks.Additionally,theyincorporateenvironmentalfeedbacktoimprovetaskperformance(

Huangetal.

2022b

),(

Liangetal.

2022

),(

Wangetal.

2023a

),and(

Ikeuchietal.

2023

InteractiveLearning:AIagentsdesignedforinteractivelearningoperateusingacombinationofmachinelearningtechniquesanduserinteractions.Initially,theAIagentistrainedonalargedataset.Thisdatasetincludesvarioustypesofinformation,dependingontheintendedfunctionoftheagent.Forinstance,anAIdesignedforlanguagetaskswouldbetrainedonamassivecorpusoftextdata.Thetraininginvolvesusingmachinelearningalgorithms,whichcouldincludedeeplearningmodelslikeneuralnetworks.ThesetrainingmodelsenabletheAItorecognizepatterns,makepredictions,andgenerateresponsesbasedonthedataonwhichitwastrained.TheAIagentcanalsolearnfromreal-timeinteractionswithusers.Thisinteractivelearningcanoccurinvariousways:1)Feedback-basedlearning:TheAIadaptsitsresponsesbasedondirectuserfeedback(

Lietal.

2023b

;

Yuetal.

2023a

;

Parakhetal.

2023

;

Zha

etal.

2023

;

Wakeetal.

2023a

).Forexample,ifausercorrectstheAI’sresponse,theAIcanusethisinformationtoimprovefutureresponses(

Zhaetal.

2023

;

Liuetal.

2023a

).2)ObservationalLearning:TheAIobservesuserinteractionsandlearnsimplicitly.Forexample,ifusersfrequentlyasksimilarquestionsorinteractwiththeAIinaparticularway,theAImightadjustitsresponsestobettersuitthesepatterns.ItallowstheAIagenttounderstandandprocesshumanlanguage,multi-modelsetting,interpretthecrossreality-context,andgeneratehuman-users’responses.Overtime,withmoreuserinteractionsandfeedback,theAIagent’sperformancegenerallycontinuousimproves.ThisprocessisoftensupervisedbyhumanoperatorsordeveloperswhoensurethattheAIislearningappropriatelyandnotdevelopingbiasesorincorrectpatterns.

1.3 Overview

MultimodalAgentAI(MAA)isafamilyofsystemsthatgenerateeffectiveactionsinagivenenvironmentbasedontheunderstandingofmultimodalsensoryinput.WiththeadventofLargeLanguageModels(LLMs)andVision-LanguageModels(VLMs),numerousMAAsystemshavebeenproposedinfieldsrangingfrombasicresearchtoapplications.Whiletheseresearchareasaregrowingrapidlybyintegratingwiththetraditionaltechnologiesofeachdomain(e.g.,visualquestionansweringandvision-languagenavigation),theysharecommoninterestssuchasdatacollection,benchmarking,andethicalperspectives.Inthispaper,wefocusonthesomerepresentativeresearchareasofMAA,namelymultimodality,gaming(VR/AR/MR),robotics,andhealthcare,andweaimtoprovidecomprehensiveknowledgeonthecommonconcernsdiscussedinthesefields.AsaresultweexpecttolearnthefundamentalsofMAAandgaininsightstofurtheradvancetheirresearch.Specificlearningoutcomesinclude:

MAAOverview:Adeepdiveintoitsprinciplesandrolesincontemporaryapplications,providingresearcherwithathoroughgraspofitsimportanceanduses.

Methodologies:DetailedexamplesofhowLLMsandVLMsenhanceMAAs,illustratedthroughcasestudiesingaming,robotics,andhealthcare.

PerformanceEvaluation:GuidanceontheassessmentofMAAswithrelevantdatasets,focusingontheireffectivenessandgeneralization.

EthicalConsiderations:Adiscussiononthesocietalimpactsandethicalleader-boardofdeployingAgentAI,highlightingresponsibledevelopmentpractices.

EmergingTrendsandFutureleader-board:Categorizethelatestdevelopmentsineachdomainanddiscussthefuturedirections.

Computer-basedactionandgeneralistagents(GAs)areusefulformanytasks.AGAtobecometrulyvaluabletoitsusers,itcannaturaltointeractwith,andgeneralizetoabroadrangeofcontextsandmodalities.WeaimstocultivateavibrantresearchecosystemandcreateasharedsenseofidentityandpurposeamongtheAgentAIcommunity.MAAhasthepotentialtobewidelyapplicableacrossvariouscontextsandmodalities,includinginputfromhumans.Therefore,webelievethisAgentAIareacanengageadiverserangeofresearchers,fosteringadynamicAgentAIcommunityand

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

sharedgoals.Ledbyesteemedexpertsfromacademiaandindustry,weexpectthatthispaperwillbeaninteractiveandenrichingexperience,completewithagentinstruction,casestudies,taskssessions,andexperimentsdiscussionensuringacomprehensiveandengaginglearningexperienceforallresearchers.

ThispaperaimstoprovidegeneralandcomprehensiveknowledgeaboutthecurrentresearchinthefieldofAgentAI.Tothisend,therestofthepaperisorganizedasfollows.Section

outlineshowAgentAIbenefitsfromintegratingwithrelatedemergingtechnologies,particularlylargefoundationmodels.Section

describesanewparadigmandframeworkthatweproposefortrainingAgentAI.Section

providesanoverviewofthemethodologiesthatarewidelyusedinthetrainingofAgentAI.Section

categorizesanddiscussesvarioustypesofagents.Section

introducesAgentAIapplicationsingaming,robotics,andhealthcare.Section

explorestheresearchcommunity’seffortstodevelopaversatileAgentAI,capableofbeingappliedacrossvariousmodalities,domains,andbridgingthesim-to-realgap.Section

discussesthepotentialofAgentAIthatnotonlyreliesonpre-trainedfoundationmodels,butalsocontinuouslylearnsandself-improvesbyleveraginginteractionswiththeenvironmentandusers.Section

introducesournewdatasetsthataredesignedforthetrainingofmultimodalAgentAI.Section

discussesthehottopicoftheethicsconsiderationofAIagent,limitations,andsocietalimpactofourpaper.

2 AgentAIIntegration

FoundationmodelsbasedonLLMsandVLMs,asproposedinpreviousresearch,stillexhibitlimitedperformanceintheareaofembodiedAI,particularlyintermsofunderstanding,generating,editing,andinteractingwithinunseenenvironmentsorscenarios(

Huangetal.

2023a

;

Zengetal.

2023

).Consequently,theselimitationsleadtosub-optimaloutputsfromAIagents.Currentagent-centricAImodelingapproachesfocusondirectlyaccessibleandclearlydefineddata(e.g.textorstringrepresentationsoftheworldstate)andgenerallyusedomainandenvironment-independentpatternslearnedfromtheirlarge-scalepretrainingtopredictactionoutputsforeachenvironment(

Xietal.

2023

;

Wang

etal.

2023c

;

Gongetal.

2023a

;

Wuetal.

2023

).In(

Huangetal.

2023a

),weinvestigatethetaskofknowledge-guidedcollaborativeandinteractivescenegenerationbycombininglargefoundationmodels,andshowpromisingresultsthatindicateknowledge-groundedLLMagentscanimprovetheperformanceof2Dand3Dsceneunderstanding,generation,andediting,alongsidewithotherhuman-agentinteractions(

Huangetal.

2023a

).ByintegratinganAgentAIframework,largefoundationmodelsareabletomoredeeplyunderstanduserinputtoformacomplexandadaptiveH

人人文庫(kù)> 全部分類(lèi)> 專(zhuān)業(yè)文獻(xiàn) > IT計(jì)算機(jī)

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

2024年斯坦福Agent+AI+論文（英）

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

2024年斯坦福Agent+AI+論文（英）

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔