2024年斯坦福Agent+AI+論文(英)_第1頁(yè)
2024年斯坦福Agent+AI+論文(英)_第2頁(yè)
2024年斯坦福Agent+AI+論文(英)_第3頁(yè)
2024年斯坦福Agent+AI+論文(英)_第4頁(yè)
2024年斯坦福Agent+AI+論文(英)_第5頁(yè)
已閱讀5頁(yè),還剩75頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

arXiv:2401.03568v2[cs.AI]25Jan2024

AGENTAI:

SURVEYINGTHEHORIZONSOFMULTIMODALINTERACTION

ZaneDurante1?*,QiuyuanHuang2??,NaokiWake2?,RanGong3?,JaeSungPark4?,BidiptaSarkar1?,RohanTaori1?,YusukeNoda5,DemetriTerzopoulos3,YejinChoi4,KatsushiIkeuchi2,HoiVo5,LiFei-Fei1,JianfengGao2

StanfordUniversity;2MicrosoftResearch,Redmond;

UniversityofCalifornia,LosAngeles;4UniversityofWashington;5MicrosoftGaming

Figure1:OverviewofanAgentAIsystemthatcanperceiveandactindifferentdomainsandapplications.AgentAIisemergingasapromisingavenuetowardArtificialGeneralIntelligence(AGI).AgentAItraininghasdemonstratedthecapacityformulti-modalunderstandinginthephysicalworld.Itprovidesaframeworkforreality-agnostictrainingbyleveraginggenerativeAIalongsidemultipleindependentdatasources.Largefoundationmodelstrainedforagentandaction-relatedtaskscanbeappliedtophysicalandvirtualworldswhentrainedoncross-realitydata.WepresentthegeneraloverviewofanAgentAIsystemthatcanperceiveandactinmanydifferentdomainsandapplications,possiblyservingasaroutetowardsAGIusinganagentparadigm.

?EqualContribution.?ProjectLead.?WorkdonewhileinterningatMicrosoftResearch,Redmond.

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

ABSTRACT

Multi-modalAIsystemswilllikelybecomeaubiquitouspresenceinoureverydaylives.Apromisingapproachtomakingthesesystemsmoreinteractiveistoembodythemasagentswithinphysicalandvirtualenvironments.Atpresent,systemsleverageexistingfoundationmodelsasthebasicbuildingblocksforthecreationofembodiedagents.Embeddingagentswithinsuchenvironmentsfacilitatestheabilityofmodelstoprocessandinterpretvisualandcontextualdata,whichiscriticalforthecreationofmoresophisticatedandcontext-awareAIsystems.Forexample,asystemthatcanperceiveuseractions,humanbehavior,environmentalobjects,audioexpressions,andthecollectivesentimentofascenecanbeusedtoinformanddirectagentresponseswithinthegivenenvironment.Toaccelerateresearchonagent-basedmultimodalintelligence,wedefine“AgentAI”asaclassofinteractivesystemsthatcanperceivevisualstimuli,languageinputs,andotherenvironmentally-groundeddata,andcanproducemeaningfulembodiedactions.Inparticular,weexploresystemsthataimtoimproveagentsbasedonnext-embodiedactionpredictionbyincorporatingexternalknowledge,multi-sensoryinputs,andhumanfeedback.WearguethatbydevelopingagenticAIsystemsingroundedenvironments,onecanalsomitigatethehallucinationsoflargefoundationmodelsandtheirtendencytogenerateenvironmentallyincorrectoutputs.TheemergingfieldofAgentAIsubsumesthebroaderembodiedandagenticaspectsofmultimodalinteractions.Beyondagentsactingandinteractinginthephysicalworld,weenvisionafuturewherepeoplecaneasilycreateanyvirtualrealityorsimulatedsceneandinteractwithagentsembodiedwithinthevirtualenvironment.

Contents

1

Introduction

5

1.1

Motivation

.......................................................

5

1.2

Background

......................................................

5

1.3

Overview

.......................................................

6

2

AgentAIIntegration

7

2.1

InfiniteAIagent

....................................................

7

2.2

AgentAIwithLargeFoundationModels

.......................................

8

2.2.1

Hallucinations

.................................................

8

2.2.2

BiasesandInclusivity

.............................................

9

2.2.3

DataPrivacyandUsage

............................................

10

2.2.4

InterpretabilityandExplainability

.......................................

11

2.2.5

InferenceAugmentation

............................................

12

2.2.6

Regulation

...................................................

13

2.3

AgentAIforEmergentAbilities

............................................

14

3

AgentAIParadigm

15

3.1

LLMsandVLMs

...................................................

15

3.2

AgentTransformerDefinition

.............................................

15

3.3

AgentTransformerCreation

..............................................

16

4

AgentAILearning

17

4.1

StrategyandMechanism

................................................

17

4.1.1

ReinforcementLearning(RL)

.........................................

17

4.1.2

ImitationLearning(IL)

............................................

18

4.1.3

TraditionalRGB

................................................

18

4.1.4

In-contextLearning

..............................................

18

4.1.5

OptimizationintheAgentSystem

......................................

18

4.2

AgentSystems(zero-shotandfew-shotlevel)

.....................................

19

4.2.1

AgentModules

................................................

19

4.2.2

AgentInfrastructure

..............................................

19

4.3

AgenticFoundationModels(pretrainingandfinetunelevel)

.............................

19

2

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

5AgentAICategorization

20

5.1

GeneralistAgentAreas

................................

................

20

5.2

EmbodiedAgents

...................................

................

20

5.2.1

ActionAgents

.................................

................

20

5.2.2

InteractiveAgents

...............................

................

21

5.3

SimulationandEnvironmentsAgents

.........................

................

21

5.4

GenerativeAgents

...................................

................

21

5.4.1

AR/VR/mixed-realityAgents

.........................

................

22

5.5

KnowledgeandLogicalInferenceAgents

.......................

................

22

5.5.1

KnowledgeAgent

...............................

................

23

5.5.2

LogicAgents

.................................

................

23

5.5.3

AgentsforEmotionalReasoning

.......................

................

23

5.5.4

Neuro-SymbolicAgents

............................

................

24

5.6

LLMsandVLMsAgent

................................

................

24

6AgentAIApplicationTasks

24

6.1

AgentsforGaming

..................................

................

24

6.1.1

NPCBehavior

.................................

................

24

6.1.2

Human-NPCInteraction

............................

................

25

6.1.3

Agent-basedAnalysisofGaming

.......................

................

25

6.1.4

SceneSynthesisforGaming

.........................

................

27

6.1.5

ExperimentsandResults

...........................

................

27

6.2

Robotics

........................................

................

28

6.2.1

LLM/VLMAgentforRobotics.

........................

................

30

6.2.2

ExperimentsandResults.

...........................

................

31

6.3

Healthcare

.......................................

................

35

6.3.1

CurrentHealthcareCapabilities

........................

................

36

6.4

MultimodalAgents

..................................

................

36

6.4.1

Image-LanguageUnderstandingandGeneration

...............

................

36

6.4.2

VideoandLanguageUnderstandingandGeneration

.............

................

37

6.4.3

ExperimentsandResults

...........................

................

39

6.5

Video-languageExperiments

.............................

................

41

6.6

AgentforNLP

.....................................

................

45

6.6.1

LLMagent

..................................

................

45

6.6.2

GeneralLLMagent

..............................

................

45

6.6.3

Instruction-followingLLMagents

......................

................

46

6.6.4

ExperimentsandResults

...........................

................

46

7AgentAIAcrossModalities,Domains,andRealities

48

7.1

AgentsforCross-modalUnderstanding

........................

................

48

7.2

AgentsforCross-domainUnderstanding

.......................

................

48

7.3

Interactiveagentforcross-modalityandcross-reality

.................

................

49

7.4

SimtoRealTransfer

..................................

................

49

8ContinuousandSelf-improvementforAgentAI

49

8.1

Human-basedInteractionData

............................

................

49

8.2

FoundationModelGeneratedData

..........................

................

50

9AgentDatasetandLeaderboard

50

9.1

“CuisineWorld”DatasetforMulti-agentGaming

...................

................

50

9.1.1

Benchmark

..................................

................

51

9.1.2

Task

......................................

................

51

9.1.3

MetricsandJudging

..............................

................

51

9.1.4

Evaluation

...................................

................

51

9.2

Audio-Video-LanguagePre-trainingDataset.

.....................

................

51

10BroaderImpactStatement

52

11EthicalConsiderations

53

3

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction

APREPRINT

12

DiversityStatement

53

References

55

Appendix

69

A

GPT-4VAgentPromptDetails

69

B

GPT-4VforBleedingEdge

69

C

GPT-4VforMicrosoftFightSimulator

69

D

GPT-4VforAssassin’sCreedOdyssey

69

E

GPT-4VforGEARSofWAR4

69

F

GPT-4VforStarfield

75

AuthorBiographies

77

Acknowledgemets

80

4

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

1 Introduction

1.1 Motivation

Historically,AIsystemsweredefinedatthe1956DartmouthConferenceasartificiallifeformsthatcouldcollectinformationfromtheenvironmentandinteractwithitinusefulways.Motivatedbythisdefinition,Minsky’sMITgroupbuiltin1970aroboticssystem,calledthe“CopyDemo,”thatobserved“blocksworld”scenesandsuccessfullyreconstructedtheobservedpolyhedralblockstructures.Thesystem,whichcomprisedobservation,planning,andmanipulationmodules,revealedthateachofthesesubproblemsishighlychallengingandfurtherresearchwasnecessary.TheAIfieldfragmentedintospecializedsubfieldsthathavelargelyindependentlymadegreatprogressintacklingtheseandotherproblems,butover-reductionismhasblurredtheoverarchinggoalsofAIresearch.

Toadvancebeyondthestatusquo,itisnecessarytoreturntoAIfundamentalsmotivatedbyAristotelianHolism.Fortunately,therecentrevolutioninLargeLanguageModels(LLMs)andVisualLanguageModels(VLMs)hasmadeitpossibletocreatenovelAIagentsconsistentwiththeholisticideal.Seizinguponthisopportunity,thisarticleexploresmodelsthatintegratelanguageproficiency,visualcognition,contextmemory,intuitivereasoning,andadaptability.ItexploresthepotentialcompletionofthisholisticsynthesisusingLLMsandVLMs.Inourexploration,wealsorevisitsystemdesignbasedonAristotle’sFinalCause,theteleological“whythesystemexists”,whichmayhavebeenoverlookedinpreviousroundsofAIdevelopment.

WiththeadventofpowerfulpretrainedLLMsandVLMs,arenaissanceinnaturallanguageprocessingandcomputervisionhasbeencatalyzed.LLMsnowdemonstrateanimpressiveabilitytodecipherthenuancesofreal-worldlinguisticdata,oftenachievingabilitiesthatparallelorevensurpasshumanexpertise(

OpenAI

,

2023

).Recently,researchershaveshownthatLLMsmaybeextendedtoactasagentswithinvariousenvironments,performingintricateactionsandtaskswhenpairedwithdomain-specificknowledgeandmodules(

Xietal.

,

2023

).Thesescenarios,characterizedbycomplexreasoning,understandingoftheagent’sroleanditsenvironment,alongwithmulti-stepplanning,testtheagent’sabilitytomakehighlynuancedandintricatedecisionswithinitsenvironmentalconstraints(

Wuetal.

,

2023

;

MetaFundamental

AIResearch(FAIR)DiplomacyTeametal.

,

2022

).

Buildingupontheseinitialefforts,theAIcommunityisonthecuspofasignificantparadigmshift,transitioningfromcreatingAImodelsforpassive,structuredtaskstomodelscapableofassumingdynamic,agenticrolesindiverseandcomplexenvironments.Inthiscontext,thisarticleinvestigatestheimmensepotentialofusingLLMsandVLMsasagents,emphasizingmodelsthathaveablendoflinguisticproficiency,visualcognition,contextualmemory,intuitivereasoning,andadaptability.LeveragingLLMsandVLMsasagents,especiallywithindomainslikegaming,robotics,andhealthcare,promisesnotjustarigorousevaluationplatformforstate-of-the-artAIsystems,butalsoforeshadowsthetransformativeimpactsthatAgent-centricAIwillhaveacrosssocietyandindustries.Whenfullyharnessed,agenticmodelscanredefinehumanexperiencesandelevateoperationalstandards.Thepotentialforsweepingautomationusheredinbythesemodelsportendsmonumentalshiftsinindustriesandsocio-economicdynamics.Suchadvancementswillbeintertwinedwithmultifacetedleader-board,notonlytechnicalbutalsoethical,aswewillelaborateuponinSection

11

.Wedelveintotheoverlappingareasofthesesub-fieldsofAgentAIandillustratetheirinterconnectednessinFig.

1

.

1.2 Background

Wewillnowintroducerelevantresearchpapersthatsupporttheconcepts,theoreticalbackground,andmodernimplementationsofAgentAI.

LargeFoundationModels:LLMsandVLMshavebeendrivingtheefforttodevelopgeneralintelligentmachines(

Bubecketal.

,

2023

;

Mirchandanietal.

,

2023

).Althoughtheyaretrainedusinglargetextcorpora,theirsuperiorproblem-solvingcapacityisnotlimitedtocanonicallanguageprocessingdomains.LLMscanpotentiallytacklecomplextasksthatwerepreviouslypresumedtobeexclusivetohumanexpertsordomain-specificalgorithms,rangingfrommathematicalreasoning(

Imanietal.

,

2023

;

Weietal.

,

2022

;

Zhuetal.

,

2022

)toansweringquestionsofprofessionallaw(

Blair-Staneketal.

,

2023

;

Choietal.

,

2023

;

Nay

,

2022

).RecentresearchhasshownthepossibilityofusingLLMstogeneratecomplexplansforrobotsandgameAI(

Liangetal.

,

2022

;

Wangetal.

,

2023a

,

b

;

Yaoetal.

,

2023a

;

Huang

etal.

,

2023a

),markinganimportantmilestoneforLLMsasgeneral-purposeintelligentagents.

5

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

EmbodiedAI:AnumberofworksleverageLLMstoperformtaskplanning(

Huangetal.

,

2022a

;

Wangetal.

,

2023b

;

Yaoetal.

,

2023a

;

Lietal.

,

2023a

),specificallytheLLMs’WWW-scaledomainknowledgeandemergentzero-shotembodiedabilitiestoperformcomplextaskplanningandreasoning.RecentroboticsresearchalsoleveragesLLMstoperformtaskplanning(

Ahnetal.

,

2022a

;

Huangetal.

,

2022b

;

Liangetal.

,

2022

)bydecomposingnaturallanguageinstructionintoasequenceofsubtasks,eitherinthenaturallanguageformorinPythoncode,thenusingalow-levelcontrollertoexecutethesesubtasks.Additionally,theyincorporateenvironmentalfeedbacktoimprovetaskperformance(

Huangetal.

,

2022b

),(

Liangetal.

,

2022

),(

Wangetal.

,

2023a

),and(

Ikeuchietal.

,

2023

).

InteractiveLearning:AIagentsdesignedforinteractivelearningoperateusingacombinationofmachinelearningtechniquesanduserinteractions.Initially,theAIagentistrainedonalargedataset.Thisdatasetincludesvarioustypesofinformation,dependingontheintendedfunctionoftheagent.Forinstance,anAIdesignedforlanguagetaskswouldbetrainedonamassivecorpusoftextdata.Thetraininginvolvesusingmachinelearningalgorithms,whichcouldincludedeeplearningmodelslikeneuralnetworks.ThesetrainingmodelsenabletheAItorecognizepatterns,makepredictions,andgenerateresponsesbasedonthedataonwhichitwastrained.TheAIagentcanalsolearnfromreal-timeinteractionswithusers.Thisinteractivelearningcanoccurinvariousways:1)Feedback-basedlearning:TheAIadaptsitsresponsesbasedondirectuserfeedback(

Lietal.

,

2023b

;

Yuetal.

,

2023a

;

Parakhetal.

,

2023

;

Zha

etal.

,

2023

;

Wakeetal.

,

2023a

,

b

,

c

).Forexample,ifausercorrectstheAI’sresponse,theAIcanusethisinformationtoimprovefutureresponses(

Zhaetal.

,

2023

;

Liuetal.

,

2023a

).2)ObservationalLearning:TheAIobservesuserinteractionsandlearnsimplicitly.Forexample,ifusersfrequentlyasksimilarquestionsorinteractwiththeAIinaparticularway,theAImightadjustitsresponsestobettersuitthesepatterns.ItallowstheAIagenttounderstandandprocesshumanlanguage,multi-modelsetting,interpretthecrossreality-context,andgeneratehuman-users’responses.Overtime,withmoreuserinteractionsandfeedback,theAIagent’sperformancegenerallycontinuousimproves.ThisprocessisoftensupervisedbyhumanoperatorsordeveloperswhoensurethattheAIislearningappropriatelyandnotdevelopingbiasesorincorrectpatterns.

1.3 Overview

MultimodalAgentAI(MAA)isafamilyofsystemsthatgenerateeffectiveactionsinagivenenvironmentbasedontheunderstandingofmultimodalsensoryinput.WiththeadventofLargeLanguageModels(LLMs)andVision-LanguageModels(VLMs),numerousMAAsystemshavebeenproposedinfieldsrangingfrombasicresearchtoapplications.Whiletheseresearchareasaregrowingrapidlybyintegratingwiththetraditionaltechnologiesofeachdomain(e.g.,visualquestionansweringandvision-languagenavigation),theysharecommoninterestssuchasdatacollection,benchmarking,andethicalperspectives.Inthispaper,wefocusonthesomerepresentativeresearchareasofMAA,namelymultimodality,gaming(VR/AR/MR),robotics,andhealthcare,andweaimtoprovidecomprehensiveknowledgeonthecommonconcernsdiscussedinthesefields.AsaresultweexpecttolearnthefundamentalsofMAAandgaininsightstofurtheradvancetheirresearch.Specificlearningoutcomesinclude:

MAAOverview:Adeepdiveintoitsprinciplesandrolesincontemporaryapplications,providingresearcherwithathoroughgraspofitsimportanceanduses.

Methodologies:DetailedexamplesofhowLLMsandVLMsenhanceMAAs,illustratedthroughcasestudiesingaming,robotics,andhealthcare.

PerformanceEvaluation:GuidanceontheassessmentofMAAswithrelevantdatasets,focusingontheireffectivenessandgeneralization.

EthicalConsiderations:Adiscussiononthesocietalimpactsandethicalleader-boardofdeployingAgentAI,highlightingresponsibledevelopmentpractices.

EmergingTrendsandFutureleader-board:Categorizethelatestdevelopmentsineachdomainanddiscussthefuturedirections.

Computer-basedactionandgeneralistagents(GAs)areusefulformanytasks.AGAtobecometrulyvaluabletoitsusers,itcannaturaltointeractwith,andgeneralizetoabroadrangeofcontextsandmodalities.WeaimstocultivateavibrantresearchecosystemandcreateasharedsenseofidentityandpurposeamongtheAgentAIcommunity.MAAhasthepotentialtobewidelyapplicableacrossvariouscontextsandmodalities,includinginputfromhumans.Therefore,webelievethisAgentAIareacanengageadiverserangeofresearchers,fosteringadynamicAgentAIcommunityand

6

AgentAI:

SurveyingtheHorizonsofMultimodalInteraction APREPRINT

sharedgoals.Ledbyesteemedexpertsfromacademiaandindustry,weexpectthatthispaperwillbeaninteractiveandenrichingexperience,completewithagentinstruction,casestudies,taskssessions,andexperimentsdiscussionensuringacomprehensiveandengaginglearningexperienceforallresearchers.

ThispaperaimstoprovidegeneralandcomprehensiveknowledgeaboutthecurrentresearchinthefieldofAgentAI.Tothisend,therestofthepaperisorganizedasfollows.Section

2

outlineshowAgentAIbenefitsfromintegratingwithrelatedemergingtechnologies,particularlylargefoundationmodels.Section

3

describesanewparadigmandframeworkthatweproposefortrainingAgentAI.Section

4

providesanoverviewofthemethodologiesthatarewidelyusedinthetrainingofAgentAI.Section

5

categorizesanddiscussesvarioustypesofagents.Section

6

introducesAgentAIapplicationsingaming,robotics,andhealthcare.Section

7

explorestheresearchcommunity’seffortstodevelopaversatileAgentAI,capableofbeingappliedacrossvariousmodalities,domains,andbridgingthesim-to-realgap.Section

8

discussesthepotentialofAgentAIthatnotonlyreliesonpre-trainedfoundationmodels,butalsocontinuouslylearnsandself-improvesbyleveraginginteractionswiththeenvironmentandusers.Section

9

introducesournewdatasetsthataredesignedforthetrainingofmultimodalAgentAI.Section

11

discussesthehottopicoftheethicsconsiderationofAIagent,limitations,andsocietalimpactofourpaper.

2 AgentAIIntegration

FoundationmodelsbasedonLLMsandVLMs,asproposedinpreviousresearch,stillexhibitlimitedperformanceintheareaofembodiedAI,particularlyintermsofunderstanding,generating,editing,andinteractingwithinunseenenvironmentsorscenarios(

Huangetal.

,

2023a

;

Zengetal.

,

2023

).Consequently,theselimitationsleadtosub-optimaloutputsfromAIagents.Currentagent-centricAImodelingapproachesfocusondirectlyaccessibleandclearlydefineddata(e.g.textorstringrepresentationsoftheworldstate)andgenerallyusedomainandenvironment-independentpatternslearnedfromtheirlarge-scalepretrainingtopredictactionoutputsforeachenvironment(

Xietal.

,

2023

;

Wang

etal.

,

2023c

;

Gongetal.

,

2023a

;

Wuetal.

,

2023

).In(

Huangetal.

,

2023a

),weinvestigatethetaskofknowledge-guidedcollaborativeandinteractivescenegenerationbycombininglargefoundationmodels,andshowpromisingresultsthatindicateknowledge-groundedLLMagentscanimprovetheperformanceof2Dand3Dsceneunderstanding,generation,andediting,alongsidewithotherhuman-agentinteractions(

Huangetal.

,

2023a

).ByintegratinganAgentAIframework,largefoundationmodelsareabletomoredeeplyunderstanduserinputtoformacomplexandadaptiveH

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論