圖檢索增強(qiáng)生成(GraphRAG) 綜述 Graph Retrieval-Augmented Generation A Survey_第1頁
圖檢索增強(qiáng)生成(GraphRAG) 綜述 Graph Retrieval-Augmented Generation A Survey_第2頁
圖檢索增強(qiáng)生成(GraphRAG) 綜述 Graph Retrieval-Augmented Generation A Survey_第3頁
圖檢索增強(qiáng)生成(GraphRAG) 綜述 Graph Retrieval-Augmented Generation A Survey_第4頁
圖檢索增強(qiáng)生成(GraphRAG) 綜述 Graph Retrieval-Augmented Generation A Survey_第5頁
已閱讀5頁,還剩77頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報或認(rèn)領(lǐng)

文檔簡介

GraphRetrieval-AugmentedGeneration:ASurvey

arXiv:2408.08921v2[cs.AI]10Sep2024

BOCIPENG?,SchoolofIntelligenceScienceandTechnology,PekingUniversity,ChinaYUNZHU?,CollegeofComputerScienceandTechnology,ZhejiangUniversity,ChinaYONGCHAOLIU,AntGroup,China

XIAOHEBO,GaolingSchoolofArtificialIntelligence,RenminUniversityofChina,China

HAIZHOUSHI,RutgersUniversity,USCHUNTAOHONG,AntGroup,China

YANZHANGt,SchoolofIntelligenceScienceandTechnology,PekingUniversity,China

SILIANGTANG,CollegeofComputerScienceandTechnology,ZhejiangUniversity,China

Recently,Retrieval-AugmentedGeneration(RAG)hasachievedremarkablesuccessinaddressingthechallengesofLargeLanguageModels(LLMs)withoutnecessitatingretraining.Byreferencinganexternalknowledgebase,RAGrefinesLLMoutputs,effectivelymitigatingissuessuchas“hallucination”,lackofdomain-specificknowledge,andoutdatedinformation.However,thecomplexstructureofrelationshipsamongdifferententitiesindatabasespresentschallengesforRAGsystems.Inresponse,GraphRAGleveragesstructuralinformationacrossentitiestoenablemorepreciseandcomprehensiveretrieval,capturingrelationalknowledgeandfacilitatingmoreaccurate,context-awareresponses.GiventhenoveltyandpotentialofGraphRAG,asystematicreviewofcurrenttechnologiesisimperative.ThispaperprovidesthefirstcomprehensiveoverviewofGraphRAGmethodologies.WeformalizetheGraphRAGworkflow,encompassingGraph-BasedIndexing,Graph-GuidedRetrieval,andGraph-EnhancedGeneration.Wethenoutlinethecoretechnologiesandtrainingmethodsateachstage.Additionally,weexaminedownstreamtasks,applicationdomains,evaluationmethodologies,andindustrialusecasesofGraphRAG.Finally,weexplorefutureresearchdirectionstoinspirefurtherinquiriesandadvanceprogressinthefield.Inordertotrackrecentprogressinthisfield,wesetuparepositoryat

/pengboci/GraphRAG-Survey.

CCSConcepts:?Computingmethodologies→Knowledgerepresentationandreasoning;?Informa-tionsystems→Informationretrieval;Datamining.

AdditionalKeyWordsandPhrases:LargeLanguageModels,GraphRetrieval-AugmentedGeneration,Knowl-edgeGraphs,GraphNeuralNetworks

*Bothauthorscontributedequallytothisresearch.tCorrespondingAuthor.

Authors’ContactInformation:BociPeng,SchoolofIntelligenceScienceandTechnology,PekingUniversity,Beijing,China,bcpeng@;YunZhu,CollegeofComputerScienceandTechnology,ZhejiangUniversity,Hangzhou,China,zhuyun_dcd@;YongchaoLiu,AntGroup,Hangzhou,China,yongchao.ly@;XiaoheBo,GaolingSchoolofArtificialIntelligence,RenminUniversityofChina,Beijing,China,bellebxh@;HaizhouShi,RutgersUniversity,NewBrunswick,NewJersey,US,haizhou.shi@;ChuntaoHong,AntGroup,Hangzhou,China,chuntao.hct@;YanZhang,SchoolofIntelligenceScienceandTechnology,PekingUniversity,Beijing,China,zhyzhy001@;SiliangTang,CollegeofComputerScienceandTechnology,ZhejiangUniversity,Hangzhou,China,siliang@.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponentsofthisworkownedbyothersthantheauthor(s)mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.Requestpermissionsfrompermissions@.

?2024Copyrightheldbytheowner/author(s).PublicationrightslicensedtoACM.ACM1557-735X/2024/9-ART111

/XXXXXXX.XXXXXXX

J.ACM,Vol.37,No.4,Article111.Publicationdate:September2024.

111:2Pengetal.

ACMReferenceFormat:

BociPeng,YunZhu,YongchaoLiu,XiaoheBo,HaizhouShi,ChuntaoHong,YanZhang,andSiliangTang.

2024.GraphRetrieval-AugmentedGeneration:ASurvey.J.ACM37,4,Article111(September2024),

41

pages.

/XXXXXXX.XXXXXXX

1Introduction

ThedevelopmentofLargeLanguageModelslikeGPT-4

[127],Qwen2

[184],andLLaMA

[31]has

sparkedarevolutioninthefieldofartificialintelligence,fundamentallyalteringthelandscapeofnaturallanguageprocessing.Thesemodels,builtonTransformer

[161]architecturesandtrained

ondiverseandextensivedatasets,havedemonstratedunprecedentedcapabilitiesinunderstanding,interpreting,andgeneratinghumanlanguage.Theimpactoftheseadvancementsisprofound,stretchingacrossvarioussectorsincludinghealthcare

[103,

166,

203],finance

[93,

125],andeduca

-tion

[46,

169],wheretheyfacilitatemorenuancedandefficientinteractionsbetweenhumansand

machines.

Despitetheirremarkablelanguagecomprehensionandtextgenerationcapabilities,LLMsmayexhibitlimitationsduetoalackofdomain-specificknowledge,real-timeupdatedinformation,andproprietaryknowledge,whichareoutsideLLMs’pre-trainingcorpus.Thesegapscanleadtoaphenomenonknownas“hallucination”

[61]wherethemodelgeneratesinaccurateoreven

fabricatedinformation.Consequently,itisimperativetosupplementLLMswithexternalknowledgetomitigatethisproblem.Retrieval-AugmentedGeneration(RAG)

[34,

45,

59,

62,

178,

195,

202]

emergedasasignificantevolution,whichaimstoenhancethequalityandrelevanceofgeneratedcontentbyintegratingaretrievalcomponentwithinthegenerationprocess.TheessenceofRAGliesinitsabilitytodynamicallyqueryalargetextcorpustoincorporaterelevantfactualknowledgeintotheresponsesgeneratedbytheunderlyinglanguagemodels.Thisintegrationnotonlyenrichesthecontextualdepthoftheresponsesbutalsoensuresahigherdegreeoffactualaccuracyandspecificity.RAGhasgainedwidespreadattentionduetoitsexceptionalperformanceandbroadapplications,becomingakeyfocuswithinthefield.

AlthoughRAGhasachievedimpressiveresultsandhasbeenwidelyappliedacrossvariousdomains,itfaceslimitationsinreal-worldscenarios:(1)NeglectingRelationships:Inpractice,textualcontentisnotisolatedbutinterconnected.TraditionalRAGfailstocapturesignificantstructuredrelationalknowledgethatcannotberepresentedthroughsemanticsimilarityalone.Forinstance,inacitationnetworkwherepapersarelinkedbycitationrelationships,traditionalRAGmethodsfocusonfindingtherelevantpapersbasedonthequerybutoverlookimportantcitationrelationshipsbetweenpapers.(2)RedundantInformation:RAGoftenrecountscontentintheformoftextualsnippetswhenconcatenatedasprompts.Thismakescontextbecomeexcessivelylengthy,leadingtothe“l(fā)ostinthemiddle”dilemma

[104].(3)

LackingGlobalInformation:RAGcanonlyretrieveasubsetofdocumentsandfailstograspglobalinformationcomprehensively,andhencestruggleswithtaskssuchasQuery-FocusedSummarization(QFS).

GraphRetrieval-AugmentedGeneration(GraphRAG)

[32,

58,

119]emergesasaninnovative

solutiontoaddressthesechallenges.UnliketraditionalRAG,GraphRAGretrievesgraphelementscontainingrelationalknowledgepertinenttoagivenqueryfromapre-constructedgraphdatabase,asdepictedinFigure

1.

Theseelementsmayincludenodes,triples,paths,orsubgraphs,whichareutilizedtogenerateresponses.GraphRAGconsiderstheinterconnectionsbetweentexts,enablingamoreaccurateandcomprehensiveretrievalofrelationalinformation.Additionally,graphdata,suchasknowledgegraphs,offerabstractionandsummarizationoftextualdata,therebysignificantlyshorteningthelengthoftheinputtextandmitigatingconcernsofverbosity.Byretrievingsubgraphsorgraphcommunities,wecanaccesscomprehensiveinformationtoeffectivelyaddresstheQFS

challengebycapturingthebroadercontextandinterconnectionswithinthegraphstructure.J.ACM,Vol.37,No.4,Article111.Publicationdate:September2024.

GraphRetrieval-AugmentedGeneration:ASurvey111:3

Query

Howdidtheartisticmovementsofthe19thcenturyimpactthedevelopmentofmodernartin

the20thcentury?

LLMs

Response

Theartisticmovementsofthe19thcenturyinfluencedmodernartinthe20th

centurybyencouraging

experimentationwithcolor,form,andsubjectmatter.Thesemovementspavedthewayforabstraction,

expressionism,andotherinnovative.

Query

Retriever

Howdidtheartisticmovementsofthe19thcenturyimpactthedevelopmentofmodernartin

the20thcentury?

LLMs

1.Impressionistartistslike

ClaudeMonetintroducednewtechniquesthatrevolutionizedthedepictionoflightandcolor.

2.TheImpressionisttechniquesinfluencedlaterartmovements.

3.PabloPicassopioneeredCubism,whichradically

transformedtheapproachtovisualrepresentation.

4.Cubismemergedintheearly20thcenturyandchallenged

traditionalperspectivesonart.

自RetrievedTextResponse

ImpressionistartistslikeClaudeMonetinthe19thcenturyintroducednewtechniquesthatinfluence

laterartmovements.PabloPicassopioneeredCubismrelativityintheearly20thcentury.

Query

Retriever

Howdidtheartisticmovementsofthe19thcenturyimpactthedevelopmentofmodernartin

the20thcentury?

LLMs

-(PabloPicasso)-[pioneered]→(Cubism)

-(Cubism)-[emergedin]→(early20thcentury)

-(ClaudeMonet)-[introduced]→(newtechniques)

-(newtechniques)–

[revolutionized]→(depictionoflightandcolor)

-(Impressionisttechniques)-[influenced]→(laterart

movements)

RetrievedTriplets

Response

Monetintroducednewtechniquesthatrevolutionizedthedepictionoflightandcolor.HisImpressionisttechniquesinfluencedlaterartmovements,includingPicasso'sCubism,whichemergedintheearly20th

century.ThisinfluencehelpedshapePicasso’s

innovativeapproachtofragmentedperspectives.

Fig.1.ComparisionbetweenDirectLLM,RAG,andGraphRAG.Givenauserquery,directansweringbyLLMsmaysufferfromshallowresponsesorlackofspecificity.RAGaddressesthisbyretrievingrelevanttextualinformation,somewhatalleviatingtheissue.However,duetothetext’slengthandflexiblenaturallanguageexpressionsofentityrelationships,RAGstrugglestoemphasize“influence”relations,whichisthecoreofthequestion.While,GraphRAGmethodsleverageexplicitentityandrelationshiprepresentationsingraphdata,enablingpreciseanswersbyretrievingrelevantstructuredinformation.

Inthispaper,wearethefirsttoprovideasystematicsurveyofGraphRAG.Specifically,webeginbyintroducingtheGraphRAGworkflow,alongwiththefoundationalbackgroundknowledgethatunderpinsthefield.Then,wecategorizetheliteratureaccordingtotheprimarystagesoftheGraphRAGprocess:Graph-BasedIndexing(G-Indexing),Graph-GuidedRetrieval(G-Retrieval),andGraph-EnhancedGeneration(G-Generation)inSection

5,Section

6

andSection

7

respectively,detailingthecoretechnologiesandtrainingmethodswithineachphase.Furthermore,weinvestigatedownstreamtasks,applicationdomains,evaluationmethodologies,andindustrialusecasesofGraphRAG.ThisexplorationelucidateshowGraphRAGisbeingutilizedinpracticalsettingsandreflectsitsversatilityandadaptabilityacrossvarioussectors.Finally,acknowledgingthatresearchinGraphRAGisstillinitsearlystages,wedelveintopotentialfutureresearchdirections.Thisprognosticdiscussionaimstopavethewayforforthcomingstudies,inspirenewlinesofinquiry,andcatalyzeprogresswithinthefield,ultimatelypropellingGraphRAGtowardmorematureandinnovativehorizons.

Ourcontributionscanbesummarizedasfollows:

?Weprovideacomprehensiveandsystematicreviewofexistingstate-of-the-artGraphRAGmethodologies.WeofferaformaldefinitionofGraphRAG,outliningitsuniversalworkflowwhichincludesG-Indexing,G-Retrieval,andG-Generation.

?WediscussthecoretechnologiesunderpinningexistingGraphRAGsystems,includingG-Indexing,G-Retrieval,andG-Generation.Foreachcomponent,weanalyzethespectrumofmodelselection,methodologicaldesign,andenhancementstrategiescurrentlybeingexplored.Additionally,wecontrastthediversetrainingmethodologiesemployedacrossthesemodules.

?Wedelineatethedownstreamtasks,benchmarks,applicationdomains,evaluationmetrics,currentchallenges,andfutureresearchdirectionspertinenttoGraphRAG,discussingboth

J.ACM,Vol.37,No.4,Article111.Publicationdate:September2024.

111:4Pengetal.

G-Retrieval

QueryExpansion

Query

Decomposition

Query

Enhancements

GraphDatabase&G-Indexing

OpenKnowledgeGraphs

Self-ConstructedGraphData

Knowledge

Enhancements

Merging

Pruning

InputQuery

Howdidtheartisticmovementsofthe19thcenturyimpactthedevelopmentofmodernartinthe20thcentury?

Retriever

Monetintroducednewtechniquesthat

revolutionizedthedepictionoflightand

color.HisImpressionisttechniques…

RetrievalResults

Nodes

Triplets

Paths

Subgraphs

Hybrid

GraphFormat

Pre-GenerationEnhancements

NaturalLanguage

Mid-GenerationEnhancements

SyntaxTree

Post-GenerationEnhancements

GraphEmbedding

OutputResponse

G-Generation

Adjacency/EdgeTable

Generator

Generator

Generator

NodeSequence

Code-LikeForms

Fig.2.TheoverviewoftheGraphRAGframeworkforquestionansweringtask.Inthissurvey,wedivideGraphRAGintothreestages:G-Indexing,G-Retrieval,andG-Generation.Wecategorizetheretrievalsourcesintoopen-sourceknowledgegraphsandself-constructedgraphdata.Variousenhancingtechniqueslikequeryenhancementandknowledgeenhancementmaybeadoptedtoboosttherelevanceoftheresults.UnlikeRAG,whichusesretrievedtextdirectlyforgeneration,GraphRAGrequiresconvertingtheretrievedgraphinformationintopatternsacceptabletogeneratorstoenhancethetaskperformance.

theprogressandprospectsofthisfield.Furthermore,wecompileaninventoryofexistingindustryGraphRAGsystems,providinginsightsintothetranslationofacademicresearchintoreal-worldindustrysolutions.

Organization.Therestofthesurveyisorganizedasfollows:Section

2

comparesrelatedtech-niques,whileSection

3

outlinesthegeneralprocessofGraphRAG.Sections

5

to

7

categorizethetechniquesassociatedwithGraphRAG’sthreestages:G-Indexing,G-Retrieval,andG-Generation.Section

8

introducesthetrainingstrategiesofretrieversandgenerators.Section

9

summarizesGraphRAG’sdownstreamtasks,correspondingbenchmarks,applicationdomains,evaluationmet-rics,andindustrialGraphRAGsystems.Section

10

providesanoutlookonfuturedirections.Finally,Section

11

concludesthecontentofthissurvey.

2ComparisonwithRelatedTechniquesandSurveys

Inthissection,wecompareGraphRetrieval-AugmentedGeneration(GraphRAG)withrelatedtechniquesandcorrespondingsurveys,includingRAG,LLMsongraphs,andKnowledgeBaseQuestionAnswering(KBQA).

2.1RAG

RAGcombinesexternalknowledgewithLLMsforimprovedtaskperformance,integratingdomain-specificinformationtoensurefactualityandcredibility.Inthepasttwoyears,researchershavewrittenmanycomprehensivesurveysaboutRAG

[34,

45,

59,

62,

178,

195,

202]

.Forexample,Fanetal.

[34]

andGaoetal.

[45]

categorizeRAGmethodsfromtheperspectivesofretrieval,gen-eration,andaugmentation.Zhaoetal.

[202]

reviewRAGmethodsfordatabaseswithdifferent

modalities.Yuetal.

[195]

systematicallysummarizetheevaluationofRAGmethods.TheseworksJ.ACM,Vol.37,No.4,Article111.Publicationdate:September2024.

GraphRetrieval-AugmentedGeneration:ASurvey111:5

provideastructuredsynthesisofcurrentRAGmethodologies,fosteringadeeperunderstandingandsuggestingfuturedirectionsofthearea.

Fromabroadperspective,GraphRAGcanbeseenasabranchofRAG,whichretrievesrelevantrelationalknowledgefromgraphdatabasesinsteadoftextcorpus.However,comparedtotext-basedRAG,GraphRAGtakesintoaccounttherelationshipsbetweentextsandincorporatesthestructuralinformationasadditionalknowledgebeyondtext.Furthermore,duringtheconstructionofgraphdata,rawtextdatamayundergofilteringandsummarizationprocesses,enhancingtherefinementofinformationwithinthegraphdata.AlthoughprevioussurveysonRAGhavetoucheduponGraphRAG,theypredominantlycenterontextualdataintegration.Thispaperdivergesbyplacingaprimaryemphasisontheindexing,retrieval,andutilizationofstructuredgraphdata,whichrepresentsasubstantialdeparturefromhandlingpurelytextualinformationandspurstheemergenceofmanynewtechniques.

2.2LLMsonGraphs

LLMsarerevolutionizingnaturallanguageprocessingduetotheirexcellenttextunderstanding,reasoning,andgenerationcapabilities,alongwiththeirgeneralizationandzero-shottransferabilities.AlthoughLLMsareprimarilydesignedtoprocesspuretextandstrugglewithnon-Euclideandatacontainingcomplexstructuralinformation,suchasgraphs

[49,

165],numerous

studies

[17,

35,

74,

92,

102,

116,

130,

131,

173,

204]havebeenconductedinthesefields

.ThesepapersprimarilyintegrateLLMswithGNNstoenhancemodelingcapabilitiesforgraphdata,therebyimprovingperformanceondownstreamtaskssuchasnodeclassification,edgeprediction,graphclassification,andothers.Forexample,Zhuetal.

[204]

proposeanefficientfine-tuningmethodnamedENGINE,whichcombinesLLMsandGNNsthroughasidestructureforenhancinggraphrepresentation.

Differentfromthesemethods,GraphRAGfocusesonretrievingrelevantgraphelementsusingqueriesfromanexternalgraph-structureddatabase.Inthispaper,weprovideadetailedintroductiontotherelevanttechnologiesandapplicationsofGraphRAG,whicharenotincludedinprevioussurveysofLLMsonGraphs.

2.3KBQA

KBQAisasignificanttaskinnaturallanguageprocessing,aimingtorespondtouserqueriesbasedonexternalknowledgebases

[41,

85,

86,

188],therebyachievinggoalssuchasfactverification,passage

retrievalenhancement,andtextunderstanding.PrevioussurveystypicallycategorizeexistingKBQAapproachesintotwomaintypes:InformationRetrieval(IR)-basedmethodsandSemanticParsing(SP)-basedmethods.Specifically,IR-basedmethods

[69,

70,

112,

113,

154,

167,

182,

196]

retrieveinformationrelatedtothequeryfromtheknowledgegraph(KG)anduseittoenhancethegenerationprocess.WhileSP-basedmethods

[16,

19,

36,

48,

153,

191]generatealogicalform(LF)

foreachqueryandexecuteitagainstknowledgebasestoobtaintheanswer.

GraphRAGandKBQAarecloselyrelated,withIR-basedKBQAmethodsrepresentingasubsetofGraphRAGapproachesfocusedondownstreamapplications.Inthiswork,weextendthediscussionbeyondKBQAtoincludeGraphRAG’sapplicationsacrossvariousdownstreamtasks.OursurveyprovidesathoroughanddetailedexplorationofGraphRAGtechnology,offeringacomprehensive

understandingofexistingmethodsandpotentialimprovements.

3Preliminaries

Inthissection,weintroducebackgroundknowledgeofGraphRAGforeasiercomprehensionof

oursurvey.First,weintroduceText-AttributedGraphswhichisauniversalandgeneralformatofJ.ACM,Vol.37,No.4,Article111.Publicationdate:September2024.

111:6Pengetal.graphdatausedinGraphRAG.Then,weprovideformaldefinitionsfortwotypesofmodelsthat

canbeusedintheretrievalandgenerationstages:GraphNeuralNetworksandLanguageModels.

3.1Text-AttributedGraphs

ThegraphdatausedinGraphRAGcanberepresenteduniformlyasText-AttributedGraphs(TAGs),wherenodesandedgespossesstextualattributes.Formally,atext-attributedgraphcanbedenoted

asG=(V,E,Au}u∈V,{ei,j}i,j∈E),whereVisthesetofnodes,E?V×Visthesetof

edges,A∈{0,1}V|×|V|istheadjacentmatrix.Additionally,{xu}u∈Vand{ei,j}i,j∈Earetextualattributesofnodesandedges,respectively.OnetypicalkindofTAGsisKnowledgeGraphs(KGs),wherenodesareentities,edgesarerelationsamongentities,andtextattributesarethenamesofentitiesandrelations.

3.2GraphNeuralNetworks

GraphNeuralNetworks(GNNs)areakindofdeeplearningframeworktomodelthegraphdata.ClassicalGNNs,e.g.,GCN

[83],GAT

[162],GraphSAGE

[52],adoptamessage-passingmannerto

obtainnoderepresentations.Formally,eachnoderepresentationhi(l?1)inthel-thlayerisupdatedbyaggregatingtheinformationfromneighboringnodesandedges:

hi(l)=UPD(hl?1),AGGj∈N(i)MSG(hi(l?1),hj(l?1),ei(,?1))),(1)

whereN(i)representstheneighborsofnodei.MSGdenotesthemessagefunction,whichcomputesthemessagebasedonthenode,itsneighbor,andtheedgebetweenthem.AGGreferstotheaggregationfunctionthatcombinesthereceivedmessagesusingapermutation-invariantmethod,suchasmean,sum,ormax.UPDrepresentstheupdatefunction,whichupdateseachnode’sattributeswiththeaggregatedmessages.

Subsequently,areadoutfunction,e.g.,mean,sum,ormaxpooling,canbeappliedtoobtaintheglobal-levelrepresentation:

hG=READOUTi∈VG(hL)).(2)

InGraphRAG,GNNscanbeutilizedtoobtainrepresentationsofgraphdatafortheretrievalphase,aswellastomodeltheretrievedgraphstructures.

3.3LanguageModels

Languagemodels(LMs)excelinlanguageunderstandingandaremainlyclassifiedintotwotypes:discriminativeandgenerative.Discriminativemodels,likeBERT

[28],RoBERTa

[107]andSentence

-

BERT[140],focusonestimatingtheconditionalprobabilityp(y|x)

andareeffectiveintaskssuchastextclassificationandsentimentanalysis.Incontrast,generativemodels,includingGPT-3

[14]and

GPT-4[127],aimtomodelthejointprobabilityp(x

,y)fortaskslikemachinetranslationandtextgeneration.Thesegenerativepre-trainedmodelshavesignificantlyadvancedthefieldofnaturallanguageprocessing(NLP)byleveragingmassivedatasetsandbillionsofparameters,contributingtotheriseofLargeLanguageModels(LLMs)withoutstandingperformanceacrossvarioustasks. Intheearlystages,RAGandGraphRAGfocusedonimprovingpre-trainingtechniquesfordiscriminativelanguagemodels

[28,

107,

140].Recently,LLMssuchasChatGPT

[128],LLaMA

[31],

andQwen2

[184]haveshowngreatpotentialinlanguageunderstanding,demonstratingpowerful

in-contextlearningcapabilities.Subsequently,researchonRAGandGraphRAGshiftedtowardsenhancinginformationretrievalforlanguagemodels,addressingincreasinglycomplextasksand

mitigatinghallucinations,therebydrivingrapidadvancementsinthefield.J.ACM,Vol.37,No.4,Article111.Publicationdate:September2024.

GraphRetrieval-AugmentedGeneration:ASurvey111:7

4OverviewofGraphRAG

GraphRAGisaframeworkthatleveragesexternalstructuredknowledgegraphstoimprovecontex-tualunderstandingofLMsandgeneratemoreinformedresponses,asdepictedinFigure

2.

ThegoalofGraphRAGistoretrievethemostrelevantknowledgefromdatabases,therebyenhancingtheanswersofdownstreamtasks.Theprocesscanbedefinedas

wherea?istheoptimalanswerofthequeryqgiventheTAGG,andAisthesetofpossibleresponses.Afterthat,wejointlymodelthetargetdistributionp(a|q,G)withagraphretrieverpθ(G|q,G)andananswergeneratorpφ(a|q,G)whereθ,φarelearnableparameters,andutilizethetotalprobabilityformulatodecomposep(a|q,G),whichcanbeformulatedas

(4)

≈pφ(a|q,G?)pθ(G?|q,G),

whereG?istheoptimalsubgraph.Becausethenumberofcandidatesubgraphscangrowexpo-nentiallywiththesizeofthegraph,efficientapproximationmethodsarenecessary.ThefirstlineofEquation

4

isthusapproximatedbythesecondline.Specifically,agraphretrieverisemployedtoextracttheoptimalsubgraphG?,afterwhichthegeneratorproducestheanswerbasedontheretrievedsubgraph.

Therefore,inthissurvey,wedecomposetheentireprocessofGraphRAGintothreemainstages:Graph-BasedIndexing,Graph-GuidedRetrieval,andGraph-EnhancedGeneration.TheoverallworkflowofGraphRAGisillustratedinFigure2anddetailedintroductionsofeachstageareasfollows.

Graph-BasedIndexing(G-Indexing).Graph-BasedIndexingconstitutestheinitialphaseofGraphRAG,aimedatidentifyingorconstructingagraphdatabaseGthatalignswithdownstreamtasksandestablishingindicesonit.Thegraphdatabasecanoriginatefrompublicknowledgegraphs

[4,

10,

100,

142,

150,

163],graphdata

[123],orbeconstructedbasedonproprietarydata

sourcessuchastextual

[32,

51,

89,

172]orotherformsofdata

[183]

.Theindexingprocesstypi-callyincludesmappingnodeandedgeproperties,establishingpointersbetweenconnectednodes,andorganizingdatatosupportfasttraversalandretrievaloperations.Indexingdeterminesthegranularityofthesubsequentretrievalstage,playingacrucialroleinenhancingqueryefficiency.

Graph-GuidedRetrieval(G-Retrieval).Followinggraph-basedindexing,thegraph-guidedretrievalphasefocusesonextractingpertinentinformationfromthegraphdatabaseinresponsetouserqueriesorinput.Specifically,givenauserqueryqwhichisexpressedinnaturallanguage,theretrievalstageaimstoextractthemostrelevantelements(e.g.,entities,triplets,paths,subgraphs)fromknowledgegraphs,whichcanbeformulatedas

G?=G-Retriever(q,G)

(5)

whereG?istheoptimalretrievedgraphelementsand

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論