作業(yè)信息檢索lecture

上傳人：1*** IP屬地：廣東上傳時(shí)間：2024-11-25 格式：PPT 頁數(shù)：46 大小：556.41KB 積分：24 舉報(bào) 版權(quán)申訴

已閱讀5頁，還剩41頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Lecture4:IndexConstruction·

Lastlecture

:·

Wildcards

○

○○

○

○·

Spellcorrection·

Soundex

mace

→madden

□〉

among

amortize

This

time

abandon

among

Indexconstruction·

Dictionarydatastructures

a-hu

y-m

n-z·

Tolerant

retrieval

0PlanIndexconstruction·

Howdo

weconstructanindex?·

Whatstrategiescan

weuse

withlimited

mainmemory?Hardwarebasics·

Manydesigndecisionsininformationretrieval

basedonthe

characteristicsofhardware·

WebeginbyreviewinghardwarebasicsHardwarebasics·

Access

data

memory

ismuchfaster

than

acces

data

disk.·

Diskseeks

Nodataistransferredfromdisk

while

diskheadisbeingpositioned.·

Therefore

:Transferringonelargechunkofdataf

diskto

memoryisfasterthantransferring

manysmall

chunks.·

Disk

I/O

block-based

:Reading

and

writing

enblocks

(as

opposed

smaller

chunks).·

Blocksizes

:8KBto256KB.Hardwarebasics·ServersusedinIR

systemsnowtypicallyhaveseve

GBofmainmemory,sometimestensofGB.·

Available

disk

space

several

(2–

3)orders

magnitudelarger.·

Faulttoleranceisveryexpensive

:It

’

smuchcheto

use

many

regular

machines

rather

than

one

faulttolerant

machine.Hardwareassumptionsforthislectur·

symbol

statisticvalue·s

average

seektime5ms=5x10-3

s·

transfer

time

per

byte0.02

μs=2x10-s·

processor

’

sclock

rate

109

s-1·

low-level

operation

0.01μs=10-8

s(e.g.,compare&swap

word)·sizeofmainmemoryseveralGB·

size

disk

space1TBorRCV1:Our

collectionfor

thislectur·Shakespeare

’

scollected

worksdefinitelyaren

’enough

for

demonstrating

many

the

points

this

course.·

Thecollection

’ll

use

isn

’t

reallylargeeno

either,

butit

’

spubliclyavailableandisatleasmore

plausible

example.·

example

for

applying

scalable

indexconstructionalgorithms,

willusetheReuters

RCV1

collection.·

This

one

year

Reuters

newswire

(part

of1995

and

1996)AReutersRCV1document·symbolstatisticvalue·Ndocuments800,000·Lavg.#tokensperdoc200·Mterms(=wordtypes)400,000·avg.#bytespertoken6(incl.spaces/punct.)·avg.#bytespertoken4.5(withoutspaces/punct.)·avg.#bytesperterm7.5ReutersRCV1

statisticsRecallIIR1indexconstructionDoc2SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc

1IdidenactJulius

Caesar

Iwaskilled

i"theCapitol

;Brutuskilledme.·

Documentsareparsedtoextract

wordsandthese

aresaved

withtheDocumentID.Wefocusonthissortstep.

Wehave

100M

itemsto

sort.·After

all

documentshavebeen

parsed,theinvertedfileissorted

terms.Key

step·In-memoryindexconstructiondoesnotscale·Can’tstuffentirecollectionintomemory,sort,theback·Howcanweconstructanindexforverylargecollections?·Takingintoaccountthehardwareconstraintswejlearnedabout...·Memory,disk,speed,etc.ScalingindexconstructionSort-basedindexconstruction·Aswebuildtheindex,weparsedocsoneatatime.·Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)·Thefinalpostingsforanytermarepleteuntiltheend·At12bytespernon-positionalpostingsentry(term,docdemandsalotofspaceforlargecollections.·T=100,000,000inthecaseofRCV1·So…wecandothisinmemoryin2009,buttypicalcollecaremuchlarger.E.g.,theNewYorkTimesprovidesanindeof>150yearsofnewswire·Thus:Weneedtostoreintermediateresultsondisk.Sort

using

disk

“memory

”

?·Canweusethesame

index

constructionalgorithm

forlargercollections,

but

usingdiskinsteado

memory?·

No:Sorting

T=100,000,000records

disk

tooslow

–toomanydiskseeks.·

Weneedanexternal

sortingalgorithm.·

Parse

and

build

postings

entries

one

doc

time·

Now

sort

postings

entries

term

(then

docwithineachterm)·

Doing

this

with

random

disk

seeks

would

too

slo–mustsortT=

100MrecordsIfeverycomparisontook2

disk

seeks,

andN

items

could

sortedwithN

log2N

comparisons,how

longwouldthistake?BottleneckBSBI:Blocked

sort-based

Indexing

(Sorting

withfewerdiskseeks)·12-byte

(4+4+4)records

(term,

doc,

freq).·

Thesearegeneratedasweparsedocs.·

Mustnowsort

100Msuch12-byterecordsbyterm.·

DefineaBlock~

10Msuchrecords·

Caneasily

fitacoupleinto

memory.·

Will

have10such

blockstostart

with.·

Basicideaofalgorithm

:·

Accumulate

postingsforeach

block,sort,

writetodi·

Then

mergethe

blocksintoonelongsortedorder.·Can

binary

merges,

with

merge

tree

log210=4layer·During

each

layer,read

into

memory

runs

blocks

of10M

merge,

write

back.Howtomergethesortedruns?DiskRunsbeing

merged.2

Mergedrun.1

234431Howtomergethesortedruns?·Butitis

moreefficienttodoa

multi-way

merge,

whereyo

are

readingfromall

blockssimultaneously·Providingyou

readdecent-sizedchunksofeach

blockin

tmemory

and

then

write

out

decent-sized

output

chunk,thenyou

’re

not

killed

bydiskseeksRemainingproblemwithsort-basedalgorithm·Ourassumptionwas:wecankeepthedictionaryinmemory.·Weneedthedictionary(whichgrowsdynamically)ordertoimplementatermtotermIDmapping.·Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings...·...butthenintermediatefileseverylarge.wouldendupwithascalable,butveryslowindexconstructionmethod.)SPIMI:Single-passin-memoryindexing·Keyidea1:Generateseparatedictionariesforeablock–noneedtomaintainterm-termIDmappingacrossblocks.·Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.·Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.·Theseseparateindexescanthenbemergedintoonebigindex.·

Merging

blocks

analogous

BSBI.SPIMI-Invert·CompressionmakesSPIMIevenmoreefficient.·

Compression

terms·

Compressionof

postings·

SeenextlectureSPIMI:CompressionDistributedindexing·

For

web-scale

indexing

(don

’t

trythisat

home!)must

useadistributedcomputingcluster·

Individual

machinesarefault-prone·

Can

unpredictablyslowdownorfail·

Howdoweexploitsuchapoolofmachines?Websearch

engine

datacenter

s·

Web

datacenter

(Google,Bing,Baidu)

mainlycontaincommoditymachines.·

Datacenter

saredistributedaroundthe

world.·

Estimate

:Google~

millionservers,3

millionprocessors/cores

(Gartner2007)Massivedatacenter

s·

Ifin

non-fault-tolerantsystemwith

1000nodes

eachnodehas99.9%uptime,whatistheuptimeofthesystem?·

Answer:63%·

Exercise

:Calculate

the

number

servers

failin

minuteforaninstallationof1

millionservers.Distributedindexing·

Maintaina

master

machinedirectingtheindexingjob

–considered

“

safe

”.·

Break

upindexingintosetsof

(parallel)tasks.·

Mastermachineassignseachtasktoanidlemachinfrom

pool.Paralleltasks·

willusetwosetsofparalleltasks·

Parser

s·

Inverter

s·

Breaktheinputdocumentcollectionintosplits·

Each

split

subset

documents

(correspondinblocks

BSBI/SPIMI)·

Masterassignsasplittoanidleparser

machine·

Parserreadsadocumentatatimeandemits

(term,doc)

pairs·

Parser

writes

pairsintojpartitions·

Eachpartitionisforarangeofterms

’

first

let·

(e.g.,

a-f,

g-p,

q-z)

–

here

j=3.·

Now

complete

the

index

versionParser

sInverter

s·

Aninvertercollectsall

(term,doc)

pairs

post

foroneterm-partition.·

Sortsand

writesto

postingslistsParser

一

a-f

g-pq-zParserParser

一

a-f

g-p

q-zDataflowMaster

assignInverter

g-pInverter

a-f一

a-f

g-p

q-zInverter

q-zReduce

phaseSegment

filesMapphasePostingsassignsplitsMapReduce·TheindexconstructionalgorithmwejustdescribaninstanceofMapReduce.·MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputing…·…withouthavingtowritecodeforthedistributpart.·TheydescribetheGoogleindexingsystem(ca.200asconsistingofanumberofphases,eachimplementedinMapReduce.·Index

construction

was

just

one

phase.·

Anotherphase

:transformingaterm-partitioned

indexintoadocument-partitionedindex.·

Term-partitioned:one

machinehandlesasubrangeofterms·

Document-partitioned:one

machinehandlesasubrange

ofdocuments·

’ll

discuss

the

web

part

the

course,mosearch

engines

use

document-partitioned

index…

betterload

balancing,etc.MapReduceSchemafor

indexconstructioninMapReduce·

Schemaofmapandreducefunctions·map:input→list(k,

reduce

(k,

list(v))→

output·Instantiation

oftheschemaforindexconstruction

·map:collection→list(termID,docID)·reduce

(<termID1,list(docID)>,<termID2,list(docID

(postingslist1,

postingslist2,…)·Map:·d1:Ccame,Cc’ed.·d2:Cdied.→·<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>,<C,d2>,<died,d2>·Reduce:·(<C,(d1,d2,d1)>,<died,(d2)>,<came,(d1)>,<c’→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:<c’ed,(d1:1)>)Exampleforindexconstruction37Dynamic

indexing·Uptonow,wehaveassumedthatcollectionsare

static.·

Theyrarely

are

:·

Documents

come

over

time

and

need

inserted.·

Documentsaredeletedand

modified.·

This

meansthatthedictionaryandpostingslists

havetobemodified

:·

Postings

updatesfortermsalreadyindictionary·

Newtermsaddedtodictionary·

Maintain

“big

”

main

index·

Newdocsgointo“

small”auxiliaryindex·

across

both,merge

results·

Deletions·

Invalidation

bit-vector

fordeleteddocs·

Filterdocsoutputon

asearch

result

thisinvalida

bit-vector·

Periodically,re-indexintoone

mainindexSimplest

approachIssueswithmainandauxiliaryindex·Problemoffrequent

merges

–

youtouchstuffalot·Poor

performance

during

merge·

Actually:·Mergingoftheauxiliaryindexintothe

mainindexisefficientifkeepaseparatefileforeachpostingslist.·

Merge

the

same

simple

append.·Butthen

wouldneedalotoffiles

–inefficientfor

OS.·Assumptionforthe

restofthelecture

Theindexisone

file.·

reality:

Useaschemesomewhereinbetween

(e.g.,spl

verylarge

postingslists,

collect

postingslists

ofleng

one

file

etc.)Logarithmicmerge·

Maintainaseriesofindexes,eachtwiceaslargethe

one·

any

time,some

these

powers

of2are

instantiated·

Keepsmallest

(Z0)inmemory·

Larger

ones

(I0,I1,…)on

disk·

IfZ0

getstoo

big

(>n),

writetodisk

asI0·

merge

with

(if

already

exists)as

Either

write

mergeZ1

todiskasI1

(ifnoI1)

OrmergewithI1

toformZ2Logarithmicmerge·

Auxiliaryand

mainindex

:indexconstructiontim

O(T2)as

eachpostingistouchedineachmerge.·

Logarithmicmerge

:EachpostingismergedO(logTtimes,so

complexity

O(T

log

T)·Sologarithmicmergeismuchmoreefficient

for

indexconstruction·

But

query

processing

now

requires

the

merging

ofO(logT)indexes·

Whereas

O(1)if

you

just

have

main

and

auxiliarindex·

Collection-widestatisticsare

hardto

maintain·

人人文庫> 全部分類> 行業(yè)資料 > 信息產(chǎn)業(yè)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

作業(yè)信息檢索lecture

文檔簡介

溫馨提示

最新文檔

評(píng)論

作業(yè)信息檢索lecture

文檔簡介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔