作業(yè)信息檢索lecture_第1頁
作業(yè)信息檢索lecture_第2頁
作業(yè)信息檢索lecture_第3頁
作業(yè)信息檢索lecture_第4頁
作業(yè)信息檢索lecture_第5頁
已閱讀5頁,還剩41頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

Lecture4:IndexConstruction·

Lastlecture

Wildcards

○○

○·

Spellcorrection·

Soundex

$m

mace

→madden

mo

□〉

among

amortize

·

This

time

:

on

abandon

among

·

Indexconstruction·

Dictionarydatastructures

a-hu

h

y-m

n-z·

Tolerant

retrieval

0

0

0PlanIndexconstruction·

Howdo

weconstructanindex?·

Whatstrategiescan

weuse

withlimited

mainmemory?Hardwarebasics·

Manydesigndecisionsininformationretrieval

a

basedonthe

characteristicsofhardware·

WebeginbyreviewinghardwarebasicsHardwarebasics·

Access

to

data

in

memory

ismuchfaster

than

acces

to

data

on

disk.·

Diskseeks

:

Nodataistransferredfromdisk

while

diskheadisbeingpositioned.·

Therefore

:Transferringonelargechunkofdataf

diskto

memoryisfasterthantransferring

manysmall

chunks.·

Disk

I/O

is

block-based

:Reading

and

writing

of

enblocks

(as

opposed

to

smaller

chunks).·

Blocksizes

:8KBto256KB.Hardwarebasics·ServersusedinIR

systemsnowtypicallyhaveseve

GBofmainmemory,sometimestensofGB.·

Available

disk

space

is

several

(2–

3)orders

of

magnitudelarger.·

Faulttoleranceisveryexpensive

:It

smuchcheto

use

many

regular

machines

rather

than

one

faulttolerant

machine.Hardwareassumptionsforthislectur·

symbol

statisticvalue·s

average

seektime5ms=5x10-3

b

transfer

time

per

byte0.02

μs=2x10-s·

processor

sclock

rate

109

s-1·

p

low-level

operation

0.01μs=10-8

s(e.g.,compare&swap

a

word)·sizeofmainmemoryseveralGB·

size

of

disk

space1TBorRCV1:Our

collectionfor

thislectur·Shakespeare

scollected

worksdefinitelyaren

’enough

for

demonstrating

many

of

the

points

in

this

course.·

Thecollection

we

’ll

use

isn

’t

reallylargeeno

either,

butit

spubliclyavailableandisatleasmore

plausible

example.·

As

an

example

for

applying

scalable

indexconstructionalgorithms,

we

willusetheReuters

RCV1

collection.·

This

is

one

year

of

Reuters

newswire

(part

of1995

and

1996)AReutersRCV1document·symbolstatisticvalue·Ndocuments800,000·Lavg.#tokensperdoc200·Mterms(=wordtypes)400,000·avg.#bytespertoken6(incl.spaces/punct.)·avg.#bytespertoken4.5(withoutspaces/punct.)·avg.#bytesperterm7.5ReutersRCV1

statisticsRecallIIR1indexconstructionDoc2SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc

1IdidenactJulius

Caesar

Iwaskilled

i"theCapitol

;Brutuskilledme.·

Documentsareparsedtoextract

wordsandthese

aresaved

withtheDocumentID.Wefocusonthissortstep.

Wehave

100M

itemsto

sort.·After

all

documentshavebeen

parsed,theinvertedfileissorted

by

terms.Key

step·In-memoryindexconstructiondoesnotscale·Can’tstuffentirecollectionintomemory,sort,theback·Howcanweconstructanindexforverylargecollections?·Takingintoaccountthehardwareconstraintswejlearnedabout...·Memory,disk,speed,etc.ScalingindexconstructionSort-basedindexconstruction·Aswebuildtheindex,weparsedocsoneatatime.·Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)·Thefinalpostingsforanytermarepleteuntiltheend·At12bytespernon-positionalpostingsentry(term,docdemandsalotofspaceforlargecollections.·T=100,000,000inthecaseofRCV1·So…wecandothisinmemoryin2009,buttypicalcollecaremuchlarger.E.g.,theNewYorkTimesprovidesanindeof>150yearsofnewswire·Thus:Weneedtostoreintermediateresultsondisk.Sort

using

disk

as

“memory

?·Canweusethesame

index

constructionalgorithm

forlargercollections,

but

by

usingdiskinsteado

memory?·

No:Sorting

T=100,000,000records

on

disk

is

tooslow

–toomanydiskseeks.·

Weneedanexternal

sortingalgorithm.·

Parse

and

build

postings

entries

one

doc

at

a

time·

Now

sort

postings

entries

by

term

(then

by

docwithineachterm)·

Doing

this

with

random

disk

seeks

would

be

too

slo–mustsortT=

100MrecordsIfeverycomparisontook2

disk

seeks,

andN

items

could

be

sortedwithN

log2N

comparisons,how

longwouldthistake?BottleneckBSBI:Blocked

sort-based

Indexing

(Sorting

withfewerdiskseeks)·12-byte

(4+4+4)records

(term,

doc,

freq).·

Thesearegeneratedasweparsedocs.·

Mustnowsort

100Msuch12-byterecordsbyterm.·

DefineaBlock~

10Msuchrecords·

Caneasily

fitacoupleinto

memory.·

Will

have10such

blockstostart

with.·

Basicideaofalgorithm

Accumulate

postingsforeach

block,sort,

writetodi·

Then

mergethe

blocksintoonelongsortedorder.·Can

do

binary

merges,

with

a

merge

tree

of

log210=4layer·During

each

layer,read

into

memory

runs

in

blocks

of10M

merge,

write

back.Howtomergethesortedruns?DiskRunsbeing

merged.2

Mergedrun.1

234431Howtomergethesortedruns?·Butitis

moreefficienttodoa

multi-way

merge,

whereyo

are

readingfromall

blockssimultaneously·Providingyou

readdecent-sizedchunksofeach

blockin

tmemory

and

then

write

out

a

decent-sized

output

chunk,thenyou

’re

not

killed

bydiskseeksRemainingproblemwithsort-basedalgorithm·Ourassumptionwas:wecankeepthedictionaryinmemory.·Weneedthedictionary(whichgrowsdynamically)ordertoimplementatermtotermIDmapping.·Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings...·...butthenintermediatefileseverylarge.wouldendupwithascalable,butveryslowindexconstructionmethod.)SPIMI:Single-passin-memoryindexing·Keyidea1:Generateseparatedictionariesforeablock–noneedtomaintainterm-termIDmappingacrossblocks.·Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.·Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.·Theseseparateindexescanthenbemergedintoonebigindex.·

Merging

of

blocks

is

analogous

to

BSBI.SPIMI-Invert·CompressionmakesSPIMIevenmoreefficient.·

Compression

of

terms·

Compressionof

postings·

SeenextlectureSPIMI:CompressionDistributedindexing·

For

web-scale

indexing

(don

’t

trythisat

home!)must

useadistributedcomputingcluster·

Individual

machinesarefault-prone·

Can

unpredictablyslowdownorfail·

Howdoweexploitsuchapoolofmachines?Websearch

engine

datacenter

Web

search

datacenter

s

(Google,Bing,Baidu)

mainlycontaincommoditymachines.·

Datacenter

saredistributedaroundthe

world.·

Estimate

:Google~

1

millionservers,3

millionprocessors/cores

(Gartner2007)Massivedatacenter

Ifin

a

non-fault-tolerantsystemwith

1000nodes

eachnodehas99.9%uptime,whatistheuptimeofthesystem?·

Answer:63%·

Exercise

:Calculate

the

number

of

servers

failin

minuteforaninstallationof1

millionservers.Distributedindexing·

Maintaina

master

machinedirectingtheindexingjob

–considered

safe

”.·

Break

upindexingintosetsof

(parallel)tasks.·

Mastermachineassignseachtasktoanidlemachinfrom

a

pool.Paralleltasks·

We

willusetwosetsofparalleltasks·

Parser

Inverter

Breaktheinputdocumentcollectionintosplits·

Each

split

is

a

subset

of

documents

(correspondinblocks

in

BSBI/SPIMI)·

Masterassignsasplittoanidleparser

machine·

Parserreadsadocumentatatimeandemits

(term,doc)

pairs·

Parser

writes

pairsintojpartitions·

Eachpartitionisforarangeofterms

first

let·

(e.g.,

a-f,

g-p,

q-z)

here

j=3.·

Now

to

complete

the

index

in

versionParser

sInverter

Aninvertercollectsall

(term,doc)

pairs

(=

post

foroneterm-partition.·

Sortsand

writesto

postingslistsParser

a-f

g-pq-zParserParser

a-f

g-p

q-zDataflowMaster

assignInverter

g-pInverter

a-f一

a-f

g-p

q-zInverter

q-zReduce

phaseSegment

filesMapphasePostingsassignsplitsMapReduce·TheindexconstructionalgorithmwejustdescribaninstanceofMapReduce.·MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputing…·…withouthavingtowritecodeforthedistributpart.·TheydescribetheGoogleindexingsystem(ca.200asconsistingofanumberofphases,eachimplementedinMapReduce.·Index

construction

was

just

one

phase.·

Anotherphase

:transformingaterm-partitioned

indexintoadocument-partitionedindex.·

Term-partitioned:one

machinehandlesasubrangeofterms·

Document-partitioned:one

machinehandlesasubrange

ofdocuments·

As

we

’ll

discuss

in

the

web

part

of

the

course,mosearch

engines

use

a

document-partitioned

index…

betterload

balancing,etc.MapReduceSchemafor

indexconstructioninMapReduce·

Schemaofmapandreducefunctions·map:input→list(k,

v)

reduce

:

(k,

list(v))→

output·Instantiation

oftheschemaforindexconstruction

·map:collection→list(termID,docID)·reduce

:

(<termID1,list(docID)>,<termID2,list(docID

(postingslist1,

postingslist2,…)·Map:·d1:Ccame,Cc’ed.·d2:Cdied.→·<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>,<C,d2>,<died,d2>·Reduce:·(<C,(d1,d2,d1)>,<died,(d2)>,<came,(d1)>,<c’→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:<c’ed,(d1:1)>)Exampleforindexconstruction37Dynamic

indexing·Uptonow,wehaveassumedthatcollectionsare

static.·

Theyrarely

are

Documents

come

in

over

time

and

need

to

be

inserted.·

Documentsaredeletedand

modified.·

This

meansthatthedictionaryandpostingslists

havetobemodified

Postings

updatesfortermsalreadyindictionary·

Newtermsaddedtodictionary·

Maintain

“big

main

index·

Newdocsgointo“

small”auxiliaryindex·

Search

across

both,merge

results·

Deletions·

Invalidation

bit-vector

fordeleteddocs·

Filterdocsoutputon

asearch

result

by

thisinvalida

bit-vector·

Periodically,re-indexintoone

mainindexSimplest

approachIssueswithmainandauxiliaryindex·Problemoffrequent

merges

youtouchstuffalot·Poor

performance

during

merge·

Actually:·Mergingoftheauxiliaryindexintothe

mainindexisefficientifkeepaseparatefileforeachpostingslist.·

Merge

is

the

same

as

a

simple

append.·Butthen

we

wouldneedalotoffiles

–inefficientfor

OS.·Assumptionforthe

restofthelecture

:

Theindexisone

b

file.·

In

reality:

Useaschemesomewhereinbetween

(e.g.,spl

verylarge

postingslists,

collect

postingslists

ofleng

one

file

etc.)Logarithmicmerge·

Maintainaseriesofindexes,eachtwiceaslargethe

previous

one·

At

any

time,some

of

these

powers

of2are

instantiated·

Keepsmallest

(Z0)inmemory·

Larger

ones

(I0,I1,…)on

disk·

IfZ0

getstoo

big

(>n),

writetodisk

asI0·

or

merge

with

I0

(if

I0

already

exists)as

Z1

·

Either

write

mergeZ1

todiskasI1

(ifnoI1)

·

OrmergewithI1

toformZ2Logarithmicmerge·

Auxiliaryand

mainindex

:indexconstructiontim

O(T2)as

eachpostingistouchedineachmerge.·

Logarithmicmerge

:EachpostingismergedO(logTtimes,so

complexity

is

O(T

log

T)·Sologarithmicmergeismuchmoreefficient

for

indexconstruction·

But

query

processing

now

requires

the

merging

ofO(logT)indexes·

Whereas

it

is

O(1)if

you

just

have

a

main

and

auxiliarindex·

Collection-widestatisticsare

hardto

maintain·

E

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論