Automatically Extracting Structure and Data from Business Reports.ppt_第1頁(yè)
Automatically Extracting Structure and Data from Business Reports.ppt_第2頁(yè)
Automatically Extracting Structure and Data from Business Reports.ppt_第3頁(yè)
Automatically Extracting Structure and Data from Business Reports.ppt_第4頁(yè)
Automatically Extracting Structure and Data from Business Reports.ppt_第5頁(yè)
已閱讀5頁(yè),還剩20頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Automatically Extracting Structure and Data from Business Reports,Stephen W. Liddle School of Accountancy and Information Systems Marriott School of Management Brigham Young University Co-Authors: Douglas M. Campbell, Chad Crawford,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,2 of 25,Overview,W

2、hat are business reports and why do we care about them? Extracting structure and data Field types Line types Page headers/footers Inferring recursive group structure Experimental results,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,3 of 25,Business Reports,Business reports are used to dissemina

3、te information pertinent to business operations Financial, inventory, production They are the result of periodic data processing Daily, weekly, monthly, etc. COBOL, 4GLs, report writers,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,4 of 25,Business Reports,RUN 05/21/99 12:34:56 00551 L A R G E C

4、 D R E P O R T ACCR: 04/26/99 POST: 05/21/99 PAGE 001 CUST NBR CD NBR N A M E BALANCE RATE MATURITY OFC 006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9

5、989 11225 BARNEY FIFE 105,529.23 .06250 05/16/99 * * * TOTAL LARGE CD * * * 1,605,529.79,CLL3605 12 BANK 012 BRANCH 0402 MY TOWN BANK-CHARGE OFF BK 05/19/98 PAGE 1 EXCEPTION REPORT - NUMBER 12 PART.SOLD - DETAIL LIST CUSTOMER NUMBER OFF ORG DATE INTEREST RATE LOAN BALANCE PART. INTEREST PARTICIPANT

6、OUR OUR PART. CUSTOMER NAME LDG MAT DATE ORG AMOUNT LOAN CODE RATE OWING PART BALANCE BALANCE LISTED WITH 9999900MY TOWN BANK - 1234567 9001 81196 09-25-92 14.50000 50,860.19 44,428.59 TUBBS, BILLY 63 DEMAND 60,825.23 OA 14.50000 62,814.48 14,954- 1234569 9002 09644 03-22-93 12.75000 29,817.50 20,38

7、6.38 JONES, MARIA 66 DEMAND 30,079.41 OA 12.75000 30,079.41 1,261- PART-TOTALS NUMBER 64,814.97 16,216.20- COUNT 2 96,893.89 BRANCH TOTALS NUMBER .00 .00 COUNT .00,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,5 of 25,Business Reports,M Y T O W N S T A T E B A N K RUN/DATE: 05/20/98 TIME: 00:34

8、TRANS/DATE: 05/21/99 CASH CONTROL - TELLER CASH LISTING CASH4 PAGE NO. 1 OFFICE NUMBER 001 . . TELLER . AMOUNT TYPE BATCH & . AMOUNT TYPE BATCH & . AMOUNT TYPE BATCH & . AMOUNT TYPE BATCH & . . NUMBER . SEQUENCE . SEQUENCE . SEQUENCE . SEQUENCE . . . . . . . . . 001 . 15.00 OUT 214-23767 . 22.00 OUT

9、 214-23632 . 41.17 OUT 214-23651 . 45.00 OUT 214-23726 . . . 190.00 OUT 214-23752 . 200.00 OUT 215-18550 . 300.00 OUT 215-18579 . 300.00 OUT 214-23735 . . . 400.00 OUT 214-23754 . 400.00 OUT 214-23810 . 400.00 OUT 215-18548 . 500.00 OUT 215-18600 . . . 1,000.00 OUT 214-23764 . 4,138.85 OUT 215-05631

10、 . 10,000.00 OUT 214-23780 . 20.00 IN 214-23686 . . . 60.00 IN 214-23670 . 100.00 IN 214-23711 . 110.25 IN 214-23720 . 140.00 IN 214-23763 . . . 160.00 IN 214-23679 . 200.00 IN 214-23643 . 200.00 IN 214-23647 . 200.00 IN 214-23696 . . . 211.00 IN 214-23751 . 280.00 IN 214-23655 . 300.00 IN 214-23739

11、 . 300.00 IN 214-23770 . . . 340.48 IN 214-23777 . 400.00 IN 214-23740 . 400.00 IN 214-23732 . 400.00 IN 215-18575 . . . 420.00 IN 214-23813 . 700.00 IN 214-23734 . 1,000.00 IN 214-23779 . 1,003.86 IN 214-23888 . . . 1,240.85 IN 214-23718 . 1,506.00 IN 214-23742 . 1,806.00 IN 214-23692 . 6,000.00 IN

12、 214-23688 . . . . . . . . . . . . . . . . . . . . CASH OUT TOTAL 17,952.02 * CASH IN TOTAL 17,498.44 * NET CASH TOTAL 453.58-* . . . . . . . . . . . . 002 . 6.00 OUT 214-27788 . 15.00 OUT 214-27821 . 25.00 OUT 214-28073 . 40.00 OUT 214-27836 . . . 50.00 OUT 214-27798 . 200.00 OUT 214-27790 . 200.00

13、 OUT 215-18551 . 250.00 OUT 214-27819 . . . 400.00 OUT 215-18547 . 1,000.00 OUT 215-18545 . 1,000.00 OUT 214-27668 . 1,080.00 OUT 214-27658 . . . 4,000.00 OUT 214-27675 . 4,000.00 OUT 214-27662 . 3,545.42 OUT 215-05659 . 45.00 IN 214-27753 . . . 50.00 IN 214-27810 . 60.00 IN 214-27807 . 95.00 IN 214

14、-27725 . 265.00 IN 214-27723 . . . 305.00 IN 214-27759 . 330.00 IN 214-27755 . 400.00 IN 214-27797 . 400.00 IN 214-27818 . . . 408.73 IN 214-27667 . 419.03 IN 214-27832 . 560.00 IN 214-27805 . 600.00 IN 214-27811 . . . 850.00 IN 214-27650 . 1,000.00 IN 214-27640 . 1,821.85 IN 214-27678 . 4,200.00 IN

15、 214-27785 . . . 3,480.32 IN 214-27695 . . . . . . . . . . . . . . . . . CASH OUT TOTAL 15,811.42 * CASH IN TOTAL 15,289.93 * NET CASH TOTAL 521.49-* . . . . . . . . . . . . 003 . 10.00 OUT 214-18486 . 20.00 OUT 214-18462 . 20.00 OUT 215-18640 . 40.00 OUT 214-27483 . . . 50.00 OUT 214-18296 . 50.00

16、OUT 214-18301 . 55.00 OUT 214-27456 . 120.77 OUT 214-18465 . . . 137.00 OUT 214-27486 . 342.54 OUT 214-18489 . 700.00 OUT 214-27490 . 1,255.00 OUT 214-18449 . . . 1,705.59 OUT 215-18642 . 1,765.34 OUT 215-18649 . 1,884.92 OUT 215-18629 . 15,000.00 OUT 214-27882 . . . .15 IN 214-18417 . 10.00 IN 214-

17、18429 . 18.62 IN 214-27395 . 29.00 IN 214-25207 . . . 50.00 IN 214-27842 . 50.00 IN 214-18393 . 68.28 IN 214-27399 . 100.00 IN 214-27474 . . . . . . . .,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,6 of 25,Business Reports,BCRCL10 PACKAGE DETAIL 05/21/99 PAGE 1 TO-YOUR TOWN BANK FROM-MY TOWN ST

18、ATE BANK JOB-001 SP-P2414 DVC-01 PKT 13 AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE 123.45 21409648 81.56 21410732 23.00 21411405 .58 21412327 25.32 21412947 50.00 21413527 679.00 21409664 34.05 21410744 100.00 21411408 71.00 21412329 115.47 214129

19、50 136.80 21413531 170.00 21409667 27.68 21410750 40.00 21411409 150.00 21412337 25.00 21412951 21.40 21413548 38.00 21409714 528.94 21410772 100.00 21411416 100.00 21412340 18.60 21412952 25.00 21413576 5.00 21409742 274.00 21410780 75.00 21411420 68.00 21412344 15.00 21412957 50.00 21413580 13.75

20、21409743 383.53 21410793 65.00 21411423 25.00 21412351 232.44 21412962 20.00 21413582 849.77 21409778 511.46 21410816 40.00 21411430 56.60 21412377 200.00 21412991 26.40 21413583 211.43 21409829 276.22 21410860 40.00 21411431 71.50 21412378 100.00 21412992 10.00 21413596 291.58 21409914 62.00 214108

21、83 40.00 21411438 432.24 21412423 28.00 21412995 10.00 21413601 15.63 21409936 35.00 21410888 854.00 21411491 11.00 21412426 8.22 21413006 25.00 21413603 50.00 21409985 35.00 21410889 86.34 21411515 258.00 21412446 18.00 21413007 103.00 21413639 1,500.00 21410053 35.00 21410890 1,000.00 21411545 483

22、.10 21412474 79.90 21413008 70.00 21413653 20.39 21410062 35.00 21410892 1,000.00 21411546 260.00 21412475 75.00 21413013 799.63 21413709 257.61 21410065 35.00 21410894 277.05 21411579 115.00 21412498 6.00 21413017 116.24 21413713 7,467.35 21410082 33.00 21410904 351.00 21411614 100.00 21412538 29.8

23、6 21413020 103.09 21413725 692.90 21410083 18.40 21410905 42.14 21411678 132.14 21412543 61.95 21413022 1,000.00 21413730 927.98 21410084 19.25 21410906 61.61 21411682 30.00 21412557 28.84 21413024 246.00 21413752 25.00 21410126 10.02 21410909 432.16 21411737 150.00 21412623 35.15 21413028 40.00 214

24、13814 7.50 21410141 63.00 21410919 69.00 21411765 15.80 21412643 25.00 21413041 35.36 21413815 7.50 21410145 4,751.00 21410921 64.00 21411768 42.65 21412646 40.00 21413042 65.00 21413817 60.00 21410152 45.69 21410922 64.00 21411769 22.75 21412653 50.00 21413043 21.00 21413818 27.50 21410154 125.00 2

25、1410923 438.00 21411791 10.00 21412688 200.00 21413044 30.00 21413819 20.00 21410163 75.00 21410948 44.15 21411801 62.00 21412698 39.31 21413050 20.00 21413820 7.50 21410164 19.93 21410954 40.00 21411815 70.00 21412699 11.89 21413051 24.99 21413821 7.00 21410337 60.00 21411252 258.00 21412121 40.00

26、21412899 414.00 21413192 75.00 21410338 59.00 21411257 516.00 21412153 40.00 21412901 20.00 21413199 15.00 21410341 35.00 21411268 450.00 21412164 200.00 21412903 87.35 21413200 45.00 21410360 592.00 21411269 35.00 21412183 69.09 21412922 80.00 21413229 258.00 21410390 64.06 21411327 65.00 21412193

27、42.53 21412923 16.00 21413231 100.00 21410551 49.28 21411347 15.00 21412195 47.32 21412924 85.00 21413336 333.33 21410552 68.93 21411362 60.00 21412196 17.08 21412925 215.00 21413361 852.34 21410581 56.64 21411365 30.00 21412220 100.00 21412934 312.50 21413468 YOUR TOWN BANK 110.00 21410642 163.32 2

28、1411380 148.66 21412269 200.00 21412935 500.00 21413469 - ( ) 50.00 21410657 160.50 21411383 29.95 21412300 10.00 21412936 100.00 21413490 FIRST 123.45 449.60 21410673 5,000.00 21411390 29.95 21412301 200.00 21412937 20.00 21413503 LAST 24.99 112.80 21410709 417.34 21411398 29.95 21412310 81.79 2141

29、2944 18.00 21413506 SEP# 00000 61.24 21410715 129.18 21411401 164.25 21412326 25.29 21412946 4.83 21413520 47,452.45 209,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,7 of 25,Business-Report Structure,Pages Rows/columns Page headers/footers Group headers/footers Assumptions ASCII format (EBCDIC

30、easy to translate) Page boundaries known We can basically ignore blank lines,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,8 of 25,Type I Reports,A type I report exhibits repeated structure only along the row (vertical) dimension,RUN 05/21/99 12:34:56 00551 L A R G E C D R E P O R T ACCR: 04/26/

31、99 POST: 05/21/99 PAGE 001 CUST NBR CD NBR N A M E BALANCE RATE MATURITY OFC 006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9989 11225 BARNEY FIFE 105,5

32、29.23 .06250 05/16/99 * * * TOTAL LARGE CD * * * 1,605,529.79,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,9 of 25,Type II Reports,A type II report exhibits repeated structure within rows (along the horizontal dimension),BCRCL10 PACKAGE DETAIL 05/21/99 PAGE 1 TO-YOUR TOWN BANK FROM-MY TOWN STAT

33、E BANK JOB-001 SP-P2414 DVC-01 PKT 13 AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE AMOUNT SEQUENCE 123.45 21409648 81.56 21410732 23.00 21411405 .58 21412327 25.32 21412947 50.00 21413527 679.00 21409664 34.05 21410744 100.00 21411408 71.00 21412329 115.47 21412950

34、 136.80 21413531 . . . 7.50 21410164 19.93 21410954 40.00 21411815 70.00 21412699 11.89 21413051 24.99 21413821 7.00 21410337 60.00 21411252 258.00 21412121 40.00 21412899 414.00 21413192 333.33 21410552 68.93 21411362 60.00 21412196 17.08 21412925 215.00 21413361 852.34 21410581 56.64 21411365 30.0

35、0 21412220 100.00 21412934 312.50 21413468 YOUR TOWN BANK 110.00 21410642 163.32 21411380 148.66 21412269 200.00 21412935 500.00 21413469 - ( ) 50.00 21410657 160.50 21411383 29.95 21412300 10.00 21412936 100.00 21413490 FIRST 123.45 449.60 21410673 5,000.00 21411390 29.95 21412301 200.00 21412937 2

36、0.00 21413503 LAST 24.99 112.80 21410709 417.34 21411398 29.95 21412310 81.79 21412944 18.00 21413506 SEP# 00000 61.24 21410715 129.18 21411401 164.25 21412326 25.29 21412946 4.83 21413520 47,452.45 209,AMOUNT SEQUENCE AMOUNT SEQUENCE 123.45 21409648 81.56 21410732 679.00 21409664 34.05 21410744,11/

37、3/99,CIKM99, Copyright 1999, Stephen W. Liddle,10 of 25,Structure Extraction Process,Field Description Lattice,Business Report,Extract Fields,Infer Line Types,Infer Page Headers/Footers,Infer Recursive Group Structure,Report Structure Decomposition,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,1

38、1 of 25,Data Extraction Process,Extract Data,Business Report,Report Structure Decomposition,Business Report,Business Report,Populated Database,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,12 of 25,Delimitations,This study examines only type I reports (i.e. a line in a report pertains to one rec

39、ord) We focus on report structure extraction,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,13 of 25,Algorithm 1: Extract Fields,Use field extraction lattice to identify basic fields in each line of the report Represent lattice with a total ordering E of regular expressions,11/3/99,CIKM99, Copyri

40、ght 1999, Stephen W. Liddle,14 of 25,Field Extraction,Extract fields to form line type vector,General Numberbd+(,ddd)*.?d*(?=(D|$) String( +( +)*) Currencyd*(,ddd)*.dd(?=(D|$),Line Type Vector: General Number (1, 3, 006) General Number (5, 4, 9994) General Number (13, 5, 10355) String (21, 28, JASON

41、 MASON CONSTRUCTION INC) Currency (54, 10, 100,000.00) Fraction (68, 6, .06005) DayMonthYear (77, 8, 03/07/99),11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,15 of 25,Algorithm 2: Infer Line Types,Cluster line types by similarity to form the set B of basic line types for R Use line distances: Fir

42、st-order distance Based on character comparison Identical strings have distance 0 Second-order distance Based on field types Uses field-description lattice for distance,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,16 of 25,Infer Basic Line Types,Generalize line types when distance is small,11/3

43、/99,CIKM99, Copyright 1999, Stephen W. Liddle,17 of 25,Algorithm 3: Infer Page Headers/Footers,Separate report detail from page headers and footers A line is considered detail if It repeats in report two or more times in immediate succession, or It repeats more than twice on one page Find the maxima

44、l page prefix/suffix of non-detail lines Remove page headers/footers and blank lines,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,18 of 25,Page Headers/Footers,RUN 05/21/99 12:34:56 00551 L A R G E C D R E P O R T ACCR: 04/26/99 POST: 05/21/99 PAGE 001 CUST NBR CD NBR N A M E BALANCE RATE MATUR

45、ITY OFC 006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9989 11225 BARNEY FIFE 105,529.23 .06250 05/16/99,RUN 05/21/99 12:34:56 00551 L A R G E C D R E P

46、 O R T ACCR: 04/26/99 POST: 05/21/99 PAGE 002 CUST NBR CD NBR N A M E BALANCE RATE MATURITY OFC 013 2349 12334 MOMS FINE COUNTRY COOKING 100,000.00 .06005 06/13/99 MS 015 1012 11221 BAKERY AT THE TOWN SQUARE 300,000.00 .05990 06/23/99 016 2344 2899 JILL JENKINS 75,000.00 .05990 06/25/99 MS 016 4389

47、8983 JEAN LUC PICARD 100,000.00 .06250 06/30/99 * * * TOTAL LARGE CD * * * 2,180,529.79,11/3/99,CIKM99, Copyright 1999, Stephen W. Liddle,19 of 25,Algorithm 4: Infer Group Structure (uvkw),006 9994 10355 JASON MASON CONSTRUCTION INC 100,000.00 .06005 03/07/99 008 9992 9657 FANNY M RYEBERG 300,000.56 .05990 04/22/99 MS 009 9991 9541 JOHN SMITH JR 1,100,000.00 .05990 04/22/99 MS 011 9989 11225 BARNEY FIFE 105,529.23 .06250 0

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論