下載本文檔
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、A Statistical Method for 3D Object Detection Applied to Faces and CarsHenry Schneiderman and Takeo Kanade Robotics InstituteCarnegie Mellon University Pittsburgh, PA 15213is specialized to frontal views. We apply these view-basedAbstractIn this paper, we describe a statistical method for 3D object d
2、etection.We represent the statistics of both object appear-detectors in parallel and then combine their results. If there are multiple detections at the same or adjacent locations, our method chooses the strongest detection.We empirically determined the number of orientations to m for each object. F
3、or faces we use two view-based detec- tors: frontal and right profile, as shown below. To detect left- profile faces, we apply the right profile detector to a mirror- reversed input images. For cars we use eight detectors as shown below. Again, we detect left side views by running the seven right-si
4、de detectors on mirror reversed images.ance and “non-object” appearance using a product of histo- grams. Each histogram represents the joint statistics of a subset of wavelet coefficients and their position on the object. Our approach is to use many such histograms representing a wide variety of vis
5、ual attributes. Using this method, we have devel- oped the first algorithm that can reliably detect human faces with out-of-plane rotation and the first algorithm that can reli- ably detect passenger cars over a wide range of viewpoints.1. IntroductionThe main challenge in object detection is the am
6、ount of variation in visual appearance. For example, cars vary in shape, size, coloring, and in small details such as the head- lights, grill, and tires. Visual appearance also depends on the surrounding environment. Light sources will vary in their intensity, color, and location with respect to the
7、 object. Nearby objects may cast shadows on the object or reflect additional light on the object. The appearance of the object also depends on its pose; that is, its position and orientation with respect to the camera. For example, a side view of a human face will look much different than a frontal
8、view. An object detector much accommodate all this variation and still distinguish the object from any other pattern that may occur in the visual world.To cope with all this variation, we use a two-part strategy for object detection. To cope with variation in pose, we use a view-based approach with
9、multiple detectors that are each spe-Figure 1. Examples of training images for each face orientationFigure 2. Examples of training images for each car orientationEach of these detector is not only specialized in orienta- tion, but is trained to find the object only at a specified size within a recta
10、ngular image window. Therefore, to be able to detect the object at any position within an image, we re-apply the detectors for all possible positions of this rectangular win- dow. Then to be able to detect the object at any size we itera- tively resize the input image and re-apply the detectors in t
11、he same fashion to each resized image.cialized to a specific orientation of the objedescribed inSection 2. We then use statistical ming within each ofthese detectors to account for the remaining variation. Thisstatistical m ing is the focus of this paper. In Section 3 we derive the functional form w
12、e use in all view-based detectors. In Section 4 we describe how we collect the statistics for these detectors from representative sets of training images. In Sec- tion 5 we discuss our implementation and in Sections 6 and 7 we describe the accuracy of the face and car detectors.3. Functional Form of
13、 Decision RuleFor each view-based detector we use statistical m ing to account for the remaining forms of variation. Each of the detectors uses the same underlying form for the statistical deci- sion rule. They differ only in that they use statistics gathered from different sets of images.2. View-Ba
14、sed DetectorsWe develop separate detectors that are each specialized to a specific orientation of the object. For example, we have one detector specialized to right profile views of faces and one thatThere are two statistical distributions we mfor each the givenview-based detector. We mthe statistic
15、s ofobject, P(image | object) and the statistics of the rest of thevisual world, which we call the “non-object” class, P(image | non-object). We then compute our detection decision using the likelihood ratio test:visual attributes. However, in order to do so, we need to first understand the issues i
16、n combining probabilities from different attributes.To combine probabilities from different attributes, we will take the following product where we approximate each class- conditional probability function as a product of histograms:P(image object)P(image non-object)P(non-object)öæl> l=&
17、#232;ø(1)P(object)If the likelihood ratio (the left side) is greater than the right side, we decide the object is present.P(image object) » ÕP (patternobject)kkkP(image non-object) » ÕPk(patternk non-object)The likelihood ratioequivalent to Bayes decision rule(2)(MAP decisio
18、n rule) and will be optimal if our representations for P(image | object) and P(image | non-object) are accurate.kIn forming these representations for P(image | object) and P(image | non-object) we implicitly assume that the attributes (patternk) are statistically independent for both the object andt
19、he non-object. However, it can be shown that we can relaxthis assumption since our goal is accurate classification not accurate probabilistic m ing 2. For example, let us con- sider a classification example based on two random variables, A and B. Lets assume that A is a deterministic function of B,
20、A= f (B), and is therefore fully dependent on A, P(A=f(B) | B) =1. The optimal classifier becomes:The rest of this section focuses on the functional forms we choose for these distributions.3.1. Representation of Statistics Using Products of HistogramsThe difficulty in ming P(image | object) and P(im
21、age |non-object) is that we do not know the true statistical charac- teristics of appearance either for the object or for the rest of the world. For example, we do not know if the true distributionsare Gaussian, Poisson, or multimodal. These properties are unknown since it is not tractable to analyz
22、e the joint statistics of large numbers of pixels.Since we do not know the true structure of these distribu-P( A, B object)P( A B, object)P(B object)=P( A, B non-object)P( A B, non-object)P(B non-object)P(B object)tions, the safest approach is to choose ms that are flexible= - > l(3)P(B non-objec
23、t)and can accommodate a wide range of structure. One class offlexible ms are non-parametric memory-based msIf we wrongly assume statistical independence between A and B, then the classifier becomes:such as Parzen windows or nearest neighbor. The disadvantageof these ms is that to compute a probabili
24、ty for a giveninput we may have to compare that input to all the training data. Such a computation will be extremely time consuming.P( A, B object)P( A object)P(B object)- » -P( A, B non-object) P( A non-object)P(B non-object)(4)An alternative is to use a flexible parametric mthat isP(B object)
25、ö2æcapable of representing multimodal distributions, such as a> g=èøP(B non-object)multilayer perceptron neural network or a mixture m.However, there are no closed-form solutions for fitting theseThis case illustrates that we can achieve accurate classification (by choosing g
26、= l2) even though we have violated the statisti- cal independence assumption.ms to a set of training examples. All estimation methodsfor these ms are susceptible to local minima.Instead of these approaches, we choose to use histograms.In teral case, when we do not know the relationshipHistograms are
27、 almost as flexible as memory-based methods but use a more compact representation whereby we retrieve probability by table look-up. Estimation of a histogram simply involves counting how often each attribute value occurs in the training data. The resulting estimates are statistically optimal. They a
28、re unbiased, consistent, and satisfy the Cramer-Rao lower bound.The main drawback of a histogram is that we can only use a relatively small number of discrete values to describe appear- ance. To overcome this limitation, we use multiple histograms where each histogram, Pk(pattern | object), represen
29、ts the prob- ability of appearance over some specified visual attribute, patternk; that is, patternk is a random variable describing some chosen visual characteristic such as low frequency content. We will soon specify how we partition appearance into differentbetween A and B, performance will depen
30、d on how well theratio P(A | object) / P(A | non-object) approximates P(A | B, object) / P(A | B, non-object). However, it is probably better tomA and B as statistically independent than to only mone variable. For example, if we only m would become:B our classifierP( A, B object)P(B object)- »
31、(1)-P( A, B non-object)P(B non-object)(5)In this formulation, we would be implicitly approximating P(A| B, object) / P(A | B, non-object), by 1. Chances are that P(A | object) / P(A | non-object) is a better approximation.In choosing how to decompose visual appearance into dif- ferent attributes we
32、face the question of what image measure-ments to mjointly and what to mindependently.Obviously, if the joint relationship between two variables, such as A and B seems to distinguish the object from the rest of themthese geometric relationships, we represent the positionsof each attribute sample with
33、 respect to a coordinate frameworld, we should try to m is still probably better to m mone at all.them jointly. If we are unsure, it them independently than not toaffixed to the object. This representation captures each sam- ples relative position with respect to all the others. With this representa
34、tion, each histogram now becomes a joint distribu- tion of attribute and attribute position, Pk(patternk(x,y), x, y | object) and Pk(patternk(x,y), x, y | non-object), where attribute position, x, y, is measured with respect to rectangular image window we our classifying (see section 2). However, we
35、 do not represent attribute position at the original resolution of the image. Instead, we represent position at a coarser resolution toIdeally, we would like to follow a systematic method fordetermining which visual qualities distinguish the object fromthe rest of the visual world. Unfortunately, th
36、e only known way of doing this is to try all possibilities and compare their classification accuracy. Such an approach is not computation- ally tractable. Ultimately, we have to make educated guesses about what visual qualities distinguish faces and cars from the rest of the world. We describe these
37、 guesses below.save on ming cost and to implicitly accommodate smallvariations in the geometric arrangements of parts.3.2. Decomposition of Appearance in Space, Frequency, and OrientationFor both faces and cars, our approach is to jointly m visual information that is localized in space, frequency, a
38、nd orientation. To do so, we decompose visual appearance along these dimensions. Below we explain this decomposition and in the next section we specify our visual attributes based on this decomposition.First, we decompose the appearance of the object into “parts” whereby each visual attribute descri
39、bes a spatially3.3. Representation of Visual Attributes by Subsets of Quantized Wavelet CoefficientsTo create visual attributes that are localized in space, fre- quency, and orientation, we need to be able to easily select information that is localized along these dimensions. In partic- ular, we wou
40、ld like to transform the image into a representation that is jointly localized in space, frequency, and orientation. To do so, we perform a wavelet transform of the image.The wavelet transform is not the only possible decomposi- tion in space, frequency, and orientation. Both the short-term Fourier
41、transform and pyramid algorithms can create such rep-localized region on the object.By doing so we concentratethe limited ming power of each histogram over a smallerresentations.Wavelets, however, produce no redundancy.amount of visual information.Unlike these other transforms, we can perfectly reco
42、nstruct the image from its transform where the number of transform coef- ficients is equal to the original number of pixels.We would like these parts to be suited to the size of the fea-tures on each object. However, since important cues for faces and cars occur at many sizes, we need multiple attri
43、butes over aThe wavelet transform organizes the image into subbands that are localized in orientation and frequency. Within each subband, each coefficient is spatially localized. We use a wave- let transform based on 3 level decomposition using a 5/3 linear phase filter-bank 1 producing 10 subbands
44、as shown below in Figure 3. Each level in the transform represents a higher octaverange of scales. We will define such attributes by making a joint decomposition in both space and frequency. Since low frequencies exist only over large areas and high frequencies can exist over small areas, we define
45、attributes with large spa- tial extents to describe low frequencies and attributes withsmall spatial extents to describe high frequencies.Theattributes that cover small spatial extents will be able to do soat high resolution. These attributes will capture small distinc- tive areas such as the eyes,
46、nose, and mouth on a face and the grill, headlights, and tires on a car. Attributes defined over larger areas at lower resolution will be able to capture other important cues. On a face, the forehead is brighter than the eye sockets. On a car, various surfaces such as the hood, wind- shield, and fen
47、ders may differ in intensity.We also decompose some attributes in orientation content. For example, an attribute that is specialized to horizontal fea- tures can devote greater representational power to horizontal features than if it also had to describe vertical features.Finally, by decomposing the
48、 object spatially, we do not want to discard all relationships between the various parts. We believe that the spatial relationships of the parts is an important cue for detection. For example, on a human face, the eyes nose, and mouth appear in a fixed geometric configuration. ToFigure 3. Wavelet re
49、presentation of an imageof frequencies. A coefficient in level 1 describes 4 times the area of a coefficient in level 2, which describes 4 times the area of a coefficient in level 3. In terms of orientation, LH denotes low-pass filtering in the horizontal direction and high pass fil-L1LLL1HLLevel 2
50、HLLevel 3 HLL1LHL1HHLevel 2 LHLevel 2 HHLevel 3 LHLevel 3 HHtering in the vertical direction, that is horizontal features. Sim- ilarly, HL represents vertical features.We use this representation as a basis for specifying visual attributes. Each attribute will be defined to sample a moving window of
51、transform coefficients. For example, one attribute could be defined to represent a 3x3 window of coefficients in level 3 LH band. This attribute would capture high frequency horizontal patterns over a small extent in the original image. Another pattern set could represent spatially registered 2x2 bl
52、ocks in the LH and HL bands of the 2nd level. This would represent an intermediate frequency band over a larger spatial extent in the image.Since each attribute must only take on a finite number of values, we will have to compute a vector quantization of its sampled wavelet coefficients. To keep his
53、togram size under 1,000,000 bins, we would like to express each attribute by no more than 10,000 discrete values since x,y (position) will together take on about 100 discrete values. To stay within this limit, each visual attribute will be defined to sample 8 wavelet coefficients at a time and will
54、quantize each coefficient to 3 levels. This quantization scheme gives 38=6,561 discrete val- ues for each visual attribute.Overall, we use 17 attributes that sample the wavelet trans- form in groups of 8 coefficients in one of the following ways (These definitions are taken from 3):1. Intra-subband
55、- All the coefficients come from the same subband. These visual attributes are the most localized in fre- quency and orientation. We define 7 of these attributes for the following subbands: level 1 LL, level 1 LH, level 1 HL, level 2 LH, level 2 HL, level 3 LH, level 3 HL.2. Inter-frequency- Coeffic
56、ients come from the same orien- tation but multiple frequency bands. These attributes represent visual cues that span a range of frequencies such as edges. We define 6 such attributes using the following subband pairs: level 1 LL - level l HL, level 1 LL-level l LH, level 1 LH - level 2 LH, level 1
57、HL - level 2 HL, level 2 LH - level 3 LH, level 2 HL - level 3 HL.3. Inter-orientation - Coefficients come from the same fre- quency band but multiple orientation bands. These attributes can represent cues that have both horizontal and vertical com- ponents such as corners. We define 3 such attributes using the following subband pairs: level 1 LH - level 1 HL, level 2 LH - level 2 HL, level 3 LH - level 3 HL.4. Inter-frequency / inter-orientation - This combination is designed to represent cues that span a range of frequencies and orientations. We define one such attribute
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 江蘇省泰州市姜堰區(qū)2024-2025學年七年級上學期期中地理試題(含答案)
- 數(shù)據(jù)中心項目投資計劃書
- 贛南師范大學《審計學》2021-2022學年第一學期期末試卷
- 2024年電動開顱設(shè)備項目投資申請報告代可行性研究報告
- 阜陽師范大學《幼兒歌曲彈唱》2022-2023學年第一學期期末試卷
- 福建師范大學協(xié)和學院《跨國公司經(jīng)營與管理》2021-2022學年第一學期期末試卷
- 《股權(quán)轉(zhuǎn)讓合同》-企業(yè)管理
- 福建師范大學《漆畫人物創(chuàng)作大創(chuàng)作》2023-2024學年第一學期期末試卷
- 醫(yī)美行業(yè)研究框架關(guān)注上游高景氣賽道
- 福建師范大學《廣告史》2021-2022學年第一學期期末試卷
- 燃氣行業(yè)雙重預(yù)防體系建立
- 小學1-6年級美術(shù)課程標準表格整理
- 倉庫消防安全管理程序模版
- 房屋市政工程生產(chǎn)安全重大事故隱患判定標準(隱患排查表)
- 三、大數(shù)據(jù)存儲技術(shù)課件
- 教育領(lǐng)導力的培養(yǎng)與發(fā)展
- 江蘇省連云港市東??h2023-2024學年八年級上學期期中數(shù)學試題
- 大豆玉米帶狀復(fù)合種植技術(shù)
- 空氣源熱泵安裝方案
- 學前教育專業(yè) 學前兒童使用電子產(chǎn)品的現(xiàn)狀及應(yīng)對策略的調(diào)查研究
- 2024屆綿陽市2021級高三一診(第一次診斷性考試)英語試卷(含答案+聽力音頻)
評論
0/150
提交評論