{"title": "Learning to Parse Images", "book": "Advances in Neural Information Processing Systems", "page_first": 463, "page_last": 469, "abstract": null, "full_text": "Learning to Parse Images \n\nGeoffrey E. Hinton and Zoubin Ghahramani \n\nGatsby Computational Neuroscience Unit \n\nUniversity College London \n\nLondon, United Kingdom WC1N 3AR \n\n{hinton,zoubin}@gatsby.ucl.ac.uk \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto, Ontario, Canada M5S 3G4 \n\nVee Whye Tah \n\nywteh@cs.utoronto.ca \n\nAbstract \n\nWe describe a class of probabilistic models that we call credibility \nnetworks. Using parse trees as internal representations of images, \ncredibility networks are able to perform segmentation and recog(cid:173)\nnition simultaneously, removing the need for ad hoc segmentation \nheuristics. Promising results in the problem of segmenting hand(cid:173)\nwritten digits were obtained. \n\n1 \n\nIntrod uction \n\nThe task of recognition has been the main focus of attention of statistical pattern \nrecognition for the past 40 years. The paradigm problem is to classify an object from \na vector of features extracted from the image. With the advent of backpropagation \n[1], the choice of features and the choice of weights to put on these features became \npart of a single, overall optimization and impressive performance was obtained for \nrestricted but important tasks such as handwritten character identification [2]. \nA significant weakness of many current recognition systems is their reliance on a \nseparate preprocessing stage that segments one object out of a scene and approx(cid:173)\nimately normalizes it. Systems in which segmentation precedes recognition suffer \nfrom the fact that the segmenter does not know the shape of the object it is seg(cid:173)\nmenting so it cannot use shape information to help it. Also, by segmenting an \nimage, we remove the object to be recognized from the context in which it arises. \nAlthough this helps in removing the clutter present in the rest of the image, it \nmight also reduce the ability to recognize an object correctly because the context \nin which an object arises gives a great deal of information about the nature of the \nobject. Finally, each object can be described in terms of its parts, which can also \nbe viewed as objects in their own right. This raises the question of how fine-grained \nthe segmentations should be. In the words of David Marr: \"Is a nose an object? \nIs a head one? ... What about a man on a horseback?\" [3]. \n\n\f464 \n\nG. E. Hinton, Z. Ghahramani and Y. W. Teh \n\nThe successes of structural linguistics inspired an alternative approach to pattern \nrecognition in which the paradigm problem was to parse an image using a hierarchi(cid:173)\ncal grammar of scenes and objects. Within linguistics, the structural approach was \nseen as an advance over earlier statistical approaches and for many years linguist(cid:173)\ns eschewed probabilities, even though it had been known since the 1970's that a \nversion of the EM algorithm could be used to fit stochastic context free grammars. \nStructural pattern recognition inherited the linguists aversion to probabilities and \nas a result it never worked very well for real data. With the advent of graphical \nmodels it has become clear that structure and probabilities can coexist. Moreover, \nthe \"explaining away\" phenomenon that is central to inference in directed acyclic \ngraphical models is exactly what is needed for performing inferences about possible \nsegmentations of an image. \nIn this paper we describe an image interpretation system which combines segmenta(cid:173)\ntion and recognition into the same inference process. The central idea is the use of \nparse trees of images. Graphical models called credibility networks which describe \nthe joint distribution over the latent variables and over the possible parse trees \nare used. In section 2 we describe some current statistical models of image inter(cid:173)\npretation. In section 3 we develop credibility networks and in section 4 we derive \nuseful learning and inference rules for binary credibility networks. In section 5 we \ndemonstrate that binary credibility networks are useful in solving the problem of \nclassifying and segmenting binary handwritten digits. Finally in section 6 we end \nwith a discussion and directions for future research. \n\n2 Related work \n\nNeal [4] introduced generative models composed of multiple layers of stochastic l(cid:173)\nogistic units connected in a directed acyclic graph. In general, as each unit has \nmultiple parents, it is intractable to compute the posterior distribution over hidden \nvariables when certain variables are observed. However, Neal showed that Gibbs \nsampling can be used effectively for inference [4]. Efficient methods of approximat(cid:173)\ning the posterior distribution were introduced later [5, 6, 7] and these approaches \nwere shown to yield good density models for binary images of handwritten digits \n[8] . The problem with these models which make them inappropriate for modeling \nimages is that they fail to respect the 'single-parent' constraint: in the correct \ninterpretation of an image of opaque objects each object-part belongs to at most \none object - images need parse trees, not parse DAGs. \nMultiscale models [9] are interesting generative models for images that use a fixed \ntree structure. Nodes high up in the tree control large blocks of the image while \nbottom level leaves correspond to individual pixels. Because a tree structure is used, \nit is easy to compute the exact posterior distribution over the latent (non-terminal) \nnodes given an image. As a result, the approach has worked much better than \nMarkov random fields which generally involve an intractable partition function . A \ndisadvantage is that there are serious block boundary artifacts, though overlapping \ntrees can be used to smooth the transition from one block to another [10]. A more \nserious disadvantage is that the tree cannot possibly correspond to a parse tree \nbecause it is the same for every image. \n\nZemel, Mozer and Hinton [11] proposed a neural network model in which the ac(cid:173)\ntivities of neurons are used to represent the instantiation parameters of objects or \ntheir parts, Le. the viewpoint-dependent coordinate transformation between an ob(cid:173)\nject's intrinsic coordinate system and the image coordinate system. The weights on \nconnections are then used to represent the viewpoint-invariant relationship between \nthe instantiation parameters of a whole, rigid object and the instantiation parame-\n\n\fLearning to Parse Images \n\n465 \n\nters of its parts. This model captures viewpoint invariance nicely and corresponds \nto the way viewpoint effects are handled in computer graphics, but there was no \ngood inference procedure for hierarchical models and no systematic way of sharing \nmodules that recognize parts of objects among multiple competing object models. \nSimard et al [12] noted that small changes in object instantiation parameters result \nin approximately linear changes in (real-valued) pixel intensities. These can be \ncaptured successfully by linear models. To model larger changes, many locally linear \nmodels can be pieced together. Hinton, Dayan and Revow [13] proposed a mixture \nof factor analyzers for this. Tipping and Bishop have recently shown how to make \nthis approach much more computationally efficient [14]. To make the approach \nreally efficient, however, it is necessary to have multiple levels of factor analyzers \nand to allow an analyzer at one level to be shared by several competing analyzers \nat the next level up. Deciding which subset of the analyzers at one level should be \ncontrolled by one analyzer at the level above is equivalent to image segmentation or \nthe construction of part of a parse tree and the literature on linear models contains \nno proposals on how to achieve this. \n\n3 A new approach to image interpretation \n\nWe developed a class of graphical models called credibility networks in which the \npossible interpretations of an image are parse trees, with nodes representing object(cid:173)\nparts and containing latent variables. Given a DAG, the possible parse trees of an \nimage are constrained to be individual or collections of trees where each unit satisfies \nthe single-parent constraint, with the leaves being the pixels of an image. Credibility \nnetworks describe a joint distribution over the latent variables and possible tree \nstructures. The EM algorithm [15] can be used to fit credibility networks to data. \nLet i E I be a node in the graph. There are three random variables associated with \ni. The first is a multinomial variate ,xi = {.xij hEpa(i) which describes the parent of \ni from among the potential parents pa(i) : \n\n,x . . _ {I \n0 \n1J -\n\nif parent of i is j, \nif parent of i is not j. \n\n(1) \n\nThe second is a binary variate Si which determines whether the object i l is present \n(Si = 1) or not (Si = 0). The third is the latent variables Xi that describe the \npose and deformation of the object. Let A = {,xi : i E I}, S = {Si : i E I} and \nX = {Xi: i E I}. \nEach connection j -+ i has three parameters also. The first, Cij is an unnormalized \nprior probability that j is i's parent given that object j is present. The actual prior \nprobability is \n\n7rij = \n\nC\" S ' \n1J J \n\n2:kEPa(i) CikSk \n\n(2) \n\nWe assume there is always a unit 1 E pa(i) such that SI = 1. This acts as a default \nparent when no other potential parent is present and makes sure the denominator in \n(2) is never O. The second parameter, Pij, is the conditional probability that object \ni is present given that j is i's parent (,xij = 1). The third parameter tij characterizes \nthe distribution of Xi given ,xij = 1 and Xj' Let 0 = {Cij,Pij, tij : i E I,j E pa(i)}. \nUsing Bayes' rule the joint distribution over A, S and X given 0 is peA, S, XIO) = \npeA, SIO)p(XIA, S, 0). Note that A and S together define a parse tree for the im(cid:173)\nage. Given the parse tree the distribution over latent variables p(XIA, S, 0) can be \n\nITechnically this should be the object represented by node i. \n\n\f466 \n\nG. E. Hinton, Z. Ghahramani and Y. W. Teh \n\nefficiently inferred from the image. The actual form of p(XIA, S, 0) is unimportant. \nThe joint distribution over A and S is \n\nP(A,SIO) = II II (1I'ijP:j(1- Pij)I-8i )Ai j \n\niEi jEpa(i) \n\n(3) \n\n4 Binary credibility networks \n\nThe simulation results in section 5 are based on a simplified version of credibility \nnetworks in which the latent variables X are ignored. Notice that we can sum out \nA from the joint distribution (3), so that \n\nP(SIO) = II L 1I'ijP:j (1 - Pij)I-8 i \n\niEi jEpa(i) \n\nUsing Bayes' rule and dividing (3) by (4), we have \n\nP(AIS, 0) = II II ( \n\nCijSjP:j (1: Pij)I-8i 1-8') Aij \n\niEi jEpa(i) \n\nI:kEPa(i) cik s kPik(l - Pik) \n\n\u2022 \n\n(4) \n\n(5) \n\nLet rij = CijP:j (1 - Pij )1-8 .. We can view rij as the unnormalized posterior proba(cid:173)\n\nbility that j is i's parent given that object j is present. The actual posterior is the \nfraction in (5) : \n\n(6) \n\nGiven some observations 0 c S, let 11 = S \\ 0 be the hidden variables. We \napproximate the posterior distribution for 11 using a factored distribution \n\nQ(ll) = II O':i (1 - O'i)1-8i \n\n(7) \n\nThe variational free energy, F(Q,O) = EQ[-log P(SIO) + log Q(S)] is \n\niEi \n\nF(Q,O) = L(EQ[IOg L cijsj-Iog L CijSjP:j(1-Pij )1-8i]) + \n\niEi \n\njEpa(i) \n\njEpa(i) \n\nL (O'i log O'i + (1 - O'i) log (1 - O'i)) \n\niEi \n\n(8) \n\nThe negative of the free energy -F is a lower bound on the log likelihood of gen(cid:173)\nerating the observations O. The variational EM algorithm improves this bound by \niteratively improving -F with respect to Q (E-step) and to 0 (M-step). Let ch(i) \nbe the possible children of i. The inference rules can be derived from (8) : \n\nO'i = sigmoid \n\nEQ [log L CijSjPij -log L cijsj(l - Pij)] \n+ L EQ [log L rljSj -log L CljSj J:::: \n\njEpa(i) \n\njEpa(i) \n\nlEch( i) \n\njEpa(l) \n\njEp a ( I ) ' \n\n(9) \n\nLet D be the training set and Q d be the mean field approximation to the posterior \ndistribution over 11 given the training data (observation) dE D. Then the learning \n\n\fLearning to Parse Images \n\n467 \n\nFigure 1: Sample images from the test set. The classes of the two digits in each \nimage in a row are given to the left. \n\nrules are \n\n(10) \n\n(11) \n\nFor an efficient implementation of credibility networks using mean field approxima(cid:173)\ntions, we still need to evaluate terms of the form E[logx] and ~[l/x] where x is a \nweighted sum of binary random variates. In our implementation we used the sim(cid:173)\nplest approximations: E[logx] ~ 10gE[x] and E[l/x] ~ l/E[x]. Although biased \nthe implementation works well enough in general. \n\n5 Segmenting handwritten digits \n\nHinton and Revow [16] used a mixture of factor analyzers model to segment and \nestimate the pose of digit strings. When the digits do not overlap, the model was \nable to identify the digits present and segment the image easily. The hard cases \nare those in which two or more digits overlap significantly. To assess the ability \nof credibility networks at segmenting handwritten digits, we used superpositions of \ndigits at exactly the same location. This problem is much harder than segmenting \ndigit strings in which digits partially overlap. \n\nThe data used is a set of 4400 images of single digits from the classes 2, 3, 4 and \n5 derived from the CEDAR CDROM 1 database [17]. Each image has size 16x16. \nThe size of the credibility network is 256-64-4. The 64 middle layer units are meant \nto encode low level features, while each of the 4 top level units are meant to encode \na digit class. We used 700 images of single digits from each class \u00b7 to train the \nnetwork. So it was not trained to segment images. During training we clamped at \n1 the activation of the top layer unit corresponding to the class of the digit in the \ncurrent image while fixing the rest at O. \nAfter training, the network was first tested on the 1600 images of single digits \nnot in the training set. The predicted class of each image was taken to be the \n\n\f468 \n\nG. E. Hinton, Z. Ghahramani and Y. W. Teh \n\nFigure 2: Segmentations of pairs of digits. (To make comparisons easier we show \nthe overlapping image in both columns of a)-I).) \n\nclass corresponding to the top layer unit with the highest activation. The error \nrate was 5.5%. We then showed the network 120 images of two overlapping digits \nfrom distinct classes. There were 20 images per combination of two classes. Some \nexamples are given in Figure 1. The predicted classes of the two digits are chosen \nto be the corresponding classes of the 2 top layer units with the highest activations. \nA human subject (namely the third author) was tested on the same test set. The \nnetwork achieved an error rate of 21.7% while the author erred on 19.2% of the \nimages. \n\nWe can in fact produce a segmentation of each image into an image for each class \npresent. Recall that given the values of S the posterior probability of unit j being \npixel i's parent is Wij. Then the posterior probability of pixel i belonging to digit \nclass k is L:j EQ[WijWjk). \nThis gives a simple way to segment the image. Figure 2 shows a number of segmen(cid:173)\ntations. Note that for each pixel, the sum of the probabilities of the pixel belonging \nto each digit class is 1. To make the picture clearer, a white pixel means a proba(cid:173)\nbility of :::; .1 of belonging to a class, while black means ~ .6 probability, and the \nintensity of a gray pixel describes the size of the probability if it is between .1 and \n.6. Figures 2a) to 2f) shows successful segmentations, while Figure 2g) to 21) shows \nunsuccessful segmentations. \n\n6 Discussion \n\nUsing parse trees as the internal representations of images, credibility networks \navoid the usual problems associated with a bottom-up approach to image interpre(cid:173)\ntation. Segmentation can be carried out in a statistically sound manner, removing \nthe need for hand crafted ad hoc segmentation heuristics. The granularity problem \nfor segmentation is also resolved since credibility networks use parse trees as inter(cid:173)\nnal representations of images. The parse trees describe the segmentations of the \nimage at every level of granularity, from individual pixels to the whole image. \n\nWe plan to develop and implement credibility networks in which each latent variable \nXi is a multivariate Gaussian, so that a node can represent the position, orientation \nand scale of a 2 or 3D object, and the conditional probability models on the links can \nrepresent the relationship between a moderately deformable object and its parts. \n\n\fLearning to Parse Images \n\nAcknowledgments \n\n469 \n\nWe thank Chris Williams, Stuart Russell and Phil Dawid for helpful discussions \nand NSERC and ITRC for funding. \n\nReferences \n[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representa(cid:173)\n\ntions by error propagation. In D. E. Rumelhart, J. L. McClelland, and the PDP \nResearch Group, editors, Parallel Distributed Processing: Explorations in The Mi(cid:173)\ncrostructure of Cognition. Volume 1 : Foundations. The MIT Press, 1986. \n\n[2] Y. Le Cun, B. Boser, J. S. Denker, S. SoH a, R. E. Howard, and L. D. Jackel. \nBack-propagation applied to handwritten zip code recognition. Neural Computation, \n1(4):541-551, 1989. \n\n[3] D. Marr. Vision: A Computational Investigation into the Human Representation \nand Processing of Visual Information. W. H. Freeman and company, San Francisco, \n1980. \n\n[4] R. M. Neal. Connectionist learning of belief networks. Artificial Intelligence, 56:71-\n\n113, 1992. \n\n[5] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. Helmholtz machines. Neural \n\nComputation, 7:1022-1037, 1995. \n\n[6] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. The wake-sleep algorithm for \n\nself-organizing neural networks. Science, 268:1158-1161, 1995. \n\n[7] L. K. Saul and M. I. Jordan. Attractor dynamics in feedforward neural networks. \n\nSubmitted for publication. \n\n[8] B. J. Frey, G. E. Hinton, and P. Dayan. Does the wake-sleep algorithm produce good \ndensity estimators? In D. Touretzky, M. Mozer, and M. Hasselmo, editors, Advances \nin Neural Information Processing Systems, volume 8. The MIT Press, 1995. \n\n[9] M. R. Luettgen and A. S. Willsky. Likelihood calculation for a class of multiscale \nstochastic models, with application to texture discrimination. IEEE Transactions on \nImage Processing, 4(2):194-207, 1995. \n\n[10] W. W. Irving, P. W. Fieguth, and A. S. Willsky. An overlapping tree approach to mul(cid:173)\ntiscale stochastic modeling and estimation. IEEE Transactions on Image Processing, \n1995. \n\n[11] R. S. Zemel, M. C. Mozer, and G. E. Hinton. TRAFFIC: Recognizing objects us(cid:173)\n\ning hierarchical reference frame transformations. In Advances in Neural Information \nProcessing Systems, volume 2. Morgan Kaufmann Publishers, San Mateo CA, 1990. \n[12] P. Simard, Y. Le Cun, and J. Denker. Efficient pattern recognition using a new \ntransformation distance. In S. Hanson, J. Cowan, and L. Giles, editors, Advances \nin Neural Information Processing Systems, volume 5. Morgan Kaufmann Publishers, \nSan Mateo CA, 1992. \n\n[13] G. E. Hinton, P. Dayan, and M. Revow. Modeling the manifolds of images of hand(cid:173)\n\nwritten digits. IEEE Transactions on Neural Networks, 8:65-74, 1997. \n\n[14] M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component anal(cid:173)\n\nysis. Technical Report NCRG/97/003, Aston University, Department of Computer \nScience and Applied Mathematics, 1997. \n\n[15] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete \ndata via the EM algorithm. Journal of the Royal Statistical Society B, 39:1-38, 1977. \n[16] G. E. Hinton and M. Revow. Using mixtures of factor analyzers for segmentation and \n\npose estimation, 1997. \n\n[17] J. J. Hull. A database for handwritten text recognition research. IEEE Transactions \n\non Pattern Analysis and Machine Intelligence, 16(5):550-554, 1994. \n\n\f", "award": [], "sourceid": 1710, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Zoubin", "family_name": "Ghahramani", "institution": null}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": null}]}