{"title": "Segregated Graphs and Marginals of Chain Graph Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1720, "page_last": 1728, "abstract": "Bayesian networks are a popular representation of asymmetric (for example causal) relationships between random variables. Markov random fields (MRFs) are a complementary model of symmetric relationships used in computer vision, spatial modeling, and social and gene expression networks. A chain graph model under the Lauritzen-Wermuth-Frydenberg interpretation (hereafter a chain graph model) generalizes both Bayesian networks and MRFs, and can represent asymmetric and symmetric relationships together.As in other graphical models, the set of marginals from distributions in a chain graph model induced by the presence of hidden variables forms a complex model. One recent approach to the study of marginal graphical models is to consider a well-behaved supermodel. Such a supermodel of marginals of Bayesian networks, defined only by conditional independences, and termed the ordinary Markov model, was studied at length in (Evans and Richardson, 2014).In this paper, we show that special mixed graphs which we call segregated graphs can be associated, via a Markov property, with supermodels of a marginal of chain graphs defined only by conditional independences. Special features of segregated graphs imply the existence of a very natural factorization for these supermodels, and imply many existing results on the chain graph model, and ordinary Markov model carry over. Our results suggest that segregated graphs define an analogue of the ordinary Markov model for marginals of chain graph models.", "full_text": "Segregated Graphs and Marginals of Chain Graph\n\nModels\n\nIlya Shpitser\n\nDepartment of Computer Science\n\nJohns Hopkins University\nilyas@cs.jhu.edu\n\nAbstract\n\nBayesian networks are a popular representation of asymmetric (for example\ncausal) relationships between random variables. Markov random \ufb01elds (MRFs)\nare a complementary model of symmetric relationships used in computer vision,\nspatial modeling, and social and gene expression networks. A chain graph model\nunder the Lauritzen-Wermuth-Frydenberg interpretation (hereafter a chain graph\nmodel) generalizes both Bayesian networks and MRFs, and can represent asym-\nmetric and symmetric relationships together.\nAs in other graphical models, the set of marginals from distributions in a chain\ngraph model induced by the presence of hidden variables forms a complex model.\nOne recent approach to the study of marginal graphical models is to consider a\nwell-behaved supermodel. Such a supermodel of marginals of Bayesian networks,\nde\ufb01ned only by conditional independences, and termed the ordinary Markov\nmodel, was studied at length in [6].\nIn this paper, we show that special mixed graphs which we call segregated graphs\ncan be associated, via a Markov property, with supermodels of marginals of chain\ngraphs de\ufb01ned only by conditional independences. Special features of segregated\ngraphs imply the existence of a very natural factorization for these supermod-\nels, and imply many existing results on the chain graph model, and the ordinary\nMarkov model carry over. Our results suggest that segregated graphs de\ufb01ne an\nanalogue of the ordinary Markov model for marginals of chain graph models.\nWe illustrate the utility of segregated graphs for analyzing outcome interference\nin causal inference via simulated datasets.\n\n1\n\nIntroduction\n\nGraphical models are a \ufb02exible and widely used tool for modeling and inference in high dimensional\nsettings. Directed acyclic graph (DAG) models, also known as Bayesian networks [11, 8], are often\nused to model relationships with an inherent asymmetry, perhaps induced by a temporal order on\nvariables, or cause-effect relationships. Models represented by undirected graphs (UGs), such as\nMarkov random \ufb01elds (MRFs), are used to model symmetric relationships, for instance proximity in\nsocial graphs, expression co-occurrence in gene networks, coinciding magnetization of neighboring\natoms, or similar colors of neighboring pixels in an image.\nSome graphical models can represent both symmetric and asymmetric relationships together. One\nsuch model is the chain graph model under the Lauritzen-Wermuth-Frydenberg interpretation, which\nwe will shorten to \u201cthe chain graph model.\u201d We will not consider the chain graph model under the\nAndersen-Madigan-Perlman (AMP) interpretation, or other chain graph models [22, 1] discussed in\n[4] in this paper. Just as the DAG models and MRFs, the chain graph model has a set of equivalent\n(under some assumptions) de\ufb01nitions via a set of Markov properties, and a factorization.\n\n1\n\n\fModeling and inference in multivariate settings is complicated by the presence of hidden, yet rel-\nevant variables. Their presence motivates the study of marginal graphical models. Marginal DAG\nmodels are complicated objects, inducing not only conditional independence constraints, but also\nmore general equality constraints such as the \u201cVerma constraint\u201d [21], and inequality constraints\nsuch as the instrumental variable inequality [3], and the Bell inequality in quantum mechanics [2].\nOne approach to studying marginal DAG models has therefore been to consider tractable supermod-\nels de\ufb01ned by some easily characterized set of constraints, and represented by a mixed graph. One\nsuch supermodel, de\ufb01ned only by conditional independence constraints induced by the underlying\nhidden variable DAG on the observed margin, is the ordinary Markov model, studied in depth in [6].\nAnother supermodel, de\ufb01ned by generalized independence constraints including the Verma con-\nstraint [21] as a special case, is the nested Markov model [16]. There is a rich literature on Markov\nproperties of mixed graphs, and corresponding independence models. See for instance [15, 14, 7].\nIn this paper, we adapt a similar approach to the study of marginal chain graph models. Speci\ufb01cally,\nwe consider a supermodel de\ufb01ned only by conditional independences on observed variables of a\nhidden variable chain graph, and ignore generalized equality constraints and inequalities. We show\nthat we can associate this supermodel with special mixed graphs which we call segregated graphs via\na global Markov property. Special features of segregated graphs imply the existence of a convenient\nfactorization, which we show is equivalent to the Markov property for positive distributions. This\nequivalence, along with properties of the factorization, implies many existing results on the chain\ngraph model, and the ordinary Markov model carry over.\nThe paper is organized as follows. Section 2 describes a motivating example from causal inference\nfor the use of hidden variable chain graphs, with details deferred until section 6. In section 3, we in-\ntroduce the necessary background on graphs and probability theory, de\ufb01ne segregated graphs (SGs)\nand an associated global Markov property, and show that the global Markov properties for DAG\nmodels, chain graph models, and the ordinary Markov model induced by hidden variable DAGs are\nspecial cases. In section 4, we de\ufb01ne the model of conditional independence induced by hidden vari-\nable chain graphs, and show it can always be represented by a SG via an appropriate global Markov\nproperty. In section 5, we de\ufb01ne segregated factorization and show that under positivity, the global\nMarkov property in section 4 and segregated factorization are equivalent. In section 6, we introduce\ncausal inference and interference analysis as an application domain for hidden variable chain graph\nmodels, and thus for SGs, and discuss a simulation study that illustrates our results and shows how\nparameters of the model represented by a SG can directly encode parameters representing outcome\ninterference in the underlying hidden variable chain graph. Section 7 contains our conclusions. We\nwill provide outlines of arguments for our claims below, but will generally defer detailed proofs to\nthe supplementary material.\n\n2 Motivating Example: Interference in Causal Inference\n\nConsider a dataset obtained from a placebo-controlled vaccination trial, described in [20], consisting\nof mother/child pairs where the children were vaccinated against pertussis. We suspect that though\nmothers were not vaccinated directly, the fact that children were vaccinated, and each mother will\ngenerally only contract pertussis from her child, the child\u2019s vaccine may have a protective effect\non the mother. At the same time, if only the mothers but not children were vaccinated, we would\nexpect the same protective effect to operate in reverse. This is an example of interference, an effect\nof treatment on experimental units other than those to which the treatment was administered. The\nrelationship between the outcomes of mother and child due to interference in this case has some\nfeatures of a causal relationship, but is symmetric.\nWe model this study by a chain graph shown in Fig. 1 (a), see section 6 for a justi\ufb01cation of\nthis model. Here B1 is the vaccine (or placebo) given to children, and Y1 is the children\u2019s out-\ncomes. B2 is the treatment given to mothers (in our case no treatment), and Y2 is the mothers\u2019\noutcomes. Directed edges represent the direct causal effect of treatment on unit, and the undirected\nedge represents the interference relationship among the mother/child outcome pair. In this model\n(B1 \u22a5\u22a5 B2) (mother and child treatment are assigned independently), and (Y1 \u22a5\u22a5 B2 | B1, Y1),\n(Y2 \u22a5\u22a5 B1 | B2, Y1) (mother\u2019s outcome is independent of child\u2019s treatment, if we know child\u2019s\noutcome, and mother\u2019s treatment, and vice versa). Since treatments in this study were randomly\nassigned, there are no unobserved confounders.\n\n2\n\n\fB1\n\nB2\n\nY1\n\nY2\n\n(a)\n\nB1\n\nA W\n\nU\n\nY1\n\nY2\n\n(b)\n\nA W\n\nY1\n\nY2\n\nB1\n\n(c)\n\nA W\n\nY1\n\nY2\n\nB1\n\n(d)\n\n(a) A chain graph representing the mother/child vaccination example in [20].\n\n(b) A\nFigure 1:\nmore complex vaccination example with a followup booster shot. (c) A naive generalization of the\nlatent projection idea applied to (b), where \u2194 and \u2212 edges meet. (d) A segregated graph preserving\nconditional independences in (b) not involving U.\n\nConsider, however, a more complex example, where both mother and child are given the initial\nvaccine A, but possibly based on results W of a followup visit, children are given a booster B1, and\nwe consider the child\u2019s (Y1) and the mother\u2019s (Y2) outcomes, where the same kind of interference\nrelationship is operating. We model the child\u2019s unobserved health status, which in\ufb02uences both W\nand Y1, by a (possibly very high dimensional) hidden variable U. The result is a hidden variable\nchain graph in Fig. 1 (b). Since U is unobserved and possibly very complex, modeling it directly\nmay lead to model misspeci\ufb01cation. An alternative, explored for instance in [13, 6, 16], is to consider\na model de\ufb01ned by conditional independences induced by the hidden variable model in Fig. 1 (b)\non observed variables A, B1, W, Y1, Y2.\nA simple approach that directly generalizes what had been done in DAG models is to encode con-\nditional independences via a path separation criterion on a mixed graph constructed from a hidden\nvariable chain graph via a latent projection operation [21]. The dif\ufb01culty with this approach is that\nsimple generalizations of latent projections to the chain graph case may yield graphs where \u2194 and\n\u2212 edges met, as happens in Fig. 1 (c). This is an undesirable feature of a graphical representation,\nsince existing factorization and parameterization results for chain graphs or ordinary Markov mod-\nels, which decompose the joint distribution into pieces corresponding to sets connected by \u2212 or \u2194\nedges, do not generalize.\nIn the remainder of the paper, we show that for any hidden variable chain graph it is always possible\nto construct a (not necessarily unique) mixed graph called a segregated graph (SG) where \u2194 and\n\u2212 edges do not meet, and which preserves all conditional independences on the observed variables.\nOne SG for our example is shown in Fig. 1 (d). Conditional independences implied by this graph\nare B1 \u22a5\u22a5 A1 | W and Y2 \u22a5\u22a5 W, B1 | A1, Y1. Properties of SGs imply existing results on chain\ngraphs and the ordinary Markov model carry over with little change. For example, we may directly\napply the parameterization in [6], and the \ufb01tting algorithm in [5] to the model corresponding to Fig.\n1 (d) if the state spaces are discrete, as we illustrate in section 6.1. The construction we give for\nSGs may replace undirected edges by directed edges in a way that may break the symmetry of the\nunderlying interference relationship. Thus, directed edges in a SG do not have a straightforward\ncausal interpretation.\n\n3 Background and Preliminaries\nWe will consider mixed graphs with three types of edges, undirected (\u2212), directed (\u2194), and di-\nrected (\u2192), where a pair of vertices is connected either by a single edge, or a pair of edges one\nof which is directed and one bidirected. We will denote an edge as an ordered pair of vertices\nwith a subscript indicating the type of edge, for example (AB)\u2192. We will suppress the sub-\nscript if edge orientation is not important. An alternating sequence of nodes and edges of the form\nA1, (A1A2), A2, (A2A3), A3, . . . Ai\u22121, (Ai\u22121Ai), Ai where we allow Ai = Aj if i (cid:54)= j\u00b11 is called\na walk (in some references also a route). We will denote walks by lowercase Greek letters. A walk\nwith non-repeating edges is called a trail. A trial with non-repeating vertices is called a path. A\ndirected cycle is a trail of the form A1, (A1A2)\u2192, A2, . . . , Ai, (AiA1)\u2192, A1. A partially directed\ncycle is a trail with \u2212,\u2192 edges, and at least one \u2192 edge where there exists a way to orient \u2212 edges\nto create a directed cycle. We will sometimes write a path from A to B where intermediate vertices\nare not important, but edge orientation is as, for example, A \u2192 \u25e6 \u2212 . . . \u2212 \u25e6 \u2212 B.\n\n3\n\n\fA mixed graph with no \u2212 and \u2194 edges, and no directed cycles is called a directed acyclic graph\n(DAG). A mixed graph with no \u2212 edges, and no directed cycles is called an acyclic directed mixed\ngraph (ADMG). A mixed graph with no \u2194 edges, and no partially directed cycles is called a chain\ngraph (CG). A segregated graph (SG) is a mixed graph with no partially directed cycles where no\npath of the form Ai(AiAj)\u2212Aj(AjAk)\u2194Ak exists. DAGs are special cases of ADMGs and CGs\nwhich are special cases of SGs.\nWe consider sets of distributions over a set V de\ufb01ned by independence constraints linked to above\ntypes of graphs via (global) Markov properties. We will refer to V as either vertices in a graph or\nrandom variables in a distribution, it will be clear from context what we mean.\nA Markov model of a graph G de\ufb01ned via a global Markov property has the general form\n\n(cid:12)(cid:12)(cid:12)(\u2200A \u02d9\u222aB \u02d9\u222aC \u2286 V), (A \u22a5\u22a5 B | C)G \u21d2 (A \u22a5\u22a5 B | C)p(V)\n\nP(G) \u2261(cid:110)\n\n(cid:111)\n\np(V)\n\n,\n\nwhere the consequent means \u201cA is independent of B conditional on C in p(V),\u201d and the antecedent\nmeans \u201cA is separated from B given C according to a certain walk separation property in G.\u201d Since\nDAGs, ADMGs, and CGs are special cases of SGs, we will de\ufb01ne the appropriate path separation\nproperty for SGs, which will recover known separation properties in DAGs, ADMGs and CGs as\nspecial cases.\nA walk \u03b1 contained in another walk \u03b2 is called a subwalk of \u03b2. A maximal subwalk in \u03b2 where\nall edges are undirected is called a section of \u03b2. A section may consist of a single node and no\nedges. We say a section \u03b1 of a walk \u03b2 is a collider section if edges in \u03b2 immediately preceding and\nfollowing \u03b1 contain arrowheads into \u03b1. Otherwise, \u03b1 is a non-collider section. A walk \u03b2 from A\nto B is said to be s-separated by a set C in a SG G if there exists a collider section \u03b1 that does not\ncontain an element of C, or a non-collider section that does (such a section is called blocked). A is\nsaid to be s-separated from B given C in a SG G if every walk from a vertex in A to a vertex in B\nis s-separated by C, and is s-connected given C otherwise.\n\nLemma 3.1 The Markov properties de\ufb01ned by superactive routes (walks) [17] in CGs, m-\nseparation [14] in ADMGs, and d-separation [11] in DAGs are special cases of the Markov property\nde\ufb01ned by s-separation in SGs.\n\n(cid:111)\n\n.\n\n4 A Segregated Graph Representation of CG Independence Models\nFor a SG G, and W \u2282 V, de\ufb01ne the model P(G)W to be the set of distributions where all conditional\n\nW p(V) implied by G hold. That is\n\n(cid:12)(cid:12)(cid:12)(\u2200A \u02d9\u222aB \u02d9\u222aC \u2286 V \\ W), (A \u22a5\u22a5 B | C)G \u21d2 (A \u22a5\u22a5 B | C)p(V)\n\np(V \\ W)\n\nindependences in(cid:80)\nP(G)W \u2261(cid:110)\n\nP(G1)W1 may equal P(G2)W2 even if G1, W1 and G2, W2 are distinct. If W is empty, P(G)W\nsimply reduces to the Markov model de\ufb01ned by s-separation on the entire graph.\nWe are going to show that there is always a SG that represents the conditional independences that\nde\ufb01ne P(G)W, using a special type of vertex we call sensitive. A vertex V in an SG G is sensitive\nif for any other vertex W , if W \u2192 \u25e6 \u2212 . . . \u2212 \u25e6 \u2212 V exists in G, then W \u2192 V exists in G. We\n\ufb01rst show that if V is sensitive, we can orient all undirected edges away from V and this results in\na new SG that gives the same set of conditional independence via s-separation. This is Lemma 4.1.\nNext, we show that for any V with a child Z with adjacent undirected edges, if Z is not sensitive,\nwe can make it sensitive by adding appropriate edges, and this results in a new SG that preserves all\nconditional independences that do not involve V . This is Lemma 4.3. Given above, for any vertex V\nin a SG G, we can construct a new SG that preserves all conditional independences in G that do not\ninvolve V , and where no children of V have adjacent undirected edges. This is Lemma 4.4. We then\n\u201cproject out V \u201d to get another SG that preserves all conditional independences not involving V in G.\nThis is Theorem 4.1. We are then done, Corollary 4.1 states that there is always a (not necessarily\nunique) SG for the conditional independence structure of a marginal of a CG.\nLemma 4.1 For V sensitive in a SG G, let G(cid:104)V (cid:105) be the graph be obtained from G by replacing all \u2212\nedges adjacent to V by \u2192 edges pointing away from V . Then G(cid:104)V (cid:105) is an SG, and P(G) = P(G(cid:104)V (cid:105)).\n\n4\n\n\fThe intuition here is that directed edges differ from undirected edges due to collider bias induced by\nthe former. That is, dependence between parents of a block is created by conditioning on variables\nin the block. But a sensitive vertex in a block is already dependent on all the parents in the block, so\norienting undirected edges away from such a vertex and making it a block parent does not change\nthe set of advertised independences.\nLemma 4.2 Let G be an SG, and G(cid:48) a graph obtained from adding an edge W \u2192 V for two non-\nadjacent vertices W, V where W \u2192 \u25e6 \u2212 . . . \u2212 \u25e6 \u2212 V exists in G. Then G(cid:48) is an SG.\nLemma 4.3 For any V in an SG G, let GV be obtained from G by adding W \u2192 Z, whenever\nW \u2192 \u25e6 \u2212 . . . \u2212 \u25e6 \u2212 Z \u2190 V exists in G. Then GV is an SG, and P(G)V = P(GV )V .\n\nThis lemma establishes that two graphs, one an edge supergraph of the other, agree on the conditional\nindependences not involving V . Certainly the subgraph advertises at least as many constraints as the\nsupergraph. To see the converse, note that de\ufb01nition of s-separation, coupled with our inability to\ncondition on V can always be used to create dependence between W and Z, the vertices joined by\nan edge in the supergraph explicitly. This dependence can be created regardless of the conditioning\nset, either via the path W \u2192 \u25e6\u2212 . . .\u2212\u25e6\u2212 Z, or via the walk path W \u2192 \u25e6\u2212 . . .\u2212\u25e6\u2212 Z \u2190 V \u2192 Z.\nIt can thus be shown that adding these edges does not remove any independences.\nLemma 4.4 Let V be a vertex in a SG G with at least two vertices. Then there exists an SG GV\nwhere V \u2192 \u25e6 \u2212 \u25e6 does not exist, and P(G)V = P(GV )V .\n\n(cid:3)\nProof: This follows by an inductive application of Lemmas 4.1, 4.2, and 4.3.\nNote that Lemma 4.4 does not guarantee that the graph GV is unique. In fact, depending on the order\nin which we apply the induction, we may obtain different SGs with the required property.\nTheorem 4.1 If G is an SG with at least 2 vertices V, and V \u2208 V, there exists an SG GV with\nvertices V \\ {V } such that P(G)V = P(GV )V .\nThis theorem exploits previous results to construct a graph which agrees with G on all independences\nnot involving V and which does not contain children of V that are a part of the block with size greater\nthan two. Given a graph with this structure, we can adapt the latent projection construction to yield\na SG that preserves all independences.\nCorollary 4.1 Let G be an SG with vertices V. Then for any W \u2282 V, there exists an SG G\u2217 with\nvertices V \\ W such that P(G)W = P(G\u2217).\n\n5 Segregated Factorization\n\nWe now show that, for positive distributions, the Markov property we de\ufb01ned and a certain factor-\nization for SGs give the same model.\nA set of vertices that form a connected component in a graph obtained from G by dropping all edges\nexcept \u2194, and where no vertex is adjacent to a \u2212 edge in G is called a district in G. A non-trivial\nblock is a set of vertices forming a connected component of size two or more in a graph obtained\nfrom G by dropping all edges except \u2212. We denote the set of districts, and non-trivial blocks in G\nby D(G), and B\u2217(G), respectively. It is trivial to show that in a SG G with vertices V, D(G), and\nB\u2217(G) partition V.\nFor a vertex set S in G, de\ufb01ne pasG(S) \u2261 {W (cid:54)\u2208 S | (W V )\u2192 is in G, V \u2208 S}, and pa\u2217\nG(S) \u2261\npasG(S) \u222a S. For A \u2286 V in G, let GA be the subgraph of G containing only vertices in A and\nedges between them. The anterior of a set S, denoted by antG(S) is the set of vertices V with a\n(cid:83)\npartially directed path into a node in S. A set A \u2286 V is called anterial in G if whenever V \u2208 A,\nantG(V ) \u2286 A. We denote the set of non-empty anterial subsets of V in G by A(G). Let Da(G) \u2261\nA\u2208A(G) D(GA). A clique in an UG G is a maximal connected component. The set of cliques in an\nUG G will be denoted by C(G). A vertex ordering \u227a is topological for a SG G if whenever V \u227a W ,\nW (cid:54)\u2208 antG(V ). For a vertex V in G, and a topological \u227a, de\ufb01ne preG,\u227a(V ) \u2261 {W (cid:54)= V | W \u227a V }.\n\n5\n\n\fA1\n\nY1\n\nA1\n\nY 1\n1\n\nY 2\n1\n\nA1\n\nY 1\n1\n\nY 2\n1\n\nC\n\nC\n\nC\n\n. . .\n\n. . .\n\nY\n\nA\n\n(a)\n\nY2\n\nA2\n\n(b)\n\nY 1\n2\n\nY 2\n2\n\nA2\n\n(c)\n\nY 1\n2\n\nY 2\n2\n\nA2\n\n(d)\n\nFigure 2:\n(a) A simple causal DAG model. (b),(c) Causal DAG models for interference. (d) A\ncausal DAG representing a Markov chain with an equilibrium distribution in the chain graph model\nin Fig. 1 (a).\n\nG (S))a) \u03c6C(C), where \u03c6C(C) is a mapping from values of C to non-negative reals.\n\nGiven a SG G, de\ufb01ne the augmented graph Ga to be an undirected graph with the same vertex set as\nG where A, B share an undirected edge in Ga if A, B are connected by a walk consisting exclusively\nof collider sections in G (note that this trivially includes all A, B that share an edge). We say p(V)\nsatis\ufb01es the augmented global Markov property with respect to a SG G if for any A \u2208 A(G), p(A)\nsatis\ufb01es the UG global Markov property with respect to (GA)a. We denote a model, that is a set of\np(V) satisfying this property with respect to G, as P a(G).\nnels [8] (cid:8)fS(S | pasG(S))(cid:12)(cid:12)S \u2208 Da(G) \u222a B\u2217(G)(cid:9) such that for every A \u2208 A(G), p(A) =\nBy analogy with the ordinary Markov model and the chain graph model, we say that p(V)\n(cid:81)\nto a SG G if there exists a set of ker-\nobeys the segregated factorization with respect\n(cid:81)\nS\u2208D(GA)\u222aB\u2217(GA) fS(S | pasG(S)), and for every S \u2208 B\u2217(G), fS(S | pasG(S)) =\nC\u2208C((Gpa\u2217\nS \u2208 B\u2217(G), and fS(S | pasG(S)) =(cid:81)\nLemma 5.1 If p(V) factorizes with respect to G then fS(S | pasG(S)) = p(S | pasG(S)) for every\nV \u2208S p(V | preG,\u227a(V ) \u2229 antG(S)) for every S \u2208 Da(G) and\nany topological ordering \u227a on G.\nTheorem 5.1 If p(V) factorizes with respect to a SG G, then p(V) \u2208 P a(G).\nLemma 5.2 If there exists a walk \u03b1 in G between A \u2208 A, B \u2208 B with all non-collider sections not\nintersecting C, and all collider sections in antG(A \u222a B \u222a C), then there exist A\u2217 \u2208 A, B\u2217 \u2208 B\nsuch that A\u2217 and B\u2217 are s-connected given C in G.\nTheorem 5.2 P(G) = P a(G).\nTheorem 5.3 For a SG G, if p(V) \u2208 P(G) and is positive, then p(V) factorizes with respect to G.\nCorollary 5.1 For any SG G, if p(V) is positive, then p(V) \u2208 P(G) if and only if p(V) factorizes\nwith respect to G.\n\n6 Causal Inference and Interference Analysis\n\nIn this section we brie\ufb02y describe interference analysis in causal inference, as a motivation for the\nuse of SGs. Causal inference is concerned with using observational data to infer cause effect re-\nlationships as encoded by interventions (setting variable valus from \u201coutside the model.\u201d). Causal\nDAGs are often used as a tool, where directed arrows represent causal relationships, not just sta-\ntistical relevance. See [12] for an extensive discussion of causal inference. Much of recent work\non interference in causal inference, see for instance [10, 19], has generalized causal DAG models\nto settings where an intervention given to a subjects affects other subjects. A classic example is\nherd immunity in epidemiology \u2013 vaccinating a subset of subjects can render all subjects, even those\nwho were not vaccinated, immune. Interference is typically encoded by having vertices in a causal\ndiagram represent not response variability in a population, but responses of individual units, or ap-\npropriately de\ufb01ned groups of units, where interference only occurs between groups, not within a\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: (a) \u03c72 density with 14 degrees of freedom (red) and a histogram of observed deviances of\nordinary Markov models of Fig. 1 (d) \ufb01tted with data sampled from a randomly sampled model of\nFig. 1 (b). (b) Y axis: values of parameters p(Y5 = 0 | Y4 = 0, A = 0) (red), and p(Y5 = 0 | Y4 =\n1, A = 0) (green) in the \ufb01tted nested Markov model of Fig. 1 (d). X axis: value of the interaction\nparameter \u03bb45 (and 3 \u00b7 \u03bb145) in the underlying chain graph model for Fig. 1. (c) Same plot with\np(Y5 = 0 | Y4 = 0, A = 1) (yellow), and p(Y5 = 0 | Y4 = 1, A = 1) (blue).\n\ngroup. For example, the DAG in Fig. 2 (b) represents a generalization of the model in Fig. 2 (a) to\na setting with unit pairs where assigning a vaccine to one unit may also in\ufb02uence another unit, as\nwas the case in the example in Section 2. Furthermore, we may consider more involved examples\nof interference if we record responses over time, as is shown in Fig. 2 (c). Extensive discussions on\nthis type of modeling approach can be found in [18, 10].\nWe consider an alternative approach to encoding interference between responses using chain graph\nmodels. We give two justi\ufb01cations for the use of chain graphs. First, we may assume that in-\nterference arises as a dependence between responses Y1 and Y2 in equilibrium of a Markov chain\nwhere transition probabilities represent the causal in\ufb02uence of Y1 on Y2, and vice versa, at multi-\nple points in time before equilibrium is reached. Under certain assumptions [9], it can be shown\nthat such an equilibrium distribution obeys the Markov property of a chain graph. For example the\n|\nDAG shown in Fig. 2 encodes transition probabilities p(Y t+1\n2 , a1, a2) = p(Y t+1\nY t+1\n2 , a1, a2), for particular values a1, a2. For suitably chosen conditional\n1\ndistributions, these transition probabilities lead to an equilibrium distribution that lies in the model\ncorresponding to the chain graph in Fig. 1 (a) [9]. Second, we may consider certain independence\nassumptions in our problem as reasonable, and sometimes such assumptions lead naturally to a chain\ngraph model. For example, we may study the effect of a marketing intervention in a social network,\nand consider it reasonable that we can predict the response of any person only knowing the treat-\nment for that person and responses of all friends of this person in a social network (in other words,\nthe treatments on everyone else are irrelevant given this information). These assumptions result in a\nresponse model that is a chain graph with directed arrows from treatment to every person\u2019s response,\nand undirected edges between friends only.\n\n, a1, a2)p(Y t+1\n\n| Y t\n\n, Y t+1\n\n1\n\n1\n\n1\n\n| Y t\n\n1 , Y t\n\n2\n\n6.1 An Example of Interference Analysis Using Segregated Graph Models\n\nGiven ubiquity of unobserved confounding variables in causal inference, and our our choice of chain\ngraphs for modeling interference, we use models represented by SGs to avoid having to deal with\na hidden variable chain graph model directly, due to the possibility of misspecifying the likely high\ndimensional hidden variables involved. We brie\ufb02y describe a simulation we performed to illustrate\nhow SGs may be used for interference analysis.\nAs a running example, we used a model shown in Fig. 1 (b), with A, W, B1, Y1, Y2 binary, and U\n15-valued. We \ufb01rst considered the following family of parameterizations. In all members of this\nfamily, A was assigned via a fair coin, p(W | A, U ) was a logistic model with no interactions, B1\nwas randomly assigned via a fair coin given no complications (W = 1), otherwise B1 was heavily\nweighted (0.8 probability) towards treatment assignment. The distribution p(Y1, Y2 | U, B1, A) was\n\n7\n\n0.0000.0250.0500.075102030llllllllllllllllllllllllll0.20.30.40.5\u22120.20.00.2llllllllllllllllllllllllll0.300.350.400.450.50\u22120.20.00.2\fZ exp(cid:0)(cid:80)\n\nC(\u22121)(cid:107)xC(cid:107)1\u03bbC\n\n(cid:1), where C ranges over all cliques in G, (cid:107).(cid:107)1 is the L1-norm,\n\nobtained from a joint distribution p(Y1, Y2, U, B1, A) in a log-linear model of an undirected graph G\nof the form: 1\n\u03bbC are interactions parameters, and Z is a normalizing constant. In our case G was an undirected\ngraph over A, B1, U, Y1, Y2 where edges from Y2 to B1 and U were missing, and all other edges\nwere present. Parameters \u03bbC were generated from N (0, 0.3). It is not dif\ufb01cult to show that all\nelements in our family lie in the chain graph model in Fig. 1 (b).\nSince all observed variables in our example are binary, the saturated model has 25 \u2212 1 = 31 pa-\nrameters, and the model corresponding to Fig. 1 (d) is missing 14 of them. 2 are missing because\np(B1 | W, A) does not depend on A, and 12 are missing because p(Y2 | Y1, B1, W, A) does not\ndepend on W, B1. If our results on SGs are correct, we would expect the ordinary Markov model\n[6] of a graph in Fig. 1 (b) to be a good \ufb01t for the data generated from our hidden variable chain\ngraph family, where we omit the values of U. In particular, we would expect the observed deviances\nof our models \ufb01tted to data generated from our family to closely follow a \u03c72 distribution with 14\ndegrees of freedom. We generated 1000 members of our family described above, used each member\nto generate 5000 samples, and \ufb01tted the ordinary Markov model using an approach described in [5].\nThe resulting deviances, plotting against the appropriate \u03c72 distribution, are shown in Fig. 3 (a),\nwhich looks as we expect. We did not vary the parameters for A, W, B1. This is because models for\nFig. 1 (b) and Fig. 1 (d) will induce the same marginal model for p(A, B1, W ) by construction.\nIn addition, we wanted to illustrate that we can encode interaction parameters directly via parameters\nin a SG. To this end, we generated a set of distributions p(Y1, Y2 | A, U, B1) via the binary log-linear\nmodel as described above, where all \u03bbC parameters were \ufb01xed, except we constrained \u03bb{Y1,Y2} to\nequal 3 \u00b7 \u03bb{A,Y1,Y5}, and varied \u03bb{Y1,Y2} from \u22120.3 to 0.3. These parameters represent two-way\ninteraction of Y1 and Y2, and three-way interaction of A, Y1 and Y2, and thus directly encode the\nstrength of the interference relationship between responses. Since the SG in Fig. 1 (d) \u201cbreaks the\nsymmetry\u201d by replacing the undirected edge between Y1 and Y2 by a directed edge, the strength\nof interaction is represented by the degree of dependence of Y2 and Y1 conditional on A. As can\nbe seen in Fig. 3 (b),(c) we obtain independence precisely when \u03bb{Y4,Y5} and \u03bb{A,Y4,Y5} in the\nunderlying hidden variable chain graph model is 0, as expected.\nOur simulations did not require the modi\ufb01cation of the \ufb01tting procedure in [5], since Fig. 1 (d) is\nan ADMG. In general, a SG will have undirected blocks. However, the special property of SGs\nallows for a trivial modi\ufb01cation of the \ufb01tting procedure. Since the likelihood decomposes into\npieces corresponding to districts and blocks of the SG, we can simply \ufb01t each district piece using\nthe approach in [5], and each block piece using any of the existing \ufb01tting procedures for discrete\nchain graph models.\n\n7 Discussion and Conclusions\n\nIn this paper we considered a graphical representation of the ordinary Markov chain graph model,\nthe set of distributions de\ufb01ned by conditional independences implied by a marginal of a chain graph\nmodel. We show that this model can be represented by segregated graphs via a global Markov prop-\nerty which generalizes Markov properties in chain graphs, DAGs, and mixed graphs representing\nmarginals of DAG models. Segregated graphs have the property that bidirected and undirected edges\nare never adjacent. Under positivity, this global Markov property is equivalent to segregated factor-\nization which decomposes the joint distribution into pieces that correspond either to sections of the\ngraph containing bidirected edges, or sections of the graph containing undirected edges, but never\nboth together. The convenient form of this factorization implies many existing results on chain graph\nand ordinary Markov models, in particular parameterizations and \ufb01tting algorithms, carry over. We\nillustrated the utility of segregated graphs for interference analysis in causal inference via simulated\ndatasets.\n\nAcknowledgements\nThe author would like to thank Thomas Richardson for suggesting mixed graphs where \u2212 and \u2194\nedges do not meet as interesting objects to think about, and Elizabeth Ogburn and Eric Tchetgen\nTchetgen for clarifying discussions of interference. This work was supported in part by an NIH\ngrant R01 AI104459-01A1.\n\n8\n\n\fReferences\n[1] S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of Markov equivalence classes for\n\nacyclic digraphs. Annals of Statistics, 25:505\u2013541, 1997.\n\n[2] J. Bell. On the Einstein Podolsky Rosen paradox. Physics, 1(3):195\u2013200, 1964.\n[3] Z. Cai, M. Kuroki, J. Pearl, and J. Tian. Bounds on direct effects in the presence of confounded interme-\n\ndiate variables. Biometrics, 64:695 \u2013 701, 2008.\n\n[4] M. Drton. Discrete chain graph models. Bernoulli, 15(3):736\u2013753, 2009.\n[5] R. J. Evans and T. S. Richardson. Maximum likelihood \ufb01tting of acyclic directed mixed graphs to binary\ndata. In Proceedings of the Twenty Sixth Conference on Uncertainty in Arti\ufb01cial Intelligence, volume 26,\n2010.\n\n[6] R. J. Evans and T. S. Richardson. Markovian acyclic directed mixed graphs for discrete data. Annals of\n\nStatistics, pages 1\u201330, 2014.\n\n[7] J. T. A. Koster. Marginalizing and conditioning in graphical models. Bernoulli, 8(6):817\u2013840, 2002.\n[8] S. L. Lauritzen. Graphical Models. Oxford, U.K.: Clarendon, 1996.\n[9] S. L. Lauritzen and T. S. Richardson. Chain graph models and their causal interpretations (with discus-\n\nsion). Journal of the Royal Statistical Society: Series B, 64:321\u2013361, 2002.\n\n[10] E. L. Ogburn and T. J. VanderWeele. Causal diagrams for interference. Statistical Science, 29(4):559\u2013578,\n\n2014.\n\n[11] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufmann, San Mateo, 1988.\n[12] J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009.\n[13] T. Richardson and P. Spirtes. Ancestral graph Markov models. Annals of Statistics, 30:962\u20131030, 2002.\n[14] T. S. Richardson. Markov properties for acyclic directed mixed graphs. Scandinavial Journal of Statistics,\n\n30(1):145\u2013157, 2003.\n\n[15] K. Sadeghi and S. Lauritzen. Markov properties for mixed graphs. Bernoulli, 20(2):676\u2013696, 2014.\n[16] I. Shpitser, R. J. Evans, T. S. Richardson, and J. M. Robins.\n\nIntroduction to nested Markov models.\n\nBehaviormetrika, 41(1):3\u201339, 2014.\n\n[17] M. Studeny. Bayesian networks from the point of view of chain graphs. In Proceedings of the Fourteenth\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI-98), pages 496\u2013503. Morgan Kaufmann, San\nFrancisco, CA, 1998.\n\n[18] M. J. van der Laan. Causal inference for networks. Working paper, 2012.\n[19] M. J. van der Laan. Causal inference for a population of causally connected units. Journal of Causal\n\nInference, 2(1):13\u201374, 2014.\n\n[20] T. J. VanderWeele, E. J. T. Tchetgen, and M. E. Halloran. Components of the indirect effect in vaccine\n\ntrials: identi\ufb01cation of contagion and infectiousness effects. Epidemiology, 23(5):751\u2013761, 2012.\n\n[21] T. S. Verma and J. Pearl. Equivalence and synthesis of causal models. Technical Report R-150, Depart-\n\nment of Computer Science, University of California, Los Angeles, 1990.\n\n[22] N. Wermuth. Probability distributions with summary graph structure. Bernoulli, 17(3):845\u2013879, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1042, "authors": [{"given_name": "Ilya", "family_name": "Shpitser", "institution": "Johns Hopkins University"}]}