# Neural Networks and Machine Learning

July - December 1997

Organisers: C M Bishop (Aston), D Haussler (UCSC), G E Hinton (Toronto), M Niranjan (Cambridge), L G Valiant (Harvard)

### EC Summer School 1 - 5 September 1997

Organisers: Christopher Bishop (Aston) and Joe Whittaker (Lancaster)

### PROBABILISTIC GRAPHICAL MODELS

#### PROVISIONAL PROGRAMME

Sunday, Aug 31

18:00 - 19:00 Wine reception and registration

Monday, Sep 1

08:30 - 09:00 Registration

09:00 - 10:30 Tutorial

Joe Whittaker (Lancaster) Graphical models

10:30 - 11:00 Coffee-Break

11:00 - 12:30 Tutorial

David Heckerman (Microsoft) Directed Acyclic Graphs: Representation and Learning

14:30 - 16:00 Presentations. Chair: Geoffrey Hinton

David Cox (Oxford) tba

Robert Cowell (City) Learning with Dirichlet Mixtures in Bayesian Networks

Wally Gilks (Cambridge) Simulation Methodology for Dynamic Models

16:00 - 16:30 Tea-Break

16:30 - 18:00 Presentations. Chair: Michael Titterington

Ed George (Texas) Empirical Bayes Covariance Selection

Stefano Monti and Greg Cooper (Pittsburg) Learning Bayesian Networks from Data Containing both Continuous and Discrete Variables

Bert Kappen (Nijmegen) A Polynomial Time Algorithm for Boltzmann Machine Learning

18:00 - 19:00 Wine reception

Tuesday, Sep 2

09:00 - 10:30 Tutorial

Christopher Bishop (Aston) Latent variables

10:30 - 11:00 Coffee-Break

11:00 - 12:30 Tutorial

Steffen Lauritzen (Aalborg) Message passing on graphs

14:30 - 16:00 Presentations. Chair: Phil Dawid

Mauro Piccioni (Aquila) The stochastic IPF algorithm and some of its properties.

Prakash Shenoy (Kansas)
Some Improvements to the Shenoy-Shafer Architecture for Computing Marginals

Larry Saul (ATT) tba

16:00 - 16:30 Tea-Break

16:30 - 18:00 Presentations. Chair: Steffen Lauritzen

Mike Titterington (Glasgow) Hyperparameter Estimation And Related Topics

Jim Smith (Warwick) Geometrical aspects of missing data in graphical models

18:00 - 19:00 Wine reception

Wednesday, Sep 3

09:00 - 10:30 Tutorial

Mike Jordan (MIT) Approximate inference via variational techniques

10:30 - 11:00 Coffee-Break

11:00 - 12:30 Tutorial

Geoffrey Hinton (Toronto) Learning Intractable Graphical Models

Afternoon free for sightseeing

Thursday, Sep 4

09:00 - 10:30 Tutorial

Phil Dawid (London) Conditional Independence for Statistics and AI

10:30 - 11:00 Coffee-Break

11:00 - 12:30 Tutorial

Stuart Russell (Berkeley) Learning: temporal processes and structure

14:30 - 16:00 Presentations. Chair: Michael Jordan

Thomas Richardson (Washington) Ancestral Graphs: Representing Markov Properties of Acyclic Directed Graphs under Marginalization and Conditionalization

Milan Studeny (Praha) On separation criterion for chain graphs

Michael Perlman (Washington) Alternative Markov Properties for Chain Graphs

16:00 - 16:30 Tea-Break

16:30 - 18:00 Presentations. Chair: Michael Perlman

Paolo Giudici (Pavia) Markov chain Monte Carlo decomposable graphical gaussian model determination

Alberto Roverato (Modena) Asymptotic Analysis of Graphical Gaussian Models: an Isserlis Matrix Based Approach

Tommi Jaakkola (MIT) Bayesian Estimation in the Presence of Missing Values

18:00 - 19:30 Reception at the Cambridge University Press bookshop

Friday, Sep 5

09:00 - 10:30 Tutorial

David MacKay (Cambridge) Introduction to Monte Carlo Methods

10:30 - 11:00 Coffee-Break

11:00 - 12:30 Tutorial

Judea Pearl (LA) Causality

14:30 - 16:00 Presentations. Chair: David Cox

Guido Consonni (Pavia) Priors for quantitative learning in probabilistic expert systems

Sebastian Seung (Lucent)Learning the Relationships Between Parts of an Object

Ross Shachter (Stanford) Causal Models: What does one need to know?

16:00 - 16:30 Tea-Break

16:30 - 18:00 Tutorial

David Spiegelhalter (MRC, Cambridge) Bayesian Graphical Modelling

19:00 Conference dinner (Sydney Sussex College)

ABSTRACTS

Joe Whittaker (Lancaster)
Graphical Models

Interpreting inter-relationships between many variables simultaneously is a complex task that confronts the research worker in many areas of applied statistics. Specific problems are to determine which variables interact, and how strongly, and to decide if the data can be condensed without loss of information.

Graphical modelling is a powerful technique for simplifying and describing multivariate interaction and association. The theoretical basis to the technique is the concept of conditional independence and the prime theoretical and practical tool is the conditional independence graph. The lecture gives an elementary account, with some practical examples, of the graphical modelling approach to multivariate data analysis, and is based on the recent book of Whittaker (1990) Graphical Models in Applied Multivariate Statistics.

The Markov properties of the independence graph, model fitting by maximum likelihood is discussed, the technique is related to log-linear models for contingency tables, to covariance selection models for correlation matrices, to recursive models for path analysis and to the mixed interaction model.

David Heckerman (Microsoft)
Directed Acyclic Graphs: Representation and Learning

I will describe DAG models and discuss methods for learning the parameters and structure of such models from data. I will mention constraint-based methods for learning, but concentrate on the Bayesian approach. Topics will include criteria for model selection, techniques for assigning priors, methods for handling missing data and latent variables, and search methods. At least one real-world application will be presented.

Robert Cowell (City)
Learning with Dirichlet Mixtures in Bayesian Networks

The general framework for sequential learning on a Bayesian network using data was presented by Spiegelhalter and Lauritzen, who introduced the simplifying assumptions of local and global independence of the parameters underlying the distribution over the network. However, when learning with incomplete data, global and local independencies are in general lost. Approximations may then be made to restore these desirable independence properties, so that the computations become tractable.

Here we investigate approximate sequential learning methods -- which relax the traditional re-imposition of local and global independencies after processing each incomplete case -- with incomplete data in a simple network, comparing their predictive performance with that obtained by Gibbs Sampling.

Wally Gilks (Cambridge) and Carlo Berzuini (Pavia)
Simulation Methodology for Dynamic Models

In situations where real-time, sequential, updating of predictive distributions is required, the usual Bayesian technology using Markov chain Monte Carlo (MCMC) is often too slow. Such situations arise in a clinical setting where patients are being continuously monitored, and where predictions of adverse events are required. Non-biomedical situations include tracking of military targets and financial time-series prediction. Similar problems arise in predictive model selection. We present a new technique for combining MCMC and importance resampling, where significant gains in computational speed can be achieved with little loss in precision. We demonstrate the methodology on artificial data from a hidden Markov model.

Stefano Monti and Greg Cooper (Pittsburg)
Learning Bayesian Networks from Data Containing both Continuous and Discrete Variables

In this paper we illustrate two different methodologies for learning Bayesian networks from complete datasets containing both continuous and discrete variables. The two methodologies differ in the way of handling continuous data when learning the Bayesian network structure. The first methodology uses discretized data to learn the Bayesian network structure, and the original non-discretized data for the parameterization of the learned structure. The second methodology uses non-discretized data both to learn the Bayesian network structure and its parameterization. For the direct handling of continuous data, we propose the use of artificial neural networks as probability estimators, to be used as an integral part of the scoring metric defined to search the space of Bayesian network structures.

We report experimental results aimed at comparing the two methodologies. These results provide evidence that learning with discretized data presents advantages both in terms of efficiency and in terms of accuracy of the learned models over the alternative approach of using non-discretized data.

Christopher Bishop (Aston)
Latent Variables

A powerful approach to probabilistic modelling involves the introduction of additional latent, or hidden variables. The distribution of observed variables is then representated in terms of the marginalisation of the joint distribution over latent and observed variables. Familiar examples include mixture distributions (involving a discrete latent variable) and factor analysis (involving continuous latent variables). The structure of such probabilistic models can be made particularly transparent using a diagrammatic representation in terms of a directed acyclic graph.

In this tutorial I will provide an overview of latent variable models for representing continuous variables. I will introduce the EM (expectation-maximization) algorithm which provides a general technique for estimating the parameters of such models through maximum likelihood.

I will also show how a specific form of linear latent variable model can be used to provide a probabilistic formulation of the well-known technique of principal components analysis (PCA). This leads naturally to mixtures, and hierarchical mixtures, of probabilistic PCA models, with applications in areas such as data compression, pattern recognition and data visualization.

References:

C M Bishop (1995) Neural Networks for Pattern Recognition, Oxford University Press.

C M Bishop, M Svensen and C K I Williams (1996) GTM: The Generative Topographic Mapping, accepted for publication in Neural Computation. Available from http://www.ncrg.aston.ac.uk/

G J McLachlan and T Krishnan (1997) The EM Algorithm and Extensions, Wiley.

M E Tipping and C M Bishop (1997) Mixtures of Principal Component Analysers, technical report NCRG/97/003, submitted to Neural Computation. Available from http://www.ncrg.aston.ac.uk/

M E Tipping and C M Bishop (1997) A Hierarchical Latent Variable Model for Data Visualization, technical report NCRG/96/028. Available from http://www.ncrg.aston.ac.uk/

Paolo Guidici (Pavia) and Peter Green (Bristol)
Markov chain Monte Carlo decomposable graphical gaussian model determination

We propose a methodology for Bayesian model determination in undirected decomposable graphical gaussian models. To achieve this aim we consider, for each given graph, a hyper inverse Wishart prior on the covariance matrix. To ensure compatibility across models, such prior distributions are obtained by marginalisation from the prior conditional on the complete graph. We explore alternative structures for the hyperparameters of the latter, and their consequences for the model.

Model determination is then carried out implementing a reversible jump MCMC sampler. In particular, the dimension changing move we propose concerns adding or dropping an edge from the graph. We characterise the set of such moves which lead to a decomposable graph. We then consider appropriate random walk proposals for the within-model moves. The main advantages of our sampler are: a) simplicity of the simulations; b) locality of most of the computations; c) extendability to hierarchical priors. These allow our proposed sampler to be suited for the analysis of complex structures.

{Keywords}: Bayesian Model Selection; Hyper Markov Laws; Junction Tree; Inverse Wishart Distribution; Reversible Jump MCMC.

Tommi Jaakkola (MIT)
Bayesian Estimation in the Presence of Missing Values

Bayesian parameter estimation has a number of advantages over simple maximum likelihood estimation. In the presence of missing values, however, the cost of performing Bayesian estimation often considerably exceeds that of maximum likelihood methods. Moreover, in a maximum likelihood setting we have the EM algorithm that naturally handles missing values, while in the Bayesian context we often need (possibly sophisticated) sampling methods. I will present a generally applicable deterministic algorithm for Bayesian parameter estimation in graphical models. The algorithm has an EM style inner loop for optimizing the posterior over the parameters (and its dual distribution) in the context of each observation. The resulting posteriors can be represented locally while they are optimized jointly in a KL-divergence sense. Overall the sequential algorithm is no more costly than performing ordinary maximum likelihood estimation.

David MacKay (Cambridge)
Introduction to Monte Carlo Methods

I will introduce a sequence of Monte Carlo methods: importance sampling, rejection sampling, the Metropolis method, and Gibbs sampling. For each method, we will discuss whether the method is expected to be useful for high--dimensional problems.

The terminology of Markov chain Monte Carlo methods will be reviewed.

Some advanced Monte Carlo methods will be presented, including methods for reducing random walk behaviour.

Mauro Piccioni
The stochastic IPF algorithm and some of its properties.

For the Bayesian analysis of hierarchical models of contingency tables which are not decomposable it is interesting to compute expected values with respect to a Dirichlet distribution on the table conditioning to zero the interactions prescribed by the model. A convenient implementation of the Gibbs sampler is possible, which exploits the independence between the marginals of any maximal component and the interactions depending on its complement. The resulting algorithm is similar to the IPF with the difference that each marginal which is imposed in a step is drawn at random from a compatible Dirichlet distribution, rather than being obtained from a table of data. Some properties of the algorithm are easily established; however, the main problem of evaluating its speed of convergence seems to be open.

Michael D. Perlman (Washington)
Alternative Markov Properties for Chain Graphs

Graphical Markov models use graphs, either undirected, directed, or mixed, to represent possible dependences among statistical variables. Applications of undirected graphs (UGs) include models for spatial dependence and image analysis, while acyclic directed graphs (ADGs), which are especially convenient for statistical analysis, arise in such fields as genetics and psychometrics and as models for expert systems and Bayesian belief networks.

Lauritzen, Wermuth, and Frydenberg (LWF) introduced a Markov property for chain graphs, which are mixed graphs that can be used to represent simultaneously both causal and associative dependencies and which include both UGs and ADGs as special cases. For multivariate normal distributions, Cox and Wermuth introduced both multivariate regression and block regression models for chain graphs. In this paper an alternative Markov property (AMP) for chain graphs is introduced and shown to be the Markov property satisfied by a Cox-Wermuth multivariate regression model, a multinormal block-recursive linear structural equations model. This model can be decomposed into a collection of conditional normal models, each of which combines the features of multivariate linear regression models and covariance selection models, facilitating the estimation of its parameters.

In the general case, necessary and sufficient conditions are given for the equivalence of the LWF and AMP Markov properties of a chain graph, for the AMP Markov equivalence of two chain graphs, for the AMP Markov equivalence of a chain graph to some ADG or decomposable UG, and for other equivalences. A new pathwise separation criterion for chain graphs is presented, called p-separation, that is equivalent to the AMP global Markov property. In some ways, the AMP Markov property and p-separation criterion for chain graphs are more direct extensions of the classical Markov property and Pearl's d-separation criterion for ADGs than are the LWF Markov property and the Bouckaert-Studeny c-separation criterion for chain graphs.

(This research has been conducted jointly with Steen A. Andersson and David Madigan.)

Alberto Roverato (Modena) and Joe Whittaker
Asymptotic Analysis of Graphical Gaussian Models: an Isserlis Matrix Based Approach

The Isserlis matrix, which was introduced by Isserlis (1918) who computed the variance matrix of the sample variances and covariances in the normal case, plays an important role in an asymptotic analysis of graphical Gaussian models. Both the Bayesian and frequentist approaches to inference require the computation of the Isserlis matrix, Iss$(\cdot)$, of a variance matrix $\Sigma$ with a given zero pattern in the inverse.

We study the properties of the Isserlis matrix of the completion, $\Sigma$, of a positive definite matrix and propose an edge set indexing notation which highlights the symmetry existing between $\Sigma$ and Iss$(\Sigma)$. In this way well known properties of $\Sigma$ can be exploited to give an easy proof of certain asymptotic results for graphical Gaussian models, as well as to extend such results to the non-decomposable case.

Ross Shachter (Stanford)
Causal Models: What does one need to know?

Causality has been a controversial modeling distinction for most of this century. In some areas of computer science and the social sciences it has been treated as an intuitive primitive explanation for all probabilistic dependence or relevance. On the other hand, in the statistics and systems analysis communities it has been tabooed as too subjective and poorly defined, and association has been the acceptable model for relevance. In modeling decision processes and managing intelligent systems, the associational model is insufficient to predict the effects of actions. We have been working to develop a practical framework that we believe successfully bridges these two extreme viewpoints.

As more formal models of causality have been proposed, a new controversy has arisen, namely, how complex must a causal model be? Can a simple causal model be sufficient for some purposes and inadequate for others? Is it always necessary to model "counterfactual" propositions? When it is possible to learn causal models from data? We will try to address these questions as we present our framework for building and applying causal models.

A. P. Dawid (University College, London)
Conditional Independence for Statistics and AI

The axiomatic theory of Conditional Independence provides a general language for formulating and determining questions relating to the intuitive idea of relevance, in a wide variety of different contexts. This tutorial describes the basic theory and various interesting models of it, with special emphasis on its use in conjunction with modular graphical representations of problems in Probability, Statistics and Expert Systems.

Stuart Russell (Berkeley)
Learning: temporal processes and structure

The first part of the talk will describe methods for representing temporal processes as graphical models, and will show how such representations can be learned from traces of system behaviour. Such models can be shown to have advantages over two other standard representations for temporal processes, namely Hidden Markov Models and Kalman filters. Temporal processes raise special problems for inference, and new algorithms will be described that partially solve these problems.

The second part of the talk describes work by Nir Friedman on learning structure in graphical models. It is shown that EM can be generalized to allow structural as well as parameter updates. This allows a much more efficient search of the hypothesis space and also simplifies the calculation of Bayesian and MDL scores in the structure learning process.

Thomas Richardson (Department of Statistics, University of Washington)
Ancestral Graphs: Representing Markov Properties of Acyclic Directed Graphs under Marginalization and Conditionalization

It is natural to consider the class of acyclic directed graph (DAG) models under marginalization, representing latent variables or "correlated errors", and under conditionalization, representing selection bias.

A graphical object, called a mixed ancestral graph (MAG) will be presented which represents the conditional dependencies in the observed margin. Further, in the Gaussian case, it is possible to construct a model which parameterizes all distributions represented by the models within a particular Markov Equivalence class of marginalized DAG models. This parameterization forms the basis for scoring based search algorithms.

A further graphical object, called a partial ancestral graph, will be presented, which represents structural features that are invariant across a Markov equivalence class of marginalized and conditionalized DAGs. Alternatively, a PAG can be viewed as representing a Markov equivalence class of MAGs.

There exist algorithms for constructing a partial ancestral graph from a marginalized and conditionalized DAG or from an oracle for conditional independence facts.

P Shenoy (Kansas)
Some Improvements to the Shenoy-Shafer Architecture for Computing Marginals

In this talk we introduce three improvements to the Shenoy-Shafer architecture for computing marginals using local computation. Although the architecture is valid more generally, we will describe our improvements for the case of Bayesian networks.

The traditional Shenoy-Shafer architecture has three phases. Phase one is the construction of a join tree. Phase two is the computation of messages between adjacent nodes in the join tree. And Phase three is the computation of marginals for the desired nodes in the join tree.

The first improvement is the concept of a binary join tree. A binary join tree is a join tree such that no node has more than three neighbors. Binary join trees ensure that all combinations (multiplications of tables for the case of probabilities) are done on a binary basis.

The second improvement is the introduction of a new phase called the transfer of valuations phase. This phase is subsequent to the join tree construction phase and prior to the message passing phase. In this phase, some functions (tables for the case of probabilities) are transferred from the non-clique nodes to the clique nodes of the join tree. The transfer of valuations phase reduces the amount of calculations needed for computation of marginal probabilities.

The third modification is the introduction of a new rule for computing marginals. The usual rule for computing the marginal for a node is to combine all messages the node receives from its neighbors with the function associated with the node. If the node has a neighbor that is its superset, then the marginal is simply the combination of the two messages exchanged between the node and the neighbor (one in each direction). Thus the marginal can be computed using only one combination of tables as opposed to several as per the old rule.

REFERENCES
1. Shenoy, P. P. and G. Shafer (1990), "Axioms for probability and belief-function propagation," in Shachter, R. D., T. S. Levitt, J. F. Lemmer and L. N. Kanal (eds.), Uncertainty in Artificial Intelligence, 4, 169-198, North-Holland, Amsterdam.

2. Shenoy, P. P. (1997), "Binary join trees for computing marginals in the Shenoy-Shafer architecture," International Journal of Approximate Reasoning, 17(2-3), 239-263.

3. Schmidt, T. and P. P. Shenoy (1997), "Some Improvements to the Shenoy-Shafer and Hugin architectures for computing marginals," Working Paper, University of Kansas, Lawrence, KS.

Milan Studeny
On separation criterion for chain graphs.

Chain graphs introduced in middle eighties provide a quite wide class of graphical models of probabilistic conditional indepence structures. They generalize both undirected graph models and directed acyclic graph models. One of possible ways how to introduce the class of Markovian distribution with respect to a chain graph is a separation criterion for reading conditional independence statements from the chain graph. It generalizes the well-known d-separation criterion for directed acyclic graphs and therefore it is also named the c-separation (chain separation) criterion.

In the talk the c-separation criterion will be formulated and explained by means of a few illustrative examples. It will be compared with the d-separation criterion and its equivalence with the original moralization criterion will be shown. Its role in the proof of existence of a perfect Markovian distribution for every chain graph will be shortly mentioned. The end of the talk will be devoted to interesting open questions concerning chain graphs.

Michael Kearns (AT&T)
Polynomial-time Algorithms for Learning Hidden Structure in Two-layer Noisy-OR Networks

I will describe work in progress (joint with Yishay Mansour of Tel-Aviv University) on inferring the structure of two-layer Bayesian networks in which the observable output units have noisy-OR CPT's over their unobservable parents. Such networks have been the subject of both applied and theoretical study in recent years. I will discuss conditions under which the hidden causal relationships between the inputs and outputs can be recovered *exactly* from sample data, and along the way make some combinatorial and algebraic observations that may prove useful in more general settings.

J.Q. Smith and R. Settimi (Warwick)
Geometrical aspects of missing data in graphical models

The estimation of probabilities on a Bayesian network is now well developed for complete data sets or for data collected on ancestor sets (eg Speigelhalter et al, 1993). The problem of how to estimate probabilities when data is missing is somewhat more problematical. Numerical methods have been devised by a number of authors (Cowell, 1997, Ramoni & Sebastiani, 1997 a,b,c,d) and various encouraging results have been proved for the case when data is missing at random. However major problems still exist when data is collected systematically only on certain joint margins (or indeed when this is the case except for certain unusual observations). Some of the problems of lack of learning potential were illustrated numerically in Speigelhalter & Cowell (1992).

In this presentation we will address some of the structural issues associated with observing only certain marginal tables when a model is known to exhibit certain conditional independence structures. We note that a collection of conditional independence statements for discrete probability models is equivalent to demanding that joint probabilities on the vector of discrete random variables lie in the solution space of high dimensional sets of simultaneous quadric equations. Typically these equations have multiple solutions and so even with perfect information on probability margins the model is often unidentifiable. In such instances numerical approaches will at best provide very uncertain estimates of probabilities and at worst be suseptible to giving misleading results.

We will begin by outlining the times when it is known that the graphical model is identifiable. We will then present examples which are is in a certain sense generic and which give rise to very difficult forms of identifiability. Some new results on the nature of these structural ambiguities will be reported. These will appear in more detail in Settimi & Smith (1997). We discuss some of the ways in which expedient prior information can make unidentifiable systems identifiable. We will argue that information on selected probabilities is often of far more value than additional conditional independence statements. We will also report some new results about how Grobner Bases can be constructed on a case by case basis to check the nature and extent of identifiability as it occurs (Riccomagno et al, 1997).

Geoffrey Hinton (Toronto) Learning Intractable Graphical Models

I will describe some learning rules for densely connected graphical models in which it is intractable to compute the full posterior probability distribution over all possible configurations of the hidden variables. To perform learning, it is necessary to approximate the posterior distribution. The first learning rule is for models called "Boltzmann machines" that have undirected connections and binary variables. The approximation to the posterior is performed using Gibbs sampling and the inaccuracy of the approximation causes serious problems. The other learning rules are for models with directed acyclic connections and discrete or piecewise linear variables. In these models the learning can work well even if the approximation to the posterior distribution over hidden states is poor.

Mike Jordan (MIT)
Approximate inference via variational techniques

For many graphical models of practical interest, exact inferential calculations are intractable and approximations must be developed. In this tutorial I describe the principles behind the use of variational methods for approximate inference. These methods provide bounds (upper and lower) on probabilities on graphs. They are complementary to the exact techniques in the sense that they tend to be more accurate for dense networks than for sparse networks; moreover, they can readily be combined with exact techniques. I will describe the application of variational ideas in a number of settings.

(This is joint work with Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul).

Sebastian Seung (Lucent))
Learning the Relationships Between Parts of an Object

In the syntactic approach to pattern recognition, a pattern is first decomposed into its parts, yielding a structural description. This structural description is checked against a set of "syntactic" rules that govern how parts are combined to form a whole. The syntactic approach is intuitively appealing, and is related to current psychological models of object recognition.

I will discuss the problem of implementing the syntactic approach with graphical models. The discussion will focus on a very simple model with two layers of variables, visible and hidden. The connections between hidden and visible variables define a pattern description "language," while the lateral connections between hidden variables specify a "grammar."

Learning algorithms for both types of connections will be illustrated by modeling images of handwritten digits. The hidden-visible connections learn stroke-like features, while the lateral connections learn constraints on the combination of these features.

Because the hidden variables are continuous, they are well-suited for representing analog variations in a pattern. However, they are also constrained to be nonnegative, giving them a quasi-binary nature.

I will conclude by linking this work to recent brain models that rely on continuous attractor dynamics.

David Spiegelhalter (MRC, Cambridge)
Bayesian Graphical Modelling

This tutorial will be based on three interconnected topics.

Graphical models: the use of directed and undirected conditional independence graphs to express the qualitative assumptions underlying a complex model.

Bayesian inference: the attachment to the links of the graph of conditional or prior distributions for all quantities, leading to a full probability model for both observables and parameters.

Markov chain Monte Carlo methods: conditional on observed data, the simulation of the posterior distributions for quantities of interest using Markov chain Monte Carlo methods, with Gibbs sampling as the simplest procedure.

These three themes are brought together in the BUGS software.

Mike Titterington (Glasgow)
Hyperparameter Estimation And Related Topics

The talk will consider a framework that includes the estimation of hyperparameters in Bayesian analysis. The same framework covers a number of incomplete-data problems such as standard mixture models, a wide variety of latent-structure models, hidden Markov chain models that are often applied in speech processing, and hidden Markov random field models popularised in recent approaches to the statistical modelling of noisy images and related to the analysis of Boltzmann machines. The relationships among the models will be established, as will be the feasibility of applying empirical Bayes and fully Bayesian approaches to the problem of estimating unknown quantities. Difficulties arise, especially in the context of hidden Markov random fields, and ways of trying to overcome the difficulties will be described. In the context of the empirical Bayes methods, in particular, the use of devices such as mean-field approximations will be discussed.

Guido Consonni (Pavia)
Priors for quantitative learning in probabilistic expert systems

We consider a Directed Acyclic Graph with discrete nodes and address the issue of learning about conditional probabilities, which are regarded as uncertain quantities. Usually both global independence and local independence are assumed in the corresponding prior. We suggest to relax the latter using two distributions, namely a hierarchical partition prior and a hierarchical logit prior. Each prior is a discrete mixture with respect to several alternative dependence structures of the probabilities of each node conditional on parents' configurations. An application to binary data under a complete sampling scheme is presented and directions on how to extend the methodology to incomplete data scheme is discussed.