Richard Durbin

Genomic data and prediction: the value of comprehensive information


Genome projects produce very large data sets, starting with genomic sequence but also now providing gene expression data from microarrays, phenotype data from systematic gene disruption screens etc. These data sets look attractive to constrain highly parameterised models; one particular advantage is that their comprehensive nature means that every gene, for example, can be considered. However, for the most part systematic genome studies remain a long way from providing what is required to constrain predictive functional models, either because they do not measure the desired properties (e.g. molecular activities), or because we can't interpret the information that is obtained (e.g. failure to fold proteins from sequence).