Genomes and Genes
Bioconductor: an open computing resource for genomics
Principal Investigator: Martin Morgan
Abstract: DESCRIPTION (provided by applicant): The Bioconductor project provides an open resource for the development and distribution of innovative reliable software for computational biology and bioinformatics. The range of available software is broad and rapidly growing as are both the user community and the developer community. The project maintains a web portal for delivering software and documentation to end users as well as an active mailing list. Additional services for developers include a software archive, mailing list and assistance and advice program development and design We propose an active development strategy designed to meet new challenges while simultaneously providing user and developer support for existing tools and methods. In particular we emphasize a design strategy that accommodates the imperfect, yet evolving nature of biological knowledge and the relatively rapid development of new experimental technologies. Software solutions must be able to rapidly adapt and to facilitate new problems when they arise. CRITQUE 1: The Bioconductor project began in 2001. In 2002 it was awarded a BISTI grant for three years 2003-2006). During this time the project has expanded and provided support for a world wide community of researchers. This is a proposal for continued development for Bioconductor, which is a set of statistical programs which are specifically tailored to the computatational biology community. Bioconductor is composed of over 130 R packages that have been contributed by a large number of developers. The software packages range from state of the art statistical methods which typically are used in microarray analysis, to annotation tools, to plotting functions, GUIs, to sequence alignment and data management packages. Contributions to and usage of Bioconductor is growing rapidly and the applicants are requesting support to continue its development as well as general logistical support for software distribution and quality assurance. The proposal includes a research component for Bioconductor which will involve the development of analysis techniques. This will include optimization of the R statistical analyses, statistical processing of Affymetrix data, analysis of SNP data, improved standards, data storage, retreivals from NCBI, sequence management, machine learning, web services and distributed computing. SCIENTIFIC MERIT The applicants address many issues that are crucial to the success of a large open source project with multiple contributors. Examples of training, scientific publication, documentation and resource development run throughout the proposal. Many tangible examples were given on the usage of the system by the scientific community. EXPERIMENTAL DESIGN This is a description of their management workflow for the project which does a good job of demonstrating the technical excellence brought to the project by this group. 1) Build annotation packages every three months, Integrate changes in annotation source data structure into annotation package building code. 2) Maintain project website, mailing lists, source control archive. Organize web resources for short course and conferences. 3) Improve existing software. 4) Sustain automated nightly builds. Work with developers whose packages fail to pass QA. 5) Resolve cross-platform issues. 6) Review new submissions. Answer questions on the mailing lists. 7) Use software engineering best practices. Develop unit testing strategies. Design appropriate classes and methods for new data types. Refactor existing code for better interoperability and extensibility. 8) Develop and organize training materials and documentation. Extensive detail on testing, build procedures, interoperability, quality assurance and project management is given elsewhere in the document. They clearly have dealt with many issues necessary for a project of this size. They state that one of the biggest cost items is support of this package to run on multiple platforms. They point out that many contributors focus on a single platform, much of their work is track down cross-platform bugs. This is time well-spent, given the platforms used are in sync with the needs of the greater bioinformatics community. ORIGINALITY While a high degree of originality is not a particularly critical element of open source software development project, there are certainly areas in the proposal that are unique. Most importantly, it is safe to say that there is not another project which has this blend of statistical analysis systems specifically tailored to a important research bioinformatics area that can be deployed on a number of different computer environments. INVESTIGATOR AND CO-INVESTIGATORS Dr. Gentleman is the founder and leader of the Bioconductor project. Dr. Gentlemen was an Associate Professor in the Department of Biostatistics, Harvard School of Public Health and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute. In 2004 he became Program Head, Computational Biology, at the Fred Hutchinson Cancer Research Center in Seattle. He has on the order of ten publications relating to Bioconductor or related statistical analysis. He implemented the original versions of the R programming language jointly with another co-founder. He is PI or Investigator of a number of research grants, at least two are directly related to this work. He and other members of the proposal have taught a number of courses and given lectures on Bioconductor, the amount of these courses certainly indicate significant dedication to the project. A review of the PI and Co-PI activities related to this project are shown on Table 3 on page 42 of the application. The roles and time allocations assigned to each participant appear to be reasonable. Dr. Gentleman will serve as project leader and will manage the programmers, coordinating the project, and investigating new computational methods and approaches. Dr. Vincent Carey, as co Principal Investigator has 20% time allocated for the project. In 2005 he became Associate Professor of Medicine (Biostatistics). Carey is a senior member of the Bioconductor development core. He will improve interoperability to allow Bioconductor reuse of external modules in Java, Perl and other languages as well as strengthen interfaces between high throughput experimental workflows and machine learning tools, and ontology capture. An administrative assistant will assist Dr. Carey with administrative requirements, including call coordination, manuscript preparation and distribution, scheduling and budget management. Dr. Rafael Irizarry as co-PI will spend 30% effort on the project. Dr. Irizarry has four years experience developing methods for microarray data analysis and in the Department of Biostatistics serving as faculty liaison to the Johns Hopkins Medical Institution's Microarray Core. He will supervize all efforts to support preprocessing on all platforms and support for microarray related consortiums such as the ERCC, GEO, and ArrayExpress. Programmers will be responsible for the project website, managing email lists, maintaining training materials, upgrading software, refactoring and other code enhancements, managing the svn archive, and Bioconductor releases. They will handle checking all submitted packages, developing unit tests, and simplifying downloads, nightly build procedures, cross-platform issues, data technologies as well as integrating resources found in other languages (e.g. large C libraries of routines for string handling, machine learning and so on). Programmers have familiarity with R packages and systems for database management and for parallel and distributed computing. They will be responsible for managing the annotation data including package building and liaising with organism specific and other data providers. SIGNIFICANCE Given the scope of the proposal, and the size of the Bioconductor project in general the request for the above resources is appropriate. There is an excellent mix of grounded project management along with development of newer state of the art techniques that will benifit many members of the bioinformatics community. There is a high probability that funding this project will help to maintain and advance this important community resource. ENVIRONMENT The computer infrastructure, and the local departments of the PI and Co-PIs, as well as the work with the larger scientific community are all excellent environments to support this project. IN SUMMARY This is a terrific resource. It is a well managed large open source project with very well crafted QA testing, documentation and training. Continuation of this is a three year project. Beyond that period, a statement of long term stated goals is needed. The PI should articulate the strategic goals, as well as their research motivation and translate that into an action plan. They should also use that context to describe how they would go about choosing packages that are put into the Bioconductor system;Table 3 only listed the names of the packages made by the applicants, it could have gone further to give the reader more information for choosing packages. A simple example would have been if they stated in the document: "Given our assessment of the microarray state of the art, we ultimately aim to overlay annotation data, ontological information, and other forms of meta data onto a statistical framework for expression data." The resulting research plan would then justify a five year project, but it was not strong enough in this application. It should be noted that many of the benificiaries to this system are not just users that download the system. In many cases a centralized informatics service downloads their system and then performs analysis for other members of the campus or the wider www community. While that type of "success measure" is hard to assess, more effort in this area in subsequent proposals would be helpful.
Funding Period: ----------------2006 - ---------------2011-
more information: NIH RePORT
- Machine learning and its applications to biologyAdi L Tarca
PLoS Comput Biol 3:e116. 2007
- Genotyping in the cloud with CrossbowJames Gurtowski
Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
Curr Protoc Bioinformatics . 2012....
- Aligning short sequencing reads with BowtieBen Langmead
Johns Hopkins University, Baltimore, Maryland, USA
Curr Protoc Bioinformatics . 2010..It also includes protocols for building a genome index and calling consensus sequences from Bowtie alignments using SAMtools...
- Software for computing and annotating genomic rangesMichael Lawrence
Bioinformatics and Computational Biology, Genentech, Inc, South San Francisco, California, United States of America
PLoS Comput Biol 9:e1003118. 2013..This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization. ..
- Gene set enrichment analysis using linear models and diagnosticsAssaf P Oron
Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109 1024, USA
Bioinformatics 24:2586-91. 2008..Gene-set enrichment analysis (GSEA) can be greatly enhanced by linear model (regression) diagnostic techniques. Diagnostics can be used to identify outlying or influential samples, and also to evaluate model fit and explore model expansion...
- Estimating genome-wide copy number using allele-specific mixture modelsWenyi Wang
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
J Comput Biol 15:857-66. 2008..Software to implement this procedure will be available in the Bioconductor oligo package (www.bioconductor.org)...
- MicroRNA discovery and profiling in human embryonic stem cells by deep sequencing of small RNA librariesMerav Bar
Divisions of Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA
Stem Cells 26:2496-505. 2008..Disclosure of potential conflicts of interest is found at the end of this article...
- Rintact: enabling computational analysis of molecular interaction data from the IntAct repositoryTony Chiang
EBI EMBL, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Bioinformatics 24:1100-1. 2008..These datasets need to be analyzed by computational methods. Software packages in the statistical environment R provide powerful tools for conducting such analyses...
- Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarraysShin Lin
McKusick Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, N Broadway, Baltimore, MD 21205, USA
Genome Biol 9:R63. 2008..Also, we tie our call confidence metric to percent accuracy. We intend that our validation datasets and methods, refered to as SNPaffycomp, serve as standard benchmarks for future SNP calling algorithms...
- An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumorsIttai Ben-Porath
Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA
Nat Genet 40:499-507. 2008....
- A framework for oligonucleotide microarray preprocessingBenilton S Carvalho
Department of Oncology, University of Cambridge, CRUK Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK
Bioinformatics 26:2363-7. 2010..However, the expansion of microarray applications has exposed the limitation of existing tools...
- Cloud-scale RNA-sequencing differential expression analysis with MyrnaBen Langmead
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
Genome Biol 11:R83. 2010..We apply Myrna to the analysis of publicly available data sets and assess the goodness of fit of standard statistical models. Myrna is available from http://bowtie-bio.sf.net/myrna...
- Tackling the widespread and critical impact of batch effects in high-throughput dataJeffrey T Leek
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205 2179, USA
Nat Rev Genet 11:733-9. 2010..We review experimental and computational approaches for doing so...
- ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasetsAlyssa C Frazee
Department of Biostatistics, The Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
BMC Bioinformatics 12:449. 2011..1..
- Analyzing biological data using R: methods for graphs and networksNolwenn Le Meur
IRISA, Equipe Symbiose, Universite de Rennes I, Rennes, France
Methods Mol Biol 804:343-73. 2012..This chapter provides a practical tutorial covering the use of R methods for graphs and networks to examine biological data and analyze their topological and statistical properties...
- Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogrammingYi Cao
Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
Dev Cell 18:662-74. 2010..Therefore, in addition to regulating muscle gene expression, MyoD binds genome wide and has the ability to broadly alter the epigenome in myoblasts and myotubes...
- Model-based quality assessment and base-calling for second-generation sequencing dataHéctor Corrada Bravo
Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, USA
Biometrics 66:665-74. 2010..Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance...
- R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChipsMatthew E Ritchie
Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville Victoria 3052, Australia
Bioinformatics 25:2621-3. 2009..We provide access to the raw summary-level intensity data, allowing users to develop their own methods for genotype calling or copy number analysis if they wish...
- ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence dataMartin Morgan
Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
Bioinformatics 25:2607-8. 2009..ShortRead is provided in the R and Bioconductor environments, allowing ready access to additional facilities for advanced statistical analysis, data transformation, visualization and integration with diverse genomic resources...
- Data structures and algorithms for analysis of genetics of gene expression with Bioconductor: GGtools 3.xVincent J Carey
Channing Laboratory, Department of Medicine, I2B2 National Center for Biocomputing, Brigham and Women s Hospital, Boston, MA 02115, USA
Bioinformatics 25:1447-8. 2009..Using R/Bioconductor facilities, Phase II HapMap genotypes and Illumina 47K expression assay results archived on multiple populations may be interactively explored and analyzed using commodity hardware...
- arrayQualityMetrics--a bioconductor package for quality assessment of microarray dataAudrey Kauffmann
EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Bioinformatics 25:415-6. 2009..The diagnosis of quality remains, in principle, a context-dependent judgement, but our tool provides powerful, automated, objective and comprehensive instruments on which to base a decision...
- Quality assessment and data analysis for microRNA expression arraysD Sarkar
Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
Nucleic Acids Res 37:e17. 2009..We anticipate that the reagents and analytic approach presented here will be useful for improving the reliability of microRNA microarray experiments...
- Power enhancement via multivariate outlier testing with gene expression arraysAdam L Asare
Immune Tolerance Network, University of California San Francisco, San Francisco, CA 94143, USA
Bioinformatics 25:48-53. 2009..We present a formal approach for microarray quality assessment that is based on dimension reduction of established measures of signal and noise components of expression followed by parametric multivariate outlier testing...
- rtracklayer: an R package for interfacing with genome browsersMichael Lawrence
Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
Bioinformatics 25:1841-2. 2009..Currently, the UCSC genome browser is supported...