Bioconductor: an open computing resource for genomics

Summary

Principal Investigator: Martin Morgan
Abstract: DESCRIPTION (provided by applicant): The Bioconductor project provides an open resource for the development and distribution of innovative reliable software for computational biology and bioinformatics. The range of available software is broad and rapidly growing as are both the user community and the developer community. The project maintains a web portal for delivering software and documentation to end users as well as an active mailing list. Additional services for developers include a software archive, mailing list and assistance and advice program development and design We propose an active development strategy designed to meet new challenges while simultaneously providing user and developer support for existing tools and methods. In particular we emphasize a design strategy that accommodates the imperfect, yet evolving nature of biological knowledge and the relatively rapid development of new experimental technologies. Software solutions must be able to rapidly adapt and to facilitate new problems when they arise. CRITQUE 1: The Bioconductor project began in 2001. In 2002 it was awarded a BISTI grant for three years 2003-2006). During this time the project has expanded and provided support for a world wide community of researchers. This is a proposal for continued development for Bioconductor, which is a set of statistical programs which are specifically tailored to the computatational biology community. Bioconductor is composed of over 130 R packages that have been contributed by a large number of developers. The software packages range from state of the art statistical methods which typically are used in microarray analysis, to annotation tools, to plotting functions, GUIs, to sequence alignment and data management packages. Contributions to and usage of Bioconductor is growing rapidly and the applicants are requesting support to continue its development as well as general logistical support for software distribution and quality assurance. The proposal includes a research component for Bioconductor which will involve the development of analysis techniques. This will include optimization of the R statistical analyses, statistical processing of Affymetrix data, analysis of SNP data, improved standards, data storage, retreivals from NCBI, sequence management, machine learning, web services and distributed computing. SCIENTIFIC MERIT The applicants address many issues that are crucial to the success of a large open source project with multiple contributors. Examples of training, scientific publication, documentation and resource development run throughout the proposal. Many tangible examples were given on the usage of the system by the scientific community. EXPERIMENTAL DESIGN This is a description of their management workflow for the project which does a good job of demonstrating the technical excellence brought to the project by this group. 1) Build annotation packages every three months, Integrate changes in annotation source data structure into annotation package building code. 2) Maintain project website, mailing lists, source control archive. Organize web resources for short course and conferences. 3) Improve existing software. 4) Sustain automated nightly builds. Work with developers whose packages fail to pass QA. 5) Resolve cross-platform issues. 6) Review new submissions. Answer questions on the mailing lists. 7) Use software engineering best practices. Develop unit testing strategies. Design appropriate classes and methods for new data types. Refactor existing code for better interoperability and extensibility. 8) Develop and organize training materials and documentation. Extensive detail on testing, build procedures, interoperability, quality assurance and project management is given elsewhere in the document. They clearly have dealt with many issues necessary for a project of this size. They state that one of the biggest cost items is support of this package to run on multiple platforms. They point out that many contributors focus on a single platform, much of their work is track down cross-platform bugs. This is time well-spent, given the platforms used are in sync with the needs of the greater bioinformatics community. ORIGINALITY While a high degree of originality is not a particularly critical element of open source software development project, there are certainly areas in the proposal that are unique. Most importantly, it is safe to say that there is not another project which has this blend of statistical analysis systems specifically tailored to a important research bioinformatics area that can be deployed on a number of different computer environments. INVESTIGATOR AND CO-INVESTIGATORS Dr. Gentleman is the founder and leader of the Bioconductor project. Dr. Gentlemen was an Associate Professor in the Department of Biostatistics, Harvard School of Public Health and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute. In 2004 he became Program Head, Computational Biology, at the Fred Hutchinson Cancer Research Center in Seattle. He has on the order of ten publications relating to Bioconductor or related statistical analysis. He implemented the original versions of the R programming language jointly with another co-founder. He is PI or Investigator of a number of research grants, at least two are directly related to this work. He and other members of the proposal have taught a number of courses and given lectures on Bioconductor, the amount of these courses certainly indicate significant dedication to the project. A review of the PI and Co-PI activities related to this project are shown on Table 3 on page 42 of the application. The roles and time allocations assigned to each participant appear to be reasonable. Dr. Gentleman will serve as project leader and will manage the programmers, coordinating the project, and investigating new computational methods and approaches. Dr. Vincent Carey, as co Principal Investigator has 20% time allocated for the project. In 2005 he became Associate Professor of Medicine (Biostatistics). Carey is a senior member of the Bioconductor development core. He will improve interoperability to allow Bioconductor reuse of external modules in Java, Perl and other languages as well as strengthen interfaces between high throughput experimental workflows and machine learning tools, and ontology capture. An administrative assistant will assist Dr. Carey with administrative requirements, including call coordination, manuscript preparation and distribution, scheduling and budget management. Dr. Rafael Irizarry as co-PI will spend 30% effort on the project. Dr. Irizarry has four years experience developing methods for microarray data analysis and in the Department of Biostatistics serving as faculty liaison to the Johns Hopkins Medical Institution's Microarray Core. He will supervize all efforts to support preprocessing on all platforms and support for microarray related consortiums such as the ERCC, GEO, and ArrayExpress. Programmers will be responsible for the project website, managing email lists, maintaining training materials, upgrading software, refactoring and other code enhancements, managing the svn archive, and Bioconductor releases. They will handle checking all submitted packages, developing unit tests, and simplifying downloads, nightly build procedures, cross-platform issues, data technologies as well as integrating resources found in other languages (e.g. large C libraries of routines for string handling, machine learning and so on). Programmers have familiarity with R packages and systems for database management and for parallel and distributed computing. They will be responsible for managing the annotation data including package building and liaising with organism specific and other data providers. SIGNIFICANCE Given the scope of the proposal, and the size of the Bioconductor project in general the request for the above resources is appropriate. There is an excellent mix of grounded project management along with development of newer state of the art techniques that will benifit many members of the bioinformatics community. There is a high probability that funding this project will help to maintain and advance this important community resource. ENVIRONMENT The computer infrastructure, and the local departments of the PI and Co-PIs, as well as the work with the larger scientific community are all excellent environments to support this project. IN SUMMARY This is a terrific resource. It is a well managed large open source project with very well crafted QA testing, documentation and training. Continuation of this is a three year project. Beyond that period, a statement of long term stated goals is needed. The PI should articulate the strategic goals, as well as their research motivation and translate that into an action plan. They should also use that context to describe how they would go about choosing packages that are put into the Bioconductor system;Table 3 only listed the names of the packages made by the applicants, it could have gone further to give the reader more information for choosing packages. A simple example would have been if they stated in the document: "Given our assessment of the microarray state of the art, we ultimately aim to overlay annotation data, ontological information, and other forms of meta data onto a statistical framework for expression data." The resulting research plan would then justify a five year project, but it was not strong enough in this application. It should be noted that many of the benificiaries to this system are not just users that download the system. In many cases a centralized informatics service downloads their system and then performs analysis for other members of the campus or the wider www community. While that type of "success measure" is hard to assess, more effort in this area in subsequent proposals would be helpful.
Funding Period: ----------------2006 - ---------------2011-
more information: NIH RePORT

Top Publications

  1. pmc Machine learning and its applications to biology
    Adi L Tarca
    PLoS Comput Biol 3:e116. 2007
  2. pmc Genotyping in the cloud with Crossbow
    James Gurtowski
    Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
    Curr Protoc Bioinformatics . 2012
  3. pmc Aligning short sequencing reads with Bowtie
    Ben Langmead
    Johns Hopkins University, Baltimore, Maryland, USA
    Curr Protoc Bioinformatics . 2010
  4. pmc Software for computing and annotating genomic ranges
    Michael Lawrence
    Bioinformatics and Computational Biology, Genentech, Inc, South San Francisco, California, United States of America
    PLoS Comput Biol 9:e1003118. 2013
  5. pmc Gene set enrichment analysis using linear models and diagnostics
    Assaf P Oron
    Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109 1024, USA
    Bioinformatics 24:2586-91. 2008
  6. pmc Estimating genome-wide copy number using allele-specific mixture models
    Wenyi Wang
    Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
    J Comput Biol 15:857-66. 2008
  7. pmc MicroRNA discovery and profiling in human embryonic stem cells by deep sequencing of small RNA libraries
    Merav Bar
    Divisions of Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA
    Stem Cells 26:2496-505. 2008
  8. ncbi Rintact: enabling computational analysis of molecular interaction data from the IntAct repository
    Tony Chiang
    EBI EMBL, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
    Bioinformatics 24:1100-1. 2008
  9. pmc Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays
    Shin Lin
    McKusick Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, N Broadway, Baltimore, MD 21205, USA
    Genome Biol 9:R63. 2008
  10. pmc An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors
    Ittai Ben-Porath
    Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA
    Nat Genet 40:499-507. 2008

Scientific Experts

  • V J Carey
  • Nolwenn Le Meur
  • Ben Langmead
  • Michael Lawrence
  • Audrey Kauffmann
  • Adam L Asare
  • Deepayan Sarkar
  • Matthew E Ritchie
  • Alyssa C Frazee
  • Benilton S Carvalho
  • Rafael A Irizarry
  • Robert Gentleman
  • Héctor Corrada Bravo
  • James Gurtowski
  • Yi Cao
  • Jeffrey T Leek
  • Walter L Ruzzo
  • Martin Morgan
  • Aravinda Chakravarti
  • Merav Bar
  • Ittai Ben-Porath
  • Shin Lin
  • Tony Chiang
  • Assaf P Oron
  • Wenyi Wang
  • Adi L Tarca
  • Michael C Schatz
  • Donald Geman
  • David Simcha
  • Robert C Gentleman
  • Stephen J Tapscott
  • Zizhen Yao
  • Maura H Parker
  • Gilson J Sanchez
  • W Evan Johnson
  • Kyle L MacQuarrie
  • Keith Baggerly
  • Martin T Morgan
  • Robert B Scharpf
  • Jerry Davison
  • Simon Anders
  • Patrick Aboyoun
  • Hervé Pagès
  • Wolfgang Huber
  • Zhen Jiang
  • Matthew W Thomson
  • Kavita S Garg
  • Jonathan Pevsner
  • Muneesh Tewari
  • Carol Ware
  • Nianhua Li
  • NATHANIEL D MILLER
  • George W Bell
  • Jerald P Radich
  • David J Cutler
  • Hannele Ruohola-Baker
  • Angelique M Nelson
  • Robert A Weinberg
  • Evan M Kroh
  • Patrick S Mitchell
  • Stacia K Wyman
  • Ruping Ge
  • Samuel Kerrien
  • Dan E Arking
  • Junlin Qi
  • Brian R Fritz
  • Rachael K Parkin
  • Ausra Bendoraite
  • Sandra Orchard
  • Henning Hermjakob
  • Aviv Regev
  • Sorin Draghici
  • Xue Wen Chen
  • Roberto Romero

Detail Information

Publications25

  1. pmc Machine learning and its applications to biology
    Adi L Tarca
    PLoS Comput Biol 3:e116. 2007
  2. pmc Genotyping in the cloud with Crossbow
    James Gurtowski
    Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
    Curr Protoc Bioinformatics . 2012
    ....
  3. pmc Aligning short sequencing reads with Bowtie
    Ben Langmead
    Johns Hopkins University, Baltimore, Maryland, USA
    Curr Protoc Bioinformatics . 2010
    ..It also includes protocols for building a genome index and calling consensus sequences from Bowtie alignments using SAMtools...
  4. pmc Software for computing and annotating genomic ranges
    Michael Lawrence
    Bioinformatics and Computational Biology, Genentech, Inc, South San Francisco, California, United States of America
    PLoS Comput Biol 9:e1003118. 2013
    ..This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization. ..
  5. pmc Gene set enrichment analysis using linear models and diagnostics
    Assaf P Oron
    Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, WA 98109 1024, USA
    Bioinformatics 24:2586-91. 2008
    ..Gene-set enrichment analysis (GSEA) can be greatly enhanced by linear model (regression) diagnostic techniques. Diagnostics can be used to identify outlying or influential samples, and also to evaluate model fit and explore model expansion...
  6. pmc Estimating genome-wide copy number using allele-specific mixture models
    Wenyi Wang
    Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland, USA
    J Comput Biol 15:857-66. 2008
    ..Software to implement this procedure will be available in the Bioconductor oligo package (www.bioconductor.org)...
  7. pmc MicroRNA discovery and profiling in human embryonic stem cells by deep sequencing of small RNA libraries
    Merav Bar
    Divisions of Clinical Research, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, USA
    Stem Cells 26:2496-505. 2008
    ..Disclosure of potential conflicts of interest is found at the end of this article...
  8. ncbi Rintact: enabling computational analysis of molecular interaction data from the IntAct repository
    Tony Chiang
    EBI EMBL, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
    Bioinformatics 24:1100-1. 2008
    ..These datasets need to be analyzed by computational methods. Software packages in the statistical environment R provide powerful tools for conducting such analyses...
  9. pmc Validation and extension of an empirical Bayes method for SNP calling on Affymetrix microarrays
    Shin Lin
    McKusick Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, N Broadway, Baltimore, MD 21205, USA
    Genome Biol 9:R63. 2008
    ..Also, we tie our call confidence metric to percent accuracy. We intend that our validation datasets and methods, refered to as SNPaffycomp, serve as standard benchmarks for future SNP calling algorithms...
  10. pmc An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors
    Ittai Ben-Porath
    Whitehead Institute for Biomedical Research, Cambridge, Massachusetts 02142, USA
    Nat Genet 40:499-507. 2008
    ....
  11. pmc A framework for oligonucleotide microarray preprocessing
    Benilton S Carvalho
    Department of Oncology, University of Cambridge, CRUK Cambridge Research Institute, Li Ka Shing Centre, Robinson Way, Cambridge CB2 0RE, UK
    Bioinformatics 26:2363-7. 2010
    ..However, the expansion of microarray applications has exposed the limitation of existing tools...
  12. pmc Cloud-scale RNA-sequencing differential expression analysis with Myrna
    Ben Langmead
    Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
    Genome Biol 11:R83. 2010
    ..We apply Myrna to the analysis of publicly available data sets and assess the goodness of fit of standard statistical models. Myrna is available from http://bowtie-bio.sf.net/myrna...
  13. pmc Tackling the widespread and critical impact of batch effects in high-throughput data
    Jeffrey T Leek
    Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205 2179, USA
    Nat Rev Genet 11:733-9. 2010
    ..We review experimental and computational approaches for doing so...
  14. pmc ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets
    Alyssa C Frazee
    Department of Biostatistics, The Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
    BMC Bioinformatics 12:449. 2011
    ..1..
  15. ncbi Analyzing biological data using R: methods for graphs and networks
    Nolwenn Le Meur
    IRISA, Equipe Symbiose, Universite de Rennes I, Rennes, France
    Methods Mol Biol 804:343-73. 2012
    ..This chapter provides a practical tutorial covering the use of R methods for graphs and networks to examine biological data and analyze their topological and statistical properties...
  16. pmc Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogramming
    Yi Cao
    Human Biology Division, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA
    Dev Cell 18:662-74. 2010
    ..Therefore, in addition to regulating muscle gene expression, MyoD binds genome wide and has the ability to broadly alter the epigenome in myoblasts and myotubes...
  17. pmc Model-based quality assessment and base-calling for second-generation sequencing data
    Héctor Corrada Bravo
    Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, USA
    Biometrics 66:665-74. 2010
    ..Our model provides these informative estimates readily usable in quality assessment tools while significantly improving base-calling performance...
  18. pmc R/Bioconductor software for Illumina's Infinium whole-genome genotyping BeadChips
    Matthew E Ritchie
    Bioinformatics Division, The Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville Victoria 3052, Australia
    Bioinformatics 25:2621-3. 2009
    ..We provide access to the raw summary-level intensity data, allowing users to develop their own methods for genotype calling or copy number analysis if they wish...
  19. pmc ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data
    Martin Morgan
    Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA, USA
    Bioinformatics 25:2607-8. 2009
    ..ShortRead is provided in the R and Bioconductor environments, allowing ready access to additional facilities for advanced statistical analysis, data transformation, visualization and integration with diverse genomic resources...
  20. pmc Data structures and algorithms for analysis of genetics of gene expression with Bioconductor: GGtools 3.x
    Vincent J Carey
    Channing Laboratory, Department of Medicine, I2B2 National Center for Biocomputing, Brigham and Women s Hospital, Boston, MA 02115, USA
    Bioinformatics 25:1447-8. 2009
    ..Using R/Bioconductor facilities, Phase II HapMap genotypes and Illumina 47K expression assay results archived on multiple populations may be interactively explored and analyzed using commodity hardware...
  21. pmc arrayQualityMetrics--a bioconductor package for quality assessment of microarray data
    Audrey Kauffmann
    EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
    Bioinformatics 25:415-6. 2009
    ..The diagnosis of quality remains, in principle, a context-dependent judgement, but our tool provides powerful, automated, objective and comprehensive instruments on which to base a decision...
  22. pmc Quality assessment and data analysis for microRNA expression arrays
    D Sarkar
    Fred Hutchinson Cancer Research Center, 1100 Fairview Ave N, Seattle, WA 98109, USA
    Nucleic Acids Res 37:e17. 2009
    ..We anticipate that the reagents and analytic approach presented here will be useful for improving the reliability of microRNA microarray experiments...
  23. pmc Power enhancement via multivariate outlier testing with gene expression arrays
    Adam L Asare
    Immune Tolerance Network, University of California San Francisco, San Francisco, CA 94143, USA
    Bioinformatics 25:48-53. 2009
    ..We present a formal approach for microarray quality assessment that is based on dimension reduction of established measures of signal and noise components of expression followed by parametric multivariate outlier testing...
  24. pmc rtracklayer: an R package for interfacing with genome browsers
    Michael Lawrence
    Program in Computational Biology, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA
    Bioinformatics 25:1841-2. 2009
    ..Currently, the UCSC genome browser is supported...