JEnsembl: Design Architecture
Fig. 1. JEnsembl Architecture. Schematic diagram of the modular JEnsembl architecture, where schema-versioned MyBatis configurations in the ensembl-config module are mapped to DatasourceAware objects using the MyBatis data mapping framework. Connection to external Ensembl datasources is via the MySQL JDBC connector. |
Modular Design as Maven Artifacts
Ensembl Data Access Interface
Ensembl Datasource Aware Module
Ensembl Datamapper
Ensembl Data Access Module
Ensembl Datasource Configuration
Ensembl Test
Footnotes
We have constructed a proof-of-concept skeletal implementation of a Java API to Ensembl in order to demonstrate the tractability of objectives 1-6; specifically it provides access to all versions of databases at Ensembl and EnsemblGenomes.
JEnsembl is implemented in Java version 1.6 following a modular design pattern using Maven software management. Project development is hosted on SourceForge where code is available from the subversion repository. The architecture of the project is shown schematically in Figure 1. Each module (Maven artifact) is coded against full JUnit tests, with an additional module providing functional tests for data retrieval by the API from remote datasources.
The modular design architecture allows separation of DataAccess functionality from Model Objects. Specifically separation of the DataAccess Configuration (the mapping of SQL statements to Model Objects) from the DataAccess Objects allows the configuration module to control per schema changes in access code. The software modules are published as separate Maven Artifacts and available on the project's release download page or through the JEnsembl Maven repository (see info.). JavaDocs for the current JEnsembl release are available.
Ensembl Model (ensembl-model)
All objects in the JEnsembl domain are defined as interfaces in the ensembl-model module. Some of these represent 'Biological/Genetic Objects' and others represent 'Ensembl Database-Specific Objects' – this division might be better separated into separate modules.
Base Interfaces include IdentifiableObject (for all database derived objects), MappableObject (for all Identifiable Objects that can participate in a Mapping), Mapping (represents a mapping relationship between a Source and Target, with Source and Target Coordinates), Coordinate (class to hold start, end and strand positions), XRef, ExternalDB, ObjectType (for representing the Type of an IdentifiableObject).
Core Interfaces define a hierarchy of Objects represented in the Core Schema, including: the base CoreObject, DNASequence, AssembledDNASequence, Chromosome, Feature, Gene, Transcript, Species, CollectionOfSpecies and CoordinateSystem.
Database Interfaces define a hierarchy of the different types of Databases in the Ensembl 'Schema', including: Database, CoreDatabase, SingleSpeciesDatabase, CollectionDatabase, VariationDatabase, ComparisonDatabase etc. A Registry interface also defines the behaviour of concrete Registries that retrieve Databases from a datasource connection.
The Model module also provides EnsemblException hierarchy for the API.
An interface, EnsemblDNASequenceReader, which extends the BioJava3-core ProxySequenceReader interface, is defined for loading Sequences as character Strings from the database.
The model artifact also provides various enumerations and static methods used for loading and parsing configuration files and parsing database and object types returned from the datasource. The Types all extend EnsemblType, which itself implements ObjectType, and may implement a hierarchy of other Type interfaces: EnsemblDBType, EnsemblCoordinateSystemType, FeatureType, EnsemblCoreObectType. The SchemaVersion and RegistryConfiguration classes provide the bridge for determining the correct MyBatis XML map/schema locations for given database types and version.
Ensembl Data Access Interface (ensembl-data-access-interface)
DAO (DataAccessObject) interfaces defined in ensembl-data-access-interface are implemented as DatabaseAwareDAO objects (in the ensembl-data-access module), and are responsible for retrieving DatasourceAware (DA) objects from the datasources using the data mapping configurations. A DatabaseDAO for datasource level query of available Databases and Species etc. CoreObject DAOs to retrieve and populate the properties of (DatasourceAware) Core Model Objects (e.g.
AssemblyDAO, ChromosomeDAO, CoordinateSystemDAO, GeneDAO, DNASequenceDAO).
The Module also defines the DAOFactory interfaces that access the ObjectDAOs. Each Database/Species combination will have its own DAOFactory of the appropriate subclass (there will be one DAOSingleSpeciesFactory for each single-species database, but one DAOCollectionFactory for each Species belonging to a CollectionDatabase). The interface hierarchy includes DAOFactory, DAOSpeciesFactory, DAOCollectionFactory, DAOSingleSpeciesFactory, DAOCoreFactory,DAOSingleSpeciesCoreFactory, DAOCollectionCoreFactory, DAOComparaFactory, DAOFuncgenFactory, DAOCollectionFuncgenFactory, DAOVariationFactory, DAOCollectionVariationFactory.
Ensembl Datasource Aware Module (ensembl-datasource-aware-model)
This module implements DatasourceAware Model Objects, defined by the interfaces in the Model Module. These objects extend the abstract DAObject and typically contain the instance of a DAOFactory that was used to instantiate them, (which provides their required for data access capability). MyBatis3 is used as the RDBMS-to-Java object-mapping tool. Objects posses a parameter-less constructor to facilitate instantiation of the objects by MyBatis3, which maps SQL query results to the data mapping configuration. The hierarchy of DA Objects matches those of the Model Objects. Where properties have not been provided on instantiation, lazy load methods will retrieve these on demand via the DAOFactory and concrete DBDAOObjects.
Ensembl Data Mapper (ensembl-datamapper)
Defines interfaces for the MyBatis data mapping framework to map DAO queries to DatasourceAware Object results. The module also implements utility RowHandlers, DataTypeMappers and Query objects for use in MyBatis mapping.
Ensembl Data Access Module (ensembl-data-access)
Contains DBDatabase concrete implementations of all the Database interfaces in the Model module, and DB implementations of the Species, CollectionSpecies and Collection interfaces.
A DBDatabaseDAO implementation of DatabaseDAO provides datasource level query of available databases and species etc. DBDAOFactory implementations of the DAOFactory interfaces in the DataAccessInterface module provide the various types of DBObjectDAOs for data retrieval, and are configured automatically to use the correct schema version Mybatis maps. Correctly configured data access is controlled by DAOFactory objects; an appropriate type of DBDAOFactory is created on demand for each Database instance, and automatically configured to use the correct MyBatis mapping rules for its schema version. A DBDAOFactory provides the DAO access objects which perform data queries using MyBatis SqlSessions provided by their shared DBDAOFactory. These SQL queries typically return DatasourceAware objects; each DatasourceAware object holds a reference to its own DBDAOFactory, which is used to perform lazy loading of data fields and perform queries about further data relationships. Hence all access to a particular database is effectively performed through a DBDAOFactory singleton (providing the opportunity for implementing data caching).
The abstract DBBaseDAO provides the template superclass for all the other DBDAOs, wrapping the MyBatis query methods called on the SqlClient obtained from a DBDAOFactory (which is configured to use the correct schema configuration for the particular database).
The module provides DBCoreObjectDAO implementations of the CoreObjectDAOs in the DataAccessInterface module, by extending DBBaseDAO. These DBDAOs implement the actual data retrieval methods, which call MyBatis SqlClient methods using queries defined in the Configuration module, and return DatasourceAwareObjects via mapping rules defined in the same Configuration module.
Ensembl Datasource Configuration (ensembl-config)
MyBatis uses the combination of the XML files containing SQL queries and result mappings from the ensembl-config module and the series of Java interfaces defined in the ensembl-datamapper module allow automatic transformation of database query result sets into DatasourceAware objects. The JEnsembl API uses a set of property files provided in the ensembl-config module to correctly configure the SQL/Data mapping rules for each available Ensembl schema version. This module provides backwards compatibility with known Ensembl releases, and is updated to reflect schema changes in future Ensembl releases (schema evolution).
Property files provide the current registry configuration for connection to EnsemblDB and EnsemblGenomes. The file schema_version_mappings.properties lists known current releases (currently 50-67) and details which mapping schema to use for each known release. The module provides pre-configured alternate database connection properties, ensembldb.properties, ensembldb-archives.properties and ensemblgenomes.properties, but allows for alternative configurations to be loaded locally on Registry initialisation.
Datasource level MyBatis mapping
The MyBatis SQL mappings for the DBRegistry are defined in single Configuration.xml and Database.xml. To connect to a data source (e.g. Ensembl), a DBRegistry object is instantiated by injecting either a default RegistryConfiguration object read from the current ensembl-config module, or a RegistryConfiguration generated from locally supplied properties. Upon DBRegistry initialization, the names of available databases at the configured datasource are parsed using knowledge of the Ensembl naming conventions to identify database-type, species, assembly and schema release versions. Queries are used to populate the list of databases in the DBRegistry and the properties of all of the represented DBSpecies from the core.meta table (including aliases, ids etc.). The DBRegistry object can then be queried for lists of known database and species or can return objects extracted from current or specific releases of named species databases.
Alternate Schema version MyBatis Sql mappings
A separate mapping configuration package will be provided for each evolution version and type of database (i.e. 58.compara, 58.core, 58.funcgen, 58.variation, 57.compara .... 50.compara ... etc). Each of these packages will contain a MyBatis Configuration.xml file that defines the actual xml mapping files and sql to DAObject mappings to be used for that package. The package may contain its own mapping and alias files for its own version, or point at mapping files in the package for an earlier release, this limiting the duplication of mapping files over time. As schema changes are implemented, only the pertinent mapping file require updating, and the new version Configuration.xml can point at a mixture of earlier and current mapping files. The mapping rules hierarchy is shown in Figure 2.
The schema_version_mappings.properties file (read by the SchemaVersion class of ensembl-model) then controls which release version uses which SQL package, so that multiple release versions may in fact reuse a single configuration package.
Fig. 2. Data-Mapping between Database Releases and Schema Versions. (A) The configuration file hierarchy in the ensembl-config module. The ensembldb, ensembldb-archives and ensemblgenomes properties files hold JDBC connection parameters, whilst schema_version_mappings specifies which MyBatis configurations are to be used for each Ensembl release version. The base Configuration.xml and Database.xml files configure connection at the data source level, whilst release specific MyBatis mappings are held in database type-specific directories: schema/XX/compara, core, funcgen and variation; rules specified in a Configuration.xml file in each directory allows a release configuration to use mapping files from different directories. (B) Abridged listing of schema_version_mappings properties, showing how the appropriate mappings of database type and version to MyBatis configuration directories are specified. Core and Compara mappings were developed for release 57, and are backwards compatible to release 51. Variation mappings were introduced from version 62, and Core mapping rules updated at release 65. |
Ensembl Test (ensembl-test)
This module contains both demonstration code illustrating how to use the JEnsembl API for data access (see description) and functional tests being developed to allow developers to check that data access behaviour is as expected.
Footnotes
1. Databases in an Ensembl Datasource
The Ensembl data sources contain not only the actual DNA sequences of genome assemblies but annotations of features on the assembly derived from Ensembl's own pipeline analyses and external sources, together with derived relationships between these features. Core sequence and assembly information together with gene and transcription annotations are stored in a 'Core' schema, whilst the other (optional) data schema are used to hold further information about the better studied model species. Access to data in the other (non-Core) database schema is controlled through the DAOCoreFactory, which, for example, can supply an instance of a DAOVariationFactory for the correct species/version Variation Database, with its own correctly configured SQLSessionFactory. This DAOVariationFactory supplies a DAOVariation object, which may be used to retrieve all the variations for a given chromosomal region. Comparative genomic data is stored somewhat differently in Ensembl, with a single 'Compara' database (accessed by a DAOComparaFactory) for each release of Ensembl holding the results of pair-wise inter-species comparisons (comprising both genomic alignments and gene family and homology data).
2. EnsemblGenomes uses the same Data Schema: Multi-Species Databases
The EnsemblGenomes datasource uses the same (versioned) schema as Ensembl (which is now focused as a Vertebrate resource), but with species organized into five separate Taxonomic groups, each with its own 'Compara' database. Therefore, as with the Ensembl Perl API, JEnsembl can use the same API for data access from EnsemblGenomes with the added benefit of version aware configuration on the fly.
However, EnsemblGenomes bacterial datasources differ significantly in being organized into multi-species databases according to phylogeny. Ensembl have adapted their schema to handle multi-species resources, and the Perl API handles all schema identically (as potentially multi-species). In JEnsembl, multispecies resources are currently handled by implementing separate 'multispecies' interfaces in Database and Factory objects. Because the underlying schema is identical, the multi-species data access architecture could be used for accessing standard single-species datasources. However, currently we feel retaining the single-species database paradigm is simpler for the vast majority of users, and allows for easier representation of a 'Species' object, shared between database release versions.
3. Nucleotide Sequences in the JEnsembl API: extending BioJava3
In order to harness the comprehensive sequence manipulation features of BioJava libraries we extended the BioJava 3.0 Core DNASequence object for the JEnsembl DNASequence object, providing an Ensembl SequenceReader that can lazy-load sequence on demand from the Ensembl datasource. This provides the JEnsembl Sequence objects with BioJava API behaviour, for example reading protein sequences from translated transcripts. Incorporation of third party open source libraries not only obviates code duplication, but also enables interoperability with a wider range of third party software.
4. Mappings and Coordinates in JEnsembl
Mappings are central to the design implementation of JEnsembl. They are used to place features such as Genes and Transcripts on CoordinateSystems such as Chromosomes, and to compose AssembledDNASequences from component DNASequences.
The JEnsembl Mapping object (defined in ensembl-model) contains a mapping relationship between two MappableObjects, given as Source and Target, with associated Coordinates for each (see Figure below). i.e. a Mapping can relate any two objects that implement the MappableObject interface, for example a Chromosome and a Gene. The Mapping can contain a ReverseMapping, where the relationship is expressed in inverse (i.e. Source becoming Target...).
A Mappable object can have sets and hashed sets of Mappings. For example a Chromosome can have a set of Gene Mappings, and each of the Genes would have an inverse Chromosome Mapping. A MappingSet object is used to store these Mappings. MappingSet extends TreeSet<Mapping> by providing a default Comparator that orders the Mappings on the basis of the Source Coordinates (an alternate Comparator orders on the basis of the Target Coordinates).
Coordinates implement the Comparable interface, and may be ordered in a CoordinateSet object, which provides useful functions to query aspects of the Collection such as the start, end, range and coverage. A MappingSet can provide the SourceCoordinates or the TargetCoordinates as a CoordinateSet, and can aso be queried for the Coordinate range covered.
5. Converting between axes for feature annotations
Genome assemblies in Ensembl are stored at a number of different 'levels'; the 'top level' assembly being the genomic/chromosome level (or for immature projects; scaffold fragments). Below this will be a number of ranked lower level assemblies. The lowest 'sequence' level contains actual DNA Sequence information, from which the higher level assemblies can 'stitch together' assembled sequences.
Most genomic Features (Genes, Transcripts, Exons etc.) are annotated on the top-level assembly (i.e. the level of the chromosome), and these annotation coordinates are held by 'Mappings' in JEnsembl (see above). JEnsembl seeks to hide these levels of assembly where possible both for annotations and for actual sequence data retrieval, and by default the chromosome/top level coordinate system is used where possible. Each 'Feature' has its own 'relative' coordinate system (e.g. from start = 1; to end = length, always expressed 5'→ 3' or N → C terminal) and there are methods for each type of feature to convert between its own coordinates and those of the chromosome and various other features (e.g. Processed Transcript to Primary Transcript, Exon, Translation or Protein coordinates to Chromosome or Transcript etc…). These methods are enumerated in the JavaDocs for the (abstract) DAFeature class and its subclasses DAGene, DATranscript and DAExon and for the DATranslation class (see DatasourceAware JavaDocs). Note that where meaningful, coordinates can be converted beyond the extent of the mapping (i.e. extending 5' and 3' to the annotation range).
Coordinate Axes for Transcription/Translation Features
A single structural gene, annotated on the 'Reverse' Strand of the Chromosome assembly is shown. Each feature has its own relative coordinate axis from position 1 at the 5' end to the 3' end position equal to the length of the feature. Methods are provided to move between these various axes for related objects.