JEnsembl: Example Code
Requirements
To run these examples you would need to have the compiled Java API and all of its dependencies on your class path. This can be done by one of the following... (most method simple first).
- building your own Maven project using the specified Maven dependencies from public repositories, or by installing the artifacts locally (in your '.m2' repository) from the File Download Page.
- including the binary bundle jensembl-bundle-no-config.jar together with a datasource configuration jar (e.g. Releases/1_76/jensembl-bundle-no-config-1.76.jar and Releases/1_76/ensembl-config-1.76.jar on the File Download Page).
- including all of the individual binary jar files in a Release download (e.g. Releases/1_76/release1_76.zip on the File Download Page).
- building your own Java project using the JEnsembl source code from Subversion (and importing required dependencies using Maven or downloading as above).
Download Example Code
Example code can be downloaded from the File Download Page (e.g. the source-code jar artifact ensembl-test-1.76-sources.jar is included in the Releases/1_76/release1_76.zip archive) as well as being available on Subversion (EnsemblTest). (Or source files may be viewed individually as a web page by clicking on the filename below).
example command-lineusage
>ls
classes lib source
>find . -name *.java
./source/uk/ac/roslin/ensembl/demo/ArchivesConnection.java
./source/uk/ac/roslin/ensembl/demo/AssemblyExceptions.java
./source/uk/ac/roslin/ensembl/demo/BacterialGeneHomologues.java ... etc.
>ls lib
ensembl-model-1.76.jar biojava3-core-3.0.3.jar
cglib-nodep-2.1_3.jar
mybatis-3.0.2.jar
ensembl-config-1.76.jar
mysql-connector-java-5.1.6.jar ensembl-data-access-1.76.jar
ensembl-data-access-interface-1.76.jar
slf4j-api-1.6.1.jar ensembl-datamapper-1.76.jar
ensembl-datasource-aware-model-1.76.jar
(optional slf4j-log4j12-1.6.1.jar & log4j-1.2.16.jar or the slf4j-nop-1.6.1.jar)
OR
>ls lib
jensembl-bundle-no-config-1.76.jar
ensembl-config-1.76.jar
- COMPILING
on *nix:
javac -sourcepath source -classpath classes:lib/* `find . -name *.java` -d classes
on Windows:
javac -sourcepath source -classpath classes;lib\* classes\uk\ac\roslin\ensembl\demo\*.java -d classes
- RUNNING
on *nix
java -cp classes:lib/* uk/ac/roslin/ensembl/demo/Genes
on windows
java -cp classes;lib\* uk\ac\roslin\ensembl\demo\Genes
Note that it is necessary to include the logger facade binaries slf4j-api-1.6.1.jar but any compatible logger implementation, or the slf4j-nop-1.6.1.jar 'No Operation' logger can be included at run time if desired.
Example Code Snippets
NEW v1.22 Selective loading of Releases
The Default Version for Data Queries is 'Current'
MyBatis Configuration Handles Ensembl Versioning
DNA Sequences
Assembled DNA Sequences
Feature Locations
Load a Registry (All available datasources and versions at a datasource)
// Connect to default Ensembl or EnsemblGenomes datasource
DBRegistry ensReg
= new DBRegistry(DataSource.ENSEMBLDB);
DBRegistry ensgenReg
= new DBRegistry(DataSource.ENSEMBLGENOMES);
NOTE: because of the increasing size of the DataSources (particularly the number of species in EnsemblGenomes) loading complete registries may be unaccebtably slow. Users may prefer to create and query an uninitialized registry or to make a registry containing data from only the current or a specified release (see below).
// Connect to locally configured datasource
RegistryConfiguration conf = new RegistryConfiguration();
conf.setDBByFile(new File("dbConn.properties"));
conf.setSchemaByFile(new File("schema.properties"));
DBRegistry localReg = new DBRegistry(conf);
// Retrieve species objects from a registry
Species chick = eReg.getSpeciesByAlias("chicken");
Species O81 = egReg.getSpeciesByAlias("E_coli_O81");
Discover available releases and selectively load a Registry
In order to improve the speed and efficiency of data loading for users who only wish to use the current release data (or data from a particular release version) various new static methods were added to the DBRegistry class in release 1.75. It is nowpossible to query which versions are available without completely initializing the Registry, and information is also available from the SchemaVersion class about which schema versions and data releases the current API can use.
QUERY THE API CONFIGURATION (using SchemaVersion class)
SchemaVersion s = new SchemaVersion();
System.out.println("API CONFIGURATION: Current declared Ensembl Version: "+s.getCurrentEnsemblVersion());
System.out.println("API CONFIGURATION: Current declared EnsemblGenomes Version: "+s.getCurrentGenomesVersion());
System.out.println("API CONFIGURATION: List of Registered Ensembl Schema Versions: "+Arrays.toString(s.getRegisteredSchemas()));
System.out.println("API CONFIGURATION: List of Registered EnsemblGenomes Release Versions: "+Arrays.toString(s.getKnownGenomesReleases()));
DISCOVER AVAILABLE RELEASES WITHOUT INITIALIZING THE REGISTRY
DBRegistry reg = DBRegistry.createUninitializedRegistryForDataSource(DataSource.ENSEMBLDB);
//Prints brief report of detailing which versions are available in this Data Source for this API release
System.out.println(reg.getVersionReport());
//Print report further details of releases present in the Data Source
//which cannot be recognized by this API
System.out.println(reg.getBriefRegistryReport());
//Create detailed report listing all Species (and aliases), versions and databases available in the registry
File report = reg.getRegistryReport();
INITIALIZE A REGISTRY FOR PARTICULAR RELEASES
//Initialize Registry soleley for current release data
DBRegistry reg =
DBRegistry.createRegistryForDataSourceCurrentRelease(DataSource.ENSEMBLDB);
//Initialize data solely for Ensembl (vertebrates) release 72.
DBRegistry reg =
DBRegistry.createRegistryForDataSourceAtReleaseVersion(DataSource.ENSEMBLDB, 72);
//The release version for EnsemblGenomes and Bacteria does not match the schema number
DBRegistry reg =
DBRegistry.createRegistryForDataSourceAtReleaseVersion(DataSource.ENSEMBLGENOMES, 20);
Data is retrieved by default from the most recent DB schema, but earlier versions can be specified (e.g. version 55).
Chromosome chr25_curr
= chick.getChromosomeByName("25");
Chromosome chr25_55
= chick.getChromosomeByName("25","55");
//The API also handles multispecies bacterial databases
Chromosome O81_chr
= O81.getChromosomeByName("chromosome");
MyBatis configuration transparently handles schema changes between Ensembl release versions.
DAOFactories for each database use the appropriate Ibatis mapping configuration for each Ensembl Release. Changes to the database schema are handled by addition of version-specific mappings to the Configuration package/artifact (ensembl-config.jar), and limiting changes required in the Model and DataAccess artifacts.
For example, in Release 51 the column 'hit_id' was renamed to 'hit_name' in the core 'protein_feature' table. A new version of the MyBatis mapping file 'protein_feature.xml' is used for subsequent releases. The API transparently determines the correct mapping context for different database versions (via the DAOFactory objects).
Species hsa
= eReg.getSpeciesByAlias("human");
Gene g58
= hsa.getGeneByStableID("ENSG00000139618", "58");
g58.getDescription();
==> breast cancer 2, early onset
[Source:HGNC Symbol;Acc:1101]
g58.getCanonicalTranslation().getProteinFeatures().size();
==> 34
Gene g50
= hsa.getGeneByStableID("ENSG00000139618", "50");
g50.getDescription();
==> Breast cancer type 2 susceptibility protein
(Fanconi anemia group D1 protein).
[Source:Uniprot/SWISSPROT;Acc:P51587]
g50.getCanonicalTranslation().getProteinFeatures().size();
==> 34
In a further example the merging of separate '***_stable_id' tables with the 'gene', 'exon', 'transcript' and 'translation' tables in Ensembl release 65 requires different SQL queries to be run post and prior this release. Again the API transparently determines the correct mapping context for different database versions (via the DAOFactory objects).
Underlying getExon SQL query in uk.ac.roslin.ensembl.configfiles.schema.57.core.Exon.xml
SELECT *
FROM
(
(exon e, exon_transcript et, seq_region sr)
LEFT JOIN exon_stable_id esi ON esi.exon_id = e.exon_id)
WHERE
e.is_current = 1
AND
sr.seq_region_id = e.seq_region_id
AND
e.exon_id = et.exon_id
<if test="featureStableID != null" >
AND esi.stable_id = #{featureStableID}
</if>
<if test="featureID != null" >
AND e.exon_id= #{featureID}
</if>
<if test="transcriptID != null" >
AND et.transcript_id= #{transcriptID}
</if>
ORDER BY et.transcript_id, et.rank
Underlying getExon SQL query in uk.ac.roslin.ensembl.configfiles.schema.65.core.Exon.xml
SELECT *
FROM
exon e, exon_transcript et, seq_region sr
WHERE
e.is_current = 1
AND
sr.seq_region_id = e.seq_region_id
AND
e.exon_id = et.exon_id
<if test="featureStableID != null" >
AND e.stable_id = #{featureStableID}
</if>
<if test="featureID != null" >
AND e.exon_id= #{featureID}
</if>
<if test="transcriptID != null" >
AND et.transcript_id= #{transcriptID}
</if>
ORDER BY et.transcript_id, et.rank
Again, because this is handled purely via the MyBatis mappings rules in the Configuration module (directed by an individual DAOFactory for each database instance), the access code is the same for all release versions.
public Collection<DAExon> DATranscript.getExons();
is implemented by calling
Collection<DAExon> out =
(Collection<DAExon>) this.getDaoFactory()
.getExonDAO().getExonsForTranscript(this);
where the DBExonDAO uses its version-aware DAOFactory
to obtain the correct Query for that release
version via the ExonMapper Interface
SqlSession session = this.getFactory().getNewSqlSession()
ExonMapper mapper = session.getMapper(ExonMapper.class);
List<DAExon> = mapper.getExon(featureQuery);
DADNASequences extend and modify BioJava Sequence Objects behaviour.
Sequences can be instantiated directlyDADNASequence myseq
= new DADNASequence("accgggttttMKYRNWSVBDHACTG");
myseq.getSequenceAsString(3,24);
==> cgggttttMKYRNWSVBDHACT
myseq.getReverseComplementSequenceAsString(3,24);
==> AGTDHVBSWNYRMKaaaacccg
AssembledDNASequences are DNASequences that hold an Assembly object
Chromosomes are examples of AssembledDNASequences, that hold an Assembly that stitches together component DNASequences and Gaps.They can be retrieved by Name (and Version):
Chromosome chr25v67 = chick.getChromosomeByName("25", "67");
But the 'actual' DNA sequence of AssembledDNASequences is only lazy loaded from the constitutent components of the Assembly as required.The necessary mapping between the Coordinate Sytems of the AssembledDNASequences, Contigs and constitutent DNASequences is handled transparently by the API.
System.out.println(chr25v67).getSequenceAsString(1000000, 1000020));
==> CACAGGTATTTCTTGATCCTC
Feature locations on a DNASequence (i.e. 'mappings') can be queried bidirectionally using lazy load query.
Gene gene
= chick.getGeneByStableID("ENSGALG00000009011");
Features such as Genes are mapped on DADNASequences.A Gene's location on the Chromsome can be queried:
System.out.println("Gene location: "
+ gene.getChromosomeMapping().getTargetCoordinates());
==> Gene location: 3080 - 12536 (REVERSE_STRAND)
This mapping held by the Gene references the Chromosome:Chromosome chr = (Chromosome) gene.getChromosomeMapping().getTarget();
System.out.println("Chromosome "+chr.getName());
==> Chromosome: 25
The Chromosome also holds mappings of its Gene locations (obtained by lazy-load):
List<? extends Gene> genesOnRegion
= chr.getGenesOnRegion(1,chr25_curr.getLength());
All of these Coordinates on the Chromosome can be transparently mapped to coordinates on the constituent Contigs and DNASequences in Ensembl
that represent real sequence data (see above).
Example Code Files
- RegistryFactory.java
- NEW v1.75 In order to improve the speed and efficiency of data loading for users who only wish to use the current release data (or data from a particular release version) various new static methods have been added to the DBRegistry class. It is possible to query which versions are available without completely initializing the Registry, and information is also available from the SchemaVersion class about which schema versions and data releases the current API can use.
- EnsemblConnection.java
- Demonstrating typical connection to the Ensembl (Vertebrate) datasource, Registry autoconfiguration and basic data retrieval functions.
- GenomesConnection.java
- Demonstrating typical connection to the EnsemblGenomes (Non-Vertebrate) datasource, (Plants, Protists, Fungi, Metazoa and Bacterial Collections).
- ArchivesConnection.java
- Demonstrating typical connection to the archives of the Ensembl (Vertebrate) datasource.
- SpeciesVersions.java
- Demonstrates retrieving data for different release versions of Species assemblies. Includes access by alias, handling of species/database renames (e.g. Orangutan), shows how species in bacterial collections are accessed, concentrating particularly on the major expansion, reorganisation and renamings that were implemented in EnsemblGenomes v17.
- LoadLocalDatasourceProperties.java
- Demonstrate use of a local db-connection.properties file instead of using the file in-place in the configuration module.
- LoadLocalConfigurationProperties.java
- Demonstrate use of a local configuration.properties file instead of using the file in-place in the configuration module
- CommandLineConfiguration.java
- Script that can load db-connection.properties and schema-version.properties files on command-line (will default to use local demo files).
- DNASequences.java
- JEnsembl Datasource Aware DNA Sequences extend org.biojava3.core.sequence.DNASequence but with modified behaviour specified by implementing uk.ac.roslin.ensembl.model.core.DNASequence. Demonstrates basic sequence functions of DADNASequences (use of BioJava STRAND is deprecated, JEnsembl works in a Positive context unless methods specify 'ReverseComplement'.
- RevalidateAndLazyLoadSequences.java
- The properties (i.e. features and nucleotide sequence) of DADNASequences (minimally with a valid database identifier) can be 'lazy loaded' when required.
- ComponentAssembledSequencesOfChromosomes.java
- Demonstration of how to retrieve sequences and complementary sequences from a chromosome; the transparent assembly of a chromosome from its component sequences; the internals of a chromosome assembly - made up of mapped component DADNASequences.
- ChromosomeCaching.java
- Demonstrates the use of the chromosome cache. Chromosomes for a given species/version are only instantiated once - and then cached for reuse. Their properties are lazy loaded as needed; if initially instantiated as Generic DADNASequences - they can be validated.
- AssemblyExceptions.java
- The human genome assembly has assembly updates associated with release versions between major Genome builds. These are represented as Novel, Patch and Haplotype Exceptions (Note: Ensembl treats PseudoAutosomalRegions in a similar fashion, but JEnsembl integrates this information into the standard Chromosome Model).
- Genes.java
- Retrieving Ensembl annotated gene models thought the API. Genes are a specific type of 'feature' annotation. Genes are retrieved together with their mapped chromosomal location.
- ExonsTranscriptionAndTranslation.java
- Demonstrates integration of BioJava3 transcription and translation functions together with the retrieval and stitching together of exon sequences by the JEnsembl API. Uses BioJava 'transcription engines' for translation: the datasource is queried to use the correct codon table if specified. Examples shown for chordate, plant and bacterial genes.
- TranscriptionAndTranslation.java
- A fuller demonstration of moving between the various coordinate axes: chromosome, gene, primary transcript, processed (spliced) transcript, translation and protein, and retrieval of actual sequence data from each of these. Uses brca-2, a human gene on the forward strand of chromosome 13.
- TranscriptionAndTranslationReverse.java
- A similar demonstration of moving between the various coordinate axes as for TranscriptionAndTranslation.java above, but for zar1l, a human gene on the reverse strand of chromosome 13.
- ExaminingObjectHashCodes.java
- Mainly of use for developers, in order to to keep track of objects, for debugging or creating caches.
- EnsemblGeneHomologues.java
- Retrieving from the Ensembl Compara datasource all asserted homologues (orthologues and paralogues) for a given gene. Integration of gene information across the Ensembl Core and Compara schema, for example, comparing the mapping data for genes in Compara with information in the matching core database.
- PlantGeneHomologues.java
- Similar demonstration to 'EnsemblGeneHomologues.java' above but querying plant genes in EnsemblGenomes, to retrieve homologous plant genes. Separate Compara sources are available for the 'Plant', 'Fungi', 'Metazoa', 'Protist' and 'Bacterial' datasources.
- BacterialGeneHomologues.java
- Demonstration of bacterial gene homology searching using EnsemblGenomes.
- EnsemblSyntenies.java
- Demonstrates searching for regions of conserved synteny between one species and another. (Uses homology searches with genes in the selected region of the source chromosome).
- PlantsSyntenies.java
- As 'EnsemblSyntenies.java' above but querying a region of a plant chromosome using EnsemblGenomes datasource.
- BacterialSyntenies.java
- As 'EnsemblSyntenies.java' above but querying a region of a bacterial chromosome using EnsemblGenomes datasource.
- Variations.java
- Retrieving locations of Ensembl records of dbSNP variations within a chromosomal region of interest: examining whether these are unique or not.
- VariationScript.java
- Demonstration script for getting variations near genes.
- Analysis.java
- Analysis objects that describe the procedures used to generate particular feature annotations can be retrieved for Genes, Transcripts and ProteinFeatures. (Eventually will not be limited to Core Objects).
- XRefs.java
- Demonstrates the recovery and use of External References for 'XRef-ed' objects (genes, transcripts and translations). XRefs hold references informational objects in External Databases.
- VegaAndCCDSIDs.java
- Demonstration of both retrieval by VegaID (for Genes,Transcripts and Translation/Proteins) and CCDS_ID (for Transcripts) and retrieval of XRefs representing this information for these Object types. (Limited to the species with high quality manual annotations curated by the Vega and CCDS databases).
- Synonyms.java
- Demonstrates the retrieval of all name-synonyms held by all the XRefs of an object, or just the synonyms of a particular XRef. Also shows look up of objects by name using matches to a synonym.
- LogicFromArkMAP.java
- ArkMAP is a downloadable map drawing application the uses the JEnsembl API to download gene-annotated chromosome maps from Ensembl datasources. The application integrates JEnsembl data retrieval with ArkDB map drawing code which uses the Java Swing API. Salient features of Ensembl data retrieval are combined in this demonstration code (Getting a basic Ensembl map, Getting gene homologies, Getting SNP variations on an Ensembl map, Finding regions of conserved synteny for a selected chromosomal region, Displaying maps of assembly exceptions and haplotypes.)
- LogicFromSavantPlugin.java
- The Savant Plugin embeds JEnsembl functionality in a plugin for the Savant Genome Browser. Salient features of JEnsembl mediated data retrieval are shown in this demonstration code.
- UserChristelle.java
- Script developed for user: parses chromosomal regions (gene locations) specified in a local BED data file, uses the JEnsembl API to retrieve sequence data flanking the given locations.
- UserSetzermann.java
- Code developed for user script, uses the JEnsembl API to fetch all genes, transcripts and exons on all chromosomes: and writes details out to file.
- BioinformaticsPublicationExamples.java
- Reproduces the code examples shown in the JEnsembl publication in 'Bioinformatics'.