The European Nucleotide Archive (ENA) provides a comprehensive open record of the world's nucleotide sequencing information and a platform for the management and analysis of sequence and related data for a wide range of applications in ocean science. Elixir is the coordinating research infrastructure for life science data in Europe.
The ENA system contains many data types/classes and a huge volume of data, which are only partly marine-related. Blue-Cloud focuses on data and information relevant for the marine domain and on data types such as samples and their analyses. Moreover, the ENA system offers several algorithms/pipelines for processing data, which might be used in a ‘smart’ way for the Blue-Cloud.
An interview with Guy Cochrane (ENA)
Type & number of data sets
ENA covers many data types in a number of interlinked database tables. A list can be found at https://www.ebi.ac.uk/ena/portal/api/results?dataPortal=ena
Data can be retrieved in different formats and with easy file download options through RESTful services: EMBL Flatfile format, FASTA format for sequences and XML Format. Details about formats: https://ena-browser-docs.readthedocs.io/en/latest/browser/search/advanced.html#downloadena-records
The ENA browser brings together a set of services via web interfaces, build upon underlying APIs. Of relevance for Blue-Cloud are two services:
- ENA Data Discovery (https://www.ebi.ac.uk/ena/browser/advanced-search)
- ENA Data Retrieval (https://www.ebi.ac.uk/ena/browser/home)
How to use the API’s and build machine-to-machine services can be found in the documentation of the ENA Portal API: https://www.ebi.ac.uk/ena/portal/api/doc
Figure 1: Illustration of growth of selected data types since March 2016
Function in Blue-Cloud
EMBL-EBI operates APIs for ENA discovery and ENA data retrieval, which are suitable endpoints for connecting to the Blue-Cloud data discovery and access service. The ENA system contains many data types / classes and a huge volume of data, which are only partly marine-related. ENA stands to benefit from the Blue-Cloud project because the project allows it to connect to all these different data types and allows scientists to access in an interdisciplinary way all these data.
Particularly contributing to the development of the Plankton Genomics demonstrator, the data provided by ENA allow the demonstrator team to pull out individual organisms and signatures of organisms in the sequence from different environments, and then to correlate them with the environmental factors that come with these data sources.