API — The MG-RAST Application Programming Interface¶
URLs¶
https://api.mg-rast.org/
Further documentation, with a complete parameter listing for all resources available is at:
https://api,mg-rast.org/api.html
Github repository of script tools, examples, and contributed code for using the MG-RAST API:
https://github.com/MG-RAST/MG-RAST-Tools
Introduction¶
This site is open source. {% github_edit_link "Improve this page" %}
_Over 110,000 metagenomic data sets have been uploaded and analyzed in MG-RAST since 2007, totaling over 43 terabases (TBp). Data uploaded falls in three classes: shotgun metagenomic data, amplicon data, and, more recently, metatranscriptomic data. The MG-RAST pipeline normalizes all samples by applying a uniform pipeline with the appropriate quality control mechanisms for the various data sources. Uniform processing and robust sequence quality control enable comparison across experimental systems and, to some extent, across sequencing platforms. With the inclusion of standardized metadata MG-RAST has enabled meta-analysis available through its web-based user interface. This provides an easy-to-use way to upload and download data, perform analyses, and create and share projects.
As with most GUIs, however, there are limitations to what can be done, for example, regarding the number of samples processed in a single analysis, access to complete metadata, and easy access to raw data and quality metrics for each sample. As part of the DOE Systems Biology knowledgebase project (KBase) we have implemented a web services application programmers interface (API) that exposes all data to (authenticated) programmers, enabling access to available data and functionality through software applications. This makes user access to MG-RAST’s internal data structures possible.
The MG-RAST API enables programmatic access to data and analyses in MG-RAST without requiring local installations. Using the API, users can authenticate against the service, submit their data, download results, and perform extensive comparisons of data sets. The API uses the Representational State Transfer (REST) [3] architecture which allows download of data in ASCII format, allowing users to query the system via URLs and returning MG-RAST data objects in their native format (e.g. similarity tables or sequence files). For structured data (e.g. metadata or project information) the MG-RAST API uses JSON (Javascript Object Notation, a widely used standard) as its data format.
This allows users to use simple tools to download data files or view the JSON in their web browsers using one of the many available JSON viewers. In addition many programming languages have libraries for convenient HTTP interaction and JSON conversions. The API has a minimal number of prerequisites; and any language with HTTP and JSON support or command line utilities such as “curl” can easily integrate with the design.
If you are not a programmer or you are not willing to spend the time learning the API, the Example scripts (see chapter 7.)
Design and Implementation¶
The MG-RAST API enables programmatic access to data and analyses in MG-RAST without requiring local installations. Users can authenticate against the service, submit their data, download results, and perform extensive comparisons of data sets. We chose to use the Representational State Transfer (REST) [3] architecture. The REST approach allows download of data in ASCII format, allowing users to query the system via URLs and returning MG-RAST data objects in their native format (e.g. similarity tables or sequence files). For structured data (e.g. metadata or project information) the MG-RAST API uses JSON (Javascript Object Notation, a widely used standard) as its data format.
Using this approach users can use simple tools to download data files to their machines or view the JSON in their web browsers using one of the many available JSON viewers. In addition many programming languages have libraries for convenient HTTP interaction and JSON conversions.
Most of the API calls are simply URLs which can be entered in the address bar of a web browser to perform the download through the browser. These URLs can also be used with a command line tool like curl, in programing-language-specific libraries, or in command line scripts. The examples in the Results section illustrate the use of each of these methods. The example scripts are available on in the supplementary materials and on github (https://github.com/MG-RAST/MG-RAST-Tools) along with other useful illustrative scripts.
The MG-RAST API covers most of the functionality available through the MG-RAST website, with access to annotations, analyses, metadata and access to the MG-RAST user inbox to view contents as well as upload files. All sequence data and data products from intermediate stages in the analysis pipeline are available for download. Other resources provide services not available through the website, e.g. the m5nr resource lets you query the m5nr database.
Each query to the API is represented as a URI beginning with
https://api.mg-rast.org/
and has a defined structure to pass the requests and parameters to the API server. These URI queries can be used from the command line, e.g. using curl, in a browser, or incorporated in a shell script or program.
Each URI has the form:
https://api.mg-rast.org/{version}/{resourcepath}?{querystring}
where
{version}
explicitly directs the request to a specific version of the API. If it is omitted the latest API version will be used. The current version number is ‘1’.
{resourcepath}
is constructed from the path parameters listed below to define a specific resource.
{querystring}
is used to filter the results obtained for the resource, this is optional.
For example, in:
https://api.mg-rast.org/1/annotation/sequence/mgm4447943.3?evalue=10&type=organism&source=SwissProt
the resource path
annotation/sequence/mgm4447943.3
defines a request for the annotated sequences for the MG-RAST job with ID 4447943.3. The optional query string
evalue=10&type=organism&source=SwissProt
modifies the results by setting an evalue cutoff, annotation type and database source.
The API provides an authentication mechanism for access to private MG-RAST jobs and users’ inbox. The ’auth_key’ (or ’webkey’) is a 25 character long string, e.g.
j6FNL61ekNarTgqupMma6eMx5
which is used by the API to identify an MG-RAST user account and determine access rights to metagenomes. Note that the auth_key is valid for a limited time after which queries using the key will be rejected. You can create a new auth_key or view the expiration date and time of an existing auth_key on the MG-RAST website. An account can have only one valid auth_key and creating a new key will invalidate an existing key.
All public data in MG-RAST is available without an auth_key. All API queries for private data which either do not have an auth_key or use an invalid or expired auth_key will get a “insufficient permissions to view this data” response.
The auth_key can be included in the query string like:
https://api.mg-rast.org/1/annotation/sequence/mgm4447943.3?evalue=10&type=organism&source=SwissProt&auth_key=j6FNL61ekNarTgqupMma6eMx5
or in a request using curl like:
curl -X GET -H "auth: j6FNL61ekNarTgqupMma6eMx5" "https://api.mg-rast.org/1/annotation/sequence/mgm4447943.3?evalue=10&type=organism&source=SwissProt"
Note that for the curl command the quotes are necessary for the query to be passed to the API correctly.
If an optional parameter passed through the query string has a list of values only the first will be used. When multiple values are required, e.g. for multiple md5 checksum values, they can be passed to the API like:
curl -X POST -d '{"data":["000821a2e2f63df1a3873e4b280002a8","15bf1950bd9867099e72ea6516e3d602"]}' "https://api.mg-rast.org//m5nr/md5"
In some cases, the data requested is in the form of a list with a large number of entries. In these cases the ‘limit’ and ‘offset’ parameters can be used to step through the list, e.g.
https://api.mg-rast.org/1/project?order=name&limit=20&offset=100
will limit the number of entries returned to 20 with an offset of 100.
If these parameters are not provided default values of limit=10
and
offset=0
are used. The returned JSON structure will contain the
‘next’ and ‘prev’ (previous) URIs to simplify stepping through the list.
The data returned may be plain text, compressed gzipped files or a JSON structure.
Most API queries are ‘synchronous’ and results are returned immediately.
Some queries may require a substantial time to compute results, in these
cases you can select the asynchronous option by adding
‘&asynchronous=1’
to the end of the query string. This query will
then return a URL which will return the query results when they are
ready.
Most of the API calls are simply URLs which can be entered in the address bar of a web browser to perform the download through the browser. These URLs can also be used with a command line tool like curl, in programing-language-specific libraries, or in command line scripts. The examples below illustrate the use of each of these methods. The example scripts are available on the github site along with other useful illustrative scripts.
Resource/Object | Description |
---|---|
annotation | taxonomic and functional annotations made by comparison with the M5nr database |
compute | resource to compute PCoA , heatmap, and normalization for a set of input metagenomes |
download | download results of the MG-RAST pipeline |
inbox | upload and listing of data in the staging area prior to execution of the MG-RAST pipeline |
library | library information for uploaded metagenome provided by the user |
matrix | abundance profiles in BIOM (5) format for a list of metagenomes |
M5nr | access M5 nonredundant protein database used for annotation of metagenomic sequences |
metadata | creation, export, and validation of metadata templates and spreadsheets |
metagenome | container for sample, library, project, and precomputed data for an uploaded metagenomic sequence file |
profile | returns a single data object in BIOM format |
project | project summary for metagenome provided by user |
sample | sample information provided by user |
search | search MG-RAST by MG-ID, metadata, function, or taxonomy; or implement a more complex search. |
validation | validates templates for correct structure and data |
[table:upload_speeds]
Examples¶
The API provides index-driven access to data subsets using the following data types as indices into the data: functions, functional hierarchy data, and taxonomic data. Whenever possible we have employed standards to expose data and metadata, such as the BIOM standard for encoding abundance profiles. The examples below are intended to illustrate usage for the various resources available, they do not cover the entire functionality of the API, see the documentation at the API website for the comprehensive listing.
annotation
https://api.mg-rast.org/1/annotation/sequence/mgm4440036.3?type=function&filter=protease&source=Subsystems
Retrieve the reads from a metagenome with ID mgm4440036.3 which were annotated as protease in SEED Subsystems.
download
https://api.mg-rast.org/1/download/mgm4447943.3
Retrieve information formatted as a JSON object about all the files available for download for metagenome mgm4447943.3 with information about the files and sequence statistics where applicable. Each file listed has a URL included which can be used to download the file, e.g.
https://api.mg-rast.org/1/download/mgm4447943.3?file=650.1
will download the protein.sims file containing the BLAT similarities.
inbox
curl -X POST -H "auth: auth_key" -F "upload=@sequences.fastq" "https://api.mg-rast.org/1/inbox"
Upload the file ’sequences.fastq’ to your inbox. This API call requires user authentication using the auth_key described above. It can not be used in a browser, but needs to be run from the command line or from a script.
matrix
https://api.mg-rast.org/matrix/organism?group_level=family&source=SEED&evalue=5&id=mgm4440442.5&id=mgm4440026.3
Retrieve the taxonomic abundance profile on family level for 2 metagenomes based on SEED assignments with an evalue cutoff of 1e-5.
metagenome
https://api.mg-rast.org/1/metagenome/mgm4440026.3
List analysis submission parameters and other details for a metagenome. The metagenome resource can also be used to search metadata, function and taxonomy.
https://api.mg-rast.org/metagenome?function=dnaA&organism=coli&biome=marine&match=all&order=created
This call will find all marine metagenomes with reads annotated as dnaA and have taxonomic assignment containing the text ‘coli’, the results will be ordered based on creation date for the metagenome.
project
https://api.mg-rast.org/project/mgp31?verbosity=full
Retrieve available information about the project with ID mgp31.
sample
https://api.mg-rast.org/1/sample/mgs12326?verbosity=full
Retrieve available information about individual samples, including IDs and metadata.
metadata
https://api.mg-rast.org//metadata/template
Retrieve the static template for metadata object relationships and types used by MG-RAST.
https://api.mg-rast.org//metadata/export/mgp128
Retrieve all metadata for project mgp128.
https://api.mg-rast.org/metadata/cv
Retrieve a set of lists of all our controlled metadata terms, including the ontologies.
https://api.mg-rast.org/metadata/ontology?name=biome&version=2013-04-27
Retrieve a more detailed list (with relationships) for a specific version of the ontology.
m5nr
https://api.mg-rast.org/1/m5nr/md5/ffc62262a18b38671c3e337150ef535f?source=SwissProt
Retrieve the UniProt ID for a given sequence identifier.
Example scripts using the MG-RAST REST API¶
.
Introduction¶
As part of the RESTful API (see chapter 6), we are providing a collection of example scripts.
Each script has comments in the source code as well as a help function. This document provides a brief overview of the available scripts and their intended purpose. Please see the help associated with all of the individual files for a complete list of options and more details.
We believe these scripts to be the best starting point for many users, he we attempt to provide a listing of the most important tools.
URLs¶
The Examples are located on github at:
https://github.com/MG-RAST/MG-RAST-Tools
This is the base directory for the rest of this chapter, go here to find the tools and examples described below:
https://github.com/MG-RAST/MG-RAST-Tools/tree/master/tools/bin
Each script has a verbose help option (–help) to list all options and explain their usage.
Download DNA sequence for a function – mg-get-sequences-for-function.py¶
This script will retrieve sequences and annotation for a given function or functional class.
The output is a tab-delimited list of: m5nr id, dna sequence, semicolon seperated list of annotations, sequence id.
Example:
mg-get-sequences-for-function.py --id "mgm4441680.3" --name "Central carbohydrate metabolism" --level level2 --source Subsystems --evalue 10
Download DNA sequences for a taxon or taxonomic group– mg-get-sequences-for-taxon.py¶
This script will retrieve sequences and annotation for a given taxon or taxonomic group.
The output is a tab-delimited list of: m5nr id, dna sequence, semicolon seperated list of annotations, sequence id
Example:
mg-get-sequences-for-taxon.py --id "mgm4441680.3" --name Lachnospiraceae --level family --source RefSeq --evalue 8
Download sequences annotated with function and taxonomy – mg-get-annotation-set.py¶
Retrieve functional annotations for given metagenome and organism.
The output is a tab-delimited list of annotations: feature list, function, abundance for function, avg evalue for function, organism.
Example:
mg-get-annotation-set.py --id "mgm4441680.3" --top 5 --level genus --source SEED
Download the n most abundant functions for a metagenome – mg-abundant-functions.py¶
Retrieve the top n abundant functions for metagenome.
The output is a tab-delimited list of function and abundance sorted by abundance (largest first). ’top’ option controls number of rows returned.
Example:
mg-abundant-functions.py --id "mgm4441680.3" --level level3 --source Subsystems --top 20 --evalue 8
Download and translate similarities into different namespaces e.g. SEED or GenBank – m5nr-tools.pl¶
MG-RAST computes similarities against a non-redundant database (Wilke et al. 2012) and later translates them into any of the supported namespaces. As a result you can view your annotations (or indeed the similarity results) in each of these namespaces. Sometimes this can lead to new features and or differences becoming visible that would otherwise be obscured.
m5nr-tools can translate accession ids, md5 checksums, or protein sequence into annotations. One option for output is a blast m8 formatted file.
Example:
m5nr-tools.pl --api "https://api.mg-rast.org/1" --option annotation --source RefSeq --md5 0b95101ffea9396db4126e4656460ce5,068792e95e38032059ba7d9c26c1be78,0b96c92ce600d8b2427eedbc221642f1
Download multiple abundance profiles for comparison – mg-compare-functions¶
Retrieve matrix of functional abundance profiles for multiple metagenomes. The output is either tab-delimited table of functional abundance profiles, metagenomes in columns and functions in rows or BIOM format of functional abundance profiles.
Example:
mg-compare-functions.py --ids "mgm4441679.3,mgm4441680.3,mgm4441681.3,mgm4441682.3" --level level2 --source KO --format text --evalue 8
Standard operating procedures SOPs for MG-RAST¶
SOP - Metagenome submission, publication and submission to INSDC via MG-RAST¶
MG-RAST can be used to host data for public access. There are three interfaces for uploading and publishing data, the Web interface, intended for most users, command line scripts, intended for programmers, and the native RESTful API, recommended for experienced programmers.
When data is published in MG-RAST, it can also be released to the INSDC databases. This tutorial covers both use cases.
We note that MG-RAST provides temporary IDs and permanent public identifiers. The permanent identifiers are assigned at the time data is made public. Permanent MG-RAST identifiers begin with “mgm” (e.g. “mgm4449249.3”) for data sets and mgp (e.g.”mgp128”) for projects/studies.
The following data types are supported:
- Shotgun metagenomes (“raw” and assembled)
- Metatranscriptome data (“raw” and assembled)
- Ribosomal amplicon data (16s, 18s, ITS amplicons)
- Metabarcoding data (e.g. cytochrome C amplicons; basically all non ribosomal amplicons)
PLEASE NOTE: We strongly prefer raw data over assembled data, if you submit assembled data, please submit the raw reads in parallel. If you perform local optimization e.g. adapter removal or quality clipping, please submit the raw data as well.
This document is intended for experienced to very experienced users and programmers. We recommend that most users not use the RESTful API. There is also a document describing data publication and INSDC submission via the web UI.
An access token for the MG-RAST API, this can be obtained from the MG-RAST web page (http://mg-rast.org) in the user section.
You will need a working python interpreter and the command line scripts and example data can be found in https://github.com/MG-RAST/MG-RAST-Tools:
Scripts: MG-RAST-Tools/tools/bin Data: MG-RAST-Tools/examples/sop/data
Change into MG-RAST-Tools/examples/sop/data and call:
sh get_test_data.sh
to add additional example data.
Either download the repository as a zipped archive from https://github.com/MG-RAST/MG-RAST-Tools/archive/master.zip or use the git command line tool:
git clone http://github.com/MG-RAST/MG-RAST-Tools.git
We tested up to the following parameters:
- max. size per file: 10GB
- max. project size: 200 metagenomes
While there is no reason to assume the software will not work with larger numbers of files or larger files, we did not test for that.
SOP:¶
Upload and submit sequence data and metadata to MG-RAST using the command mg-submit.py Note: This is an asynchronous process that may take some time depending on the size and number of datasets. (Note: We recommend that novice users try the web frontend; the cmd-line is primarily intended for programmers) The metadata in this example is in Microsoft Excel format, there is also an option of using JSON formatted data. Please note: We have observed multiple problems with spreadsheets that were converted from older version of Excel or “compatible” tools e.g. OpenOffice.
Example:
mg-submit.py submit simple .... --metadata
Verify the results and obtain a temporary identifier E.g. by using the WebUI at http://mg-rast.org – you can also use that to publish the data and trigger submission to INSDC.
Publish your project in MG-RAST and obtain a stable and public MG-RAST project identifier
Note: once the data is made public the data is read only, but metadata can be improved
Example:
mg-project make-public $temporary_ID
Trigger release to INSDC/ submit to EBI
Note: Metadata updates are automatically synced with INSDC databases within 48 hours.
Example:
mg-project submit-ebi $PROJECT_ID
Check status of release to INSDC/ submission to EBI
Note: This is an asynchronous process that may take some time depending on the size and number of datasets.
Example:
mg-project status-ebi $PROJECT_ID
We include a sample submission below:
From within the MG-RAST-Tool repository directory
# Retrieve repository and setup environment
git clone http://github.com/MG-RAST/MG-RAST-Tools.git
cd MG-RAST-Tools
# Path to scripts for this example
PATH=$PATH:`pwd`/tools/bin
# set environment variables
source set_env.sh
# Set credentials, obtain token from your user preferences in the UI
mg-submit.py login --token
# Create metadata spreadsheet. Make sure you map your samples to your
# sequence files
# Upload metagenomes and metadata to MG-RAST
mg-submit.py submit simple \
examples/sop/data/sample_1.fasta.gz \
examples/sop/data/sample_2.fasta.gz \
--metadata examples/sop/data/metadata.xlsx
# Output
> Temp Project ID: ed2102aa666d676d343735323836382e33
> Submission ID: 77a1a1a5-4cbd-4673-86bf-f87c9096c3e1
# Remember IDs for later use
SUBMISSION_ID=77a1a1a5-4cbd-4673-86bf-f87c9096c3e1
TEMP_ID=mgp128
# Check if project is finished
mg-submit.py status $SUBMISSION_ID
# Output
> Submission: 77a1a1a5-4cbd-4673-86bf-f87c9096c3e1 Status: in-progress
# Make project public in MG-RAST
mg-project.py make-public $TEMP_ID
# Output
> # Your project is public.
> Project ID: mgp128
> URL: https://mg-rast.org/linkin.cgi?project=mgp128
PROJECT_ID=mgp128
# Release project to INSDC archives
mg-project.py submit-ebi $PROJECT_ID
# Output
> # Your Project mgp128 has been submitted
> Submission ID: 0cf7d811-1d43-4554-ab97-3cb1f5ceb6aa
# Check if project is finished
mg-project.py status-ebi $PROJECT_ID
# Output
> Completed
> ENA Study Accession: ERP104408
Acknowledgments¶
This project is funded by the NIH grant R01AI123037 and by NSF grant 1645609
This work used the Magellan machine (U.S.Department of Energy, Office of Science, Advanced Scientific Computing Research, under contract DE-AC02-06CH11357) at Argonne National Laboratory, and the PADS resource (National Science Foundation grant OCI-0821678) at the Argonne National Laboratory/University of Chicago Computation Institute.
In the past the following sources contributed to MG-RAST development:
- U.S. Dept. of Energy under Contract DE-AC02-06CH11357
- Sloan Foundation (SLOAN #2010-12),
- NIH NIAID (HHSN272200900040C),
- NIH Roadmap HMP program (1UH2DK083993-01).
[1] | This includes only the computation cost, no data transfer cost, and was computed using 2009 prices. |
[2] | We use the term cloud as a shortcut for Infrastructure as a Service (IaaS). |
[3] | This would be for several metagenomes that are part of the JGI Prairie pilot. |
[4] | An MD5 checksum is a widely used way to create a digital fingerprint for a file. Think of it as a kind of checksum, if the fingerprint changed, so did the file. The fingerprints are easy to compare. There are many tools out there for creating MD5 checksums, google is your friend. |