Download data and meta-data
Download via HPC (bulk download)
Context
This guide is intended to capture download to a desktop workstation, or direct download of portal data to a high performance computing (HPC) environment.
Instructions
-
When you have found the data set(s) you would like to download, click Bulk Download
- This will provide a summary of the processs, and present a Download Zip button. Clicking this button will generate a zip folder with the files you need to download the data.
- It DOES NOT download the data directly.
- Download and decompress this folder
- Inside there are the following files and folders:
organization_metadata/
package_metadata/
resource_metadata/
tmp/
download.py
download.ps1
download.sh
memberships.txt
README.txt
-
README.txt
provides instructions for data download, and information about the size of the data you have selected for download: PLEASE READ THIS! -
memberships.txt
provides information about which initiative membership you have, and also what membership is required to access the data you have selected. If you do not have the appropriate membership, you can try to self-register from some initiatives, or submit a request for access to help@bioplatforms.com -
organization_metadata
contains a spreadsheet file for each organization which owns data in the downloaded filtered data set. This metadata includes links to acknowledgement, initiative and methods information. -
package_metadata
contains a spreadsheet file with the metadata relevant to the downloaded filtered data set -
resource_metada
contains a spreadsheet file with the metadata relevant to the files which comprise the filtered data set -
The
tmp/
folder contains:*_md5sum.txt
, where the * indicates the name of the downloaded data package*_urls.txt
, where the * indicates the urls for each data set in the downloaded package
-
download.py
,download.ps1
anddownload.sh
are scripts in Python, Powershell and bash-
download.py
: Python 3 script, which when executed will download the files, and then checksum them. This is supported on all platforms (Windows, Linux, MacOS). Requires Python 3 and therequests
module to be installed. -
download.ps1
: Windows PowerShell script, which when executed will download the files, and then checksum them. This is supported on a Microsoft system, and uses only PowerShell. -
download.sh
: UNIX shell script, which when executed will download the files, and then checksum them. This is supported on any Linux or MacOS/BSD system, so long ascurl
is installed.
-
- When you run
download.py
,download.sh
ordownload.ps1
, it will provide instructions to set up your API token - Set up API token within the data portal and assign it to the CKAN_API_TOKEN environment variable. You can create your API Token by browsing to your user page by clicking your name at the top of the Portal page. Then click the API Tokens tab. Enter a name for your token, then click “Create API Token”. Copy the generated token, and use it to populate the CKAN_API_TOKEN environment variable. Note: the token will only be displayed ONCE when it is created, however you can create and delete as many tokens as you wish. If you wish, add your token to your password manager of choice
- Ensure the environment variable is set whenever you wish to download data from the data portal.
- Ensure you have sufficient storage available for your selected data to download. Size requirements can be viewed in README.txt.
- Run
download.py
,download.sh
ordownload.ps1
again - The data should now download and checksum
Common Issues and Problems
download.sh - MD5 sums do not validate correctly and files are not correct size
- Check that your enviroment variable for the CKAN_API_TOKEN is defined, and contains the API token generated within CKAN. (If the CKAN_API_TOKEN is missing or incorrect, the files do not download properly due to permission issues).
- If you are using the BASH (.sh) script, and ran it without having set the CKAN_API_TOKEN, set the token and delete the files. Then run the script again.
- Check that you are running a recent version of curl. The Bioplatforms Data Portal requires version 7.58 or later
(due to a bug fix with the Authorization header). Run
curl --version
to check. - Check that your PATH contains the correct version of curl. Run
which curl
to check. - Once you have checked all these issues, please rerun the script, even if you have not made any changes, as intermittent issues do sometimes occur.
Programmatic Download via R/Python/bash
The Bioplatforms Data Portal is based upon CKAN, an open source platform for sharing metadata and data. CKAN provides Application Programming Interfaces (APIs) which allow computer software to interact with the portal. It is possible for software to programmatically do everything you can do manually by using the portal in your web browser - including searching for data, downloading data, and even uploading data.
This guide covers access to the portal using the R and Python programming languages, and via the command shell on your desktop computer (bash
or zsh
.)
Getting started
When your computer program connects to the portal, it must identify itself. It will do this by sending an ‘API key’. You can find this key on the portal.
-
Go to the portal https://data.bioplatforms.com/
-
At the top right of the page, you will see either a prompt to log in, or your name
-
Log in, if required, and then click on your name
-
Click the API Tokens tab,
- Enter a name for the token, and click the Create API Token button.
- The token will be generated and shown to you ONCE only. Copy it to your clipboard.
- If you wish, add your token to your password manager of choice.
- You will need this API token for the scripts used below!
If you are downloading or uploading large datasets from the portal, it is highly recommended that you do this from an appropriate environment, with a fast and reliable connection to the internet. Many institutions provide access to HPC nodes.
Getting started: R
You will need access to an R installation, either on your computer or on a HPC node.
-
Launch R, and
-
Install the ckanr bindings for CKAN:
> install.packages("devtools")
> library("devtools")
> install_github("ropensci/ckanr")
Getting started: Python
You will need access to a Python installation, either on your computer or on a HPC node. Note: Python is installed by default on MacOS and most Linux systems. The examples given here are for Python version 3.7 or higher.
Use pip
to install the ckanapi extension and the requests
extension:
$ pip install ckanapi
$ pip install requests
Getting started: Bash
You will need curl installed. Note: Curl is installed by default on MacOS and most Linux systems. For Windows, you can easily install it using the Windows Subsystem for Linux, or chocolatey.
You will also need jq installed - it is including in most Linux distributions, and is available via homebrew on MacOS.
Searching the CKAN archive
CKAN makes it possible to query datasets and resources programmatically.
This may be of use when:
- running a specific query repeatedly, or
- building a front-end or command line tool to search for data in a particular way
It is helpful to be aware of two key concepts which the Data Portal uses:
-
Packages: a package is a collection of zero or more resources. It is effectively a flat folder, with metadata.
-
Resources: a resource is a file, with its associated metadata, including its type and size
The following sections contain example code snippets demonstrating how to search for all ‘amplicon’ data from the Australian Microbiome Framework Initiative. The Bioplatforms Australia Data Portal allows up to 50,000 search results to be returned at a time. The results returned will contain the package and resource metadata for each ‘amplicon’ dataset.
The list of all data types, and their associated schemas, are in the ckanext-bpaschema repository on Github. Each data type has a JSON schema file in that directory. Make sure to change to delimiter from _
(underscore) to -
(hyphen) prior to trying a programmatic search.
Note: you will need to change the following elements below for the scripts to function correctly
- replace the
xx-xx-xx-xx-xx
with your API token from the data portal - replace the
type:amdb-genomics-amplicon
with the data type you would like to search for: these can be found in the ckanext-bpatheme repository on Github
For example:
Searching: Python
import ckanapi
remote = ckanapi.RemoteCKAN('https://data.bioplatforms.com', apikey='xx-xx-xx-xx-xx')
# we increase the number of rows to be returned, and we
# ask for all packages, including private packages
result = remote.action.package_search(
q='type:amdb-genomics-amplicon',
rows=50000,
include_private=True)
print("{} matches found.".format(result['count']))
Searching: R
library(ckanr)
key="xx-xx-xx-xx-xx"
ckanr_setup(url="https://data.bioplatforms.com", key=key)
x <- package_search(q='type:amdb-genomics-amplicon', rows=50000, include_private=TRUE)
print(x$count)
Searching: bash
export CKAN_API_TOKEN="xx-xx-xx-xx-xx"
curl -H "Authorization: "$CKAN_API_TOKEN" 'https://data.bioplatforms.com/api/3/action/package_search?q=type:amdb-genomics-amplicon&rows=50000&include_private=true'
Downloading data from the CKAN archive
Once you have identified the packages containing the data you wish to download, it is possible to programmatically download this data. You may also wish to save the associated metadata for use in your analysis.
The following examples extend the search functionality above to download and save data files.
Downloading data: Python
import ckanapi
remote = ckanapi.RemoteCKAN('https://data.bioplatforms.com', apikey='xx-xx-xx-xx-xx')
# we increase the number of rows to be returned, and we
# ask for all packages, including private packages
result = remote.action.package_search(
q='type:amdb-genomics-amplicon',
rows=50000,
include_private=True)
print("{} matches found.".format(result['count']))
# iterate through the resulting packages, downloading them one by one
import requests
for package in result['results']:
for resource in package['resources']:
url = resource['url']
response = requests.get(resource['url'], stream=True, headers={'Authorization': remote.apikey})
handle = open(resource['name'], 'wb')
for chunk in response.iter_content(chunk_size=1024):
if chunk: #filter out keep-alive new chunks
handle.write(chunk)
Downloading data: R
library(ckanr)
key="xx-xx-xx-xx-xx"
ckanr_setup(url="https://data.bioplatforms.com", key=key)
x <- package_search(q='type:amdb-genomics-amplicon', rows=50000, include_private=TRUE)
headers <- (key)
names(headers) <- c('Authorization')
for (package in x$results) {
for (resource in package$resources) {
print(resource$url)
download.file(resource$url, resource$name, headers=headers)
}
}
Downloading data: bash
We ‘pipe’ the response received from the Data Portal into jq
, a command-line tool which allows us to easily parse JSON data.
export CKAN_API_TOKEN="xx-xx-xx-xx-xx"
for URL in $( curl -H "Authorization: $CKAN_API_TOKEN" 'https://data.bioplatforms.com/api/3/action/package_search?q=type:amdb-genomics-amplicon&rows=50000&include_private=true' | jq -r '.result .results [] .resources [] .url'); do
# download the file using curl
echo "downloading: $URL"
curl -O -L -C - -H "Authorization: $CKAN_API_KEY" "$URL"
done
You can also use jq
to generate an MD5 checksum file, which can be used to verify the downloaded data.
export CKAN_API_TOKEN="xx-xx-xx-xx-xx"
curl -H "Authorization: $CKAN_API_TOKEN" 'https://data.bioplatforms.com/api/3/action/package_search?q=type:amdb-genomics-amplicon&rows=50000&include_private=true' | jq -r '.result .results [] .resources [] | "\(.md5) \(.name)"' > checksums.md5
# check downloaded data
md5sum -c checksums.md5
Downloading data: bash @ HPC
If you wish to download data directly from your cloud or HPC environment, you can copy the bulk download Zip archive, described in the find, filter, download guide, to that system, extract it, and run the download (download.sh
) from there. This means you will have access to the reliable, high-speed connection to the internet provided by that environment, as well as any attached storage resources.
Advanced integrations with the archive
Once you are set up for programmatic access, it is build more advanced integrations with the portal. As examples, you could:
- poll the portal periodically for new data, and automatically download it
- only download files that you haven’t already downloaded
- run an analysis pipeline as data becomes available
- perform arbitrary complex filters on the data and metadata in the portal
The example below (in Python) explores some of those possibilites, building on the snippets above.
import ckanapi
import datetime
import os
remote = ckanapi.RemoteCKAN('https://data.bioplatforms.com', apikey='xx-xx-xx-xx-xx')
# we increase the number of rows to be returned, and we
# ask for all packages, including private packages
result = remote.action.package_search(
q='type:amdb-genomics-amplicon',
rows=50000,
include_private=True)
print("{} matches found.".format(result['count']))
# iterate through the resulting packages, downloading them one by one
import requests
for package in result['results']:
# filter the results based on the date they were sampled
date_sampled = datetime.datetime.strptime(package['date_sampled'], '%Y-%m-%d').date()
# if the sample is less than 90 days old, excluded it
if (datetime.date.today() - date_sampled).days < 90:
continue
for resource in package['resources']:
target = resource['name']
# do we already have the file? if so, skip the download
if os.access(target, os.R_OK):
continue
url = resource['url']
resp = requests.get(resource['url'], headers={'Authorization': remote.apikey})
with open(target, 'wu') as fd:
fd.write(resp.content)