8.14. Omics storages

Reference store
- Create reference store
- View and edit content of reference store
  - Import reference
  - Delete reference
Sequence store
- Create sequence store
- View and edit content of sequence store
Manage store
Manage via the CLI
Usage

Cloud Pipeline supports AWS HealthOmics Storages.
These are specialized storages that allow to store Omics data.
There are two types of AWS HealthOmics Storages:

reference store - for storing genome references
sequence store - for storing genomics files (e.g., BAM, CRAM, FASTQ)

Depending on the store type, management and abilities will vary.

To create a Storage in a Folder you need to have WRITE permission for that folder and the ROLE_STORAGE_MANAGER role. For more information see 13. Permissions.

Reference store

Specialized storage for storing raw genome references.

Create reference store

Note: please note that for each Cloud Region, only one reference store can be created.

To create reference store:

Navigate to the folder where you want to create reference storage.
Click + Create → Storages → Create AWS HealthOmics Store:
The pop-up of the AWS HealthOmics Storage creation will appear:
Specify a name of the creating store and select the service type as Reference store:
Specify a description if necessary.
Click the Create button to confirm.
Reference store will appear in the folder:

View and edit content of reference store

Reference store has a flat structure:

on the first level, there are only folders, where each folder presents a reference
on the second level (inside the folder of each reference), there are only files of the reference itself and its index
nested folders are not supported

CP_AWS_HealthOmicsStorages

Inside the reference folder:
CP_AWS_HealthOmicsStorages

In the row of the reference file/its folder, there is a set of labels, e.g.:
CP_AWS_HealthOmicsStorages

reference name
state of the file
type of the content (REFERENCE)

Possible actions in a reference store:

Import new reference - load a reference from the data storage
Download reference - download reference files to the local workstation
Delete reference

Import reference

Note: to load a reference into a reference store, it shall be previously loaded to the s3 bucket available on the Cloud Pipeline Platform.

To load a reference:

Open the reference store.
Click the Import button:
Import form will appear:
Specify mandatory fields:
- Name - reference genome name
- Subject id - source's subject ID
- Sample id - source's sample ID
If necessary, specify optional fields:
- Description - reference description
- Generated from - reference details
Select the reference source file:
- click the folder icon near the Source file label:
- the pop-up to select a file from a data storage will appear:
- select a reference file in one of the regular storages available in the Cloud Pipeline Platform, e.g.:
- click the OK button to confirm selection
- selected file will be shown near the Source file label:
Once all fields are specified, click the Import button:
Attributes panel with the section of import jobs will be opened automatically on the right side:

At this panel, you can check the state of the file import jobs:
- set dates (From and To) and desired state (select from the list)
- click the Search button, results will be shown as the job IDs list, e.g. to find newly completed jobs:
When the import is already completed, reference will appear in the storage:
Click the reference, to display reference files:

As you can see, reference folder contains the reference file itself (source) and automatically created index (index).

Note: to show/hide import jobs section, you may use the special menu in the upper side of the store:

Delete reference

Note: you may remove only the reference entirely, separate reference files can not be removed.

To remove a reference:

Click the Delete button in a reference row, e.g.:
Confirm the deletion in the appeared pop-up:
Reference will be permanently removed.

Sequence store

Specialized storage for storing different types of genomics files - currently, these are BAM, CRAM, UBAM, FASTQ.

Create sequence store

To create sequence store:

Navigate to the folder where you want to create sequence storage.
Click + Create → Storages → Create AWS HealthOmics Store:
The pop-up of the AWS HealthOmics Storage creation will appear:
Specify a name of the creating store and select the service type as Sequence store:
Specify a description if necessary, then click the Create button to confirm:
Sequence store will appear in the folder:

View and edit content of sequence store

Sequence store has a flat structure:

on the first level, there are only folders, where each folder presents a separate genomic sequence
on the second level (inside the folder of each sequence), there are only sequence genomic files - depending on the format, these can be one (for example, a single UBAM file) or two files (for example, a BAM file and its index)
nested folders are not supported

CP_AWS_HealthOmicsStorages

Inside the sequence folder, there are sequence files:
CP_AWS_HealthOmicsStorages

In the row of the sequence file/folder, there is a set of labels, e.g.:
CP_AWS_HealthOmicsStorages

sequence name
state of the file
type of the content (e.g. FASTQ)
sample id
subject id

Possible actions in a sequence store:

Import new sequence - load a sequence from the data storage
Upload sequence - upload a sequence from the local workstation
Download sequence - download a sequence to the local workstation
Delete sequence

Import sequence

Note: to load a sequence into a sequence store, it shall be previously loaded to the s3 bucket available on the Cloud Pipeline Platform.

To load a sequence:

Open the sequence store.
Click the Import button:
Import form will appear:
Specify mandatory fields:
- Name - sequence name
- Subject id - source's subject ID
- Sample id - source's sample ID
If necessary, specify optional fields:
- Description - sequence description
- Generated from - sequence details
From the corresponding dropdown list, select the type of a source file you want to load, e.g.:
Select the sequence source file:
- click the folder icon near the Source file label:
- the pop-up to select a file from a data storage will appear:
- select a sequence file in one of the regular storages available in the Cloud Pipeline Platform, e.g.:
- click the OK button to confirm selection
- selected file will be shown near the Source file label:
Additional field for a second source file will appear:

You may add such additional file similarly as described at the previous step, e.g.:
If necessary, you may link a sequence with a reference from the reference store:
- click the folder icon near the Reference path label:
- in the appeared pop-up, select a reference from the reference store, e.g.:
- click the OK button to confirm selection
Once all fields are specified, click the Import button:
Attributes panel with the section of import jobs will be opened automatically on the right side:

At this panel, you can check the state of the file import jobs:
- set dates (From and To) and desired state (select from the list)
- click the Search button, results will be shown as the job IDs list, e.g. to find newly completed jobs:
When the import is already completed, sequence will appear in the storage:
Click the sequence, to display sequence files:

As you can see, sequence folder contains files named by the format: source plus the index.

Note: to show/hide import jobs section, you may use the special menu in the upper side of the store:

Upload sequence

To upload a sequence:

Open the sequence store.
Click the Upload button:
Upload form will appear:
Select the type of a source file you want to upload, e.g.:
Select the sequence source file from your local workstation:
- click the Upload source file button:
- the OS pop-up to select a file will appear
- choose a sequence file and confirm, e.g.:
- selected file will be shown near the Upload source file button:
Additional button for a second source file will appear:

You may add such additional file similarly as described at the previous step, e.g.:
If necessary, you may link a sequence with a reference from the reference store:
- click the folder icon near the Reference label:
- in the appeared pop-up, select a reference from the reference store, e.g.:
- click the OK button to confirm selection
Specify mandatory fields:
- Name - sequence name
- Sample id - source's sample ID
- Subject id - source's subject ID
If necessary, specify optional fields:
- Description - sequence description
- Generated from - sequence details
Once all fields are specified, click the Upload button:
The upload will take some time:
Then, the pop-up will be automatically closed, sequence will appear in the storage:
Please note, that the upload may take extra time - during this period state of the sequence will be shown as PROCESSING_UPLOAD, download button will not be shown:
When the upload is already completed, sequence state will change to ACTIVE and download button near will appear:

Download sequence

To download at once all files of the sequence to the local workstation:

Click the Download button in a sequence row.
Download will be start automatically (please note that files will be loaded separately, not as an archive):

To download specific sequence file to the local workstation:

Open the sequence folder.
Click the Download button in a row of the sequence file.
Download of the selected file will be start automatically:

Delete sequence

Note: you may remove only the sequence entirely, separate sequence files can not be removed.

To remove a sequence:

Click the Delete button in a sequence row, e.g.:
Confirm the deletion in the appeared pop-up:
Sequence will be permanently removed.

Manage store

To edit reference or sequence store:

Click the gear icon in the right upper corner of the store page.
The settings pop-up will appear, e.g.:

Here, you can:

edit store alias and description - specify new value(s) and click the Save button
grant store permissions - for more details see the section 13. Permissions
delete a store - for more details the section 8.5. Delete Data Storage

Manage via the CLI

You can also manage AWS HealthOmics Storages and their data via CLI.

Currently, the following CLI functionality is supported for AWS HealthOmics Storages:

store creation - using pipe storage create command:

Note: to specify the storage type during the creation (with -t | --type option) use the following values - AWS_OMICS_REF (for reference store) and AWS_OMICS_SEQ (for sequence store).
store listing - using pipe storage ls command:

Note: to specify the storage path, use omics Cloud prefix.
moving store to another folder - using pipe storage mvtodir command:
store deletion - using pipe storage delete command:
genomic data uploading - using pipe storage cp <source> <destination> command, where:
- <source> - path to genomic data file(s) from the existing data storage registered in the Cloud Pipeline platform or from your local workstation
- <destination> - AWS HealthOmics Storage path
  Note: for loading genomic data, you shall use the option -a | --additional-options to specify arguments of the data file:
  - name - genomic file name
  - subject_id - source's subject ID
  - sample_id - source's sample ID
  - file_type - type of the loading file
download genomic data - using pipe storage cp <source> <destination> command, where:
- <source> - path to genomic data in AWS HealthOmics Storage (folder or file)
- <destination> - path on your local workstation

Usage

AWS HealthOmics Storages can not be mounted to the running instances as regular storages.
But these storages (and their data) can be used via pipeline's parameters - for more details see the corresponding section.

You may use data from AWS HealthOmics Storages in 2 types of parameters:

Path parameter:
- select the corresponding parameter type from the list:
- click the folder icon in the appeared parameter field:
- in the pop-up, select a reference or sequence store, e.g.:
- then, you may select specific genomic data or the whole AWS HealthOmics Store:
- an example of the whole selected bucket as the path parameter value:
Input parameter:
- select the corresponding parameter type from the list:
- click the input icon in the appeared parameter field:
- in the pop-up, select a reference or sequence store, e.g.:
- then, you may select specific genomic data, e.g.:
- an example of the selected reference as the input path parameter's value: