8.14. Omics storages

Cloud Pipeline supports AWS HealthOmics Storages.
These are specialized storages that allow to store Omics data.
There are two types of AWS HealthOmics Storages:

  • reference store - for storing genome references
  • sequence store - for storing genomics files (e.g., BAM, CRAM, FASTQ)

Depending on the store type, management and abilities will vary.

To create a Storage in a Folder you need to have WRITE permission for that folder and the ROLE_STORAGE_MANAGER role. For more information see 13. Permissions.

Reference store

Specialized storage for storing raw genome references.

Create reference store

Note: please note that for each Cloud Region, only one reference store can be created.

To create reference store:

  1. Navigate to the folder where you want to create reference storage.
  2. Click + CreateStoragesCreate AWS HealthOmics Store:
    CP_AWS_HealthOmicsStorages
  3. The pop-up of the AWS HealthOmics Storage creation will appear:
    CP_AWS_HealthOmicsStorages
  4. Specify a name of the creating store and select the service type as Reference store:
    CP_AWS_HealthOmicsStorages
  5. Specify a description if necessary.
  6. Click the Create button to confirm.
  7. Reference store will appear in the folder:
    CP_AWS_HealthOmicsStorages

View and edit content of reference store

Reference store has a flat structure:

  • on the first level, there are only folders, where each folder presents a reference
  • on the second level (inside the folder of each reference), there are only files of the reference itself and its index
  • nested folders are not supported

CP_AWS_HealthOmicsStorages

Inside the reference folder:
CP_AWS_HealthOmicsStorages

In the row of the reference file/its folder, there is a set of labels, e.g.:
CP_AWS_HealthOmicsStorages

  • reference name
  • state of the file
  • type of the content (REFERENCE)

Possible actions in a reference store:

  • Import new reference - load a reference from the data storage
  • Download reference - download reference files to the local workstation
  • Delete reference

Import reference

Note: to load a reference into a reference store, it shall be previously loaded to the s3 bucket available on the Cloud Pipeline Platform.

To load a reference:

  1. Open the reference store.
  2. Click the Import button:
    CP_AWS_HealthOmicsStorages
  3. Import form will appear:
    CP_AWS_HealthOmicsStorages
  4. Specify mandatory fields:
    • Name - reference genome name
    • Subject id - source's subject ID
    • Sample id - source's sample ID
  5. If necessary, specify optional fields:
    • Description - reference description
    • Generated from - reference details
  6. Select the reference source file:
    • click the folder icon near the Source file label:
      CP_AWS_HealthOmicsStorages
    • the pop-up to select a file from a data storage will appear:
      CP_AWS_HealthOmicsStorages
    • select a reference file in one of the regular storages available in the Cloud Pipeline Platform, e.g.:
      CP_AWS_HealthOmicsStorages
    • click the OK button to confirm selection
    • selected file will be shown near the Source file label:
      CP_AWS_HealthOmicsStorages
  7. Once all fields are specified, click the Import button:
    CP_AWS_HealthOmicsStorages
  8. Attributes panel with the section of import jobs will be opened automatically on the right side:
    CP_AWS_HealthOmicsStorages
    At this panel, you can check the state of the file import jobs:
    • set dates (From and To) and desired state (select from the list)
    • click the Search button, results will be shown as the job IDs list, e.g. to find newly completed jobs:
      CP_AWS_HealthOmicsStorages
  9. When the import is already completed, reference will appear in the storage:
    CP_AWS_HealthOmicsStorages
  10. Click the reference, to display reference files:
    CP_AWS_HealthOmicsStorages
    As you can see, reference folder contains the reference file itself (source) and automatically created index (index).

Note: to show/hide import jobs section, you may use the special menu in the upper side of the store:
CP_AWS_HealthOmicsStorages

Delete reference

Note: you may remove only the reference entirely, separate reference files can not be removed.

To remove a reference:

  1. Click the Delete button in a reference row, e.g.:
    CP_AWS_HealthOmicsStorages
  2. Confirm the deletion in the appeared pop-up:
    CP_AWS_HealthOmicsStorages
  3. Reference will be permanently removed.

Sequence store

Specialized storage for storing different types of genomics files - currently, these are BAM, CRAM, UBAM, FASTQ.

Create sequence store

To create sequence store:

  1. Navigate to the folder where you want to create sequence storage.
  2. Click + CreateStoragesCreate AWS HealthOmics Store:
    CP_AWS_HealthOmicsStorages
  3. The pop-up of the AWS HealthOmics Storage creation will appear:
    CP_AWS_HealthOmicsStorages
  4. Specify a name of the creating store and select the service type as Sequence store:
    CP_AWS_HealthOmicsStorages
  5. Specify a description if necessary, then click the Create button to confirm:
    CP_AWS_HealthOmicsStorages
  6. Sequence store will appear in the folder:
    CP_AWS_HealthOmicsStorages

View and edit content of sequence store

Sequence store has a flat structure:

  • on the first level, there are only folders, where each folder presents a separate genomic sequence
  • on the second level (inside the folder of each sequence), there are only sequence genomic files - depending on the format, these can be one (for example, a single UBAM file) or two files (for example, a BAM file and its index)
  • nested folders are not supported

CP_AWS_HealthOmicsStorages

Inside the sequence folder, there are sequence files:
CP_AWS_HealthOmicsStorages

In the row of the sequence file/folder, there is a set of labels, e.g.:
CP_AWS_HealthOmicsStorages

  • sequence name
  • state of the file
  • type of the content (e.g. FASTQ)
  • sample id
  • subject id

Possible actions in a sequence store:

  • Import new sequence - load a sequence from the data storage
  • Upload sequence - upload a sequence from the local workstation
  • Download sequence - download a sequence to the local workstation
  • Delete sequence

Import sequence

Note: to load a sequence into a sequence store, it shall be previously loaded to the s3 bucket available on the Cloud Pipeline Platform.

To load a sequence:

  1. Open the sequence store.
  2. Click the Import button:
    CP_AWS_HealthOmicsStorages
  3. Import form will appear:
    CP_AWS_HealthOmicsStorages
  4. Specify mandatory fields:
    • Name - sequence name
    • Subject id - source's subject ID
    • Sample id - source's sample ID
  5. If necessary, specify optional fields:
    • Description - sequence description
    • Generated from - sequence details
  6. From the corresponding dropdown list, select the type of a source file you want to load, e.g.:
    CP_AWS_HealthOmicsStorages
  7. Select the sequence source file:
    • click the folder icon near the Source file label:
      CP_AWS_HealthOmicsStorages
    • the pop-up to select a file from a data storage will appear:
      CP_AWS_HealthOmicsStorages
    • select a sequence file in one of the regular storages available in the Cloud Pipeline Platform, e.g.:
      CP_AWS_HealthOmicsStorages
    • click the OK button to confirm selection
    • selected file will be shown near the Source file label:
      CP_AWS_HealthOmicsStorages
  8. Additional field for a second source file will appear:
    CP_AWS_HealthOmicsStorages
    You may add such additional file similarly as described at the previous step, e.g.:
    CP_AWS_HealthOmicsStorages
  9. If necessary, you may link a sequence with a reference from the reference store:
    • click the folder icon near the Reference path label:
      CP_AWS_HealthOmicsStorages
    • in the appeared pop-up, select a reference from the reference store, e.g.:
      CP_AWS_HealthOmicsStorages
    • click the OK button to confirm selection
  10. Once all fields are specified, click the Import button:
    CP_AWS_HealthOmicsStorages
  11. Attributes panel with the section of import jobs will be opened automatically on the right side:
    CP_AWS_HealthOmicsStorages
    At this panel, you can check the state of the file import jobs:
    • set dates (From and To) and desired state (select from the list)
      CP_AWS_HealthOmicsStorages
    • click the Search button, results will be shown as the job IDs list, e.g. to find newly completed jobs:
      CP_AWS_HealthOmicsStorages
  12. When the import is already completed, sequence will appear in the storage:
    CP_AWS_HealthOmicsStorages
  13. Click the sequence, to display sequence files:
    CP_AWS_HealthOmicsStorages
    As you can see, sequence folder contains files named by the format: source plus the index.

Note: to show/hide import jobs section, you may use the special menu in the upper side of the store:
CP_AWS_HealthOmicsStorages

Upload sequence

To upload a sequence:

  1. Open the sequence store.
  2. Click the Upload button:
    CP_AWS_HealthOmicsStorages
  3. Upload form will appear:
    CP_AWS_HealthOmicsStorages
  4. Select the type of a source file you want to upload, e.g.:
    CP_AWS_HealthOmicsStorages
  5. Select the sequence source file from your local workstation:
    • click the Upload source file button:
      CP_AWS_HealthOmicsStorages
    • the OS pop-up to select a file will appear
    • choose a sequence file and confirm, e.g.:
      CP_AWS_HealthOmicsStorages
    • selected file will be shown near the Upload source file button:
      CP_AWS_HealthOmicsStorages
  6. Additional button for a second source file will appear:
    CP_AWS_HealthOmicsStorages
    You may add such additional file similarly as described at the previous step, e.g.:
    CP_AWS_HealthOmicsStorages
  7. If necessary, you may link a sequence with a reference from the reference store:
    • click the folder icon near the Reference label:
      CP_AWS_HealthOmicsStorages
    • in the appeared pop-up, select a reference from the reference store, e.g.:
      CP_AWS_HealthOmicsStorages
    • click the OK button to confirm selection
  8. Specify mandatory fields:
    • Name - sequence name
    • Sample id - source's sample ID
    • Subject id - source's subject ID
  9. If necessary, specify optional fields:
    • Description - sequence description
    • Generated from - sequence details
  10. Once all fields are specified, click the Upload button:
    CP_AWS_HealthOmicsStorages
  11. The upload will take some time:
    CP_AWS_HealthOmicsStorages
  12. Then, the pop-up will be automatically closed, sequence will appear in the storage:
    CP_AWS_HealthOmicsStorages
  13. Please note, that the upload may take extra time - during this period state of the sequence will be shown as PROCESSING_UPLOAD, download button will not be shown:
    CP_AWS_HealthOmicsStorages
  14. When the upload is already completed, sequence state will change to ACTIVE and download button near will appear:
    CP_AWS_HealthOmicsStorages

Download sequence

To download at once all files of the sequence to the local workstation:

  1. Click the Download button in a sequence row.
  2. Download will be start automatically (please note that files will be loaded separately, not as an archive):
    CP_AWS_HealthOmicsStorages

To download specific sequence file to the local workstation:

  1. Open the sequence folder.
  2. Click the Download button in a row of the sequence file.
  3. Download of the selected file will be start automatically:
    CP_AWS_HealthOmicsStorages

Delete sequence

Note: you may remove only the sequence entirely, separate sequence files can not be removed.

To remove a sequence:

  1. Click the Delete button in a sequence row, e.g.:
    CP_AWS_HealthOmicsStorages
  2. Confirm the deletion in the appeared pop-up:
    CP_AWS_HealthOmicsStorages
  3. Sequence will be permanently removed.

Manage store

To edit reference or sequence store:

  1. Click the gear icon in the right upper corner of the store page.
  2. The settings pop-up will appear, e.g.:
    CP_AWS_HealthOmicsStorages

Here, you can:

  • edit store alias and description - specify new value(s) and click the Save button
  • grant store permissions - for more details see the section 13. Permissions
  • delete a store - for more details the section 8.5. Delete Data Storage

Manage via the CLI

You can also manage AWS HealthOmics Storages and their data via CLI.

Currently, the following CLI functionality is supported for AWS HealthOmics Storages:

  • store creation - using pipe storage create command:
    CP_AWS_HealthOmicsStorages

    Note: to specify the storage type during the creation (with -t | --type option) use the following values - AWS_OMICS_REF (for reference store) and AWS_OMICS_SEQ (for sequence store).

  • store listing - using pipe storage ls command:
    CP_AWS_HealthOmicsStorages

    Note: to specify the storage path, use omics Cloud prefix.

  • moving store to another folder - using pipe storage mvtodir command:
    CP_AWS_HealthOmicsStorages
  • store deletion - using pipe storage delete command:
    CP_AWS_HealthOmicsStorages
  • genomic data uploading - using pipe storage cp <source> <destination> command, where:

    • <source> - path to genomic data file(s) from the existing data storage registered in the Cloud Pipeline platform or from your local workstation
    • <destination> - AWS HealthOmics Storage path
      CP_AWS_HealthOmicsStorages

      Note: for loading genomic data, you shall use the option -a | --additional-options to specify arguments of the data file:

      • name - genomic file name
      • subject_id - source's subject ID
      • sample_id - source's sample ID
      • file_type - type of the loading file
  • download genomic data - using pipe storage cp <source> <destination> command, where:

    • <source> - path to genomic data in AWS HealthOmics Storage (folder or file)
    • <destination> - path on your local workstation CP_AWS_HealthOmicsStorages

Usage

AWS HealthOmics Storages can not be mounted to the running instances as regular storages.
But these storages (and their data) can be used via pipeline's parameters - for more details see the corresponding section.

You may use data from AWS HealthOmics Storages in 2 types of parameters:

  • Path parameter:
    • select the corresponding parameter type from the list:
      CP_AWS_HealthOmicsStorages
    • click the folder icon in the appeared parameter field:
      CP_AWS_HealthOmicsStorages
    • in the pop-up, select a reference or sequence store, e.g.:
      CP_AWS_HealthOmicsStorages
    • then, you may select specific genomic data or the whole AWS HealthOmics Store:
      CP_AWS_HealthOmicsStorages
    • an example of the whole selected bucket as the path parameter value:
      CP_AWS_HealthOmicsStorages
  • Input parameter:
    • select the corresponding parameter type from the list:
      CP_AWS_HealthOmicsStorages
    • click the input icon in the appeared parameter field:
      CP_AWS_HealthOmicsStorages
    • in the pop-up, select a reference or sequence store, e.g.:
      CP_AWS_HealthOmicsStorages
    • then, you may select specific genomic data, e.g.:
      CP_AWS_HealthOmicsStorages
    • an example of the selected reference as the input path parameter's value:
      CP_AWS_HealthOmicsStorages