Сomparison of using different FS storage types in Cloud Pipeline environment

Performance
- Synthetic data experiment
- Real data experiment
Costs
PIPE Fuse filesystem benchmark

Performance comparison

The performance was measured for different AWS file systems are using in Cloud Pipeline:

filesystems are managed by S3
- EFS in AWS
- FSx for Lustre
local filesystems
- BTRFS on EBS
- LizardFS on EBS

A performance comparison of Cloud Pipeline storages was conducted against synthetic and real data.
All experiments were carried out on c5.2xlarge (8 CPU, 16 RAM) AWS instance.

Synthetic data experiment

In this experiment we have generated two types of data:

100Gb large file
100 000 small files by 500Kb

With this data, we have measured the creation and read times using the time command of Unix and Unix-like systems.

The command that was used to create a 100Gb large file by 100Mb chunks:

dd if=/dev/urandom of=/path-to-storage/large_100gb.txt iflag=fullblock bs=100M count=1024

The command that was used to create a 100 000 small files by 500Kb:

for j in {1..100000}; do
  head -c 500kB </dev/urandom > /path-to-storage/small/randfile$j.txt
done

The command that was used to read a large file:

dd if=/path-to-storage/large_100gb.txt of=/dev/null conv=fdatasync

The command that was used to read small files:

for file in /path-to-storage/small/* ; do
    dd if=$file of=/dev/null
done

The synthetic data experiment was initially carried out in one thread and then 4 and 8 threads.

The experimental results on synthetic data are presented in the following tables:

1 thread

Storage	Create the large file	Read the large file	Create many small files	Read many small files
EFS	real 17m24.812s user 0m0.004s sys 10m34.675s	real 17m18.216s user 0m48.904s sys 3m9.517s	real 53m3.498s user 0m58.209s sys 5m46.386s	real 27m11.831s user 2m14.295s sys 1m35.923s
LUSTRE	real 10m55.494s user 0m0.004s sys 9m47.326s	real 13m28.536s user 0m51.675s sys 12m36.813s	real 12m37.380s user 1m54.744s sys 5m13.759s	real 20m40.877s user 0m44.095s sys 9m4.877s
BTRFS on EBS	real 11m9.866s user 0m0.032s sys 11m7.813s	real 7m0.032s user 0m53.303s sys 3m47.549s	real 6m48.540s user 0m47.254s sys 6m6.036s	real 6m26.190s user 2m1.610s sys 1m25.726s
LizardFS on EBS	real 13m4.352s user 0m0.008s sys 10m57.101s	real 7m54.142s user 0m53.089s sys 3m51.089s	real 16m39.980s user 2m11.618s sys 6m35.383s	real 10m3.791s user 2m18.924s sys 1m28.035s

4 threads

Storage	Create the large file	Read the large file	Create many small files	Read many small files
EFS	real 69m25.583s user 0m0.015s sys 11m30.451s	real 64m20.614s user 0m48.233s sys 3m15.074s	real 59m18.137s user 0m37.185s sys 8m29.882s	real 33m19.459s user 2m23.134s sys 2m18.345s
LUSTRE	real 38m32.383s user 0m0.014s sys 36m40.821s	real 20m45.156s user 0m59.189s sys 19m26.054s	real 25m38.531s user 0m21.820s sys 16m59.318s	real 24m58.620s user 2m11.449s sys 11m57.240s
BTRFS on EBS	real 38m50.438s user 0m0.028s sys 38m45.451s	real 27m55.173s user 0m52.903s sys 4m26.061s	real 20m54.831s user 0m20.394s sys 20m34.926s	real 12m50.153s user 2m5.149s sys 1m18.555s
LizardFS on EBS	real 48m47.367s user 0m0.020s sys 40m17.341s	real 32m21.257s user 0m57.588s sys 5m32.215s	real 28m12.707s user 1m30.504s sys 12m40.881s	real 15m31.591s user 2m21.772s sys 2m31.211s

8 threads

Storage	Create the large file	Read the large file	Create many small files	Read many small files
EFS	real 127m49.718s user 0m0.010s sys 14m44.358s	real 122m46.188s user 1m12.786s sys 21m14.236s	real 72m43.596s user 0m26.727s sys 15m0.582s	real 62m46.118s user 2m31.595s sys 2m31.577s
LUSTRE	real 93m56.846s user 0m0.018s sys 90m30.5s	real 94m1.258s user 0m48.908s sys 3m18.557s	real 50m42.845s user 0m23.462s sys 35m12.511s	real 30m53.199s user 0m54.020s sys 14m42.712s
BTRFS on EBS	real 87m48.066s user 0m0.016s sys 86m15.610s	real 39m59.167s user 0m50.900s sys 4m25.582s	real 44m36.847s user 0m24.491s sys 43m11.719s	real 17m12.744s user 2m14.324s sys 1m50.667s
LizardFS on EBS	real 97m25.045s user 0m0.007s sys 73m32.042s	real 39m59.167s user 0m50.900s sys 4m25.582s	real 50m25.868s user 0m58.079s sys 19m12.443s	real 27m50.506s user 2m26.704s sys 3m9.926s

As we can see from the presented results, Amazon FSx for Lustre is several times faster (from 1.3 to 3.2 times for read mode and from 1.3 to 4 times for write mode) than Amazon EFS. As expected, the local systems performed better than FSx for Lustre or EFS generally. But it should be noted, that the FSx for Lustre was comparable to local systems in some cases.

Real data experiment

The cellranger count pipeline was used to conduct an experiment with real data.
The input data:

15Gb transcriptome reference
50Gb fastqs

The command that was used to run cellranger count pipeline:

/path-to-cellranger/3.0.2/bin/cellranger count --localcores=8 --id={id} --transcriptome=/path-to-transcriptome-reference/refdata-cellranger-mm10-3.0.0 --chemistry=SC5P-R2 --fastqs=/path-to-fastqs/fastqs_test --sample={sample_name}

The experimental run result is presented in the following table:

Storage	Execution time
EFS	real 207m15.566s user 413m32.168s sys 13m40.581s
LUSTRE	real 189m23.586s user 434m6.950s sys 13m2.902s
BTRFS on EBS	real 187m23.048s user 413m32.666s sys 12m30.285s
LizardFS on EBS	real 189m8.210s user 412m6.558s sys 14m18.429s

The best result was shown by the BTRFS on EBS local system.
However, it can be said that BTRFS on EBS, LizardFS on EBS, and FSx for Lustre are showed the comparable time. Amazon FSx for Lustre was faster Amazon EFS just to ~ 9%.

Costs

Cost calculations have been performed in according to Amazon pricing at the time of this document.
For the experiments were used the storages that were created in the US East (N.Virginia) region with similar features:

Storage	Storage size	Throughput mode
EFS	Size in EFS Standard: 1 TiB (100%)	Bursting: 50 MB/s/TiB
LUSTRE	SSD: 1.2 TiB Capacity	50 MB/s/TiB baseline, up to 1.3 GB/s/TiB burst
BTRFS on EBS	SSD: 1.2 TiB Capacity	Max Throughput/Instance - 4,750 MB/s
LizardFS on EBS	SSD: 1.2 TiB Capacity	Max Throughput/Instance - 4,750 MB/s

The total charge for the month of usage for various storages is calculated in different ways:

EFS

1 TB per month x 1024 GB in a TB = 1024 GB per month (data stored in Standard Storage)

1 024 GB per month x $0,30 GB-month = $307,20 (Standard Storage monthly cost)

FSx for Lustre

$0.14 GB-month / 30 / 24 = $0.000194 GB-hour

1228 GB x $0.000194 GB-hour x 720 hours = $171,5 (FSx for Lustre monthly cost)

BTRFS on EBS and LizardFS on EBS

(1228 GB x $0.10 GB-month * 86400 seconds (for 24 hours)) / (86,400 seconds/day * 30 day-month) = $4 (for the volume)

As seen from the calculation, the most beneficial is to use BTRFS on EBS and LizardFS on EBS local systems.
However, this is not suitable for long term storage. In this case, EFS monthly cost is more expensive than Amazon FSx for Lustre monthly cost in 1.8 times for similar features storage at the time of this document.

PIPE Fuse filesystem benchmark

This benchmark provides information on using PIPE Fuse over S3 filesystem compared to other appoaches.

Methodology

Here were using BWA aligner to mimic a real-world use case aligning a genomic fastq file into sam file.

Three types of data storage/transfer approaches are used: * AWS EFS shared filesystem. This option can be questionable in terms of the performance, as AWS EFS may behave very differently depending on the setup parameters and workload generated by the other users of a filesystem. Here we are giving a general sense of AWS EFS usage, which are not accessed by any other process. * Localize/Delocalize approach with a AWS EBS volume. This is a default approach when running well-established pipelines within Cloud Pipeline environment with the input/output data located in AWS S3. This approach makes any tool, used in a pipeline, compliant with the AWS S3. This approach consists of three steps: * Download data from S3 to a local filesystem (e.g. AWS EBS volume) * Run data processing using local filesystem paths * Upload results back to the AWS S3 * AWS S3 bucket mounted as directory using PIPE Fuse. AWS S3 bucket is mounted as a directory to a compute instance using pipe storage mount command. All data is read and written using these mounted file paths. Not local disk caches are used, communication happens between a compute instance and AWS S3 directly.

Compute environment setup: * m6i.2xlarge EC2 instance * gp3 SSD AWS EBS volume (no provisioning)

Data used for testing BWA: We've used a small plant genome and SRA samples to perform this benachmark

Reference genome: Arabidopsis_thaliana.TAIR10.dna.toplevel.fa, 116Mb
Sample: SRR5304927 (R1: 72Mb, R2: 76Mb)

Scripts and raw results

The following commands were executed:

EFS

# EFS is mounted into /EFS
# Input data is located in /EFS/Benchmark/INPUT
SAMPLE_NAME=SRR5304927
REFERENCE=/EFS/Benchmark/INPUT/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
FASTQ_R1=/EFS/Benchmark/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_1.fastq.gz
FASTQ_R2=/EFS/Benchmark/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_2.fastq.gz
BWA_OUTPUT=/EFS/Benchmark/OUTPUT/Samples/$SAMPLE_NAME/Output_Bench/bwa

$BWA_BIN index -a bwtsw $REFERENCE
# [main] Real time: 144.355 sec; CPU: 141.110 sec

mkdir -p $BWA_OUTPUT
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $REFERENCE $FASTQ_R1
# [main] Real time: 82.218 sec; CPU: 592.851 sec

$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $REFERENCE $FASTQ_R2
# [main] Real time: 81.995 sec; CPU: 588.473 sec

$BWA_BIN sampe $REFERENCE $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $FASTQ_R1 $FASTQ_R2 > $BWA_OUTPUT/$SAMPLE_NAME.sam
# [main] Real time: 79.564 sec; CPU: 75.702 sec

Localize/Delocalize approach

time pipe storage cp s3://data-bucket/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa /INPUT/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
# real    0m4.202s

time pipe storage cp s3://data-bucket/Samples/SRR5304927/Input/SRR5304927_1.fastq.gz /INPUT/Samples/SRR5304927/Input/SRR5304927_1.fastq.gz
# real    0m4.032s

time pipe storage cp s3://data-bucket/Samples/SRR5304927/Input/SRR5304927_2.fastq.gz /INPUT/Samples/SRR5304927/Input/SRR5304927_2.fastq.gz
# real    0m4.076s

SAMPLE_NAME=SRR5304927
REFERENCE=/INPUT/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
FASTQ_R1=/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_1.fastq.gz
FASTQ_R2=/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_2.fastq.gz
BWA_OUTPUT=/OUTPUT/Samples/$SAMPLE_NAME/Output_Bench/bwa

$BWA_BIN index -a bwtsw $REFERENCE
# [main] Real time: 105.612 sec; CPU: 104.980 sec

mkdir -p $BWA_OUTPUT
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $REFERENCE $FASTQ_R1
# [main] Real time: 57.351 sec; CPU: 412.471 sec

$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $REFERENCE $FASTQ_R2
# [main] Real time: 56.449 sec; CPU: 405.536 sec

$BWA_BIN sampe $REFERENCE $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $FASTQ_R1 $FASTQ_R2 > $BWA_OUTPUT/$SAMPLE_NAME.sam
# [main] Real time: 53.929 sec; CPU: 52.661 sec

time pipe storage cp $BWA_OUTPUT/$SAMPLE_NAME.sam s3://data-bucket/Samples/$SAMPLE_NAME/Output_Bench_localize/bwa/
# real    0m7.970s

PIPE Fuse

# AWS S3 bucket is mounted with the default parameters, additional tuning may provide better results, i.e. performance may be increased by 10-15%
# But to make this benchmark applicable for a general use case default parameters are kept
pipe storage mount --threads -b data-bucket /cloud-data/data-bucket

SAMPLE_NAME=SRR5304927
REFERENCE=/cloud-data/data-bucket/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
FASTQ_R1=/cloud-data/data-bucket/Samples/SRR5304927/Input/${SAMPLE_NAME}_1.fastq.gz
FASTQ_R2=/cloud-data/data-bucket/Samples/SRR5304927/Input/${SAMPLE_NAME}_2.fastq.gz
BWA_OUTPUT=/cloud-data/data-bucket/Samples/$SAMPLE_NAME/Output_Bench/bwa

$BWA_BIN index -a bwtsw $REFERENCE
# [main] Real time: 126.284 sec; CPU: 106.519 sec

mkdir -p $BWA_OUTPUT
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $REFERENCE $FASTQ_R1
# [main] Real time: 63.851 sec; CPU: 411.206 sec

$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $REFERENCE $FASTQ_R2
# [main] Real time: 63.224 sec; CPU: 404.646 sec

$BWA_BIN sampe $REFERENCE $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $FASTQ_R1 $FASTQ_R2 > $BWA_OUTPUT/$SAMPLE_NAME.sam
# [main] Real time: 79.832 sec; CPU: 57.641 sec

Final results

Storage approach	Total wall time, s	Difference, times slower
EFS	387	1,33
Localize/Delocalize approach	290	1
PIPE Fuse	331	1,14