Сomparison of using different FS storage types in Cloud Pipeline environment
Performance comparison
The performance was measured for different AWS
file systems are using in Cloud Pipeline:
- filesystems are managed by
S3
- EFS in AWS
- FSx for Lustre
- local filesystems
- BTRFS on EBS
- LizardFS on EBS
A performance comparison of Cloud Pipeline storages was conducted against synthetic and real data.
All experiments were carried out on c5.2xlarge
(8 CPU, 16 RAM) AWS
instance.
Synthetic data experiment
In this experiment we have generated two types of data:
- 100Gb large file
- 100 000 small files by 500Kb
With this data, we have measured the creation and read times using the time
command of Unix and Unix-like systems.
- The command that was used to create a 100Gb large file by 100Mb chunks:
dd if=/dev/urandom of=/path-to-storage/large_100gb.txt iflag=fullblock bs=100M count=1024
- The command that was used to create a 100 000 small files by 500Kb:
for j in {1..100000}; do
head -c 500kB </dev/urandom > /path-to-storage/small/randfile$j.txt
done
- The command that was used to read a large file:
dd if=/path-to-storage/large_100gb.txt of=/dev/null conv=fdatasync
- The command that was used to read small files:
for file in /path-to-storage/small/* ; do
dd if=$file of=/dev/null
done
The synthetic data experiment was initially carried out in one thread and then 4 and 8 threads.
The experimental results on synthetic data are presented in the following tables:
1 thread
Storage | Create the large file | Read the large file | Create many small files | Read many small files |
---|---|---|---|---|
EFS | real 17m24.812s user 0m0.004s sys 10m34.675s |
real 17m18.216s user 0m48.904s sys 3m9.517s |
real 53m3.498s user 0m58.209s sys 5m46.386s |
real 27m11.831s user 2m14.295s sys 1m35.923s |
LUSTRE | real 10m55.494s user 0m0.004s sys 9m47.326s |
real 13m28.536s user 0m51.675s sys 12m36.813s |
real 12m37.380s user 1m54.744s sys 5m13.759s |
real 20m40.877s user 0m44.095s sys 9m4.877s |
BTRFS on EBS | real 11m9.866s user 0m0.032s sys 11m7.813s |
real 7m0.032s user 0m53.303s sys 3m47.549s |
real 6m48.540s user 0m47.254s sys 6m6.036s |
real 6m26.190s user 2m1.610s sys 1m25.726s |
LizardFS on EBS | real 13m4.352s user 0m0.008s sys 10m57.101s |
real 7m54.142s user 0m53.089s sys 3m51.089s |
real 16m39.980s user 2m11.618s sys 6m35.383s |
real 10m3.791s user 2m18.924s sys 1m28.035s |
4 threads
Storage | Create the large file | Read the large file | Create many small files | Read many small files |
---|---|---|---|---|
EFS | real 69m25.583s user 0m0.015s sys 11m30.451s |
real 64m20.614s user 0m48.233s sys 3m15.074s |
real 59m18.137s user 0m37.185s sys 8m29.882s |
real 33m19.459s user 2m23.134s sys 2m18.345s |
LUSTRE | real 38m32.383s user 0m0.014s sys 36m40.821s |
real 20m45.156s user 0m59.189s sys 19m26.054s |
real 25m38.531s user 0m21.820s sys 16m59.318s |
real 24m58.620s user 2m11.449s sys 11m57.240s |
BTRFS on EBS | real 38m50.438s user 0m0.028s sys 38m45.451s |
real 27m55.173s user 0m52.903s sys 4m26.061s |
real 20m54.831s user 0m20.394s sys 20m34.926s |
real 12m50.153s user 2m5.149s sys 1m18.555s |
LizardFS on EBS | real 48m47.367s user 0m0.020s sys 40m17.341s |
real 32m21.257s user 0m57.588s sys 5m32.215s |
real 28m12.707s user 1m30.504s sys 12m40.881s |
real 15m31.591s user 2m21.772s sys 2m31.211s |
8 threads
Storage | Create the large file | Read the large file | Create many small files | Read many small files |
---|---|---|---|---|
EFS | real 127m49.718s user 0m0.010s sys 14m44.358s |
real 122m46.188s user 1m12.786s sys 21m14.236s |
real 72m43.596s user 0m26.727s sys 15m0.582s |
real 62m46.118s user 2m31.595s sys 2m31.577s |
LUSTRE | real 93m56.846s user 0m0.018s sys 90m30.5s |
real 94m1.258s user 0m48.908s sys 3m18.557s |
real 50m42.845s user 0m23.462s sys 35m12.511s |
real 30m53.199s user 0m54.020s sys 14m42.712s |
BTRFS on EBS | real 87m48.066s user 0m0.016s sys 86m15.610s |
real 39m59.167s user 0m50.900s sys 4m25.582s |
real 44m36.847s user 0m24.491s sys 43m11.719s |
real 17m12.744s user 2m14.324s sys 1m50.667s |
LizardFS on EBS | real 97m25.045s user 0m0.007s sys 73m32.042s |
real 39m59.167s user 0m50.900s sys 4m25.582s |
real 50m25.868s user 0m58.079s sys 19m12.443s |
real 27m50.506s user 2m26.704s sys 3m9.926s |
As we can see from the presented results, Amazon
FSx for Lustre is several times faster (from 1.3 to 3.2 times for read mode and from 1.3 to 4 times for write mode) than Amazon
EFS.
As expected, the local systems performed better than FSx for Lustre or EFS generally. But it should be noted, that the FSx for Lustre was comparable to local systems in some cases.
Real data experiment
The cellranger count
pipeline was used to conduct an experiment with real data.
The input data:
- 15Gb transcriptome reference
- 50Gb fastqs
The command that was used to run cellranger count
pipeline:
/path-to-cellranger/3.0.2/bin/cellranger count --localcores=8 --id={id} --transcriptome=/path-to-transcriptome-reference/refdata-cellranger-mm10-3.0.0 --chemistry=SC5P-R2 --fastqs=/path-to-fastqs/fastqs_test --sample={sample_name}
The experimental run result is presented in the following table:
Storage | Execution time |
---|---|
EFS | real 207m15.566s user 413m32.168s sys 13m40.581s |
LUSTRE | real 189m23.586s user 434m6.950s sys 13m2.902s |
BTRFS on EBS | real 187m23.048s user 413m32.666s sys 12m30.285s |
LizardFS on EBS | real 189m8.210s user 412m6.558s sys 14m18.429s |
The best result was shown by the BTRFS on EBS local system.
However, it can be said that BTRFS on EBS, LizardFS on EBS, and FSx for Lustre are showed the comparable time.
Amazon
FSx for Lustre was faster Amazon
EFS just to ~ 9%.
Costs
Cost calculations have been performed in according to Amazon
pricing at the time of this document.
For the experiments were used the storages that were created in the US East
(N.Virginia) region with similar features:
Storage | Storage size | Throughput mode |
---|---|---|
EFS | Size in EFS Standard: 1 TiB (100%) | Bursting: 50 MB/s/TiB |
LUSTRE | SSD: 1.2 TiB Capacity | 50 MB/s/TiB baseline, up to 1.3 GB/s/TiB burst |
BTRFS on EBS | SSD: 1.2 TiB Capacity | Max Throughput/Instance - 4,750 MB/s |
LizardFS on EBS | SSD: 1.2 TiB Capacity | Max Throughput/Instance - 4,750 MB/s |
The total charge for the month of usage for various storages is calculated in different ways:
- EFS
- 1 TB per month x 1024 GB in a TB = 1024 GB per month (data stored in Standard Storage)
- 1 024 GB per month x $0,30 GB-month = $307,20 (Standard Storage monthly cost)
- FSx for Lustre
- $0.14 GB-month / 30 / 24 = $0.000194 GB-hour
- 1228 GB x $0.000194 GB-hour x 720 hours = $171,5 (FSx for Lustre monthly cost)
- BTRFS on EBS and LizardFS on EBS
- (1228 GB x $0.10 GB-month * 86400 seconds (for 24 hours)) / (86,400 seconds/day * 30 day-month) = $4 (for the volume)
As seen from the calculation, the most beneficial is to use BTRFS on EBS and LizardFS on EBS local systems.
However, this is not suitable for long term storage. In this case, EFS monthly cost is more expensive than Amazon
FSx for Lustre monthly cost in 1.8 times for similar features storage at the time of this document.
PIPE Fuse filesystem benchmark
This benchmark provides information on using PIPE Fuse over S3 filesystem compared to other appoaches.
Methodology
Here were using BWA aligner to mimic a real-world use case aligning a genomic fastq file into sam file.
Three types of data storage/transfer approaches are used:
* AWS EFS shared filesystem. This option can be questionable in terms of the performance, as AWS EFS may behave very differently depending on the setup parameters and workload generated by the other users of a filesystem. Here we are giving a general sense of AWS EFS usage, which are not accessed by any other process.
* Localize/Delocalize approach with a AWS EBS volume. This is a default approach when running well-established pipelines within Cloud Pipeline
environment with the input/output data located in AWS S3. This approach makes any tool, used in a pipeline, compliant with the AWS S3. This approach consists of three steps:
* Download data from S3 to a local filesystem (e.g. AWS EBS volume)
* Run data processing using local filesystem paths
* Upload results back to the AWS S3
* AWS S3 bucket mounted as directory using PIPE Fuse. AWS S3 bucket is mounted as a directory to a compute instance using pipe storage mount
command. All data is read and written using these mounted file paths. Not local disk caches are used, communication happens between a compute instance and AWS S3 directly.
Compute environment setup: * m6i.2xlarge EC2 instance * gp3 SSD AWS EBS volume (no provisioning)
Data used for testing BWA: We've used a small plant genome and SRA samples to perform this benachmark
- Reference genome: Arabidopsis_thaliana.TAIR10.dna.toplevel.fa, 116Mb
- Sample: SRR5304927 (R1: 72Mb, R2: 76Mb)
Scripts and raw results
The following commands were executed:
- EFS
# EFS is mounted into /EFS
# Input data is located in /EFS/Benchmark/INPUT
SAMPLE_NAME=SRR5304927
REFERENCE=/EFS/Benchmark/INPUT/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
FASTQ_R1=/EFS/Benchmark/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_1.fastq.gz
FASTQ_R2=/EFS/Benchmark/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_2.fastq.gz
BWA_OUTPUT=/EFS/Benchmark/OUTPUT/Samples/$SAMPLE_NAME/Output_Bench/bwa
$BWA_BIN index -a bwtsw $REFERENCE
# [main] Real time: 144.355 sec; CPU: 141.110 sec
mkdir -p $BWA_OUTPUT
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $REFERENCE $FASTQ_R1
# [main] Real time: 82.218 sec; CPU: 592.851 sec
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $REFERENCE $FASTQ_R2
# [main] Real time: 81.995 sec; CPU: 588.473 sec
$BWA_BIN sampe $REFERENCE $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $FASTQ_R1 $FASTQ_R2 > $BWA_OUTPUT/$SAMPLE_NAME.sam
# [main] Real time: 79.564 sec; CPU: 75.702 sec
- Localize/Delocalize approach
time pipe storage cp s3://data-bucket/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa /INPUT/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
# real 0m4.202s
time pipe storage cp s3://data-bucket/Samples/SRR5304927/Input/SRR5304927_1.fastq.gz /INPUT/Samples/SRR5304927/Input/SRR5304927_1.fastq.gz
# real 0m4.032s
time pipe storage cp s3://data-bucket/Samples/SRR5304927/Input/SRR5304927_2.fastq.gz /INPUT/Samples/SRR5304927/Input/SRR5304927_2.fastq.gz
# real 0m4.076s
SAMPLE_NAME=SRR5304927
REFERENCE=/INPUT/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
FASTQ_R1=/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_1.fastq.gz
FASTQ_R2=/INPUT/Samples/SRR5304927/Input/${SAMPLE_NAME}_2.fastq.gz
BWA_OUTPUT=/OUTPUT/Samples/$SAMPLE_NAME/Output_Bench/bwa
$BWA_BIN index -a bwtsw $REFERENCE
# [main] Real time: 105.612 sec; CPU: 104.980 sec
mkdir -p $BWA_OUTPUT
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $REFERENCE $FASTQ_R1
# [main] Real time: 57.351 sec; CPU: 412.471 sec
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $REFERENCE $FASTQ_R2
# [main] Real time: 56.449 sec; CPU: 405.536 sec
$BWA_BIN sampe $REFERENCE $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $FASTQ_R1 $FASTQ_R2 > $BWA_OUTPUT/$SAMPLE_NAME.sam
# [main] Real time: 53.929 sec; CPU: 52.661 sec
time pipe storage cp $BWA_OUTPUT/$SAMPLE_NAME.sam s3://data-bucket/Samples/$SAMPLE_NAME/Output_Bench_localize/bwa/
# real 0m7.970s
- PIPE Fuse
# AWS S3 bucket is mounted with the default parameters, additional tuning may provide better results, i.e. performance may be increased by 10-15%
# But to make this benchmark applicable for a general use case default parameters are kept
pipe storage mount --threads -b data-bucket /cloud-data/data-bucket
SAMPLE_NAME=SRR5304927
REFERENCE=/cloud-data/data-bucket/Genomes/Arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa
FASTQ_R1=/cloud-data/data-bucket/Samples/SRR5304927/Input/${SAMPLE_NAME}_1.fastq.gz
FASTQ_R2=/cloud-data/data-bucket/Samples/SRR5304927/Input/${SAMPLE_NAME}_2.fastq.gz
BWA_OUTPUT=/cloud-data/data-bucket/Samples/$SAMPLE_NAME/Output_Bench/bwa
$BWA_BIN index -a bwtsw $REFERENCE
# [main] Real time: 126.284 sec; CPU: 106.519 sec
mkdir -p $BWA_OUTPUT
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $REFERENCE $FASTQ_R1
# [main] Real time: 63.851 sec; CPU: 411.206 sec
$BWA_BIN aln -t $(nproc) -f $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $REFERENCE $FASTQ_R2
# [main] Real time: 63.224 sec; CPU: 404.646 sec
$BWA_BIN sampe $REFERENCE $BWA_OUTPUT/${SAMPLE_NAME}_R1.sai $BWA_OUTPUT/${SAMPLE_NAME}_R2.sai $FASTQ_R1 $FASTQ_R2 > $BWA_OUTPUT/$SAMPLE_NAME.sam
# [main] Real time: 79.832 sec; CPU: 57.641 sec
Final results
Storage approach | Total wall time, s | Difference, times slower |
---|---|---|
EFS | 387 | 1,33 |
Localize/Delocalize approach | 290 | 1 |
PIPE Fuse | 331 | 1,14 |