Costs management rationale and concepts

Any job, run in the Cloud Pipeline, or any file placed into the Cloud Storage - cost money. These costs are incurred by the underlying provider: AWS/GCP/Azure/etc.

At a scale, when hundreds of the platform users submit the jobs and utilize the storage capacity, the cloud bill may grow to a quite high level.

Even more, some of the users may not be aware of the underlying billing and keep the instances up and running for months, but not performing any productive work. To address these issues, Cloud Pipeline provides a number of features, that allow to reduce and control the costs.

This machinery can be split into separate sections:

  • Features that allow to minimize the costs, e.g. by limiting the size of the compute nodes or automatically stopping IDLE jobs
  • Reporting features, that make users aware of the workload costs, e.g. by notifying the users on the IDLE jobs or providing interactive spending dashboard

This document lists the concepts and corresponding features (either Ready and WIP as well).

Restricting the size of the compute cluster

From the overall Cloud usage experience - the largest bills are generated by the compute resources, not the storages. And the key driver of the compute cost is the size/shape of the compute nodes used.

Compute nodes size

Each of the Cloud Providers offers lots of different compute instances families and types and most likely the users won't need that variety. Users may even choose huge GPU-enabled nodes by mistake, which are the most expensive.

Cloud Pipeline allows to restrict which sizes are available to the users for a specific platform deployment, thus reducing a chance of spending money unused resources of a huge node.

These restrictions can be applied to the overall platform and than fine-tuned for a specific users' group, specific user or a docker image.

Number of compute nodes

While we can restrict the size of each single compute node - it is still possible to spin up an uncontrolled number of smaller instances, which still costs a lot.

Cloud Pipeline offers a way to restrict the number of simultaneously running Cloud instances in the following manner:

  • Administrator may configure the overall number of compute nodes, that may be created for this particular platform deployment. This restriction will be in effect across all the users/groups/tools. E.g. if the administrators sets this parameter to 50 nodes and we already have 50 nodes running jobs - the 51st node won't be created. The corresponding job will sit in the queue until the previous runs finish and free the space for the new instance.
  • Users may also spin-up on-demand clusters, i.e. a single job run which require more than a single compute node (e.g. a molecular dynamics job, which needs a 200 of CPU cores interconnected with MPI). In this case - administrator may limit the size of such on-demand clusters. E.g. limiting this parameter to 5 will allow user to launch several jobs in a cluster mode, but each cluster will be limited to max 5 hosts.

Spot/Preemptible compute instances

One of the very first cost reduction options to consider is the usage of Spot/Preemptible compute nodes.

The behavior and savings are a bit different across the underlying Cloud providers, but in general all of them follow the strategy: allow to use the compute resources for a limited time at a greatly reduced costs. It's quite hard to predict the savings in general, but typically this will be ~twice cheaper than general on-demand instance type.

For more details on the Spot/Preemptible compute instances details, please review the corresponding provider's documentation:

While being quite cost-effective, such type of compute instances are not reliable for the long-running or stateful jobs.

Cloud Pipeline encourages users to leverage the Spots/Preemptibles for:

  • The Batch jobs, e.g. NGS pipelines which can be easily restarted
  • Testing/Proofing tasks, when some script shall be debugged or a new package tested

For the interactive tasks, e.g. Jupyter notebooks, which require online access from the users - such type of instance does not fit well.

Another limitation of the Spots/Preemptibles is that such jobs cannot be Paused (i.e. stop consuming money, but keep the job's environment). Such kind of compute nodes can be fully terminated only.

From the Cloud Pipeline's point of view this option is considered a Price Type and can be enabled at different levels:

  • Globally, at a platform level (if the Spot/Preemptible can be used by any job in the platform)
  • User group or a specific user levels (if can be used by the specific users)
  • Pipeline level (if a particular pipeline can tolerate Spots/Preemptibles restarts)
  • Docker image/version (if a particular image can be launched using a reduced cost instances)

CP_AppendixD

Instance PAUSE/RESUME

From the existing usage scenarios the best compute costs reduction was observed, when the interactive tools were stopped while not used.

The typical use case here is:

  • User launches some IDE (e.g. RStudio/Jupyter)
  • Works during the day
  • Keeps the instance running over the night or weekends

This introduces really high spendings, but without any actual outcome, as the instance is doing nothing. To address this, Cloud Pipeline allows to PAUSE and RESUME any instance/job, which is created using on-demand price type. While the instance is paused - compute is not charged, but the job's environment and filesystem is persisted. Once required - the instance can be resumed.

Under the hood, the "real" compute instances are stopped/deallocated and than restarted. Cloud Pipeline takes care of the software state persistance and restore.

CP_AppendixD

In general, this feature is available via the Web GUI and API and the users are in charge of performing this pause/resume operation. API allows to automate this procedure (e.g. based on the schedule or resources usage) and the subsequent sections describe this approach.

See Manage runs lifecycles for more details.

IDLE instances

This section extends the "plain" PAUSE/RESUME by managing the IDLE instances.

Besides the describe above night-time/weekends cases, here we also consider under-utilization of the compute resource. E.g. if the user selects 96-cored instance (maybe by mistake), but runs a single-threaded application. In this case lots of CPU resources are just wasted.

Cloud Pipeline mixes together the PAUSE/RESUME and instances workload monitoring and offers a set of policies, which can be applied to the compute instances to take care of such IDLE instances.

These policies can be used in the following manner:

  • Platform administrators can define thresholds for the hardware utilization (CPU/GPU usage) and overall run duration.
  • If some job is considered as IDLE (the hardware utilization is below the threshold for the configured period of time) – a number of actions can be performed (a single or a mixture of them):
    • Job is marked as IDLE in the GUI
    • Email Notification is sent (to the Adminstrators and the Owner of the instance) to make user aware of the event
    • Job can be automatically paused (if it’s price type and cluster mode are compatible with the PAUSE operation)
    • Job can be terminated

Scheduled instances PAUSE/RESUME

For certain use cases (e.g. when then user leverages Cloud Pipeline as a development/research environment) users launch runs and keep them running all the time, including weekends and holidays. As shown in the examples above.

One can use the IDLE policy to PAUSE such jobs, but the IDLE status is set only when the threshold is exceeded. This time (while the platform will decide to treat a job as IDLE) also costs some money. So to manage the jobs that shall be stopped for the non-working hours - a PAUSE/RESUME schedule is used.

  • User (who have permissions to pause/resume a run) is able to set a schedule for an active run or a run being launched
  • Schedule is defined as a list of rules (user shall be able to specify any number of them):
    • Action: PAUSE or RESUME
    • Recurrence:
      • Daily: every N days, time
      • Weekly: every N weeks, weekday, time
  • User is able to create/view/modify/delete schedule rules anytime run is active (i.e. running or paused)
  • This is applied only to the "Pauseable" runs (i.e. On-demand/Non-cluster)

Spending quotas

While the described options are mostly focused on the soft cost reduction (e.g. help the user to decide on the instance type), the platform shall be also capable of enforcing certain policies if the soft restrictions didn't work.

This kind of restriction is controlled by the spending quotas. This functionality allows to apply policies to the platform's entities:

  • User - quotas can be applied to a specific user
  • Users group - group of users can be restricted separately as well
  • Billing/Cost center - this is a different dimension of the users grouping. Typically this is a meta-group, which is not used to apply security permissions. Such groups describe, e.g. the departments, which have separate budget and can manage it
  • Global - administrator can define what is the overall budget for the platform

CP_AppendixD

Each of those user groupings can be managed by the platform administrator or an authorized manager (who is assigned a corresponding platform role).

For each of the quotas configured – there is an option to specify the thresholds (e.g. 50% / 75% / 100%), when a specific action shall be performed by the platform:

  • "Notify" (default) - this action will only notify corresponding users that a limited is exceeded. The access to the platform shall not be restricted
  • "Read-only & keep jobs" - corresponding users that a limited is exceeded and then will have the "read-only" access to the platform, they won't be able to launch new jobs. All active runs will be kept
  • "Read-only & stop jobs" – same as previous, but all the active runs will be stopped
  • "Block" - corresponding users will be blocked and won’t have any access to the platform. After a time period for which the limit is set, the access to the platform shall be restored to users according to their permissions.

CP_AppendixD

Billing reports

To make the users aware of the current bills and quota policy attached - platform logs the information on the compute and storage costs.

This reporting feature is available in two flavors:

  • When a compute job is launched - user is notified on the hourly cost of the chosen hardware configuration
    CP_AppendixD
  • Compute and storage costs are aggregated into the ElasticSearch index on a daily basis and can be queried to build the historical reports

General costs report

The latter one offers graphical/tabular visualization options to get the insights on a current or a previous period bills
CP_AppendixD

By these forms users can view the whole system spendings, or spendings divided by the specific resources.
Presented metrics (resources):

  • costs of launching compute instances, used for tools/pipelines runs. There could be:
    • CPU instances
    • GPU instances
  • costs of storing user data in storages. There could be costs of storing:
    • in Object storages
    • in File storages
  • info about auxiliary costs isn't supported yet

All costs are aggregated and displayed for the specific period (by default - the current calendar month). Selected specific period, for which the costs should be calculated and displayed, is called "current". Also, for comparison, the costs for the analogical previous period are displayed in diagrams/charts (where this data is available).
The user can select the desired (current) period to view the costs incurred:

  • from one of predefined periods. For each of them there will be a specific "previous" period:
    • Month. If the month is selected as current period - the "previous" period will be the previous calendar month of the selected one
    • Quarter. If the quarter is selected as current period - the "previous" period will be the same quarter in the previous calendar year
    • Year. If the year is selected as current period - the "previous" period will be the previous calendar year
  • custom period. That period is configured by the user manually in months and can have any duration. The "previous" period isn't displayed in this case

By default, the "Billing Visualization" form looks like on the picture above.
It contains:

  • General report table with the following data:
    • current and previous periods, for which the costs are calculated
    • summary spendings according to selected configurations in the toolbar for the current and previous periods
    • the difference between the costs of current period and the previous one in percentage with a certain mark to determine whether spendings grow or reduce
      CP_AppendixD
  • The main summary costs diagram. On this diagram, the user can see summary spendings over the current and the previous time periods according to selected configurations in the toolbar. This data could be displayed with the accumulation (as line chart) or as fact (as bar chart with actual spending values in each time point of the period without accumulation)

    • if the selected "current" period coincides with the current corresponding calendar period - there a slice line is displayed on the current calendar date: on its intersection with the "current" chart line, the amount of money which was spent from the beginning of the period to the current date is displayed and on its intersection with the "previous" chart line - the amount of money which was spent for the same previous time period is displayed
    • the user can hover any point of the "current"/"previous" line of the diagram - the summary spendings at the corresponding datetime appear in the tooltip

      • the aggregation value of timeline division for selected periods less than 4 months is 1 day, in other cases - 1 month

        Note: costs of the current calendar day is being aggregated in the following day, so they aren't displayed in diagrams

      With accumulation:
      CP_AppendixD
      And actual values (without accumulation):
      CP_AppendixD

  • The spendings bar chart with the Resources division. On this chart, the user can view the division of summary spendings in a selected time period over the resource groups - Storages (with division on Object and File storages) and Compute Instances (with division on CPU and GPU). Also, for each resource, the summary spendings for the same previous period are presented as well:
    CP_AppendixD

  • The spendings bar chart with the Billing centers division. On this chart, the user can view the division of summary spendings in a selected time period over the billing centers (the system displays only top N most costly billing centers). For each displayed center/group, the summary spendings for the same previous period is presented as well:
    CP_AppendixD
  • The report toolbar with the following controls:

    • the "period selector" to select a current time period for the report - from one of predefined periods or the manual custom period
    • the additional calendar control - to select another "current" period that doesn't coincide with the current calendar period (for predefined periods) or manually select desired period duration (for custom period)
    • the control to select the specific billing center (user group) or user. By default, all available centers/users are selected
    • the export control to download a data of displayed report in *.csv format to the local workstation

    CP_AppendixD

The billing report form described above is the general.

To get info/charts of summary spendings in any desired period (from available) you should select the corresponding period via the Period selector and Calendar control (if necessary).

To get info/charts of summary spendings in the selected period only for the specific resource group you can click that resource in the Resources chart or use the corresponding item of the menu in the left side of the page (for more details see sections below).

User report

To get info/charts of summary spendings in the selected period only for the specific user you can select the desired user from the dropdown list in the main toolbar. In this case:

  • the costs data will be calculated and displayed only for the selected user (in the general report table, in the main diagram and in the resources chart)
  • the spendings bar chart with the Billing centers division will disappear, e.g.:
    CP_AppendixD

Billing center/group report

To get info/charts of summary spendings in the selected period only for the specific Billing center you can select the desired one from the dropdown list in the main toolbar or by click the corresponding Billing Center's bar in the Billing centers division chart. In this case:

  • the costs data will be calculated and displayed only for the selected Billing center (summary for all its users - in the general report table, in the main diagram and in the resources chart)
  • the bar chart with the top N of users' spendings in the selected period (among all users in that Billing center) will appear
  • the table with short info about summary spendings of each user of that Billing center in the selected period (also info contains summary duration and count of the runs launched by the user, used storages volume) will appear, e.g.:
    CP_AppendixD

Resources reports

If you wish - you may get info/charts of summary spendings not for all resources but for the specific resource group.
Also, for these charts you may select the desired time period and specific user/Billing center (group) to view costs in the way analogical as described above.

Storages report

To get info/charts of summary spendings in the selected period only for the Storages resource group (costs of the storing data in storages) you can click the corresponding item of the menu in the left side of the page:
CP_AppendixD

In this case, the summary costs for all used storages during the selected period by selected users/Billing center will be calculated and displayed (both types - Object/File storages).

In the appeared page, you can see:

  • General report table with summary costs for all used storages during the selected and previous (if it's available) periods
  • The summary Storages spendings diagram over the current and the previous time periods according to selected configurations in the toolbar. This data could be displayed with the accumulation (as line chart) or as fact (as bar chart with actual spending values in each time point of the period without accumulation)
  • The spendings bar chart with top N most costly storages used during the selected period compared with the previous one (if it's available)
    • The detailed spendings table under the chart with the full list of storages used during the selected period by selected user/Billing center

Example:
CP_AppendixD

By default, in this form there isn't the division by storage type (Object/File). All data is calculated and displayed summary for both types.
If you want to view costs only for storages with the specific type - you can select the corresponding one in the menu at the left side of the page.
E.g., for Object storages:
CP_AppendixD

Compute instances report

To get info/charts of summary spendings in the selected period only for the Compute instances resource group (costs of the launching instances for running tools/pipelines) you can click the corresponding item of the menu in the left side of the page:
CP_AppendixD

In this case, the summary costs for all launched instances during the selected period by selected users/Billing center will be calculated and displayed (launched tools/pipelines and both instance types - CPU/GPU).

In the appeared page, you can see:

  • General report table with summary costs for all runs launched in the selected and previous (if it's available) periods
  • The summary Compute instances spendings diagram over the current and the previous time periods according to selected configurations in the toolbar. This data could be displayed with the accumulation (as line chart) or as fact (as bar chart with actual spending values in each time point of the period without accumulation)
  • The spendings bar chart with top N most costly instance types launched in the selected period compared to the previous one (if it's available)
    • The detailed spendings table under that chart with the full list of launched instance types in the selected period
  • The spendings bar chart with top N most costly tools launched in the selected period compared to the previous one (if it's available)
    • The detailed spendings table under that chart with the full list of launched tools in the selected period
  • The spendings bar chart with top N most costly pipelines launched in the selected period compared to the previous one (if it's available)
    • The detailed spendings table under that chart with the full list of launched pipelines in the selected period
  • Additional control that allows to change displaying of spendings bar charts with the following possible values:
    • cost (default) - on 3 above described bar charts top N most costly objects (instance types, tools, pipelines) are displayed. Data Unit - currency
    • usage - on 3 above described bar charts top N most involved objects (instance types, tools, pipelines) are displayed. Data Unit - usage hours
    • runs - on 3 above described bar charts top N objects (instance types, tools, pipelines) with the largest runs count are displayed. Data Unit - runs count

Example:
CP_AppendixD

By default, in this form there isn't the division by workload type (CPU/GPU). All data is calculated and displayed summary for both types.
If you want to view costs only for instances with the specific workload type - you can select the corresponding one in the menu at the left side of the page.
E.g., for CPU:
CP_AppendixD

Report aggregation level according to the permissions

Available reports are calculated and displayed according to the user permissions:

  • general user can view:
    • general report with only own costs (User report, without Billing center report)
    • resources report with only own costs
  • Billing center (group) leader can view:
    • general report with costs of that Billing center
    • Billing center report
    • user report for any user of that Billing center
    • resources report with costs of that Billing center on the whole and for any user of that Billing center separately
  • platform admin can view all available reports