The Ugly, Hidden and Underestimated Costs of Building an On-Premise HPC System

Depending on your current and future HPC and organizational demands, each system offers benefits and limitations that need to be defined and compared. One of the main comparisons between systems is usually the Total Cost of Ownership (TCO). As I mentioned in a previous blog post, TCO not exactly a good fit for making buying decisions between fundamentally dissimilar alternatives. The TCO of on-premise HPC systems has been discussed for +30 years, even by our VP of Sales in his blog “The Real Cost of High-Performance Computing.” For people who are considering buying on-premise HPC systems, there are some hidden expenses that are often overlooked when calculating the TCO of and on-premise HPC system.
In this post, I intend to break down the TCO of an on-premise system and expose some expenses that may be overlooked.
A quick review on TCO
The broad definition of an on-premise HPC system’s TCO is that you sum the amount of all direct and indirect expenses correlated with your prospective system. The more obvious expenses are hardware, software, staffing, and power. For hardware, you need the following: servers, wiring, ToR switches, aggregation switches, server racks, power distribution units, etc. Then you must buy software that coordinates the communication between each node to solve complex problems. In addition, you must buy licenses for the software you plan on using. A resource that can be extremely variable and hard to estimate is the staffing required to develop, deploy, and maintain the on-premise HPC system. Finally, on-premise HPC systems require a lot of power and cooling capabilities: it is essential to calculate your energy consumption and how it will affect your operational expenses. Take the sum of the expenses for the items above and you have the basic TCO for your on-premise HPC system; however, there are some hidden costs that can heavily affect the TCO of your on-premise system.
Real-world, Hidden Costs
#1 The facilities hosting your HPC systems have cost dependencies that reach further than at first glance. Ensuring your facility has the proper cooling and power provisions necessary to support the current system and its potential scalability can save a lot of expenses down the road. Power is a major expense and can be extremely impactful on your overall operating expense. Depending on cluster location and utilization, your power costs can vary greatly. Due to your location, you may also see highly variable power prices that will heavily affect how you operate your HPC system to minimize expenses. In some cases, power can become over 1/3 of your operating expenses. Facilities and energy are important to consider when calculating your TCO and, for a large facility, should be considered a primary concern.
#2 Staffing will cost and vary more than you think, with performance and uptime suffering if neglected. One of the most variable and elusive expenses to define is the staffing for on-premise HPC systems. It can be very difficult to find, hire and train good Operations and IT Managers that can perform the development, deployment, and maintenance of an HPC system. Designing an HPC system requires expensive specialists to match the best hardware and software for your computing demands. The procurement of the system alone can cost as much as 5% of the total HPC system and takes at least 6 months. During this time, you must continue paying specialists to assemble the cluster while receiving no reward for the HPC system. Once deployed, the systems require very specific IT staffing to ensure its’ maintenance and operation. These employees require specialized skills to test and protect your HPC system’s longevity and performance. Finding the right employees to perform these functions can be cumbersome and costly, but is a priority when considering deploying an on-premise HPC system.
#3 Underutilization costs more than just the idle time, the associated overhead is substantial as well. An idle HPC system not only lowers your ROI, but can have devastating impacts on your product development cycle. Back-up systems can be overlooked because they are not considered necessary expenses to have an operating HPC system; however, the consequences for not having them can be dire. Generators, switches, gas, and maintenance of your backup energy system are all necessary to ensure that your systems are protected from power outages. Comparable to back-up energy provisions, back-up hardware is extremely important to mitigate an idle HPC system. Spare hardware is important to have on hand in case there is an issue; without backup hardware, you can find your system sitting idle while the part is repaired or bought. If you fail to plan, you should plan to fail; this is especially true for running an on-premise HPC system.
#4 Finally, on-premise technology is a constant uphill (and usually losing) battle. This is the harm caused by not utilized the best technology, and having to spend enormous efforts and capital to race to keep up. When comparing HPC systems, you have to acknowledge the costs and rewards, and their effect on each other. Not using the best technology can create expenses that stem from forfeiting rewards that are given by the best system. The expenses correlated to not using the best HPC solution are: lost productivity, missed innovation, longer time-to-solution, technology refresh cost, IT risk management, and increased IT debt and commitment. The most harmful forfeited reward is inefficiency in the research pipeline which creates a plethora of expenses correlated to the increase in time-to-market, delay in innovation, and increase in researcher idle time. The lack of HPC technology can cause your organization to have irreparable implications such as not being able to research larger problems and missing innovations that can make your organization uncompetitive. These expenses are often difficult to calculate because you have to assess how much more efficient your team will be with a better HPC solution and then work backwards to calculate the expenses correlated to inefficiency.
In summary, finding the true TCO of an on-premise HPC system can prove very difficult when considering all the hidden costs: staffing, facilities, power consumption, backup provisions, and forfeited rewards. I argue that one of the most important expenses to consider when comparing HPC systems is the expenses caused by forfeited rewards; however, these prove to be the most difficult to calculate and predict. The topic of TCO comparisons between cloud-enabled and on-premise HPC systems has been discussed regularly and is still not clearly defined. It is a comparison that we are working to improve, so if you have any comments or questions on this blog post or TCO, we would love to hear what you think.
Sara Jeanes. (2017, June 19). Cloud vs. Datacenter Costs for High Performance Computing (HPC): A Real World Example. Retrieved from: https://www.internet2.edu/blogs/detail/14114
Tony Spagnuolo. (2015, January). The Real Cost of High Performance Computing. Retrieved from: https://rescale.com/blog/the-real-cost-of-high-performance-computing/
Wolfgang Gentzsch. (2016, March 6). A Total Cost Analysis for Manufacturers of In-house Computing Resources and Cloud Computing. Retrieved from: https://community.theubercloud.com/wp-content/uploads/2016/04/TCO-Study-UberCloud.pdf

Rescale Sales

View all posts

Cookie	Duration	Description
AWSALBCORS	7 days	This cookie is managed by Amazon Web Services and is used for load balancing.
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__cf_bm	30 minutes	This cookie, set by Cloudflare, is used to support Cloudflare Bot Management.
bcookie	2 years	LinkedIn sets this cookie from LinkedIn share buttons and ad tags to recognize browser ID.
lang	session	LinkedIn sets this cookie to remember a user's language setting.
lidc	1 day	LinkedIn sets the lidc cookie to facilitate data center selection.
player	1 year	Vimeo uses this cookie to save the user's preferences when playing embedded videos from Vimeo.

Cookie	Duration	Description
AWSALB	7 days	AWSALB is an application load balancer cookie set by Amazon Web Services to map the session to the target.
sync_active	never	This cookie is set by Vimeo and contains data on the visitor's video-content preferences, so that the website remembers parameters such as preferred volume or video quality.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_UA-32985745-1	1 minute	A variation of the _gat cookie set by Google Analytics and Google Tag Manager to allow website owners to track visitor behaviour and measure site performance. The pattern element in the name contains the unique identity number of the account or website it relates to.
_gcl_au	3 months	Provided by Google Tag Manager to experiment advertisement efficiency of websites using their services.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
utm_campaign	past	Google Ad Services sets this cookie to store session campaign value if present.
utm_content	past	This cookie is used for storing the session content value if present.
utm_source	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
utm_term	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
vuid	2 years	Vimeo installs this cookie to collect tracking information by setting a unique ID to embed videos to the website.

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to display advertisements when either on Facebook or on a digital platform powered by Facebook advertising, after visiting the website.
_mkto_trk	2 years	This cookie, provided by Marketo, has information (such as a unique user ID) that is used to track the user's site usage. The cookies set by Marketo are readable only by Marketo.
fr	3 months	Facebook sets this cookie to show relevant advertisements to users by tracking user behaviour across the web, on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
personalization_id	2 years	Twitter sets this cookie to integrate and share features for social media and also store information about how the user uses the website, for tracking and targeting.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
utm_medium	past	This cookie is used to record from where the visitor came to the website orginally. This information is used by the website operator to know the efficiency of their marketing.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
_chtbl	session	No description available.
_dtses	30 minutes	No description available.
_dtuid	10 years	No description available.
BIGipServersj30web-nginx-app_https	session	No description
email	past	No description available.
gclid	past	No description
handl_ip	1 month	No description available.
handl_landing_page	1 month	No description available.
handl_original_ref	past	No description available.
handl_ref	past	No description available.
handl_url	1 month	No description available.
li_gc	2 years	No description
muc_ads	2 years	No description
username	past	No description available.

Rescale Platform

Overview

HPC & AI Software

HPC & AI Architectures

Security & Compliance

Ecosystem Integrations

Pricing

HPC as a Service

Intelligent Batch

Elastic Cloud Workstation

Storage Fabric

Enterprise Management

Multi-Team Management

Performance Management

Software Publisher

Digital Engineering

AI Physics

Knowledge Management

Computational Pipelines

Author

Similar Posts

Newsletter Sign Up