YubiHSM 2: A load balanced design for heavy traffic environments


Introduction

The YubiHSM 2 was specifically designed to be a number of things: light weight, compact, portable and flexible. It is to server-side security what the YubiKey is to personal security. And as with all Hardware Security Module (HSM) devices, it affords superior protection compared to software-based alternatives - particularly at the enterprise level - because the physical separation of the secure element can prevent attackers from accessing memory and other traceable resources across a network, and whom might otherwise seek to subvert these in order to compromise the valuable secrets therein.

 

The device is not only technically effective - each YubiHSM 2 is created using the same Yubico principles and has undergone the same production process as the YubiKey - but also cost effective, at a fraction of the price of other HSM devices on the market. Its applications are wide and varied, ranging from code signing, assurance of authenticity within manufacturing, and even as an identifier inside embedded IoT, but the most common use case for the device is within public key infrastructure (PKI) and for storing server secrets such as private keys or certificates used for critical signing or encryption operations.

 

It is important, however, to highlight the one drawback of the smaller size of the device when compared to a traditional HSM, which is the distinct limitation in terms of its operational load capacity. The threshold will vary depending on both the algorithm and the type of operation being requested of the device, but is likely considerably less than its oversized counterparts. This tradeoff between cost, performance and size is the foremost consideration by customers, when the YubiHSM 2 is being weighed as a candidate to solve the gamut of cryptographic security needs.

 

As a means to mitigate this issue and demonstrate that the YubiHSM 2 can in fact be incorporated even in situations where there are heavy demands, this article will propose a scalable and practical load balanced approach using multiple devices accessed via a load balancer, in order to distribute traffic across several parallel sessions. Ultimately, the final number of YubiHSM 2 devices required in the final implementation will be directly proportional to the number of cryptographic requests per second, which might be projected based on metrics such as current network conditions and the number of concurrent users. Moreover, it is even possible to create failover redundancy using a similar design, in situations where availability potentially takes priority over capacity or just as a sensible precaution to any practical implementation.

 

Operational threshold

The first data point to consider when attempting such an implementation, is to examine how long it takes for an otherwise unoccupied YubiHSM 2 to process various sorts of requests. Table 1 below contains some of the algorithmic capabilities of the device based on numerous internal test runs, but results may vary depending on network and server conditions. For more information on how to perform the tests locally, in addition to the full capabilities of the YubiHSM 2, please refer to the following link.

 

Operation or Algorithm

Approx time taken (ms)

Operations per second

RSA-2048-PKCS1-SHA256

139

7

RSA-3072-PKCS1-SHA384

504

2

RSA-4096-PKCS1-SHA512

852

1

ECDSA-P224-SHA1

64

15

ECDSA-P256-SHA256

73

13

ECDSA-P384-SHA384

120

8

ECDSA-P521-SHA512

210

4

EdDSA-25519-32Bytes

105

9

EdDSA-25519-64Bytes

121

8

EdDSA-25519-128Bytes

137

7

EdDSA-25519-256Bytes

168

5

EdDSA-25519-512Bytes

229

4

EdDSA-25519-1024Bytes

353

2

AES-(128|192|256)-CCM-Wrap

10

100

HMAC-SHA-(1|256)

4

250

HMAC-SHA-(384|512)

243

4

 

Table 1. The average time taken to complete various operations on the YubiHSM 2

 

Using the average time taken as a baseline, it thereby becomes possible to extrapolate the number of operations per second for each algorithm type (see the rightmost column in Table 1). For example, an RSA 2048 based operation takes the YubiHSM 2 approximately 139 ms on average - or expressed another way - it is possible to execute seven (7) such operations every second.

 

In a theoretical scenario where the event of invoking and using an RSA 2048 based key pair corresponds to an upstream signing operation triggered by a user, it would therefore be reasonable to assume that a single device could support up to seven concurrent users working simultaneously, if each submits no more than a single signing operation every second. If the intention is instead to use RSA 4096 all things being equal, naturally, the increased complexity of the operation would add additional burden on the device and result in reduced capacity to only a single user request at any given second on average.

 

As a practical example to further illustrate, the aforementioned theoretical signing operations could in fact, correspond to a specific event such as a Federation server issuing SAML tokens to users, as they log onto a network. Ideally, the validity period of said tokens corresponds to the underlying strength of the RSA algorithm being used, such as shorter lived tokens created using RSA 2048 or tokens with much longer lifespans using RSA 4096. An important point to highlight is that the more taxing algorithms do not necessarily preclude the YubiHSM 2 from being useful given a large number of users, since it stands to reason that the longer the lifespan of the token, the less frequently will come requests to generate such a keypair in the first place.

 

Additionally, it should be clarified that the YubiHSM 2 is fully capable of supporting any combination of non-homogenous operations up to its threshold, so long as the total sum of the time taken does not exceed 1000 ms (i.e. 1 second).

 

Using the values from Table 1 to simulate one possible scenario, below are four hypothetical users accessing the YubiHSM 2 concurrently, including an example operation type and the average time taken to execute said operation:

 

  • User 1 - RSA-2048-PKCS1-SHA256 - 139 ms
  • User 2 - ECDSA-P521-SHA512 - 210 ms
  • User 3 - EdDSA-25519-128Bytes - 137 ms
  • User 4 - HMAC-SHA-256 - 4 ms

The total time taken in this example is therefore 139 ms + 210 ms + 137 ms + 4 ms = 490 ms - below 1000 ms - indicating that the scenario is well within the threshold of the YubiHSM 2 and hence statistically feasible.

 

In situations where a YubiHSM 2 device does receive more requests per second than its threshold, it will always throw an error message back to the requestor, even if the consequences are somewhat unpredictable. Simply sending an unmeasured amount of requests to the device will result in an elegant queue, but could mean requests already being processed fail to complete in a timely manner, or at the very least, for the incoming requests beyond the threshold to fail unceremoniously or timeout. It’s always advised that administrators closely monitor the error logs and react accordingly, especially to situations where the device has the potential of being exposed to a heavy load.

 

Given the performance limitations of the YubiHSM 2 then, it seems logical to consider how the total number of operations per second might be scaled upwards to solve larger demands imposed by networks with significantly higher traffic. A reasonable extrapolation of the data discussed up until this point suggests that by simply increasing the number of YubiHSM 2 devices within an ecosystem, such that all requests may be distributed evenly and so that no individual device will exceed its threshold, the operational capacity may be increased in a linear manner. A couple of other considerations may affect the final achievable output, such as the fact each YubiHSM 2 device is single-threaded, but more information is provided in the upcoming section.

 

It should also be highlighted that the final implementation will clearly increase costs relative to the number of additional devices introduced. Furthermore, some programming effort will be required in order to assemble all of the elements within the proposed design, made possible by the PKCS 11 compatibility, and Yubico even provides core libraries in Java, C and Python for customers to take creative license.

 

A limit to the number of concurrent sessions

While some operations may be performed on a YubiHSM 2 which do not require a session - such as generic status commands - they are certainly required for all meaningful operations. A session can be defined as a logical connection between an application and the device, encrypted and authenticated based on the Global Platform SCP03 protocol. In essence, short-lived keys that last only for the duration of a session, are used to secure it, and are derived from longer-lived, pre-shared authentication keys.

 

Although each YubiHSM 2 device is capable of supporting up to sixteen (16) concurrent sessions, it is only able to execute operations across the sessions in a serial manner (i.e. the device is single threaded). This should be top of mind when planning a load balanced system design, as it very well may impact the calculus on the overall processing capacity of the YubiHSM 2, in addition to the previously discussed threshold.

 

To elaborate with another simple example, consider three hypothetical concurrent users, with one of them occupying two sessions:

 

  • User 1 - Session 1 - EdDSA-25519-512Bytes - 229 ms
  • User 1 - Session 2 - ECDSA-P521-SHA512 - 210 ms
  • User 2 - Session 3 - AES-128-CCM-Wrap - 10 ms
  • User 3 - Session 4 - ECDSA-P224-SHA1 - 64 ms

Assuming that all users submit their requests at the exact same moment,

 

  • User 1 will receive:
    • A response to their request on Session 1 after 229 ms 
    • A response to their request on Session 2 after 439 ms (229 ms + 210ms)
  • User 2 will receive:
    • A response to their request on Session 3 after 449 ms (439 ms + 10 ms)
  • And finally, User 3 will receive:
    • A response to their request on Session 4 after 513 ms (449 ms + 64 ms)

While the above operations do not exceed the device’s overall threshold, there are obvious timing delays for each user and their operations, as the YubiHSM 2 works through the backlog of simultaneously submitted requests.

 

It may also be relevant to highlight that sessions are expired by the YubiHSM 2 if no activity has been detected throughout the duration of a default expiration period of thirty (30) seconds. Sessions may also be explicitly closed on command, and all closed sessions whether expired or explicitly closed, are released back to the pool of unused sessions. When an application or user attempts to open a session when there are none available, a “no more available sessions” error will be returned by the device.

 

The overall implication from both the single threaded mode of operation and the session limit, is that even if any number of operations may be performed because they collectively take less than 1000 ms, the YubiHSM 2 may ultimately restrict real throughput and supersede the theoretical maximum of total executable operations. This lends further credence to the design principle that a system featuring multiple YubiHSM 2 devices is more scalable than a design centered around a single device by effectively creating multiple pseudo threads.

 

Increasing the number of devices

In order to scale up any YubiHSM 2 centric solution and deploy more devices across an ecosystem to help absorb traffic, each additional device needs to be a logical duplicate of the original. It is vital that all devices are identical in terms of the relevant private keys and certificates, because any signing or encryption operation invoked across the network should always result in the same output.

 

load-balanced-solution-i1.png

 

So, assuming that there is already a master YubiHSM 2 device and that all target keys have the export-under-wrap capability, its contents can be securely replicated to any additional devices under a Wrap process, which is specifically designed to encapsulate cryptographic key material without exposing the underlying secrets. The concept works in spite of untrusted storage conditions or even if data is being sent across an untrusted communications network. Key Wrap algorithms do this by encrypting session keys under longer-term encryption keys, which in the case of the YubiHSM 2, are based on the AES 192 or AES 256 algorithms.

 

Through YubiHSM-Shell for instance, the YubiHSM 2 is fully capable of performing this process by following the below procedure for each cryptographic object residing on the master YubiHSM 2. Note that the full command syntax for each step can be created using the guide found here, as only the logical steps are listed below. An example backup and restore can also be found here, which may be useful as a starting point. It is highly recommended to save the commands as a reusable, executable script (namely Steps 3 to 6) once the logic has been crafted, as this will reduce the manual work required for any subsequent expansions as well.

 

The steps to achieve key replication under Wrap are:

 

  1. Generate a Pseudo Random value on the master
  2. Import the Pseudo Random value back onto the master to be used as the Wrap Key 
  3. Import the Pseudo Random value as the Wrap Key onto the candidate device
  4. Wrap the target object on the master using the Wrap Key
  5. Export Wrapped data from the master to the candidate device
  6. Unwrap the target object on the candidate device with the imported Wrap Key

To be clear, since the YubiHSM 2 can be managed remotely across a network, there is no requirement that additional devices must be in physical proximity to any existing devices. So long as each device is assigned a unique IP address on the network, and has been correctly configured under Wrap, they can be deployed anywhere within the organization. An important consideration not to be overlooked, however, is network latency. Always remember to assess the roundtrip between network components - including any YubiHSM 2 devices - in addition to the time it takes to perform the desired cryptographic operations when calculating overall processing times. Significant delays to the user are possible even if network latency does not directly contribute to the overloading of YubiHSM 2 devices. 

 

load-balanced-solution-i2.png

 

Figure 1. Expanding the number of devices using Wrap export and import

 

To manage all of the satellite devices and maintain a scalable solution, a centralised point of contact is crucial, as there is clearly a need to not only distribute all of the incoming requests in an organised manner, but also keep track of the total quantity and (logical) location of all devices. This can be readily accomplished by the use of a reverse proxy for instance (or sometimes referred to as a load balancer), configured with a look-up table of all the deployed YubiHSM 2 devices (i.e. their IP addresses) which can also be updated with any additional devices that may be added in the future. There are many vendors that provide such solutions both licenced and freeware, for example HAProxy and NGINX, and the approach to implementation should be familiar to any infrastructure or web server administrators who have deployed a load-balanced solution in the past.

 

When implementing the reverse proxy in conjunction with multiple YubiHSM 2 devices specifically, it is extremely important to enable sticky or persistent sessions within the configuration. In fact, if this is not implemented, the entire solution can run into severe problems, since any responses generated from the satellite devices are not guaranteed to propagate back towards the original requestor because each interaction with a YubiHSM 2 requires an end-to-end session and even logically identical devices won’t have access to the proper session keys.

 

To elaborate, in order to successfully complete a hash generation, signing or an encryption for instance, the user must first open a new session, provide credentials and then submit the specific instructions regarding the desired operation. Clearly, if sticky sessions are not implemented, a request to produce a cryptographic signature on device X will not work if the login process was in fact completed on device Y.

 

It should be added that as part of the YubiHSM 2 SDK library, there are optional capabilities which may allow for automatic session re-establishment and generally more robust session management when developing a solution. These may ultimately provide more flexibility when determining how and when individual sessions are created and then released.

 

Load balancing with a proxy

A high-level representation of the entire setup, putting everything discussed in the previous sections together, is depicted in Figure 2 (below), with the reverse proxy at the heart of the design. The primary feature is the scalability from 1 to n number of devices, directly proportional to the previously discussed usage and overall network traffic conditions.

 

As another point of interest to many customers, load balancing via the Microsoft keystore using the Yubico provided KSP is currently still being verified internally, so it is unclear at this stage whether it can be leveraged specifically. Regardless, the base C and Python libraries can most certainly be used in conjunction with the proxy when receiving requests, or alternatively, a supported cryptographic API such as PKCS 11 is a possibility. Hopefully, at some point in the future, Yubico will confirm support for direct load balancing via the KSP, which will mean native compatibility with all Microsoft products such as Microsoft ADFS and ADCS. Additional details will be shared if and when that comes to fruition.

 

The goal of the design is to funnel all system-wide requests into one centralised static IP address which is always online (i.e. the load balancer). In more complicated cases, there may even be multiple load balancers in place in order to perform higher levels of distribution, but for the purposes of this article, only one balancer is represented.

 

load-balanced-solution-i3.png

 

Figure 2. A high-level overview of the load balanced YubiHSM 2 design

 

Once a session is initiated from an application, the corresponding session ID must be mapped or passed along to the proxy in order to track the origin of the request. Responsibility then passes to the reverse proxy for diverting the user session into one of the satellite devices, and for maintaining traffic flow throughout the duration of the end-to-end session. As previously stated, since all of the YubiHSM 2 devices are logical copies, it should not matter which device is invoked.

 

The load balancing algorithm may be configured in whichever manner is deemed most suitable based on system requirements or network conditions. The design is purposely open-ended to promote flexibility, but a couple of suggestions when establishing sessions are:

 

  • A geographic or IP based scheme to always target the physically closest YubiHSM 2, especially where network latency is a major concern
  • Simply forwarding all requests to a single device until it is approaching theoretical capacity before establishing a new session on a parallel device
  • A very basic scheme where each new request will always correspond to the next device with the process wrapping around back to the start

Finally, to reiterate the previously discussed identifier uniqueness requirement, each of the YubiHSM 2 devices on the network must be differentiated by IP address at the very least, but in practice, different port numbers ranging from X to X + n may also be beneficial for management purposes. There are two reasons for this:

 

  1. IP addresses may change over time due to shifting network requirements or even outages, and as such, cannot always be pre-configured as a point of difference. Port numbers do not suffer from this in most cases, and may in fact, be the more reliable aspect when enumerating and scaling the overall design
  2. In situations where multiple YubiHSM 2 devices are deployed on the same server, the port number becomes the only reliable identifier. This is because the server itself is only identifiable by a single IP address at the network layer.

As depicted in Figure 2, the sessions are numbered Y to Y + n, and correspond to each YubiHSM 2 from port X to X + n respectively.

 

Multiple devices as an approach to failover

It is even theoretically possible to create a traditional failover mechanism with multiple parallel YubiHSM 2 devices to cater for possible system or network outages. While the article has covered a design aimed at solving heavy traffic environments up to this point, the same principles should apply to a design aimed at providing operational redundancy. The configuration adjustment required to solve the latter scenario is achieved by pushing traffic to only a predetermined sub-set of primary devices and only when no responses are detected, is traffic only then diverted to a secondary set, instead of an equal distribution from the start.

 

Again, each implementation will require discretionary decisions to be made about how the final flows are achieved, and the provided Yubico coding libraries may be used in whichever manner is deemed most appropriate, alongside the load balancing configurations.

 

load-balanced-solution-i4.png

 

Figure 3. An alternative use case where the YubiHSM 2 device can be used in failover

 

Figure 3 (above) depicts a standard setup of one (1) or more YubiHSM 2 devices operating as per normal on the left side, with an (opaque) dormant secondary set on the right side. Again, note that the devices within the secondary set are logically identical to the devices within the primary set, but do not receive any requests under normal operating conditions.

 

The cut-over to the secondary set may be triggered manually by a network administrator reconfiguring the load balancer on the fly as issues in the primary set are detected, or ideally, by a pre-written script which activates if an interval of time elapses and no response(s) have been detected. The easiest way to achieve this is to implement a heartbeat-like status command firing from the load balancer to each individual YubiHSM 2 device in the primary set periodically, measuring the elapsed time between responses. If the length of time exceeds a predetermined threshold, an alert should be triggered and traffic rerouted as needed.