phd-thesis/chapter-smpc/chapter.tex

\chaptertitle{Case Study: Multiparty Computation in Scalable Hardware Security Modules}

\section{Fast MPC and Slow HSMs}

Multiparty Computation (MPC) is a cryptographic construct that allows several networked parties to jointly perform a
computation in such a way that the inputs to the computation remain private to the parties providing them, and no single
party must be trusted for the computation to produce the correct result. Conceptually, MPC is similar to a secret
sharing scheme that shares not just data, but computation between untrusted parties. The computation primitive MPC
offers is a cryptographic answer to the question of how to bootstrap trust in a computing system.

%The most challenging scenarios in computing arise when multiple
%parties such as manufacturers and operators, servers and clients, or sellers and buyers need to interact through
%computation. In many practical situations, it is impossible to create a single computer that can be trusted by every
%participant. MPC is a generic solution to a multitude of such scenarios reducing the problem of creating a single,
%shared computer everyone can trust simultaneously to everyone creating their own computer that they only can trust.

We can deconstruct the problem of trust in computing into two largely disjunct parts: Establishing trust in a computing
system during its creation is one, and maintaining this trust throughout its life is the other. For the second part of
this problem, maintaining trust in a system once trusted, we have an ample supply of good methods such as encryption,
authentication, and formally proven protocols. In contrast, establishing trust in a computing system is largely
intractable and despite a large corpus of academic research on approaches such as hardware trojan detection and
physicaly unclonable functions, only two approaches find practical adoption: In one, we build the system ourselves from
the ground up, making sure to leave no part vulnerable to third-party compromise. In the other, we go to a store and
physically buy a randomly-chosen computer using cash, assuming that while an attacker can target any particular system,
they cannot target all systems simultaneously and we give them too little time to target the system we buy.

A limitation of both approaches is that in either case, while the party creating or acquiring the system can trust it,
they cannot prove its trustworthiness to other parties. MPC solves this issue by allowing every party to contribute
their trusted system to the protocol, cryptographically bootstrapping common trust in the computation and its
output\footnote{
    In fact, MPC does more than just bootstrapping from each participant trusting their own system to a trusted shared
    computation. In an MPC protocol providing semi-honest or better security, MPC even \emph{relaxes} each party's trust
    requirement from trusting their own system to trusting that any $n$-of-$k$ out of all systems contributing to the
    protocol.
}.

MPC is a uniquely powerful cryptographic primitive, yet it has still not found widespread practical adoption. This is
because MPC is extremely resource-intensive to run. MPC protocols exist on a continuum trading off between extreme
memory and bandwidth requirements on one end and intense computational requirements on the other end. At a first glance,
MPC and Hardware Security Modules look like they would complement each other well, but HSMs cannot keep up with the
intense computational requirements posed by MPC.

Commercially available HSMs are quoted to perform between X and Y\todo{Look up number range} individual cryptographic
operations per second. Meanwhile, an MPC protocol doing something as simple as a single AES encryption, corresponding to
X\todo{look up numbers} logic gates or Y\todo{look up numbers} x86-64 instructions, requires
\emph{millions}\todo{Validate and add citation} of cryptographic operations when performed in MPC. As a result, applying
conventional HSMs to MPC at any practical scale is infeasible by multiple orders of magnitude.

HSMs are slow compared to contemporary computers because they are limited in their power dissipation, and power
dissipation is largely proportional to processing speed. In the limited fields where HSMs have found commercial
application, this limitation was never considered important and market forces pushing towards faster HSMs remain
light\todo{Can we find a citation here?}. Fundamentally, conventional HSMs must envelope the entire payload in a tamper
sensing mesh to detect drilling attacks, but a tamper sensing mesh that is impermeable to a drill is also impermeable to
air. As a result, any heat conducted from the HSMs processor to the outside world must pass through the mesh. At the
same time, the mesh cannot be thinned either because thinning it would enable micro-drilling attacks. The result of
these constraints is a high thermal resistance between the HSM's processor and an external heat sink, which limits
maximum power dissipation to a fraction of what is achieved in modern CPUs or even GPUs.

A secondary limitation of conventional HSMs is that the highly specialized tamper sensing foils used in their
construction often cannot be scaled to arbitray sizes without incurring unsustainable process yields due to the
multiplication of error rates with increasing area. As a result, even if the heat dissipation problem could be solved,
manufacturing the tamper sensing foil for a conventional HSM large enough to contain a more powerful CPU might not be
possible. The HSM's tamper-sensing envelope would have to protect not only the CPU itself, but also its supporting
components such as memory, power supplies and any internal heat spreading components.

Inertial HSMs solve this issue since they allow their payload to be air cooled without compromising security, and they
expand the feasible security boundary size from the several hundred milliliters offered by conventional HSMs to several
liters and more, enabling the integration of standard, off-the-shelf server components such as mainboards, CPUs, CPU
coolers, and power supplies. In this chapter, we will first provide a short overview of the theory of MPC before
elaborating a design of an IHSM tailored to MPC tasks including performance calculations and unique design aspects. We
will conclude with an outlook of applications unlocked by our design as well as promising areas for future improvements
of our design.

\section{The Fundamentals of Multiparty Computation}

Secure Multiparty Computation can be separated into two broad classes of approaches: Garbled Circuits, and Secret
Sharing-based techniques. Garbled Circuit techniques model the computation as a circuit of binary logic components such
as logic gates. They are well-suited for implementing cryptographic primitives such as conventional symmetric ciphers
such as AES or hash functions such as the SHA-2 series. Secret Sharing-based techniques model computation as an
arithmetic circuit made from components such as arithmetic operations. While they can also work in binary, they often
support operations on larger finite fields. Secret sharing-based techniques are efficient processing integer numbers,
but can have higher overhead in processing using many bitwise operations such as ciphers or cryptographic hash
functions\cite{evansPragmaticIntroductionSecure}.

\subsection{Security Models in MPC}

MPC schemes are usually evaluated assuming one of three adversary levels: \emph{Semi-Honest}, \emph{Covert} or
\emph{Malicious} adversaries. A \emph{Semi-Honest} adversary is an adversary that follows the protocol as specified, but
that outside the protocol's execution may collude arbitrarily with other parties to reveal the secret inputs of other
parties. A \emph{Covert} adversary is an adversary that additionally may cheat during the protocol's execution, but only
in ways that cannot be detected by other parties. Finally, a \emph{Malicious} adversary is one that can deviate from
the protocol's execution arbitrarily~\cite{aumannSecurityCovertAdversaries2010}. The covert adversary model most closely
captures the requirements of a real-world scenario where a small number of cooperating parties runs the protocol, since
in such settings cheating parties can easily be excluded once identified. The malicious adversary model captures
real-world settings where parties do not have stable identities such as peer-to-peer settings. The semi-honest model is
mostly interesting as a research tool since protocols assuming a semi-honest adversary can often be upgraded to covert
or malicious security at some performance tradeoff. In a practical setting, a semi-honest secure MPC protocol would not
provide additional security over just having one party run the computation except in some situations where inadvertent
side-channel leakage is a concern.

\subsection{Oblivious Transfer}

Before we can go into details of multiparty computation, we need to define an important primitive. Oblivious Transfer is
a cryptographic protocol between two parties that enables one party, the \emph{Sender}, to share one of two values to
the other party, the \emph{Receiver}. The Receiver can choose which of the two values it wants to receive by inputting a
choice bit $b$ into the protocol. After the protocol has been executed, the Receiver learns \emph{only} its chosen
value, and learns nothing about the other input value that the Sender provided. Meanwhile, the Sender learns nothing
about the choice bit $b$ of the receiver.

\subsubsection{OT extensions}

Oblivious Transfer is a public-key primitive, requiring asymmetric cryptographic operations. For this reason, ``raw''
Oblivious Transfer is not particularly fast. In practice, this issue is solved by applying a technique called Oblivious
Transfer Extensions (OTe)\cite{ishaiExtendingObliviousTransfers2003}. Using OTe, a fixed, small number of public-key
base Oblivious Transfer instances can be extended into an arbitrarily large number of Oblivious Transfer instances using
only invocations of a pseudo-random function (PRF) such as a cryptographic hash function.

\subsection{Boolean MPC with Yao's Garbled Circuits}
% Yao's Garbled Circuits
Yao's Garbled Circuits (GC) protocol is one of the oldest Multiparty Computation protocols, dating back to the 1980ies.
In Yao's GC, two parties jointly compute a function that is represented as a circuit of binary logic gates by evaluating
the circuit gate by gate. In Yao's GC, one party, generator, creates a random \emph{garbled} representation of the
circuit and sends it to the other party, the evaluator, who computes its output. The core idea in Yao's GC is that every
wire $w_i$ in the circuit is assigned two random cryptographic secret keys $w_i^b$, called wire labels, one $w_i^0$ for
the logical value $0$ and one $w_i^1$ for the value $1$. The mapping from logic values to these keys is assigned
randomly by the generator, and unknown to the evaluator~\cite{
    yaoHowGenerateExchange1986,
    beaverComplexitySecureProtocols1990,
    evansPragmaticIntroductionSecure
}.

Gates are represented in Yao's GC as truth tables with one row for every combination of input wire values. Each row of
these truth tables contains the output wire label (i.e. secret key) corresponding to the gate's logical output value for
the row's combination of input values. The output wire label in each row is encrypted with \emph{both} input wire labels
corresponding to this row as keys.

The generator must indicate to the evaluator which row of a gate's truth table to decrypt, while also avoiding leaking
the logical value of the output wire to the evaluator. This is commonly done in a technique called
\emph{point-and-permute} where a random pointer bit $p_i^b$ is appended to each wire $w_i^b$ label such that $p_i^b =
\neg p_i^{\neg b}$. The rows in the gate's truth table are ordered according to the combination of the two input wire
labels' pointer bits. When the evaluator obtains the two input wire labels, they obtain their pointer bits, which
combined are the index of the row to decrypt in the following gate's truth table.

It is clear how the protocol described above can be used to compute any binary circuit, but there are two questions
remaining: How do the two parties provide input into the circuit, and how do they decode the circuit's output? Output is
handled in Yao's GC by creating an output decoding table for every output wire of the circuit. The output decoding table
contains two rows, one for a logical $0$ output value and one for a logical $1$ output value. Each row contains the hash
of the output wire's label corresponding to the row's logical output value. This way, the evaluator can identify the
logical value knowing the output wire label, but is unable to deduce the output wire label from the output decoding
table.

Inputs are a bit more difficult to handle. While the generator can easily provide secret inputs by simply providing the
evaluator with the input wire labels corresponding to its input, inputs from the evaluator require oblivious transfer to
avoid leaking the evaluator's input to the generator. To input the logic bit $b$ on wire $w_i$, the generator and the
evaluator perform an 1-out-of-2 oblivious transfer with the generator assuming the Sender role and providing the two
input wire labels $w_i^0$ and $w_i^1$ as the two choices, and the evaluator submitting its chosen input bit $b$ as the
OT's choice bit.

Yao's GC has good performance since the only asymmetric cryptographic primitive used is in the Oblivious Transfer needed
for the evaluator's hidden inputs. The generation and the evaluation of the garbled circuit itself both only require
evaluations of a pseudorandom function such as a cryptographic hash or a cipher. Still, performing a computation using a
Garbled Circuit is many times slower than performing it in the clear. Intuitively, each single-bit gate in the garbled
circuit results in several cryptographic operations with input and output sizes of dozens or hundreds of bits.
Practically useful functions such as AES encryption have circuit implementations measuring thousands or tens of
thousands of gates, meaning these costs quickly escalate for practical problem sizes.
\cite{
    boyarNewCombinationalLogic2010,
    songhoriTinyGarbleHighlyCompressed2015
}

% FIXME This entire connecting section

%\subsection{Practical Application}
%\subsubsection{Preprocessing and Online Phases}
%\subsubsection{Constant-Round MPC}

% \subsection{Performance}

% zahurTwoHalvesMake2015,wangGlobalScaleSecureMultiparty2017,kellerMPSPDZVersatileFramework2020,dalskovFantasticFourHonestMajority

% \subsection{Practical Deployments}

% \subsection{Solutions}

% \subsection{Hardware Security Applied to MPC}

% Hardware security primitives can be applied in several roles in an MPC protocol.

\section{A High-Performance IHSM for MPC Applications}

Multiparty Computation is at the verge of being practical in some applications, but is still too computationally
expensive for others. While some attempts at GPU-accelerating MPC primitives exist, in practice it is commonly
implemented using CPU processing. From the performance numbers given above\todo{Give performance numbers above} we can
see that a single, modern server-class CPU is sufficient for an useful amount of computation in MPC.

A naive implementation might attempt to implement MPC using an HSM by simply offloading all cryptographic operations to
the HSM. In practice, this is not a workable solution due to the slow processing speed of conventional HSMs.
Conventional HSMs commonly use smartphone-class SoCs, which lag behind server CPUs in processing speed by several orders
of magnitude.
\todo{Cite some HSM/MPC papers here.}

In the near term, absent radical developments in either MPC theory or in the speed and power efficiency of processing
hardware, the only feasible solution for HSM-protected MPC at any practical scale is to find a way to protect an entire
server-class computer. As elaborated above, IHSMs are a natural fit for this requirement since they allow for large,
air-cooled payloads.

%\subsection{Hardware Requirements}

As a baseline performance target, we consider a commodity server mainboard in CEB or ATX form factor, populated with a
high-end server CPU and a large amount of RAM. As MPC systems do not usually require a great amount of storage, we can
largely ignore storage for our size and power calculations.\todo{Refer to performance numbers from research above here}

As a result, we end up with a total maximum power dissipation of approximately \todo{Calculate, make table} XXX as shown
in Table~\ref{tab_power_budget}. Dissipating this amount of power using air cooling is within the capabilities of
commodity server cooling components\todo{Refer to datasheets}.

\begin{table}
    \centering
    \begin{tabular}{r|l|r|r}
    Count & Component & Maximum Power Dissipation Each & Total\\\hline
    1 & CPU: & &\\
    16 & DDR-4 Memory modules: & &\\
    1 & Mainboard: & &\\
    1 & Power Supply: & &\\
    \end{tabular}
    \caption{Power budget of a modern mid-range server. Power supply power dissipation calculated at target load.}
    \label{tab_power_budget}
\end{table}

A common type of side-channel attack on cryptographic systems are power analysis attacks. In such attacks, the supply
current of the target processing system is measured at high speed while the target is performing cryptographic
computations. By aggregating the results of a large number of the resulting power traces, it is often possible to infer
the value of secret data such as cryptographic keys. To mitigate this type of attack, not only do we have to place the
CPU, mainboard, and memory inside of the HSM's tamper-sensing barrier, but also the power supply. A secondary benefit of
placing the power supply inside the tamper-sensing barrier is that it simplifies the power wiring between the outside of
the IHSM cage and the payload. Supplying the \qty{12}{\volt} power rails that commodity mainboard commonly use requires
tens of Ampere. To carrie such high current, the wiring has to be sized accordingly. In an IHSM, even thick wires can
easily be passed through the mesh cage, but such wiring requires a large opening at the shaft on one end of the cage,
which creates a literal security gap. Placing the power supply inside of the cage reduces the size of the wires needed
since the power supply steps down a lower current \qty{240}{\volt} input to the system's high-current \qty{12}{\volt}
rails. According to DIN VDE 0298-4\todo{Citation?}, a pair of \qty{1.5}{\milli\meter^2} conductors is sufficient for
more than \qty{3}{\kilo\watt} of load under worst-case conditions.

\subsection{Software Considerations}

While the hardware of a HSM-assisted MPC system is a straightforward application of IHSM technology to a server
platform, the software implementation poses some unique challenges. A core concern in an IHSM based on commodity
hardware running a commodity operating system is the concrete implementation of the IHSM's alarm reseponse. When the
IHSM detects tampering, it is crucial that all secrets in the payload have been made unusable before an attacker can
either extract them, or stop the system from making them unusable.

Making secret data unusable to an attacker can be done by either deleting it (\emph{zeroization} in HSM terminology) or
by encrypting the data when it is stored and destroying the key. Zeroization is more technically challenging, while
encryption comes at a performance cost. The main challenges in zeroization are ensuring that the data is deleted fast
enough, and making sure it cannot be reconstructed through data remanence effects. Zeroization is practical for small
amounts of RAM such as a microcontroller's main SRAM or a small amount of DRAM in a server set aside for cryptographic
keys. For larger amounts of data, or for any data stored on flash memory, HDDs etc., encryption is the only viable
option due to speed. Encrypting data at rest on HDD or flash storage is straightforward to do in software and is a
standard feature in any modern operating system. RAM is more challenging. Encrypting data in RAM cannot be done in
software without effectively running a system emulation and incurring a massive performance penalty. Recent CPUs from
Intel and AMD contain hardware features that provide transparent DRAM encryption. These hardware features would be
necessary when securing an entire sever in an MPC setup with IHSMS technology.

% \subsection{Fast Zeroization of Non-Customizable Memories}
% Thermite experiements and paper

\subsection{A Joint Cooling and IHSM Envelope Powertrain}

We have determined that our requirements are an IHSM envelope large enough to fit a small server mainboard, and that
provides air cooling to the payload. In this section, we will sketch out a solution that solves the engineering issue of
moving such an IHSM envelope while simultaneously providing cooling to the payload.

% FIXME picture!
Our proposed design is based on the idea of using the cooling fans' airflow to power the rotation of the IHSM envelope.
Using the basic cylindrical design, the IHSM envelope consists of two discs above and below the payload that are
connected through vertical struts containing part of the tamper-sensing mesh on the outside of the payload. We propose
widening these vertical connecting struts, and angling them such that the entire envelope becomes a centrifugal
impeller. By letting air flow into the envelope from the side, and back out through its top and bottom, the envelope
assumes the same configuration used in centrifugal cooling fans. A secondary advantage of this concept is that we do not
need a motor on the envelope's shaft, saving vertical space and one difficult to source part. Furthermore, the cooling
fans can be located on the outside of the envelope in an easily accessible location, and can be set up in a redundant
way such that a failed cooling fan can be replaced while the system continues operation. The only disadvantage of this
solution over a direct motor drive is noise. To achieve the speed necessary for sufficient security at the large
envelope diameter of an MPC accelerator application, high-airflow fans must be used, which are very noisy when at full
speed. We consider this a valid tradeoff since such a system would be deployed in a datacenter where high noise levels
are acceptable.

\section{Outlook}

In this chapter we briefly introduced the challenges raised by MPC at scale, and we outlined a practical solution based
on an IHSM-protected sever that can be used to perform MPC with an unique combination of high bandwidth and low latency
that was previously considered impractical due to physical security concerns.