master-thesis/paper/safety-reset-paper.tex

\documentclass[sigconf]{acmart}

\usepackage[binary-units]{siunitx}
\DeclareSIUnit{\baud}{Bd}
\DeclareSIUnit{\year}{a}
\usepackage{graphicx,color}
\usepackage{subcaption}
\usepackage{array}
\usepackage{hyperref}
\usepackage{enumitem}

\renewcommand{\floatpagefraction}{.8}
\newcommand{\degree}{\ensuremath{^\circ}}
\newcolumntype{P}[1]{>{\centering\arraybackslash}p{#1}}
\newcommand{\partnum}[1]{\texttt{#1}}

% https://eepublicdownloads.entsoe.eu/clean-documents/pre2015/publications/entsoe/Operation_Handbook/Policy_1_Appendix%20_final.pdf

%\keywords{Security, privacy and resilience in critical infrastructures \and Security and privacy in ``internet of
%things'' \and Cyber-physical systems \and Hardware security \and Network Security \and Energy systems \and Signal theory}

\copyrightyear{2022}
\acmYear{2022}
\setcopyright{rightsretained}
\acmConference[ACSAC]{Annual Computer Security Applications Conference}{December 5--9, 2022}{Austin, TX, USA}
\acmBooktitle{Annual Computer Security Applications Conference (ACSAC), December 5--9, 2022, Austin, TX, USA}
\acmDOI{10.1145/3564625.3564640}
\acmISBN{978-1-4503-9759-9/22/12}

\begin{document}

\acmConference[ACSAC '22]{Annual Computer Security Applications
Conference}{December 5--9}{Austin, TX, USA}

\title{
  Ripples in the Pond: Transmitting Information through Grid Frequency Modulation
}

\author{Jan Sebastian Götte}
\affiliation{
  \institution{Technische Universität Darmstadt}
  \city{Darmstadt}
  \country{Germany}
}
\email{research@jaseg.de}

\author{Liran Katzir}
\affiliation{
  \institution{Tel Aviv University}
  \city{Tel Aviv}
  \country{Israel}
}
\email{lirankat@tau.ac.il}

\author{Björn Scheuermann}
\affiliation{
  \institution{Technische Universität Darmstadt}
  \city{Darmstadt}
  \country{Germany}
}
\email{scheuermann@kom.tu-darmstadt.de}

\renewcommand{\shortauthors}{Götte, Katzir and Scheuermann}
\begin{CCSXML}
<ccs2012>
<concept>
<concept_id>10010583.10010662.10010668.10010671</concept_id>
<concept_desc>Hardware~Power networks</concept_desc>
<concept_significance>500</concept_significance>
</concept>
<concept>
<concept_id>10010583.10010662.10010668.10010672</concept_id>
<concept_desc>Hardware~Smart grid</concept_desc>
<concept_significance>300</concept_significance>
</concept>
<concept>
<concept_id>10010583.10010750.10010769</concept_id>
<concept_desc>Hardware~Safety critical systems</concept_desc>
<concept_significance>500</concept_significance>
</concept>
<concept>
<concept_id>10010520.10010553.10010562.10010561</concept_id>
<concept_desc>Computer systems organization~Firmware</concept_desc>
<concept_significance>300</concept_significance>
</concept>
<concept>
<concept_id>10010520.10010553.10010562.10010563</concept_id>
<concept_desc>Computer systems organization~Embedded hardware</concept_desc>
<concept_significance>300</concept_significance>
</concept>
<concept>
<concept_id>10002978.10002997.10002998</concept_id>
<concept_desc>Security and privacy~Malware and its mitigation</concept_desc>
<concept_significance>300</concept_significance>
</concept>
<concept>
<concept_id>10002978.10003001.10003003</concept_id>
<concept_desc>Security and privacy~Embedded systems security</concept_desc>
<concept_significance>500</concept_significance>
</concept>
<concept>
<concept_id>10002978.10003001.10003599.10011621</concept_id>
<concept_desc>Security and privacy~Hardware-based security protocols</concept_desc>
<concept_significance>300</concept_significance>
</concept>
</ccs2012>
\end{CCSXML}

\ccsdesc[500]{Hardware~Power networks}
\ccsdesc[300]{Hardware~Smart grid}
\ccsdesc[500]{Hardware~Safety critical systems}
\ccsdesc[300]{Security and privacy~Malware and its mitigation}
\ccsdesc[500]{Security and privacy~Embedded systems security}
\ccsdesc[300]{Security and privacy~Hardware-based security protocols}

\begin{abstract}
    The growing heterogenous ecosystem of networked consumer devices such as smart meters or IoT-connected appliances
    such as air conditioners is difficult to secure, unlike the utility side of the grid which can be defended
    effectively through rigorous IT security measures such as isolated control networks. In this paper, we consider a
    crisis scenario in which an attacker compromises a large number of consumer-side devices and modulates their
    electrical power to destabilize the grid and cause an electrical
    outage~\cite{ctap+11,wu01,zlmz+21,kgma21,smp18,hcb19}.

    In this paper propose a broadcast channel based on the modulation of grid frequency through which utility operators
    can issue commands to devices at the consumer premises both during an attack for mitigation and in its wake to aid
    recovery. Our proposed grid frequency modulation (GFM) channel is independent of other telecommunication networks.
    It is resilient towards localized blackouts and it is operational immediately after power is restored.

    Based on our GFM broadcast channel we propose a ``safety reset'' system to mitigate an ongoing attack by disabling a
    device's network interfaces and resetting its control functions. It can also be used in the wake of an attack to aid
    recovery by shutting down non-essential loads to reduce strain on the grid.

    To validate our proposed design, we conducted simulations based on measured grid frequency behavior. Based on these
    simulations, we performed an experimental validation on simulated grid voltage waveforms using a smart meter
    equipped with a prototype safety reset system based on a commodity microcontroller.
\end{abstract}

\maketitle

\section{Introduction}

With the rollout of the smart grid, the IT security of electrical infrastructure has attracted increased attention in
the last years. Smart Grid security has two major components: The security of central SCADA systems, and the security
of equipment at the consumer premises such as smart meters and IoT devices. While there is previous work on both sides,
their interactions have not yet received much attention.

We consider the previously proposed scenario where a large number of compromised consumer devices is used alone or in
conjunction with an attack on the grid's central SCADA systems to destabilize the grid by rapidly modulating the total
connected load~\cite{ctap+11,wu01,zlmz+21,kgma21,smp18,hcb19}. Several devices have been identified as likely targets
for such an attack including smart meters with integrated remote disconnect switches~\cite{ctap+11,anderson01}, large
IoT-connected appliances~\cite{smp18,hcb19,chl20,olkd20} and electric vehicle chargers~\cite{kgma21,zlmz+21,olkd20}.
Such attacks are hard to mitigate, and existing literature focuses on hardening grid control
systems~\cite{kgma21,lzlw+20,lam21,zlmz+21} and device firmware\cite{mpdm+10,smp18,zb20,yomu+20} to prevent compromise.
Despite the infeasibility of perfect firmware security, there is little research on \emph{post-compromise} mitigation
approaches. A core issue with post-attack mitigation is that network connections such as internet and cellular networks
between the utility and devices on consumer premises may not work due to the attack. Thus, mitigation strategies that
involve devices on the consumer premises will need an out-of-band communication channel.

In this paper, we propose a novel, resilient, grid-wide communication technique based on \emph{grid frequency
modulation} (GFM) that can be used to broadcast short messages to all devices connected to the electrical grid. The grid
frequency modulation channel is robust and can be used even during an ongoing attack. Based on our channel we propose
the \emph{safety reset} controller, an attack mitigation technique that is compatible with most smart meter and IoT
device designs. A safety reset controller is a separate controller integrated with the device that awaits an out-of-band
reset command transmitted through GFM. Upon reception of the reset command, it puts the device into a safe state (e.g.
\emph{heater off} or \emph{light on}) that interrupts attacker control over the device. To reduce attack surface and
cost, the safety reset controller is separated from the system's main application controller and does not have any
conventional network interfaces.

The grid frequency modulation channel can be operated by transmission system operators (TSOs) even during black-start
recovery procedures and it bridges the gap between the TSO's private control network and consumer devices that can not
economically be equipped with other resilient communication techniques such as satellite transceivers. To demonstrate
our proposed channel, we have implemented a system that transmits error-corrected and cryptographically secured commands
through an emulated grid frequency-modulated voltage waveform to an off-the-shelf smart meter equipped with a prototype
safety reset controller based on a small off-the-shelf microcontroller.

The frequency behavior of the electrical grid can be analyzed by examining the grid as a large collection of mechanical
oscillators coupled through the grid via the electromotive force~\cite{rogers01,wcje+12}. The generators and motors that
are electromagnetically coupled through the grid's transmission lines and transformers run synchronously with each
other, with only minor localized variations in their rotation angle. The dynamic behavior of grid frequency is a direct
product of this electromechanical coupling: With increasing load, frequency drops because turbines move slower under
higher torque, and consequentially with decreasing load frequency rises. Industrial control systems keep frequency close
to its nominal value over time spans of minutes or hours, but over shorter time spans the combined inertia of all
grid-connected generators and motors is what regulates frequency.

Grid frequency modulation works by quickly modulating the power of a large, grid-connected load or generator. When this
modulation is at low amplitude and high frequency, it is below the thresholds set for the grid's automated control
systems and monitoring systems and it will directly affect frequency according to the grid's inertia. GFM differs from
traditional Powerline Communication (PLC) systems in that it works at much lower frequencies, it directly modulates the
grid's fundamental frequency instead of superimposing an additional signal on top of it, and by nature it reaches every
device within one synchronous area as the signal is embedded into the fundamental grid frequency. Traditional PLC uses a
superimposed voltage, which is quickly attenuated across long distances. Practically speaking, using GFM a single large
transmitter can cover an entire synchronous area, while in traditional PLC hundreds or thousands of smaller transmitters
would be necessary. Unlike traditional PLC, any large industrial load that allows for fast computer control with slew
rates in the order of several percent of total load per second can act as a GFM transmitter with minimal or no hardware
modifications.

\begin{figure}
    \centering
    \includegraphics[width=0.4\textwidth]{flowchart}
    \caption{Structural overview of our concept. 1 - Government authority or utility operations center. 2 - Emergency
    radio link. 3 - Aluminium smelter. 4 - Electrical grid. 5 - Target smart meter.}
    \Description{A schematic overview of the safety reset system with its parts represented by icons. A signal is sent
    from a radio tower next to a government building to a radio tower next to a factory. The factory forwards this
    signal to the electrical grid, where it is transmitted through a series of transformers to a smart meter at a
    residential building.}
    \label{fig_intro_flowchart}
\end{figure}

Figure~\ref{fig_intro_flowchart} shows an overview of our concept using a smart meter as the target device and a large
aluminium smelter temporarily re-purposed as a GFM transmitter.  Two scenarios for its application are before or during
a cyber attack, to stop an attack on the electrical grid in its tracks, and after an attack while power is being
restored to prevent a repeated attack. In both scenarios, our concept is independent of telecommunication networks (such
as the internet or cellular networks) as well as broadcast systems (such as cable television or terrestrial broadcast
radio) while requiring only inexpensive signal processing hardware and no external antennas (such as are needed for
satellite communication). A grid frequency-based system can function as long as power is still available, or as soon as
power is restored after the attack. One powerful function this allows is ``flushing out`` an attacker from compromised
smart meters after an attack, before restoring smart meter internet connectivity.

Using simulations we have determined that control of a $\SI{25}{\mega\watt}$ load such as a large aluminium smelter,
load bank or photovoltaic farm would allow for the transmission of a cryptographically secured safety reset signal
within $15$ minutes. We have designed and constructed a proof-of-concept prototype receiver that demonstrates the
feasibility of decoding such signals on a resource-constrained microcontroller.

\subsection{Motivation}

Consumer devices are increasingly becoming \emph{smart}. Large numbers of IoT devices are connected through the public
internet, and in several countries internet-connected Smart Meters can disconnect entire households from the grid in
case of unpaid bills~\cite{anderson01}. The increasing proliferation of smart devices on the consumer side presents an
opportunity to grid operators, who rely on forecasts for the cost-optimized control of generation and power flow. The
core of the \emph{Smart Grid} vision is that utilities can now gather detailed data for more accurate consumption
forecasts, and in some cases can even adjust parameters of large devices like water heaters to smooth out load spikes.

However, this increased degree of visibility and control comes with an increased IT security risk. In this paper we
focus on scenarios where an attacker compromises a large number of grid-connected remote-controllable devices. This may
be simple smart home devices such as IoT-connected air conditioners, but it may also include Smart Meters that are
outfitted with a remote disconnect switch as is common in some countries. By rapidly switching large numbers of such
devices in a coordinated manner, the attacker has the opportunity to de-stabilize the electrical
grid~\cite{zlmz+21,kgma21,smp18,hcb19}.

In this paper, we focus on assisting the recovery procedure after a successful attack because we estimate that this
approach will yield a better return of investment in overall grid stability versus resources spent on security
measures compared to bug hunting in device firmware. Previous work on IoT and Smart Grid security has focused on the
prevention of attacks though firmware security measures. While research on prevention is important, we estimate that its
practical impact will be limited by the diversity of implementations found in the field~\cite{nbck+19,zlmz+21,smp18}. We
predict that it would be a Sisyphean task to secure the firmware of a number of devices devices sufficient to deny an
attacker the critical mass needed to cause trouble. Even if all flaws in the firmware of a broad range of devices would
be fixed, users still have to update. In smart grid and IoT devices, this presents a difficult problem since user
awareness is low~\cite{nbck+19}.

\subsection{Attacker model}

According to the above criteria, our attacker model has the following key features:

\begin{itemize}
    \item The attacker cannot compromise the utility operators' SCADA systems.
    \item The attacker can compromise and subsequently control a large number of target devices at the customer's
        premises such as smart meters or large IoT devices such as air conditioners or central heating systems.
    \item Devices that may become targets of attacks can be designed to include a separate firmware and factory reset
        function that the attacker cannot circumvent. In the simplest case, this could be a separate microcontroller
        that is connected to an in-system programming interface of the device's application processor.
\end{itemize}

\subsection{Contents}

Starting from a high level architecture, we have carried out simulations of our concept's performance under real-world
conditions using measured grid frequency data. Based on these simulations we implemented an end-to-end prototype of our
proposed safety reset controller as part of a realistic smart meter demonstrator. Finally, we experimentally validated
our results based on a simulated mains voltage signal and we will conclude with an outline of further steps towards a
practical implementation.

This work contains the following contributions:
\begin{enumerate}[topsep=4pt]
    \item We introduce Grid Frequency Modulation (GFM) as a communication primitive. % FIXME done before in that one paper
    \item We elaborate the fundamental physics underlying GFM and theorize on the constrains of a practical
        implementation.
    \item We design a communication system based on GFM.
    \item We carry out extensive simulations of our systems to determine its performance characteristics.
\end{enumerate}

%\subsection{Notation}
% FIXME drop or rework this section ; actually update notation to be consistent throughout
%To a computer scientist there is one confusing aspect to the theory of grid frequency modulation. GFM can be seen as a
%frequency modulation (FM) with a baseband signal in the band below approximately $f_m = \SI{5}{\hertz}$ that is
%modulated on top of a carrier signal at $f_c = \SI{50}{\hertz}$ in case of the European electrical grid. The frequency
%deviation $f_\Delta$ that the modulated carrier deviates from its nominal value of $f_m$ is very small at only a few
%milli-Hertz.
%
%When grid frequency is measured by first digitizing the mains voltage waveform, then de-modulating digitally, the FM's
%signal-to-noise ratio (SNR) is very high and is dominated by the ADC's quantization noise and nearby mains voltage noise
%sources such as resistive droop due to large inrush current of nearby machines.
%
%Note that both the carrier signal at $f_c$ and the modulation signal at $f_m$ both have unit Hertz. To disambiguate
%them, in this paper we will use \textbf{bold} letters to refer to the carrier waveform $\mathbf{U}$ or frequency
%$\mathbf{f_c}$ as well as its deviation $\mathbf{f_\Delta}$, and we will use normal weight for the actual modulation
%signal and its properties such as $f_m$.

\section{Background on the electrical grid}
\subsection{Components and interactions}

The electrical grid transmits electrical power from generators to loads through alternating current. Any device that is
connected to the grid must run \emph{synchronous} with the grid, i.e.\ it must produce or consume power following the
grid's voltage waveform. In generators and motors, the electromotive force acts to synchronize the device with the grid.
Connecting a generator that has not been synchronized to the grid leads to large currents flowing through the
generator's windings, inducing extreme forces that can mechanically destroy the generator. Similarly, if the inverters
of a solar power station would try to fight the grid, the grid would win and the inverters' power semiconductors would
release their magic smoke.

Originally, all power sources on the grid were synchronous rotating generators. Today, the shift towards renewable
energy and the introduction of high-voltage DC links has led to some of the grid's generating capacity being replaced
with inverters that electronically emulate the grid's voltage waveform to efficiently convert a DC input to the grid's
alternating current.

The generators and loads on the grid are linked through a complex network of transmission lines. Transformers are used
to couple between transmission lines operating at different voltage levels, and several types of switches allow
utilities to steer power flow throughout this network. Through the electromotive force, all synchronous generators
connected to the grid are electromechanically coupled. Transmission lines introduce a (small) phase delay to the
electric fields traversing the grid, but besides local differences in phase, all parts of the grid are synchronous.

\subsection{Grid frequency behavior}

On the electrical grid, generation and consumption of energy must be precisely matched at all times for the grid to stay
at a constant, synchronous frequency. If generation outpaces consumption, generators would provide less mechanical
resistance to their source of mechanical power, or \emph{prime mover}, which would lead the generators to spin faster
and faster. Similarly, if consumption outpaced production, the increased mechanical load would slow down generators,
ultimately leading to a collapse.

In day-to-day operation, the frequency of the electrical grid is maintained at a fixed, stable level through several
layers of control systems on top of the grid's inherent mechanical inertia. Fast-acting automatic primary control
stabilizes temporary frequency excursions, while slower automatic secondary control and manual tertiary control
re-adjust device's operating points back to their nominal values after they have shifted due to primary control action.

\subsection{Black-start recovery}

To function, the grid relies on a delicate balance between electricity generation, transmission and consumption.  When
this balance is disturbed, cascading failures can occur and because this balance must be kept in balance at all times,
the recovery from a large-scale power outage is a complex operational challenge. Since all consumers and producers that
are connected to the electrical grid are physically coupled through the electromotive force, a fault in one part of the
grid affects all devices connected across the grid. A transmission line shutting off can lead other, nearby lines to
overload and shut off, and a generator or consumer suddenly shutting off causes a transient in the grid's frequency. If
the frequency goes too far out of bounds, protection devices take power plants and large industrial loads offline.

The recovery from a large-scale outage requires the grid's operators to bring generators and loads back online one by
one while continuously maintaining balance between generation and consumption to avoid their protection devices shutting
them down again. To coordinate this process, transmission system operators cannot rely on the public internet or
cellular networks, as they may not work during a large-scale power outage. Instead, they maintain private communication
infrastructure using dedicated lines rented from telecommunication providers, fibers run along transmission lines, and
dedicated radio links.

To start from a complete outage, first a number of \emph{black start}-capable power stations that can start by
themselves without any external power are brought online. With their help, other power stations and consumers are
gradually brought online until a part of the grid has been restored to nominal operation. This process can be performed
simultaneously in different parts of the grid. After these \emph{islands} have been restored, they can then be
synchronized and re-joined to restore the grid to its normal state.

\subsection{Demand-side response and Smart Metering}

Maintaining the balance between electricity generation and consumption under varying load conditions is critical.
Utilities can access different energy sources, each of which have their own trade-off in response speed versus energy
cost. For instance, the availability of wind and solar power cannot be controlled at all, while hydroelectric power
plants can quickly regulate the speed and power output of their turbines. Combined with the complex layout of the grid's
infrastructure such as transmission lines, these economical factors lead to a complex optimization problem, the quality
of whose solution directly manifests itself in the utility's bottom line.

For decades, one solution to this issue has been demand-side response (DSR)~\cite{rs48}. In DSR, large loads such as
water heaters are centrally controlled by the utility to switch on outside of peak demand. Since the precise timing of
these loads is of no consequence to their user, users are happy to get slightly better prices from their utility while
utilities gain a degree of control allowing them to optimize their network's performance. As part of the smart grid
vision, DSR will be utilized in a larger fraction of consumer devices.

A core component of the smart grid is the rollout of ``Advanced Metering Infrastructure'' (AMI), colloquially known as
smart meters. Smart meters are electricity meters that use a real-time communication interface to automatically transmit
high-resolution measurements to the utility. In contrast to the yearly reading schedule of traditional electricity
meters, smart meters can provide near-realtime data that the utility can use for more accurate load forecasting.

\subsection{Powerline Communication (PLC)}

A core issue in smart metering and demand-side response is the communication channel from the meter to the greater
world. Smart meters are cost-constrained devices, which limits the use of landline internet or cellular connections.
Additionally, electricity meters are often installed in basements, far away from the customer's router and with soil and
concrete blocking radio signals. For these reasons, in some AMI deployments, powerline communication (PLC) has been
chosen for the meters' uplink.

Since the early days of the electrical grid, powerline communication has been used to control devices spread throughout
the grid from a central transmitter~\cite{rs48}. PLC systems super-impose a modulated higher-frequency signal on top of
the grid voltage. When the carrier frequency of this modulation is in the audible frequency range, low data
rates can be transmitted over distances of several tens of kilometers. By using a radio frequency carrier, higher data
rates can be achieved across shorter distances\cite{pvyh03}. Audio frequency PLC, called ``ripple control'', is still
used today by utilities for demand-side response, remote-controlling special water heaters to avoid times of
peak electricity demand.

Powerline communication systems are usually uni-directional, but there are instances of bi-directional powerline
communication for smart meter reading~\cite{ec03,rs48,gungor01,agf16}.

\section{Related work}
\label{sec_related_work}

\subsection{IoT and Smart Grid security}

The security of IoT devices as well as the smart grid has received extensive attention in the
literature~\cite{nbck+19,acsc20,smp18,ykll17,anderson01,anderson02,zlmz+21,kgma21,hcb19,mpdm+10,lzlw+20,chl20,lam21,olkd20,yomu+20}.
The challenges of IoT device security and the security of smart meters and other smart grid devices are similar because
smart grid devices are essentially IoT devices in a particularly sensitive location~\cite{zheng01,ifixit01,acsc20}. In
both device types, the challenge is that securing embedded firmware is difficult, and adding network interfaces and cost
constraints only makes the task harder.

In some countries, smart meters can have a built-in off-switch that is used to disconnect customers who do not pay their
electricity bill. An attack scenario in which the attacker compromises a large number of such meters has been discussed
by Anderson and Fuloria in~\cite{anderson01}. In meters that do not have such a switch, an attacker can still use their
access to manipulate the meter's energy accounting, leading to financial impact on the utility operating the meter. This
scenario has received research attention~\cite{anderson02,mcdaniel01} and comes with the most direct industry
incentives.

In~\cite{smp18}, Soltan, Mittal and Poor investigated an attack scenario where an attacker first gains control over a
large number of high wattage devices through an IoT security vulnerability, then uses this control to cause rapid load
spikes. The researchers performed computer simulations for a range of parameters and concluded that an attacker
controlling 200 - 300 devices of $\SI{1}{\kilo\watt}$ each per megawatt of total grid power (equivalent to
30\% of total connected power) can cause a large-scale blackout in a healthy grid, while 10 such compromised
devices per megawatt (1\% of total power) are enough to cause cascading line failures that may ultimately lead
up to a large-scale blackout.

In~\cite{hcb19}, Huang, Cardenas and Baldick raised a counter-point to the conclusions of Soltan et al., arguing that
limitations of their simulations in~\cite{smp18} have lead them to over-estimate the severity of an attack. Using a
model tailored to accurately represent the grid's protection mechanisms, they found that due to the action of protection
systems such as load shedding and over frequency protection, large attacks of 30\% of total grid power are likely to
cause only localized blackouts and the decay of the grid into islands, instead of a large-scale blackout. Smaller attack
sizes between 1\% and 10\% were mostly harmless in their simulations.

From literature, we get the overall impression that both IoT and Smart Grid security are challenging. Both lack behind
the security standard of state of the art desktop, server and smartphone operating systems. Reasons for this are the
relatively recent nature of the IoT software ecosystem and the large number of independent implementations. A unique
challenge to Smart Grid security is that due to the fragmentation of markets along national borders, certain devices
such as smart meters or DSR implementations exist in large monocultures.

Smart meters are consumer devices built down to a price and manufacturers' firmware security R\&D budgets are limited by
the high degree of market fragmentation that is caused by mutually incompatible national smart metering standards.
Landis+Gyr, a large utility meter manufacturer, state in their 2019 annual report that they invested \SI{36}{\percent}
of their total R\&D budget on embedded software while spending only \SI{24}{\percent} on hardware
R\&D~\cite{landisgyr01,landisgyr02}, which indicates tension between firmware security and the manufacturers's bottom
line.

Compared to IoT and Smart Grid devices, the embedded firmware foundations of modern smartphones have received more
attention both from the industry and from academia. Pinto and Santos in~\cite{pinto01} conducted a survey of
implementations based on ARM's TrustZone embedded virtualization architecture and found a significant number of reported
vulnerabilities across different implementations. For instance, Rosenberg in~\cite{rosenberg01} found critical issues in
Qualcomm's QSEE hypervisor, and Kanonov and Wool in~\cite{kanonov01} identified a number of design weaknesses and
security vulnerabilities in Samsung's competing KNOX virtualization product. To us, the state of the field of embedded
security indicates that even if significant effort is spent on the security of IoT and Smart Grid devices to catch up
with desktop, server and smartphone security, significant vulnerabilities are likely to remain for some time to come.
In this instance, market forces do not align with the interest of the public at large. Vulnerabilities remain likely,
especially in code implementing complex network protocols such as TLS~\cite{georgiev01}, which may even be mandated by
national standards in some devices such as smart meters.

%\subsection{Reliably resetting an IoT or Smart Grid device}

\subsection{Oscillations in the electrical grid}

Common to the attacks on the electrical grid proposed in the papers discussed above is their approach of overloading
parts of the grid. However, scenarios have been proposed that go beyond a simple overload condition, in which an
attacker instead carefully exploits the physical characteristics of the grid to cause oscillations of increasing
amplitude, ultimately triggering a cascade of protection mechanisms. The purpose of this type of attack is to use a
small controllable load to cause outsized damage.

Electro-mechanical oscillation modes between different geographical areas of an electrical grid are a well-known
phenomenon. In their book~\cite{rogers01}, Rogers and Graham provide an in-depth analysis of these oscillations and
their mitigation. In~\cite{grebe01}, Grebe, Kabouris, López Barba et al.\ analyzed modes inherent to the
continental European grid. A report on an event where an oscillation on one such mode caused a problem can be found in
\cite{entsoe01}.

In~\cite{zlmz+21}, Zou, Liu, Ma et al.\ analyzed the possibility of a modal attack in which electric vehicle chargers
rapidly modulate their power to force an oscillation of a poorly dampened wide-area electromechanical mode. In their
model an attacker compromises a backend smart grid control system that controls a large number of EV chargers. Using
mathematical analysis, small-scale simulations and limited practical experiments they validated the attack scenario and
developed a countermeasure that can be implemented as part of generator control systems and that when activated can
suppress forced oscillations of wide-area electromechanical modes.

\subsection{Proposed Countermeasures}

In parallel with research on theoretical attacks, countermeasures to these have also been proposed in academic
literature. In~\cite{kgma21}, the authors propose an extension to grid control algorithms aimed at increasing the grid's
robustness towards forced oscillations. In~\cite{smp18}, the authors propose that utility operators use a detailed
attacker model to engineer additional safety margins into the grid while minimizing the economic inefficiency of these
measures. On the IoT side, they note that due to the wide implementation diversity, the problem cannot be solved by
individual measures and propose additional fundamental research on IoT device security.

In~\cite{hcb19}, the authors conclude that simple demand attacks where compromised loads suddenly increase demand are
adequately mitigated by existing safety measures, in particular \emph{Under-Frequency Load Shedding} (UFLS), which forms
the basis of any grid's automatic emergency response. As part of UFLS, during a contingency the utility will
progressively disconnected loads according to set priorities until the production / generation balance has been restored
and a blackout has been averted.

% FIXME more sources!

\section{Grid Frequency as a Communication Channel}

The countermeasures discussed above are fully automatic. Such systems can provide a good first line of defense, but they
must be complemented by means of manual intervention since not every eventuality can be anticipated. During a
large-scale cyber attack, availability of internet and cellular connectivity cannot be relied upon. An attacker may
already have disabled such systems in a separate attack, or they may go down along with parts of the electrical grid.
Powerline communication systems will likely be unaffected by an attack, but at a range of no more than several tens of
kilometers, covering the entire grid would require a large upfront infrastructure investment for transmitters.

We propose to approach the problem of broadcasting an emergency control signal to all grid-connected devices such as
smart meters or IoT appliances within a synchronous area by using grid frequency as a communication channel.  Despite
the technological complexity of the grid, the physics underlying its response to changes in load and generation is
surprisingly simple. Individual machines (loads and generators) can be approximated by a small number of differential
equations describing their control systems' interaction with the machines' physics, and the entire grid can be modelled
by aggregating these approximations into a large system of differential equations. As a consequence, small signal
changes in generation/consumption power balance cause an approximately proportional change in
frequency~\cite{kundur01,crastan03,entsoe02,entsoe04}. The slope of this first-order approximation is known as
\emph{Power Frequency Characteristic}, and in case of the continental European synchronous area happens to be about
\SI{25}{\giga\watt\per\hertz}  according to the European electricity grid authority, ENTSO-E.

If we modulate the power consumption of a large load, this modulation will result in a small change in frequency
according to that characteristic. As long as we stay within the operational limits set by
ENTSO-E~\cite{entsoe02,entsoe03}, this change will not degrade the operation of other parts of the grid. The advantages
of grid frequency modulation are the fact that a single transmitter can cover an entire synchronous area as well as low
receiver hardware complexity.

To the best of the authors' knowledge, grid frequency modulation has only ever been proposed as a communication channel
at very small scales in microgrids before~\cite{urtasun01} and has not yet been considered for large-scale application.

\subsection{Comparison to other communication channels}

Compared to traditional channels such as Fiber To The Home (FTTH), 5G or LoraWAN, grid frequency as a communication
channel has a resiliency advantage. It can start transmission as soon as a power island with a connected transmitter is
powered up, while communication networks such as FTTH or 5G are still rebooting or waiting for their centralized
infrastructure to come back online. Mesh networks such as LoraWAN can cover short distances up to $\SI{20}{\kilo\meter}$
without requiring infrastructure to be available, but for longer distances LoraWAN relies on the public internet for its
network backbone. Additionally, systems such as FTTH, 5G and LoraWAN are built around a point-to-point communication
model and usually do not support a global broadcast primitive. During times when a large number of devices must be
reached simultaneously this can lead to congestion of cellular towers and servers. Therefore, during an ongoing cyber
attack, grid frequency is promising as a communication channel because only a single transmitter facility must be
operational for it to function, and this single transmitter can reach all connected devices simultaneously.

\subsection{Characterizing Grid Frequency}
\label{grid-freq-characterization}

To prepare our analysis of grid frequency modulation, we developed a device that allows us to collect measurements of
actual grid frequency behavior through safely recording the grid voltage waveform.  Our system consists of an
\texttt{STM32F030F4P6} ARM Cortex M0 microcontroller that records mains voltage using its internal 12-bit ADC and
transmits measured values through a galvanically isolated USB/serial bridge to a host computer. We derive our system's
sampling clock from a crystal oven to avoid frequency measurement noise due to thermal drift of a regular crystal:
\SI{1}{ppm} of crystal drift would cause a grid frequency error of $\SI{50}{\micro\hertz}$. We compared our
oven-stabilized clock against a GPS 1 pps reference and found that over a time span of 20 minutes both stayed stable
within 5 ppb of each other, which corresponds to the drift specification of a typical crystal oven.

In utility SCADA systems, Phasor Measurement Units (PMUs) are used to precisely measure grid frequency among other
parameters.  Details on the inner workings of commercial phasor measurement units are scarce but there is a large amount
of academic research on their measurement algorithms.  PMUs employ complex signal analysis algorithms to provide fast
and precise measurements even when given a heavily distorted input signal~\cite{narduzzi01,derviskadic01,belega01}.

In our application, we do not need the same level of precision. For the sake of simplicity, we use the universal
frequency estimation approach of Gasior and Gonzalez~\cite{gasior01}. In this algorithm, the windowed input signal is
processed using a Discrete Fourier Transform (DFT), then the signal's fundamental frequency is interpolated by fitting a
wavelet to the largest peak in the DFT result. The bias parameter of this curve fit is an accurate estimation of the
signal's fundamental frequency. This algorithm is similar to the interpolated DFT algorithm referenced by phasor
measurement literature~\cite{borkowski01}.

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{../notebooks/fig_out/freq_meas_spectrum_new}
    \caption{The spectrum of grid frequency variations measured over 24 hours. The raw spectrum is shown in gray, and a
    smoothed spectrum is shown in red. The blue line is inversely proportional to frequency and illustrates the $1/f$
    nature of the spectrum. Distinctive peaks in the spectrum are marked with red crosses, and their locations
    are given on the bottom of the diagram.}
    \Description{A plot of power spectral density in Hertz squared per Hertz versus period in seconds. The plot shows
    the measured spectrum, a smoothed fit of the measured spectrum, and an one over f line for comparison. The measured
    spectrum is very noisy. The smoothed signal looks much cleaner, and roughly follows the one over f line. The
    smoothed data contains several notable features. At a period of about 80 seconds, its slope suddenly starts falling
    off faster than one over f to form a through shape towards higher frequencies. There are several narrow bumps at
    round number periods such as 10 seconds, 60 seconds, 300 seconds and 900 seconds. There are three wider bumps
    visible. Two, a larger and a smaller one, next to each other centered on 4.7 seconds for the larger one and 7.0
    seconds for the smaller one. The last wider bump is below 0.5 seconds.}
    \label{fig_freq_spec}
\end{figure}

Using our grid frequency recorder, we performed a two-day measurement series of grid frequency.
Figure~\ref{fig_freq_spec} shows the frequency spectrum of grid frequency over this two-day span. In this spectrum, we
observe a number of features. Across the frequency range, we observe a broad $1/f$ noise. Above a period of
$\SI{10}{\second}$, this $1/f$ noise dips to a flat noise floor. We estimate that this low-noise region is caused by the
self-regulating effect of loads. Above a $\SI{10}{\second}$ period, primary control is activated and thus the $1/f$
noise we observe is the result of the interaction between primary control and consumer demand. On top of this $1/f$
behavior, the spectrum shows several sharp peaks at time intervals with a ``round'' number such as $\SI{10}{\second}$,
$\SI{60}{\second}$ or multiples of $\SI{300}{\second}$. These peaks are due to loads turning on- or off depending on
wall-clock time, and demand forecasting not being able to precisely match the amplitude of these large changes in load.
Besides the narrow peaks caused by this effect we can also observe two wider bumps at $\SI{7.0}{\second}$ and
$\SI{4.7}{\second}$. These bumps closely correlate with continental European synchonous area's oscillation modes at
$\SI{0.15}{\hertz}$ (east-west) and $\SI{0.25}{\hertz}$ (north-south)~\cite{grebe01}.

\section{Grid Frequency Modulation}

A transmitter for grid frequency modulation would be a controllable load of several Megawatt that is located centrally
within the grid. A baseline implementation would be a spool of wire submerged in a body of cooling liquid (such as a
small lake) which is powered from a thyristor rectifier bank. Compared to this baseline solution, hardware and
maintenance investment can be decreased by repurposing a large industrial load as a transmitter. Going through a list of
energy-intensive industries in Europe~\cite{ec01}, we found that an aluminium smelter would be a good candidate. In
aluminium smelting, aluminium is electrolytically extracted from alumina solution. High-voltage mains power is
transformed, rectified and fed into approximately 100 series-connected electrolytic cells forming a \emph{potline}.
Inside these pots, alumina is dissolved in molten cryolite electrolyte at approximately \SI{1000}{\degreeCelsius} and
electrolysis is performed using a current of tens or hundreds of Kiloampère at a few Volt per cell. The resulting pure
aluminium settles at the bottom of the cell and is tapped off for further processing.

Aluminium smelters are operated around the clock, and due to the high financial stakes their behavior under power
outages has been carefully characterized. Power outages of tens of minutes up to two hours reportedly do not cause
problems in aluminium potlines~\cite{eisma01,oye01}. Recently, even techniques for intentional power modulation without
affecting cell lifetime or product quality have been developed to take advantage of variable energy
prices~\cite{duessel01,eisma01,depree01}.  An aluminium plant's power supply is controlled to constantly keep all
smelter cells under optimal operating conditions. Modern power supply systems employ large banks of diodes or thyristors
to rectify low-voltage AC to DC to be fed into the potline~\cite{ayoub01}. Potline voltage is controlled through a
combination of a tap changer and a transductor.  Individual cell voltages are controlled by changing the physical
distance between anode and cathode.  In this setup, power can be electronically modulated using the thyristor rectifier.
Since the system does not have any mechanical inertia, high modulation rates are possible.

In~\cite{depree01}, the authors describe a setup where a large Aluminium smelter in continental Europe is used as
primary control reserve for frequency regulation. In this setup, a rise time of $\SI{15}{\second}$ was achieved to meet
the $\SI{30}{\second}$ requirement posed by local standards for primary control. In their conclusion, the authors note
that for their system, an effective thermal energy storage capacity of $\SI{7.7}{\giga\watt\hour}$ is possible if all
plants of a single operator are used. Given the maximum modulation depth of $\SI{100}{\percent}$ for up to one hour that
is mentioned by the authors, this results in an effective modulation power of $\SI{7.7}{\giga\watt}$. Over a longer
time span of $\SI{48}{\hour}$, they have demonstrated a $\SI{33}{\percent}$ modulation depth which would correspond to a
modulation power of $\SI{2.5}{\giga\watt}$. We conclude that a modulation of part of an aluminium smelter's power
consumption is possible at no significant production impact and at low infrastructure cost. Aluminium smelters are
already connected to the grid in a way that they do not pose a danger to other nearby consumers when they turn off or on
parts of the plant, as this is commonplace during routine maintenance activities.

\subsection{The operational model of a GFM-based safety reset}

While a single large Aluminium smelter could conceivably provide sufficient modulation power to cover the entire
continental European synchronous area, we have to consider operation during a black start, when the grid temporarily
divides into a number of disconnected power islands. A single transmitter would only be able to reach receivers on the
same power island.

To alleviate this constraint, the system can use a number of transmitters that are distributed throughout the network.
Piggy-backing transmitters on existing industrial loads keeps the implementation cost of additional transmitters low. By
running transmitters from stable, synchronized frequency standards such as gps-disciplined rubidium standards,
transmissions can be precisely synchronized across power islands even after a holdover period of several days. This
allows a transmission to continue uninterrupted while the utility rejoins power island into the larger grid, since the
transmissions on both islands are precisely synchronized.

As illustrated in Figure~\ref{fig_intro_flowchart}, the transmitters are connected to a command center. For this
connection, a redundant set of long-range radio or satellite links can be used, as well as wired connections through the
utility's dedicated SCADA network. In an emergency, the command center can then trigger a transmission. Synchronized
through their gps-backed frequency standards, two transmitters will then constructively interfere as soon as they are
connected to the same power island.

\subsection{Parameterizing Modulation for GFM}

Given the grid characteristics we measured using our custom waveform recorder and using a model of our transmitter, we
can derive parameters for the modulation of our broadcast system.  The overall network power-frequency characteristic of
the continental European synchronous area is approximately $\SI{25}{\giga\watt\per\hertz}$~\cite{entsoe02}. Thus, the
main challenge for a GFM system will be poor signal-to-noise ratio (SNR) due to low transmission power. A second layer
of modulation yielding some modulation gain beyond the basic amplitude modulation of the transmitter will be necessary
to achieve sufficient overall SNR.

The grid's frequency noise has significant localized peaks that might interfere with this modulation. Further
complicating things are the oscillation modes. A GFM system must be designed to avoid exciting these modes. However,
since these modes are not static, a modulation method that is designed around a specific assumption of their location
would not be future proof. Given these concerns, the optimal second-level modulation technique for GFM is a
spread-spectrum technique. By spreading signal energy throughout a wide band, both the impact of local noise spikes is
minimized and the risk of mode excitation is reduced since spread-spectrum techniques minimize energy in any particular
sub-band.

The spread-spectrum technique that we chose is Direct Sequence Spread Spectrum for its simple implementation and good
overall performance. DSSS chip timing should be as fast as the transmitter's physics allow to exploit the low-noise
region between $\SI{0.2}{\hertz}$ to $\SI{2.0}{\hertz}$ in Figure~\ref{fig_freq_spec}. Going past
$\approx\SI{2}{\hertz}$ would complicate frequency measurement at the receiver side.

\subsubsection{Direct Sequence Spread Spectrum (DSSS) modulation}

Direct Sequence Spread Spectrum modulation is a common spread-spectrum technique that forms the basis of a number of
radio systems, most prominently all global navigation satellite systems (GNSS). As a spread-spectrum technique, DSSS
spreads out the signal's energy across a broad spectral range. This decreases the susceptibility of a DSSS signal to
narrowband interference. In GNSS, this allows the rejection of other nearby RF sources. In our use case, this makes the
signal immune to the many narrow peaks in the grid frequency's noise spectrum that are caused by control systems
sychronized to wall-clock time(cf.~Fig.~\ref{fig_freq_spec}). In addition to better interference immunity, DSSS has two
other important characteristics: It provides \emph{modulation gain}, i.e.~it allows a trade-off between data rate and
receiver sensitivity, and it allows for Code Division Multiple Access (CDMA). In CDMA, multiple DSSS-modulated signals
can be sent simultaneously through a shared channel with less impact to the resulting signal-to-noise ratio (SNR) than
would be the case for other modulation techniques.

A DSSS signal is made up from pseudo-random \emph{symbols}, which in turn are made up from individual physical layer
bits called \emph{chips}. Chips are encoded in the signal using a lower-layer modulation such as phase-shift keying
(e.g.~in GPS) or frequency-shift keying (in this work). In DSSS, a \emph{code} is a library of symbols that are
constructed to have minimal cross-correlation, i.e.\ they are near-orthogonal. A transmitter sends a symbol by
transmitting its particular pseudo-random chip sequence at a chosen polarity, conveying one bit of information. A
receiver demodulates the signal by directly correlating the incoming physical-layer signal with the symbol's chip
pattern, which results in a positive or negative peak when a symbol is received depending on its polarity.

By increasing the DSSS sequence length by a factor of $2$, SNR is improved by $\sqrt{2}$ assuming an additive white
gaussian noise (AWGN) channel. At the same time, when doubling the sequence length, common DSSS code construction
methods provide twice the number of distinctive symbols allowing for twice the number of CDMA participants. The trade
off between twice the sequence length (and transmission time) for approximately $\SI{1.5}{dB}$ in SNR is a steep
trade-off, but is necessary in systems where transmitter power cannot be increased further and the resulting signal has
a marginally low SNR.

\subsubsection{DSSS parametrization}

To find the parameters for our DSSS modulation, we simulated a proof-of-concept modulator and demodulator using data
captured from our grid frequency sensor. Our simulations covered a range of combinations of modulation amplitude, DSSS
sequence bit depth, chip duration and detection threshold. Figure~\ref{fig_ser_nbits} shows our simulation results for
symbol error rate (SER) as a function of modulation amplitude with Gold sequences of several bit depths. From these
graphs we conclude that the range of practical modulation amplitudes starts at approximately $\SI{1}{\milli\hertz}$,
which corresponds to a modulation power of approximately $\SI{25}{\mega\watt}$~\cite{entsoe02}.
Figure~\ref{fig_ser_thf} shows SER against detection threshold relative to background noise. Figure~\ref{fig_ser_chip}
shows SER against chip duration for a given fixed symbol length. As expected from looking at our measured grid frequency
noise spectrum, performance is best for short chip durations and worsens for longer chip durations since shorter chip
durations move our signals' bandwidth into the lower-noise region from $\SI{0.2}{\hertz}$ to $\SI{2}{\hertz}$.
%FIXME introduce term "chip" somewhere

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{../notebooks/fig_out/dsss_gold_nbits_overview}
    \caption{Symbol Error Rate as a function of modulation amplitude for Gold sequences of several lengths.}
    \Description{A plot of symbol error rate versus amplitude in millihertz. The plot shows four lines, one each for 5
    bit, 6 bit, 7 bit and 8 bit. All four lines form smooth step functions, plateauing at a symbol error rate of 1.0 for
    low amplitudes and falling to a symbol error rate of 0.0 for high amplitudes. The low-amplitude plateau is widest
    for 5 bit and narrowest for 8 bit. The falloff is steepest for 8 bit, and slowest for 5 bit. For 8 bit, a symbol
    error rate of 0.5 is crossed at about 0.4 millihertz. For 7 bit at about 0.6 millihertz, for 6 bit at 0.8 millihertz
    and for 5 bit at 1.3 millihertz. For 7 and 8 bit, symbol error rate settles at zero above 1.0 millihertz. For 5 bit
    above 2.0 millihertz and for 8 bit at about 3.0 millihertz.
    }
    \label{fig_ser_nbits}
\end{figure}

\begin{figure}
    \centering
    \hspace*{-5mm}\includegraphics[width=0.5\textwidth]{../notebooks/fig_out/dsss_thf_amplitude_5678}
    \vspace*{-5mm}
    \caption{SER vs.\ Amplitude and detection threshold. Detection threshold is set as a factor of background noise
    level.}
    \Description{This figure shows four plots that are similar to the previous figure. Each plot shows symbol error rate
    plotted against signal amplitude in millihertz. Each of the four plots shows a different gold sequence length, from
    5 bit up to 8 bit. Each plot contains more than ten traces that are color-coded for a different detection threshold
    factor. All plots show that a high threshold factor going towards 10 shifts the symbol error rate curve towards
    higher amplitudes, implying a less sensitive receiver. For lower threshold factors the sensitivity improves,
    however, for very low threshold factors performance deterioates and the plotted curves suddenly become completely
    erratic, with several curves for low threshold factors around 2 at all bit lengths never reaching symbol error rates
    below 0.2. The middle ground between the two seems to be a threshold factor of around 5. The four plots show a clear
    dependency between receiver sensitivity and gold code length. For a 5 bit gold code, only a few graphs settle at all
    and those that do settle towards zero symbol error rate only between 3 and 4 millihertz in amplitude. For a 6 bit
    gold sequence, most graphs settle, and for the best threshold factor the graph settles to zero symbol error rate
    below 2 millihertz amplitude. For the 7 bit gold code, the best graph settles at approximately 1.2 millihertz, and
    for the 8 bit gold code at approximately 0.8 millihertz.}
    \label{fig_ser_thf}
\end{figure}

\begin{figure}
    \centering
    \hspace*{-5mm}\includegraphics[width=0.5\textwidth]{../notebooks/fig_out/chip_duration_sensitivity_6}
    \vspace*{-5mm}
    \caption{SER vs.\ DSSS chip duration.}
    \Description{The figure shows two plots. The first plot shows symbol error rate against signal amplitude in
    millihertz, but this time it shows a cohort of curves for different chip durations. The general amplitude behavior
    is similar to the previous figure showing threshold factor instead, with a plateau at a 1.0 symbol error rate for
    low amplitudes, and a smooth step settling to a 0.0 symbol error rate for large signal amplitude. The plot shows
    chip durations between 0.1 seconds, equivalent to 6.4 seconds symbol duration and 5.0 seconds, equivalent to 320
    seconds symbol duration. Most curves settle within the plotted range of 0 to 5 millihertz. Larger chip durations
    settle only at higher amplitudes, and the fastest settling chip durations are also the shortest. There is a cluster
    of fast-settling curves settling around 1.0 millihertz amplitude for chip durations below 1.0 seconds. A clear best
    candidate is hard to distinguish from this cluster.
    The second plot in the figure shows the minimum amplitude necessary for a symbol error rate of 0.5 plotted in
    millihertz against chip duration in seconds. The graph shows a nicely round curve bottoming out at approximately
    0.75 millihertz for a chip duration of 0.3 seconds. For lower chip durations, the curve slightly rises, while for
    longer chip durations it rises by a lot, reaching 4.0 millihertz for a chip duration of 5.0 seconds.}
    \label{fig_ser_chip}
\end{figure}

\subsection{Parameterizing a proof-of-concept ``Safety Reset'' System Based on GFM}

%FIXME introduce scenario
Taking these modulation parameters as a starting point, we proceeded to create a proof-of-concept smart meter emergency
reset system. On top of the modulation described in the previous paragraphs we layered simple Reed-Solomon error
correction~\cite{mackay01} and some cryptography. The goal of our PoC cryptographic implementation was to allow the
sender of an emergency reset broadcast to authorize a reset command to all listening smart meters. An additional
constraint of our setting is that due to the extremely slow communication channel all messages should be kept as short
as possible. The solution we chose for our PoC is a simplistic hash chain using the approach from the Lamport and
Winternitz One-time Signature (OTS) schemes~\cite{lamport02,merkle01}. Informally, the private key is a random
bit string. The public key is generated by recursively applying a hash function to this key a number of times. Each
smart meter reset command is then authorized by disclosing subsequent elements of this series. Unwinding the hash chain
from the public key at the end of the chain towards the private key at its beginning, at each step a receiver can
validate the current command by checking that it corresponds to the previously unknown input of the current step of the
hash chain. Replay attacks are prevented by the device memorizing the most recent valid command.  This simple scheme
does not afford much functionality but it results in very short messages and removes the need for computationally
expensive public key cryptography inside the smart meter.

Formally, we can describe our simple cryptographic protocol as follows. Given an $m$-bit cryptographic hash function $H
: \{0,1\}^*\rightarrow\{0,1\}^m$ and a private key $k_0 \in \{0,1\}^m$, we construct the public key as
$k_{n_\text{total}} = H^{n_\text{total}}(k_0)$ where $H^n(x)$ denotes the $n$-fold recursive application of $H$ to
itself, i.e.\ $H(H(\hdots H(x)))$. $n_\text{total}$ is the total number of signatures that the system can
issue over its lifetime. $n_\text{total}$ must be chosen with adequate safety margin to account for unpredictable future
use of the system. The choice of $n_\text{total}$ is of no consequence when a device checks reset authorization, but key
generation time grows linearly with $n_\text{total}$ since $H$ needs to applied $n_\text{total}$ times. In practice,
given the speed of modern computers, values of $n_\text{total} > 10^9$ should pose no problem during key generation. For
public key $k_{n_\text{total}}$, the system can authorize up to $n_\text{total}$ commands by successively disclosing the
$k_i$ starting at $i=n-1$ and counting down until finally disclosing $k_0$. Since we only want to transmit a single bit
of information, we do not need any payload. Instead, we simply send a message $m =  (k_i)$ consisting solely of $k_i$.
The receiver of a message $m$ can check that the message is a legitimate command by checking $\exists i<q: H^i(m) =
k_\text{last}$ where $k_\text{last}$ is the last valid command that was received. $q$ is the maximum lookup depth that
the device will accept as valid. To conserve processing power, $q$ should be chosen to be much smaller than
$n_\text{total}$. Choosing $q$ too small, a device might become out of sync with the transmitter when it is disconnected
from the electrical grid for a long enough time for at least $q$ commands to be issued in the meantime. In practice,
this should not be a concern since only few commands should be issued over the life time of the system.

During an emergency situation, not all safety reset controllers might be online at the same time. In case the electrical
grid is restored piece by piece with safety reset controllers coming back online in batches, an utility might repeatedly
transmit the same reset command. In our protocol, we handle this situation by memorizing the last valid received command
on the device side, and only acting \emph{once} when a new command is received. The transmission of one command thus
becomes idempotent, and the utility can repeat the command until sufficiently many devices have received the command and
performed a safety reset.

In our protocol, we define two commands, \emph{reset} and \emph{disarm}. We assign \emph{reset} and \emph{disarm} to the
$k_i$ in an alternating way. For odd $i$, $k_i$ is a reset command and for even $i$, $k_i$ is a \emph{disarm} command.
To trigger a safety reset, the utility transmits the next unused $k_{2i+1}$. The utility may transmit this command
repeatedly to also reset devices that have come online only after earlier transmissions have started. After a sufficient
number of devices have performed a safety reset, the utility then transmits the next disarm command, $k_{2i}$. When
devices receive the disarm command, they still update the last received command, but they do not perform any other
action. The initial private key, $k_0$, is a \emph{disarm} key.

The reason for interleaving two commands in this way is to prevent a specific attack scenario in which an attacker first
observes a safety reset command being transmitted, and then at a later time gains access to a large load that could act
as a grid frequency modulation transmitter. Without a \emph{disarm} command, this attacker could then later trigger a
safety reset in any device that has not received the original reset command yet. The \emph{disarm} command gives the
utility the option to revoke a prior \emph{reset} command before any devices that were offline during the original reset
without triggering them to reset.

% FIXME add more precise/formal description of crypto
% FIXME add description of targeting/scope function?
% FIXME somewhere above descirbe entire reset system architecture????!!!
% FIXME add description of disarm message (replay protection)

\subsection{Experimental results}

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{prototype.jpg}
    \caption{The completed prototype setup. The board on the left is the safety reset microcontroller. It is connected
    to the smart meter in the middle through an adapter board. The top left contains a USB hub with debug interfaces to
    the reset microcontroller. The cables on the bottom left are the debug USB cable and the \SI{3.5}{\milli\meter}
    audio cable for the simulated mains voltage input.}
    \Description{A photo of the safety reset prototype. Visible is a stand made from plywood to which a smart meter is
    mounted in the middle. To one side of the smart meter a light switch and a socket are connected. To the other side,
    an orange power cable exits towards the back of the stand. The smart meter is connected to a prototype circuit board
    with colorful wires. The prototype circuit board is in turn connected to a microcontroller development board. The
    development board is connected to a USB hub with both an SWD programming adapter and a USB to serial converter. A
    usb cable from the USB hub as well as a 3.5 millimeter audio cable from the prototype circuit board are neatly
    coiled up and hang down from the stand.}
    \label{fig_proto_pic}
\end{figure}

For a realistic proof of concept, we decided to implement our signal processing chain from DSSS demodulator through
error correction up to our simple cryptography layer in microcontroller firmware and demonstrate this firmware on actual
smart meter hardware, shown in Figure~\ref{fig_proto_pic}. In our proof of concept a safety reset controller is
connected to the main application microcontroller of a smart meter. The reset controller is tasked with listening for
authenticated reset commands on the voltage waveform, and on reception of such a command resetting the smart meter
application controller by flashing a known-good firmware image to its memory.

The signal processing chain of our PoC is shown in Figure~\ref{fig_demo_sig_schema}. To interoperate with existing
implementations of SHA-512 and reed-solomon decoding, this implementation was written in the C programming language. To
demonstrate an application close to a field implementation, we chose an Easymeter \texttt{Q3DA1002} smart meter as our
reset target. This model is popular in the German market and readily available second-hand. The meter consists of three
isolated metering ASICs connected to a data logging and display PCB through infrared optical links. To demonstrate the
safety reset's firmware reset functionality, we connected our safety reset microcontroller to the Texas Instruments
\texttt{MSP430} microcontroller on the meter's display and data logging board through the JTAG debug interface that the
board's vendor had conveniently left accessible. We ported part of
\texttt{mspdebug}\footnote{\url{https://dlbeer.co.nz/mspdebug/}} to drive the meter microcontroller's JTAG interface and
wrote a piece of demonstrator code that overwrites the meter's firmware with one that displays an identifying string on
the meter's display after boot-up.

\begin{figure}
    \centering
    \includegraphics[width=0.45\textwidth]{prototype_schema}
    \caption{The signal processing chain of our demonstrator.}
    \Description{A diagram showing the signal processing flow. The diagram shows a number of steps going from grid
    voltage waveform to trigger decision. The diagram begins with the DMA-assisted ADC capture. At this point, the
    signal is a clean analog sine wave. The next step is grid frequency estimation, after which the signal is a
    noise-like ragged line. After grid frequency estimation follows DSSS demodulation, which itself is made up of three
    steps. The first step of DSSS demodulation is convolution, which produces a small noise signal with a large peak
    somewhere in the middle. The peak is roughly ten times the amplitude of the noise and has two prominent negative
    side lobes to the left and right. The following step, CWT peak contrast enhancement, cleans up this signal and
    removes the side-lobes leaving only the positive peak sticking out of the background noise. The final step of DSSS
    demodulation is maximum likelihood estimation, which produces a vector of n plus k discrete elements. After DSSS
    demodulation, this vector is passed through Reed-Solomon error correction, which transforms it into a vector of now
    only n discrete elements. This vector is then finally processed in the cryptographic trigger protocol, which
    produces the final trigger decision.}
    \label{fig_demo_sig_schema}
\end{figure}

To measure grid frequency in our demonstrator, we ported the same code we used in
Section~\label{grid-freq-characterization} to our demonstrator, again using the voltage measured using the
microcontroller's internal ADC but using a regular crystal instead of a crystal oven for the microcontroller's system
clock.  Since we did not have an aluminium smelter ready, we decided to feed our proof-of-concept reset controller with
an emulated grid voltage sine wave from a computer's headphone jack. Where in a real application this microcontroller
would  take ADC readings of input mains voltage divided down by a long resistive divider chain, we instead feed the ADC
from a $\SI{3.5}{\milli\meter}$ audio input. For operational safety, we disconnected the meter microcontroller from its
grid-referenced capacitive dropper power supply and connected it to our reset controller's debug USB power supply.

We performed several successful experiments using a signature truncated at 120 bit and a 5 bit DSSS sequence. Taking the
sign bit into account, the length of the encoded signature is 20 DSSS symbols. On top of this we used Reed-Solomon error
correction at a 2:1 ratio inflating total message length to 30 DSSS symbols. At the \SI{1}{\second} chip rate we used in
other simulations as well this equates to an overall transmission duration of approximately \SI{15}{\minute}. To give
the demodulator some time to settle and to produce more realistic conditions of signal reception we padded the modulated
signal with unmodulated noise on both ends.

\section{Lessons learned}

For our proof of concept, before settling on the commercial smart meter we first tried to use an \texttt{EVM430-F6779}
smart meter evaluation kit made by Texas Instruments. This evaluation kit did not turn out well for two main reasons.
One, it shipped with half the case missing and no cover for the high-voltage terminal blocks. Because of this some work
was required to get it electrically safe.  Even after mounting it in an electrically safe manner the safety reset
controller prototype would also have to be galvanically isolated to not pose an electrical safety risk since the main
MCU is not isolated from the grid and the JTAG port is also galvanically coupled. The second issue we ran into was that
the development board is based around a specific microcontroller from TI's \texttt{MSP430} series that is incompatible
with common JTAG programmers.

Our initial assumption that a development kit would be easier to program than a commercial meter did not prove to be
true. Contrary to our expectations the commercial meter had JTAG enabled allowing us to easily read out its stock
firmware requiring neither reverse-engineering vendor firmware update files nor circumventing code protection measures.
The fact that its firmware was only available in its compiled binary form was not much of a hindrance as it proved not
to be too complex and all we wanted to know we found with just a few hours of digging in
Ghidra\footnote{\url{https://ghidra-sre.org/}}.

In the firmware development phase we tested every module such as DSSS demodulator, Reed-Solomon decoder, or grid
frequency estimation individually. This approach proved particularly useful for debugging. The modular architecture
allowed us to directly compare our demodulator implementation to our Jupyter/Python prototype, where we found that our C
implementation outperformed the Python prototype. Despite the algorithms's complexity, the microcontroller C
implementation has no issues processing data in real-time due to the low sampling rate necessary.

\section{Conclusion}
\label{sec_conclusion}

In this paper we have developed an end-to-end design for a safety reset system that provides these capabilities.
Our novel broadcast data transmission system is based on intentional modulation of global grid frequency. Our system is
independent of normal communication networks and can operate during a cyber attack. We have shown the practical
viability of our end-to-end design through simulations. Using our purpose-designed grid frequency recorder, we can
capture and process real-time grid frequency data in an electrically safe way. We used data captured this way as the
basis for simulations of our proposed grid frequency modulation communication channel. In these simulations, our system
has proven feasible. From our simulations we conclude that a large consumer such as an aluminium smelter at a small cost
can be modified to act as an on-demand grid frequency modulation transmitter.

We have demonstrated our modulation system in a small-scale practical demonstration.  For this demonstration, we have
developed a simple cryptographic protocol ready for embedded implementation in resource-constrained systems that allows
triggering a safety reset with a response time of less than 30 minutes.  In this demonstration we use simulated grid
frequency data to trigger a commercial microcontroller to perform a firmware reset of an off-the-shelf smart meter. The
next step in our evaluation will be to conduct an experimental evaluation of our modulation scheme in collaboration with
an utility and an operator of a multi-megawatt load.

\subsection{Discussion}

During an emergency in the electrical grid, the ability to communicate to large numbers of end-point devices is a
valuable tool for restoring normal operation. When a resilient communication channel is available, loads such as smart
meters and IoT devices can be equipped with a supervisor circuit that allows for a remote ``safety reset'' that puts the
device into a safe operating state. Using this safety reset, an attacker that uses compromised smart meters or IoT
devices to attack grid stability can be interrupted before the can conclude their attack. During recovery from an
outage, a safety reset can be used to reduce stress on the system during a black start by temporarily disabling
non-essential loads such as air conditioners.

The safety reset controller does not require any peripherals except for an ADC. Thus we expect code size to be the main
factor affecting per-unit cost in an in-field deployment of our concept. At around \SI{64}{\kilo\byte}, our demonstrator
firmware implementation is viable on low-end microcontrollers. Given that modern smart meters and IoT devices usually
use complex Systems on Chip (SoCs), a safety reset controller could be integrated into the main application processor
itself at little added complexity. In summary, we expect safety reset controllers to be commercially viable.

Safety reset controllers can be adapted to most IoT device and smart meter designs. Because they are independent from
other public utilities such as the internet or cellular networks, we believe in their potential as a last line of
defense providing resilience under large-scale cyber attacks. The next steps towards a practical implementation will be
a practical demonstration of broadcast data transmission through grid frequency modulation using a megawatt-scale
controllable load as well as further optimization of the modulation and data encoding and the demodulator
implementation.

\subsection{Artifacts}

Source code for the demonstrator and simulations, as well as hardware EDA designs are available at the public git
repository at the following URL:

\begin{center}
    \url{https://git.jaseg.de/safety-reset.git}
\end{center}

\begin{acks}
    This work has been co-funded by the LOEWE initiative (Hesse, Germany) within the emergenCITY center.
\end{acks}

\bibliographystyle{ACM-Reference-Format}
\bibliography{\jobname}

\end{document}