safety-reset/ma/safety_reset.tex

\documentclass[12pt,a4paper,notitlepage]{report}
\usepackage[ngerman, english]{babel}
\usepackage[utf8]{inputenc}
\usepackage[a4paper,textwidth=17cm, top=2cm, bottom=3.5cm]{geometry}
\usepackage[T1]{fontenc}
\usepackage[
    backend=biber,
    style=numeric,
    natbib=true,
    url=false,
    doi=true,
    eprint=false
    ]{biblatex}
\addbibresource{safety_reset.bib}
\usepackage{amssymb,amsmath}
\usepackage{listings}
\usepackage{eurosym}
\usepackage{wasysym}
\usepackage{amsthm}
\usepackage{tabularx}
\usepackage{multirow}
\usepackage{multicol}
\usepackage{tikz}
\usepackage{mathtools}
\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
\DeclarePairedDelimiter{\paren}{(}{)}

\usetikzlibrary{arrows}
\usetikzlibrary{chains}
\usetikzlibrary{backgrounds}
\usetikzlibrary{calc}
\usetikzlibrary{decorations.markings}
\usetikzlibrary{decorations.pathreplacing}
\usetikzlibrary{fit}
\usetikzlibrary{patterns}
\usetikzlibrary{positioning}
\usetikzlibrary{shapes}

\usepackage[binary-units]{siunitx}
\DeclareSIUnit{\baud}{Bd}
\usepackage{hyperref}
\usepackage{tabularx}
\usepackage{commath}
\usepackage{graphicx,color}
\usepackage{ccicons}
\usepackage{subcaption}
\usepackage{float}
\usepackage{footmisc}
\usepackage{array}
\usepackage[underline=false]{pgf-umlsd}
\usetikzlibrary{calc}
%\usepackage[pdftex]{graphicx,color}
\usepackage{epstopdf}
\usepackage{pdfpages}
\usepackage{minted} % pygmentized source code
% Needed for murks.tex
\usepackage{setspace}
\usepackage[draft=false,babel,tracking=true,kerning=true,spacing=true]{microtype} % optischer Randausgleich etc.
% For german quotation marks

\newcommand{\degree}{\ensuremath{^\circ}}
\newcolumntype{P}[1]{>{\centering\arraybackslash}p{#1}}

\usepackage{fancyhdr}
\fancyhf{}
\fancyfoot[C]{\thepage}
\newcommand{\includenotebook}[2]{
    \fancyhead[C]{Included Jupyter notebook: #1}
    \includepdf[pages=1,
        pagecommand={\thispagestyle{fancy}\section{#1}\label{#2_notebook}}
        ]{resources/#2.pdf}
    \includepdf[pages=2-,
        pagecommand={\thispagestyle{fancy}}
        ]{resources/#2.pdf}
}

\begin{document}
\selectlanguage{ngerman}
\input{murks}
\titelen{A Post-Attack Recovery Architecture for Smart Electricity Meters}
\titelde{Eine Architektur zur Kontrollwiederherstellung nach Angriffen auf Smart Metering in Stromnetzen}
\typ{Masterarbeit}
\grad{Master of Science (M. Sc.)}
\autor{Jan Sebastian Götte}
\gebdatum{Aus Datenschutzgründen nicht abgedruckt} % Geburtsdatum des Autors
\gebort{Aus Datenschutzgründen nicht abgedruckt} % Geburtsort des Autors
\gutachter{Prof. Dr. Björn Scheuermann}{Prof. Dr.-Ing. Eckhard Grass}
\mitverteidigung
\makeTitel
\selbstaendigkeitserklaerung{\today}
\vfill
\selectlanguage{english}
{\center{
\begin{minipage}[t][10cm][b]{\textwidth}
    \center{\ccbysa}

    \center{This work is licensed under a Creative-Commons ``Attribution-ShareAlike 4.0 International'' license. The
    full text of the license can be found at:}

    \center{\url{https://creativecommons.org/licenses/by-sa/4.0/}}

    \center{For alternative licensing options, source files, questions or comments please contact the author at
    \texttt{masterarbeit@jaseg.de}}.

    \center{This is version \texttt{\input{version.tex}\unskip} generated on \today. The printed version of this
    document will be marked \texttt{-dirty} due to the private personal information on the title page that is not
    checked in to git. The git repository can be found at:}

    \center{\url{https://git.jaseg.de/master-thesis.git}}
\end{minipage}
}}
\newpage

% Hier folgt die eigentliche Arbeit (bei doppelseitigem Druck auf einem neuen Blatt):
\tableofcontents
\newpage

\chapter{Introduction}

%FIXME: sprinkle this section with citations.
Like in all fields of engineering there is an ongoing diffusion of information systems into industrial control systems
in the power grid. Automation of these control systems has been practised for the better part of a century already.
Until recently this automation was mostly limited to core components of the grid. Generators in power stations are
computer-controlled according to electromechanical and economic models. Switching in substations is automated to allow
for fast failure recovery. Humans are still vital to these systems, but their tasks have shifted from pure operation to
engineering, maintenance and surveillance.

A large-scale trend in power systems is the move from a model of centralized generation built around massive large-scale
fossil and nuclear power plants towards a more heterogenous model. In this new model large-scale fossil power plants
still serve a major role but two new factors come into play. One is the advance of renewable energies. The large-scale
use of wind and solar power in particular from a current standpoint seems unavoidable for our continued existence on
this planet. For the electrical grid however, these systems constitute a significant challenge. Fossil-fueled power
plants can be precisely controlled to match the expected energy consumption at any point in time. This tracking of
production and consumption is vital to the stability of the grid. Renewable energies such as wind and solar power do not
provide the same degree of controllability, and they introduce a large degree of uncertainty due to the
unpredictable way of the forces of nature.

Along with this change in dynamic behavior renewable energies have brought forth the advance of distributed generation.
In distributed generation end-customers that previously only consumed energy have started to feed energy into the grid
from small solar installations on their property. Distributed generation is a chance for customers to gain autonomy and
shift from a purely passive role to being active participants of the electricity market\cite{crastan03}.

To match this new landscape of decentralized generation and unpredictable renewable resources the utility industry has
had to adapt itself in major ways. One aspect of this adaption that is particularly visible to ordinary people is the
computerization of end-user energy metering. Despite the widespread use of industrial control systems inside the
electrical grid and the far-reaching diffusion of computers into people's everyday lifes the energy meter has long been
one of the last remnants of an offline, analog time. Until the 2010s many households were still served through
electromechanical Ferraris-style meters that have their origin in the late 19th
century\cite{borlase01,ukgov04,bnetza02}.

Today under the umbrella term \emph{Smart Grid} the shift towards fully computerized, often networked meters has been
partially accomplished. The roll out of these \emph{Smart Meters} has not been very smooth overall with some countries
severely lagging behind other countries. As a safety-critical technology smart meter technology is usually standardized
on a per-country basis. This leads to an inhomogenous landscape with in some instances wildly incompatible systems.
Often vendors only serve a single country or have a separate model of their meter for each country. This complex
standardization landscape and market situation has led to a proliferation of highly complex, custom-coded
microcontroller firwmare. The complexity and scale of this often network-connected firmware makes for a ripe substrate
for bugs to surface.

A remotely exploitable flaw inside a smart meter's firmware\footnote{
    There are several smart metering architectures that ascribe different roles to the component called \emph{smart
    meter}. Coarsely divided into two camps these are systems where all metering and communication code resides within
    one physical unit and systems where metering and communication are separated into two units, the \emph{smart meter}
    and the \emph{smart meter gateway}\cite{stuber01}. An example for the former are setups in the USA, an example of
    the latter is the one in Germany. For clarity in this introductory chapter we use \emph{smart meter} to describe the
    entire system at the customer premises including both the meter and a potential gateway.
} could have consequences ranging from impaired billing
functionality to an existential threat to grid stability\cite{anderson01,anderson02}.  A coördinated attack on meters in
a country where load switches are common could at worst cause widespread activation of grid safety systems by repeatedly
connecting and disconnecting megawatts of load capacity in just the wrong moments\cite{wu01}.

Mitigation of these attacks through firmware security measures is unlikely to yield satisfactory results. The enormous
complexity of smart meter firmware makes firmware security extremely labor-intensive. The diverse standardization
landscape makes a coördinated, comprehensive response unlikely.

In this thesis instead of lamenting the state of firmware security we introduce a pragmatic solution to the in our minds
likely scenario of a large-scale compromise of smart meter firmware. In our proposal the components of the smart meter
that are threatened by remote compromise are equipped with a physically separate \emph{safety reset controller} that
listens for a reset command transmitted through the electrical grid itself and on reception forcibly resets the smart
meter's entire firmware to a known-good state.  Our safety reset controller receives commands through Direct Sequence
Spread Spectrum (DSSS) modulation carried out on grid frequency through a large controllable load such as an aluminium
smelter. After forward error correction and cryptographic verification it re-flashes the target application
microcontroller over the standard JTAG interface.

In this thesis starting from a high-level architecture we have carried out extensive simulations of our proposal's
performance under real-world conditions. Based on these simulations we implemented an end-to-end prototype of our
proposed safety reset controller as part of a realistic smart meter demonstrator. Finally we experimentally validate our
results and give an outline of further steps towards practical implementation.

\chapter{Fundamentals}

\section{Structure and operation of the electrical grid}

Since this thesis is filed under \emph{computer science} we will provide a very brief overview of some basic aspects of
modern power grids.

\subsection{Structure of the electrical grid}

The electical grid is composed of a large number of systems such as distribution systems, power stations and substations
interconnected by long transmission lines. Mostly due to ohmic losses\footnote{
    Power dissipation of a resistor of resistance $R [\Omega]$ given current $I [A]$ is $P_\text{loss} [W] =
    U_\text{drop} \cdot I = I^2 \cdot R$. Fixing power $P_\text{transmitted} [W] = U_\text{line} \cdot I$ this yields a
    dependency on line voltage $U_\text{line} [V]$ of $P_\text{loss} =
    \left(\frac{P_\text{transmitted}}{U_\text{line}}\right)^2 \cdot R$. Thus, ignoring other losses a $2\times$ increase
    in transmission voltage halves current and cuts ohmic losses to a quarter. In practice the economics of this are
    much more complicated due to the cost of better isolation for higher-voltage parts and the added factor of power
    factor compensation. }
the efficiency of transmission of electricity through long transmission lines increases with the square of
voltage\cite{crastan01,simon01}. % simon01: p. 425, 9.4.1.1, crastan p.55, 3.1
In practice economic considerations take into account a reduction of the considerable transmission losses (about
\SI{6}{\percent} in case of Germany\cite{destatis01}) as well as the cost of equipment such as additional transformers
and the cost increase for the increased volatage rating of components such as transmission lines. Overall these
considerations have led to a hierarchical structure where large amounts of energy are transmitted over very long
distances (up to thousands of kilometers) at very high voltages (upwards of \SI{200}{\kilo\volt}) and voltages get lower
the closer one gets to end-customer premises. In Germany at the local level a substation will distribute
\SIrange{10}{30}{\kilo\volt} to large industrial consumers and streets with small transformer substations converting
this to the \SI{400}{\volt} three-phase AC households are usually hooked up with\cite{crastan01}.

\subsubsection{Transmission lines, bus bars and tie lines}

The number one component of the electrical grid are transmission lines. Short transmission lines that tightly couple
parts of a substation are called \emph{bus bars}. Transmission lines that couple otherwise independent grid segments are
called \emph{tie lines}. A tie line often connects grid segments operated by two different operators e.g.\ across a
country border.

\emph{Short} transmission lines can be approximated as a simple lumped-component
RLC\footnote{resistor-inductor-capacitor} circuit. In this case the effect of wave propagation along the line does not
have to be taken into consideration. In this lumped model the transmission line is represented by a circuit of one or
two inductors, one or two capacitors and some resistors. This representation simplifies analysis. For \emph{long}
transmission lines above \SI{50}{\kilo\meter} (cable) or \SI{250}{\kilo\meter} (overhead lines) this approximation
breaks down and wave propagation along the line's length has to be taken into account. The resulting model is what RF
engineering calls a \emph{transmission line} and models the line's parasitics\footnote{stray capacitance, ohmic
resistance and stray inductance} as being uniformly distributed along the length of the line. To approximate this model
in lumped-element evaluations the line is represented as a long chain of small lumped-component RLC sections. This
complex structure makes modelling more difficult in comparison to short lines\cite{crastan01}.

Almost all transmission lines used in the transmission and distribution grid use three-phase AC. Long-distance overland
lines are usually implemented as overhead lines due to their low cost and ease of maintenance. Underground cables are
much more expensive due to their isolation and are only used when overhead lines cannot be used for e.g.\ safety or
aesthetic reasons. In some specialized applications such as long, high-power undersea cables high-voltage DC (HVDC) is
used. In HVDC converter stations at both ends of the line convert between three-phase AC and the line's DC voltage.
These converter stations are controlled electronically and do not exhibit any of the electromechanical effects
generators in a power plant do. Since HVDC re-synthesizes three-phase AC from DC at the receiving end of the line it can
be used to couple non-synchronous grids. This also allows for additional degrees of control over the transmission of
power compared to a regular transmission line. These technical benefits are offset by the high initial cost (mostly due
to the converter stations) leading to HVDC being used in specific situations only\cite{crastan03}.

\subsubsection{Generators}

Traditionally all generators in the power grid were synchronous machines. A synchronous machine is a generator that is
wound and connected in such a way that during normal operation its rotation is synchonous with the grid frequency. Grid
frequency and generator rotation speed are bidirectionally electromechanically coupled. If a generator would lag behind
the grid it would receive electrical energy from the grid and convert it into mechanical energy, acting as a motor.
Small deviations between rotational speed and grid frequency will be absorbed by the electromechanical coupling between
both. All generators connected to the grid operate synchronously. Maintaining this synchronization over time is the task
of complex control systems within each power station\cite{simon01,crastan01}.

Nowadays besides traditional rotating generators the grid also contains a large amount of electronically controlled
inverters. These inverters are used in photovoltaic installations and other setups where either DC or non-synchronous AC
is to be fed into the grid. Setups like this behave differently to rotating generators. In particular \emph{inertia} in
these setups is either absent or a software parameter potentially reducing their overload capacity compared to rotating
generators. The fundamentally different nature of electronically controlled inverters has to be taken into account in
planning and regulation\cite{crastan03}.

\subsubsection{Switchgear}

In the electrical grid switches perform various roles. The ones a computer scientist would recognize are used for
routing electricity between transmission lines and transformers and can be classified into ones that can be switched
under load (called load switches) and ones that can not (called disconnectors). The latter are used to ensure parts of
the network are free from voltage. The former are used to re-route flows of electrical currents. A major difference in
their construction is that in contrast to disconnectors load switches have built-in components that extinguish the
high-power arc discharge that forms when the circuit is interrupted under load\footnote{
    While an arc discharge is considered a fault condition in most low-voltage systems including computers, in energy
    systems it is often part of normal operation.
}. Beyond this there are circuit breakers.  Circuit breakers are safety devices that can still switch even under failure
conditions at several times the circuit's nominal current. They are activated automatically on conditions such as
overcurrent or overvoltage. Fuses can be considered non-resettable switches. The fuse in a computer power supply is
barely more than a glass tube with some wire in it that is designed to melt at the designated current. In energy systems
fuses are often much more complex devices that in some cases even utilize explosivese to quickly and decisively open the
circuit and extinguish the resulting arc discharge\cite{nelles01,crastan01,simon01}.
% disconnect switches, fuses, breakers -> crastan 1 (ch. 8)

\subsubsection{Transformers}

Along with transmission lines transformers are one of the main components most people will be thinking of when talking
about the electrical grid. Transformers connect grid segments at different voltage levels with one another.  In the
distribution grid transformers are used to provide standard end-user voltage levels to the customer (e.g. 230/400V in
Europe) from a \SIrange{10}{25}{\kilo\volt} feeder. Transformers can also be used to convert between buses without a
fourth neutral conductor and buses with one.

Transformers are large and heavy devices consisting of thick copper wire or copper foil windings arranged around a core
made from thin stacked, insulated iron sheets. The entire core sits within a large metal enclosure that is filled with
liquid (usually a specialized oil) for both cooling and electrical insulation. This cooling liquid is cooled by means
such as radiator fins on the transformer enclosure itself or an external radiator. Depending on the design cooling may
rely on natural convection within the cooling liquid or on electrical pumps\cite{crastan01,simon01}.

Transformers come in a large variety of coil and wiring configurations. There exist autotransformers where the secondary
is part of the primary (or vice-versa) that are used to translate between voltage levels without galvanic isolation at
lower cost. Transformers used in parts of the electrical grid often have several taps and include \emph{tap changers}. A
tap changer is a system of mechanical switches that can be used to switch between several discrete transformer ratios to
adjust secondary voltage under load\cite{simon01}. Tap changers are used in the distribution grid to maintain the
specified voltage tolerances at the customer's connection.

\subsubsection{Instrument transformers}

While operating on the exact same physical principles instrument transformers are very different from regular
transformers in an energy system. Instrument transformers are specialized low-power transformers that are used as
transducers to measure voltage or current at very high voltages. They are part of the control and protection systems of
substations\cite{crastan01}.

\subsubsection{Chokes}

Chokes are large inductors. In power grid applications their construction is similar to the construction of a
transformer with the exception that they only have a single winding on the core. They are used for a variety of
purposes. A frequent use is as a series inductor on one of the phases or the neutral connection to limit transient fault
currents.  In addition to use as simple series inductances for current limiting inductors are also used to tune LC
circuits. One such use are Petersen coils, large inductors in series with the earth connection at a transformer's star
point are used to quickly extinguish arcs between phase and ground on a transmission line. The Petersen coil forms a
parrallel LC resonant circuit with the transmission line's earth capacitance. Tuning this circuit through adjusting the
petersen coil reduces earth fault current to levels low enough to quickly extinguish the arc\cite{simon01}.

\subsubsection{Power factor correction}

Power factor is a power engineering term that is used to describe how close the current waveform of a load is to that of
a purely resistive load. Given sinusoidal input voltage $V(t) = V_\text{pk} \sin \paren{\omega_\text{nom} t}$ with
$\omega_\text{nom} = 2 \pi f_\text{nom} = 2 \pi \cdot \SI{50}{\hertz}$ being the nominal angular frequency, the current
waveform of a resistor with resistance $R \left[\Omega\right]$ according to Ohm's law would be $I(t) = \frac{V(t)}{R} =
\frac{1}{R} V_\text{pk} \sin\paren{\omega_\text{nom} t}$. In this case voltage and current are perfectly in phase, i.e.
the current at time $t$ is linear in voltage at constant factor $\frac{1}{R}$.

In contrast to this idealized scenario reality provides us with two common issues: One, the load may be reactive.  This
means its current waveform is an ideal sinusoid, but there is a phase difference between mains voltage and load current
like so: $I(t) = \frac{V(t)}{R} = \frac{1}{\left|Z\right|} V_\text{pk} \sin\paren{\omega_\text{nom} t + \varphi}$ $Z$
would be the load's complex impedance combining inductive, capacitive and resistive components and $\varphi$ the phase
difference between the resulting current waveform and the mains voltage waveform. A common case of such loads are motors
and the inductive ballasts in old fluorescent lighting fixtures.

The second potential issue are loads with non-sinusoidal current waveform. There are many classes of these but the most
common one are switching-mode power supplies. Most SMPS for modern electronic devices have an input stage consisting of
a bridge rectifier followed by a capacitor that provide high-voltage DC power to the following switch-mode convert
circuit. This rectifier-capacitor input stage under normal load draws a high current only at the very peak of the input
voltage sinusoid and draws almost zero current for most of the period.

These two cases are measured by \emph{displacement power factor} and \emph{distortion power factor} that when combined
yield the overall true power factor. The power factor is a key quantity in the design and operation of the power grid
since a high power factor (close to $1.0$ or an in-phase sinusoidal current waveform) yields lowest transmission and
generation losses.

Reactive power (also referred to as \emph{VAR} after its is unit Volt-Ampère Reactive) an important variable in the
operation of electrical grids (see sec.\ \ref{frequency_estimation}). If reactive power generation and consumption are
mismatched and power factor is low, high currents develop that lead to high transmission losses.  For this reason grids
include circuits to compensate reactive power imbalances\cite{crastan01}. These circuits can be as simple as inductors
or capacitors connected to a power line but often can be switched to adapt to changing load conditions. Static Var
compensators are particularly fast-acting reactive power compensation devices whose purpose is to maintain bus
voltage\cite{rogers01}.

\subsubsection{Loads}

Lastly, there is the loads that the electrical grid serves. Loads range from mains-powered indicator lights in devices
such as light switches or power strips weighing in at mere milliwatts to large smelters in industrial metal production
that can consume a good fraction of a gigawatt all on their own.

\subsection{Operational concerns}
\subsubsection{Modelling the electrical grid}

Modelling performs an important role in the engineering of a reliable power infrastructure. The grid is a complex,
highly dynamic system. To maintain operational parameters such as voltage in various parts of the grid, grid frequency
and currents inside their specified ranges complex control systems are necessary. To design and parametrize such control
systems simulations are a valuable tool.  Using model calculations the effects of control systems on operational
variables such as transmission efficiency or generation losses can be estimated. Model simulations can be used to
identify structural issues such as potential points of congestion. The same models can then be used to engineer
solutions to such issues, e.g.\ by simulating the effect of a new transmission line.

There are several aspects under which the grid or parts of the grid can be simulated. There are static analysis methods
such as modal analysis that yield information on electromechanical oscillations by computing the eigenvalues of a
large system of differential equations describing the collective behavior of all components of the grid. Modal analysis
is one example of simulations used in grid planning. Using modal analysis likely oscillatory modes can be identified and
ultimately these results can inform a decision to install additional stabilization systems in a particular location.
In contrast to static analysis, transient simulations calculate an approximation of the time-domain behavior of some
variable of interest under a given model. Transient simulations are used e.g.\ in the design of control systems.
Power flow equations describe the flow of electrical energy throughout the network from generator to load. Numerical
solutions these equations are used to optimize control parameters to increase overall efficiency.

% TODO decide what of this to keep.
% \subsubsection{Generator controls}
% \subsubsection{Load shedding}
% \subsubsection{System stability}
% \subsubsection{Power System Stabilizers}

\section{Smart meter technology}

Smart meters were a concept pushed by utility companies throughout the 00's. Smart metering is one component of the
larger societal shift towards digitally interconnected technology. Old analog meters required that service pesonnel
physically come to read the meter. \emph{Smart} meters automatically transmit their readings through modern
technologies. Utility companies were very interested in this move not only because of the cost savings for meter reading
personnel. Beyond this, an always-connected meter allows several entirely new use cases that have not been possible
before. One often-cited one is utilizing the new high-resolution load data to improve load forecasting to allow for
greater generation efficiency. Computerizing the meter also allows for new fee models where electricity cost is no
longer fixed over time but adapts to market conditions. Models such as prepayment electricity plans where the customer
is automatically disconnected until they pay their bill are significantly aided by a fully electronic system that can be
controlled and monitored remotely\cite{anderson02}. A remotely controllable load switch can also be used to coerce
customers in situations where that was not previously economically possible\footnote{
    The swiss association of electrical utility companies in sec.\ 7.2 par.\ (2)a of their 2010 whitepaper on the
    introduction of smart metering\cite{vseaes01} cynically writes that remotely controllable load switches ``lead a new
    tenant to swiftly register'' with the utility company. This whitepaper completely vanished from their website some
    time after publication, but the internet archive has a copy.
}. Figure \ref{fig_smgw_schema} shows a schema of the smart metering installation in a typical household\cite{stuber01}.

\begin{figure}
    \centering
    \includegraphics{resources/smgw_usage_scenario}
    \caption{A typical usage scenario of a smart metering system in a typical home.}
    \label{fig_smgw_schema}
\end{figure}

To the customer the utility of a smart meter is largely limited to the convenience of being able to read it without
going to the basement. In the long term it is said that there will be second-order savings to the customer since
electricity prices adapting to the market situation along with this convenience will lead them to consume less
electricity and to consume it in a way that is more amenable to utilities, both leading to reduced
cost\cite{borlase01,bmwi03,anderson02}.

Traditional Ferraris counters with their distinctive rotating aluminium disc are simple electromechanical devices. Since
it does not include any failure-prone semiconductors or other high technology a cheap Ferraris-style meter can easily
last decades. In contrast to this, smart meters are complex high technology. They are vastly more expensive to develop
in the first place since they require the development and integration of large amounts of complex, custom firwmare. Once
deployed, their lifetime is severely limited by this very complexity. Complex semiconductor devices tend to fail, and
firmware that needs to communicate with the outside world tends to not age well\cite{borkar01}.
This combination of higher unit cost and lower expected lifetime leads to grossly increased costs per household. This
cost is usually shared between utility and customer.

As part of its smart metering rollout the German government in 2013 had a study conducted on the economies of smart
meter installations. This study came to the conclusion that for the majority of households computerizing an existing
ferraris meter is uneconomical. For larger consumers or new installations the higher cost of installation over time is
offset by the resulting savings in electricity cost\cite{bmwi03}.

\subsection{Human-Computer Interaction aspects of smart meter technology}

A fundamental aspect in realizing the cost and energy savings promised by the smart metering revolution is that it
requires a paradigm shift in consumer interaction. Previously most consumers would only confront their energy use when
their monthly or yearly electricity bill arrived. All of the cost savings smart meters promise over traditional metering
infrastructure\footnote{
    We are excluding savings from Demand-Side Response (DSR) implemented through smart meters here: Traditional ripple
    control systems already allowed for these, and due to the added cost of high-power relays many smart meters do not
    include such features.
} critically depend on the consumer regularly interacting with the meter through an in-home display or app. We live in
an era where our attention is already highly contested. A myriad of apps and platforms compete for our attention through
our smart phones and other devices. Introducing an entirely new service into this already complex battleground is a large
endeavour. On the one hand it is not clear how this new service would compete with everything else. On the other hand if
it does manage to capture our attention and lead us to modify our behavior, what are the side effects? For instance,
does an in-home display increase financial anxiety in economically disadvantaged customers?

Human Computer Interaction research has touched the topic of smart metering several times and has many insights to offer
for technologists\cite{pierce01,rodden01,lupton01,costanza01,fell01}. An issue pointed out in \cite{rodden01} is that at
least in some countries consumers fundamentally distrust their utility companies. This trust issue is exacerbated by
smart meters being unilaterally forced onto consumers by utility companies. Much of the success of smart metering's
ubiquitous promises of energy savings fundamentally depends on consumer coöperation. Here, the aforementioned trust
issue calls into question smart metering's chances of long-term success.

As \text{pierce01} pointed out smart metering developments could benefit greatly from early involvement of HCI research.
HCI research certainly would not have overlooked entire central issues such as privacy as it happened in the dutch
case\cite{cuijpers01}. The current corporate-driven approach to a technological advance forced through national
standardization bears a serious risk of failing to meet its ostensible objectives for consumers. The role of consumers
and the complex sociotechnological environment posed by this new technology is seriously considered nowhere in the
standardization process. While certainly noone will admit to outright ignoring consumers in smart meter standardization
their role is largely limited to the occassional public consultation. At the same time the standards are written by
technologists--it seems largely without input on their practicality or socio-technological implications from fields such
as HCI. % TODO citation? too much burn?

\subsection{Common components}
\label{sm-cpu}

Smart meters usually are built around an off-the-shelf microcontroller. Some meters use specialized smart metering
SOCs\cite{ifixit01} while others use standard microcontrollers with core metering functions implemented in external
circuitry (cf.\ sec.\ \ref{sec-easymeter} where we detail the meter in our demonstration setup).  Specialized SoCs
usually contain a segment LCD driver along with some high-resolution analog-to-digital converters for the actual
measurement functions. In many smart meter designs used outside of Germany the metering SoC will be connected to another
full-featured SoC acting as the modem. At a casual glance this might seem to be a security measure, but it may be more
likely that this is done to ease integration of one metering platform with several different communication stacks (e.g.\
proprietary sub-gigahertz wireless, powerline communication (PLC) or ethernet). In these architectures there is a clear
line of functional demarcation between the metering SoC and the modem. As evidenced by over-the-air software update
functionality (see e.g.\ \cite{honeywell01}) this does not however extend to an actual security boundary.

Energy usage is calculated by measuring both voltage and current at high resolution and then integrating the
measurements. Current measurements are usually made with either a current transformer or a shunt in a four-wire
configuration. Voltage is measured by dividing input AC down with a resistor chain. Both are integrated digitally using
the MCU's time base as a reference.

Whereas legacy electromechanical energy meters only provided a display of aggregate energy use through a decimal counter
as well as an indirect indication of power through a rotating wheel one of the selling points of smart meters is their
ability to calculate advanced statistics on energy use. These statistics are supposed to help customers better target
energy conservation measures\cite{bmwi03}.

In addition to the pure measurement and data aggregation functions smart meters can perform additional functions. One is
to serve as a gateway between the utility company's control systems and large controllable loads in the consumer's
household for Demand-Side Management (DSM)\cite{borlase01}.  In DSM the utility company can control when exactly a
high-power device such as a water storage heater is turned on. To the customer the precise timing does not matter since
the storage heater is set so that it has enough hot water in its reservoir at all times. The utility company however can
use this degree of control to reduce load variations during temporary imbalances such as peaks. The efficiency gains
realized with this system translate into lower electricity prices for DSM-enabled loads for the customer. Traditionally
DSM was realized on a local level using ripple control systems. In ripple control control data is coded by modulating a
carrier at a low frequency such as \SI{400}{\hertz} on top of the regular mains voltage. These systems require
high-power transmitters at tens of kilowatts and still can only bridge regional distances\cite{dzung01}.

Another important additional function is that in some countries some smart meters can be used to remotely disconnect
consumer households with outstanding bills. Using euphemisms such as \emph{utility revenue protection}\cite{kamstrup01}
or \emph{reducing nontechnical losses}\cite{brown01} while cynically claiming \emph{Consumer
Empowerment}\cite{kamstrup01} these systems allow an utility company to remotely disconnect a customer at any time.
Whereas before smart metering this required either additional hardware or an expensive site visit by a qualified
technician smart meters have ushered in an era of frictionless control\footnote{
    Note that in some countries such as the UK non-networked mechanical prepayment meters did exist. In such systems the
    user inserts coins into a coin slot that activates a load switch at the household's main electricity connection.
    These systems were non-networked and did not allow for remote control. A disadvantage of such systems compared to
    modern \emph{smart} systems are the high cost of the coin acceptor and the overhead of site visits required to empty
    the coin box\cite{anderson02}.
}.

\subsection{Cryptographic coprocessors}

Just like in legacy electricity meters in smart meters physical security is still a key component of the overall system
design. Since in both types of meter cost depends on physical quantities being measured at the customer premises
customers can save cost in case they are able to falsify the meter's measurements without being
detected\cite{anderson02}. For this reason both types of meters employ countermeasures against physical intrusion.
Compared to high-risk devices such as card payment processing terminals or ATMs the tamper proofing used in smart meters
is only basic\cite{anderson02}. Common measures include sealing the case by irreversibly ultrasonically welding front
and back plastic shells together or the use of security seals on the lid covering the input/output screw terminals.
Low-tech attacks using magnets to saturate the current transformer's ferrite cores are detected using hall
sensors\cite{anderson02,anderson03,itron01,hager01,easymeter01}.  German smart metering standards specify the use of a
smartcard-like security module to provide transport encryption and other cryptographic
services\cite{bsi-tr-03109-2,bsi-tr-03109-2-a}. During our literature review we did not find many references to similar
requirements in other national standards, though this does not mean that individual manufacturers do not use smartcards
for engineering reasons or due to pressure from utilities. The limited documentation on meter internals that we did find
such as \cite{ifixit01} suggests where no such regulation exists manufacturers and utilities likely choose to forego
such advanced measures and instead settle on simple software implementations.

\subsection{Physical structure and installation}

Smart meters are installed like traditional electricity meters. In Japan this means they are usually installed on an
exterior wall and need to be resistant against weather and extreme environmental conditions (direct sunlight, high
temperature, high humidity). In Germany the meter is always installed either indoors or in an outdoor utility closet
that is sealed to keep out the weather. In most countries the meter is connected through large integrated screw
terminals. In the US meters compliant with the domestic ANSI C12 standard are round and plug into a large socket that is
wired into the house or apartment's electrical connection.

Modern smart meters are usually made with plastic cases. Ferraris meters often used cases stamped from sheet metal with
glass windows on them. Smart meters now look much more like other modern electronic devices. A common construction style
is to separate the case in a front and back half with both halves clipped or ultrasonically welded together. Ultrasonic
welding gives a robust, airtight interface. This interface cannot easily be separated and re-connected without leaving
visible traces, which helps with tamper evidence properties. As an industry-standard process common in various consumer
goods ultrasonic welding is a cheap and accessible technology\cite{easymeter01,ifixit01}.

Communication interfaces sometimes are brought out through regular electromechanical connectors but often also are
optical interfaces. A popular style here is to use a regular UART connected to an LED/phototransistor optocoupler
mounted on the side of the case. The user interface is usually limited to an LCD display. For cost and ingress
protection smart meters rarely use mechanical buttons. Some smart meters use a phototransistor mounted behind the
faceplate that can be activated with a flashlight as a crude contact-less input device\cite{easymeter01}.

All meters provide several options for security seals to be installed to detect opening of the meter or access to its
terminal block. The shape and type of these security seals varies. Factory-installed seals are used to detect tampering
of the meter itself while seals made by the utility during meter installation are used to guard the meter's terminal
block and detect attempts at by-passing\cite{czechowski01}.

\section{Regulatory frameworks around the world}

Smart metering regulation varies from country to country as it is tightly coupled to the overall regulation of the
electrical grid. The standardization of the physical form factor and metrological parameters of a meter is usually
separate from the standardization of its \emph{smart} functionality. Most countries base the standard for their meters'
outwards-facing communication interface on a family of standards unified under the IEC as DLMS/COSEM. Employing this
base protocol ountry-specific standardization only covers which precise variant of it is spoken and what features are
supported.

\subsection{International standards}

The family of standards one encounters most in smart metering applications are IEC 62056 specifying the Device Language
Message Specification (DLMS) and the Companion Specification for Electronic Metering (COSEM). DLMS/COSEM are
application-layer standards describing a request/response schema similar to e.g.\ HTTP. DLMS/COSEM are mapped onto a
multitude of wire protocols. They can be spoken over TCP/IP or mapped onto low-speed UART serial interfaces
\cite{sato01,stuber01}.  Besides DLMS/COSEM there are a multitude of standards usually specifying how DLMS/COSEM are to
be applied.

DLMS/COSEM show some amount of feature creep. They do not adhere to the age-old systems design adage that a tool should
\emph{do one thing and do it well}. Instead they try to capture the convex hull of all possible applications. This led
to a complicated design that requires extensive additional specification and testing to maintain even basic
interoperability. In particular in the area of transport security it becomes evident that the IEC as an electrical
engineering standards body stretched their area of expertise and resorting to established standard protocols would have
improved the situation\cite{weith01}. Compared to industry-standard transport security the IEC standards provide
a simplistic key management framework based on a static shared key with unlimited lifetime and provide sub-optimal
transport security properties (e.g.\ lack of forward-secrecy)\cite{khurana01,sato01}.
% TODO maybe expand this?

\subsection{The regulatory situation in selected countries}

In this section we will give an overview of the situation in a number of countries. This list of countries is not
representative and notably does not include any developing countries and is geographically biased. We selected these
countries for illustration only and based our selection in a large part on the availability of information in a language
we read. We will conclude this section with a summarization of common themes.

\subsubsection{Germany}

Germany standardized smart metering on a national level. Apart from the calibration standards applying to any type of
meter smart meters are covered by a set of communications and security standards developed by the German Federal Office
for Information Security (BSI). Germany mandates smart meter installations for newly constructed buildings and during
major renovations but does not require most legacy residential installations to be upgraded. This is a consequence of a
2013 cost-benefit analysis that found these upgrades to be uneconomical for the majority of residential
customers\cite{bmwi03,bmwi1,bmwe01,brown01}.

The German standards strictly separate between metering and communication functions. Both are split into separate
devices, the \emph{meter} and the \emph{gateway} (called emph{smart meter gateway} in full and often abbreviated
emph{SMGW}). One or several meters connect to a gateway through a COSEM-derived protocol. The communication interface
between meter and gateway can optionally be physically unidirectional. An unidirectional interface eliminates any
possibility of meter firmware compromise. The gateway contains a cryptographic security module similar to a
smartcard\cite{mahlknecht01} that is entrusted with signing of measurements and maintaining an authenticated and
encrypted communication channel with its authorities. Security of the system is certified according to a Common Criteria
process.

The German specification does not include any support for load switches outside of demand-side management as they are
common in some other countries. It does not prohibit the installation of one behind the smart meter installation. This
makes it theoretically possible for a utility company to still install a load switch to disconnect a customer, but this
would be a spearate installation from the smart meter. In Germany there are significant barriers that have to be met
before a utility company may cut power to a household\cite{delaw01}. The elision of a load switch means attacks on
German meters will be limited in influence to billing irregularities and attacks using DSM equipment.

% TODO elaborate DSM attacks vs. whole-household attacks in attacks section

\subsubsection{The Netherlands}
The Netherlands were early to take initiative to roll out smart metering after its recognition by the European
Commission in 2006\cite{cuijpers01,ec04}. After overcoming political issuses the Netherlands were above the European
median in 2018 having replaced almost half of all meters\cite{cuijpers01,ec03}. Dutch smart meters are standardized by a
consortium of distribution system operators. They integrate gateway and metrology functions into one device. The
utility-facing interface is a IEC DLMS/COSEM-based interface over cellular radio such as GPRS or
LTE\cite{aubel01}. Like e.g.\ the German standard, the Dutch standard precisely specifies all communication
interfaces of the meter\cite{dsmrp3}. Another parallel is that the Dutch standard also does not cover any functionality
for remotely disconnecting a household. This absence of a load switch limits attacks on Dutch smart meters to causing
billing irregularities.

\subsubsection{The UK}

The UK is currently undergoing a smart metering rollout. Meters in the UK are nationally standardized to provide both
Zigbee ZSE-based and IEC DLMS/COSEM connectivity. UK smart metering specifications are shared between electrical and gas
meters. Different to other countries' specifications the UK national specifications require electrical meters to have an
integrated load switch and gas meters to have an integrated valve.  In Northern Ireland most consumers use prepaid
electricity contracts\cite{anderson02}.  Prepayment and credit functionality are also specified in the UK's national
smart metering standard, as is remote firmware update functionality\cite{ukgov02}. Outside communications in these
standards is performed through a gateway (there called \emph{communications hub}) that can be shared between several
meters \cite{ukgov01,ukgov02,ukgov03,brown01,sato01}. The combination of both gas and electricity metering into one
family of standards and the exceptionally large set of \emph{required} features make the UK regulations the maximalist
among the ones in this section. The mandatory inclusion of both load switches and remote connectivity up to remote
firmware update make it an interesting attack target.

\subsubsection{Italy}

Italy was among the first countries to legally mandate the widespread installation of smart meters in households. Italy
in 2006 and 2007 by law set a starting date for the rollout in 2008\cite{brown01}. The Italian electricity market was
recently privatized. While the wholesale market and transmission network privatization has advanced the vast majority of
retail customers continued to use the incumbent distribution system operator ENEL as their supplier\cite{ec03}. This
dominant position allowed ENEL to orchestrate the large-scale rollout of smart meters in Italy. Almost every meter in
Italy had been replaced by a smart meter by 2018\cite{ec03}. An unique feature of the Italian smart metering
infrastructure is that it relies on Powerline Communication (PLC) to bridge distances between meters and cellular radio
gateways\cite{gungor01}.

\subsubsection{Japan}

Japan is currently rolling out smart metering infrastructure. Compared to other countries in Japan significant
standardization effort has been spent on smart home integration\cite{usitc01,sato01,brown01}. Japan has domestic
standards (JIS) for metrology and physical dimensions. The TEPCO deployment currently being rolled out is based on the
IEC DLMS/COSEM standards suite for remote meter reading in conjuction with the Japanese ECHONET protocol for the
home-area network. Smart meters are connected to TEPCO's backend systems through the customer's internet connection,
sub-gigahertz radio based on 802.15.4 framing, regular landline internet or PLC\cite{toshiba01,sato01}.

A unique point in the Japanese utility metering landscape is that the current practice is monthly manual readings. In
Japan residential utility meters are usually mounted outside the building on an exterior wall and every month someone
with a mirror on a long stick will come and read the meter. The meter reader then makes a thermal paper print-out of the
updated utility bill and puts it into the resident's post box. This practice gives consumers good control over their
consumption but does incur significant pesonnel overhead. % TODO decide on citation. Maybe the toshiba one?

\subsubsection{The USA}

In the USA the rollout of smart meters has been promoted by law as early as 2005. The US electricity market is highly
complex with states having significant authority to decide on their own policies\cite{brown01}. Different from the IEC
standards used in large fraction of the rest of the world, the USA have their own domestic set of standards for smart
meters developed by ANSI\cite{sato01}. The main difference between IEC and ANSI-standard meters is that ANSI-standard
meters are round devices that plug into a wall-mounted socket while IEC devices are usually rectangular and connected
directly to the mains wiring through large screw terminals\cite{ifixit01}.

\subsection{Common themes}

Researching the current situation around the world for the above sections we were able to distill some common themes.
First, smart metering is slowly advancing on a global scale and despite significant reservations from privacy-conscious
people and consumer advocates it seems it is here to stay. There are some notable exceptions of countries that have
decided to scale-back an ongoing rollout effort after subsequent analysis showed economical or other
issues\footnote{cf.\ the Netherlands and Germany}.

\subsubsection{The introduction of smart metering}

The smart meter rollout is largely driven by utility companies. Utility companies field a variety of arguments for the
rollout. The most prominent argument is a general increase in energy-efficiency along with a reduction of emissions.
This argument is based on the estimation that smart metering will increase private customers' awareness of their own
consumption and this will lead them to reduce their consumption. The second highly popular argument for smart metering
is that it is necessary for the widespread adoption of renewable energies. This argument again builds on the trend
towards \emph{green} energy to rationalize smart metering. Often it is formulated as an \emph{inevitability} instead of
a choice.

Academic reception of smart metering is dyed with an almost unanimous enthusiasm. In particular smart meter
communication infrastructure has received a large amount of research
attention\cite{dzung01,gungor01,kabalci01,lloret01,mahmood01,yan01,anderson01,anderson02}. Outside of human-computer
interaction claims that smart meters will reduce customer energy consumption have often been uncritically accepted.

\subsubsection{Standardization and reality of smart devices}

Regulators, utilities and academics meet in their enthusiasm on the issue of smart home integration of smart metering. A
feature of many setups is that the meter acts as the centerpiece of a modern, fully integrated smart
home\cite{aubel01,geelen01,bsi-tr-03109-1,abdallah01}. The smart meter serves as a communication hub between a new class
of grid-aware loads and the utility company's control center. Large (usually thermal) loads such as dishwashers,
refrigerators and air conditioners are forecasted to intelligently adapt their heating/cooling cycles to better match
the grid's supply. A frequent scenario is that in which the meter bills the customer using near-real time pricing, and
supplies large loads in the customer's household with this pricing information. These loads then intelligently schedule
their operation to minimize cost\cite{sato01}. At the time in the mid-2000nds when smart metering proposals were first
advanced this vision might have been an effect of the \emph{law of the instrument}\cite{kaplan01,anderson02}. Back then
outside of specialty applications household devices were not usually networked\cite{merz01}. Smart meters at the time
may have seemed the obvious choice for a smart home communications hub.

From today's perspective, this idea is obviously outdated. Smart \emph{things} now have found their way into many homes.
Only these things are directly interconnected through the internet--foregoing the home-area network (HAN) technologies
anticipated by the smart metering pioneers. The simple reason for this is that nowadays anyone has Wifi, and Wifi
transceivers have become inexpensive enough to disappear in the bill of materials (BOM) cost of a large home device such
as a washing machine. Smart meters are usually situated in the basement--physically far away from most of one's devices.
This makes connecting them to said devices awkward and connecting them via the local Wifi lends the question why the
smart devices should not simply use the internet in the first place.

Connecting things to a smart meter through a local bus is academically appealing. It promises cost-savings from a
simpler physical layer (such as ZigBee instead of Wifi) and it neatly separates concerns into \emph{home infrastructure}
and the regular internet. Communication between smart meter and devices never leaves the house. This gives potential
additional tolerance to utility backend systems breaking. It also physically keeps communication inside the house,
bypassing the utility's eyes improving both customer privacy and agency. The presently popular model of a device as
simple as a light switch proxying its every action through a manufacturer's servers somewhere on the public internet is
in stark contrast to this scenario. Alas, the reason that this model is as popular is that in most cases it simply
works. Device manufacturers simply integrate one of many off-the-shelf Wifi modules. The resulting device will work
anywhere on earth\footnote{For some places channel assignments may have to be updated. This is a configuration-level
change and in some devices is done by the end-user during provisioning.}. A HAN-connected device would have several
variants with different modems for different standards. Some might work across countries, but some might not. And in
some countriese there might not even be a standard for smart grid HANs.

Looking at the situation like this begs the question why this realization has not yet found its way into mainstream
acceptance by smart metering implementors. The customer-facing functionality promised through smart meters would be
simple to implement as part of a now-standard \emph{internet of things} application. An in-home display that shows
real-time energy consumption and cost statistics would simply be an android tablet fetching summarized data from the
utility's billing backend. Demand-side response by large loads would be as simple as an HTTP request with a token
identifying the customer's contract that returns the electricity price the meter is currently charging along with a
recommendation to switch on or off. It seems the smart home has already arrived while smart metering standardization is
still getting off the starting blocks\cite{anderson02}.
% TODO is this too critical? Is maybe the modern smart home compatible with smart meters? Is maybe the local-only path
% of data, avoiding utility clouds a design feature? (may be true in DE, NL, probably not anywhere else)

\section{Security in smart distribution grids}

The smart grid in practice is nothing more or less than an aggregation of embedded control and measurement devices that
are part of a large control system. This implies that all the same security concerns that apply to embedded systems in
general also apply to most components of a smart grid in some way. Where programmers have been struggling for decades
now with input validation\cite{leveson01}, the same potential issue raises security concerns in smart grid scenarios as
well\cite{mo01, lee01}.  Only, in smart grid we have two complicating factors present: Many components are embedded
systems, and as such inherently hard to update. Also, the smart grid and its control algorithms act as a large
(partially-)distributed system, making problems such as input validation or authentication difficult to
implement\cite{blaze01} and adding a host of distributed systems problems on top\cite{lamport01}.

Given that the electrical grid is a major piece of essential infrastructure in modern civilization, these problems
amount to significant issues in practice. Attacks on the electrical grid may have grave
consequences\cite{anderson01,lee01} all the while the long maintenance cycles of various components make the system slow
to adapt. Thus, components for the smart grid need to be built to a much higher standard of security than most consumer
devices to ensure they live up to well-funded attackers even decades down the road. This requirement intensifies the
challenges of embedded security and distributed systems security among others that are inherent in any modern complex
technological system. The safety-critical nature of modern smart metering ecosystems in particular was quickly
recognized by security experts\cite{anderson01}.

A point we will not consider in much depth is theft of electricity. An incentive for the introduction of smart metering
that is frequently cited in utility industry publications outside of a general public's view is the reduction of
electricity theft\cite{czechowski01}.  Academic papers tend to either focus on other benefits such as generation
efficiency gains through better forecasting or try to rationalize the funamentally anti-consumer nature of smart
metering with strenuous claims of ``enormous social benefits''\cite{mcdaniel01}. Academics rarely point out the large
economical incentive such \emph{revenue protection} mechanisms provide\cite{anderson01,anderson02}.

This thesis will entirely focus on grid stability and discard electricity theft. For the attack scenarios we lay out
billing inaccuracies of utility companies are of very low urgency compared to grid stability. In fact stability is a
precondition for billing to happen.  Additionally utility companies can already limit the volume of theft by
cross-refrencing meter readings against trusted readings from upstream sections of the grid.  This capability works even
without smart meters and only gains speed from smart meters. A smart meter cannot prevent the customer from bypassing it
with a section of wire.  Due to the limit on its volume, electricity theft using smart meter hacking would not scale.
Hackers would quickly be triangulated with no damage to consumers and limited damage to utility companies.

\subsection{Privacy in the smart grid}

A serious issue in smart metering setups is customer privacy. Even though the meter ``only'' collects aggregate energy
consumption of a whole household this data is highly sensitive\cite{markham01}. This counterintuitive fact was initially
overlooked in smart meter deployments leading to outrage, delays and reduced features\cite{cuijpers01}. The root cause
for this is that given sufficient timing resolution these aggregate measurements contain ample entropy. Through
disaggregation individual loads can be identified and through pattern matching even complex usage patterns can be
discerned with alarming accuracy\cite{greveler01}. Similar privacy issues arise in many other areas of modern life
through pervasive tracking and surveillance\cite{zuboff01}. What makes the case of smart metering worse is that even the
fig leaf of consent such practices hide behind does not apply here. If I as a citizen do not consent to Google's privacy
policy Google says I can choose not to use their service. In today's world this may not be a free choice making this
argument totally invalid, but it is at least technically possible. Smart metering on the other hand is mandated by law.
In some countries such as Germany a customer unwilling to accept the accompanying privacy violation cannot legally
evade it\cite{bmwi04}.

\subsection{Smart grid components as embedded devices}

A fundamental challenge in smart grid implementations is the central role smart electricity meters play. Smart meters
are used both for highly-granular load measurement and (in some countries) load switching\cite{zheng01}.
Smart electricity meters are effectively consumer devices. They are built down to a certain price point that is measured
by the burden it puts on consumers. The cost of a smart meter is ultimately limited by it being a major factor in the
economies of a smart meter rollout\cite{bmwi03}.  Cost requirements preclude some hardware features such as the use of a
standard hardened software environment on a high-powerded embedded system (such as a hypervirtualized embedded linux
setup) that would both increase resilience against attacks and simplify updates. Combined with the small market sizes in
smart grid deployments\footnote{
    Most vendors of smart electricity meters only serve a handful of markets. For the most part, smart meter development
    cost lies in the meter's software % TODO cite?
    There exist multiple competing standards applicable to various parts of a smart electricity meter. In addition,
    most countries have their own certification regimen\cite{cenelec01}. This complexity creates a large development
    burden for new market entrants\cite{perez01}.
}
this produces a high cost pressure on the software development process for smart electricity meters.

\subsection{The state of the art in embedded security}

Embedded software security generally is much harder than security of higher-level systems. This is due to a combination
of the unique constraints of embedded devices (hard to update, usually small quantity) and their lack of capabilities
(processing power, memory protection functions, user interface devices). Even very well-funded companies continue to
have serious problems securing their embedded systems. A spectacular example of this difficulty is the recently-exposed
flaw in Apple's iPhone SoC first-stage ROM bootloader\footnote{
    Modern system-on-chips integrate one or several CPUs with a multitude of peripherals, from memory and DMA
    controllers over 3D graphics accelerators down to general-purpose IO modules for controlling things like indicator
    LEDs. Most SoCs boot from one of several boot devices such as flash memory, ethernet or USB according to a
    configuration set e.g. by connecting some SoC pins a certain way or set by device-internal write-only fuse bits.

    Physically, one of the processing cores of the SoC (usually one of the main CPU cores) is connected such that it is
    taken out of reset before all other devices, and is tasked with switching on and configuring all other devices of
    the SoC. In order to run later intialization code or more advanced bootloaders, this core on startup runs a very
    small piece of code hard-burned into the SoC in the factory. This ROM loader initializes the most basic peripherals
    such as internal SRAM memory and selects a boot device for the next bootloader stage.

    Apple's ROM loader performs some authorization checks, to ensure no unauthorized software is loaded. The present
    flaw allows an attacker to circumvent these checks, booting code not authorized by Apple on a USB-connected iPhone,
    compromising Apple's chain of trust from ROM loader to userland right at its root.
}, that allows a full compromise of any iPhone before the iPhone X. iPhone 8, one of the affected models, is still being
manufactured and sold by Apple until April 2020.  In another instance in 2016 researchers found multiple flaws in the
secure-world firmware used by Samsung in their mobile phone SoCs.  The flaws they found were both severe architectural
flaws such as secret user input being passed through untrusted userspace processes without any protection and shocking
cryptographic flaws such as CVE-2016-1919\footnote{\url{http://cve.circl.lu/cve/CVE-2016-1919}}\cite{kanonov01}.  And
Samsung is not the only large multinational corporation having trouble securing their secure world firmware
implementation. In 2014 researchers found an embarrassing integer overflow flaw in the low-level code handling untrusted
input in Qualcomm's QSEE firmware\cite{rosenberg01}. For an overview of ARM TrustZone including a survey of academic
work and past security vulnerabilities of TrustZone-based firmware see \cite{pinto01}.

If all of these very large companies have trouble securing parts of their secure embedded software stacks measuring a
mere few hundred bytes in Apple's case or a few kilobytes in Qualcomm's, what is a smart electricity meter manufacturer
to do? For their mass-market phones, these two companies have R\&D budgets that dwarf some countries' national budgets.

Since thorough formal verification of code is not yet within reach for either large-scale software development or code
heavy in side-effects such as embedded firmware or industrial control software\cite{pariente01} the two most effective
measures for embedded security is reducing the amount of code on one hand, and labour-intensively checking and
double-checking this code on the other hand. A smart electricity manufacturer does not have a say in the former since it
is bound by the official regulations it has to comply with, and will likely not have sufficient resources for the
latter. We are left with an impasse: Manufacturers in this field likely do not have the saftey resources to keep up with
complex standards requirements. At the same time they have no option to reduce the scope of their implementation to
alleviate the burden on firmware security.

\subsection{Attack avenues in the smart grid}

If we model the smart grid as a control system responding to changes in inputs by regulating outputs, on a very high
level we can see two general categories of attacks: Attacks that directly change the state of the outputs, and attacks
that try to influence the outputs indirectly by changing the system's view of its inputs. The former would be an attack
such as one that shuts down a power plant to decrease generation capacity\cite{lee01}. The latter would be an attack
such as one that forges grid frequency measurements where they enter a power plant's control systems to provoke
increasing oscillation in the amount of power generated by the plant according to the control systems'
directions\cite{kosut01,wu01,kim01}.

\subsubsection{Communication channel attacks}

Communication channel attacks are attacks on the communication links between smart grid components. This could be
attacks on IP-connected parts of the core network or attacks on shared busses between smart meters and IP gateways in
substations. Generally, these attacks can be mitigated by securing the aforementioned communication links using modern
cryptography. IP links can be protected using TLS, and more low-level busses can be protected using more lightweight
Noise\cite{perrin01}-based protocols.

Cryptographic security transforms an attackers ability to manipulate communication contents into a mere denial of
service attack. Thus, in addition to cryptographic security safety under DoS conditions must be ensured to ensure
continued system performance under attacks. This safety property is identical with the safety required to withstand
random outages of components, such as communications link outages due to physical damage from storms, flooding
etc\cite{sato01}. In general attacks at the meter level are hard to weaponize.  Meters primarily serve billing purposes.
The use of smart meter data for load forecasting is not yet common practice.  Additionally smart meter data will only be
used to refine existing forecasting models based on aggregate data collected at higher vantage points in the
distribution grid. This combination of smart metering data with more trusted aggregate data from sensors within the grid
infrastructure limits the potential impact of a data falsification attack on smart meters. It also allows the utility to
identify potentially corrupt meter readings and thus detect manipulation above a certain threshold.  In order for an
attack to have more far-reaching consequences the attacker would need to compromise additional grid
infrastructure\cite{kim01,kosut01}.

\subsubsection{Exploiting centralized control systems}

The type of smart grid attack most often cited in popular discourse, and to the author's knowledge the only type that
has so far been conducted in practice, is a direct attack on centralized control systems. In this attack, computer
components of control systems are compromised by the same techniques used to compromise any other kind of computer
system such as spearfishing, exploiting insecure services running on internet-exposed ports and using one compromised
system to compromise other systems on the same ostensably secure internal network. These attacks are very powerful as
they yield the attacker direct control over whatever outputs the control systems are controlling. If an attacker manages
to compromise the right set of control computers, they may even be able to cause a blackout\cite{lee01}.

Despite their potentially large impact, these attacks are only moderately interesting from a scientific perspective. For
one, their mitigation mostly consists of a straightforward application of security practices well-known for decades.
Though there is room for the implementation of genuinely new, application-specific security systems in this field, the
general state of the art is lacking behind other fields of embedded security. From this background low-hanging fruit
should take priority\cite{heise02}.

Given political will these systems can readily be fortified. There is only a comparatively small number of them and
having a technician drive to every one of them in turn to install a firmware security update is feasible.

\subsubsection{Control function exploits}

Control function exploits are attacks on the mathematical control loops used by the centralized control system. One
example of this type of attack are resonance attacks as described in \cite{wu01}.  In this kind of attack, inputs from
peripheral sensors indicating grid load to the centralized control system are carefully modified to cause a
disproportionally large oscillation in control system action. This type of attack relies on complex resonance effects
that arise when mechanical generators are electrically coupled. These resonances, coloquially called ``modes'' are
well-studied in power system engineering\cite{rogers01,grebe01,entsoe01,crastan03}.  Even disregarding modern attack
scenarios, for stability electrical grids are designed with measures in place to dampen any resonances inherent to grid
structure. Still, requiring an accurate grid model these resonances are hard to analyze and unlikely to be noiticed
under normal operating conditions.

Mitigation of these attacks can be achieved by ensuring unmodified sensor inputs to the control systems in the first
place. Carefully designing control systems not to exhibit exploitable behavior such as oscillations is also possible but
harder.

\subsubsection{Endpoint exploits}

One rather interesting attack on smart grid systems is one exploiting the grid's endpoint devices such as smart
electricity meters.  These meters are deployed on a massive scale, with at least one meter per household on
average\footnote{Households rarely share a meter but some households may have a separate meter for detached properties
such as a detached garage or basement.}.  Once compromised, restoration to an uncompromised state can potentially be
very difficult if it requires physical access to thousands of devices hidden inaccessible in private homes.

By compromising smart electricity meters, an attacker can trivially forge the distributed energy measurements these
devices perform. In a best-case scenario, this might only affect billing and lead to customers being under- or
over-charged if the attack is not noticed in time. In a less ideal scenario falsified energy measurements reported by
these devices could impede the correct operation of centralized control systems.

In some countries such as the UK smart meters have one additional function that is highly useful to an attacker: They
contain high-current load switches to disconnect the entire household or business in case electricity bills are left
unpaid for a certain period. In countries that use these kinds of systems on a widespread level, the load disconnect
switch is controlled by the smart meter's central microcontroller. This allows anyone compromising this
microcontroller's firmware to actuate the load switch at will.  Given control over a large number of network-connected
smart meters, an attacker might thus be able to cause large-scale disruptions of power
consumption\cite{anderson01,temple01}.  Combined with an attack method such as the resonance attack from \cite{wu01}
that was mentioned above, this scenario poses a serious danger to grid stability.

In places where Demand-Side Management (DSM) is common this functionality may be abused in a similar way. In DSM the
smart metering system directly controls power to certain devices such as heaters. The utility can remotely control the
turn-on and turn-off of these devices to smoothen out the load curve. In exchange the customer is billed a lower price
for the energy consumed by these loads. DSM was traditionally done with de-centralized systems mostly through
low-frequency PLC over the distribution grid. Smart metering systems no longer require large, resource-intensive
transmitters in substations and thus potentially allow the rollout of such technology on a much wider scale than before.
This leads to a potentially significant role of DSM systems in the impact calculation of an attack on a smart metering
system. DSM does not control as much load capacity as remote disconnect switches do. The attacks cited in the above
paragraph still fundamentally apply.

\subsection{Practical threats}

As a highly integrated system the electrical grid is vulnerable to attacks from several angles. One way to classify
attacks is by their motivation. Along this axis we found the following motives:

\begin{description}
    \item[Service disruption.] An attack aimed at disrupting service could e.g.\ aim at causing a blackout. It could
        also take aim in a more subtle way targeting a degradation of parameters such as power quality (voltage,
        frequency and waveform). It could target a particular customer, geographic area or all parts of the grid.
        Possible motivations range from a bored tennage hacker to actual cyberwar\cite{cleveland01,lee01}.
    \item[Commercial disruption.] Simple commercial motives already motivate a wide variety of attacks on grid
        infrastructure\cite{czechowski01}. Though generally mostly harmless from a cypersecurity point of view there are
        instances where these attacks put the lives of both the attacker and bystanders at grave risk\cite{anderson01}.
        Such attacks generally aim at the meter itself but a more sophisticated attacker might also target the
        utility's backend computer-bureaucracy.
    \item[Data extraction.] The smart grid collects large amounts of data on both individual consumers and on an
        aggregate level. The privacy risk in individual consumer's data is obvious. On the web
        data collection practices from questionable to flat-out illegal have widely proliferated for various purposes up
        to manipulation of elections\cite{heise03}. Assuming criminals in this field would eschew fertile ground such as
        this due to legal or ethical concerns is optimistic. Taking the risk to individual customer's data out of the
        equation even aggregate data is still highly attractive to some. Aggregate real-time electricity usage data is a
        potential source on timely information on things such as national social events (through TV set energy
        consumption\cite{greveler01}) or just plainly the state of the economy.
\end{description}

A factor to consider in all these cases is that one actor's attacks have the potential to weaken system security
overall. An attacker might add new backdoors to gain persistence or they might disable existing mitigations to enable
further steps of their attack.

In this paper we will largely concentrate on attacks of the first type because they both have the most serious
consequences and the most motivated attackers. Attackers that may want to disrupt service include cyberwar operations of
enemy nation states. This type of attacker is both highly skilled and highly funded.

\subsection{Conclusion or, why we are doomed}

We can conclude that a compromise of a large number of smart electricity meters cannot be ruled out. The complexity of
network-connected smart meter firmware makes it exceedingly unlikely that it is in fact flawless. Large-scale
deployments of these devices under some circumstances such as where they are used with load disconnect relays make them
an attractive target for attackers interested in causing grid instability. The attacker model for these devices includes
nation states, who have considerable resources at their disposal.

For a reasonable guarantee that no large-scale compromises of hard- and software built today will happen over a span of
some decades, we would have to radically simplify its design and limit attack surface. Unfortunately, the complexity of
smart electricity meter implementations mostly stems from the large list of requirements these devices have to conform
with. Alas, the standards have already been written, political will has been cast into law and changes that reduce scope
or functionality have become exceedingly unlikely at this point.

A general observation with smart grid systems of any kind is that they comprise a departure from the decentralized
control structure of yesterday's dumb grid and the advent of centralization at an enormous scale. This modern,
centralized infrastructure has been carefully designed to defend against malicious actors and all involved parties have
an interest in keeping it secure. In decentralized systems scaling attacks is inherently harder than in centralized
systems\cite{anderson02}. Centralization makes for an attractive attack target.  An attacker can employ this centralized
control to their advantage.  From this perspective the centralization of smart metering control sytems--sometimes at a
national level\cite{anderson01,anderson02}--poses a security risk.

\chapter{Restoring endpoint safety in an age of smart devices}

As laid out in the previous paragraph we cannot fully rule out a large-scale compromise of smart energy meters at some
point in the long-term future. We have to rephrase our claim to security. We cannot rule out exploitation: We have to
limit its impact. Assuming that we cannot strip any functionality from smart meters (it may be required by standards or
for enormous social benefits\cite{mcdaniel01}). All we can do is to flush out an attacker once they are in, i.e.\
mitigation instead of prevention.

In a worst-case scenario an attacker would gain unconstrained code execution (e.g.\ by exploiting a flaw in a network
protocol implentation). Smart meters use standard microcontrollers that do not have advanced memory protection functions
(cf.\ Section \ref{sm-cpu}). We can assume the attacker has full control over the main microcontroller given any such
flaw. With this control they can actuate the load switch if present. They can transmit data through the device's
communication interfaces or use the user interface components such as LEDs and the LCD. Using the self-programming
capabilities of flash microcontrollers an attacker may even gain persistency. Note that in systems separating
cryptographic functions into some form of cryptographic module\footnote{such as systems used in
Germany\cite{bsi-tr-03109}.} we can be optimistic and assume the attacker has not yet compromised this cryptographic
co-processor.

With the meter's core microcontroller under attacker control we cannot use this microcontroller to restore control over
the system. We have no way of ensuring the attacker does not simply delete a security mechanism we include in the core
microcontroller's firmware.

Our solution to this problem is to add another smaller microcontroller to the smart meter design. This microcontroller
will contain a small piece of software that receives cryptographically authenticated commands from utility companies. On
demand it can reset the meter's core microcontroller to a known-good state. To reliably flush out an attacker from a
compromised core microcontroller we re-program the core microcontroller in its entirety. We propose using JTAG to
re-program the core microcontroller with a known-good firmware image read from a sufficiently large SPI flash connected
to the reset controller. JTAG is supported by most microcontrollers complex enough to be used in a smart meter design.
JTAG programming functionality can be ported to a new microcontroller with relatively little work.

Our solution requires the core mircocontroller's JTAG interface to be activated (i.e. not fused-shut). For our solution
to work the core microcontroller firmware must not be able to permanently disable the JTAG interface by itself.  In
microcontrollers that do not yet provide this functionality this is a minor change that could be added to a custom
microcontroller variant at low cost. On most microcontrollers keeping JTAG open should not interfere with code readout
protection\footnote{Readout protection usually forces a device erase before allowing JTAG access.}. Code secrecy should
be of no concern\cite{schneier01} here but some manufacturers have strong preferences due to a fear of copyright
infringement.

\section{The theory of endpoint safety}
\label{sec_criteria}

In order to gain anything by adding our reset controller to the smart meter's already complex design we must satisfy two
interrelated conditions.
\begin{enumerate}
\item \emph{security} means our reset controller itself does not have any remotely exploitable flaws
\item \emph{safety} menas our reset controller will perform its job as intended
\end{enumerate}

Note that our \emph{security} property includes only remote exploitation, and excludes any form of hardware attack.
Even though most smart meters provide some level of physical security, we do not wish to make any assumptions on this.
In the following section we will elaborate our attacker model and it will become apparent that sufficient physical
security to defend against all attackers in our model would be infeasible, and thus we will design our overall system
to remain secure even assuming some number of physically compromised devices.
% FIXME expand

\subsection{Attack characteristics}
The attacker model these two conditions must hold under is as follows. We assume three angles of attack: Attacks by the
customer themselves, attacks by an insider within the metering systems controlling utility company and lastly attacks
from third parties. Examples for these third parties are hobbyist hackers or outside cyber-criminals on the one hand,
but also other companies participating in the smart grid infrastructure besides the utility company such as intermediary
providers of meter-reading services.

Due to the critical nature of the electrical grid, we have to include hostile state actors in our attacker model. When
acting directly, these would be classified as third-party attackers by the above schema, but they can reasonably be
expected to be able to assume either of the other two roles as well e.g. through infiltration or bribery.  In the
generalized attacker model in \cite{fraunholz01} the authors give a classification of attackers and provide a nice
taxonomy of attacker properties. In their threat/capability rating, criminals are still considered to have higher threat
rating than state-sponsored attackers. The New York Times reported in 2016 that some states recruit their hacking
personnel in part from cyber-criminals. If this report is true, in a worst-case scenario we have to assume a
state-sponsored attacker to be the worst of both types. Comparing this against the other attacker types in
\cite{fraunholz01}, this state-sponsored attacker is strictly worse than any other type in both variables. We are left
with a highly-skilled, very well-funded, highly intentional and motivated attacker.

Based on the above classification of attack angles and our observations on state-sponsored attacks, we can adapt
\cite{fraunholz01} to our problem, yielding the following new attacker types:

\begin{enumerate}
    \item \textbf{Utility company insiders controlled by a state actor}
        We can ignore the other internal threats described in \cite{fraunholz01} since an insider cooperating with a
        state actor is strictly worse in every respect.
    \item \textbf{State-sponsored external attackers}
        A state actor can directly attack the system through the internet.
    \item \textbf{Customers controlled by a state actor}
        A state actor can very well compromise some customers for their purposes. They might either physically
        infiltrate the system posing as legitimate customers, or they might simply deceive or bribe existing customers
        into cooperation.
    \item \textbf{Regular customers}
        Though a hostile state actor might gain control of some number of customers through means such as voluntary
        cooperation, bribery, infiltration, they are limited in attack scale since they do not want to arouse premature
        attention. Though regular customers may not have the motivation, skill or resources of a state-sponsored
        attacker, potentially large numbers of them may try to attack a system out of financial incentives. To allow for
        this possibility, we consider regular customers separate from state actors posing as customers in some way.
\end{enumerate}

\subsection{Overall structural system security}

Considering overall security, we first introduce the \emph{reset authority}, a trusted party acting as the single
authority for issuing reset commands in our system. In practice this trusted party may be part of the utility company,
part of an external regulatory body or a hybrid setup requiring both to cooperate. We assume this party will be designed
to be secure against all of the above attacker types. The precise design of this trusted party is out of scope for this
work but we will list some practical suggestions on how to achieve security below. % FIXME do the list
% FIXME put up a large box on this limitation

Using an asymmetric cryptographic design centered around the \emph{reset authority}, we rule out all attacks except for
denial-of-service attacks on our system by any of the four attacker types. All reset commands in our system originate
from the \emph{reset authority} and are cryptographically secured to provide authentication and tamper detection.
Under this model, attacks on the electrical grid components between the \emph{reset authority} and the customer device
degrade into man-in-the-middle attacks. To ensure the \emph{safety} criterion from Section \ref{sec_criteria} holds we
must make sure our cryptography is secure against man-in-the-middle attacks and we must try to harden the system against
denial-of-service attacks by the attacker types listed above. Given our attacker model we cannot fully guard against
this sort of attack but we can at least choose a commmunication channel that is resilient against denial of service
attacks under the above model.

Finally, we have to consider the issue of hardware security. We will solve the problem of physical attacks on some small
number of devices by simply not programming any secret information into these devices. This also simplifies hardware
production. From consideration in this work we explicitly rule out any form of supply-chain attack as
out-of-scope.
% FIXME include considerations on production testing somewhere (is the device working? is the right key programmed?)

\subsection{Complex microcontroller firmware}

The \emph{security} property from \ref{sec_criteria} is in a large part reliant on the security of our reset
controller firmware. The best method to increase firmware security is to reduce attack surface by limiting external
interfaces as much as possible and by reducing code complexity as much as possible.
% FIXME formalize this as something like "Design Goal DG-023-42-1" ?
If we avoid the complexity of most modern microcontroller firmware we gain another benefit beyond implicitly reduced
attack surface: If the resulting design is small enough we may attempt formal verification of our security property.
Though formal verification tools are not yet suitable for highly complex tasks they are already adequate for small
amounts of code and simple interfaces.

\subsection{Modern microcontroller hardware}

Microcontrollers have gained enormously in both performance/efficiency as well as in peripheral support. Alas, these
gains have largely been driven by insatiable customer demand for faster, more powerful chips and for a long time
security has not been considered important outside of some specific niches such as smartcards. Traditionally a
microcontroller would spend its entire lifetime without ever being exposed to any networks. Though this trend has been
reversing with the increasing adoption of internet-of-things things
and more advanced security features have started appearing in general-purpose microcontrollers, most still lack even
basic functionality found in processors for computers or smartphones.

One of the components lacking from most microcontrollers is strong memory protection or even a memory mapping unit as
it is found in all modern computer processors and SoCs for applications such as smartphones. Without an MPU/MPU some
mitigations for memory safety violations cannot be implemented.  This and the absence of virtualization tools such as
ARM's TrustZone make hardening microcontroller firmware a big task.  It is very important to ensure memory safety in
microcontroller firmware through tools such as defensive coding, extensive testing and formal verification.

In our design we achieve simplicity on two levels: One, we isolate the very complex metering firmware from our reset
controller by having both run on separate microcontrollers. Two, we keep the reset controller firmware itself extremely
simple to reduce attack surface there.

\subsection{Regulatory and economical constraints}
%FIXME

\subsection{Safety vs. security: Opting for restoration instead of prevention}

By implementing our reset system as a physically separate microcontroller we sidestep most security issues around the
main application microcontroller.  There are some simple measures that can be taken to harden this firmware.
Implementing industry best practices such as memory protection or stack canaries will harden the system and increase the
cost of an attack but it will not yield a system that we can be confident enough in to say it is fully secure. The
complexity of the main application controller firmware makes fully securing the system a formidable effort--and one that
would have to be repeated by every meter vendor for every one of their code bases.

In contrast to this our reset system does not provide any additional security. Any attack that could occur without it
can still occur with it in place. What it provides is a fail-safe mechanism that can quickly immobilize a malicious
actor even mid-attack. It does this in a way that can be adapted to any meter architecture and any microcontroller
platform with low effort since it relies on established standard interfaces such as JTAG and SWD. Concentrating
research and development resources on a single platform like this allows for a system that is more economical to
implement across device series and across vendors.

Attack resilience in the power grid can benefit from a safety-focused approach. The greater danger such an attack poses
is not the temporary denial of service of utility metering functions. Even in a highly integrated smart grid as
envisioned by utility companies their measurement functions are used by utility companies to increase efficiency and
reduce cost but are not necessary for the grid to function at all. % TODO citation
Thus if we can provide mere \emph{safety} with a fail-safe semantic instead of unattainable perfect \emph{security} we
have gained resilience against a large class of realistic attack scenarios.

\subsection{Technical outline of a safety reset system}

There are several ways our system could be practically implemented. The most basic way is to add a separate
microcontroller connected to the meter's main application MCU and optionally other embedded microcontrollers such as
modems. This discrete chip could either be placed on the metering board itself or it could be placed on a separate PCB
connected to the programming interface(s) of the metering board. In certain cases the latter might allow use in
otherwise unmodified legacy designs.

The saftey reset controller would be a much simpler MCU than the meter's main application controller. Its software can
be held simple leading to low program flash and RAM requirements. Since it does not need to address rich periphery such
as external parallel memory, LCDs etc.\ it can be a physically small, low-pin count device. If the main application
controller is supposed to be reset to a full factory image with little or no reduced functionality its firmware image
size is certainly too large for the reset controller's embedded flash. Thus a realistic setup would likely use an
external SPI flash chip to store this image.

The most likely interfaces to reset the main application controller and possibly other microcontrollers such as modem
chips would be the controller's integrated programming port such as JTAG. There exist a variety of programming
interfaces for microcontrollers but for moderately complex ones JTAG has grown to be by far the most broadly supported
one. Parallel high-voltage flash programming has come to be uncommon in modern microcontrollers and most chips nowadays
use some form of a serial interface. Some vendors have their own proprietary serial in-system programming interfaces
that they use on certain parts instead of or in addition to JTAG. The reasons for this usually are either lower
complexity in parts that do not require full debugging capabilities as provided by JTAG or the high pin count of JTAG.

The kind of microcontroller that would likely be used as the main application controller in a smart meter application
will almost certainly support JTAG. These microcontrollers are high pin-count devices since they need to connect to a
large set of peripherals such as the LCD and the large program flash makes it likely for a proper debugging interface to
be present.

The one remaining issue in this coarse technical outline is what communication interface should be used to transmit the
trigger command to the reset controller. In the following section we will give an overview on communication interfaces
established in energy metering applications and evaluate each of them for our purpose.

\section{Communication channels on the grid}

There is a number of well-established technologies for communication on or along power lines. We can distinguish three
basic system categories: Systems using separate wires (such as DSL over landline telephone wiring), wireless radio
systems (such as LTE) and \emph{powerline communication} (PLC) systems that re-use the existing mains wiring and
superimpose data transmissions on the 50 Hz mains sine\cite{gungor01,kabalci01}.

For our scenario, we will ignore short-range communication systems. There exists a large number of \emph{wideband}
powerline communication systems that are popular with consumers for bridging ethernet between parts of an apartment or
house.  These systems transmit at up to several hundred megabits over distances up to several tens of
meters\cite{kabalci01}.  Technologically, these wideband PLC systems are very different from \emph{narrowband} systems
used by utilities for load management among other applications and they are not relevant to our analysis.

\subsection{Powerline communication (PLC) systems and their use}

In long-distance communications for applications such as load management, PLC systems are attractive since they allow
re-using the existing wiring infrastructure and have been used as early as in the 1930s\cite{hovi01}. Narrowband PLC
systems are a potentially low-cost solution to the problem of transmitting data at small bandwidth over distances of
several hundred meters up to tens of kilometers.

Narrowband PLC systems transmit on the order of kilobits per second or slower.  A common use of this sort of system are
\emph{ripple control} systems. These systems superimpose a low-frequency signal at some few hundred Hertz carrier
frequency on top of the 50Hz mains sine. This low-frequency signal is used to encode switching commands for
non-essential residential or industrial loads. Ripple control systems provide utilities with the ability to actively
control demand while promising small savings in electricity cost to consumers\cite{dzung01}.

In any PLC system there is a strict tradeoff between bandwidth, power and distance. Higher bandwidth requires higher
power and reduces maximum transmission distance. Where ripple control systems usually use few transmitters to cover
the entire grid of a regional distribution utility, higher-bandwidth bidirectional systems used for automatic meter
reading (AMR) in places such as italy or france require repeaters within a few hundred meters of a transmitter.

\subsection{Landline and wireless IP-based systems}

Especially in automated meter reading (AMR) infrastructure the cost-benefit tradeoff of powerline systems does not
always work out for utilities. A common alternative in these systems is to use the public internet for communication.
Using the public internet has the advantage of low initial investment on the part of the utility company as well as
quick commissioning. Disadvantages compared to a PLC system are potentially higher operational costs due to recurring
fees to network providers as well as lower reliability. Being integrated into power grid infrastructure, a PLC system's
failure modes are highly correlated with the overall grid. Put briefly, if the PLC interface is down, there is a good
chance that power is out, too. In contrast to this general internet services exhibit a multitude of failures that are
entirely decorrelated from power grid stability.

For purposes such as meter reading for billing purposes, this stability is sufficient. However for systems that need to
hold up in crisis situations such as the recovery system we are contemplating in this thesis, the public internet may
not provide sufficient reliability.

\subsection{Short-range wireless systems}

Smart meters contain copious amonuts of firmware but still pale in comparison to the complexity of full-scale computers
such as smartphones. For short-range communication between a meter and a cellular radio gateway mounted nearby or
between a meter an an meter reading operator in a vehicle on the street a protocol such as Wifi (802.11) might be too
complex in most cases. Absent widely-used standards in this space proprietary radio protocols instead grow very
attractive. These might be based on some standardized lower-level protocol such as ZigBee (802.15) or might be entirely
home-grown. To a meter manufacturer a proprietary radio protocol has several advantages. It is easy to implement and
requires zero external certification. It can be customized to its specific application. In addition it provides some
level of vendor lock-in to customers sharing infrastructure such as a cellular radio gateway between multiple devices.
In other fields where a lack of standardization has led to a proliferation of proprietary protocols such as home
automation this has led to a fragmented protocol landscape. In other fields this is a large problem since consumer
cannot easily integrated products made by different manufacturers into one system. In advanced metering infrastructure
this is unlikely to be a disadvantage since ususally there is only one distribution grid operator for an area.
Additionally shared resources such as a cellular radio gateway would most likely only be shared within a single building
and within a single building usually all meters are operated by the same provider.

Systems in Europe commonly support Wireless M-Bus, an european standardized protocol\cite{silabs01} that operates on
several ISM bands\footnote{
    Frequency bands that can be used for \emph{Industrial, Scientific and Medical} applications by anyone and that do
    not require obtaining a license for transmitter operation. Manufacturers can use whatever protocol they like on
    these bands as long as they obtain certification that their transmitters obey certain spectral and power
    limitations.
}. ZigBee is another popular standard and some vendors additionally support their own proprietary protcols\footnote{
    For an example see \cite{honeywell01}.
}.
% TODO expand this?

\subsection{Frequency modulation as a communication channel}

For our system, we chose grid frequency modulation (henceforth GFM) as a low-bandwidth uni-directional broadcast
communications channel.  Compared to traditional PLC GFM requires only a small amount of additional hardware, works
reliably throughout the grid and is harder to manipulate by a malicious actor.

Grid frequency in europe's synchronous areas is nominally 50 Hertz, but there are small load-dependent variations from
this nominal value. Any device connected to the power grid (or even just within physical proximity of power wiring) can
reliably and accurately measure grid frequency at low hardware overhead. By intentionally modifying grid frequency, we
can create a very low-bandwidth broadcast communication channel. Grid frequency modulation has only ever been proposed
as a communications channel at very small scales in microgrids before\cite{urtasun01} but to our knowledge has not yet
been considered for large-scale application.

Advantages of using grid frequency for communication are low receiver hardware complexity as well as the fact that a
single transmitter can cover an entire synchronous area. Though the transmitter has to be very large and powerful, setup
of a single large transmitter faces lower bureaucratic hurdles than integration of hundreds of smaller ones into
hundreds of local systems each with autonomous goverance.

\subsubsection{The frequency dependency of grid frequency}

Despite the awesome complexity of large power grids the physics underlying their response to changes in load and
generation is surprisingly simple. Individual machines (loads and generators) can be approximated by a small number of
differential equations and the entire grid can be modelled by aggregating these approximations into a large system of
nonlinear differential equations. Evaluating these systems it has been found that in large power grids small-signal
steady-state changes in generation/consumption power balance cause an approximately linear change in
frequency\cite{kundur01,crastan03,entsoe02,entsoe04}. \emph{Small signal} here describes changes in power balance that
are small compared to overall grid power.  \emph{Steady state} describes changes over a timeframe of multiple cycles as
opposed to transient events that only last a few milliseconds.

This approximately linear relationship allows the specification of a coefficient linking $\Delta P$ and $\Delta f$ with
unit \si{\watt\per\hertz}.  In this thesis we are using the European power grid as our model system. We are
using data provided by ENTSO-E (formerly UCTE), the governing association of european transmission system operators. In
our calculations we use data for the continental european synchronous area, the largest synchronous area. $\frac{\Delta
P}{\Delta f}$, called \emph{Overall Network Power Frequency Characteristic} by ENTSO-E is around
\SI{25}{\giga\watt\per\hertz}.

We can derive general design parameter for any system utilizing grid frequency as a communications channel from the
policies of ENTSO-E\cite{entsoe02,entsoe03}.  Any such system should stay below a modulation amplitude of
\SI{100}{\milli\hertz} which is the threshold defined in the ENTSO-E incidents classification scale for a Scale 0-1
(from "Anomaly" to "Noteworthy Incident" scale) frequency degradation incident\cite{entsoe03} in the continental europe
synchronous area.

\subsubsection{Control systems coupled to grid frequency}

The ENTSO-E Operations Handbook Policy 1 chapter defines the activation threshold of primary control to be
\SI{20}{\milli\hertz}. Ideally a modulation system would stay well below this threshold to avoid fighting the primary
control reserve. Modulation line rate should likely be on the order of at most a few hundred millibaud.  Modulation at
such high rates would outpace primary control action which is specified by ENTSO-E as acting within between ``a few
seconds'' and \SI{15}{\second}.

The effective \emph{Network Power Frequency Characteristic} of primary control in the european grid is reported by
ENTSO-E at around \SI{20}{\giga\watt\per\hertz}. Keeping modulation amplitude below this threshold would help to avoid
spuriously triggering these control functions.  This works out to an upper bound on modulation power of
\SI{20}{\mega\watt\per\milli\hertz}.

\subsubsection{An outline of practical transmitter implementation}

In its most basic form a transmitter for grid frequency modulation would be a very large controllable load connected to
the power grid at a suitable vantage point. A spool of wire submerged in a body of cooling water (such as a small lake
with a fence around it) along with a thyristor rectifier bank would likely suffice to perform this function during
occassional cybersecurity incidents.  We can however decrease hardware and maintenance investment even further compared
to this rather uncultivated solution by repurposing regular large industrial loads to our transmitter purposes in an
emergency situation.  For some preliminary exploration we went through a list of energy-intensive industries in
Europe\cite{ec01}.  The most electricity-intensive industries in this list are primary aluminium and steel production.
In primary production raw ore is converted into raw metal for further refinement such as casting, rolling or extrusion.
In steelmaking iron is smolten in an electric arc furnace. In aluminium smelting aluminium is electrolytically extracted
from alumina. Both processes involve large amounts of electricity with electricity making up \SI{40}{\percent} of
production costs. Given these circumstances a steel mill or aluminium smelter would be good candidates as transmitters
in a grid frequency modulation system.

In aluminium smelting high-voltage mains is transformed, rectified and fed into about 100 series-connected cells forming
a \emph{potline}. Inside the pots alumina is dissolved in molten cryolite electrolyte at about
\SI{1000}{\degreeCelsius} and electrolysis is performed using a current of tens or hundreds of kiloampere. Resulting
pure aluminium settles at the bottom of the cell and is tapped off for further processing.

Like steelworks, aluminium smelters are operated night and day without interruption. Aside from metallurgical issues the
large thermal mass and enormous heating power requirements do not permit power-cycling. Due to the high costs of
production inefficiencies or interruptions the behavior of aluminium smelters under power outages is a fairly
well-characterized phenomenon in the industry. The recent move away from nuclear power and to renewable energy has lead
to an increase in fluctuations of electricity price throughout the day. These electricity price fluctuations have
provided enough economic incentive to aluminium smelters to develop techniques to modulate smelter power consumption
without affecting cell lifetime or the output product\cite{duessel01,eisma01}. Power outages of tens of minutes up to
two hours reportedly do not cause problems in aluminium potlines and are in fact part of routine operation for purposes
such as electrode changes\cite{eisma01,oye01}.

The power supply system of an aluminium plant is managed through a highly-integrated control system as keeping all cells
of a potline under optimal operating conditions is challenging. Modern power supply systems employ large banks of diodes
or SCRs to rectify low-voltage AC to DC to be fed into the potline\cite{ayoub01}. The potline voltage can be controlled
almost continuously through a combination of a tap changer and a transductor. The individual cell voltages can be
controlled by changing the anode to cathode distance (ACD) by physically lowering or raising the anode.  The potline
power supply is connected to the high voltage input and to the potline through isolators and breakers.

In an aluminium smelter most of the power is sunk into resistive losses and the electrolysis process. As such an
aluminium smelter does not have any significant electromechanical inertia compared to the large rotating machines used
in other industries. Depending on the capabilities of the rectifier controls high slew rates should be possible,
permitting modulation at high\footnote{Aluminium smelter rectifiers are \emph{pulse rectifiers}. This means instead of
simply rectifying the incoming three-phase voltage they use a special configuration of transformer secondaries and in
some cases additional coils to produce a large number (such as 6) of equally spaced phases. Where
a direct-connected three-phase rectifier would draw current in 6 pulses per cycle a pulse rectifier draws current in
more, smaller pulses to increase power factor. E.g. a 12-pulse rectifier will draw current in 12 pulses per cycle. In
the best case an SCR pulse rectifier switched at zero crossing should allow \SIrange{0}{100}{\percent} load changes from
one rectifier pulse to the next, i.e. within a fraction of a single cycle.} data rates.

% FIXME validate this \subsubsection with an expert

\subsubsection{Avoiding dangerous modes}

Modern power systems are complex electromechanical systems. Each component is controlled by several carefully tuned
feedback loops to ensure voltage, load and frequency regulation. Multiple components are coupled through transmission
lines that themselves exhibit complex dynamic behavior. The overall system is generally stable, but may exhbit some
instabilities to particular small-signal stimuli\cite{kundur01,crastan03}. These instabilities, called \emph{modes}
occur when due to mis-tuning of parameters or physical constraints the overall system exhibits oscillation at particular
frequencies.  These are separated into four categories in \cite{kundur01}:

\begin{description}
    \item[Local modes] where a single power station oscillates in some parameter
    \item[Interarea modes] where subsections of the overall grid oscillate w.r.t.\ each other due to weak coupling
        between them
    \item[Control modes] caused by imperfectly tuned control systems
    \item[Torsional modes] that originate from electromechanical oscillations in the generator itself
\end{description}

The oscillation frequencies associated with each of these modes are usually between a few tens of Millihertz and a few
Hertz\cite{grebe01,entsoe01,crastan03}. It is hard to predict the particular modes of a power system at the scale of the
central-european interconnected system. Theoretical analysis and simulation may give rough indications but cannot yield
conclusive results. Due to the obvious danger as well as high economical impact due to inefficiencies experimental
measurements are infeasible. Finally, modes are highly dependent on the power grid's structure and will change with
changes in the power grid over time. For all of these reasons, a grid frequency modulation system must be designed very
conservatively without relying on the absence (or presence) of modes at particular frequencies. A concrete design
guideline that we can derive from this situation is that the frequency spectrum of any grid frequency modulation system
should not exhibit any notable peaks and should avoid a concentration of spectral energy in certain frequency ranges.

\subsubsection{Overall system parameters}

In conclusion we end up with the following tunable parameters for a grid frequency modulation based on a large
controllable load:

\begin{description}
    \item[Modulation amplitude.] Amplitude is proportionally related to modulation power. In a practical setup we might
        realize a modulation power up to a few hundred \si{\mega\watt} which would yield maybe a few tens of
        \si{\milli\hertz} of frequency amplitude.
    \item[Modulation pre-emphasis and slew-rate control.] Pre-emphasis might be necessary to ensure an adequate
        Signal-to-Noise ratio (SNR) at the receiver. Slew-rate control and other shaping measures might be necessary to
        reduce the impact of these sudden load changes on the transmitter's primary function (say, aluminium smelting)
        and to prevent disturbances to grid components.
    \item[Modulation frequency]. For a practical implementation a careful study would be necessary to determine an
        optimal frequency band for operation. On one hand we need to prevent disturbances to the grid such as through
        excitation of some local or inter-area modes. On the other hand we need to optimize Signal-to-Noise ratio (SNR)
        and data rate to achieve optimal latency between transmission start and successful reception and to reduce the
        overall burden on transmitter and grid.
    \item[Further modulation parameters.] The modulation itself has numerous parameters that are discussed in sec.\
        \ref{mod_params} below.
\end{description}

\section{From grid frequency to a reliable communication channel}

\subsection{Channel properties}
In this section we will explore how we can construct a reliable communication channel from the analog primitive we
outline in the previous section. Our load control approach to grid frequency modulation leads to a channel with the
following properties.

\begin{description}
    \item[Slow-changing.] Accurate grid frequency measurements need several periods of the mains sine wave. Faster
        sampling rates can be achieved with more complex specialized synchrophasor estimation algorithms but this will
        result in a tradeoff between sampling rate and accuracy\cite{belega01}.
    \item[Analog.] Grid frequency is an analog signal.
    \item[Noisy.] While stable over long periods of time thanks to Load-Frequency Control\cite{entsoe04} it shows
        considerable random short-term variations. In addition our modulation amplitude is limited by technical and
        economic constraints so we have to find a system that will work at poor SNRs.
    \item[Polarized.] Grid frequency measurements have an inherent sense of \emph{up} (higher frequencies). We can use
        this in a polarized modulation scheme to encode information without first transmitting some reference signal to
        establish this polarization.
\end{description}

\subsection{Modulation and its parameters}
\label{mod_params}

In this section we will consider how to select a good set of parameters for a modulation scheme fitting grid frequency
modulation.

The sensitivity of the grid to oscillation at particular frequencies described above means we should avoid any
modulation technique that would concentrate a lot of energy in a small bandwidth. Taking this principle to its extreme
provides us with a useful pointer towards techniques that might work well: Spread-spectrum techniques. By employing
spread-spectrum modulation we can produce an almost ideal frequency-domain behavior that spreads the modulation energy
almost flat across the modulation bandwidth\cite{goiser01} while at the same time achieving some modulation gain,
increasing system sensitivity.  This modulation gain spread-spectrum techniques yield potentially allows us to use a
weaker stimulus, allowing further reduction of the probability of disturbance to the overall system. Spread-spectrum
techniques also inherently allow us to tune the tradeoff between receiver sensitivity and data rate. This tunability is
a highly useful parameter to have for the overall system design.

Spread spectrum covers a whole family of techniques. In \cite{goiser01} these techniques are divided into the coarse
categories of \emph{Direct Sequence Spread Spectrum}, \emph{Frequency Hopping Spread Spectrum} and \emph{Time Hopping
Spread Spectrum}.

In \cite{goiser01} a BPSK or similar modulation is assumed underlying the spread-spectrum technique. Our grid frequency
modulation channel effectively behaves more like a DC-coupled wire than a traditional radio channel: Any change in
excitation will cause a proportional change in the receiver's measurement. Using our fft-based measurement methodology
we get a real-valued signed quantity. In this way grid frequency modulation is similar to a channel using coherent
modulation. We can transmit not only signal strength, but polarity too.

For our purposes we can discount both Time and Frequency Hopping Spread Spectrum techniques. Time hopping aids to reduce
interference between multiple transmitters but does not help with SNR any more than Direct Sequence does since all it
does is allowing other transmitters to transmit.  Our system is strictly limited to a single transmitter so we do not
gain anything through Time Hopping.

Frequency Hopping Spread Spectrum techniques require a carrier. Grid frequency modulation itself is very limited in
peak frequency deviation $\Delta f$. Frequency hopping could only be implemented as a second modulation on top of GFM,
but this would not yield any benefits while increasing system complexity and decreasing data bandwidth.

Direct Sequence Spread Spectrum is the only remaining approach for our application. Direct Sequence Spread Spectrum
works by directly modulating a long pseudorandom bit sequence onto the channel. The receiver must know the same
pseudo-random bit sequence and continuously calculates the correlation between the received signal and the pseudo-random
template sequence mapped from binary $[0, 1]$ to bipolar $[1, -1]$. The pseudorandom sequence has approximately equal
number of $0$ and $1$ bits the correlation between the sequence and uncorrelated noise is small. The positive
contribution of the $+1$ terms of the correlation template approximately cancel out with the $-1$ terms when multiplied
with an uncorrelated signal such as white gaussian noise or another pseudo-random sequence.

By using a family of pseudo-random sequences with low cross-correlation channel capacity can be increased. Either the
transmitter can encode data in the choice of sequence or multiple transmitters can use the same channel at once.  The
longer the pseudo-random sequence the lower its cross-correlation with noise or other pseudorandom sequences of the same
length. Choosing a long sequence we increase modulation gain while decreasing bandwidth. For any given application the
sweet spot will be the shortest sequence that is long enough to yield sufficient SNR for subsequent processing layers
such as channel coding.

A popular code used in many DSSS systems are Gold codes. A set of Gold codes has small cross-correlations. For some
value $n$ a set of Gold codes contains $2^n + 1$ sequences of length $2^n - 1$. Gold codes are generated from two
different maximum length sequences generated by linear feedback shift registers (LFSRs). For any bit count $n$ there are
certain empirically determined preferred pairs of LFSRs that produce Gold codes with especially good cross-correlation.
The $2^n + 1$ gold codes are defined as the XOR sum of both LFSR sequences shifted from $0$ to $2^n-1$ bit as well as
the two individual LFSR sequences. Given LFSR sequences \texttt{a} and \texttt{b} in numpy notation this is
\mintinline{python}{[a, b] + [ a ^ np.roll(b, shift) for shift in len(b) ]}.

In DSSS modulation the individual bits of the DSSS sequence are called \emph{chips}. Chip duration determines modulation
bandwidth\cite{goiser01}. In our system we are directly modulating DSSS chips on mains frequency without an underlying
modulation such as BPSK as it is commonly used in DSSS systems.

\subsection{Error-correcting codes}

To make our overall system reliable we have to layer some channel coding on top of our DSSS modulation. The messages we
expect to transmit are at least a few tens of bits long. We are highly constrained in SNR due to limited transmission
power. With lower SNR comes higher BER (bit error rate). Packet error rate grows exponentially with transmission length.
For our relatively long transmissions we would realistically get unacceptable error rates.

Error correcting codes are a very broad field with many options for specialization. Since we are implementing nothing
more than a prototype in this thesis we chose to not expend resources on optimization too much and settled on a basic
reed-solomon code. The state of the art has advanced considerably since the discovery of reed-solomon
codes\cite{mackay01}.  The main areas of improvement are overhead and decoding speed. Since message length in our system
limits system response time but we do not have a fixed target we can tolerate some degree of overhead.  Decoding speed
is of very low concern to us because our data rate is extremely low.

An important concern for our prototype implementation was the availability of reference implementations of our error
correcting code. We need a python implementation for test signal generation on a regular computer and we need a small C
or C++ implementation that we can adapt to embedded firmware. LDPC codes are a popular textbook example of
error-correcting codes and we had no particular difficulty finding either.

\subsection{Cryptographic security}
\label{sec-crypto}

Informally the system we are looking for can be modelled as consisting of three parties: the trusted
\emph{transmitter}, one of a large number of untrusted \emph{receivers}, and an \emph{attacker}. These three play
according to the following rules:

\begin{description}
    \item[Access.] Both transmitter and attacker can transmit any bit sequence.
    \item[Indistinguishability.] The receiver receives any transmission by either but cannot distinguish between them.
    \item[Kerckhoff's principle.] The attacker knows anything any receiver might know\cite{kerckhoff01,kerckhoff02}.
    \item[Priority.] The transmitter is stronger than an attacker and will ``win'' during simultaneous transmission.
    \item[Seeding.] Both transmitter and receiver can be seeded out-of-band with some information on each other such as
        public key fingerprints.
\end{description}

We are not considering situations where an attacker attempts to jam an ongoing transmission. In practice there are
several avenues to prevent such attempts. Compromised loads that are being abused by the attacker can be manually
disconnected by the utility. Error-correcting codes can be used to provide resiliency against small-scale disturbances.
Finally, the transmitter can be designed to have high enough power to be able to override any likely attacker.

Our goal is to find a cryptographic primitive that has the following properties:
\begin{description}
    \item[Authenticity.] The transmitter can produce a message bit sequence that a subset of receivers can identify as
        being generated by the transmitter. On reception of this sequence, all addressed receivers perform a safety
        reset.
    \item[Unforgeability.] The attacker cannot forge a message, i.e.\ find a bit sequence other than one of the
        transmitter's previous messages that a receiver would accept. This implies that the attacker also cannot modify
        an existing message.
    \item[Brevity.] The message should be short. Our communications channel is outrageously slow compared to anything
        else used in modern telecommunications and every bit counts.
\end{description}

On a protocol level we also have to ensure \emph{idempotence}. Our system should have an at-most-once semantic. This
means for a given message each receiver either performs exactly one safety reset or none at all, even if the message is
re-transmitted by either the transmitter or an attacker.  We cannot achieve the ideal exactly-once semantic wit pure
protocol gymnastics since we are using an unidirectional lossy communication primitive. A receiver might be offline
(e.g.\ due to a local power outage) and then would not hear the transmission even if our broadcast primitive was
reliable. Since there is no back-channel, the transmitter has no way of telling when that happens.  The practical impact
of this can be mitigated by the transmitter by repeating the transmission a number of times.

It follows from the unforgeability requirement that we can trivially reach idempotence at the protocol level by keeping
a database of all previous messages and only accepting \emph{new} messages. By considering this in our cryptographic
design we can reduce the storage requirement for this ``database''.

Along with the indistinguishability property the access requirement implies that we need a cryptographic
signature\cite{lamport01}.  However, we have relaxed constraints on this signature compared to cryptographic practice.
While cryptographic signatures need to work over arbitrary inputs, all we want to ``sign'' here is the instruction to
perform a safety reset. This is the only message we might ever want to transmit so our message space has only one
entry. The information content of our message thus is 0 bit! All the information we want to transmit is already
encoded \emph{in the fact that we are transmitting}. We do not require any further payload to be transmitted. We can
omit the entirety of the message and just transmit whatever ``signature'' we produce. This is useful to conserve
transmission bits so our transmission does not take an exceeedingly long time over our extremely slow
communication channel.

We can modify this construction to allow for a small number of bits of information content in our message (say two or
three instead of zero) at no transmission overhead. We could transmit the cryptographic signature as usual but simply
omit the message. The message is only a few bits and we are dealing with minutes of transmission time so the receiver
can reconstruct the message through brute-force. Though this tradeoff between computation and data transmission might
seem inelegant it does work for our extremely slow link for very few bits.

There is an important limitation in the rules of our setup above: The attacker can always record the reset bit sequence
the transmitter transmits and replay that same sequence later. Even without cryptography we can trivially prevent an
attacker from violating the at-most-once criterion. If every receiver memorizes all bit sequences that have been
transmitted so far it can detect replays. With this mitigation by replaying an older authentic transmission an attacker
can cause receivers that were offline during the original transmission to reset at a later point. Considering our goal
is to reset them in the first place this should not pose a threat to the system's safety or security.

A possible scenario would be that an attacker first causes enough havoc for authorities to trigger a safety reset. The
attacker would record the trigger transmission. We can assume most meters were reset during the attack. Due to this the
attacker cannot cause a significant number of additional resets immediately afterwards.  However, the attacker could
wait several years for a number of new meters to be installed. These new meters might not yet have updated firmware
including the lastest transmission. This means the attacker could cause them to reset by replaying the original
sequence.

A possible mitigation for this risk would be to introduce one bit of information into the trigger message that is
ignored by the replay protection mechanism.  This \emph{enable} bit would be $1$ for the actual reset trigger message.
After the attack the transmitter would then perform scheduled transmissions of a ``disarm'' message that has this bit
set to $0$. This message informs all new meters and meters that were offline during the original transmission of the
original transmission for replay protection without actually performing any further resets.

We could use any of several traditional asymmetric cryptographic primitives to produce these signatures. The
comparatively high computational effort required for signature verification would not be an issue. Transmissions take
several minutes anyway and we can afford to spend some tens of seconds even in signature verification. Transmission
length and by proxy system latency would be determined by the length of the signature. For RSA signature length is the
modulus length (i.e. larger than \SI{1000}{bit} for very basic contemporary security). For elliptic curve-based systems
curve length is approximately twice the security level and signature size is twice the curve length because two curve
points need to be encoded\cite{anderson02}.  For contemporary security this results in more than 300 bit transmission
length. Thanks to our unique setting we can do better than this. We can exploit that our effective message entropy is 0
bit to derive a more efficient scheme.

\subsubsection{Lamport signatures}

1979, Lamport in \cite{lamport02} introduced a signature scheme that is based only on a one-way function such as a
cryptographic hash function. The basic observation is that by choosing a random secret input to a one-way function and
publishing the output, one can later prove knowledge of the input by simply publishing it. In the following paragraphs
we will describe a construction of a one-time signature scheme based on this observation. The scheme we describe is the
one usually called a ``Lamport Signature'' in modern literature but is slightly different from the variant described in
the 1979 paper. For our purposes we can consider both to be equivalent.

\paragraph{Setup.} In a Lamport signature, for an n-bit hash function $H$ the signer generates a private key $s =
\left(s_{b, i} | b\in\left\{0, 1\right\}, 0\le i<n\right)$ of $2n$ random strings of length $n$. The signer publishes a
public key $p = \left(p_{b, i} = H\left(s_{b, i}\right), b\in\left\{0, 1\right\}, 0\le i<n\right)$ that is simply the
list of hashes of each of the random strings that make up the private key.

\paragraph{Signing.} To sign a message $m$, the signer publishes the signature $\sigma = \left(\sigma_i = k_{H(m)_i,
i}\right)$ where $H(m)_i$ is the $i$-th bit of $H$ applied to $m$. That is, for the $i$-th bit of the message's hash
$H(m)$ the signer publishes either of $p_{0, i}$ or $p_{1, i}$ depending on the hash bit's value, keeping the other
entry of $P$ secret.

\paragraph{Verification.} The verifier can compute $H(m)$ themselves and check the corresponding entries $\sigma_i =
k_{H(m)_i}$ of $S$ correctly evaluate to $p_{b, i} = H\left(s_{b, i}\right)$ from $P$ under $H$.

The above scheme is a one-time signature scheme only. After one signature has been published for a given key, the
corresponding key must not be re-used for other signatures. This is intutively clear as we are effectively publishing
part of the private key as the signature, and if we were to publish a signature for another message an attacker could
derive additional signatures by ``mixing'' the two published signatures.

\subsubsection{Winternitz signatures}

An improvement to basic Lamport signatures as described above are Winternitz signatures as detailed in
\cite{merkle01,dods01}. Winternitz signatures reduce public key length as well as signature length for hash length $n$
from $2n$ to $\mathcal O \left(n/t\right)$ for some choice of parameter $t$ (usually a small number such as 4).

\paragraph{Setup.} The signer generates a private key $s = \left(s_i\right)$ consisting of $\ceil{\frac{n}{t}}$ random
bit strings. The signer publishes a public key $p = \left(H^{2^t}\left(s_i\right)\right)$ where each element
$H^{2^t}\left(s_i\right)$ is the $2^t$-fold recursive application of $H$ to $s_i$.

\paragraph{Signing.} The signer splits $m$ padded to a multiple of $t$ bits into $\ceil{\frac{n}{t}}$ chunks $m_i$ of
$t$ bit each. The signer publishes the signature $\sigma = \left( \sigma_i = H^{m_i}\left(s_i\right) \right)$.

\paragraph{Verification.} The verifier can calculate for each $\sigma_i = H^{m_i}\left(s_i\right)$ that $H^{2^t -
m_i}\left(\sigma_i\right) = H^{2^t - m_i}\left(H^{m_i}\left(s_i\right)\right) = H^{2^t - m_i + m_i} \left(s_i\right) =
p_i$.

To prevent an attacker from forging additional signatures from one signature by calculating $\sigma_i' =
H\left(\sigma_i\right)$ matching $m_i' = m_i + 1$, this scheme is usually paired with a simple checksum as described in
\cite{merkle01}.

\subsubsection{Using hash-based signatures for trigger authentication}

The most basic possible trigger authentication scheme would be to simply generate a random bit string secret key $s$ and
publish $p = H(s)$ for some hash function $H$. To activate the trigger, $\sigma = s$ is published and receivers verify
that $H(\sigma) = p = H(s)$. This simplistic scheme has one main disadvantage: It is a fundamentally one-time
construction. To prevent an attacker from re-triggering a receiver a second time by replaying a valid trigger $\sigma$
all receivers have to blacklist any ``used'' $\sigma$. Alas, this means we can only ever trigger a receiver \emph{once}.
The good part is that any receiver that missed this trigger can still be triggered later, but the bad part is that once
$s$ is burned we are out of options. The trivial solution to this would be to simply inform each receiver with a whole
list of public keys in advance. This however takes $n$ times the amount of space for $n$-fold retriggerability and we
have to memorize separately for each one whether it has been used up. Luckily we can easily derive a scheme that yields
$n$-fold retriggerability and naturally memorizes replay state while using no more same space than the original scheme
by taking some inspiration from Winternitz signatures above.

In this scheme the secret key $s$ is still a random bit string. The public key is $p = H^n(s)$ for $n$-times
retriggerability.  The $i$-th time the trigger is activated, $\sigma_i = H^n-i(s)$ is published, and every receiver can
verify that $\sigma_{i-1} = H\left(\sigma_i\right)$ with $\sigma_0 = p$. In case a receiver missed one or more previous
triggers it continues computing $H\left(H\left(\sigma_i\right)\right)$ and
$H\left(H\left(H\left(\sigma_i\right)\right)\right)$ until either reaching the $n$-th recursion level (indicating an
invalid signature) or finding $H^n\left(\sigma_i\right) = \sigma_j$ with $sigma_j$ being the last signature this
receiver recorded, or $p$ in case there is none.

This scheme provides replay protection through receiver memorizing the last signature they activated to. Public key
length is equal to the length of the hash function $H$ used. Even for our embedded systems use case $n$ can
realistically be up to $\mathcal O\left(10^3\right)$, which is easily enough for our purposes.

The ``disarm'' message we discussed above can be integrated into this scheme by encoding the ``enable'' bit into the
least significant bit of $n$ in our $H^n$ construction. In the chain of valid signatures every second one would be a
disarm signature. Reset and disarm signatures would alternate in this scheme. By skipping a disarm signature two resets
can still be triggered directly after one another.

In practice it may be useful to have some control over which particular meters reset. An attack exploiting a particular
network protocol implementation flaw might only affect one series of meters made by one manufacturer. Resetting
\emph{all} meters may be too much in this case. A simple solution for this is to define adressable subsets of meters.
``All meters'' along with ``meters made by manufacturer $x$'' and ``meters of model $y$'' are good choices for such
scopes. On the cryptographic level the protocol state is simply duplicated for each scope. This incurs memory and
computation overhead linear in the number of scopes. Device memory requirements are small at a few bytes only and
computation is of no concern due to the very slow channel so this simple solution is adequate. The transmitter has to
either store copies of all scope's keys or derive these keys from a root key using the scope's identifier. Keys are
small and the transmitter would be using a regular server or hardware security module so either easily feasible.

A diagram of the key structure in this key management scheme is shown in Figure \ref{fig:sig_key_chain}. The
transmitter key management is shown in Figure \ref{fig:tx_scope_key_illu}. This scheme is simplistic but suffices for
our prototype in Section \ref{sec-prototype} and may even be useful in a practical implementation. During
standardization of a safety reset system the key management system would most likely have to be customized to the
particular application's requirements. Developing an universal solution is outside the scope of this work.
% FIXME revisit this section - 2020-05-26
\begin{figure}
    \centering
    \begin{minipage}[c]{0.5\textwidth}
        \includegraphics{resources/signature_key_chain}
    \end{minipage}
    \begin{minipage}[c]{0.45\textwidth}
        \caption{
            The hash chain between secret transmitter key and public device key. Each step represents one invocation of the
            hash function. To generate a new chain a random transmitter key is generated, then hashed $n$ times to
            generate the corresponding device key. A new trigger message can be generated by generating the key at depth
            $m-1$ where $m$ is the height of the last used trigger, or $n$ initially. Every second trigger message is a
            disarm message and every second one a reset message. Depending on which is needed the other one may be skipped.
        }
        \label{fig:sig_key_chain}
    \end{minipage}
\end{figure}

\begin{figure}
    \centering
    \includegraphics{resources/transmitter_scope_key_illustration}
    \caption{
        An illustration of a key management system using a shared master key. The transmitter derives one secret key for
        each adressable group from the master key. Then public device keys are generated like in Figure
        \ref{fig:sig_key_chain}. Finally for each device the manufacturer picks the group public keys matching the
        device. In this example one device is a series A meter made by manufacturer B so it gets provisioned with the
        keys for the ``all devices'', ``manufacturer B'' and ``series A'' keys. The other device is also made by
        manufacturer B but is a series C device so it gets provisioned with the ``all devices'', ``manufacturer B'' and
        ``series C'' public device keys. In this example the transmitter stores (or is able to derive) all six shown
        group keys, but each device only needs to store the three applying to it for the three scopes ``all devices'',
        ``manufacturer'' and ``series''.
    }
    \label{fig:tx_scope_key_illu}
\end{figure}

\chapter{Practical implementation}

To validate the practical feasibility of the theoretical concepts we laid out in the previous chapter we decided to
build a prototype of a safety reset controller.  In this section we describe the reasoning behind the components of this
prototype and the engineering that went into its firmware. The prototype consists of a smart meter whose application
microcontroller is reset by a prototype reset controller on an external circuit board. We lay out how we extensively
tested all parts of our firmware implementation. We conclude with results of a practical end-to-end experiment
exercising every part of our prototype.

\section{Data collection for channel validation}

To design a solid system we needed to parametrize mains frequency variations under normal conditions. To set modulation
amplitude as well as parameters of our modulation scheme we need a frequency spectrum of mains frequency variations
(that is $\mathcal F\left(f(V(t))\right)$: Taking mains frequency $f(x)$ as a variable, the frequency spectrum of that
variable, as opposed to the frequency spectrum of mains voltage $V(t)$ itself).

\subsection{Grid frequency estimation}
\label{frequency_estimation}

In commercial power systems Phasor Measurement Units (PMUs) are used to precisely measure parameters of a mains voltage
waveform. One of the parameters PMUs measure is mains frequency. PMUs are used as part of SCADA systems controlling
transmission networks to characterize the operational state of the network.

From a superficial viewpoint measuring mains frequency might seem like a simple problem. Take the mains voltage
waveform, measure time between two rising-edge (or falling-edge) zero-crossings and take the inverse $f = t^{-1}$. In
practice, phasor measurement units are significantly more complex than this. This discrepancy is due to the combination
of both high precision and quick response that is demanded from these units. High precision is necessary since
variations of mains frequency under normal operating conditions are quite small--in the range of
\SIrange{5}{10}{\milli\hertz} over short intervals of time. Relative to the nominal \SI{50}{\hertz} this is a derivation
of less than \SI{100}{ppm} Relative to the corresponding \SI{20}{\milli\second} period that means a time derivation of
about $2 \mu\text{s}$ from cycle to cycle. From this it is already obvious why a simplistic measurement cannot yield the
required precision for manageable averaging times--we would need either a ADC sampling rate in the order of megabits or
for a reconstruction through interpolated readings an impractically high ADC resolution.

Detail on the inner workings of commercial phasor measurement units is scarce but given their essential role to SCADA
systems there is a large amount of academic research on such algorithms\cite{narduzzi01,derviskadic01,belega01}. A
popular approach to these systems is to perform a Short-Time Fourier Transform (STFT) on ADC data sampled at high
sampling rate (e.g. \SI{10}{\kilo\hertz}) and then perform some analysis on the frequency-domain data to precisely
locate the strong peak around \SI{50}{\hertz}. A key observation here is that FFT bin size is going to be much larger
than required frequency resolution. This fundamental limitiation follows from the nyquist criterion %FIXME maybe cite?
and if we had to process an \emph{arbitrary} signal this would highly limit our practical measurement accuracy
\footnote{
    Some software packages providing FFT or STFT primitives such as scipy\cite{virtanen01} allow the user to
    super-sample FFT output by specifying an FFT width larger than input data length, padding the input data with zeros
    on both sides. Note that in line with Nyquist this \emph{does not} actually provide finer output resolution but
    instead just amounts to an interpolation between output bins. Depending on the downstream analysis algorithm it may
    still be sensible to use this property of the DFT for interpolation, but in general it will be computationally
    expensive compared to other interpolation methods and in any case it will not yield any better frequency resolution
    aside from a hypothetical numerical advantage\cite{gasior02}.
}.
For this reason all approaches to mains frequency estimation are based on a model of the mains voltage waveform.
Nominally, this waveform would be a perfect sine at $f=\SI{50}{\hertz}$. In practice it is a sine at
$f\approx\SI{50}{\hertz}$ superimposed with some aperiodic noise (e.g. irregular spikes from inductive loads being
energized) as well as harmonic distortion that is caused by grid-topologically nearby devices with power factor
$\cos \theta \neq 1.0$. Under a continous fourier transform over a long period the frequency spectrum of a signal
distorted like this will be a low noise floor depending mainly on aperiodic noise on which a comb of harmonics as well
as some sub-harmonics of $f \approx f_\text{nom} = \SI{50}{\hertz}$ rides. The main peak at $f \approx f_\text{nom}$
will be very strong with the harmonics being approximately an order of magnitude weaker in energy and the noise floor
being at least another order of magnitude weaker. See Figure \ref{mains_voltage_spectrum} for a measured spectrum.  This
domain knowledge about the expected frequency spectrum of the signal can be employed in a number of interpolation
techniques to re-construct the precise frequency of the spectrum's main component despite comparatively coarse STFT
resolution and despite numerous distortions.

Published grid frequency estimation algorithms such as \cite{narduzzi01,derviskadic01} are rather sophisticated and use
a combination of techniques to reduce numerical errors in FFT calculation and peak fitting. Given that we do not need
reference standard-grade accuracy for our application we chose to start with a very basic algorithm instead. We chose to
use a general approach to estimate the precise fundamental frequency of an arbitrary signal that was published by
experimental physicists Gasior and Gonzalez at CERN\cite{gasior01}. This approach assumes a general sinusoidal signal
superimposed with harmonics and broadband noise.  Applicable to a wide spectrum of practical signal analysis tasks it is
a reasonable first-degree approximation of the much more sophisticated estimation algorithms developed specifically for
power systems.  Some algorithms have components such as kalman filters\cite{narduzzi01} that require a phyiscal model.
As a general algorithm \cite{gasior01} does not require this kind of application-specific tuning, eliminating one source
of error.

The Gasior and Gonzalez algorithm\cite{gasior01} passes the windowed input signal through a DFT, then interpolates the
signal's fundamental frequency by fitting a wavelet such as a gaussian to the largest peak in the DFT results. The bias
parameter of this curve fit is an accurate estimation of the signal's fundamental frequency. This algorithm is similar
to the simpler interpolated DFT algorithm used as a reference in much of the synchrophasor estimation
literature\cite{borkowski01}. The three-term variant of the maximum sidelobe decay window often used there is a blackman
window with parameter $\alpha = \frac{1}{4}$. Analysis has shown\cite{belega01} that the interpolated DFT algorithm is
worse than algorithms involving more complex models under some conditions but that there is \emph{no free lunch} meaning
that more complex perform worse when the input signal deviates from their models.

\subsection{Frequency sensor hardware design}
% FIXME: link to schematics in appendix
% FIXME: include pics of finished board and device

\label{sec-fsensor}
Our safety reset controller will have to measure mains frequency to later demodulate a reset signal transmitted through
it. Since we have decided to do our own frequency measurement system here we can use this frequency measurement setup as
a prototype for the frequency measurement subcomponent of the demodulation system we will later develop. Since we do not
plan to do a large-scale field deployment of our measurement setup we can keep the hardware implementation simple by
moving most of the signal processing to a regular computer and concentrating our hardware efforts on raw signal capture.

\begin{figure}
    \begin{center}
        \begin{tikzpicture}[start chain = going below, node distance = 12mm and 50mm, every join/.style = {norm}]
            \tikzset{
                base/.style = {draw, on chain, on grid, align=center, minimum height = 4ex, font=\footnotesize},
                text/.style = {base},
                component/.style = {base, rectangle, text width=40mm},
                coord/.style = {coordinate, on chain, on grid, node distance=6mm and 25mm}
            }
            \node[text centered] (input)                                {Single-Phase Mains Input};
            \node[component] (safety)       [below = of input]          {Input Protection};
            \node[coord]     (safety-anchor) [below = of safety]        {};
            \node[component] (analog)       [below = of safety-anchor]  {Analog Signal Processing};
            \node[component] (powersupply)  [left = of analog]          {Power supply};
            \node[component] (adc)          [below = of analog]         {ADC};
            \node[component] (micro)        [below = of adc]            {Microcontroller};
            \node[component] (isol)         [below = of micro]          {Galvanic Digital Isolation};
            \node[coord]     (isol-left)    [left = 6cm of isol.west]   {};
            \node[coord]     (isol-right)   [right = 1cm of isol.east]  {};
            \node[component] (usb)          [below = of isol]           {USB interface};

            \draw[->] (input.south) -- (safety.north);
            \draw[-]  (safety.south) -- (safety-anchor);
            \draw[->] (safety-anchor) -| (powersupply.north);
            \draw[->] (safety-anchor) -| (analog.north);
            \draw[->] (powersupply.south) |- (adc.west);
            \draw[->] (powersupply.south) |- (micro.west);
            \draw[->] (analog.south) -- (adc.north);
            \draw[->] (adc.south) -- (micro.north);
            \draw[->] (micro.south) -- (isol.north);
            \draw[->] (isol.south) -- (usb.north);

            \draw[dashed] (isol.west) -- (isol-left.east);
            \draw[dashed] (isol.east) -- (isol-right.west);
        \end{tikzpicture}
    \end{center}
    \caption{Frequency sensor hardware diagram.}
    \label{fmeas-sens-diag}
\end{figure}

An overall block diagram of our system is shown in Figure \ref{fmeas-sens-diag}. The mircrocontroller we chose is an
\texttt{STM32F030F4P6} ARM Cortex-M0 microcontroller made by ST Microelectronics. The ADC in Figure
\ref{fmeas-sens-diag} in our design is the integrated 12-bit ADC of this microcontroller, which is sufficient for our
purposes. The USB interface is a simple USB to serial converter IC (\texttt{CH340G}) and the galvanic digital isolation
is accomplished with a pair of high-speed optocouplers on its \texttt{RX} and \texttt{TX} lines. The analog signal
processing is a simple voltage divider using high-power resistors to get the required creepage along with some
high-frequency filter capacitors and an op-amp buffer. The power supply is an off-the-shelf mains-input power module.
The system is implemented on a single two-layer PCB that is housed in an off-the-shelf industrial plastic case fitted
with a printed label and a few status lights on its front.

\subsection{Clock accuracy considerations}

Our measurement hardware will sample line voltage at some sampling rate $f_S$, e.g.\ \SI{1}{\kilo\hertz}. All downstream
processsing is limited in accuracy by the accuracy of $f_S$\footnote{
We are not considering the effects of clock jitter. We are highly oversampling the signal and the FFT done in our
downstream processing will eliminate small jitter effects leaving only frequency stability to worry about.  }. We
generate our sampling clock in hardware by clocking the ADC from one of the microcontroller's timer blocks clocked from
the microcontroller's system clock. This means our ADC's sampling window will be synchronized cycle-accurate to the
microcontroller's system clock.

Our downstream measurement of mains frequency by nature is relative to our sampling frequency $f_S$. In the setup
described above this means we have to make sure our system clock is fairly stable. A frequency derivation of \SI{1}{ppm}
in our system clock causes a proportional grid frequency measurement error of $\Delta f = f_\text{nom} \cdot
10^{-6} = \SI{50}{\micro\hertz}$. In a worst-case where our system is clocked from a particularly bad crystal that exhibits
\SI{100}{ppm} of instabilities over our measurement period we end up with an error of \SI{5}{\milli\hertz}. This is well
within our target measurement range, so we need a more stable clock source. Ideally we want to avoid writing our own
clock conditioning code where we try to change an oscillators operating frequency to match some reference. Clock
conditioning algorithms are highly complex and in our case post-processing of measurement data and simply adding and
offset is simpler and less error-prone.

Our solution to these problems is to use a crystal oven\footnote{
    A crystal oven is a crystal oscillator thermally coupled closely to a heater and temperature sensor and enclosed in
    a thermally isolated case. The heater is controlled to hold the crystal oscillator at a near-constant temperature
    some few ten degrees above ambient. Any ambient temperature variations will be absorbed by the temperature control.
    This yields a crystal frequency that is almost completely unaffected by ambient temperature variations below the
    oven temperature and whose main remaining instability is aging.
}as our main system clock source. Crystal ovens are expensive compared to ordinary crystal oscillators. Since any
crystal oven will be much more accurate than a standard room-temperature crystal we chose to reduce cost by using one
recycled from old telecommunications equipment.

To verify clock accuracy we routed an externally accessible SMA connector to a microcontroller pin that is routed to one
of the microcontroller's timer inputs. By connecting a GPS 1pps signal to this pin and measuring its period we can
calculate our system's Allan variance\footnote{
    Allan variance is a measure of frequency stability between two clocks.
}, thereby measuring both clock stability and clock accuracy.
We ran a 4 hour test of our frequency sensor that generated the histogram shown in Figure \ref{ocxo_freq_stability}.
These results show that while we get a systematic error of about \SI{10}{ppm} due to manufacturing tolerances the
random error at less than \SI{10}{ppb} is smaller than that of a room-temperature crystal oscillator by 3-4 orders of
magnitude. Since we are interested in grid frequency variations over time but not in the absolute value of grid
frequency the systematic error is of no consequence to us.  The random error at \SI{3.66}{ppb} corresponds to a
frequency measurement error of about \SI{0.2}{\micro\hertz}, well below what we can achieve at reasonable sampling rates
and ADC resolution.

\begin{figure}
    \centering
    \includegraphics{../lab-windows/fig_out/ocxo_freq_stability}
    \caption{OCXO Frequency derivation from nominal \SI{19.440}{\mega\hertz} measured against GPS 1pps.}
    \label{ocxo_freq_stability}
\end{figure}

\subsection{Firmware implementation}

The firmware uses one of the microcontroller's timers clocked from an external crystal oscillator to produce an
\SI{1}{\milli\second} tick that the internal ADC is triggered from for a sample rate of \SI{1}{\kilo sps}. Higher sample
rates would be possible but reliable data transmission over the opto-isolated serial interface might prove challenging
and \SI{1}{\kilo sps} corresponds to $20$ samples per cycle at $f_\text{nominal}$. This figure exceeds the nyquist
criterion by a factor of ten and is be plenty for accurate measurements.

The ADC measurements are read using DMA and written into a circular buffer. Using some DMA controller features this
circular buffer is split in back and front halves with one being written to and the other being read at the same time.
Buffer contents are moved from the ADC DMA buffer into a packet-based reliable UART interface as they come in. The UART
packet interface keeps two ringbuffers: One byte-based ringbuffer for transmission data and one ringbuffer pointer
structure that keeps track of ADC data packet boundaries in the byte-based ringbuffer. Every time a chunk of data is
available from the ADC the data is framed into the byte-based ringbuffer and the packet boundaries are logged in the
packet pointer ringbuffer. If the UART transmitter is idle at this time a DMA-backed transmission of the oldest packet
in the packet ringbuffer is triggered at this point. Data is framed using Consistent Overhead Byte Stuffing
(COBS)\footnote{
COBS is a framing technique that allows encoding $n$ bytes of arbitray data into exactly $n+1$ bytes with no embedded
$0$-bytes that can then be delimited using $0$-bytes. COBS is simple to implement and allows both one-pass decoding and
encoding. The encoder either needs to be able to read up to \SI{256}{\byte} ahead or needs a buffer of \SI{256}{\byte}.
COBS is very robust in that it allows self-synchronization. At any point a receiver can reliably synchronize itself
against a COBS data stream by waiting for the next $0$-byte. The constant overhead allows precise bandwidth and buffer
planning and provides constant, good efficiency close to the theoretical maximum.}\cite{cheshire01} along with a
CRC-32 checksum for error checking. When the host receives a new packet with a valid checksum it returns an
acknowledgement packet to the sensor. When the sensor receives the acknowledgement, the acknowledged packet is dropped
from the transmission packet ringbuffer. When the host detects an incorrect checksum it simply stays quiet and waits for
the sensor to resume with retransmission when the next ADC buffer has been received.

The serial interface logic presents most of the complexity of the sensor firmware. This complexity is necessary since
we need reliable, error-checked transmission to the host. Though rare, bit errors on a serial interface do happen and
data corruption is unacceptable. The packet-layer queueing on the sensor is necessary since the host is not a realtime
system and unpredictable latency spikes of several hundred milliseconds are possible.

The host in our recording setup is a Raspberry Pi 3 model B running a Python script. The Python script handles serial
communication and logs data and errors into an SQLite database file. SQLite has been chosen for its simple yet flexible
interface and its good tolerance of system resets due to unexpected power loss. Overall our setup performed adequately
with IO contention on the raspberry PI/linux side causing only 16 skipped sample packets over a 68-hour recording span.

\subsection{Frequency sensor measurement results}

Captured raw waveform data has been processed in the Jupyter Lab environment\cite{kluyver01} and grid frequency
estimates are extracted as described in sec. \ref{frequency_estimation} using the Gasior and Gonzalez\cite{gasior01}
technique.  Appendix \ref{grid_freq_estimation_notebook} contains the Jupyter notebook we used for frequency
measurement. In Figure \ref{freq_meas_feedback} we fed back to the frequency estimator its own output giving us an
indication of its numerical performance. The result was \SI{1.3}{\milli\hertz} of RMS noise over a \SI{3600}{\second}
simulation time. This indicates performance is good enough for our purposes. In addition to this we validated our
algorithm's performance by applying it to the test waveforms from \cite{wright01}. In this test we got errors of
\SI{4.4}{\milli\hertz} for the \emph{noise} test waveform, \SI{0.027}{\milli\hertz} for the \emph{interharmonics} test
waveform and \SI{46}{\milli\hertz} for the \emph{amplitude and phase step} test waveform. Full results can be found in
Figure \ref{freq_meas_rocof_reference}.

Figures \ref{freq_meas_trace} and \ref{freq_meas_trace_mag} show our measurement results over a 24-hour and a 2-hour
window respectively.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{../lab-windows/fig_out/freq_meas_feedback}
    \caption{
        The frequency estimation algorithm applied to a synthetic noise-less mains waveform generated from its own
        output. This feedback simulation gives an indication of numerical errors in our estimation algorithm. The top
        four graphs show a comparison of the original trace (blue) and the re-calculated trace (orange). The bottom
        trace shows the difference between the two. As we can tell both traces agree very well with an overall RMS
        deviation of about \SI{1.3}{\milli\hertz}. The bottom trace shows deviation growing over time. This is very
        likely an effect of numerical errors in our ad-hoc waveform generator.
    }
    \label{freq_meas_feedback}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{../lab-windows/fig_out/freq_meas_rocof_reference}
    \caption{
        Performance of our frequency estimation algorithm against the test suite specified in \cite{wright01}. Shown are
        standard deviation and variance measurements as well as time-domain traces of differences.
    }
    \label{freq_meas_rocof_reference}
\end{figure}

\begin{figure}
    \centering
    \includegraphics{../lab-windows/fig_out/freq_meas_trace_24h}
    \caption{Trace of grid frequency over a 24 hour window. One clearly visible feature are large positive and negative
    transients at full hours. Times shown are UTC. Note that the european continental synchronous area that this
    sensor is placed in covers several time zones which may result in images of daily load peaks appearing in 1 hour
    intervals. Figure \ref{freq_meas_trace_mag} contains two magnified intervals from this plot.}
    \label{freq_meas_trace}
\end{figure}

\begin{figure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics{../lab-windows/fig_out/freq_meas_trace_2h_1}
        \caption{A 2 hour window around 00:00 UTC.}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics{../lab-windows/fig_out/freq_meas_trace_2h_2}
        \caption{A 2 hour window around 18:30 UTC.}
    \end{subfigure}
    \caption{Two magnified 2 hour windows of the trace from Figure \ref{freq_meas_trace}.}
    \label{freq_meas_trace_mag}
\end{figure}

\begin{figure}
    \centering
    \includegraphics{../lab-windows/fig_out/mains_voltage_spectrum}
    \caption{Power spectral density of the mains voltage trace in Figure \ref{freq_meas_trace}.  Data was captured using
    our frequency measurement sensor (\ref{sec-fsensor}) and FFT'ed after applying a blackman window. Vertical lines
    indicate \SI{50}{\hertz} and odd harmonics.  We can see the expected peak at \SI{50}{\hertz} along with smaller
    peaks at odd harmonics. We can also see a number of spurious tones both between harmonics and at low frequencies, as
    well as some bands containing high noise energy around \SI{0.1}{\hertz}. This graph demonstrates a high
    signal-to-noise ratio that is not very demanding on our frequency estimation algorithm.
    }
    \label{mains_voltage_spectrum}
\end{figure}

\section{Channel simulation and parameter validation}
\label{sec-ch-sim}

To validate all layers of our communication stack from modulation scheme to cryptography we built a prototype
implementation in python. Implementing all components in a high-level language builds up familiartiy with the concepts
while taking away much of the implementation complexity. For our demonstrator we will not be able to use python since
our target platform is a cheap low-end microcontroller. Our demonstrator firmware will have to be written in a low-level
language such as C or rust. For prototyping these languages lack flexibility compared to python.

To validate our modulation scheme we first performed a series of simulations on our python demodulator prototype
implementation. To simulate a modulated grid frequency signal we added noise to a synthetic modulation signal. For most
simulations we used measured frequency data gathered with our frequency sensor. We only have a limited amount of capture
data. Re-using segements of this data as background noise in multiple simulation runs could hypothetically lead to our
simulation results depending on individual features of this particular capture that would be common between all runs. To
estimate the impact of this problem we re-ran some of our simulations with artificial random noise synthesized with a
power spectral density matching that of our capture. To do this, we first measured our capture's PSD, then fitted a
low-resolution spline to the PSD curve in log-log coördinates. We then generated white noise, multiplied the resampled
spline with the DFT of the synthetic noise and performed an iDFT on the result. The resulting time-domain signal is our
synthetic grid frequency data. Figure \ref{freq_meas_spectrum} shows the PSD of our measured grid frequency signal. The
red line indicates the low-resolution log-log spline interpolation used for shaping our artificial noise. Figure
\ref{simulated_noise_spectrum} shows the PSD of our simulated signal overlayed with the same spline as a red line and
shows time-domain traces of both simulated (blue) and reference signals (orange) at various time scales. Visually both
signals look very similar, suggesting we have found a good synthetic approximation of our measurements.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{../lab-windows/fig_out/freq_meas_spectrum}
    \caption{Power spectral density of the 24 hour grid frequency trace in Figure \ref{freq_meas_trace} with some notable
    peaks annotated with the corresponding period in seconds. The $\frac{1}{f}$ line indicates a pink noise spectrum.
    Around a period of \SI{20}{\second} the PSD starts to fall off at about $\frac{1}{f^3}$ until we can make out some
    bumps at periods around $2$ and \SI{3}{\second}.  Starting at at around \SI{1}{Hz} we can see a white noise floor in
    the order of \si{\micro\hertz^2\per\hertz}.
    % TODO: where does this noise floor come from? Is it a fundamental property of the grid? Is it due to limitations of
    % our measurement setup (such as ocxo stability/phase noise) ???
    }
    \label{freq_meas_spectrum}
\end{figure}

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{../lab-windows/fig_out/simulated_noise_spectrum}
    \caption{Synthetic grid frequency in comparison with measured data. The topmost graph shows the synthetic spectrum
    compared to the spline approximation of the measured spectrum (red line). The other graphs show time-domain
    synthetic data (blue) in comparison with measured data (orange).
    }
    \label{simulated_noise_spectrum}
\end{figure}

In our simulations, we manipulated four main variables of our modulation scheme and demodulation algorithm and observed
their impact on symbol error rate (SER):

\begin{description}
    \item[Modulation amplitude.] Higher amplitude should correspond to a lower SER.
    \item[Modulation bit count.] Higher bit count $n$ means longer transmissions but yields higher theoretical decoding
        gain, and should increase demodulator sensitivity. Ultimately, we want to find a sweet spot of manageable
        transmission length at good demodulator sensitivity.
    \item[Decimation.] or DSSS chip duration. The chip time determines where in the grid frequency spectrum (Figure
        \ref{freq_meas_spectrum} our modulated signal is located. Given our noise spectrum (Figure
        \ref{freq_meas_spectrum}) lower chip durations (shifting our signal upwards in the spectrum) should yield lower
        in-band background noise which should correspond to lower symbol error rates.
    \item[Demodulation correlator peak threshold factor.] The first step of our prototype demodulation algorithm is to
        calculate the correlation between all $2^n+1$ Gold sequences
        and to identify peaks corresponding to the input data containing a correctly aligned Gold sequence. The
        threshold factor is a factor peaks of what magnitude compared to baseline noise levels are considered in the
        following maximum likelihood estimation (MLE) decoding (cf.\ Figure \ref{fig_demo_sig_schema}).
\end{description}

Our results indicate that symbol error rate is a good proxy of demodulation performance. With decreasing signal-to-noise
ratio, margins in various parts of the demodulator decrease which statistically leads to an increased symbol error rate.
Our simulations yield smooth, reproducible SER curves with adequately low error bounds. This shows SER is related
monotonically to the signal-to-noise margins inside our demodulator prototype.

\subsection{Sensitivity as a function of sequency length}

A basic parameter of our DSSS modulation is the length of the Gold codes used. The length of a Gold code is exponential
in the code's bit count.  Figure \ref{dsss_gold_nbits_overview} shows a plot of the symbol error rate of our demodulator
prototype depending on amplitude for each of five, six, seven and eigth-bit Gold sequences. In regions where symbol
error rate is between $0$ and $1$ we can see the expected dependency that a $n+1$ bit Gold sequence at roughly twice
the length yields roughly one half the SER. We can also observe a saturation effect: At low amplitudes, increasing the
correlation length does not seem to yield much of a benefit in SER anymore. In particular there seems to be a level of
about \SI{2.5}{\milli\hertz} signal amplitude where even with asymptotically infinite sequence length our demodulator
would still not be able to produce a good demodulation. This is likely due to numerical errors in our demodulator. Since
Gold codes of more than 7 bit would yield unacceptably long transmission times this does not pose a problem in practice.

Figure \ref{dsss_gold_nbits_sensitivity} for each bit count shows the minimum signal amplitude where our demodulator
crossed below $\text{SER}=0.5$. If we have sufficient transmitter power to allocate selecting either a 5 bit or a 6 bit
gold code looks to yield good enough performance at manageable data rates.

\begin{figure}
    \centering
    \includegraphics[width=0.6\textwidth]{../lab-windows/fig_out/dsss_gold_nbits_overview}
    \caption{
        Symbol Error Rate (SER) as a function of transmission amplitude. The line represents the mean of several
        measurements for each parameter set. The shaded areas indicate one standard deviation from the mean. Background
        noise for each trial is a random segment of measured grid frequency. Background noise amplitude is the same for
        all trials. Shown are four traces for four different DSSS sequence lengths. Using a 5-bit gold code, one DSSS
        symbol measures 31 chips. 6 bit per symbol are 63 chips, 7 bit are 127 chips and 8 bit 255 chips. This
        simulation uses a decimation of 10, which corresponds to an $1 \text{s}$ chip length at our $10 \text{Hz}$ grid
        frequency sampling rate. At 5 bit per symbol, one symbol takes $31 \text{s}$ and one bit takes $6.2 \text{s}$
        amortized. At 8 bit one symbol takes $255 \text{s} = 4 \text{min} 15 \text{s}$ and one bit takes $31.9 \text{s}$
        amortized. Here, slower transmission speed buys coding gain. All else being the same this allows for a decrease
        in transmission power.
    }
    \label{dsss_gold_nbits_overview}
\end{figure}

\begin{figure}
    \centering
    \begin{minipage}[c]{0.5\textwidth}
        \includegraphics{../lab-windows/fig_out/dsss_gold_nbits_sensitivity}
    \end{minipage}
    \begin{minipage}[c]{0.45\textwidth}
        \caption{
            Amplitude at a SER of 0.5\ in mHz depending on symbol length. Here we can observe an increase of sensitivity
            with increasing symbol length, but we can clearly see diminishing returns above 6 bit (63 chips). Considering
            that each bit roughly doubles overall transmission time for a given data length it seems lower bit counts are
            preferrable if the necessary transmitter power can be realized.
        }
        \label{dsss_gold_nbits_sensitivity}
    \end{minipage}
\end{figure}

\subsection{Sensitivity versus peak detection threshold factor}

One of the high-level parameters of our demodulation algorithm is the \emph{threshold factor}. This parameter is
an implementation detail specific to our algorithm and not general to all possible DSSS demodulation algorithms. After
correlating the input signal against the template Gold sequences our algorithm runs a single-channel discrete wavelet
transform (DWT) on the correlator output to better discriminate peaks from background noise. The output of this DWT is
then normalized against a running average and then fed into a simple threshold detector. The threshold of this detector
is our threshold factor. This threshold is the ratio that a correlation peak after DWT has to stand out from long-term
average background noise to be considered a peak.

The threshold factor is an empirically-determined parameter Low threshold factors yield many false positives that in the
extreme ultimately overload our MLE estimator's capacity to discard them. Moderate numbers of false positive do not pose
much of a challenge to our MLE since these spurious peaks have a random time distribution and are easily discarded by
our MLE's symbol chain detection.  High threshold factors lead the algorithm to completely ignore some valid peaks. To
some degree this can be compensated by our later interpolation step for missing peaks but in the extreme will also break
demodulation. In our simulations good values lie in the range from $4.0$ to $5.5$.

Figure \ref{dsss_thf_amplitude_5678} contains plots of demodulator sensitivity like the one in Figure
\ref{dsss_gold_nbits_overview}. This time there is one color-coded trace for each threshold factor between $1.5$ and
$10.0$ in steps of $0.5$. We can see a clear dependency of demodulation performance from trheshold factor with both very
low and very high values breaking the demodulator. The ``runaway'' traces that we can see at low threshold factors are
artifacts of an implementation issue with our prototype code. We later fixed this issue in the demonstrator firmware
implementation in Section \ref{sec-demo-fw-impl}. For comparison purposes this issue do not matter.

\begin{figure}
    \centering
    \includegraphics{../lab-windows/fig_out/dsss_thf_amplitude_5678}
    \caption{
        SER vs.\ amplitude graph similar to Figure \ref{dsss_gold_nbits_overview} with one color-coded traces for
        threshold factors between $1.5$ and $10.0$.  Each graph shows traces for a single DSSS symbol length.
    }
    \label{dsss_thf_amplitude_5678}
\end{figure}

If we again look at the intercept points where the amplitude traces cross $\text{SER}=0.5$ in these graphs we get the
plots in Figure \ref{dsss_thf_sensitivity_all_bits}. From this we can conclude that the range between $4.0$ and $5.0$ will
yield adequate threshold factors for our use case.

\begin{figure}
    \centering
    \includegraphics{../lab-windows/fig_out/dsss_thf_sensitivity_5678}
    \caption{
        Graphs of amplitude at $SER=0.5$ for each symbol length as well as asymptotic SER for large amplitudes.  Areas
        shaded red indicate that $SER=0.5$ was not reached for any amplitude in the simulated range. The bumps in the 7
        bit and 8 bit graphs are due to the convergence problem we identified above and do not exist in our demonstrator
        implementation. We see that smaller symbol lengths favor lower threshold factors, and that optimal threshold
        factors for all symbol lengths are between $4.0$ and $5.0$.
    }
    \label{dsss_thf_sensitivity_all_bits}
\end{figure}

\subsection{Chip duration and bandwidth}

A parameter of any DSSS system is the frequency band used for transmission. Instead of specifying absolute frequencies
in our simulations we expressed DSSS bandwidth through chip duration and Gold sequence length. In our prototype, chip
duration is specified in grid frequency sampling periods to ease implementation without loss of generalization.

Figure \ref{chip_duration_sensitivity} shows the dependence of symbol error rate at a fixed good threshold factor from
chip duration. The color bars indicate both chip duration translated to seconds real-time and the resulting symbol
duration at the given Gold code length. In the lower graphs we show the trace of ampltude at $\text{SER}=0.5$ over chip
duration like we did in Figure \ref{dsss_thf_sensitivity_all_bits} for threshold facotr. In both graphs we can just about
see an optimum for very short chips with a decrease of sensitivity for long chips. This effect is due to longer chips
moving the signal band into noisier spectral regions (cf.\ Figure \ref{freq_meas_spectrum}).

\begin{figure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=\textwidth]{../lab-windows/fig_out/chip_duration_sensitivity_5}
        \label{chip_duration_sensitivity_5}
        \caption{
        5 bit Gold code.
        }
    \end{subfigure}
\end{figure}
\begin{figure}
    \ContinuedFloat
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=\textwidth]{../lab-windows/fig_out/chip_duration_sensitivity_6}
        \label{chip_duration_sensitivity_6}
        \caption{
        6 bit Gold code.
        }
    \end{subfigure}
    \caption{
        Dependence of demodulator sensitivity on DSSS chip duration. Due to computational constraints this simulation is
        limited to 5 bit and 6 bit DSSS sequences. There is a clearly visible sensitivity maximum at fairly short chip
        lengths around $0.2 \text{s}$. Short chip durations shift the entire transmission band up in frequency. In
        Figure \ref{freq_meas_spectrum} we can see that noise energy is mostly concentrated at lower frequencies, so
        shifting our signal up in frequency will reduce the amount of noise the decoder sees behind the correlator by
        shifting the band of interest into a lower-noise spectral region. For a practical implementation chip duration
        is limited by physical factors such as the maximum modulation slew rate ($\frac{\text{d}P}{\text{d}t}$) and the
        maximum Rate-Of-Change-Of-Frequency (ROCOF, $\frac{\text{d}f}{\text{d}t}$) the grid can tolerate.
    }
    \label{chip_duration_sensitivity}
\end{figure}

In the previous graphs we have used random clips of measured grid frequency noise as noise in our simulations. Comparing
between a simulation using measured noise and synthetic noise generated as we outlined in the beginning of Section
\label{sec-ch-sim} we get the plots in Figure \ref{chip_duration_sensitivity_cmp}. We can see that while not perfect our
simulated noise is an adequate approximation of reality: Our prototype demodulator shows no significant difference in
behavior between measured and simulated noise. Simulated noise causes slightly worse performance for long chips. Overall
the results for both are very close in absolute value.

\begin{figure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=\textwidth]{../lab-windows/fig_out/chip_duration_sensitivity_cmp_meas_6}
        \label{chip_duration_sensitivity_cmp_meas_6}
        \caption{
            Simulation using baseline frequency data from actual measurements.
        }
    \end{subfigure}
\end{figure}
\begin{figure}
    \ContinuedFloat
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=\textwidth]{../lab-windows/fig_out/chip_duration_sensitivity_cmp_synth_6}
        \label{chip_duration_sensitivity_cmp_synth_6}
        \caption{
            Simulation using synthetic frequency data.
        }
    \end{subfigure}
    \caption{
        Chip duration/sensitivity simulation results like in Figure \ref{chip_duration_sensitivity} compared between a
        simulation using measured frequency data like previous graphs and one using artificially generated noise. There
        is little visible difference indicating that we have found a good model of reality in our noise synthesizer, but
        also that real grid frequency behaves like a frequency-shaped gaussian noise process.
    }
    \label{chip_duration_sensitivity_cmp}
\end{figure}

\section{Implementation of a demonstrator unit}
\label{sec-prototype}

To demonstrate the viability of our reset architecture we decided to implement a demonstrator system. In this
demonstrator we use JTAG to reset part of a commodity smart meter from an externally-connected reset controller. The
reset controller receives its commands over the grid frequency modulation system we outlined in this thesis. To keep
implementation cost low the reset controller is fed a simulation of a modulated grid frequency signal through a standard
\SI{3.5}{\milli\meter} audio jack\footnote{
    By generously cutting two PCB traces the meter we chose to use can be easily modified to provide strong galvanic
    separation between grid and main application microcontroller. With this modification we have to supply power to its
    main application MCU externally along with the JTAG interface.
}. Measurement of actual grid frequency instead would simply require a voltage divider and depending on the setup an
analog optoisolator.

\subsection{Selecting a smart meter for demonstration purposes}
\label{sec-easymeter}

For our demonstrator to make sense we wanted to select a realistic reset target. In Germany where this thesis was
written a standards-compliant setup would consist of a fairly dumb smart meter and a smart meter gateway (SMGW)
containing all of the complex bidirectional protocol logic such as wireless or landline IP connectivity. The realistic
target for a setup in this architecture would be the components of an SMGW such as its communications modem or main
application processor. In the German architecture the smart meter does not even have to have a bi-directional data link
to the SMGW effectively mitigating any attack vector for remote compormise.

Despite these considerations we still chose to reset the application MCU inside smart meter for two reasons. One is that
SMGWs are much harder to come by on the second-hand market. The other is that SMGWs are a particular feature of the
German standardization landscape and in many other countries functions of an SMGW such as wireless protocol handling are
integrated into the meter itself (see e.g.\ \cite{honeywell01}).

In the end we settled on an Q3DA1002 three-phase 60A meter made by German manufacturer EasyMeter. This meter is typical
of what would be found in an average German household and can be acquired very inexpensively as new old stock on online
marketplaces.

The meter consists of a plastic enclosure with a transparent polycarbonate top part and a grey ABS bottom part that are
ultrasonically welded shut. In the bottom part of the case a PCB we call the \emph{measurement} board is potted in
epoxide resin (see Figure \ref{easymeter_composites}). This PCB contains three separate energy measurement ASICs for the
three phases (see Figure \ref{easymeter_detail_xrays}). It also contains a capacitive dropper power supply for the meter
circuitry and external modules such as a SMGW.  The measurement board through three infrared links (one per phase)
communicates with a smaller unpotted PCB we call the \emph{display} board in the top of the case. This PCB handles
measurement logging and aggregation, controls a small segment LCD displaying totals and handles the externally
accessible \si{\kilo\watt\hour} impulse LED and serial IR links.

The measurement board does not contain any logging or outside communication interfaces. All of that is handled on the
display board by a Texas Instruments MSP430F2350 application MCU. This is a 16-bit RISC MCU with \SI{16}{\kilo\byte}
flash and \SI{2}{\kilo\byte} SRAM\footnote{
    The microcontroller might seem a bit overkill for such a simple application, but most of its \SI{16}{\kilo\byte}
    program flash is in fact used. A casual glance with Ghidra shows that a large part of program flash is expended on
    keeping multiple redundant copies of energy consumption aggregates including error recovery in case of data
    corruption and some effort has even been made to guard against data corruption using simple non-cryptographic
    checksums. Another large part of the MCU's firmware handles data transmission over the meter's externally accessible
    IR link through Smart Message Language\cite{bsi-tr-03109-1-IVb}.
}. There is an I2C EEPROM that is used in conjunction with the microcontroller's internal \SI{256}{\byte} data flash to
keep redundant copies of energy consumption aggregates. On the side of the base board is a 14-pin header containing both
a standard TI MSP430 JTAG pinout and an UART serial link for debugging. Conveniently the JTAG port was left enabled by
fuse in our particular production unit.

We chose to use this MSP430 series application MCU as our reset target. Though in this particular unit compromise is
impossible due to a lack of bi-directional communication links some of its sister models do contain bidirectional
communication links\cite{easymeter01} making compromise through communication interfaces at least a theoretical
possibility. In other countries meters with a similar architecture to the Q3DA1002 commonly include complex protocol
logic as part of the meter itself\cite{honeywell01,ifixit01}. As an example, the Honeywell REX2 uses a Maxim Integrated
71M6541 main application microcontroller along with a Texas Instruments CC1000 series radio transceiver and is
advertised to support both over-the-air firmware upgrades and a remotely accessible ``service control switch''.

% TODO add pics of the intact easymeter and of the one with the safety reset0r hooked up

\begin{figure}
    \centering
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[width=0.6\textwidth]{resources/easymeter_board_composite.jpg}
        \label{easymeter_display_board_composite}
        \caption{
            \footnotesize
            Optical composite image of the display and data logging board in the top of the case. The six pins at the
            top are the SPI chip-on-glass segment LCD. Of the eight pads on the left six are unused and two carry the
            auxiliary power supply from the measurement board below. The bottom right section contains the
            \si{\kilo\watt\hour} impulse LED and the angled IR communication LED. The flying wires
            connect to the 14-pin JTAG and serial debug header.
        }
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \vspace{1cm}
        \centering
        \includegraphics[width=0.8\textwidth]{resources/easymeter_baseboard_composite.jpg}
        \label{easymeter_measurement_board_composite}
        \caption{
            \footnotesize
            Composite microfocus x-ray image of the potted measurement module in the bottom of the case. The ovals on
            the top left and right are power supply and data jumper connections for external modules such as SMGW
            interfaces. The bright parts at the bottom are the massive screw terminals with integrated current shunts.
            The circuitry right of the three independent measurement channels is the power supply circuit for the
            display board.
        }
    \end{subfigure}

    \caption{
        Composite images of the circuit boards inside the EasyMeter Q3DA1002 ``smart'' electricity meter used in our
        demonstration.
    }
    \label{easymeter_composites}
\end{figure}

\begin{figure}
    \centering
    \begin{subfigure}{0.45\textwidth}
        \centering
        \includegraphics[width=\textwidth]{resources/easymeter_baseboard_channel.jpg}
        \label{easymeter_channel_xray}
        \caption{Microfocus x-ray of one channel's data acquisition circuit.}
    \end{subfigure}\hspace*{5mm}
    \begin{subfigure}{0.45\textwidth}
        \centering
        \includegraphics[width=\textwidth]{resources/easymeter_baseboard_powersupply.jpg}
        \label{easymeter_powersupply_xray}
        \caption{Microfocus x-ray of the auxiliary power supply.}
    \end{subfigure}

    \caption{
        Microfocus x-rays of major sections of the EasyMeter Q3DA1002 measurement board.
    }
    \label{easymeter_detail_xrays}
\end{figure}

\subsection{Firmware implementation}
\label{sec-demo-fw-impl}

We based our safety reset demonstrator firmware on the grid frequency sensor firmware we developed in sec.\
\ref{sec-fsensor}. We implemented DSSS demodulation by translating the python prototype code we developed in sec.\
\ref{sec-ch-sim} to embedded C code. After validating the C translation in extensive simulations we integrated our code
with a reed-solomon implementation and a libsodium-based implementation of the cryptographic protocol we designed in
sec.\ \ref{sec-crypto}.  To reprogram the target MSP430 microcontroller we ported over the low-level bitbang JTAG driver
of \texttt{mspdebug}\footnote{\url{https://github.com/dlbeer/mspdebug}}. See Figure \ref{fig_demo_sig_schema} for a
schematic overview of signal processing in our demonstrator.

For all computation-heavy high-level modules of our firmware such as the DSSS demodulator or the grid frequency
estimator we wrote test fixtures that allow the same code that runs on the microcontroller to be executed on the host
for testing. These test fixtures are very simple C programs that load input data from a file or the command line, run
the algorithm and print results on standard output.

\begin{figure}
    \centering
    \includegraphics[width=\textwidth]{resources/prototype_schema}
    \caption{The signal processing chain of our demonstrator.}
    \label{fig_demo_sig_schema}
\end{figure}

\section{Grid frequency modulation emulation}

To emulate a modulated grid frequency signal we superimposed a DSSS-modulated signal at the proper amplitude with
synthetic grid frequency noise generated according to the measurements we took in sec. \ref{sec-fsensor}. In this
primitive simulation we do not simulate the precise impulse response of the grid to a DSSS-modulated stimulus signal.
Our results still serve to illustrate the possibility of data transmission in this manner this impulse response can be
compensated for at the transmitter by selecting appropriate modulation parameters (e.g. chip rate and amplitude) and at
the receiver by equalization with a matched filter.

\section{Experimental results}

After extensive simulations and testing of the individual modules of our solution we proceeded to conduct a real-world
experiment. We tried the demonstrator setup with an emulated noisy DSSS signal in real-time. Our experiment went without
any issues and the firmware implementation correctly reset the demonstrator's meter. We were happy to see that our
extensive testing paid off: The demonstrator setup worked on its first try.
% FIXME add pictures of the finished demo setup in action
% FIXME maybe add an SER curve here?

\section{Lessons learned}

Before settling on the commercial smart meter we first tried to use an EVM430-F6779 smart meter evaluation kit made by
Texas Instruments. This evaluation kit did not turn out well for two main reasons. One, it shipped with half the case
missing and no cover for the terminal blocks. Because of this some work was required to maintain electrical safety.
Even after mounting it in an electrically safe manner since the main MCU is not isolated from the grid and the JTAG port
is also galvanically coupled the safety reset controller prototype would also have to be galvanically isolated to not
pose an electrical safety risk. The second issue we ran into was that the EVM430-F6779 is based around an MSP430F6779
microcontroller. This microcontroller is a rather large part within the MSP430 series and uses a particularly new
revision of the CPU core and associated JTAG peripheral that are incompatible with all MSP430 programmers we tried to
use on it. \texttt{mspdebug} does not have support for it and porting TI's own JTAG programmer reference sources did not
yield any results either. Finally we tried an USB-based programmer made by TI themselves that turned out to either have
broken firmware or a hardware defect, leading to it frequently re-enumerating on the USB.

Overall our initial assumption that a development kit would certainly be easier to program than a commercial meter did
not prove to be true. Contrary to our expectations the commercial meter had JTAG enabled allowing us to easily read out
its stock firmware without needing to reverse-engineer vendor firmware update files or circumventing code protection
measures. The fact that its firmware was only available in its compiled binary form was not much of a hindrance as it
proved not to be too complex and all we wanted to know could be found out with just a few hours of digging in Ghidra.

In the firmware development phase our approach of testing every module individually (e.g. DSSS demodulator, Reed-Solomon
decoder, grid frequency estimation) proved to be very useful. In particular debugging benefited greatly from being able
to run a couple thousand tests within seconds. In case of our DSSS demodulator this modular testing and simulation
architecture allowed us to simulate many thousand runs of our implementation on test data and directly compare it to our
Jupyter/Python prototype (see Figure \ref{fw_proto_comparison}). Since we spent more time polishing our embedded C
implementation it turned out to perform much better than our initial python prototype. At the same time it shows
fundamentally similar response to its parameters.  One significant bug we fixed in the embedded C version is the python
version's tendency towards incorrect decodings at even very large amplitudes.

\begin{figure}
    \centering
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[trim={0 4cm 0 0},clip]{../lab-windows/fig_out/dsss_thf_amplitude_56_jupyter_impl}
        \caption{Python prototype.}
    \end{subfigure}
    \begin{subfigure}{\textwidth}
        \centering
        \includegraphics[trim={0 4cm 0 0},clip]{../lab-windows/fig_out/dsss_thf_amplitude_56_fw_impl}
        \caption{Embedded C implementation.}
    \end{subfigure}

    \caption{
        Symbol error rate plots versus threshold factor for both our python prototype (above) and our firmware
        implementation of our demodulation algorithm. Note the slightly different threshold factor color scales. Cf.\
        Figure \ref{dsss_thf_amplitude_5678}.
    }
    \label{fw_proto_comparison}
\end{figure}

In accordance with our initial estimations we did not run into any code space nor computation bottlenecks for chosing
floating-point emulation instead of porting over our algorithms to fixed-point calculations. The extremely slow sampling
rate of our systems makes even heavyweight processing such as FFT or our rather brute-force dynamic programming approach
to DSSS demodulation possible well within performance constraints.

Compiled code size of our firmware implementation is slightly larger than we would like at around \SI{64}{\kilo\byte}
for our firmware image including everything except the target microcontroller firmware image. See appendix
\ref{symbol_size_chart} for a graph illustrating the contribution of various parts of the signal processing toolchain to
this total. Overall the most heavy-weight operations by far are the SHA512 implementation from libsodium and the FFT
from ARM's CMSIS signal processing library.

\chapter{Future work}

\section{Precise grid characterization}

We based our simulations on a linear relationship between generation/consumption power imbalance and grid frequency.
Our literature study suggests that this is an appropriate first-order approximation\cite{crastan03}.  We kept modulation
bandwidth in our simulations inside a \SIrange{1000}{100}{\milli\hertz} frequency band that we reason is most likely to
exibit this linear behavior in practice. At lower frequencies primary control kicks in. With the frequency delta
thresholds specified for primary control systems\cite{entsoe04} this will likely lead to significant non-linear effects.
At higher frequencies grid frequency estimation at the receiver becomes more complex.  Higher frequencies also come
close to modes of mechanical oscillation in generators (usually at \SI{5}{\hertz} and above\cite{crastan03}).

An analysis of the above concerns can be performed using dynamic grid simulation models\cite{semerow01,entsoe05}.
Presumably out of safety concerns these models are only available under non-disclosure agreements. Integrating
NDA-encumbered results stemming from such a model in an open-source publication such as this one poses a logistical
challenge which is why we decided to leave this topic for a separate future work.  After detailed model simulation we
ultimately aim to validate our results experimentally. Assuming linear grid behavior even under very small disturbances
a small-scale experiment is an option. Such a small-scale experiment would require very long integration times.

Given a frequency characteristic of \SI{30}{\giga\watt\per\hertz} a stimulus of \SI{10}{\kilo\watt} yields $\Delta f =
\SI{0.33}{\micro\hertz}$. At an estimated \SI{20}{\milli\hertz} of RMS noise over a bandwidth of interest this results
in an SNR slightly better than \SI{-50}{\decibel}. The correlation time necessary to offset this with DSSS processing
gain at a chip rate of \SI{1}{\baud} would be in the order of days. With such long correlation times clock stability
starts to become a problem as during correlation transmitter and receiver must maintain close phase alignment w.r.t.\
one chip period. A $\leq \SI{10}{\degree}$ phase difference requirement over this period of time would translate into
clock stability better than \SI{10}{ppm}. Though certainly not impossible to achieve this does pose an engineering
challenge.

A possible way to maintain clock alignment is to use grid frequency itself as a reference. Instead of keying the DSSS
modulator/demodulator on a local crystal oscillator, chip timings would be described in fractions of a mains voltage
cycle. This would track grid frequency variations synchronously at both ends and would maintain phase alignment even
over long periods of time at cost of a slight increase in system complexity.

\section{Technical standardization}

The description of a safety reset system provided in this work could be translated into a formalized technical standard
with relatively low effort. Our system is very simple compared to e.g. a full smart meter communication standard and
thus can conceivably be described in a single, concise document. The much more complicated side of standardization would
be the standardization of the backend operation including key management, coördination and command authorization.

\section{Regulatory adoption}

Since the proposed system adds significant cost and development overhead at no immediate benefit to either consumer or
utility company it is unlikely that it would be adopted voluntarily. Market forces limit what long-term planning utility
companies can do. An advanced mitigation such as this one might be out of their reach on their own and might require
regulatory intervention to be implemented. To regulatory authorities a system such as this one provides a primitive to
guard against attacks. Due to the low-level approach our system might allow a regulatory authority to restore meters to
a safe state without the need of fine-grained control of implementation details such as application network protocols.

A regulatory authority might specify that all smart meters must use a standardized reset controller that on command
resets to a minimal firmware image that disables external communication, continues basic billing functions and enables
any disconnect switches. This system would enable the \emph{reset authority} to directly preempt a large-scale attack
irrespective of implementation details of the various smart meter implementations.

Cryptographic key management for the smart reset system is not much different to the management of highly privileged
signing keys as they are used in many other systems already.  If the safety reset system is implemented with a
regulatory authority as the \emph{reset authority} they would likely be able to find a public entity that is already
managing root keys for other government systems to also manage safety reset keys. Availability and security requirements
of safety reset keys do not differ significantly from those for other types of root keys.

\section{Zones of trust}

In our design, we opted for a safety reset controller in form of a separate micocontroller entirely separate from
whatever application microcontroller the smart meter design is already using.  This design nicely separates the meter
into an untrusted application (the core microcontroller) and the trusted reset controller. Since the interface between
the two is simple and logically one-way, it can be validated to a high standard of security.

Despite these security benefits, the cost of such a separate hardware device might prove high in a mass-market rollout.
In this case, one might attempt to integrate the reset controller into the core microcontroller in some way. Primarily,
there would be two ways to accomplish this. One is a solution that physically integrates an additional microcontroller
core into the main application microcontroller package either as a submodule on the same die or as a separate die in a
multi-chip module (MCM) with the main application microcontroller. A full-custom solution integrating both on a single
die might be a viable path for very large-scale deployments, but will most likely be too expensive in tooling costs
alone to justify its use. More likely for a medium- to large-scale deployment (millions of meters) would be a MCM
integrating an off-the-shelf smart metering microcontroller die with the reset controller running on another, much
smaller off-the-shelf microcontroller die. This solution might potentially save some cost compared to a solution using a
discrete microcontroller for the reset controller.

The more likely approach to reducing cost overhead of the reset controller would be to employ virtualization
technologies such as ARM's TrustZone in order to incorporate the reset controller firmware into the application firmware
on the same processor core without compromising the reset controller's security or disturbing the application firmware's
operation.

TrustZone is a virtualization technology that provides a hardware-assisted privileged execution domain on at least one
of the microcontroller's cores. In traditional virtualization setups a privileged hypervisor is managing several
unprivileged applications sharing resources between them. Separation between applications in this setup is longitudinal
between adjacent virtual machines. Two applications would both be running in unprivileged mode sharing the same cpu and
the hypervisor would merely schedule them, configure hardware resource access and coördinate communication. This
longitudinal virtualization simplifies application development since from the application's perspective the virtual
machine looks very similar to a physical one. In addition, in general this setup reciprocally isolates two applications
with neither one being able to gain control over the other.

In contrast to this, a TrustZone-like system in general does not provide several application virtual machines and
longitudinal separation. Instead, it provides lateral separation between two domains: The unprivileged application
firmware and a privileged hypervisor. Application firmware may communicate with the hypervisor through defined
interfaces but due to TrustZone's design it need not even be aware of the hypervisor's existence. This makes a perfect
fit for our reset controller. The reset controller firmware would be running in privileged mode and without exposing any
communication interfaces to application firmware. The application firmware would be running in unprivileged mode
without any modification. The main hurdles to the implementation to a system like this are the requirement for a
microcontroller providing this type of virtualization on the one hand and the complexity of correctly employing this
virtualization on the other hand. Virtualization systems such as TrustZone are still orders of magnitude more complex to
correctly configure than it is to simply use separate hardware and secure the interfaces in between.

\chapter{Conclusion}

In this thesis we have developed an end-to-end design of a reset system to restore smart meters to a safe operating
state during an ongoing large-scale cyberattack. We have laid out the fundamentals of smart metering infrastructure and
elaborated the need for an out-of-band method to reset device firmware due to the large attack surface of this complex
firmware.  To allow our system to be triggered even in the middle of a cyberattack we have developed a broadcast data
transmission system based on intentional modulation of global grid frequency. We have developed the theoretical
foundations of the process based on an established model of inertial grid frequency response to load variations and
shown the veracity of our end-to-end design through extensive simulations. To properly base these simulations we have
developed a grid frequency measurement methodology comprising of a custom-designed hardware device for electrically safe
data capture and a set of software tools to archive and process captured data. Our simulations show good behavior of our
broadcast communication system and give an indication that coöperating with a large consumer such as an aluminium
smelter would be a feasible way to set up a transmitter at very low hardware overhead.  Based on our broadcast primitive
we have developed a cryptographic protocol ready for embedded implementation in resource-constrained systems that allows
quick (response time less than 30 minutes) triggering of all or a selected subset of devices. Finally, we have
experimentally validated our system using simulated grid frequency data in a demonstrator setup based on a commercial
microcontroller as our safety reset controller and an off-the-shelf smart meter. We have laid out a path for further
research and standardization related to our system.

\newpage

%\nocite{*} TODO: check unused references
\printbibliography[heading=bibintoc]
\newpage

\appendix
%\chapter{Transcripts of Jupyter notebooks used in this thesis}

%\includenotebook{Grid frequency estimation}{grid_freq_estimation}
%\includenotebook{Grid frequency estimation validation against ROCOF test suite}{freq_meas_validation_rocof_testsuite}
%\includenotebook{Frequency sensor clock stability analysis}{gps_clock_jitter_analysis}
%\includenotebook{DSSS modulation experiments}{dsss_experiments-ber}

\chapter{Frequency sensor schematics}
\fancyhead[C]{Frequency sensor schematics (1/3)}
\fancyfoot[C]{}
\fancyhead[R]{\thepage}
\includepdf[fitpaper,landscape,pagecommand={\thispagestyle{fancy}}]{resources/platform-export-pg1.pdf}
\fancyhead[C]{Frequency sensor schematics (2/3)}
\includepdf[fitpaper,pagecommand={\thispagestyle{fancy}}]{resources/platform-export-pg2.pdf}
\fancyhead[C]{Frequency sensor schematics (3/3)}
\includepdf[fitpaper,landscape,pagecommand={\thispagestyle{fancy}}]{resources/platform-export-pg3.pdf}
\fancyfoot[C]{\thepage}

%\chapter{Firmware source code excerpts}
%\section{DMA-backed ADC capture (adc.c)}
%\inputminted[fontsize=\footnotesize,linenos,firstline=18,lastline=115,breaklines]{C}{../gm_platform/fw/adc.c}
%
%\section{Frequency sensor packetized serial interface}
%\subsection{serial.c}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{C}{../gm_platform/fw/serial.c}
%\subsection{packet\_interface.c}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{C}{../gm_platform/fw/packet_interface.c}
%\subsection{cobs.c}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{C}{../gm_platform/fw/cobs.c}
%\subsection{Host data logging utility (tw\_test.py)}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{python}{../gm_platform/fw/tw_test.py}
%
%\section{Frequency estimation (freq\_meas.c)}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{C}{../controller/fw/src/freq_meas.c}
%\section{DSSS demodulation (dsss\_demod.c)}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{C}{../controller/fw/src/dsss_demod.c}
%\section{Cryptographic protocol handling}
%\subsection{protocol.c}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{C}{../controller/fw/src/protocol.c}
%\subsection{crypto.c}
%\inputminted[fontsize=\footnotesize,linenos,breaklines]{C}{../controller/fw/src/crypto.c}


\chapter{Demonstrator firmware symbol size map}
\label{symbol_size_chart}
\includepdf[fitpaper]{resources/safetyreset-symbol-sizes.pdf}

% FIXME
%\chapter{Economic viability of countermeasures}
%\section{Attack cost}
%\section{Countermeasure cost}
%\section{Conclusion}

% FIXME maybe include a standard for the technical side of a safety reset system here, e.g. in the style of an IETF draft?

\end{document}