Authors: Ximin Luo

Created: 29.11.2019


Authentication

In a public secure decentralised network, it’s important for some level of service to be provided to unauthenticated users, e.g. for transparency and auditing purposes. This however opens up avenues of attack for malicious agents to attack the network.

Fortunately, this is straightforward to defend against at least conceptually - simply constrain the amount of resources that are allocated to service unauthenticated users. Extending this to a more general case, we can imagine that, depending on the needs of the higher layer (e.g. application), there may be several possible roles available for authentication, where some roles have higher resource priorities than others.

In a computer system, the possible resource types are typically taken to be CPU, memory and bandwidth. For now, we’ll only focus on the latter in this document because it’s the simplest to control directly - and doing so should also help to implicitly constrain the other two albeit with less accuracy than a more direct mechanism.

To clarify our terminology, note that there are two things commonly referred to as “authentication”:

  1. message authentication, where communication data are provably linked to some cryptographic identity such as the entity controlling a private key.

  2. identity authentication, where cryptographic identities are linked to some other source of authority that distinguishes “good” identities from “bad” or “unknown” ones. Part of the linking may be cryptographic, but at some point there is a non-cryptographic component that asserts something is “good”, e.g the genesis block, a certificate authority, or user-supplied credential.

(1) is straightforward and is done automatically by the networking layer via the underlying transport protocol (e.g. TLS or QUIC) and requires no runtime configuration to achieve. (2) is the harder problem especially because the set of “good” identities can change over time, and the decision of “good” vs “bad” is highly subjective and dependent on the surrounding system context.

(1) and (2) can be done separately from each other. Both must be done for security - if (1) is done but not (2), then the actual identity supplied by (1) may be false, since it may be controlled by an active MITM. Note also that, even though the issues can be checked separately, something needs to bring the results together and react to the case where either check fails, by ensuring that no further resources are spent on the failed communication channel.

General

Proposal: fresh authentication signals

In well-layered protocol architectures, it’s typical for the networking layer to deal with (1). Since it’s also the component that directly deals with resources, it’s also in charge of taking action on the results of (2) e.g. disconnecting the peer. However to maintain suitable layer separation and software composability and reusability, we’d like some other component to decide the results of (2). Below is a proposal on doing this efficiently.

The networking layer receives some input data or signal from a higher layer, which instructs it how to validate all the different roles of authenticated peers. (For example, peers must present a certificate group-signed by a recent set of validators, or belong to a fixed set of identities supplied by the higher layer.)

This signal must include some period of validity, after which the networking layer will deauthenticate (disconnect and ignore) any existing authenticated peers, and reject any further attempts at authentication. In other words, the higher layer must continually refresh the validity instructions for the networking layer, otherwise it will eventually revert to only dealing with unauthenticated peers with a very restricted resource policy.

One effect of this is that, if a node goes offline for a long-enough period of time, the networking layer will expire its last authentication instructions. This will result in disconnection of all authenticated peers; this is drastic but is better than continuing in an authenticated state that is expired. The higher layer should not let the situation reach this stage, but if it does, it should run e.g. a synchronisation protocol to retrieve the latest version of the authentication instructions, and signal this to the networking layer again.

Note that in practise with epoch-based authentication schemes, the higher layer may want to send two staggered signals per epoch switch - first sending a signal to include instructions for the next epoch, then waiting a grace period for the external network to all switch to the next epoch, then a second signal to exclude instructions for the previous epoch.

Proposal: bandwidth resource allocation

As mentioned in the introduction, the straightforward defence against malicious unauthenticated peers is good resource allocation. Below is a concrete proposal for such an algorithm, which should generalise to all foreseeable use-cases.

Given some roles (or more concretely, “peer connections”) we want to send data to or receive data from, how do we do this in a “fair” way?

Here’s one basic proposal. Suppose we have a list of roles R, and each role is associated with a priority and a relative use_limit, such that sum(use_limit[r] for r in R) == 1 and R is sorted by priority, “most urgent” first; this defines our resource limits and is to be configured by the higher layer. Then suppose for every time-interval t we have a list of actual usages for each role R, written as demand[r], and we want to calculate what to actually use for each role, to_use[r] such that sum(to_use) < total_avail.

Then the algorithm we propose below (valid for both sending and receiving) satisfies the following properties:

  1. If the sum(demand) < total_avail, then to_use == demand. In other words, if demand is lower than availability then demand is fully-satisfied, regardless of priorities or use-limits. A good algorithm should not special case this explicitly, but simply degenerate to this when appropriate.

  2. If all(demand[r] > use_limit[r] * total_avail for r in R), then all(to_use[r] == use_limit[r] * total_avail for r in R). In other words if every role demands to use more than is available, then their actual use is simply the proportion that is defined by our resource limits. This avoids higher-priority roles starving lower-priority roles.

The algorithm is as follows, described as executable Python:

from math import floor, isclose

def calculate_to_use(R, use_limit, total_avail, demand):
  if not isclose(sum(use_limit.values()), 1.0):
    raise ValueError("invalid use_limit, sum not 1: %s" % use_limit)
  if any(not v > 0 for v in use_limit.values()):
    raise ValueError("invalid use_limit, not all +: %s" % use_limit)
  to_use = {r: 0 for r in R}
  remaining_avail = total_avail
  remaining_demand = demand.copy()
  while remaining_avail > 0:
    # normalise use_limits i.e. ignore roles with no remaining demand
    relevant_use_limits = sum(v for (r, v) in use_limit.items() if remaining_demand[r] > 0)
    # division below is safe; relevant_use_limits can only be 0 if normalised_use_limit is an empty dict
    normalised_use_limit = {r: (v / relevant_use_limits) for (r, v) in use_limit.items() if remaining_demand[r] > 0}
    used_this_round = 0
    for r in R:
      # remaining demand of r. we'll try to satisfy as much of this as possible
      v = remaining_demand[r]
      if v == 0: continue
      # u is the max that can be satisfied this round, given the constraints
      u = floor(float(remaining_avail) * normalised_use_limit[r])
      # x is what we'll actually satisfy, either u or v
      x = min(u, v)
      to_use[r] += x
      used_this_round += x
      remaining_demand[r] -= x
    remaining_avail -= used_this_round
    # due to floor() sometimes we are short by a few bytes, just ignore it and prevent infinite loop
    if used_this_round == 0: break
    # now we loop back.
    # if any r didn't use up its use_limit in this iteration (i.e. v < u) then
    # this is now available for another r to use up in the next iteration
  assert(sum(to_use.values()) <= total_avail)
  return to_use

R = [0,1,2]
U = {0:0.8, 1:0.1, 2:0.1}
A = 1000

# property 1
assert(calculate_to_use(R, U, A, {0:500, 1:238, 2:262}) == {0:500, 1:238, 2:262})
# property 2
assert(calculate_to_use(R, U, A, {0:900, 1:102, 2:102}) == {0:800, 1:100, 2:100})
# corner case with 0
assert(calculate_to_use(R, U, A, {0:0, 1:0, 2:0}) == {0:0, 1:0, 2:0})
# more complex case
assert(calculate_to_use(R, U, A, {0:900, 1:50, 2:120})[0] > 800)
assert(calculate_to_use(R, U, A, {0:900, 1:50, 2:120})[1] == 50)
assert(calculate_to_use(R, U, A, {0:900, 1:50, 2:120})[2] > 100)

In a real networking program, the above algorithm may have to be tweaked a bit. However this should not be so hard - it should be possible to know what demand is at any given moment either for sending or for receiving.

  • for sending: in order to deal with backpressure properly you should always have a priority-heap of things to send, as opposed to trying to send something as soon as it becomes available to send.

    • (The priority-heap is populated as new things become available to send, and is popped when the recipient signals via some control-flow mechanism that they are unblocked for receiving again. This priority is unrelated to the role-priorities we introduce above for calculate_to_use.)

  • for receiving: normal networking implementations have the kernel buffer stuff until the application is ready to process it, and any excess is dropped. It is normally possible to query how full the buffer is.

At each time-interval, it is straightforward to calculate demand from either the application’s send priority-heaps or the kernel’s receive buffers.

As applied to Polkadot

Unauthenticated operation

Collators without any fresh authentication instructions can only:

  • TODO…

Validators without any fresh authentication instructions can only:

  • TODO…

Authenticated roles

Collators need to authenticate the following:

  • Parachain validators

  • TODO…

Validators need to authenticate the following:

  • TODO…

etc