====================================================================

Authors: Ximin Luo

Created: 05.11.2019

====================================================================

# Discovery¶

As with many real-world secure p2p networks, nodes should deal with each other as cryptographic identities, and it should not matter too much what their physical (network) address(es) are since these can easily be faked by the surrounding environment. Nevertheless for the communications to get through to the other side 1, we need to associate the correct addresses with the correct identities, and so we have an abstract “address book” service that provides this association. In the p2p space, this is commonly referred to as “node discovery” or simply “discovery”.

Typically in real-world p2p systems this service is implemented by a decentralised DHT, and the most commonly-used solution today is the Kademlia DHT. So this document has a lot of focus on that, but we also discuss non-Kademlia solutions and combined hybrid solutions.

1

As always, the communication contents are protected by cryptography, so the worst attack then is simply to drop the communication, which we must assume doesn’t happen often enough to disrupt the whole network.

## Overview¶

Polkadot inherits its discovery design from older p2p networks that don’t care too much about security issues, deploying designs with known weaknesses and trivial methods of attacking them. Those networks have survived thus far, largely by being small enough that there is not much economic or political incentive for a large attacker (i.e. a well-funded corporation or government) to launch an attack, even though it would be technically straightforward to. However, we must take security a bit more seriously when trying to build a global decentralised financial network.

Polkdot, like these other networks, use Kademlia as a basic distributed lookup service to associate cryptographic identities with their physical (IP) addresses. Peer identities are not authenticated - queries and responses are sent, received and handled without any consideration of whether one might be dealing with an attacker or not, and data from others are trusted mostly blindly. We’ll go into these weaknesses and propose ways for solving these.

Additionally, Polkadot has more structure in its networks than a “common use” homogenous public p2p network. There are different subnetworks where different levels of trust and service are expected - for example, the relay chain is expected to be the most secure and reliable, and we expect some parachains to contain hostile nodes. We can make use of this information in our discovery service as well, designing a certain level of structure into it.

For example, an initial simple proposal would be to run a separate DHT instance specifically for the relay chain, with a stricter access policy, and then run larger and less strict DHTs for different sets of parachains.

Some other systems have also used a certain amount of structure for performance - for example the Coral CDN has an interesting algorithm for its members to self-organise into many different DHTs, merging and splitting each DHT as necessary to try to take advantage of network latencies. Exploring this in more detail would be worthwhile for the future evolution of this system.

We can also imagine storing the address book on some other decentralised storage system that isn’t a DHT. In fact a blockchain itself is such a system - however this cannot be the only address book in our system, since then this would be a cyclic dependency where one needs to reach the blockchain in order to reach the blockchain. A brief comparison:

• DHTs - temporary storage, faster to query, more easily censored

• Blockchains - permanent storage, slower to query, less easily censored

A hybrid approach gives users more choice, based on what they consider more important. For example, some operators may prefer not to have their historical IP addresses published permanently on a blockchain. A hybrid approach should also make the system more resilient - since we only need one address book to succeed, using several address books each with different security properties, means that we can try a different one if one of them is attacked.

Kademlia is a DHT (distributed hash table) that is one of the more popular ones to implement in real-world p2p systems. Much of this is due to the fact that the mathematics of it is relatively simple to implement, involving essentially only the XOR operation, and its various bookkeeping tasks are also reasonably straightforward, mostly involving comparing timestamps and setting timeouts.

S-Kademlia is a commonly-cited extension to Kademlia that offers a couple of basic security improvements on top of Kademlia:

1. A proposal to increase the cost of generating new IDs. This is general and not specific to Kademlia. In fact the one they propose is not good at all, and there are much better ones available. There is ongoing work in this area in the literature; OTOH the effectiveness of the general approach has also been questioned. Nevertheless it can be useful as a “defense in depth” measure, and to raise the cost for an attacker.

2. A bug fix to the distance algorithm for nodes very close to yourself.

3. A security fix for the lookup algorithm where it’s trivial for a malicious node to “override” the responses of non-malicious nodes.

This last one (3) is often described as “disjoint paths” but the current author believes that this phrasing is a bit misleading, and suggests the term “fat multipaths” instead, see the section on Multipath routing below.

Both DHTs have weaknesses that are typical of general DHTs, namely being susceptible to the sybil attack (global identity flooding) and the eclipse attack (targeted identity flooding). These types of attacks can often be defended against by making use of context-specific information that is outside the model of more general analyses. For example, Polkadot does include the concept of more trusted peers (e.g. validators), and we can make use of this information in the DHT layer as well.

We make some concrete proposals for improving security below. The wording is quite Kademlia-specific, simply because other DHTs are not in wide use today and it would be confusing to use overly-general language. However the underlying ideas can be re-used in other systems, as follows:

• Multipath routing -> this applies to any iterative routing algorithm that makes parallel queries, not just Kademlia, and in fact not just DHTs even.

• Routing table authentication -> applies to any system that needs to store routing tables for both authenticated and unauthenticated peers

### Multipath routing¶

In original Kademlia, at each hop n, the results of all hop-n queries are accumulated and the closest d (to the eventual target) are selected as the nodes to query for hop (n+1). With this merging, a malicious hop-n peer can trivially control all of hop (n+1), by replying with fake nodes that seem closer to the eventual target, than any of the other replies.

S-Kademlia fixes this by mandating that hop-(n+1) peers are chosen from the result sets of d different hop-n peers. That is, at least one result from every hop-n query peer, must be a hop-(n+1) query peer. This ensures that no strict subset of peers from hop n, is able to control the whole of hop (n+1), increasing the chance of success even if some peers are malicious.

The term “disjoint path” comes from the fact that out of the peers for hop n and (n+1) as chosen above, it is possible to assign them to d distinct pairs (u, v) where v is in the result-set for u, so that it can look like we are independently building d disjoint paths. However this is a bit of an artificial construction since in general each hop-(n+1) peer could have been obtained via multiple hop-n peers. It is more accurate to think of the whole process as building a “fat multipath” (i.e. a multipath with an enforced minimum width) rather than as building several “disjoint” paths. Indeed the “disjoint paths” viewpoint can lead one to omit certain corner cases involving backtracking, that is naturally covered by the “fat multipath” viewpoint.

In more precise terms, here is an incremental version of the SKad recursive lookup algorithm, to be called whenever a reply is received at each step of the recursive lookup:

from collections import defaultdict
from copy import deepcopy
from itertools import combinations
from pprint import pprint

def pprintd(d):
pprint({k: {k_: v for (k_, v) in kv.items() if v != 0} for (k, kv) in d.items()})

def dinic_bfs(Cap, Flow, src):
queue = []
queue.append(src)
level = defaultdict(lambda: 0)
level[src] = 1
while queue:
k = queue.pop(0)
for i in Cap[k].keys():
if Flow[k][i] < Cap[k][i] and level[i] == 0:
level[i] = level[k] + 1
queue.append(i)
return level

def dinic_dfs(Cap, Flow, dst, level, k, flow):
if k == dst:
return flow
for i in Cap.keys(): # important to visit every node, not just successors
if Flow[k][i] < Cap[k][i] and level[i] == level[k] + 1:
curr_flow = min(flow, Cap[k][i] - Flow[k][i])
temp_flow = dinic_dfs(Cap, Flow, dst, level, i, curr_flow)
if temp_flow > 0:
Flow[k][i] += temp_flow
Flow[i][k] -= temp_flow
return temp_flow
return 0

def max_flow_dinic(Cap, src, dst, max=float("inf")):
# https://www.geeksforgeeks.org/dinics-algorithm-maximum-flow/
Flow = defaultdict(lambda: defaultdict(lambda: 0)) # flow graph, successor-map
flow = 0
while True:
level = dinic_bfs(Cap, Flow, src)
if not level[dst]: break
flow += dinic_dfs(Cap, Flow, dst, level, src, max)
#pprintd(Flow)
# residual graph is not explicitly expressed in this implementation of the
# algorithm, but it can be calculated as (C - F) or residual_graph(C, F)
#
# note that Flow contains reverse edges with negative flows, one for every
# forward edge in the actual flow. if you want to calculate the cost of the
# flow against a rate graph, be sure to filter out the negative edges first.
return (flow, Flow)

def residual_graph(Cap, Flow):
Res = deepcopy(Cap)
for k, kf in Flow.items():
for i, f in kf.items():
Res[k][i] -= f
return Res

def residual_rates(Rates):
# rates in a residual graph
ResRates = defaultdict(lambda: defaultdict(lambda: 0))
for k, kr in Rates.items():
for i in kr.keys():
ResRates[k][i] = Rates[k][i] - Rates[i][k]
return ResRates

def graph_costs(G, Rates):
Cost = defaultdict(lambda: defaultdict(lambda: 0))
for k, kv in G.items():
for i, v in kv.items():
# don't create edges of cost 0 that don't exist in the input graph, which
# breaks the negative cycle detection later
if v != 0:
Cost[k][i] = v * Rates[k][i]
return Cost

def graph_cost(G, Rates):
cost = 0
for k, kc in G.items():
for i, c in kc.items():
cost += c * Rates[k][i]
return cost

def get_negcycle(G, max=float("inf")):
# https://cp-algorithms.com/graph/finding-negative-cycle-in-graph.html
# based on Bellman-Ford
#pprint(G)
D = defaultdict(lambda: 0)
pred = defaultdict(lambda: None)

# relaxation
for _ in G.keys():
changed = None
for k, kd in G.items():
for i, d in kd.items():
if D[i] > D[k] + d:
D[i] = D[k] + d
pred[i] = k
changed = i

if changed is None:
return []
for _ in G.keys():
changed = pred[changed]

# get the cycle. most basic implementations of Bellman-Ford don't have this
cycle = []
v = changed
while True:
cycle.append(v)
if v == changed and len(cycle) > 1:
break
v = pred[v]
#print(cycle)
return list(reversed(cycle))

def max_flow_min_cost(Cap, Rates, src, dst, max=float("inf")):
# https://courses.csail.mit.edu/6.854/06/scribe/s12-minCostFlowAlg.pdf
(max_flow, Flow) = max_flow_dinic(Cap, src, dst)

# to find min-cost, first convert into a circulation problem
Cap[dst][src] = max_flow
Rates[dst][src] = -(sum(r for kr in Rates.values() for r in kr.values()) + 1)
Flow[dst][src] = max_flow
Flow[src][dst] = -max_flow
flow_cost = graph_cost(Flow, Rates)

Res = residual_graph(Cap, Flow)
ResRates = residual_rates(Rates)
# keep adding negative-cost cycles to the flow, until there is none left
while True:
# construct residual cost graph
ResCost = graph_costs(Res, ResRates)
negcycle = get_negcycle(ResCost, max=max)
if not negcycle: break

# add the cycle to F, and update R
edges = list(zip(negcycle[:-1], negcycle[1:]))
flow = min(Res[u][v] for (u, v) in edges)
for (u, v) in edges:
Flow[u][v] += flow
Flow[v][u] -= flow
Res[u][v] -= flow
Res[v][u] += flow

# check that new cost of F is strictly lower than previous cost
cost = graph_cost(Flow, Rates)
assert (cost < flow_cost)
flow_cost = cost

# clean up F a bit before returning it
del Flow[src][dst]
del Flow[dst][src]
min_cost = graph_cost(Flow, Rates)
# reverse our changes to Cap & Rates
del Cap[dst][src]
del Rates[dst][src]
#pprintd(Cap)
#pprintd(Flow)
return (max_flow, min_cost, Flow)

def tests_max_flow_dinic():
C = defaultdict(lambda: defaultdict(lambda: 0))
C["S"][1] = 3
C["S"][2] = 3
C[1][2] = 2
C[1][3] = 3
C[2][4] = 2
C[3][4] = 4
C[3]["T"] = 2
C[4]["T"] = 2
assert (max_flow_dinic(C, "S", "T")[0] == 4)

D = defaultdict(lambda: defaultdict(lambda: 0))
D["S"][1] = 1
D["S"][2] = 1
D["S"][3] = 1
D[1][4] = 1
D[2][5] = 1
D[3][6] = 1
D[4][7] = 1
D[4][8] = 1
D[4][9] = 1
D[5][7] = 1
D[5][8] = 1
D[5][9] = 1
D[6][7] = 1
D[6][8] = 1
D[6][1] = 1
D[7]["T"] = 1
D[8]["T"] = 1
D[9]["T"] = 1
assert (max_flow_dinic(D, "S", "T")[0] == 3)

E = defaultdict(lambda: defaultdict(lambda: 0))
E["S"][1] = 11
E["S"][2] = 12
E[1][2] = 10
E[1][3] = 12
E[2][1] = 4
E[2][4] = 14
E[3][2] = 9
E[3]["T"] = 20
E[4][3] = 7
E[4]["T"] = 4
assert (max_flow_dinic(E, "S", "T")[0] == 23)

def tests_max_flow_min_cost():
C = defaultdict(lambda: defaultdict(lambda: 0))
C["S"][1] = 1
C["S"][4] = 1
C[1][2] = 1
C[1][3] = 1
C[2]["T"] = 1
C[3]["T"] = 1
C[4]["T"] = 1
CR = defaultdict(lambda: defaultdict(lambda: 0))
CR[2]["T"] = 2
CR[3]["T"] = 1
CR[4]["T"] = 3
assert (max_flow_min_cost(C, CR, "S", "T")[0:2] == (2, 4))

K = 20

def distance(a, b):
return a ^ b

FLOW_SRC = "self"
FLOW_SINK = "target"

def __init__(self, init_queries, target_key, use_min_cost=True, prioritise_unique_results=False):
# query flow graph, capacities
self.query_succ = { FLOW_SRC: set(init_queries) }

# nodes queried that have given a reply
self.query_result = set()
# nodes queried that have failed, i.e. timed out
self.query_failed = set()
# nodes queried that are awaiting a reply
# this should be size (d - query_gap)
self.query_expect = set()
# number of queries missing. this can happen when the initial nodes all
# return too-few results to sustain a max-flow of d throughout the query
self.query_gap = 0

self.target_key = target_key
# use max-flow-min-cost, otherwise use max-flow and brute-force over the costs
self.use_min_cost = use_min_cost
# TODO: this could be implemented by tweaking the costs, leave it out for now
self.prioritise_unique_results = prioritise_unique_results
if prioritise_unique_results:
raise NotImplementedError()

for nodeId in init_queries:
self.query_succ[nodeId] = {}
self.launch_query("init %3s" % target_key, nodeId)

def distance_to(self, n):
return distance(n, self.target_key)

def launch_query(self, ctx, nodeId):
# fake send, for our testing purposes
print(ctx, ": sending query to node %s" % nodeId)

def warn_no_query(self, ctx):
print(ctx, ": too-few results to select a next peer") # TODO more error detail

def max_flow_wrong(self, matching, verbose=False):
# construct the capacity graph
# note that the order is important, we want the max_flow algorithm to
# traverse the closest ones first.
# FIXME: sadly this fails, including some asserts below
# we need max-saturating-flow not max-flow
MAX = 100000000000 # TODO: should calculate this better

C = defaultdict(lambda: defaultdict(lambda: 0))
for k, succs in self.query_succ.items():
for succ in sorted(succs, key=self.distance_to):
C[k][succ] = MAX
if k != FLOW_SRC and matching(k):
C[k][FLOW_SINK] = MAX - self.distance_to(k)

# calculate max flow
(f, max_flow) = max_flow_dinic(C, FLOW_SRC, FLOW_SINK)
if verbose:
print("cap & flow:")
pprintd(C)
pprintd(max_flow)

# extract results
results = set()
for k in C.keys():
if k not in (FLOW_SRC, FLOW_SINK) and matching(k):
# FIXME: sadly this doesn't always work, partial flows can be maximal
if max_flow[k][FLOW_SINK] == MAX - self.distance_to(k):

return sorted(results, key=self.distance_to)

def max_flow_min_distance(self, matching, verbose=False):
C = defaultdict(lambda: defaultdict(lambda: 0))
for k, succs in self.query_succ.items():
if k == FLOW_SRC:
for succ in sorted(succs, key=self.distance_to):
C[k][succ] = 1
elif succs:
# restrict all nodes to only being on 1 flow, by converting them into
# two nodes (k) and (k, "out") with capacity 1 between them
C[k][(k, "out")] = 1
for succ in sorted(succs, key=self.distance_to):
C[(k, "out")][succ] = 1

candidates = list(k for k in self.query_succ.keys() if k != FLOW_SRC and matching(k))
wanted = len(self.query_succ[FLOW_SRC]) - self.query_gap
if verbose:
print("candidates:", candidates, "wanted:", wanted)

if not self.use_min_cost:
# brute force version using only max-flow, also works
for subset in combinations(sorted(candidates, key=self.distance_to), wanted):
C_ = deepcopy(C)
for k in subset:
C_[k][FLOW_SINK] = 1
(max_flow, max_flow_graph) = max_flow_dinic(C_, FLOW_SRC, FLOW_SINK)
if verbose:
print(subset)
pprintd(C_)
print("max flow:", max_flow)
pprintd(max_flow_graph)
if max_flow == wanted:
return list(subset)
return []
else:
# more efficient version using max-flow-min-cost just once
Rates = defaultdict(lambda: defaultdict(lambda: 0))
for k in candidates:
C[k][FLOW_SINK] = 1
Rates[k][FLOW_SINK] = self.distance_to(k)
(max_flow, min_cost, F) = max_flow_min_cost(C, Rates, FLOW_SRC, FLOW_SINK)
if max_flow < wanted: return []
results = [k for k in candidates if F[k][FLOW_SINK] == 1]
assert max_flow == len(results)
# if (self.query_gap == 0) then we always get exactly what we want.
# else, we might get more results than we want, but we cut those back
# since those are from previously-failed queries.
assert wanted <= max_flow <= wanted + self.query_gap
return sorted(results)[:wanted]

def select_next_query(self, verbose=False):
next_to_query = self.max_flow_min_distance(
lambda k: k not in self.query_result and
k not in self.query_failed, verbose=verbose)
return [q for q in next_to_query if q not in self.query_expect]

if verbose:

if peer not in self.query_expect:

if peer in self.query_result:

if peer in self.query_failed:

# query failed, record it as failed
self.query_succ[peer] = set()

else:
# check that all replies are closer to the target, with exemption for
# peers that are already very close to the target
old_d = self.distance_to(peer)
if any(self.distance_to(r) >= self.distance_to(peer) > K for r in reply):
# probably malicious

# everything OK, now store the reply in the query graph
self.query_succ.setdefault(r, {}) # ensure r is in query_succ.keys()

assert (len(self.query_expect) + self.query_gap == len(self.query_succ[FLOW_SRC]))
self.query_expect.remove(peer)
candidates = self.select_next_query(verbose=verbose)

if not candidates:
# not enough data returned by neighbours to select a next_peer.
# record this to keep track of it, and emit a warning.
self.warn_no_query("recv %3s" % peer)
self.query_gap += 1
return None
else:
next_peer = candidates[0]
# make the actual query
self.launch_query("recv %3s" % peer, next_peer)
assert (len(self.query_expect) + self.query_gap == len(self.query_succ[FLOW_SRC]))
return next_peer

def current_best(self, include_waiting=False, verbose=False):
return self.max_flow_min_distance(lambda k:
(k in self.query_result or k in self.query_expect)
if include_waiting else
(k in self.query_result), verbose=verbose)

print("----")
assert(q.recv_result(4, {1,2,3}) == 1)
assert(q.recv_result(5, {1,2,3}) == 2)
assert(q.recv_result(6, {4,1,2}) == 3)
# ^ corner case, correctly selects 3 even though 3 not part of 6's reply
assert(q.current_best() == [4,5,6])
assert(q.current_best(True) == [1,2,3]) # this assert fails if using max_flow_wrong

print("----")
assert(q.recv_result(5, {3,4}) == 3)
assert(q.recv_result(6, {3,4}) == 4)
assert(q.recv_result(7, {3,4}) == None)
assert(q.recv_result(8, {3,4}) == None)
assert(q.recv_result(3, {1}) == 1)
assert(q.recv_result(4, {1}) == None)
# ^ selects None since we actually don't have enough results
assert(q.query_gap == 3)

print("----")
assert(q.recv_result(5, {4}) == 4)
assert(q.recv_result(4, {1,2,3}) == 1)
assert(q.recv_result(6, {4}) == None) # e.g. not 2
assert(q.recv_result(7, {4}) == None)
# this test fails if "restrict nodes to only being on 1 flow" is not implemented
assert(q.query_gap == 2)
assert(q.current_best() == [4])
assert(q.current_best(True) == [1])

print("----")
assert(q.recv_result(10, {5,6}) == 5)
assert(q.recv_result(11, {6,7}) == 6)
assert(q.recv_result(12, {8}) == 8)
assert(q.recv_result(5, {1,2}) == 1)
assert(q.recv_result(1, None) == 2)
assert(q.recv_result(2, None) == 7)
assert(q.current_best() == [5, 11, 12])
assert(q.current_best(True) == [5, 6, 8])
# ^ corner case involving backtracking and failure - correctly selects 7,
# even though it's unrelated to the "10 -> 5 -> 1" path, because 6 was

if __name__ == "__main__":
tests_max_flow_dinic()
tests_max_flow_min_cost()


### Routing table drop policy¶

Nodes are dropped immediately if we PING and they time out. We may want to be a bit more generous, especially since Kademlia has an explicitly stated design intention of preferring known-long-term nodes to unknown nodes.

We should collect real-world logs about how often nodes drop vs whether this is short-term or long-term.

TODO

### Routing table authentication¶

See Authentication for some background on the general topic of authentication. Kademlia natively does not have a concept of authentication, but it is fairly straightforward to fit one in.

Specifically, Kademlia natively already treats recently-seen peers as more trustworthy than others - they have higher priority when being added into its k-buckets. So, we can generalise this idea to cover peers that have been marked as more trustworthy by a higher-layer application.

For Polkadot, this means that in the relay chain validators should pre-allocate some space in their k-buckets for other validators such that this space cannot be overridden by other non-validator nodes. This functionality should be configurable via a generic API offered by the kademlia implementation that does not specifically reference Polkadot or specific authenticated roles, so it can continue to be useful even if we (for example) add extra types of authenticated roles to the Polkadot relay chain. Details are given in the next section.

This ensures that a certain amount of resources is always allocated to those “more trustworthy” peers. It does not protect against an attack by the “more trustworthy” peers themselves, e.g. if a group of validators themselves decides to attack the kademlia network. However, it is intended that other mechanisms will be able to detect such attacks, and furthermore the slashing mechanisms in place elsewhere will be able to discourage these types of attacks, whereas the slashing mechanism by its nature cannot work against unauthenticated peers.

#### Proposal: authenticated stratified k-buckets¶

1. The kademlia implementation should offer an API for the higher-layer application to define different “authenticated roles” for peers that can potentially be inserted into k-buckets, along with the maximum fraction of the whole k-bucket that peers belonging to that role are allowed to occupy. Roles must be sortable. So for example a typical configuration might look like:

• { role 2: 0.5, role 1: 0.3 }

The sum of the fractions should <= 1, with the remainder being allocated to the default unauthenticated role (“role 0”). So in the above example, role 0 would be left with a fraction of 0.2.

This should be called sporadically e.g. at the start of the process run.

2. The kademlia implementation should offer an API for the higher-layer application to define which cryptographic identities belong to which roles. As described in Proposal: fresh authentication signals this should come with an expiry time and be refreshed continually. For example a typical API call might look like:

• { role 2: member_of { some set }, role 1: some predicate, expires: 1 hour }

When the k-bucket is not full, these configurations have no effect and the behaviour is the same as ordinary Kademlia. That is, a new node is simply added to the k-bucket, and an existing node has their timestamp refreshed.

When the k-bucket is full however, this configuration serves to affect which other node is potentially evicted in favour of the new node. In ordinary Kademlia it is the node with the oldest timestamp that is nominated for eviction - a PING is sent and if they don’t reply then they are evicted in favour of the new node. In our new proposal, the node to maybe-evict is instead selected as follows:

1. from lowest to highest role (starting with the unauthenticated role 0):

1. if the k-bucket contains more than k*fraction entries for that role, then

1. select the oldest node in that role for eviction

2. if the above loop did not select any node, then

1. select the oldest node in the same role as the new node for eviction

This should serve to guarantee the following security properties:

1. If resources are available (i.e. the k-bucket is not full) then everyone is added, without prejudice.

2. If resources are constrained then higher-priority roles are added with greater priority than lower-priority roles, but without starving lower-priority roles of the guarantees they are given according to the fractional resource limits configuration.

## Implementation topics¶

This section discusses some implementation topics not commonly mentioned in research papers, that any “real world” implementation has to deal with. However, they are critical to actually creating a real working system, which is what we want to do at the end of the day.

We will approach these issues in a methodical and principled way so that implementors are not forced to come up with their own hacky solutions. We first give a general description of the issues, discuss some general strategies for solving them, and finally apply these strategies to come up with concrete proposals for Kademlia and Polkadot.

Most networking papers (including Kademlia and S-Kademlia) talk about node addresses as a trivial fact that each node inherently knows. In both the real internet and a more precise model of an abstract addressing scheme, this is not realistic at all. Existing Kademlia implementations therefore are forced to devise their own hacky non-standard methods to detect node addresses, most of which are quite insecure, and some of which are unable to deal with multiple addresses, and some of which get confused by local or NAT addresses. We present something that should be much more robust in all possible real-world scenarios.

Starting from first principles, an address is simply a piece of information used to receive things from others. You do not need to know your own address in order to send things to other people. If your house is teleported around at will by your landlord, you have no way of knowing what your address is unless you re-explore your surroundings by asking other people. Likewise, the analogue of this happens all the time on the internet - sometimes because the node owner did in fact decide to move the machine, but at other times because the ISP reconfigured their network outside of the node owner’s control. This might even happen within an organisation that wants to e.g. keep the networking department well-separated from the applications department.

On the internet, from addresses are added automatically as part of the delivery of a packet. However these may not always be correct, i.e. these addresses may not actually have the ability to receive replies. However one can flip this around slightly to create a more robust and “obviously correct” protocol:

1. If node A sends to address B some unpredictable string, and

2. Node A later receives (from any address) the same unpredictable string, then

3. Node A knows that address B can be observed by some alive node, that can reply with the aforementioned from address.

Adding signatures or some other cryptographic authentication makes the security more precise:

1. If node A sends to address AddrB some unpredictable string, and

2. Node A later receives (from any address) the same unpredictable string signed (or otherwise authenticated) by node B, then

3. Node A knows that node B is able to observe (some) information sent to AddrB, at least in the recent past and hopefully recent future.

Note that AddrB may or may not be the “actual address” for node B - whatever “actual address” means - it is possible there is a proxy in the way that is translating addresses, but this doesn’t matter too much - we have gained some concrete evidence that node B is reachable via AddrB, and that is all that matters for most protocols, and certainly for Kademlia. And in some sense, everywhere on the internet is proxied by routers anyway.

Out of this we can amend the Kademlia protocol as follows. The original paper describes each nodeId as being associated with a single address and a single timestamp. Our amendment associates each nodeId with a list of (address, seen) pairs. The “seen” type describes how we know about that address, and is itself a pair (verification-method, timestamp) where verification-method is either “Untrusted” or “ExplicitReply”, where “ExplicitReply” sorts higher than “Untrusted”. The overall “seen” value for the given nodeId, is simply the highest “seen” value over all of its (address, seen) pairs.

When sending an outgoing request, we try all of a node’s “ExplicitReply” addresses first, from most-recent to least-recent, in sequence. If these all timeout (or don’t exist) then we try the “Untrusted” addresses, in parallel as reasonable, e.g. 3 at once. If these all timeout then we fail the overall sending request, which is what higher-level functionality should wait on.

This gracefully deals with the issue of NATs and firewalls since we simply use whatever address works, without being too fussed about which one is “the correct one”. This is very similar to how ICE (RFC 5245) does things.

The “unpredictable string” we mentioned above is simply the same as Kademlia’s “request IDs” that is already part of the protocol. We store the ones we generate, and match these against incoming replies. We also store the timestamp of the original send, and use that as the timestamp associated with the ExplicitReply, so that all the timestamps used are local values according to our own clock. We cannot compare our own timestamps with others’ timestamps.

We do have to embed the sending address in the body of the request packet, and the replying node should read this and re-embed this in the body of the reply packet. This is necessary to distinguish which address(es) actually succeeded, when sending a packet to the same node but different addresses in parallel. Also, packet bodies should be authenticated, otherwise NATs and malicious MITMs could rewrite this information. (The replying node could also fake its own address in the reply, but this would simply be attacking itself.)

When we receive a reply, we update the relevant “seen” entries as prescribed by original Kademlia, but extended in the obvious way to deal with multiple addresses. I.e. either update the relevant address’s “seen” value to (ExplicitReply, timestamp) or insert it anew up to some bounded limit, if it doesn’t already exist in the list.

When we receive a FIND_NODE request, we only send the addresses and not our own “verification-method” markers, since these should not be trusted by others. In the future we may come up with an heuristic to make use of this information but that is a hard problem so we omit it for now to avoid implementors attempting their own hacky solutions.

### Using stream-based sockets¶

Kademlia as commonly described runs over UDP, and includes within it a subprotocol for checking whether nodes are alive. Above, we extended this so that it supports multiple addresses across complex network topologies.

Some real-world Kademlia implementations on the other hand use TCP instead of UDP. The main advantage of this is that TCP includes a handshake already so that the alive-checking subprotocol (PING, PONG) is not required, and therefore the Kademlia implementation can be made simpler. Depending on how TCP is used, there are different downsides to this approach however:

1. If a new TCP connection is opened on every request (FIND_NODE, FIND_VALUE, STORE), then this increases latency by several round trips, i.e. at least 4x.

2. If TCP connections are kept open on a per-peer basis, this consumes OS resources (TCP ports) which might be better-used for other applications or other parts of the same application. Furthermore, care must be taken to ensure that relevant keepalives and heartbeats are set on the TCP connection to ensure it stays alive, otherwise one throws away the advantage of keeping these open in the first place. This is preferable to (1) in most real-world situations.

In (2) the disadvantage can be mitigated by using QUIC which does not need to consume many OS ports, instead consuming just one per usage-context and multiplexing through it multiple connections indexed by 64-bit connection IDs handled internally by QUIC. These cannot run out, at least before the machine itself runs out of memory.

In this case, the tradeoff remains that one has to keep these connections alive using QUIC pings which are done on a time-interval basis, rather than Kademlia native pings which are done in response to various network observations. The resulting simplification however is probably worth it in overall “real-world” development terms, assuming that one already has a good QUIC implementation (which at the time of writing, end 2019, is actually a bit of a stretch):

#### Proposal: multiple addresses, with QUIC¶

• We can omit PING/PONG entirely

• When we receive information about new nodes, we don’t have to check old nodes whether they are alive or not since this is already taken care of by the QUIC keepalive logic. Instead, we simply attempt to open up a QUIC connection to each new node that we know about, and add them to the k-bucket if this connection attempt succeeds.

• When doing this in conjunction with the “multiple addresses” proposal above, we do not need to embed the send address in the body of the request, since any discrepancies between to/from addresses are handled by QUIC or QUIC-TLS.

• QUIC is able to keep connections alive even if one endpoint changes IP address. Whenever we receive something from a node’s connection, we should check the current from-address of the connection and update the relevant (verification-method, timestamp) tag for this address specifically. This is effectively the same as in the “multiple address” proposal above, just re-worded for clarity to be more specific to QUIC.

As per the “drop policy” section above, if an existing QUIC connection times out, the analogous thing to do here would be to remove them immediately from the k-bucket. However it may be more prudent to attempt to re-establish the connection at least for a short while, before switching to a different node, to better effect the “long-term nodes are preferred” design intention of Kademlia.