\chapter{Attack-Resistant Metadata}
\label{metadata}


Systems should be designed so that metadata is attack-resistant, as well.

One important form of metadata: ``this song is beautiful.''

Another example: micropayments in a music distribution system,
commonly known as ``pay Lars.'' But: which Lars? Cite Scott McCloud's
discussion of micropayment systems for comics as well.

Another form of metadata that has received substantial recent
attention is relevance of Web search results. To answer this query, an
extremely rich trove of raw data is available: the link structure of
the several billion documents currently available on the World Wide
Web. Two systems have attracted particular attention: IBM's
Clever\cite{kleinberg99clever}, and the PageRank algorithm used by
Google\cite{page98pagerank}. The latter is claimed to have some attack
resistant properties, and the experience from Google's deployment is
encouraging in this regard: the relevance of Google's search results
is widely praised, and it does seem to be rather difficult for
spammers or others wishing to unfairly dominate the search results
to do so, except at considerable expense.

One interesting recent example is the dominance of pro-Scientology
listings in the Google results for a search on the keyword
``scientology.'' This is discussed in an online
essay\cite{operatingthetan}. To Google's credit, as of 2 Mar 2002,
this search yields one anti-Scientology site, in fact at
\#4. \emph{(probably cut this---the scieno thing has obviously become
far more a political issue than a technical one)}

\section{Brief analysis of PageRank}

A description of the PageRank algorithm was published in
1998\cite{page98pagerank}, but as the work has continued as a business
rather than an academic pursuit, there has been no published followup.
This section attempts to address this hole in the literature, by
providing a brief discussion of the PageRank's attack-resistance.

\subsection{Recap of PageRank}

For the purposes of analyzing the PageRank algorithm, it is most
useful to present it in its random walk formulation. Each walk begins
by choosing a node $u$, using an initial distribution $E(u)$. At each
step, with probability $||E||_1$, the walk ends. Otherwise, the next
node in the walk is chosen uniformly from the successors. Thus, the
probability that a walk is of length $k$ is $||E||_1 (1 -
||E||_1)^k$. The rank assigned to each node is the probability that
such a walk will end at that node.

Here, $||E||_1$ denotes the $L_1$ norm of $E$, informally ``Manhattan
distance'' and formally:

\[
||E||_1 = \sum_u |E(u)|
\]

Note that when $E$ is nonnegative, $||E||_1 = E \cdot {\bf 1}$, where
${\bf 1}$ is the vector of all ones.

The PageRank paper also presents an eigenvector formulation of the
algorithm. The adjacency graph $A$ is defined as follows: $A_i,j$ is
$1/N_i$ if there is an edge from $i$ to $j$, otherwise 0, where $N_i$
is the outdegree of node $i$. Let $R = c(A + E \times {\bf 1})R$,
where $||R||_1 = 1$, and $c$ is maximized. Thus, $R$ is an eigenvector
of $(A + E \times {\bf 1})$. The eigenvector formulation is equivalent
to the random walk formulation: $R(u)$ is equal to the probability
that a random walk ends in node $u$.

\subsection{Attack-resistance of PageRank}

What exactly do we mean to say that an algorithm such as PageRank is
attack resistant? In fact, the criteria are basically identical to the
Advogato group trust metric presented in Chapter~\ref{groupmetric}.
The input to the PageRank algorithm is a graph. In the Google
application of PageRank, nodes correspond to web pages, and edges
correspond to links.  The ``seed'' of the Advogato trust metric
corresponds to the initial vector $E$ of PageRank. Finally, the output
of PageRank is a real-valued ranking, while Advogato produces a
boolean accept/reject value for each node. Note that the Advogato
trust metric can be extended to multiple levels of trust by doing
multiple runs of the algorithm with different capacities.

There are a number of different ways to quantify the success of an
attacker. Here, largely for purposes of simplifying the proof outline,
we choose the sum of the rank accorded to all nodes under the
attacker's control. This measure is very similar to the one chosen
for the analysis of the Advogato trust metric: the number of ``bad''
nodes chosen.

We state the attack resistance property of PageRank more formally.
For any labelling of the nodes in the graph $G$ into ``good'' and
``bad'' nodes, the sum of the rank of the ``bad'' nodes is bounded as
follows: ...

The analysis of the attack-resistance of PageRank is dependent on the
choice of the initial source of rank, $E(u)$. As in \ref{TODO}, let
the nodes in the ``trust graph'' (ie web pages) be divided into three
classes: good, bad, and confused. To recap, a ``confused'' node is
one that is itself good, but may contain links to bad nodes. Let us
then assume that $E(u)$ is zero for all bad nodes $u$.

Now, it is possible to see that the resulting PageRank meets the
``bottleneck'' criterion of \ref{TODO}. A given walk will lead either
to a good page or a bad page. In the worst case, if the walk ever
touches a bad node, then the resulting page may be bad. Otherwise,
it is good.

\subsection{Significance of initial vector and walk length}

Here, we see that the choice of initial vector, as in the choice
of ``seed'' in the Advogato trust metric, has a critical effect on
the quality of the results.

If the fraction of good nodes in $E$ is $p$, then the total rank
accorded to good nodes is multiplied by $p$.

One of the choices of $E$ presented in the PageRank paper is a uniform
distribution over the set of all nodes in the graph. This value for
$E$ is not attack-resistant; an attacker can simply create a huge
number of web pages, and have them link to a ``bad'' page which will
then receive a high rank.

Thus, to achieve attack resistance, $E$ must be chosen with some
tendency toward good nodes. Obviously, if it is possible to do this
with perfect accuracy, the trust metric is not needed. In general, a
good strategy is to hand-pick a small subset of sites that are known
to be good, and let the walks cover the rest of the graph. In the
context of websites, one good way to choose start nodes is to use a
collaboratively edited link directory such as the ODP (dmoz.org).

Given this assumption, the choice of walk length becomes clearer.  The
shorter the walks, the more attack resistant. However, if walks are
too short, then good nodes are inadequately reachable from the start
nodes in $E$. The PageRank paper gives a value of 0.15 that the walk
ends at each step. Thus, walk lengths are a geometric distribution
with a mean of $(1 -||E||_1) / ||E||_1$, thus 5.67 for $||E||_1 =
0.15$. This probably represents a good real-world balance between
complete coverage and low vulnerability.

The PageRank paper presents another intriguing choice for $E$: the
set of ``root'' web pages. This set can be determined automatically,
reducing the dependence on manual labor. Since setting up a root web
site requires purchase of a domain name from a DNS cartel, typically
on the order of \$10 per year, mounting an attack requires money,
generally in an amount proportional to the sum of rank desired. It
is unclear whether Google uses this criterion for their own $E$. If
so, this analysis lends credence to the widespread belief that
cross-linked web sites with many different domain names are a good
way to increase Google rank.

\section{What makes a trust metric attack resistant?}

Now that we know of two different trust metrics that we consider
attack resistant, it is worth investigating what the similarities
and differences are.

Network flow and principal eigenvectors are two different computations.
They are not used interchangeably. However, there is a common thread
which allows both algorithms to serve as the basis of attack resistant
trust metrics. In this section, we propose the \emph{bottleneck
property} as a common feature of both trust metrics. We conjecture
that this property is satisfied by all attack resistant trust metrics,
although at the moment there is no proof.

The bottleneck property, informally stated, is that the total trust
quantity accorded to an $s \rightarrow t$ edge is not significantly
affected by changes to the successors of $t$. Informally, when $s$ is
a ``good'' node and $t$ is a ``bad'' node, this means that
manipulation on the part of the ``bad'' nodes does not affect the
trust value.

Scalar trust metrics, no matter how carefully tuned, all fail to
satisfy the bottleneck property, and are not attack resistant. It
is fairly easy to see why they do not: since the trust computation
is performed independently for each final target node, simply adding
more target nodes increases the total amount of trust accorded to
target nodes. To be attack resistant, a trust metric must dilute
the trust accorded to the successors of $t$ as more successors are
added, either by decreasing the value of each (as in the case of
PageRank), or decreasing the relative probability that any one of
these successors is accepted (as in Advogato).

The class of attack resistant trust metrics may have more instances
than Advogato and PageRank. Network flow and eigenvectors are both
compelling in their simplicity, resulting in relatively easy analysis
and implementation. Even so, it may be that some other, yet to be
discovered, trust metric is better suited to real applications.

\section{Generalized metadata}
\label{genmeta}

The results of a trust metric or Web page ranking represent a fairly
special class of metadata: essentially boolean or real ``confidence
value'' for inclusion in a group. This section describes how an
eigenvalue-based trust metric might be extended to more general
forms of metadata assertions.

The primary new concept is that of an ``assertion,'' which is simply a
statement of some sort. Examples of meaningful assertions include ``I
think the song hashing to \texttt{01..EF} rates 9 out of 10'', ``Tips
to be paid to the artist of song \texttt{01..EF} should be paid to
Swiss bank account 12345'', ``The song hashing to \texttt{01..EF} is
actually \emph{Man of Constant Sorrow}, recorded by the Soggy Bottom
Boys''.

The ranking of such pieces of metadata is given in terms of random
walks. Begin the walk at a start node (generally, this will correspond
to the user of the ranking system). At each step, if that node
contains a matching metadata record, the walk stops. If not, then the
walk with some fixed probability $p$ terminates, or proceeds to a
successor chosen at random.

The concept of ``matching'' metadata record requires a bit of further
elaboration. For example, it makes little sense to rate a song as both
4 out of 10 and 9 out of 10. Thus, the template for such a query is
``I believe song \texttt{01..EF} ranks \texttt{*} out of 10''. The
random walk will end at any node containing such a matching record.

\section{Distributed network model}
\label{net-meta}

This section presents a distributed network model of the eigenvector
approach. The initial presentation is based on PageRank, but it
extends straighforwardly to more general metadata.

Each node $i$ maintains a vector $R_i(j)$, assigning a real value to
each other node $j$, with $||R_i||_1 = 1$. Periodically (but
asynchronously), each node $i$ performs an \emph{update} operation, as
follows. For each predecessor node $s$ in the trust graph, node $i$
requests the current value of $R_s$ from node $s$. All such vectors
must satisfy $||R_s|| = 1$; if not, the vector is clearly bad and
should be discarded. A new $R_i$ is recomputed as follows:

\[
R_i(j) = {(1 - p)\over{|\mbox{pred}(i)|}}
{\sum_{s\in \mbox{\scriptsize pred}(i)} R_s(j)} +
(p\ \mbox{if}\ i=j, 0\ \mbox{otherwise})
\]

It should be clear that such a network will converge to the same value
as the random walk and eigenvector formulations as presented
above. Specifically, $R_i(j)$ is the probability that a random walk
beginning at $i$ will end at $j$, with $p$ representing the
probability that the walk terminates at each step.

Thus, this network computes the ``personalized PageRank'' for each
node in the network. Distributing the computation may ease the
computation and communication bandwidth bottlenecks inherent in a
centralized approach.

Further, the distributed model eliminates the need for global
knowledge of the trust graph. Each node need only query its immediate
predecessors.

To adapt this network model to generalized metadata as described in
Section~\ref{genmeta}, $R_i$ becomes a vector over assertions, rather
than over other nodes. For all metadata statements $j$ local to node
$i$, $R_i(j) = 1$, and $R_i(k) = 0$ for all $k \not = j$ and for which
the templates of $j$ and $k$ match. For example, if $j$ is ``Song X
ranks 7 out of 10'', then statements ``Song X ranks 4 out of 10''
receive a 0 value. For all metadata statements not covered by local
assertions, the propagation formula above is used.

In actual implementation, scaling becomes a problem as the number of
elements in the vector $R$ grows. For a small network, or even a large
network covering a relatively small number of metadata statements,
scaling may not be a serious problem. Some techniques worth exploring
include:

\begin{itemize}

\item Pruning the vector for near-zero values.

\item Data compression of the vector.

\item Adaptive transmission, so that slow-changing values of $R_i(j)$
are sent less frequently.

\end{itemize}


Analysis: why is such a network attack resistant? The ``bottleneck
property'' generalizes the capacity constrained flow network of
the group trust metric presented in Chapter~\ref{groupmetric}.

Discussion of dynamic behavior: can optimize for ``diamond in the
rough'', by selecting for high recommendations with low confidence
value. Then listen to song and inject a new recommendation into the
network. If song is excellent, then recommendations will propagate
quickly through the network. If song sucks, then low recommendations
will remove song from the diamond in the rough category. The goal is
to get good coverage of reviews while minimizing the number of people
forced to listen to mediocre songs.