Attack Resistant Trust Metric Metadata HOWTO

Attack Resistant Trust Metric Metadata HOWTO

Raph Levien

3 Jul 2002: Advogato is now running an implementation of the generic metadata engine described below. The code is available at the mod_virgule CVS, hosted on casper.
Please note that PageRank is covered by US Patent 6,285,999. The implementation linked below is for research only. Seriously.
1 Apr 2002

Don't worry about people stealing your ideas. If your ideas are any good, you'll have to ram them down people's throats.
-- Howard Aiken quoted by Ken Iverson quoted by Jim Horning, 1979

I've done a fair amount of thinking about attack-resistant trust metrics as part of my PhD research. I did a testbed implementation, Advogato, and now feel that the potential applications for this technology are much wider. I'm very busy with real work, though, and much as I would love to see my newer ideas implemented, I don't have time right now to do it myself.
Hence, I am writing this document to entice others with community websites into implementing some of the ideas, especially the ones having to do with generalized metadata.
Background

You don't have to be a mathematician or trust researcher to do a good job implementing a trust metric. However, understanding the basic ideas really helps.
The best writeup of Advogato's existing trust metric is fc.ps, my draft submission to the FC '00 conference. This describes the trust metric implemented in Advogato in fairly good detail.
For my newer thinking, my thesis-in-progress is the best bet. I am trying to make the writing more accessible than your average PhD thesis, so please don't be too intimidated. The metadata work is in Chapter 6.
I can also heartily recommend the original PageRank paper. It's fairly easy to read, and PageRank itself kicks ass.
For insight into actual implementations of trust metrics, I've split out testbed implementations of both Advogato's trust metric and PageRank into a small tarball. This testbed should also be fairly fun to play with.
You have to understand basic graph theory: nodes and edges, predecessor and successor sets, indegree and outdegree. If your interest is implementing a trust metric, rather than trying to analyze it, you don't have to understand network flows, random walks, or eigenvectors. Of course, it never hurts to understand more.
User-visible aspects

The core of any trust metric is peer certifications. In a community website context, these are assertions by users that other users are c00l d00ds. In the graph model, each user is a node, and a peer certification by A that B is c00l is an edge from A to B.
In Advogato, these peer certifications are the sole input to the trust metric. The output is a simple yes/no for each user. (Actually, Advogato implements three levels, but these are just three runs of the underlying algorithm). For the purposes of Advogato, that works pretty well, but here I'm going to concentrate on the problem of generalized metadata.
In this context, the user input consists of "assertions" in addition to the peer certifications. It's entirely up to you to determine the schema for these assertions. I'll give some concrete examples to get you started.
Songs

One form of assertion is: "song X is Y out of 10". A key concept is whether two assertions are mutually inconsistent: "X is 4 out of 10" and "X is 7 out of 10" match a template and are inconsistent, where "X is 4 out of 10" and "Y is 7 out of 10" don't match and are perfectly consistent.
The big challenge here is identifying songs. Should "X" be an ASCII string containing the name of the song? A cryptographic hash of the .ogg file of the song? A URL where the song can be downloaded? All of these are reasonable, and have their own tradeoffs.
Note that the links between these identifiers can also be metadata assertions. Thus, all of these are reasonable:

Song hashing to 01234567ABCDEF is "Smoke on the Water"
"Song hashing to 01234567ABCDEF can be downloaded from http://mp3server/...
"Smoke on the Water" can be downloaded from http://mp3server/...
"Smoke on the Water" has hash 01234567ABCDEF
http://mp3server/... is "Smoke on the Water"
http://mp3server/... hashes to 01234567ABCDEF
Note that the 2nd and 6th can be more-or-less automatically determined, so probably don't need manual entry.
Other applications

There are plenty of other applications. Writing them up remains a TODO.
Computing the trust metric

There is a very large space of trust metrics you could compute. Here, I'll narrow it down to just one. This one is appropriate for implementing on a community Web server. It's also eigenvector-based, like PageRank. The rest of the space is worth exploring, but probably would take more foo.
The goal of the algorithm is to compute a "confidence value" for each user and each metadata assertion. That tells you right there where the scaling problems are: the product of number of users and number of assertions can't grow too much. If the number of users is small, or the number of assertions is small, you're probably ok.
The algorithm starts with a crude approximation of these confidence values, then refines them iteratively. Actually, it doesn't matter what approximation you start with, it will converge on the same result in the end. The number of iterations needed depends on the exact data, but high dozens or low hundreds sounds right for typical usage.
Now for a little notation. Let R[i,j] be the confidence that user i has in assertion j. For efficient implementation, it probably helps for i and j to be numbers, but that's not absolutely required.
In the "step", we compute a new R'[i,j] based on the old R[i,j], the trust graph, and the assertions local to the users i.
For each i:
Find the successors of i (ie, the people i has issued a peer certificate for). Determine i's outdegree (ie the number of nodes in this set). You probably want to reject i->i edges at this point.
For each assertion j:
If i has a local assertion matching j, set R'[i,j] to 1 if it's an exact match, or 0 if it just matches the template for the assertion (ie if the assertion is inconsistent with j). Otherwise, compute the average of R[s,j] for all successors s of i. Multiply by a "damping factor", typically 0.85. This is the new value of R'[i, j]. If i has no successors, use zero.
Optionally, divide each R'[i,j] by the sum over j of R'[i,j] (if zero, use zero). This normalization step reduces the influence of people who issue lots and lots of local assertions. Whether or not it's a good idea isn't obviously clear, and is probably dependent on the specifics of the application.
Now, copy R'[i,j] to R[i,j] for all i, j.
After on the order of 100 iterations of this step, R[i,j] contains the confidence value for assertion j by user i. You can then display this information in user i's custom home page. Obviously, the details on how to present this information are up to you.
Some practical issues

You will care about the performance of this computation. Google has a massive cluster to compute their PageRank algorithm over a graph of roughly 3 billion nodes and 20 billion edges. Advogato uses a much smaller graph (a few thousand nodes) but has its own issues.
Generally, here are the parts you need to worry about:

1. Storing the peer certifications and local assertions.
2. Retrieving these into RAM.
3. Computing the trust metric.
4. Storing the results.
5. Retrieving the results for presentation.

The implementation of Advogato has some lessons. The big performance bottlenecks were (2) and (5). I expected the trust metric computation itself to dominate, but in fact it's trivial - about 180 ms to compute over the Advogato graph.
Advogato stores an XML file (called "profile.xml") for each user. This file contains the peer certifications and other info. Thus, step (2) requires reading thousands of XML files. On a cold cache, this takes about a minute. The trust graph, utterly uncompressed, is about 2 megabytes. I estimate that very simpleminded compression could bring it down to 200k or so. bzip2 gets it down to 130k. Obviously, if the trust graph were maintained in a single file, the step of reading it into memory would be much faster. On the other hand, it would complicate the step of updating the certs, and I'd also be concerned about fragility.
Advogato also used to store the results in an XML file. This file is read into RAM before rendering any page - it is used, among other things, to assign colors to user id's. In XML format, which is fairly verbose, it took a good fraction of a second and a megabyte to do this. Since Advogato serves on the order of one page per second on average, this was quite a bottleneck. Now the tmetric results are written out in a simple ASCII format, and no longer represent a performance bottleneck.
The Advogato trust metric itself is written in C, which helps with the speed. Writing the inner loops in an interpreted language like Python could be a problem - I've noticed it's about two orders of magnitude slower than optimized C for basic array manipulation.
In the case of Python, the NumPy extension may offer a happy medium between ease of programming and performance. An alternate formulation of the trust metric algorithm may be more efficient in this case, because the inner loops are straightforward vector add and scale.
Another optimization
If your assertions are of the form "X ranks Y out of 10", then you have 10 assertions for each X. If you're only going to report the mean value of these assertions, you can reduce the storage to 2. For each X, let R[i,X0] be the total confidence value for i's ranking of X. Let R[i, X1] be the mean ranking scaled by R[i,X0]. If i has a local ranking of X, then R[i,X0] is 1, and R[i,X1] is that ranking.
To report the rankings, R[i,X0] is the confidence value, and R[i,X1]/R[i,X0] is the mean ranking.
For the rocket scientists, this is basically a first-moment transform. If you care about mean and standard deviation, I believe a second-moment transform will do the trick. If this confuses you, don't worry; it's not really needed.
Copyright and patent

All my ideas and algorithms on trust metrics are in the public domain, and will remain so. I tend to release my trust metric code under GPL, but if you're seriously working on a system with a different license and would find the code useful, I'm open to releasing it under a different license.
There may be others with patents on trust metrics. In particular, Google is probably getting patents on PageRank. The trust metric presented here is fairly different from PageRank, so there's a good chance it can be implemented freely. In any case, if you're implementing trust metrics for research purposes, patents don't apply.
What next?
The next step, should you choose to accept it, is up to you. If you really want to implement a trust metric, and have trouble understanding this HOWTO, let me know, and I'll try to flesh it out.

Free software