I spent the past two weeks doing a few things. I hacked on ERNIE, and I sat in on a grad class at Drexel to hear Roger speak about Tor. It was enlightening to learn more about the social and political considerations surrounding it, as well as some of the technological aspects that I missed before.
The goals of my project have been significantly narrowed down. For the first two weeks, I focused on everything database related. The idea is to use a database approach to generate the graphs and statistics you see on the metrics portal. Databases are quite a bit more flexible than using a file-based approach (the current method), but often at the price of performance. Fortunately for me, there was already a lot of database functionality in ERNIE. It is fully configurable to write the descriptor and consensus data to a database, and it was well documented.
Perhaps soon, ERNIE will work something like this:
- Download relay descriptors and consensuses from various sources
- Parse and re-arrange data into various data sinks (database, archives)
- Query database with R – generate graphs
- Periodically repeat the above steps to update statistics
Short Term Goals
For the first two weeks, I planned to get the schema modeled and complete. We decided to use database denormalization for its performance benefits, and triggers to keep the data consistent. We also decided to use the materialized view technique to provide quick access to the data. Denormalization is a common technique in data warehouses, despite the drawbacks of data redundancy and anomalies. I decided to use row-level triggers to keep the unnormalized table. However, because of all of the anomalies denormalization introduces, it turned out to be a bit messy and complicated.
When all is said and done
So, it looks good on paper, but how does it hold up? Here is an example of running a query to find relay versions on normalized tables with joins, provided by the ERNIE docs. There are 510,000 rows of descriptors, and 3,400,000 consensus entries (Feb 2010-May2010 data):
kjb$ time psql -t -q tordir kjb << EOF
SELECT DATE(validafter) AS date,
SUBSTRING(platform, 5, 5) AS version,
COUNT(*) / relay_statuses_per_day.count AS count
FROM
(SELECT COUNT(*) AS count, DATE(validafter) AS date
FROM (SELECT DISTINCT validafter
FROM statusentry) distinct_consensuses
GROUP BY DATE(validafter)) relay_statuses_per_day
JOIN statusentry
ON relay_statuses_per_day.date = DATE(validafter)
LEFT JOIN descriptor
ON statusentry.descriptor = descriptor.descriptor
GROUP BY DATE(validafter), SUBSTRING(platform, 5, 5),
relay_statuses_per_day.count, relay_statuses_per_day.date
ORDER BY DATE(validafter), SUBSTRING(platform, 5, 5);
EOF
date | version | count
------------+---------+-------
2010-02-01 | 0.1.2 | 10
2010-02-01 | 0.2.0 | 217
2010-02-01 | 0.2.1 | 774
2010-02-01 | 0.2.2 | 75
2010-02-01 | | 505
...
1m56.121s
The same query on the denormalized table…
kjb$ time psql -t -q tordir kjb <<EOF
SELECT
DATE(validafter),
substring(platform, 5, 5) as version,
COUNT(*) / relay_statuses_per_day.count as count
FROM descriptor_statusentry
JOIN (SELECT COUNT(*) AS count, DATE(validafter) AS date
FROM (SELECT DISTINCT validafter FROM statusentry) distinct_consensuse
GROUP BY DATE(validafter)) relay_statuses_per_day
ON DATE(validafter) = relay_statuses_per_day.date
GROUP BY DATE(validafter), version, count
ORDER BY DATE(validafter);
EOF
date | version | count
------------+---------+-------
2010-02-01 | 0.1.2 | 10
2010-02-01 | 0.2.0 | 217
2010-02-01 | 0.2.1 | 774
2010-02-01 | 0.2.2 | 75
2010-02-01 | | 505
...
0m52.541s
Pretty good, a 220% speed increase! I expect this to scale as the tables get larger as well. Fortunately, the result sets were exactly the same, so no worries as far as data consistency. I also believe it can be optimized more. We should have no problems running the metrics portal from a database!
New Statistics
As I mentioned before, using a database driven approach allows us to have more flexible statistics. In my proposal I planned to implement node churn. I have this, but I think it can use some work. Its implementation in the github tree db/tordir.sql, branch db. I came up with a few more as well. Here is the list of materialized views we can query from quickly and reliably right now. Maybe this can be extended to bridges and the rest of metrics later.
network_size,
network_size_30_days,
network_size_90_days,
relay_platforms,
relay_platforms_30_days,
relay_platforms_90_days,
relay_versions,
relay_versions_30_days,
relay_versions_90_days,
relay_uptime,
relay_uptime_30_days,
relay_uptime_90_days,
relay_bandwidth,
relay_bandwidth_30_days,
relay_bandwidth_90_days,
total_bandwidth,
total_bandwidth_30_days,
total_bandwidth_90_days
Future considerations
All of this database stuff has the added benefit of allowing us to make a simple interface for someone that wants to find information about a specific relay in the past. Like I said in my proposal, we could now implement a search feature for the directory data, like ExoneraTor.