GSoC status report #4

July 26th, 2010

Hello again, everyone. Time for another update on Tor metrics! Some good stuff got done over the past two weeks. For everyone who got Karsten’s e-mail, we are attempting to remove some/most of the file based dependencies for metrics. I succeeded in moving the rest of the metrics portal to the database-driven system. Now, the bridge users, torperf, and gettor data is logged in the database and graphed accordingly. Different graphs are requested from one point (imageservlet – the user can request a parametized .png file), which provides a pretty simple query interface to use around the rest of the site, and elsewhere if someone wants to. The only trouble I’ve had is that the code has turned into a rats nest of conditionals and error checking, due to the different sets of parameters available for each different graph. I guess I can improve this during the “clean up” week at the end. Also, I have a basic “custom graph” page, which is just a simple HTML form with all of the graph parameter options available. I’ve also decided to prove a simple grammar for constructing a query URL, in case someone would like to script it or request multiple graphs at a time. A database version of ExoneraTor is also mostly implemented. Good thing I spent so much time on doing infrastructure work!

During the past coding session, I’ve discovered when using this design, it gives us the potential to do a lot of the heavy lifting with the database instead of java (which is nice, because the data fits the relational model well), however, it seems like there is no way around the complexity of aggregating massive amounts of metrics data. For the last coding week perhaps, I’ll have to spend more time on the schema to optimize and design it further. For now, all of the “work” is done with the file based design, with the database simply serving as a secondary data sink.

It is time to start thinking about new visualizations. There is a nearly unlimited amount of variables that can be combined and graphed. A choropleth (country map) would be a fun (and difficult) one. We could do this with all of the bridge users data, or all of this data normalized by population to see which countries have the highest percentage of users. I also plan to look into other things like churn rates for bandwidth, platforms, and uptime and perhaps combinations of these. Apparently some good ideas in this regard were shared at PETS which I intend on looking in to.


GSoC Midterm

July 8th, 2010

Hello everyone. Back again with another two week update of my project. The jsp version of the metrics portal has been pretty much refined and completed, and has been merged into ernie master to supersede the servlet-based design. This has taken a bit longer than expected because of e-mail delays and some small issues, but it has been “finished” for a while. The site is now a hybrid, with some of the older functionality still delegated to servlets (exonerator, consensus health, and the log), with the rest in jsp. It still works the same way, however.

The other development track has been part of my original proposal – database driven dynamic graphs. I have a working prototype using Rserve, which is a back-end server for the R language that allows us to have a persistent instance of it. R then connects to the database to generate the graphs (With the RPostgreSQL driver, part of GSoC ‘08!). I have most of the graphs working in this way for everything that is in the consensus and descriptors. I don’t know what I’m going to do at this point for the directory and performance data, etc. However, my project is pretty much focusing on the consensus graphs.

The prototype works acceptably fast. However, ggplot2 is a bit slow, so maybe some optimizations can be put in place here to make it work faster, but I can’t think of any right now. It will take a few seconds if a user wants a custom graph.

So, whats next? Time to finish the java code for all of this graphing stuff to work nicely with the website (graph caching – I’ve said this a lot in the past weeks…), and generalizing and refining it some more. I have a pretty good idea of how its going to work in my head, with it half implemented right now. Then, it is time to start thinking about the new statistics which are now possible with the database driven design. This will be a matter of generating a schema (or a materialized view, more specifically), writing a small bit of R code, putting it on the website. I will have to do more database stuff at the end of the project (See my first update). Also, documenting everything will also take a good chunk of time. There happens to be a lot of different parts that must work together right now! (Tomcat, R, Rserve, Postgres…).

If anyone has any feature requests or thoughts (New metrics?), feel free to let me know!


4 weeks

June 22nd, 2010

Hi. Last week, my plans were to get the dynamic graphing and caching mechanisms done, and to work on R. Well, it’s been a rough two weeks of trying to simultaneously absorb the endless expanse of j2ee and hack something useful together. Here is a short summary of what has been done:
  • Moved static, servlet based metrics portal to JSP with a template inheritance design (this makes it a bit easier to re-use code, change things, stay organized, etc)
  • Initial graph caching mechanisms (if a graph is over an hour old it will be regenerated on request)
  • Parametizable network size graph, allowing both pre-defined and custom ranges.
  • R works with database directly for network size graphs.
  • Graph controller somewhat finished.

And what is left to do:

  • Implement any unfinished metrics in the database side (mostly done), graph them accordingly.
  • Generalize network size graph controller to cover the rest of the information we need.
  • Generalize R code.
  • Refine – everything is pretty hackish so far. Getting there, though.

Some problems:

  • R is slow. It takes about 2.5 seconds to generate a new graph on each request.
    • How to fix? Perhaps use Rserve. This will give us a persistent R instance, along with a nice java client. We won’t have to load an instance of R on each request. I think a majority of the time is spent loading the large R libraries, so this will help out.

I feel like I may have spent too much time fiddling with j2ee and just trying to get a nice framework down from which we can build off. Maybe I should have spent more time doing, and less time designing. I also wish I had traded my beloved vim for eclipse a bit earlier to ease my j2ee pains. The really unfortunate side of j2ee (web stuff in particular) is the config and overhead complexity. Once all of this is in place it moves faster. Getting there, however, has been a bit of a journey. It’s still not as fast as throwing something together in python or php, but java is solid and really does have a ton of useful stuff. There is even relatively obscure stuff immediately available, like a fully fledged Rserve driver.


The first two weeks of GSoC

June 5th, 2010

I spent the past two weeks doing a few things. I hacked on ERNIE, and I sat in on a grad class at Drexel to hear Roger speak about Tor. It was enlightening to learn more about the social and political considerations surrounding it, as well as some of the technological aspects that I missed before.

The goals of my project have been significantly narrowed down. For the first two weeks, I focused on everything database related. The idea is to use a database approach to generate the graphs and statistics you see on the metrics portal. Databases are quite a bit more flexible than using a file-based approach (the current method), but often at the price of performance. Fortunately for me, there was already a lot of database functionality in ERNIE. It is fully configurable to write the descriptor and consensus data to a database, and it was well documented.

Perhaps soon, ERNIE will work something like this:

  • Download relay descriptors and consensuses from various sources
  • Parse and re-arrange data into various data sinks (database, archives)
  • Query database with R –  generate graphs
  • Periodically repeat the above steps to update statistics

Short Term Goals

For the first two weeks, I planned to get the schema modeled and complete. We decided to use database denormalization for its performance benefits, and triggers to keep the data consistent. We also decided to use the materialized view technique to provide quick access to the data. Denormalization is a common technique in data warehouses, despite the drawbacks of data redundancy and anomalies. I decided to use row-level triggers to keep the unnormalized table. However, because of all of the anomalies denormalization introduces, it turned out to be a bit messy and complicated.

When all is said and done

So, it looks good on paper, but how does it hold up? Here is an example of running a query to find relay versions on normalized tables with joins, provided by the ERNIE docs. There are 510,000 rows of descriptors, and 3,400,000 consensus entries (Feb 2010-May2010 data):

kjb$ time psql -t -q tordir kjb << EOF
SELECT DATE(validafter) AS date,
  SUBSTRING(platform, 5, 5) AS version,
  COUNT(*) / relay_statuses_per_day.count AS count
FROM
  (SELECT COUNT(*) AS count, DATE(validafter) AS date
  FROM (SELECT DISTINCT validafter
    FROM statusentry) distinct_consensuses
  GROUP BY DATE(validafter)) relay_statuses_per_day
JOIN statusentry
  ON relay_statuses_per_day.date = DATE(validafter)
LEFT JOIN descriptor
  ON statusentry.descriptor = descriptor.descriptor
GROUP BY DATE(validafter), SUBSTRING(platform, 5, 5),
  relay_statuses_per_day.count, relay_statuses_per_day.date
ORDER BY DATE(validafter), SUBSTRING(platform, 5, 5);
EOF

    date    | version | count
------------+---------+-------
 2010-02-01 | 0.1.2   |    10
 2010-02-01 | 0.2.0   |   217
 2010-02-01 | 0.2.1   |   774
 2010-02-01 | 0.2.2   |    75
 2010-02-01 |         |   505
...

1m56.121s

The same query on the denormalized table…

kjb$ time psql -t -q tordir kjb <<EOF
SELECT
    DATE(validafter),
    substring(platform, 5, 5) as version,
    COUNT(*) / relay_statuses_per_day.count as count
FROM descriptor_statusentry
JOIN (SELECT COUNT(*) AS count, DATE(validafter) AS date
        FROM (SELECT DISTINCT validafter FROM statusentry) distinct_consensuse
        GROUP BY DATE(validafter)) relay_statuses_per_day
ON DATE(validafter) = relay_statuses_per_day.date
GROUP BY DATE(validafter), version, count
ORDER BY DATE(validafter);
EOF

    date    | version | count
------------+---------+-------
 2010-02-01 | 0.1.2   |    10
 2010-02-01 | 0.2.0   |   217
 2010-02-01 | 0.2.1   |   774
 2010-02-01 | 0.2.2   |    75
 2010-02-01 |         |   505
...

0m52.541s

Pretty good, a 220% speed increase! I expect this to scale as the tables get larger as well. Fortunately, the result sets were exactly the same, so no worries as far as data consistency. I also believe it can be optimized more. We should have no problems running the metrics portal from a database!

New Statistics

As I mentioned before, using a database driven approach allows us to have more flexible statistics. In my proposal I planned to implement node churn. I have this, but I think it can use some work. Its implementation in the github tree db/tordir.sql, branch db. I came up with a few more as well. Here is the list of materialized views we can query from quickly and reliably right now. Maybe this can be extended to bridges and the rest of metrics later.

    network_size,
    network_size_30_days,
    network_size_90_days,
    relay_platforms,
    relay_platforms_30_days,
    relay_platforms_90_days,
    relay_versions,
    relay_versions_30_days,
    relay_versions_90_days,
    relay_uptime,
    relay_uptime_30_days,
    relay_uptime_90_days,
    relay_bandwidth,
    relay_bandwidth_30_days,
    relay_bandwidth_90_days,
    total_bandwidth,
    total_bandwidth_30_days,
    total_bandwidth_90_days

Future considerations

All of this database stuff has the added benefit of allowing us to make a simple interface for someone that wants to find information about a specific relay in the past. Like I said in my proposal, we could now implement a search feature for the directory data, like ExoneraTor.


Nearing the end of the community bonding period

May 20th, 2010

Most of my time thus far has been spent learning and configuring software to start the summer. I’ve spent the last week learning a few of the ins and outs of ERNIE, git, and Tomcat. Git has been a beast, but since I’ll be using it a lot over the summer, hopefully I’ll come to appreciate its flexibility and power. But what is with stuff like “git checkout <branch>”, and “git checkout — <file>?”. Strange.

Tomcat, although notorious for its “boilerplate hell,” wasn’t a bad set up. As of now, I have a stand-alone tomcat setup running on port 8080 on this server. Since I will likely be working with this, check back on http://kjb.homeunix.com:8080/ernie/ for updates.

I’m pretty excited to get started. However, there are a few things that need to be done. The most important of these is the direction of my project! Some long term goals and decisions need to be made which will determine the outcome of my work.

I’ve added links to my github repository and my proposal on the side bar.


GSoC Blog 2010

May 12th, 2010

This is my new blog. The following updates will be mostly for Google Summer of Code 2010. I will be working with the Tor Project on network metrics. More to come.

The project’s repository will be hosted at github.