Geekstream runtime performance issue!

Whoa! Our first runtime performance issue with OrigoDB.

To populate geekstream we run a background process which periodically checks a number of RSS feeds. The feeds are added manually, either individually or from opml files. Initially, I was eager to pull in a lot of content quickly. I was able to achieve this by gobbling some opml files found with google, the bigger the better. This resulted in some quality issues:

Duplicate feeds due to multiple urls for the same source
Duplicate items due to 'popular', 'latest' or syndicated streams
Non tech related feeds

Today I started to do some manual cleaning by identifying feeds to remove using random searches like “fish” and “texas”, then removing them with a command line tool (included in the source).

$ geekstream.admin -r 2345

To my surprise each invocation was taking several seconds to return. Are the commands taking time to execute or is there something else going on? I could tinker with the logging to see command execution timings but knowledge about the data model and the fact the queries run at the same time were blocked confirms that the commands are slow. So lets have a look at the code to see what’s going on and what we can do about it. The application sends a RemoveFeedCommand to the server. The engine executes the command which calls this method:

The size of _mostRecentItems is 10 so is surely insignificant. Following the call to RemoveFeedFromSearchIndex(feedToRemove) leads us to:

Yeah, this looks interesting. The main for loop walks the entire dictionary of keywords which at the moment of writing is slightly more than 300 000. _statistics.TotalKeywords is a misleading name, it’s actually the total number of IndexEntry objects linking keywords to articles. entrySet.RemoveWhere() performs a single iteration over the items in the HashSet so the total number of calls to the inner lambda on line 11 is equal to the total number of index entries, more than 11 million. The inner lambda is a hash lookup so should be O(1) which is somewhat confirmed by the MSDN doc for HashSet<T>.Contains<T>(T) method. Here’s the index data structure for reference:

Dictionary<string,HashSet<IndexEntry>> _searchIndex

Before we start optimizing let’s do some math to figure out how this scales. First some basic assumptions and measures and how they scale:

N -> number of articles, grows linearly with time
A -> average number of unique words per article, constant over time
K -> number of keywords, Has a constant upper bound
L -> number of links = N * A, so proportional to N

Looking back at our RemoveFeed algorithm we have a loop in a loop which suggests O(N^2) but it’s actually O(N) because the total number of calls to the basic operation is L. We could actually write a projection over the command journal, calculate these numbers over time and plot them. But let’s save that for separate blog article.

So what do we do about it? There are several things we can improve.

Add a reverse index, so that FeedItems have references back to each associated HashSet<IndexEntry> object. Then we could just iterate the FeedItems instead of the entire set of keywords. And given that most sources don't grow indefinitely were back to an O(1) algorithm at the expense of some extra memory, a so called space-time trade-off. Perhaps I'll elaborate on the details in a future blog article.
Run the outer loop in parallel utilizing cores/processors. Sure, we could divide the execution time by the number of cores (almost) but we'll be back at square one when the number of articles has increased by the same factor. Also, we'll be hogging CPU which could cause other problems. So this is just buying time.
Redesign the the algorithm to use a read phase and a write phase. The read phase iterates and builds a list of hashsets to be modified during write phase. The read phase happens during Command<M>.Prepare(M db) which immediately precedes Command<M>.Execute() but allows concurrent queries.

So which path do we choose? Or can we do a combination? It all depends on what we want to optimize. We probably want to optimize for query throughput which means fast command execution. In this sense alternative 1 and 3 have the same effect. The latter just calculates on the fly what the former stores in memory. We can combine both alt 1 and alt 2 with multi core processing (alt 2) during Command.Execute() to minimize the duration of the exclusive lock. Hogging cpu during Command.Prepare() will shorten the total execution time but will have an impact on concurrent queries.

As always you should do some profiling or other performance measurements and not base optimizations on assumptions.