The 18th of July the openSUSE Conference 2013 will kick off in Thessaloniki. As always there will be many talks about the technologies on which the openSUSE distribution is build as well as workshops where one can learn further tricks. However, the conference also provides an opportunity to talk about the project itself. There will be a Project Meeting, various sessions on the Ambassador program and many development teams will meet and discuss their work.
The openSUSE team will go a little more ‘meta’ with a presentation about numbers about openSUSE. Alberto Planas tells you all about it!
Lies, damned lies, statistics…
The numbers were gathered by Alberto, software developer by trade. He has a background in data mining and machine learning and his first task after joining the openSUSE team was to dig into the treasure troves of data gathered from the openSUSE Servers. The goal was to get an idea of the number of users and active installations, contributor base, bug reporters and so on.
Counting the number of downloaded ISOs, grouping by distribution, seems an easy task. Is a matter of a well placed regular expression in a simple shell script. In fact this approach is good enough… if you do not have a deadline. Let’s do some math: we want to harvest 3.5 years of data (1275 days), stored in log files with a size of up to 2.3Gb per day (1.5Gb as an average value). This data is store on a remote server and the total amount is close to 2Tb of information that needs to be downloaded, uncompressed and processed by a number of scripts to gather the desired metrics. After some tests, Alberto’s best time was around 75 minutes per day-to-be-processed (doing multiple rounds of analysis of the same log line using a naïve approach). That would add up to a total of 66 days just processing the data once. Ouch.
One of the good thing about working at SUSE is that asking the right question to the right people helps you do the right thing. In this case the answer was obvious: we need to parallelize the analysis by splitting the data between different machines. Using the internal infrastructure of SUSE Alberto managed to reserve a 24 core machine, so in theory the analysis would go from 66 days to under three.
But now another problem arose: the data storage. If we are satisfied with counting ISO downloads and number of users that access to the openSUSE repositories, maybe we can remove some lines of the logs, like bots or another information non-related with our goals. That would speed up things. But as soon as we want to analyze something that is outside the goal of this initial project we’d have to wait another three days of processing for the results. The solution was to pre-process the log files and store them in a local database which can be used for flexible analysis.
Big Data and a Old Friend
Usually you’d use a database like MySQL or PostgreSQL, or better yet, a NoSQL database like MongoDB or Cassandra. Did I mention that we need to store more than 40 millions lines per day, and we only have 2Tb of free space reserved for this project? Those restrictions clearly discard these kind of data storage, because the data container will overflow the space available with the indexes and the data itself.
How to resolve this problem? Well, we might not need a full database with all the functionality (and overhead) that comes with it. It is enough if we can transverse the data linearly to make different aggregations quickly. So we need something more like a persistent queue than a database. And that is what BerkeleyDB is good for. It turned out that if we removed the redundancy found in every line, for example indexing the path field of the log lines to another data storage and compressing all the information, all the lines can be stored in less that 1.2Tb, even leaving us some space to make different groupings.
Now, we can walk the data for a single day in less than 5 minutes and the initial grouping we need for the analysis could be done in under 4 hours! Considering we have almost 51 billion rows of data (that is 51.000.000.000, yes) and seeing how long it took initially, this makes our workload embarrassingly parallel (in a few ways) for sure.
Having made a number of initial groupings and saved these in the 800GB of space we had left, we have achieved a huge speed-up. Any time we want to run an analysis, provided there is a grouping to speed it up, executing time can be measured in seconds.
Now, after all this, you might want to know what the results were. Well, that’s a subject for the openSUSE Conference itself… If you can’t make it, remember, we’re recording and streaming it!
And now for something completely different… The top 10 contributors to Factory in week 26!
|7||Cristian Rodríguez, Bjørn Lie|
|8||Dr. Werner Fink|
|9||Petr Gajdos, Marcus Meissner, Luiz Fernando Ranghetti|
Both comments and pings are currently closed.