Thursday, May 5, 2016

My top author list for Planet MySQL

Who are the top individual authors of influential recent posts to planet MySQL?  The planet MySQL page includes a list of the top 20 authors as well as a list of the top 10 vendor blogs.  However, since posts to the vendor blogs make up at least 1/4 of all of the posts, and the authors of vendor blog posts are not included in the top author list, I decided to compute my own top author list.  I include the hidden authors from the vendor blogs when computing my top author list.

The first problem is to identify the hidden authors for posts from each vendor blog.  This requires that the author information be extracted from the individual posts, and this requires a specialized parser for each vendor blog to extract the author name from the document.

The second problem is to rank the authors using some criteria such as the number of posts in a given recent time range.   I could run a page rank algorithm if I had a map of the web, but I don't have a web map. So I include the 'like' and 'dislike' counters maintained for each post by planet MySQL into the criteria.  In addition, I could use the comments on a post to indicate the posts interest level, but I did not process the comments.

I don't know what the ranking algorithm is used by planet MySQL (is the algorithm obvious?), so I use my own ranking algorithm.  I rank the authors by assigning a score to each post and adding up the scores by author.  I select posts within some recent time range like the last 3 months.

My 'score' function is:

score(post) = (1 + A(like,dislike))*B(now,post_date)

A(like,dislike) = like_scale_factor*(abs(like) + abs(dislike))

B(now,post_date) = exp(-((now - post_date)*time_scale_factor))

Note that the constant '1' indicates the value of the post itself.  

The 'A' function computes the interest level of the post based on whether or not a reader took the time to poke the 'like' or 'dislike' button.  

The 'B' function is a negative exponential function that assigns precedence to recent posts.  Old posts will have a high score only if there is a lot of 'like' or 'dislike' counts.

It will be interesting to fiddle with the 'score' function, especially the scale factors.

My top authors list for posts between Feb 5 2016 and May 5 2016 inclusive

Given the 'score' function described above and a time range for the last 3 months, my list of the top 25 individual authors on planet MySQL is:

+-------------------------+-------+
| author                  | score |
+-------------------------+-------+
| Joao Osorio             |    22 |
| Todd Farmer             |    20 |
| Valeriy Kravchuk        |    20 |
| Colin Charles              16 |
| Daniel Bartholomew      |    16 |
| Morgan Tocker           |    15 |
| Fahd Mirza              |    15 |
| Hrvoje Matijakovic      |    15 |
| Mark Callaghan          |    13 |
| Dave Avery              |    13 |
| Frederic Descamps       |    12 |
| Peter Zaitsev           |    11 |
| Shahriyar Rzayev        |    11 |
| Dave Stokes             |    10 |
| Valerie Parham-Thompson |    10 |
| Vadim Tkachenko         |    10 |
| Sveta Smirnova          |     9 |
| Giuseppe Maxia          |     8 |
| Marcelo Altmann         |     8 |
| Mario Beck              |     7 |
| Seppo Jaakola           |     7 |
| Dimitri Kravtchuk       |     7 |
| Shlomi Noach            |     6 |
| Rasmus Johansson        |     6 |
| Peter Gulutzan          |     6 |
+-------------------------+-------+

My list is similar to the planet MySQL top author list except that there are no team blogs in my list. In addition, my list contains several authors that contribute heavily to the team blogs since I extracted the authors from the posts to several team blogs including the 'MySQL Performance Blog', the 'Pythian Group' blog, the 'MariaDB' blogs, and several 'MySQL' blogs.  Note that since the 'Severalnines' blog posts do not contain individual author names, all of its posts are not included in my list.

One can also compute the top authors for the last year or the last 10 years from the planet MySQL meta-data database using this post scoring process.  These results are not included in this post.

Posts by year

Let's change the topic.  One can compute interesting trends from the planet MySQL meta-data, like the number of posts to planet MySQL per year.  Obviously, there is some erroneous timestamp data (1970?) in the planet MySQL database.

+-----------------+----------+
| year(post_date) | count(*) |
+-----------------+----------+
|            1970 |        2 |
|            2001 |        2 |
|            2003 |        5 |
|            2004 |       67 |
|            2005 |     1409 |
|            2006 |     3489 |
|            2007 |     3715 |
|            2008 |     5744 |
|            2009 |     5052 |
|            2010 |     3192 |
|            2011 |     3137 |
|            2012 |     2987 |
|            2013 |     2695 |
|            2014 |     2308 |
|            2015 |     1686 |
|            2016 |      558 |

+-----------------+----------+