What is TeaBreak?
TeaBreak (http://teabreak.pk) is a Pakistani Blog Aggregator. It started off in April 2008 with an aim to organize the expanding Pakistani blogosphere. The engine supports automated tagging and categorization of posts into relevant topics like Politics, sports, business, technology, etc.
Background
Initially the project was completely based on WordPress and used a feed aggregation plugin called WP-O-Matic.
But in a matter of months TeaBreak grew to over 500 local blogs and the aggregation overhead along with serving growing traffic was enough to exhaust the VPS resources. The site was running on a decent VPS and even after a lot of fiddling around with MySQL and apache tweaks and optimization, the WordPress + WP-O-Matic solution was unfortunately not optimal.
The new distributed architecture
At that point I decided to split the monolithic system into stand-alone distributed systems. This way each system can be scheduled to run at a different time and possibly spread across multiple hosts.
So, the system was divided into:
1. WordPress Front-end
I didn’t want to re-invent the wheel so I still preferred to keep the front-end on WordPress because of it’s excellent support for managing posts and editorial-level features (like editing and tagging posts, etc.). Additionally, the plugin-base was quite attractive to retain WordPress as that would mean an almost instantaneous launch of a new feature (like polling).
This system is live at http://teabreak.pk.
2. CDN for static content
We have roughly 50K+ posts in our system at any given time, and we don’t yet have a traffic level to keep all of them in cache. Also the amount of images, and other static content is considerable, so I decided to put off the load for static content like images, CSS, Javascripts, etc. by using a different shared hosting account.
I achieved this by a modified version of CDN Rewrite plugin. The CDN is live at: http://cdn.teabrk.com.
3. Management System
This is a non-wordpress based system to deal with sign-ups, registrations, admin & editorial reviewing of new blogs. The basic aim was to have a system that can be modified as our demands and requirements shape up when working with external partners.
Because this is a totally separate and stand-alone sub-system we can modify it without constraining ourselves in the WordPress world. This system dispenses knowledge about which blogs are registered, active, approved, etc.
This area is live at: http://site.teabreak.pk.
4. Aggregation Engine
This is actually the part of the system that does most of the heavy lifting. The engine is written in Java and runs on Google AppEngine infrastructure. Among various functions, the engine primarily parses XML / RSS feeds, processes posts, tags and classifies them into relevant topics and puts these posts in the publishing queue.
5. Publishing Engine
This sub-system (part of the Aggregation Engine) picks posts up from the queue waiting to be published and posts them to TeaBreak’s WordPress front-end via XML-RPC API.
Conclusion & Aftermath
After a soft-launch in December 2010, this system went live in January 2011 when we deployed Version 3.0. The VPS is mostly sitting idle as it is now only serving WordPress requests. The CPU averages at a meager 20% load as compared to 78% before.
Offloading aggregation and processing bits of our system to Google AppEngine proved to be a good investment. The website has been running quite smoothly since the transition to the distributed system with almost no down-time / unresponsive moments.
Posts related to this Project:
[show-posts tagged=”TeaBreak”]
Amazing structure. I felt there is a space for algorithmic management in the site admin part. Although that is subjective, but overall awesome work !
Thanks for dropping by, Amir 🙂
Yes indeed I believe you’re right. Automation in the admin side will be good. This is probably the most manual bit in the system which suffers often with delays and backlogs.
However, the reason its kept manual is because we don’t want to aggregate low-quality content (like spam blogs) and need to make sure that the blogs being registered look genuine. This makes it a bit trickier to automate. I will be interested to hear thoughts you may have?
Great insights – I too had gone through a roller coaster of experiences but the road went into a dead end as the blog aggregation at bloggers.pk and it grown to become larger then any off-the-shelf products – required customized solution and I think you have nailed it perfectly
Well done – keep up the great work
Thanks 🙂 Yes, I can relate to your experience. I was being sucked into a lot of tweaking and optimization with little or no effect. And, as they say if the only way your software scales is by adding more hardware muscle — then its time to rethink.
I don’t think WordPress is primarily designed for this use-case and most off-the-shelf addons are geared towards serving high traffic spikes rather than for a blog that hosts 50K-100K posts. I think this is where it starts to fall apart.
Out of curiosity was the previous bloggers.pk system based on WordPress?
This is great website and i can have everything which i like to read the most is available in here. Thanks mate for making such a wonderful website.
Its a good site for increasing information 🙂 😛