If the web is going real-time (seems that way according to all the recent hype) why are marketers and Internet companies still using old, slow data warehouse technology to analyze interactions?
If Web 1.0 means replicating "old" systems online - a push of information from originator to consumer, then Web 2.0 is the next step - a back and forth discussion between originator and recipient.
Web 3.0 is here, and it's subtly different on the outside (which is why there's no fanfare), but it's impact on Internet infrastructure is massive - the conversation goes from originator to user, to groups completely outside of the originator's understanding, intent or control. Conversations are completely decentralized, and rise and fall based on networks, not the source. Twitter, re-tweets, and completely asynchronous, unorganized group messaging is a new paradigm and makes interactions hard to follow, and the data they generate much more difficult to capture and understand at a high level.
Web marketers, online advertisers, and content distributors can't rely on data warehouses to provide analysis of these conversations - they are very complex and unbounded events. Business intelligence systems were designed as back-office applications and meant for data mining well-known data well after the fact. They are completely unsuited for this role. But we're using 'em because "that's what we got and that's what we know."
When you think about it, it's silly that you can have a million users visit your website in a day, but you have no idea what happened until a day later.
Business intelligence is much different from "Internet intelligence" or "social web intelligence" (you heard it here first). On the now social Internet, you need to know what's happening now to make an impact, optimize, or re-target based on activities. The next day, when a data warehouse could deliver that insight, is far too late.
Open-source projects like Hadoop and MapReduce that eschew relational database storage techniques are a different way to tackle the problem. They break data up into manageable chunks that distribute and speed processing. But the basic idea that data must be pulled out of production systems to process and analyze across dozens or hundreds of servers - then re-centralized and shoved back in - still doesn't fit with the need to process data as it's generated. These systems still have significant lag times (hours or days), and are really complicated to manage. Internet intelligence requires analysis within the production environment, in real time - whether or not action is actually taken in real time.
There are a lot of smart people out there, so why hasn't anyone been able to solve this problem? Well, to be fair, it's only gotten to be a real problem in the last few years. Before that, in Web 1.0 and 2.0, the volume of data being generated was manageable and understanding interactions could wait until tomorrow.
Today, many well-funded vendors are trying to solve this problem (getting immediate analysis from massive amounts of data in a cost-effective manner), but it's a very hard problem to solve. Some Internet companies are building their own stuff (Facebook, Google, YouTube, Yahoo!), some are leveraging the fastest third party data warehouse products (Teradata, Netezza, even Oracle). But, whether it's all based on the idea of batch processing, and whether it's being built in-house, or it's one of the super-scale data warehouse vendors, it's REALLY expensive ($2 - $10m just to start off with a Teradata system), and it goes up the more data and the faster you want results.
I found one company, Truviso, that stands out from these other vendors - they are actually able to deliver real-time data analysis in a production environment in a cost-effective manner. They haven't figured out how to make a data warehouse faster, but instead they process data in a different way.Truviso's Continuous Analytics software processes data in real time before it's stored in a database, so it completely eliminates the lag time in batching and loading and indexing, or chunking and distributing data across clusters. Analysis is done on the fly, decisions based on data can be automated, and people can actually see what's happening on their websites - and across conversations - at any time.
Truviso has created a scalable data analytics system for Internet production environments with real-time data analysis problems. Online ad networks, CDN's, social networks, and online video companies are producing massive amounts of data that they need to analyze to deliver better experiences for their users or customers. Their business depends on the analysis of this data for revenue generation - it's vitally important to them. If they can make a change today instead of tomorrow, that could result in tens or hundreds of thousands of dollars in additional revenue.
This type of Internet intelligence analysis is going to change the status quo. Once companies realize they don't have to wait for analysis anymore, there will be no going back. They'll expect it. They'll want it all the time. And that's the right way to go, especially since the group-discussion that defines Web 3.0 aren't going to go away.