What Goes Around Comes Around
Not long ago, a legacy system was enterprise software running on a mainframe. Wait, I thought those supercharged Teradata screaming machines were purchased just a few years ago? Did someone say legacy Teradata? Grab the gold coins son, we’re going off the grid. It’s a stirring revolution that some large industry leaders are taking their in-house MPP databases and shoving them, opting for the route of IaaS and PaaS. Part of what’s driving this movement is the desire to let someone else worry about infrastructure and platforms. These initiatives are not little toy sandbox data marts, but large-scale behemoths being deployed by the largest retail, consumer goods and high tech companies in the world. Sound familiar? No, it’s not déjà vu. This really is a lot like the time sharing on mainframes almost a half century ago.
The Cloud People
Whether on premise or off, you need to align the right set of skills when deploying big data in the cloud. So what kind of skills should you bring on board? Perhaps you’ll need experience in hosting and performance tuning Splunk applications on Azure to analyze product (machine) operations data for troubleshooting or analysis. Or maybe you’ll need knowledge in tools such as SOASTA for load testing and log analysis as teams spin up thousands of servers across your cloud platform. Going off premise cloud raise may raise your bar on efficiency and effectiveness, but subsequently, require some team retooling. SaaS experience should entail big queries on cloud services platforms. You’ll need to import data into the cloud storage and thereafter ingest to cloud storage from real-time sources. IaaS and PaaS experience should involve virtual machine, virtual network and tools heavy lifting on cloud platforms. Though many big data tools are open source, not all are created equal across cloud platforms. Many skills are transferable, so don’t decide solely based on your architects’ cloud platform religions as to whether you’ll be standing up Virtual Clouds (AWS) or Virtual Nets (Azure).
It’s Real Time
In ongoing cloud operations and performance, there is a need to process significant amount of data from web servers, application servers, databases and external sources, such Akamai and SiteCatalyst. Additionally, data may need to be correlated with applications such as Apigee or an Order Management System. There is a need for real-time and near real-time visibility for into operations. Basket analysis, for example, may use Apache Hadoop, with the data ingested via Flume, where then custom programs do the analysis and prepare the results for reporting. Tools such as Splunk may be used to collect logs from hundreds or servers and correlate with the data from the applications.
Moving data in and out of cloud storage may end up being a blend of batch and real-time using cloud features such as Elasticsearch in an AWS cloud to search data lakes of mobile, social, consumer and business applications data. For Google Cloud, you’ll want experience with Google Pub/Sub and Dataflow for real-time data integration into BigQuery.
Nothing But Cloud
You wouldn’t do it of course just because the cool CIO getting all of the attention at the conference said she did, but between us, if you’re thinking about going off-premise cloud, you’re not alone. More than one top Fortune Company is reducing their in-house Teradata footprint and moving to cloud platforms. Note: Teradata is now offering Database as a service (DBaaS) in its own cloud, as well as on public cloud platforms.
Your plans may be to go all-in all a public cloud platform. You might end up, however, with some on premise deployment, even if that’s not the way it started out on the drawing board. So if you end up with an Elasticsearch cluster running on Azure with an on premise connection to MongoDB inside the firewall, you’re not a loser. And you can still tell your buddies at the bocce league that you’ve gone cloud.
Make sure to get architects with experience in designing data flows into Hadoop and on a large-scale cloud platform. Develop best practices for the day-to-day loads, which may involve MongoDB or Cassandra Collections. Be prepared to have your team develop use case prototypes for demonstration to upper management. Data lake technology is still in its early stages, and the skills to extract meaningful business value are in rather short supply. Deployment guides don’t exist yet. Aligning the right skills and governance is key to avoid projects getting bogged down.
Master the Data Lake
It’s intriguing that the new term data lake is named after a body of water rather than a place where we store stuff a la the data warehouse. That lets us use other water terms when the data gets all mucked up or otherwise permeated, for better or worse, enabling us to call them things like data bogs, swamps, creeks and oceans. Sticking to the metaphor of the lucent body of data, a data lake is similar to the persisted storage area of a data warehouse. The idea is to store massive amounts of data so that is readily available.
Back in the data warehouse glory days, you only wanted to have the really important data immediately available, because you simply could not handle making all of the data persistent. It ended up though not being such a bad thing, because adding some governance to the process led to some discipline up front about what was more important and what was less important. True, technology limited storing ‘everything’ from a cost and capabilities perspective. The collective technology, which we might just call tools, really did not allow us to ‘have it all.’
Alas, technology has advanced, and it is now cost effective to store gargantuan amounts of data – yes, unstructured too – in data lakes. But is there a downside? Time will tell. The good news is, we get to store a bunch of data we don’t have a clue about. So maybe, over time, we can make some sense out of it. It’s sort of like data cryogenics, in that we want to keep the data alive until we find the cure to fix whatever ailments are being caused by something we know so little about.
Decoder for Big Data & Cloud Terms Referenced
SaaS – Software as a Service – functional specific on-demand application. Example: SalesForce.com
PaaS – Platform as a Service – middleware, virtualized databases, development platforms, operating systems. Examples: Azure and AWS
IaaS – Infrastructure as a Service – memory, disk, network, virtual machine (computer). Examples: Amazon EC2, Rackspace, IBM SoftLayer
API Management Platform – Apigee
Big Data Platforms – Hadoop, Hortonworks
Cloud Services Platforms – Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform
Content Deliver Network (CDN) – Akamai
Data Ingestion Tool – Flume
Massively Parallel Processing (MPP) DB– Teradata
Monitoring and Analyzing Platform – Splunk
NoSQL Databases – Cassandra, MongoDB
Performance Analytics – SOASTA
Search Analytics – Elasticsearch, Google BigQuery
Web Analytics – SiteCatalyst