As a part of University of Washington’s (UW) cloud class’s assignment, I played with Google’s BigData offering BigQuery and I am writing this blog post to share what I think about it. please note that the views are my own and do not represent those of the instructor’s and fellow students at UW. And also I am not a BigData “Expert”, Think of me as a student trying to get my head around various offerings out there – So if you feel otherwise about what I have written, Just let me know in the comments section. Any-who read along to know what I think of BigQuery:
First up what is BigQuery?
It’s a platform to analyze your data (lot’s of it) by running SQL-Like Queries. And it’s really SQL-Like, and so if you are from SQL world like me – you would not face any issues in getting up and running in seconds by referring to the nicely written documentation.
And other point to consider here is that even though it’s SQL-Like, you’ll be able to analyze considerable number of rows in few seconds. Let me give you an example: I played with a sample (called gsod) which had 115M rows and as per my experiments, I was able to get answers to simple computations like max, mean, avg, etc in less than couple of seconds. And little complex queries having where, joins and group by in around 5-6 seconds. Your results may vary depending on the type of query you run but the BOTTOMLINE is that it is FAST. that’s a good news!
BigQuery is Fast!
But what bothers me is that How am I suppose to “UPLOAD” lots of data on the Google CLOUD. It takes time, right? But I guess that’s an issue with every cloud based BigData offering. But here’s what I am thinking – If your data is already on the cloud. for e.g. Amazon’s or Microsoft’s – Does it not make sense to run analytic’s on Amazon’s and Microsoft’s cloud instead of porting your data to Google’s?
[Sidenote: I like it that Hadoop on Azure allows Amazon S3 data source. Nice move!]
My concern: Time spent in uploading truckload of data to Google’s cloud just so that we can use it for BigQuery
And even if you have your data on GAE data-store, you’ll have to uplaod your data to BigQuery separately. Source
Zooming out for a moment, I feel the Goal of BigQuery was to offer an easy to use BigData platform, And I feel that’s what they have delivered:
An easy-to-use + easy-to-setup “Hadoop+Hive” Like Offering.
But this “easiness” means that It is NOT as advanced as a Hadoop Installation (or Hadoop-on-Azure or Amazon’s elastic-map-reduce). But again, it’s easier and faster to get started with BigQuery. I guess, it just depends on what you are trying to achieve and based on that you’ll have to figure which is right tool for your scenario. No generic answer here, Sorry!
And BTW BigQuery supports only CSV – Talk about Variability (One of the V’s of BigData!). Let’s not get into that. I just wanted to Point that out because if you’re looking to analyze data-sets that cannot be converted to CSV for running SQL-Like Queries on top of them then BigQuery is not for you.
Try out BigQuery. It’s easy to get started. It’s powerful if SQL-Like queries are all what you’ll need to analyze your data. If you are BigData enthusiast/expert/student – It’ll be a nice exercise to mentally compare other BigData offerings with BigQuery.
If you decide to try BigQuery or have already tried it out, I’ll love to hear what you think of it. Please leave a comment!
Republished from Paras Doshi's Blog [61 clicks].
Read the original version here [3 clicks].