In this latest episode of the Drill to Detail Podcast, Mark Rittman is joined by Gartner analyst and ex-Oracle Database Cloud Service PM Rick Greenwald to talk about IT’s continuing (and essential!) role in corporate BI&DW deployments and the debate around Mode1 vs. Mode2 Analytics, how we got here, and the future of data warehousing database platforms as we move into the cloud.
Mark Rittman is joined by Alex Olivier from Qubit on Episode 16 of Drill to Detail to talk about their platform journey from on-premise Hadoop to petabytes of data running in Google Cloud Platform, using Google Cloud Dataflow (aka Apache Beam), Google PubSub and Google BigQuery along with machine learning and analytics to deliver personalisation at-scale for digital retailers around the world.
I started the Drill to Detail Podcast series back in October this year with an inaugural episode featuring long-time friend and colleague Stewart Bryson talking about changes in the BI industry, and since then we’ve gone on to publish one episode a week featuring guests such as Oracle’s Paul Sonderegger on data capital, Cloudera’s Mike Percy on Apache Kudu followed shortly after by Mark Grover on Apache Spark, Dan McClary from Google talking about SQL-on-Hadoop and Neeraja Rentachintala from MapR telling us about Apache Drill and Apache Arrow, Tanel Poder from Gluent on the New World of Hybrid Data …
… along with many other guests including Jen Underwood, Kent Graziano, Pat Patterson from StreamSets and old friends and colleagues Andrew Bond, Graham Spicer and most recently for the Christmas and New Year special, Robin Moffatt.
In fact I’d only ever planned on publishing new episodes of Drill to Detail once every two weeks along the lines of the two podcasts that inspired Drill to Detail, the Apple-focused The Talk Show by John Gruber and Marco Arment, Casey Liss, and John Siracusa’s Accidental Tech Podcast), but what with a number of episodes recorded over the summer waiting for the October launch and so many great guests coming on over the new few months we ended-up publishing new episodes every week.
So at the end of 2016 and with fourteen episodes published on this website and on the iTunes directory I’d like to take this opportunity to thank all the guests that came on the show along with friends in the industry such as Confluent’s Gwen Shapira who helped get the word out and make introductions, and of course most importantly I’d like to thank everyone who’s downloaded episodes of the show, mentioned it on Twitter and other social networks and increasingly, subscribed to the show on the iTunes store to the point where we’re typically hitting a thousand or more subscribers each week based on Squarespace’s estimate of overall RSS subscriber numbers including those coming in from iTunes and other feed aggregators.
And if you’re wondering which show had the highest audience numbers it was November’s Episode 7 with Cloudera’s Mark Grover on Apache Spark and Hadoop Application Architectures, closely followed by October’s episode with Oracle’s Big Data Strategist Paul Sonderegger on data capital, both of which were great examples of what ended-up being the recurring theme and area of discussion with every one of the guests and shows we recorded … the business, strategy, rationale and opportunities for competitive advantage coming out of innovations in the big data and analytics industry.
And now we’re going into 2017 and the second year of Drill to Detail, we’re going to double-down on this area of focus by updating the Drill to Detail website with a new look and launching the new Drill to Detail Blog to accompany the podcast series, each week posting a long-form blog post looking at the business and strategy behind what’s happening in the big data, analytics and cloud-based data processing industry.
We’ll still be continuing with the podcast series exactly as they are now with guests including Elastic’s Mark Walkom and Cindi Howson from Gartner due on the show in January, but these longer-form blog posts give us a chance to explore and explain in a more structured way the topics and questions raised by what’s been discussed on the podcast, analyzing and exploring the implications from trends and directions coming out of the industry.
Finally, going back to my original inspiration for the podcast that started all of this, a big part of the inspiration and idea to focus on this particular theme came from what’s now become my new favourite blog and podcast series, Ben Thompson’s Stratechery website and Exponent podcast that he co-authors with James Allworth, and if I manage to get even someway towards the insights and understanding he brings towards the wider IT landscape and apply that to the part of the industry we work in during the coming year … well that’ll be my evenings, commute time and weekend time well spent this coming year.
Fast-forward to today and it’s increasingly common to see what’s now Google Apps deployed across large corporations, universities and even government deparments as they switch to buying email, calendering and file storage as services rather than own and manage the infrastructure themselves, and users benefit by having a more reliable service that always has enough capacity to meet their needs.
Contrast this with the state of Hadoop and on-premise big data systems today, where its not unusual to wait months for corporate IT to add a new cluster or expand capacity on existing ones as they in turn wait for procurement to negotiate a deal and then pass the work to an outsourced managed service provider to finally have it provisioned in the data centre – if there’s still some CapEx budget left, otherwise try again next year. Hadoop system administration is still a bit of a black art and few of us have the skills and experience to manage and keep secure clusters of thousands or even hundreds of server nodes … but that exactly what the likes of Google, Microsoft, Amazon Web Services and Oracle do for a living, and as I blogged about a few months ago they’re now selling object-level storage, Spark and streaming ingestion as services you connect to and consume, billed monthly based on numbers of users or some other metric and never having worry about upgrades, cluster administration or replacing a faulty node.
And in the same way I worked out that my time was better spent learning Oracle development rather than running a mail server, the Hadoop cluster I’ve been tinkering with in the garage is likely to go the same way as I port my home development work over to Google’s cloud platform and think of my big data platform as elastic compute and storage services that always work and scale-up when needed … and as big data and data warehousing platforms converge as they transform into managed cloud services, big data doesn’t necessarily have to be Hadoop, HDFS and unstructured data stores.
Google famously invented the core Google File System and MapReduce technologies that Doug Cutting and Mike Cafarella then reimplemented as the open source Apache Hadoop project, but Google then went on to create Dremel, a distributed ad-hoc query system better suited to data warehouse-type workloads than batch-orientated MapReduce, ran over tens of thousands of servers in Google’s data centres and stored its data in column-oriented format (and in-turn inspired another open-source project, Apache Drill)
Google have now made a public version of Dremel, Google BigQuery, available as a service within Google Cloud Platform and conceptually it’s similar to the Apache Kudu and Apache Impala projects that Mike Percy and I discussed in Episode 3 of the Drill to Detail podcast; specialised storage optimised for analytic workloads with data stored columnar and organised as tables and columns, together with an SQL query engine designed for fast, ad-hoc analytics through BI tools that support BigQuery’s REST API or more recently, a standard JDBC interface and ANSI SQL.
As I showed in my tweet about BigQuery I’ve been using it and its ability to use Google Sheets as external table sources to land and query my home IoT and wearables data, gradually building in real-time ingestion and loading using Google PubSub and Google Cloud DataFlow, to receive and then transform incoming IoT events in a pipeline that feeds data into BigQuery as streaming row inserts.
What’s interesting is that Google Cloud Platform and Google BigQuery is the technology that ended-up powering the type of Customer 360 applications I talked about around this time last year.
While most Customer 360 and digital marketing initiatives start on in-house Hadoop clusters or cloud-based services such as Amazon Redshift, the petabyte-scale of event-based customer interaction records means that it’s easier, cheaper and far less work to hand this sort of workload off to Google and have developers concentrate on delivering new experiences and offers to customers rather than plugging in another truckload of servers into the cluster to try and keep up with demand. But the story behind that is one for another day … and it’s good.
Every guest on the Drill to Detail podcast has been a pleasure to interview, from Stewart Bryson on the inaugural episode through Dan McClary, Mike Percy, Kent Graziano, Andrew Bond and later this week Cloudera and Apache Spark’s Mark Grover, but one recording I was particularly looking forward to was last week’s guest Paul Sonderegger, ex-Endeca and currently Oracle’s Big Data Strategist talking to their customers about a concept he’s termed “Data Capital” … and what this new form of capital means for competitive strategy and company valuations.
If you (like me, secretly) thought Oracle’s previous “Digitisation and Datification” slidedeck was a bit … handwavy and corporate marketing b*llocks, well this is where it all comes together and makes sense. If you work in consulting or are looking for some sort of economic rationale and underpinning for all this investment in big data technology, and sometimes wonder why Netflix and Google are valued higher than CBS and your local newspaper, here’s your answer. A great episode exploring the business value of big data, not just the technical benefits.
And coming soon on another future episode … MapR. Watch this space.
Episode 3 of the Drill to Detail podcast is now live and available for download on iTunes, and this week I’m very pleased to be joined by Cloudera’s Mike Percy, software engineer and lead evangelist within Cloudera for Apache Kudu, the new Cloudera-sponsored column-store data layer that takes the best features from HBase and Parquet and creates a storage layer specifically optimized for analytics.
The problem that Kudu solves is something that becomes apparent to most Hadoop developers creating analytic applications that need to support BI-type query workloads against data arriving in real-time from streaming sources; whilst column-orientated file formats like Apache Parquet are great for supporting BI-style workloads they’re not that good for handling streaming data, and while HBase adds support for single-row inserts, updates and deletes to Hive, queries that require aggregation up from cell level don’t perform all that well, such that most projects I’ve worked on copy data from HBase into a format such as parquet before presenting that data out to users for query.
Apache Kudu, as Mike Percy explains in this video of one of his presentations on Kudu back in 2015, takes the “fast data” part of HBase and adds the “fast query” capability you get with column-store formats like parquet, and for Hadoop platforms that need to support this type of workload the aim is that it replaces HDFS as a more optimized form of storage for this type of workload and dataset.
In-practice you tend to use Kudu as the storage format for Cloudera Impala queries, with Impala then gaining INSERT, UPDATE and DELETE capabilities, or you can do what I’ve been doing recently and use a tool such as StreamSets to load data into Kudu as just another destination type, as I’m doing in the screenshot below where home IoT sensor data lands in real-time into Kudu via Streamsets, and can be queried immediately using Impala SQL and a tool such as Hue or Oracle Data Visualization Desktop.
So thanks to Mike Percy and Cloudera for coming on this latest edition of the show, and you can read more about Kudu and the milestone 1.0 release on the Cloudera Vision blog.
This is a good one.
Most of you will know Dan McClary as the product manager at Oracle for Big Data SQL, and more recently he’s now moved to Google to work on their storage and big data projects. If you’ve met Dan or heard him speak you’ll know he’s not only super-smart and very knowledgable about Hadoop, but he’s great to get into a conversation with .. which is why I was particularly pleased to have him on as the special guest on the Episode 2 of my new podcast series, Drill to Detail.
In this new episode Dan and I discuss the state of the SQL-on-Hadoop market and where he’s seeing the innovation; how the mega-vendors are contributing to, extending and competing with the Hadoop ecosystem; and what he’s seeing coming out of the likes of Google, Yahoo and other Hadoop innovators that may well make its way into the next Apache Hadoop project.
You can download the podcast recording from this website, and it’s just about to go live on iTunes where you can subscribe and automatically receive future episodes — and you certainly won’t want to miss the next one, believe me.
Whilst I’m working out what my next adventure will be after leaving my role as CTO at Rittman Mead, the Oracle BI, DW + Data Integration consulting company I co-founded back in 2007, I’m pleased to launch the first episode of my new podcast series, “Drill to Detail”, available to subscribe and download from Apple iTunes.
Inspired by the Apple-focused podcast “The Talk Show” by John Gruber and with a similar informal but candid style, each episode I’m joined by one of the “movers and shakers” in the BI, DW and Big Data industry to talk about three topics relevant to that particular guest. In the inaugural episode I’m joined by none-other than my old friend and colleague Stewart Bryson, ex-Rittman Mead and currently CTO of his own company Red Pill Analytics, where we talk about the recent Gartner BI & Analytics Magic Quadrant 2016 and what it means for vendors such as Oracle; what the rise of Bi-Modal IT and Mode 2 analytics means for agile BI methodologies; and we return to the Oracle BI, DW and Big Data Reference Architecture both of us contributed to back in 2014 and ask ourselves what worked, what’s changed and how relevant it is today?
Show notes for Episode 1 are available also on this website, and check back over the coming weeks for further episodes featuring the likes of Dan McClary (ex Oracle Big Data SQL PM, now at Google), Mike Percy (Cloudera Software Engineer working on Apache Kudu, a new distributed storage technology for analytic workloads sponsored by Cloudera), Kent Graziano (Snowflake DB), Paul Sonderegger (ex-Endeca CTO, now Oracle’s Big Data Evangelist) and Cameron Lackpour (Oracle ACE Director and world expert on Essbase). Subscribing is free through the iTunes Podcast Directory, and episodes will run once a week for a while, dropping back to once every two weeks as I get through the episodes I recorded over the summer.