Big data serving real time decisions and consumer services still will be happening in the domain of real time big data. However, there are other intermediate analysis and decision making for consumer interactions, between raw data streams and the final single purpose graphs. Machine learning methods is in the middle though typed at the bottom of the graph.
If data are a catch all, some thing that makes sense of data is a catch all of engaging brain connection. If not 'decision', what is it? if 'decision' does not actively engage in the generation of 'data', the inevitable duality is already created. If paucity of scientists are going to be in data area, think about the paucity in the decision area. However, this duality is unnecessary.
Here Dr. J. Patil, while talking on the topic "Keynote - Data Jujitsu w/Dr. DJ Patil" nicely connects, data sciences and decision scientists.
Look at time screen 30:31; he talks about who is a great data scientist. Look at the next slide where talks about "What are the key skills?". The following slide is about "Where do you look for them?". The way the logic flows, my attempt to differentiate between big data and power data is probably unnecessary and for the first time I see why any one including me, a traditionally trained statisician/econometrician, would like to tag the term 'Data Scientist' as he nicely mixes data scientist and decision scientist functions, as long as the traditional phrase descriptor, decision scientist, knows how to speed up the process of efficient decision making with voluminous data that is coming fast and furious and you have to handle them from the point of view of consumer information products development. But for traditional training method of statistics and computational scientists to call themselves as modern 'data scientist', he/she needs to know and leverage how to develop information products faster in the market place, using the fast, furious, and voluminous real time data.
Remember not all problems in the world are of the types of opportunity that flow through Linkedin, ebay, ... and other massive data companies.
I will be writing about information product development in a different article. Continuing on the core tool sets and how they are used, broadly understood Machine learning and Mahout implementation of that should be in the middle of this continuum.
I think the above distinction is important to fulfill the promises of what McKinsey calls the demand for 150K analysts and managers of BIG data.
The proof of the above observations are in the course content of EMC's certificate for Data Sciences Associate, even though it is called Data Sciences Certificate, it requires a very good mix of statistical, machine learning and computer science courses. It seems the general industry perspective has been the computer sciences version of BIG data interpretation, based on the presentations I see. In that sense the certification contents are well balanced so that Decision scientists and Data scientists are together working bringing the most from BIG data.
I think it is important to have the right mix of data sciences tools and decision sciences tools along with right mix of the respective scientists to leverage all the open opportunities with least inefficiencies in the labor market.
The rest of the material is from EMC (courtesy of EMC). There is one thing that is missing, which is a course on optimization.
See the comments below.
"EMC offers a Data Sciences Associate Certificate with the following courses.
The following background preparation materials are recommended. Quoted material is completely from EMC site.
"A strong quantitative background with a solid understanding of basic statistics, as would be found in a statistics 101 level course. For refresher materials, please see
or read through Edward Tufte’s text:
Experience with a scripting language, such as Java, Perl, or Python (or R). Many of the lab examples taught in the course use R (actually RStudio), which is an open source statistical tool and programming language that can be download from:
There are many R tutorial packages, such as:
And this web site has R news as well as R tutorial materials:
Experience with SQL (some course examples use PSQL). To become familiar with these skills or to review this area, there are many online tutorials, such as:
Consider the above as a list of specific prerequisite (or refresher) training and reading to be completed prior to enrolling for or attending this course. Having this requisite background will help ensure a positive experience in the class, and enable students to build on their expertise to learn many of the more advanced tools and analytical methods taught in the course."
The following modules and lessons included in this course are designed to support the course objectives:
Introduction and Course Agenda
- Introduction to Big Data Analytics
▬ State of the Practice in Analytics
▬ The Data Scientist
▬ Big Data Analytics in Industry Verticals
- Data Analytics Lifecycle
▬ Data Preparation
▬ Model Planning
▬ Model Building
▬ Communicating Results
- Review of Basic Data Analytic Methods Using R
▬ Analyzing and Exploring the Data
▬ Statistics for Model Building and Evaluation
- Advanced Analytics – Theory And Methods
▬ Association Rules
▬ Linear Regression
▬ Logistic Regression
▬ Naïve Bayesian Classifier
▬ Decision Trees
▬ Time Series Analysis
▬ Text Analysis
- Advanced Analytics - Technologies and Tools
▬ The Hadoop Ecosystem
o In-database Analytics – SQL Essentials
o Advanced SQL and MADlib for In-database Analytics
- The Endgame, or Putting it All Together
▬ Creating the Final Deliverables
▬ Data Visualization Techniques
▬ Final Lab Exercise on Big Data Analytics"
"...concepts taught in the course in the context of the Data Analytics Lifecycle. The course prepares the student for the Proven™ Professional Data Scientist Associate (EMCDSA) certification exam, and establishes a baseline of Data Science skills that can be enhanced with additional training and further real-world experience."