Express Predictive Modeling Services



I am in the process of moving to self-hosting... Pardon any inconvenience.

Community of 1500+ followers in various ways; Blog, BigData Hot News,Twitter, and LinkedIn


Some of the top archived articles, personal discussion opportunities and email exchanges are available for people who join my blog community - Click on the button "JOIN THIS SITE" button to follow, which is available on the right.


Monday, July 23, 2012

Data Scientists Vs. Decision Scientists

What are power data? It is the MapReduce completed version (BIG data stops here from complex business hypotheses POV) of data streams captured, that has the capacity to handle all kinds of management hypotheses to manage and grow an organization in the new world of BIG data.  The end point of power data, in the extreme is the single graph that captures what actions and follow ups are needed to keep an organization under KPI/KLI goals.  (By the way, KLIs are more important than KPIs - see: http://predictive-models.blogspot.com/2013/07/strategic-metric-kpis-and-klis-which.html ). Some people call power data as smart data.

Big data serving real time decisions and consumer services still will be happening in the domain of real time big data. However, there are other intermediate analysis and decision making for consumer interactions, between raw data streams and the final single purpose graphs.  Machine learning methods is in the middle though typed at the bottom of the graph.
Why the arrows are shown in a continuum?

What are achieved today under Hadoop and MapReduce?  What happens afterwords?


If data are a catch all, some thing that makes sense of data is a catch all of engaging brain connection. If not 'decision', what is it?  if 'decision' does not actively engage in the generation of 'data', the inevitable duality is already created. If paucity of scientists are going to be in data area, think about the paucity in the decision area.  However, this duality is unnecessary.


While students will choose what they want, the leading companies such as IBM, SAS, Oracle, SAP in computational industry have started addressing integrated platforms to address the trends.  I wanted to get one exhibit that will capture the central duality of the what happens in the market place vs. what are the opportunities.


Here Dr. J. Patil, while talking on the topic "Keynote - Data Jujitsu w/Dr. DJ Patil" nicely connects, data sciences and decision scientists.






Look at time screen 30:31; he talks about who is a great data scientist.  Look at the next slide where talks about "What are the key skills?".  The following slide is about "Where do you look for them?".  The way the logic flows, my attempt to differentiate between big data and power data is probably unnecessary and for the first time I see why any one including me, a traditionally trained statisician/econometrician, would like to tag the term 'Data Scientist' as he nicely mixes data scientist and decision scientist functions, as long as the traditional phrase descriptor, decision scientist, knows how to speed up the process of efficient decision making with voluminous data that is coming fast and furious and you have to handle them from the point of view of consumer information products development.  But for traditional training method of statistics and computational scientists to call themselves as  modern 'data scientist', he/she needs to know and leverage how to develop information products faster in the market place, using the fast, furious, and voluminous real time data.

Remember not all problems in the world are of the types of opportunity that flow through Linkedin, ebay, ... and other massive data companies.



I will be writing about information product development in a different article.  Continuing on the core tool sets and how they are used, broadly understood Machine learning and Mahout implementation of that should be in the middle of this continuum.

The second exhibit I want to bring out is: (I shall modify the top legend info as "Effectiveness band for broader collection of statistical, computational, and optimization solutions).


What this points out is the importance of analysts in the organization.  The word decision scientists are not that popular today but will be after few years.  This does not mean it will replace data scientists.  

The value from the good quality and amount of data is still a function of the following : tools, systems and storage, management support/hypotheses and the quality of analysts.  The value derived can be of different order of magnitude if you have the right qualified analysts. I have mentioned them as low, moderate, high levels of qualified decision scientists. They lift your organizations to the best possible scenario.

I think the above distinction is important to fulfill the promises of what McKinsey calls the demand for 150K analysts and managers of BIG data.

The proof of the above observations are in the course content of EMC's certificate for Data Sciences Associate, even though it is called Data Sciences Certificate, it requires a very good mix of statistical, machine learning and computer science courses.  It seems the general industry perspective has been the computer sciences version of BIG data interpretation, based on the presentations I see.  In that sense the certification contents are well balanced so that Decision scientists and Data scientists are together working bringing the most from BIG data.


I think it is important to have the right mix of data sciences tools and decision sciences tools along with right mix of the respective scientists to leverage all the open opportunities with least inefficiencies in the labor market.



The rest of the material is from EMC (courtesy of EMC). There is one thing that is missing, which is a course on optimization.  

See the comments below.




"EMC offers a Data Sciences Associate Certificate with the following courses.

The following background preparation materials are recommended.  Quoted material is completely from EMC site.

"A strong quantitative background with a solid understanding of basic statistics, as would be found in a statistics 101 level course. For refresher materials, please see

o http://www.macs.hw.ac.uk/modules/F21QM3/topic1.pdf

or read through Edward Tufte’s text:

o http://www.edwardtufte.com/tufte/dapp/

Experience with a scripting language, such as Java, Perl, or Python (or R). Many of the lab examples taught in the course use R (actually RStudio), which is an open source statistical tool and programming language that can be download from:

o http://rstudio.org

There are many R tutorial packages, such as:

o http://math.illinoisstate.edu/dhkim/rstuff/rtutor.html

And this web site has R news as well as R tutorial materials:

o www.r-bloggers.com

Experience with SQL (some course examples use PSQL). To become familiar with these skills or to review this area, there are many online tutorials, such as:

o www.sql-tutorial.net/
or
o www.sqlzoo.net/

Consider the above as a list of specific prerequisite (or refresher) training and reading to be completed prior to enrolling for or attending this course. Having this requisite background will help ensure a positive experience in the class, and enable students to build on their expertise to learn many of the more advanced tools and analytical methods taught in the course."

"Course Outline
The following modules and lessons included in this course are designed to support the course objectives:
Introduction and Course Agenda
  • Introduction to Big Data Analytics
▬ Big Data Overview
▬ State of the Practice in Analytics
▬ The Data Scientist
▬ Big Data Analytics in Industry Verticals
  • Data Analytics Lifecycle
▬ Discovery
▬ Data Preparation
▬ Model Planning
▬ Model Building
▬ Communicating Results
▬ Operationalizing
  • Review of Basic Data Analytic Methods Using R
▬ Using R to Look at Data – Introduction to R
▬ Analyzing and Exploring the Data
▬ Statistics for Model Building and Evaluation
  •  Advanced Analytics – Theory And Methods
▬ K Means Clustering
▬ Association Rules
▬ Linear Regression
▬ Logistic Regression
▬ Naïve Bayesian Classifier
▬ Decision Trees
▬ Time Series Analysis
▬ Text Analysis
  • Advanced Analytics - Technologies and Tools
▬ Analytics for Unstructured Data - MapReduce and Hadoop
▬ The Hadoop Ecosystem
o In-database Analytics – SQL Essentials
o Advanced SQL and MADlib for In-database Analytics
  • The Endgame, or Putting it All Together
▬ Operationalizing an Analytics Project
▬ Creating the Final Deliverables
▬ Data Visualization Techniques
▬ Final Lab Exercise on Big Data Analytics"

"...concepts taught in the course in the context of the Data Analytics Lifecycle. The course prepares the student for the Proven™ Professional Data Scientist Associate (EMCDSA) certification exam, and establishes a baseline of Data Science skills that can be enhanced with additional training and further real-world experience."

4 comments:

  1. Decision scientists already exist, their work is known as Operations Research.

    ReplyDelete
  2. True. Decision scientists is not a new phrase and perhaps it has already lost its brand value, having not delivered the expectations. However, now 'Decision Scientists' can apply more OR to more data opportunity areas. The distinction between power data and big data is also to find the opportunity areas that could get attention for faster equilibration of labor market efficiency.

    If EMC has to include some key concepts and material from OR, what concepts (parts of OR) you would recommend as part of the EMC course? Let us say we are looking for top 5 problems every data scientist should know how to solve at basic/intermediate level. This is great, the way we are shaping the question.

    I think the good blend of optimization/statistics/computation/behavioral sciences are central to applying oneself in practical problem solving. Thank you for bringing out this point.

    ReplyDelete
  3. what are the Statistics for Model Building and Evaluation?

    ReplyDelete
    Replies
    1. If I understand your question well, the measures(statistics) for goodness of fit of model building and evaluation depends on what model you are building. Even with in a specific model, specific part of applicability is addressed by one specific measure among many that are available.

      For example, for logistic regression, there are many measures that are available.

      See here:

      http://www.medicine.mcgill.ca/epidemiology/joseph/courses/epib-621/logfit.pdf

      Delete

Be courteous and seek knowledge... Post your comments responsibly