Express Predictive Modeling Services



I am in the process of moving to self-hosting... Pardon any inconvenience.

Community of 1500+ followers in various ways; Blog, BigData Hot News,Twitter, and LinkedIn


Some of the top archived articles, personal discussion opportunities and email exchanges are available for people who join my blog community - Click on the button "JOIN THIS SITE" button to follow, which is available on the right.


Saturday, September 08, 2012

Building a Predictive Text Mining Model - Part 1

http://www.nytimes.com/interactive/2012/09/06/us/politics/convention-word-counts.html#! 

We have a way to figure out the million dollar question, what is the Reagon statement for Romney that will help him win 2012 election, that is coming in less than 2 months. Seriously it is not 'are you better off today compared to 4 years back'.  So, What is it?  Some glimpses are provided in the following analysis.  See the statistical latent intelligence column in the second picture.  It is just the beginning.

The start of this note and some extensions of analysis is based on some preliminary analysis provided as a visual treat and some key points in the above graphics by nytimes.com.



Often, text analysis is popularly associated with frequency analysis of words.  While that is a great start for profiling, it does not bring out the density(strength) of communication nor it is a powerful way of predicting the nuances of communication, a critical aspect especially in communication in politics.

Think what happened when Obama mentioned 'you didn't build it' and how the RNC used it to its advantage and think how Biden brilliantly (at least tried) turned the negative campaign of RNC as 'not to bet against America'.
NY times has published the above wonderful visual that separates the DNC vs. RNC usage of frequency of key words used that separates their campaigning style.  The idea is that the most frequently used words are the most important for their respective targeted population; for DNC, it is women, younger population, people who are using medicare, middle-class, military/veterans while for RNC, their segment is richer americans who exhibit openly their interest for freedom, middle-america (geographically speaking), and business oriented.
Messaging Strength to the targeted segment:



The density of words - the focused communication - is the ratio of most significant words correlated to the segments , and is an indicator of communication strength to the targeted segment and it is highly favorable in DNC communication (49%) compared to RNC communication (45%)
RNC has to use lot more connecting unemployment, jobs, businesses, American dream, god, debt, and I will be ranking the depth of segment specific communication by RNC and DNC.
In the next part, I will be writing more on how to go about predictive text mining, where both overall classification of a message content and sentiment has to be combined.
Like variable selection, we use for word selection on the basis of the ratio of RNC/DNC word frequency.
Note that the ratio of RNC/DNC (without loss of generality as to which needs to be used in denominator) is defined with a high artificial value of 100 so that it does not become infinity to create undue influence.
These ratios which are given in fourth columns are used to figure out the most important words.  We also create the statements representing latent intelligence, which are provided in the last column.  
Now with additional analysis we have a way to figure out possible answers to the million dollar question: What is the Reagon statement for Romney that will help him win 2012 election, that is coming in less than 2 months.

No comments:

Post a Comment

Be courteous and seek knowledge... Post your comments responsibly