Wednesday, December 29, 2010

Social Media and statistical data

My interest in social media is because of the opportunity it offers in providing a new source of data to subject to statistical analysis.  Apart from this blog, I'm basically not involved in social media. I've sent 3 tweets in my life, 2 of which were probably not seen or read by anyone. My Facebook account consists of my name and that's it. I don't have friend or followers. I may set up a Linked In profile, but I haven't done so yet.

With the data angle, there are two types of analysis that I can see. Firstly, businesses that use social media want to monitor and measure the metrics associated with their social media sites. How many people visited my site, how long did they stay, where did they come from, etc, etc.

And then the content of social media can be the source of a lot of interesting information to be analysed. That's why I like Tribalytics   http://tribalytic.com/

This website has an Australian focus (more useful to Australian researchers than using US data), and allows researchers to extract out samples of twitter messages (tweets).

Here are two links to the Tribalytic blog that explain how business can use twitter data :-



The other thing I like about Tribalytics is that one of the two co-founders took the time to respond to an e-mail I sent to them.

(Although I doubt I'd get such a response at the moment (Dec 2010) – one of their other products – Trunk.ly – is riding a wave since rumours of the impending closure of Delicious were doing the rounds.  (Trunk.ly automagicallycollects links you share on Facebook, Twitter and Delicious, ... and makes them searchable. )

The ways of using Tribalytic tweet data include :-

1.       Tracking words used in tweets by day and time of day. The example of "daypart" analysis Tribalytics use is tracking the usage of " ice cream" and "coffee-machine".
2.       Keyword analysis : looking for word patterns in samples of up to 500 tweets that share a common word/s. What other words are associated, say, with "coffee machine"

In my e-mail to Tribalytics, I queried whether their keyword analysis was statistically valid. Tribalytic takes a evenly distributed sample of 500 tweets that share a common word/s, and extract the top 52 keywords, where those words appear in more than 2% of tweets. As Tribalytics acknowledge, if you are going to suggest a correlation, you need to be reasonably sure the words don't co-occur by chance. My comment in my e-mail was : I also wasn't sure about your logic of using 2% as the threshold for including or excluding keywords. Your reasoning is that the same word appearing in 10 separate tweets is unlikely to have occurred by chance  . You might be surprised how many such samples contained common words, and how common they were.  I'm guessing that the twitter vocabulary would be fairly limited, and given an average tweet size of ????, the odds of 10 or so tweets having common words might be quite large. "   The response was What's perhaps NOT evident is that this has already undergone processing with a Natural Language Parser which will filter out the most common words already ".
It should be a relatively simple task to sample tweet data and get a handle on how many words appear multiple times in randomly selected tweets. However, I'll need to increase my skills in handling text data before I can do that.
I also queried the post-hoc nature of the keyword analysis to construct a "story" :
§  Nugget: broken, working, fixed, buy (bought)
§  Where: office,
§  When: week, today, morning
§  Emotion: love, carefully,
§  Noise: china, #pawpawty, bone are from the same kind of tweets like this which can be safely tossed away.
By sifting through the natural noise coming with the tweets, now we can see that a common theme of people talking about the physical status of their coffee machine
Tribalytics response was : With regards to post-hoc analysis, I agree this is a potential issue, interestingly though our "real world" tests - popping it front of clients and researching topics they know something about shows that it does a pretty good job of throwing up enough of the expected keywords to confirm value for them on past events that they do know about.  Often when we are confronted with keywords from a search we are stuck as to why they are there (without drilling in) but clients say things like "Oh yeah, bear shows with scouts because Bear Grylls was at the Jamboree in the NT in March".

With the benefit of further thought and reflection, I'm not sure if there is a problem with "post hoc" analysis here, nor indeed whether or not it matters whether 10 or more common words is statistically significant.
This is because
-          Twitter message analysis, using tools such as Tribalytics, is a quick and inexpensive way of conducting exploratory or prototype research.
-          Exploratory research is conducted to gain insights, and the grouping of common words with a seed word (like "coffee machine") is a great way brainstorm , if you will, ideas and concepts that may be associated with your word of interest.
-          A media campaign can create a common word : a hashtag
So, in conclusion, a great idea and a great opportunity to conduct statistical research in future.




: What is Bayesian Statistics

I've now worked my way through a fair part of Statistics : A Bayesian Perspective by Berry, and thought it was time to step back from the mechanics of calculation and see what my understanding of Bayesian statistics actually was.

-          How does Bayesian statistics differ from Frequent statistics
-          What are the differences in practice, for a student
-          Is it possible / reasonable to combine both schools.

Some comments, not necessarily in any order:

1.       Bayesian statistics specifically acknowledges that statistical research does not exist in a vacuum : the same data can lead different researchers to different conclusions. Bayesian statistics also makes explicit the learning process in research : each statistical study builds on what has gone before.

2.       Generalised Bayes Rule : separation of prior information and current data.

3.       Is the difference more one of philosophy, which doesn't have to impact one at a practical level. Is it the case that Bayesian statistics is "just" more explicit about "how we come to know things about the world".

4.       Bayesian statistics formalises the evidentiary value of observations : people can change their beliefs in a rational way when confronted with new information. In fact, Bayesian statistics provides a method to assign numerical values to changing beliefs. If the prior probability is based on prior statistical studies, the comparing  prior and posterior probabilities shows (and measures) the effect of the data.

5.       Both Frequentist and Bayesian statistics calculate a confidence ( or Probability)  interval : to this student they seem very similar

6.       A Bayesian analysis with a flat prior seems very similar to the Frequentist model.

7.       Sample size matters in both models. But is the Bayesian model more valuable with small samples, where prior knowledge can formally inform the analysis.

8.       Bayesian statistics doesn't include the frequentist "null hypothesis significance testing"  that is becoming so outmoded (and reasonably so – for a science that is all about information, the significance test provides so much less information than a confidence interval.

9.       Most of the explanations of Bayesian statistics use very artificial  examples to compare and contrast it with Frequentist statistics. I'd like to see an example that compares a study with real world data

10.   In some ways, Bayesian analysis seems to have similarities with meta-analysis (http://en.wikipedia.org/wiki/Meta-analysis  ) : The first meta-analysis was performed by Karl Pearson in 1904, in an attempt to overcome the problem of reduced statistical power in studies with small sample sizes; analyzing the results from a group of studies can allow more accurate data analysis

11.   Need to better understand what likelihoods are.

12.   Berry has an example (example 7.11) that involves defects in the manufacture of glass. The probability of a glass pane breaking due to a manufacturing defect is 5%; however if a glass breakage is due to a manufacturing defect, then the probability of other breakages in glass from the same batch being due to manufacturing defects rises to 95%  :  I'm sure that in the Frequent real world, that sort of prior information is taken account of.

13.   Is it necessary to spend some much time explaining probability theory as an introduction ?

14.   What a pity the stock standard statistics computer software (S-Plus, SPSS) does not allow for Bayesian analysis.  The spreadsheet models used in Berry are ok, but do not allow for detailed analysis. Winbugs looks difficult (and I am still to learn how to use R)

Review : Social Media Marketing Practice (SMMP Stage 3) - Advanced Concepts Workshop - University of Technology Sydney

I recently attended this short course run by the University of Technology Sydney. The course outline is here : http://www.shortcourses.uts.edu.au/code/coursedetails.php?sc_code=SMMP-ACW

The course is conducted by Suresh Sood, and you can see his LinkedIn profile here : http://au.linkedin.com/in/sureshsood

I enrolled in this short (one day) course with high expectations, and I guess that's why I've come away not entirely satisfied.

Firstly, the delivery of the course was a bit flaky

-          The course was described as part computer workshop, giving participants the opportunity to try some software hands on. Nothing worse in these situations where not all the computers are pre-loaded with the software, and the instructor stuffs around helping individuals get properly set up.  You expect that all computers to have been checked beforehand, and perhaps a second assistant to be on hand to look after technical issues. Also, why not let participants know they can get individual assistance during the breaks, and in this course, the breaks total 2 hours. The same lack of preparation showed up later on when Suresh attempted several times to install Pagerank with Google Reader.
-          Suresh clearly knows a lot about the subject, and shows his passion, yet I thought he rambled on a bit.  It was only after lunch that I started to get a sense of the key concepts of the presentation.
-          Finally, the course guide could do with a makeover, and perhaps expanded.

Secondly, I felt the course itself was a bit unstructured, and tended to be a bit of a grab bag of internet goodies. After all, the aim of the course is:

The course not only focuses on the strategic outlook for social media marketing activities in 2011 and beyond but tactical advice to immediately participate in successful marketing activities on social media platforms.

                however, there is also a "warning" which I should have taken more notice of : this is an interactive computer laboratory workshop providing access to a variety of free and low cost tools for use in SMMP activities  (my emphasis).

                The focus on free and low cost tools should have alerted me to the possibility this course is in effect targeted at a SME rather than Enterprise level, and that the insights to be gained are "interesting tit bits" rather than a "holistic and      robust understanding". For example, one is not going to gain the
                understanding that might be gained from a product like Prophecy"  (  http://www.forethought.com.au/prophecy/  ).

                The "tit bits" observation can be illustrated by the inclusion in the course of a brief description of Page Rank algorithm, which included the term : eigenvector. If any of the attendees understood what an eigenvector was , I'd be surprised.


Criticisms aside, there were some positives for me  about the course:


-          One of the handouts was an academic paper on "Reflected Glory" . I thought this was a useful concept to help explain why communities and networks form and develop. This explains for me why people become members of a football club.
-          Train of Thought analytics. Not mentioned a great deal, but seemed an idea worth following up. To me, it may provide a way of using data that is not statistically valid or significant, but still is valid as being suggestive.
-          Psychology of influence. Provides an evidence based theory of how different websites can be more successful than others.

These are the concepts I would have liked the course to have focused on, as if we are using social media in business, we want to understand

-          What makes social media successful as a business activity
-          What would we like to measure
-          What can we measure to best approximate that


Some of the tools mentioned include :-

-          Nodexl : an excel addon used to map networks
-          Gephi – an open source program for network mapping.
-          Social API graph : google this and look at the example applications.
-          Netvizz for Facebook
-          Facebook Insights
-          Apps.asterisq.com/mention,ap
-          Social Mention
-          Bit.ly : data can be obtained on how many times link was accessed.
-          Xinureturns.com/[url]
-          Social too
-          Google insights for search
-          Post rank with google raeder
-          Tag crowd
-          Word stat
-          Alexa
-          Social blade


Finally, an interesting example of a good website : garyvaynerchuk.com

It would be interesting to compare this course (which costs $770 GST inclusive for one day) with a two day course offered by the Australian Marketing Institute (for $980 GST inclusive, cheaper for members).







Series on Statistics

A useful series on statistics is the "Fundamentals of Clinical Research For Radiologists".

http://www.google.com.au/search?q=%22fundamentals+of+clinical+research+for+radiologists%22+filetype%3Apdf+site%3Aajronline.org&hl=en&biw=1920&bih=965&num=10&lr=&ft=i&cr=&safe=images&tbs=

There are 22 articles in the series.

Comparing two proportions : Statistics : A bayesian Perspective : Chapter 8


Chapter 8 deals with comparisons between two populations : the subject population and a control or comparison group.
Assuming a range of 11 discrete proportions from 0.0 to 1.0, and two populations, we build what is in effect a 3D model. We will look at each slice individually. The concept is no different than outlined in chapter 6, although the mechanics are a little different .
The first slice is for the prior probabilities. With each population having 11 models, the combination of two populations means there are 11 * 11 equals 121 models in total. In this example, there is a flat prior.
Model Dc- >
0.00000000
0.10000000
0.20000000
0.30000000
0.40000000
0.50000000
0.60000000
0.70000000
0.80000000
0.90000000
1.00000000

Model Dn ->












1.0
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.9
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.8
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.7
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.6
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.5
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.4
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.3
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.2
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.1
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909
0.0
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.00826446
0.09090909

0.09090909
0.09090909
0.09090909
0.09090909
0.09090909
0.09090909
0.09090909
0.09090909
0.09090909
0.09090909
0.09090909
1.00000000
The diagonal represents those models where the proportion in each population is the same.
The second slice is the likelihood of each model. The likelihood of each of the 11 models for each population is calculated. The likelihood for each of the 121 models is calculated by multiplying the likelihoods for pairs of models.

Model Dc- >
0.00000000
0.00254187
0.00219902
0.00087200
0.00020897
0.00003052
0.00000242
0.00000008
0.00000000
0.00000000
0.00000000

Model Dn ->
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.00000000
1.0
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.02287679
0.9
0.0000000
0.0000581
0.0000503
0.0000199
0.0000048
0.0000007
0.0000001
0.0000000
0.0000000
0.0000000
0.0000000
0.00879609
0.8
0.0000000
0.0000224
0.0000193
0.0000077
0.0000018
0.0000003
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00203467
0.7
0.0000000
0.0000052
0.0000045
0.0000018
0.0000004
0.0000001
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00031346
0.6
0.0000000
0.0000008
0.0000007
0.0000003
0.0000001
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00003052
0.5
0.0000000
0.0000001
0.0000001
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00000161
0.4
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00000003
0.3
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00000000
0.2
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00000000
0.1
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.00000000
0.0
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000
0.0000000















0.0000000
0.0000866
0.0000749
0.0000297
0.0000071
0.0000010
0.0000001
0.0000000
0.0000000
0.0000000
0.0000000
The third slice is to multiply likelihoods by prior probabilities:

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.0
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.9
0.000000000
0.000000481
0.000000416
0.000000165
0.000000040
0.000000006
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.8
0.000000000
0.000000185
0.000000160
0.000000063
0.000000015
0.000000002
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.7
0.000000000
0.000000043
0.000000037
0.000000015
0.000000004
0.000000001
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.6
0.000000000
0.000000007
0.000000006
0.000000002
0.000000001
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.5
0.000000000
0.000000001
0.000000001
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.4
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.3
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.2
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.1
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.0
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
0.000000000
The fourth slice is to divide the product of likelihood * prior by the total of the products, which gives the posterior probability.

0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1



1.0
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

0
0.9
0.000
0.292
0.252
0.100
0.024
0.004
0.000
0.000
0.000
0.000
0.000
0.672

0.604617
0.8
0.000
0.112
0.097
0.038
0.009
0.001
0.000
0.000
0.000
0.000
0.000
0.258

0.206644
0.7
0.000
0.026
0.022
0.009
0.002
0.000
0.000
0.000
0.000
0.000
0.000
0.060

0.041825
0.6
0.000
0.004
0.003
0.001
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.009

0.005523
0.5
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.001

0.000448
0.4
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

1.89E-05
0.3
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

2.95E-07
0.2
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

7.7E-10
0.1
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

2.64E-14
0.0
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000

0














0.000
0.434
0.376
0.149
0.036
0.005
0.000
0.000
0.000
0.000
0.000
1.000

0.859075
















0
0.04341453
0.075117704
0.04468078
0.01427673
0.00260617
0.00024758
9.3401E-06
7.1638E-08
1.2451E-11
0
0.18035291


From the spreadsheet above, it is possible to read off various probability subsets; for example , the probability of D0.3 is 14.9%.
The null hypothesis is represented by the diagonal