Wednesday, December 29, 2010

Social Media and statistical data

My interest in social media is because of the opportunity it offers in providing a new source of data to subject to statistical analysis.  Apart from this blog, I'm basically not involved in social media. I've sent 3 tweets in my life, 2 of which were probably not seen or read by anyone. My Facebook account consists of my name and that's it. I don't have friend or followers. I may set up a Linked In profile, but I haven't done so yet.

With the data angle, there are two types of analysis that I can see. Firstly, businesses that use social media want to monitor and measure the metrics associated with their social media sites. How many people visited my site, how long did they stay, where did they come from, etc, etc.

And then the content of social media can be the source of a lot of interesting information to be analysed. That's why I like Tribalytics   http://tribalytic.com/

This website has an Australian focus (more useful to Australian researchers than using US data), and allows researchers to extract out samples of twitter messages (tweets).

Here are two links to the Tribalytic blog that explain how business can use twitter data :-



The other thing I like about Tribalytics is that one of the two co-founders took the time to respond to an e-mail I sent to them.

(Although I doubt I'd get such a response at the moment (Dec 2010) – one of their other products – Trunk.ly – is riding a wave since rumours of the impending closure of Delicious were doing the rounds.  (Trunk.ly automagicallycollects links you share on Facebook, Twitter and Delicious, ... and makes them searchable. )

The ways of using Tribalytic tweet data include :-

1.       Tracking words used in tweets by day and time of day. The example of "daypart" analysis Tribalytics use is tracking the usage of " ice cream" and "coffee-machine".
2.       Keyword analysis : looking for word patterns in samples of up to 500 tweets that share a common word/s. What other words are associated, say, with "coffee machine"

In my e-mail to Tribalytics, I queried whether their keyword analysis was statistically valid. Tribalytic takes a evenly distributed sample of 500 tweets that share a common word/s, and extract the top 52 keywords, where those words appear in more than 2% of tweets. As Tribalytics acknowledge, if you are going to suggest a correlation, you need to be reasonably sure the words don't co-occur by chance. My comment in my e-mail was : I also wasn't sure about your logic of using 2% as the threshold for including or excluding keywords. Your reasoning is that the same word appearing in 10 separate tweets is unlikely to have occurred by chance  . You might be surprised how many such samples contained common words, and how common they were.  I'm guessing that the twitter vocabulary would be fairly limited, and given an average tweet size of ????, the odds of 10 or so tweets having common words might be quite large. "   The response was What's perhaps NOT evident is that this has already undergone processing with a Natural Language Parser which will filter out the most common words already ".
It should be a relatively simple task to sample tweet data and get a handle on how many words appear multiple times in randomly selected tweets. However, I'll need to increase my skills in handling text data before I can do that.
I also queried the post-hoc nature of the keyword analysis to construct a "story" :
§  Nugget: broken, working, fixed, buy (bought)
§  Where: office,
§  When: week, today, morning
§  Emotion: love, carefully,
§  Noise: china, #pawpawty, bone are from the same kind of tweets like this which can be safely tossed away.
By sifting through the natural noise coming with the tweets, now we can see that a common theme of people talking about the physical status of their coffee machine
Tribalytics response was : With regards to post-hoc analysis, I agree this is a potential issue, interestingly though our "real world" tests - popping it front of clients and researching topics they know something about shows that it does a pretty good job of throwing up enough of the expected keywords to confirm value for them on past events that they do know about.  Often when we are confronted with keywords from a search we are stuck as to why they are there (without drilling in) but clients say things like "Oh yeah, bear shows with scouts because Bear Grylls was at the Jamboree in the NT in March".

With the benefit of further thought and reflection, I'm not sure if there is a problem with "post hoc" analysis here, nor indeed whether or not it matters whether 10 or more common words is statistically significant.
This is because
-          Twitter message analysis, using tools such as Tribalytics, is a quick and inexpensive way of conducting exploratory or prototype research.
-          Exploratory research is conducted to gain insights, and the grouping of common words with a seed word (like "coffee machine") is a great way brainstorm , if you will, ideas and concepts that may be associated with your word of interest.
-          A media campaign can create a common word : a hashtag
So, in conclusion, a great idea and a great opportunity to conduct statistical research in future.




No comments: