Quantcast
Channel: SCN : Blog List - SAP HANA and In-Memory Computing
Viewing all articles
Browse latest Browse all 902

Creating a Rugby World Cup Sentiment Tracker

$
0
0

With the Rugby World Cup now on, I decided to put some of the SAP kit bag to the test.

The latest output of this *should* be automatically republished daily at 22:00 BST to Lumira Cloud, allowing you to interact with it.

http://tiny.cc/RWCTweets

Rugby Tweet Analysis v2.png

From the 18th to the 23rd September I have already captured 1.2 million tweets with the #RWC2015 Twitter Feed.  I hope to keep the data capture running throughout the tournament

 

In this example have used

1. Smart Data Integration (SDI) within SAP HANA to acquire the tweets from Twitter in real time from the #RWC2015 feed

2. SAP HANA to store, process and the data

3. Text Analysis to turn Tweets into a structured form

4. Text Mining to identify Relevant Terms

5. SAP HANA Studio to model

6. SAP Lumira Desktop to create some analytics

7. SAP Lumira Cloud to expose the output

 

 

1. Data Acquisition through the SDI Data Provisioning Agent

From HANA SPS 09 Smart Data Integration has been added directly in HANA. One of the data provisioning (DP) sources available is a Twitter.  I won't repeat the steps to setup the DP agent here, as Bob has created a great series of SAP HANA Academy videos of this setup here.

SAP HANA Academy - Smart Data Integration/Quality : Twitter Replication Pt 1 of 3 [SPS09] - YouTube

 

With the virtual table now available in HANA you can make this real-time by issuing the following SQL.

 

SET SCHEMA HANA_EIM;
--Create SDA Virtual Table
CREATE VIRTUAL TABLE "HANA_EIM"."RWC_R_STATUS" at
"TWITTER"."<NULL>"."<NULL>"."status";
--Create a target table
create COLUMN table "HANA_EIM"."RWC_T_STATUS" like "HANA_EIM"."RWC_R_STATUS";
--Create Subscriptions
create remote subscription "HANA_EIM"."rt_trig1"
as (select * from "HANA_EIM"."RWC_R_STATUS" where "Tweet" like '%#RWC2015%')
target table "HANA_EIM"."RWC_T_STATUS";
--SELECT * FROM "HANA_EIM"."RWC_T_STATUS";
--truncate table "HANA_EIM"."RWC_T_STATUS";
--Queue the subscription and start streaming.
alter remote subscription "HANA_EIM"."rt_trig1" queue;
alter remote subscription "HANA_EIM"."rt_trig1" distribute;
select count(*) from "HANA_EIM"."RWC_T_STATUS";
--Stop Subscription
--ALTER REMOTE SUBSCRIPTION "rt_trig1" RESET;

 

With the data now being acquired "automatically" it's possible to monitor the acquisition via the XS Monitoring URL http://ukhana.mo.sap.corp:8000/sap/hana/im/dp/monitor/?view=DPSubscriptionMonitor

DPSubscriptionMonitor.png

3. Text Analysis

As I previously described Using Custom Dictionaries with Text Analysis in HANA SPS9, for Formula One Twitter Analysis creating custom dictionaries for your subject area is very easy.

I've added one to include the Rugby teams, Twitter handle and short name.  This new dictionary was included in a new configuration.

HANA Web IDE.png

To turn on Text Analysis on the acquired twitter data, use the following syntax

CREATE FULLTEXT INDEX "RWC-TWEETS" ON "HANA_EIM"."RWC_T_STATUS"("Tweet")
CONFIGURATION 'RWC::RUGBY_SOCIAL_CONFIG'
FAST PREPROCESS OFF
LANGUAGE COLUMN "isoLanguageCode"
LANGUAGE DETECTION ('EN','FR','DE','ES','ZH','IT')
TEXT ANALYSIS ON
TEXT MINING ON
FUZZY SEARCH INDEX ON

 

Text Analysis is really clever and identifies some useful elements, beyond the basics. Who, Where, When, etc.  The more advanced output is often known as fact extraction, of these "facts" Sentiment, Emotion and Requests are three of these that could potentially be useful in the Rugby Tweet data.

 

4. Text Mining the Tweets

Now I wanted to try something more than just sentiment, mentions and emotion.  For this I decided to use Text Mining which is also built into HANA, and has been further enhanced is SPS10 with SQL access to Text Mining functions.  Activating Text Mining is very easy, it's done when when specifying the FULL TEXT index by using the syntax as above TEXT MINING ON.

 

Text Mining has multiple capabilities which are applicable at a document level, for this I treated each Tweet as a document which served a purpose. As tweets by nature are very short you don't gain that much additional insight from the document level analysis.

 

SELECT *
FROM TM_GET_RELEVANT_TERMS (
DOCUMENT IN FULLTEXT INDEX WHERE "Tweet" like '%England%'
SEARCH "Tweet" FROM "HANA_EIM"."RWC_T_STATUS"
RETURN
TOP 16
) AS T

 

After investigating the Text Mining functions TM_GET_RELEVANT_TERMS and TM_GET_RELATED_TERMS with Twitter data I found the core Text Analysis functions to be more than capable for my analysis purposes. If however I was analyzing news reports, blogs or documents then Text Mining would be much more appropriate

Text Mining Output.png

 

5. HANA Modelling

This piece took the longest and was fairly challenging as you need to model the Tweets with final output in mind.  This turns the structured $TA table into a format suitable for analysis in Lumira (or other BI tool) by identifying the entities and the relationships, Countries, Tweets, Sentiment.

 

I created 2 Calculation Views in HANA Studio, they are still a work in progress, but are sufficient to give some useful output.

I felt it easier to create 2 as they are at different levels of granularity. One is at the Country level, the other at Country, Key Word

Text_Analysis_Calc_View_Annotated.png

Text_Analysis_Words_CV_Annotated.png

6. SAP Lumira Desktop to create some visualisations

With the modelling and manipulation taken care of in HANA, using Lumira is then easy (although you can spend some time perfecting your final output).  Here we can build some visualisations as below and then encapsulate them into a story board.

Screen Shot 2015-09-23 at 10.34.32.png

My original visualisations have now been greatly enhanced by Daniel Davis into a great Lumira Story.

Daniel has also created a England Rugby Wall chart available for download from here http://www.thedavisgang.com/

Screen Shot 2015-09-23 at 10.46.32.png

7. SAP Lumira Cloud

To share the output in an interactive way we can publish the visualisaitons, stories and dataset to SAP Lumira Cloud.  There's one crucial story option "Refresh page on open" that is required to  update the visualisations within the story which by default is OFF. Set this to ON and the story also gets updated.

 

Lumira Desktop has a scheduling agent built in, once enabled it can automatically refresh and republish to Lumira Cloud.

I have set this to refresh the Rugby Tweet Analysis every day at 22:00

 

Within Lumira Cloud we now need to make the story public, this is set under the Story optionsLumira Cloud Share.png

Change Access.png

Public.png

We now have the URL which can be shared with others, for ease of consumption I created a Short URL pointing to this long URL with http://tiny.cc/

 

To View the full interactive Lumira Story Board please use the link below

http://tiny.cc/RWCTweets


Viewing all articles
Browse latest Browse all 902

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>