How useful are online video rankings?

You have probably seen this chart by comScore, they provide great analytical reports about digital media and rankings for video delivery platforms:

Screen_shot_2012-03-30_at_1

Problem is, these metrics (at least the ones that are widely quoted by various articles and press releases) can be misleading.

Not all "views" are equal.  If a video isn't interesting, you stop or fast-forward.  These metrics exist, but they aren't being reflected in any of the widely publicized online video rankings.

Then, there is content that people actually want to watch.  Even willing to pay for it (Hulu, Netflix).  But these views get lost in the sea of low-quality short-form video content.  

Yahoo Video, for example, is reporting higher views lately.  Some speculate that Yahoo may be doing something right:

But those new shows were announced in early March and haven’t yet aired, so they can’t account for the 20% increase in online video viewership Yahoo gained from January 2012 to February 2012. Either Bill Maher attracted a lot of viewers or Yahoo is doing a better job of pushing its 177+ million unique monthly U.S. visitors toward watching more of its moving pictures online.

That's not necessarily the case.  Note that the numbers reported by come from "Yahoo! Sites", and that includes many properties, News, Finance, Real Estate...  and I suspect the reason for these inflated view numbers may be simple:

AUTOPLAY

... the feature so annoying, that people (viewers referenced in these statistical reports) are constantly looking for ways to stop it.   How many times have you clicked on a seemingly interesting news article, only to have a loud video ad start playing as soon as the page loads?  You may stop playback instantly, but your "view" has already been registered, and will become part of some statistical report.

In Yahoo's defense, they are not the only ones doing that. Many of the online video properties mentioned in the comScore reports, demonstrate low "minutes per view" (not "minutes per viewer"):

Screen_shot_2012-04-12_at_12
(the data above is provided by comScore, I just added one calculated column on the right - Minutes per View)

You probably noticed that companies like Netlix are usually shown toward the bottom of these lists, or not shown at all.  How can that be, when according to other reports, Netflix is responsible for one third of all downstream bandwidth consumption in the US?

Netflix provides long-form, highly engaging content (1.5 hr movies, for example).  So if we take the duration of those views into account, and base our rankings on minutes (hours) watched, instead of views, then the overall distribution could look something like this, with Netflix and Hulu playing a much larger role:

Screen_shot_2012-04-12_at_1
(this chart is for illustration purposes only, it's based on publicly available views statistics for these properties, multiplied by the length of their typical content - films and tv shows)

In conclusion, in online media delivery, "views" is a useless metric unless it's accompanied by several other metrics providing some insight into user's behaviour: time watched, stop/pause/fast-forward/rewind events.

 

 

 

 

 

 

 

 

 

 

 

 

 

Personalized cancer therapy: two significant data sets made public

Good news for personalized medicine, especially as relates to cancer treatment - two research teams just released data on over 1,000 cell lines and anticancer drugs that were tested on them.

One team identified biological markers of drug sensitivity to a broad range of cancer drugs:

"Our research has taken us down unknown paths to find associations that are completely novel," says Cyril Benes, PhD, senior author from Massachusetts General Hospital (MGH) Cancer Center. "We have identified hundreds of associations, many of which we still don't fully understand. We identified a novel indication for the use of PARP inhibitors – anticancer drugs currently used to treat breast and ovarian cancers – for the treatment of Ewing's sarcoma."

(download)

Abstract, published in Nature on March 29 2012 (page includes links to supplemental data):

Clinical responses to anticancer therapies are often restricted to a subset of patients. In some cases, mutated cancer genes are potent biomarkers for responses to targeted agents. Here, to uncover new biomarkers of sensitivity and resistance to cancer therapeutics, we screened a panel of several hundred cancer cell lines—which represent much of the tissue-type and genetic diversity of human cancers—with 130 drugs under clinical and preclinical investigation. In aggregate, we found that mutated cancer genes were associated with cellular response to most currently available cancer drugs. Classic oncogene addiction paradigms were modified by additional tissue-specific or expression biomarkers, and some frequently mutated genes were associated with sensitivity to a broad range of therapeutic agents. Unexpected relationships were revealed, including the marked sensitivity of Ewing’s sarcoma cells harbouring the EWS (also known as EWSR1)-FLI1 gene translocation to poly(ADP-ribose) polymerase (PARP) inhibitors. By linking drug activity to the functional complexity of cancer genomes, systematic pharmacogenomic profiling in cancer cell lines provides a powerful biomarker discovery platform to guide rational cancer therapeutic strategies.

Source: http://www.nature.com/nature/journal/v483/n7391/full/nature11005.html

 

The other team, led by the Broad Institute and the Novartis Institutes for Biomedical Research and its Genomics Institute of the Novartis Research Foundation, puiblished data on approximately 1,000 cell lines. The database is available on the Cancer Cell Line Encyclopedia (CCLE) website (according to an announcement posted on the site, it's currently experiencing heavy load):

http://www.broadinstitute.org/ccle/home

The CCLE provides public access analysis and visualization of DNA copy number, mRNA expression and mutation data.

Screenshot-2012-03-30_15

 

by Michael Alatortsev

 

 

 

Personalized medicine: merely scratching the surface

Last week's announcement by IBM describes a promising biomedical analytics platform:

Scientists from IBM Research are collaborating with theFondazione IRCCS Istituto Nazionale dei Tumori, a major research and treatment cancer center in Italy, on the new decision support solution. This new analytics platform is being tested by the Institute's physicians to personalize treatment based on automated interpretation of pathology guidelines and intelligence from a number of past clinical cases, documented in the hospital information system.

Selecting the most effective treatment can depend on a number of characteristics including  age, weight, family history, current state of the disease and general health.  As a result, more informed and personalized decisions are needed to provide accurate and safe care.

We are clearly moving towards personalized medicine, but there are many challenges along the way.

Technology.  This one used to be problematic, but it no longer presents a significant challenge. Cloud services (both storage and computing power) are affordable, and there are many new technologies (Hadoop and various NoSQL frameworks) that simplify processing large amounts of loosely structured data - on cheap commodity servers.

Regulations.  Various privacy guidelines complicate information gathering and make sharing it in a collaborative fashion difficult.  Luckily, effective data models can be built using "de-identified" data (data that's been processed or summarized in order to strip personal identifying details - essentially rendering all data anonymous).  One problem is that sanitizing data often affects outliers the most (e.g. in de-indentified data, a given anonymous patient, who remained hospitalized for 200 days, may only show "6+ weeks" under Duration of Stay).  While outliers are often deliberately ignored by statistical analysis (as they are often considered bad data points), they may still provide valuable information about the treatment or related data gathering/recording process.

Acceptance by "Big Pharma".  Highly targeted medications (e.g. those that are highly effective in a small subset of population with the right genetic makeup, but ineffective or potentially dangerous in others) are hard to get through the approval process.  Moreover, such drugs may require the same (or larger) amount of R&D spend, yet present a much smaller "market".  Drugs that work (not very effectively) for most people, win.  Extending a lung cancer patient's life expectancy by 2 months is considered a success.  This is likely to change as personalized medicine becomes more commonplace.

Genetic research.  This is where things get a little depressing.  Human beings have 20,000 ... 25,000 [protein-coding] genes.  I recently attended a Pesonalized Medicine panel discussion at Yale.  According to the speakers, modern medicine knows nothing about 16,000 of those genes.  Certain deseases (such as cancer) are caused by mutations in one or several key genes.  Currently, about 500 of gene mutations are linked to various forms of cancer.  Of those 500, we know and understand 8 (eight).  So, someone with that particular gene mutation can receive "personalized" treatment that will be extremely effective.

(download)

In conclusion, next 5 years should present tremendous opportunities for data analysis.  Some predictive models can be build simply by using existing de-personalized data.  For example, it is now possible to predict how many days a patient is likely to spend hospitalized next year, based on their prior claims data, coupled with certain amount of "personal data" (e.g. age group, sex, Charlson index, etc).  Some speculate that will help advance preventive care.  

However, it appears that a patient's ability to receive highly targeted (and highly effective) treatment for cancer, will continue to rely on advancements in genetic research.  And that area is still dealing with technology limitations.  There are companies working on making gene sequencing affordable, which should drive personalized medicine.

by Michael Alatortsev

Hadoop nodes won't start? Check your OS firewall settings

We recently configured a small three-node cluster (Dell servers + Dell managed switch) using Cloudera's distribution of Hadoop:

OS: CentOS 6.2
Hadoop: Cloudera CDH, Cloudera Manager Free Edition version 3.7.3.

Hadoop installation was a snap (much better experience overall, compared to version 3.6), but some services (HDFS, HBASE, Hue) would not start on some nodes:

Services_bad

Note that on a typical CentOS 6.2 installation (which all our Hadoop nodes are), firewall is enabled by default on all interfaces - which may prevent your nodes from talking to each other, thus preventing some services from starting in distributed mode.

We like to physically separate internal Hadoop chatter from other kinds of traffic by designating one interface (eth0) on each node as "Hadoop", giving it its own subnet with statically assigned IP addresses, and connecting them all via their own VLAN.  This approach helps improve your cluster's performance, security, ease of management.

Vlans_on_switch

Photo

Because our Hadoop traffic is already restricted at several levels, we can just designate each of our node's "Hadoop" interfaces (eth0) as "Trusted":

Firewall-trusted-port

This needs to be done on each node.

Once the firewall settings have been updated, you can restart affected services using Cloudera Manager (HDFS first, followed by Mapreduce, and finally Hbase):

Screen_shot_2012-02-22_at_11

Scm_healthy
If you found this post helpful, feel free to hit "Like" below, or Tweet about it. 

by Michael Alatortsev

 

Chrysler's Super Bowl ad takes the lead on Twitter

Blog code: PQVJF2F9BUTY

While other brands may have taked the top spots among "most effective" Super Bowl ads according to ad ranking agencies, it's Chrysler's "It's halftime in America" ad featuring Clint Eastwood that's now the most widely discussed commercial on Twitter.  Skechers or Doritos?  Not so much.

(download)
Screen_shot_2012-02-07_at_1

It has now branched into political conversations, which seems to be very effective at generating social media buzz.

by Michael Alatortsev

PQVJF2F9BUTY

#10BasicFactsAboutMe is trending, people talk about what they "like" or "hate"

Today's top trending tag on Twitter globally is #10BasicFactsAboutMe.

(download)

Keyword frequency analysis of most recent 1,233 messages suggests that most posts are about what people "like" or "hate", with most popular keywords, sorted by popularity:

  • people
  • time
  • music
  • friends
  • life
  • person
  • food
  • color
  • family
  • God

According to another trending tag, #WeAllDoThat, the most common thing people do is saying they are tired, when in reality they are sad.  Followed by checking the fridge for food (even if it was already checked recently and there was none):

(download)

by Michael Alatortsev