The role of R stuff in becoming data scientists

Dr James Black | Associate Group Director

Personalised Healthcare Data Science | Roche / Genentech

The underlying drivers

The growth of MDAS

MDAS within Roche

Data types in

Personalised Health Care (PHC)

The expansion of toolsets

The infrastructure we used has evolved

Admin managed




Data Scientists

The challenges

  • Pharma is traditionally bad at sharing code
    • Code shares can be file servers or wikis...
    • Sharing code is seen as a risk
    • We need to share and collaborate
  • Data Scientists need to be flexible
    • Match the languages used in each analysis to the methods and data types
    • There is a lot of training overhead in being a polygot
    • Infra code should be centralised

Step 1: Wrap infrastructure in a shared code base


Abstracting infrastructure



Making it easy for scientists to derive insights from the Flatiron data

For humans

No SQL (dbplyr)

Reuse code for variable definitions, tables, plots

Vignette driven docs

Robust science

Embedded provenance

Extensible (by users)

Automated checks of data

Code unit test coverage high

Step 2: Share, colloborate and play


Collaborative community


'ing the Tidyverse concept unified our non methods codebase

Data science is evolving. Fast. Acknowledging and protecting time to learn and play with new languages and libraries is key.

Moving to a code sharing culture requires work - but tools like slack, github and discourse can foster that community