The role of R stuff in becoming data scientists

Dr James Black | Associate Group Director

Personalised Healthcare Data Science | Roche / Genentech

The underlying drivers

The growth of MDAS

MDAS within Roche

Data types in

Personalised Health Care (PHC)

The expansion of toolsets

The infrastructure we used has evolved

Admin managed

r.roche.com

Epidemiologists/

Analysts/

Statisticians

Data Scientists

The challenges

  • Pharma is traditionally bad at sharing code
    • Code shares can be file servers or wikis...
    • Sharing code is seen as a risk
    • We need to share and collaborate
  • Data Scientists need to be flexible
    • Match the languages used in each analysis to the methods and data types
    • There is a lot of training overhead in being a polygot
    • Infra code should be centralised

Step 1: Wrap infrastructure in a shared code base

 

Abstracting infrastructure

RWDSverse

RocheTemplates

Making it easy for scientists to derive insights from the Flatiron data

For humans

No SQL (dbplyr)

Reuse code for variable definitions, tables, plots

Vignette driven docs

Robust science

Embedded provenance

Extensible (by users)

Automated checks of data

Code unit test coverage high

Step 2: Share, colloborate and play

 

Collaborative community

Summary

'ing the Tidyverse concept unified our non methods codebase

Data science is evolving. Fast. Acknowledging and protecting time to learn and play with new languages and libraries is key.

Moving to a code sharing culture requires work - but tools like slack, github and discourse can foster that community