The GithubMetrics package

R package
Open Source
Easy access to GithubMetrics via a gh wrapper
Author

James Black

Published

January 28, 2021

Modified

January 28, 2021

Caution

This package has been superseded by the gitstats R package

At work I manage a data science team, and the backbone to our work is an on-premise Github server. This holds our research code, as well as pan-study code (e.g. packages and libraries). To help keep on top of our codebase, we use the Github API. To make it easier to manage this codebase, I threw some of these functions into an R package called GithubMetrics.

The aim of this package is to provide a wrapper on gh to quickly get you key Github repo information you need.The code here is used within Roche to quickly let me pull answer simple questions like:

## Table of Contents

Setup

# devtools::install_github("OpenPharma/GithubMetrics")
library(GithubMetrics)
library(glue)
library(tidyverse)

organisation <- "openpharma"

Info on the repos

Quickly pull info on all the repos in a particular org. Here I look at the organisation called OpenPharma.

repos <- organisation %>%
  gh_repos_get() %>%
  gh_repos_clean()

repos %>%
  mutate(days_since_updated = Sys.Date() - as.Date(updated_at)) %>%
  arrange(days_since_updated) %>% select(name,language,MB,days_since_updated) %>%
  knitr::kable()
name language MB days_since_updated
GithubMetrics R 0.1 0 days
BBS-causality-training R 0.0 1 days
visR HTML 20.8 1 days
facetsr R 2.1 61 days
CTP R 0.9 85 days
simaerep R 77.6 86 days
ReadStat C 1.8 126 days
visR-docs Unsure 5.3 131 days
sas7bdat Python 0.1 141 days
syntrial R 0.3 199 days
icd_hierarchies Unsure 0.0 267 days
pypharma_nlp Jupyter Notebook 28.0 289 days
RDO R 0.5 327 days
openpharma.github.io JavaScript 0.9 1315 days

Get all commits

Now I can pull all the commits on the main branch across repos in that org.

repo_all_commits <- repos %>%
  filter(size > 0) %>% # make sure has some commits
  pull(full_name) %>%
  gh_commits_get(
    days_back = 365*10
  )

## Pulling commits looking back to 2011-02-02

repo_all_commits %>%
  filter(!author %in% c(".gitconfig missing email","actions-user")) %>%
  mutate(
    repo = gsub("openpharma/","",full_name)
  ) %>%
  group_by(repo) %>%
  summarise(
    commits = n(),
    contributors = n_distinct(author),
    last_commit = max(as.Date(datetime))
  ) %>% arrange(desc(commits)) %>%
  knitr::kable()
repo commits contributors last_commit
ReadStat 992 13 2020-09-04
visR 380 14 2020-11-19
pypharma_nlp 110 1 2020-04-16
sas7bdat 86 7 2020-09-10
RDO 42 1 2020-03-09
GithubMetrics 19 1 2021-01-30
CTP 6 3 2020-10-19
simaerep 6 1 2020-11-05
BBS-causality-training 3 1 2021-01-29
visR-docs 3 1 2020-09-21
openpharma.github.io 2 1 2017-06-25
syntrial 2 1 2020-07-15
facetsr 1 1 2020-11-30

Get visR commits

Now digging into a single repo, for the R package visR.

visr_all_commits <- "OpenPharma/visR" %>%
  gh_commits_get(
    days_back = 365*10
  ) %>%
  mutate(date = as.Date(datetime))

## Pulling commits looking back to 2011-02-02

visr_all_commits %>%
  filter(!author %in% c(".gitconfig missing email")) %>%
  ggplot(aes(x = date)) +
  stat_bin(aes(y=cumsum(..count..)),geom="step", binwidth = 1) +
  ggthemes::theme_hc() +
  labs(
    x = "Date",
    y = "Commits",
    title = "Cumulative commit count for OpenPharma/visR",
    subtitle =
      glue("{nrow(visr_all_commits)} commits were made to master since project started (First commit: {min(visr_all_commits$date)})"),
    caption = paste0("Data collected on ",Sys.Date())
  )

Who has been contributing to visR?

contributors <- visr_all_commits %>%
  filter(!author %in% c(".gitconfig missing email","actions-user")) %>%
  group_by(author) %>%
  summarise(
    commits = n()
  )

contributors <- contributors %>%
  left_join(
    gh_user_get(contributors$author),
    by = c("author"="username")
  )

contributors %>%
  arrange(-commits) %>%
  mutate(
    last_on_github = Sys.Date() - last_active,
    contributor = glue('<img src="{avatar}" alt="" width="30"> {author}'),
    blog = case_when(
      blog == "" ~ "",
      TRUE ~ as.character(glue('<a href="{blog}">link</a>'))
      )
    ) %>%
  select(contributor,commits,name,last_on_github,company,location,blog) %>%
  knitr::kable(
    caption = "People that have contributed to visR master"
  )
contributor commits name last_on_github company location blog
SHAESEN2 127 Steven Haesendonckx 16 days
bailliem 109 Mark Baillie 0 days Basel, CH link
epijim 69 James Black 1 days Roche Basel, Switzerland link
Jonnie-Bevan 25 59 days
cschaerfe 21 Charlotta 114 days
diego-s 12 Diego S 251 days
rebecca-albrecht 4 1 days
dazim 3 Tim Treis 19 days Heidelberg, Germany
kentm4 3 Matt Kent 4 days Genesis Research
kawap 2 285 days Roche / 7N
thomas-neitmann 2 Thomas Neitmann 14 days Roche Basel, Switzerland link
galachad 1 Adam Foryś 16 days @Roche Warsaw, Poland link
ginberg 1 10 days Remote link
thanos-siadimas 1 68 days

Explore the files present

Now use the API to explore files present in head across repos in this org. Just for fun I’ll compare R to Python files present.

repo_files <- gh_repo_files_get(
  repo_commits = repo_all_commits,
  only_last_commit = TRUE
)

## Pulling files in latest commit from 13 repos

repo_files %>%
  group_by(repo) %>%
  summarise(
    Files = n(),
    `R files` = sum(lang %in% "R"),
    `Python files` = sum(lang %in% c("Python","Jupyter Notebook"))
  ) %>%
  mutate(
    Language = case_when(
      `R files` > `Python files` ~ "R",
      `R files` < `Python files` ~ "Python",
      TRUE ~ "?"
    )
  ) %>%
  knitr::kable(
    caption = "Types of files in the organisation"
  )
repo Files R files Python files Language
openpharma/BBS-causality-training 4 2 0 R
openpharma/CTP 100 30 0 R
openpharma/facetsr 63 13 0 R
openpharma/GithubMetrics 44 22 0 R
openpharma/openpharma.github.io 76 1 0 R
openpharma/pypharma_nlp 131 0 49 Python
openpharma/RDO 105 11 0 R
openpharma/ReadStat 207 0 0 ?
openpharma/sas7bdat 8 0 2 Python
openpharma/simaerep 145 32 0 R
openpharma/syntrial 67 24 0 R
openpharma/visR 177 81 0 R
openpharma/visR-docs 185 0 0 ?

Search for code

And as a toy example of searching for code. Note that it is a plain text search, so there will be false positives, particularly if the package name is common (I think here that’s less of an issue).

helper_gh_repo_search <- function(x, org = "openpharma"){

  ## Slow it down! as search has 30 calls a minute rate limit.
  ## If you prem the search rate limit is higher, so usually not needed
  if(interactive()){message("Wait 5 seconds")}
  Sys.sleep(5)
  ## End slow down


   results <- gh_repo_search(
      code = x,
      organisation = org
    )

  if(is.na(results)) {
    results <- return()
  }
  results %>%
    mutate(Package = x, Organisation = org) %>%
    group_by(Organisation,Package) %>%
    summarise(
      Repos = n_distinct(full_name), .groups = "drop"
    )
}

packages <- c(
  "tidyverse","pkgdown","dplyr","data.table"
  )

package_use <- bind_rows(
  packages %>%
    map_df(
      helper_gh_repo_search, org = "openpharma"
    ),
  packages %>%
    map_df(
      helper_gh_repo_search, org = "AstraZeneca"
    ),
  packages %>%
    map_df(
      helper_gh_repo_search, org = "Roche"
    ),
  packages %>%
    map_df(
      helper_gh_repo_search, org = "Genentech"
    ),
  packages %>%
    map_df(
      helper_gh_repo_search, org = "Novartis"
    )
)

## tidyverse does not appear in AstraZeneca.
## pkgdown does not appear in AstraZeneca.
## data.table does not appear in AstraZeneca.
## query = 'data.table in:file  user:AstraZeneca'

package_use %>%
  pivot_wider(names_from = "Package", values_from = "Repos") %>%
  mutate(Total = rowSums(.[,-1], na.rm = TRUE)) %>%
  arrange(-Total) %>%
  knitr::kable(
    caption = "Package use detected within repositaries in Pharma orgs"
  )
Organisation tidyverse pkgdown dplyr data.table Total
Novartis 4 6 10 12 32
openpharma 4 6 6 2 18
Roche 3 3 2 3 11
Genentech 3 2 3 3 11
AstraZeneca 1 1