Publish or perish - there's a model for that

The h-index

A rather contentious talking point in academia is the belief that it’s publish or perish. Fanning the flames of this debate are measures like the H-index1, which attempt to distill a researchers impact on the body of knowledge into a single metric.

Top explain the H-index, I’ll rely on the quote below, which was written by Daniel Acuna.

One popular measure of success is physicist Jorge Hirsch’s h-index, which captures the quality (citations) and quantity (number) of papers, thus representing scientific achievements better than either factor alone. A scientist has an h-index of n if he or she has published n articles receiving at least n citations each. Einstein, Darwin and Feynman, for example, have impressive h-indices of 96, 63 and 53, respectively. According to Hirsch, an h-index of 12 for a physicist — meaning 12 papers with at least 12 citations each — could qualify him or her for tenure at a major university.

Acuna and his co-authors developed a model to predict future H-index2, based on previous publications and citations. Their sample contained 3,085 neuroscientists, 57 Drosophila researchers and 151 evolutionary scientists. From this population they built, and published, a model that explains about the same amount of variation in the outcome as the cardiovascular disease (CVD) risk score that is recommended for use in UK practices.

Unlike a CVD risk score though, there has been no external validations, so we don’t know how well this model performs outside of the 3,000ish neuroscientists…

So with a plethora of caveats in hand - like always - there is an R package to connect this model to people’s Google Scholar profiles.

I should stress now that I consider this model, and the H-index in general, a fun metric to play with. And not a true measure of the relevance and impact of someones research

The package

So first the packages needed to run this analysis.

library(devtools) # get the latest version
install_github("jkeirstead/scholar") # google scholar package
library(xkcd) # xkcd graphs!
library(extrafont) # xkcd fonts
library(ggplot2) # best package ever

And check the connection with google scholar. For the id, it’s the bit after user= in your google scholar profile url.

id_j <- '-S7V41QAAAAJ'
james <- get_profile(id_j)
  name # me!
  total_cites # 1129 (although only as I'm one of 300 people
              # on the WHO Global Burden of Disease paper
  h_index # 5
  i10_index # 3

And now I’ll scrape one of my two supervisors pages.

id_s <- '6WC1bewAAAAJ'
simon <- get_profile(id_s)
  name # Simon Griffin MBBS MSc DM FRCGP
  total_cites # 9673
  h_index # 50
  i10_index # 118

Compare academics

The google scholar package allows you to compare the citations for different authors over time. In the following code block I pull the citations per year for myself, one of my supervisors, and the Winston Professor for the Public Understanding of Risk (an academic at Cambridge with one of my dream jobs).

ids <- c('oz7MFu0AAAAJ', id_s,id_j)

# Compare their career trajectories, based on year of first citation
df <- compare_scholar_careers(ids)

xrange <- range(df$career_year)
yrange <- range(df$cites)

ggplot(df, aes(x=career_year, y=cites)) +
  ylab("Citations that year") +
  xlab("Year since first citation") +
  ggtitle(expression(atop("Comparing citations",
                          atop(italic("Cam's Prof of Risk, my supervisor and me"), ""))))+
  theme(legend.position=c(0.3, 0.85))
Citations by year.

So my career is quite a bit behind - but as I’m only in the 2nd year of my PhD, I’m also not really an academic yet.

Predict into the future

Now to the H-index. The following code will use the google scholar profiles of myself and the two other people who’s data I pulled to calculate the predicted trajectories of our H-indexes over the next 10-years. As I mentioned before - this model has not been validated in epidemiologists, so this is just for interest.

## Predict my h-index
simon_h <- predict_h_index(id_s)
  simon_h$name <- "Prof Griffin"
james_h <- predict_h_index(id_j)
  james_h$name <- "James Black"
david_h <- predict_h_index("oz7MFu0AAAAJ")
  david_h$name <- "Prof Spiegelhalter"

h_index <- rbind(simon_h,james_h)
h_index <- rbind(h_index,david_h)

Now I have a dataframe with predicted H-indexes, so I can plot them. Using an XKCD theme of course.

xrange <- range(h_index$years_ahead)
yrange <- range(h_index$h_index)

ggplot(h_index, aes(x=years_ahead, y=h_index)) +
  ylab("Predicted H index") +
  xlab("Years from 2014") +
  ggtitle(expression(atop("Predicting H index over the next 10 years",
                          atop(italic("For Cam's Prof of Risk, my supervisor and me"), ""))))+
  theme(legend.position=c(0.60, 0.40))
Predicted H-index from April 2014 onwards.

So according to this model my H-index should increase from it’s current level of 5, to 16 in 10-years time. 10-years from now it will be 8.5 years since I submitted my PhD. So I would be incredibly stoked to have 16 papers with more than 16 citations for each. Better yet, it suggests my current trajectory seems to avoid perish. Sadly, I don’t have much faith in the prophetic abilities of this model, especially as it’s based off only a handful of publications.

I’m also really hoping no-one at a funding body or fellowship board gets carried away with this model. I’m afraid that the big winners from these sorts of scores becoming more important would be the vanity publishers like Hindawi and Dove Press, who will publish anything for a fee.

  1. Hirsch. Proc. Natl Acad. Sci. USA 102, 16569–16572 (2005)

  2. Acuna. Nature 489, 201–202 (2012)

A post about: , and

You May Also Enjoy