“Shakespeareomics” – how scientists are unlocking the secrets of the Bard

Science has an enormous and complementary role to play in humanities-based research.

The year 2016 is the 400th anniversary of the death of William Shakespeare. Many events are planned to celebrate his life and we can look forward to a new wave of scholarship and debate that dissects his work.

Of course, Shakespeare is at the heart of the “humanities”, those subjects which seek to explain humankind’s separate standing from the rest of life. Scientific research might appear, at first glance, to stand aside from the humanities. It is hard, objective scientific fact that we aspire to learn. Evolutionary biology’s greatest achievement was to find humankind’s place among other life forms, rather than stand us aside.

But science has an enormous and complementary role to play in humanities based research. I direct an organisation called “Glasgow Polyomics”. Mainly we look at the underlying composition of different biological systems. Many of our projects, for example, seek differences in genomes, or other biological markers that distinguish healthy and sick people. We aim to diagnose disease and facilitate its treatment.

However, we are increasingly applying similar techniques to the humanities. For example, we are about to embark upon a project with friends in the Hunterian museum and College of Arts in Glasgow, seeking insight from sequencing microbial DNA present in anatomical specimens preserved from the eighteenth Century in William Hunter’s extraordinary collection here. Who had tuberculosis? How widespread was syphilis?

We anticipate learning how bacteria that caused those diseases several hundreds years ago have evolved. We also have projects assessing the authenticity of some manuscripts pertaining to come from Robert Burns and dissecting the provenance of key textile collections.

These kinds of advanced forensic and archeological science have also been considered for Shakespeare. As part of the 400 year celebrations, for example, his will is going to be displayed at Somerset house in London alongside findings from analyses of the document’s paper and ink that help authenticate its origin, in much the same way we have investigated possible forgeries of the poems of our very own Scottish bard, Robert Burns.

Elsewhere, as others have wondered whether his esoteric writing might have been influenced by his being high on drugs, a study on chemical traces found in clay pipes dug up in Stratford-Upon-Avon revealed no sign of cannabis, providing some evidence against the “High Bard” theory.

The holes in the science, for example there being no way of really linking those pipes directly to Shakespeare, doesn’t really challenge scientific rigour, but it does show how the application of the kinds of profiling technologies we have unbounded potential. Since the successful use of DNA sequencing technology to identify the bones of King Richard III found beneath a carpark in Leicester, others have suggested that exhuming bones from beneath the chancel in the Holy Trinity Church in Stratford, where Shakespeare was buried, could be equally enlightening. The playwright, himself, however, would have deplored the idea. His tombstone starkly warning:

“Good frend for Jesus sake forebeare,

To digg the dust encloased heare;

Blese be the man that spares thes stones,

And curst be he that moves my bones.”

Towards the end of last year I saw Benedict Cumberbatch playing Hamlet and then Justin Kurzel’s fantastic film version of Macbeth. In chatting about the performances with literary friends afterwards, I was struck by the similarities in research methodology employed in the humanities and science. Literary scholars dissect texts seeking clues which they reinforce with meta-data coming from biographical details of the author and historical facts about contemporary knowledge and beliefs. Was, for example, Ophelia pregnant by Hamlet when she committed suicide?

Scholars can isolate phrases in the text that point to that, and then apply other data to try and prove the point. Why, for example, did she hold back the herb Rue for herself when distributing flowers and herbs to members of the Court. Rue was a common abortifacient in Elizabethan times. Her apparently madness-driven ramblings also lend themselves to an interpretation that Hamlet had cajoled her into pre-marital sex. Hamlet too offers clues that he has sullied her virtue. And then Lady Macbeth. Could she, herself, have been a witch?

Essentially, Shakespearean scholars string together a multitude of fragmented clues until a credible hypothesis can be made. Most of science too has classically depended upon collating obscure clues gleaned from abstract experimentation until an incontrovertible narrative appears. Our ability to collect data on unprecedented scales has now led to an appreciation of how the “Big Data” approach to hypothesis generation works.

Instead of digging around for subtle clues as to what might cause a particular kind of cancer, for example, we can sequence the genomes of cancer cells from a multitude of people and see what features they have in common not found in people without that particular kind of cancer.

Big data research applies to the humanities too. In the case of “Shakespeareomics” the convergence of data driven methodologies is beginning to yield fruit as computers mine through texts with unprecedented power.

Take for example the fundamental question as to the authorship of Shakespeare’s plays. For many years there has been debate as to whether a relatively poorly educated lad and lowly actor from Stratford could really have written such prescient drama. Cases have been made that it was contemporary playwrights, possibly Francis Bacon or Christopher Marlowe or the well-travelled aristocrat Edward de Vere, Earl of Oxford who actually wrote the plays.

Recently, Sung-Hou Kim, a scientist based at University of California, Berkley and his team, applied a technique known as “feature frequency profiling” (FFP) to seek relationships between different types of text. Their method had already been shown to make very clear and accurate depictions of the relationships between different types of organism based on their genome sequences. By removing all punctuation and spaces from various texts (turning them into a long string of letters) they then sought patterns comprising series of eight letters as a window, moving across the text.

Features were ranked according to the frequency in which they appeared, then relationships sought between the top-ranked features yielding a score on how closely related different texts are. The approach readily clustered different types of book (religious texts, nineteenth century novels, philosophy, mythology, science and children’s fiction).

Figure 1. (Click to enlarge). FPP relationship trees depicting the relations between book genres. Taken from here.

Next they compared the writings of Shakespeare and a number of his contemporaries. Most of the “Shakespeare” plays clustered with each other while those attributable to Marlowe and Bacon were separate. At least one play credited to Shakespeare (“Pericles”) appeared to follow a different FPP, casting doubt on its provenance, while another (“The two noble kinsman”) for which doubt over his authorship has existed, appeared in the Bard’s cluster.

Figure 2 FPP relationship tree showing the links between Shakespears’s plays and those of his contemporaries. Taken from here.

In seeking to explain the possibilities offered by “infinity” it is whimsically stated that a monkey, given a type writer and infinite time could create the complete works of Shakespeare. A few years ago students at the University of Plymouth went some way to testing this hypothesis when they set six macaques to work with a computer.

After a few days, long repeats of the letter “s” were the most popular occupants of the five pages of text created. Nothing resembling even a couple of Shakespearean words emerged. The experiment was abandoned when the keyboard gave out having been urinated upon by one of the scholarly primates. Modern technology has enabled Jesse Anderson, an American computer programmer, to bypass such confounding simian behaviour by setting up millions of virtual monkeys typing random letters in nine letter strings across millions of virtual keyboards with constant monitoring of letter strings against the complete works.

Every time a word appears that matches the complete works it is isolated from the random morass and placed into context. In this way the complete works did indeed emerge within just a few weeks of random typing back in 2011. I’d argue that the initial expectation from the infinite monkey was to get the words in order without having to pluck them from trillions of random letter strings. The power of massively parallel computing, however, can be appreciated.

In another extraordinary example of the ways in which Polyomics based technologies can impact on the humanities, Shakespeare’s sonnets were translated by Nick Goldman’s team at the European Bioinformatics Institute, near Cambridge, into a code based entirely on the nucleotide sequences that make up DNA. The famous four bases that comprise DNA (A,T,C and G) were converted to a complex but workable binary code from which it became possible to re-write the sonnets in that code. Furthermore, it was then possible to actually synthesise physical strands of DNA containing that code, send the DNA to a sequencing centre and then back translate the code into a perfect copy of the sonnets. DNA thus offers a means to physically store masses of written text, in more compact and durable ways than required for computer storage of the same information!

Recently, too, another young scientist in the USA, J. Nathan Matias, employed a “machine learning” approach using Shakespeare’s Sonnets to train a computer on how to predict which words to string together in creating a sonnet depicting the agonies of a scorned lover. Matias had to work with the computer to provide narrative and vocabulary too, but the result wasn’t at all bad.

Machine learning has become critical in Scientific research in seeking patterns across highly complex datasets that point to chemicals or genome sequences that correlate to medical conditions or responses to drug.

The value in applying algorithms to plough through huge datasets in ways that can improve health is obvious. But can similar efforts be justified in contemplating whether or not Ophelia was pregnant by Hamlet? These are fictional characters whose virtual existence happened over 400 years ago. Does it matter if Marlowe wrote a few of the plays attributed to Shakespeare, or whether the bard smoked pot or not? Well the answer is a resounding yes, for many reasons. Technically it was in Linguistics that some of the earliest pattern seeking algorithms emerged, and derivatives of these algorithms are those applied to finding meaning in genomes.

Much of today’s Stratified Medicine, seeking cures tailored to specific lesions identified in individual patients depend on algorithms with a direct link to Computational humanities. But there is more to our need for humanities research than that. Rigorous human analysis of manuscripts can still offer insight way beyond the cold, almost careless objectivity of a computer. Shakespeare was human.

Maybe he did write a play or two while high on drugs after all, and in so-doing his vocabulary deviated from his norm, which would confuse computational analysis. Literary scholarship provides scholars (stretching back to school children) a forum through which to learn how to dissect an entity from multiple angles That most human of all traits, imagination, helps to create hypotheses to test. It can be fun. It can be hugely satisfying and above all, I think, understanding the humanities can help us learn to live with our brain, which after all drives our feelings and makes us all we are.

Take the case of Australian Poet, Journalist and TV presenter Clive James who was diagnosed with Chronic lymphocytic leukaemia (CLL) in 2010. Modern science has shown that an enzyme called Bruton’s tyrosine kinase gets turned on constitutively in CLL. Moreover, drug developers have developed a chemical called ibrutinib that inhibits the offending enzyme.

Science has enabled doctors to prolong James’ life to a degree where he confesses to a sense of embarrassment that he is still alive. His writing, in particular his recent volume of poems “Sentenced to Life”, poignantly reveals how the humanities even more than scientific intervention relieve his pain. It is how we feel about things and cope with the slings and arrows of outrageous fortune that is central to all we do. Upon learning of the new therapeutic option offered by ibrutinib James wrote the following lines. The last six words themselves offering alternative meanings for us to debate and perhaps take our own personalised meaning:

“The week before last my leukaemia came back out of remission, but there’s a new drug all set to fight it. The drug is called ibrutinib. Can you beat that? It’s poetry.”