We covered a fair amount of Natural Language Processing in my recent Computational Humanities course this spring. As our running example text, I used selections from the epic fantasy series The Wheel Of Time. This proved to be a rich source of material for our explorations of how to quantify textual meaning and writing style using computational tools.
After visualizing letter frequency, for The Eye of the World, the first book in the series, we set out to determine if the word distribution in the text matched the predicted distribution of Zipf’s Law.
Word | Frequency |
---|---|
the | 19672 |
and | 8132 |
to | 7382 |
a | 6807 |
he | 6614 |
of | 6383 |
his | 4617 |
in | 4132 |
was | 3838 |
it | 3519 |
Frequency of words in The Eye of the World.
For a given text, we calculated the usage frequency of each word. But as is typical, looking at the top words did not reveal much about the content of the document. Removing highly-used English words, or stop words, let us uncover more of the content, and we moved on to understanding the algorithms behind drawing word clouds, where words are plotted in an image with their size proportional to their frequency in the document.
In the end, we developed a first approximation to the Wordle algorithm, using a monospaced font and ignoring the possibility of nesting words inside the nooks and crannies of other letters. And by utilizing a wordlist of English words, we could highlight those unique words that typically denote characters or locations with red. You can follow along with the development and code with this Jupyter Notebook.
I’m including a word cloud that I generated for each book in the series. A few things to note: the main character of the series, Rand al’Thor, is prominent in each of the clouds, although you can see when the attention shifts from him to the side-stories of other characters. Also, the system of magic in the world is very gendered, thus the high frequency of man and woman in the books. I’ll focus on the shifting cast of characters in a later post, then later pick up on the rise of abbreviations like he’d, i’ve, and you’re next.
Spoiler Alert
While these clouds don’t convey any info about plot, they might give away some relevant info.