Creating a Word Cloud from an Academic Paper!
Word clouds are a fun and creative way to show what the most used words are in a book, document, website or basically anything containing words. I wanted to create a word cloud from one of my papers that were published from my PhD. To take it a step further, I wanted to turn it into a lobster cloud because the paper is on a species of spiny lobster. Here is a walkthrough of how I did this…
First, let’s see what we will need to import. Academic papers are usually in a column format, so I used the pdf_layout_scanner package by Yusuke Shinyama to import the pdf into a format that can be read by the computer. It extracts text from pdf’s with multiple columns.
from pdf_layout_scanner import layout_scanner
from nltk import word_tokenize
from nltk.corpus import stopwords
Then we want to import the WordCloud library.
from wordcloud import WordCloud
from PIL import Image
Finally, we will import numpy, pandas and matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Step 1: Parse the pdf file using layout_scanner
pages = layout_scanner.get_pages('lobster.pdf')
Check how many pages we working with
print(len(pages))
44
Created a variable called text to store the first 23 pages of the paper (the rest are references, or images)
text = pages[0:23]
type(text)
list
Our text is currently a list, which we will have to convert into a string. We have 52251 words
text2 = ' '.join(text)
len(text2)
52251
After tokenizing and removing stop words we see that the number of words is now reduced to 6711
stop_words = set(stopwords.words('english'))
text_tokens = word_tokenize(text2)
filtered_words = [w for w in text_tokens if not w in stop_words]
len(filtered_words)
6711
filtered_words = ' '.join(filtered_words)
Now for the fun part! To make the wordcloud in the shape of a lobster, we will need a vector .png image of a lobster. The best resource for finding pictures of biological creatures is PhyloPic. Download the image to your working folder and assign it to a variable.
LOB_FILE = 'Spiny2.png'
Here is an example of how an ordinary word cloud looks
word_cloud = WordCloud().generate(filtered_words)
plt.imshow(word_cloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

Now we will use pillow to read the image and do some manipulations. We have to create an image mask from the lobster image which will be a canvas for the wordcloud
icon = Image.open(LOB_FILE)
image_mask = Image.new(mode='RGB', size = icon.size, color = (255, 255, 255))
image_mask.paste(icon, box = icon)
rgb_array = np.array(image_mask)
word_cloud = WordCloud(mask = rgb_array, background_color = 'white',
max_words = 1000, colormap = 'ocean', max_font_size = 300)
word_cloud.generate(filtered_words.upper())
plt.figure(figsize=[20, 20])
plt.imshow(word_cloud, interpolation = 'bilinear')
plt.axis('off')
plt.show()

And there you have it! A beautiful lobster wordcloud created from an academic paper