Gathering Word Clouds of Language Data – Library Research Service

As hubs for diverse communities, our Colorado public libraries are proud to serve people of many different backgrounds, ethnicities, cultures, and nationalities. Newcomers to Colorado can benefit greatly from services offered at public libraries and contribute valuable ideas and perspectives to our communities. However, language barriers can create significant challenges to reaching these populations. According to the U.S. Census Bureau, languages other than English are spoken in around 16% of homes across Colorado, and close to 6% of Colorado residents speak English “less than very well”. Public libraries can welcome these segments of the population by finding ways to communicate with people in their preferred language.

Gathering the Data

In 2024, information on which languages are spoken by multilingual library staff, compensation for these language skills, and the languages in which library programs are conducted was gathered for the first time in the Public Library Annual Report (PLAR). Library Research Service (LRS) only adds questions to the PLAR if the responses would contribute meaningfully to our repository of information about Colorado public libraries. Language data meets this requirement by helping library staff learn which languages are used across the state and how libraries can break down language barriers.

The new questions added to the PLAR include:

Do you have positions at your library that require the person in the position to be able to communicate in languages other than English? If yes, how many and which languages?
Do you have multilingual people on staff using languages other than English to help patrons, but that is not an official part of their job? If yes, how many and which languages?
Does your library offer a stipend or differential pay for multilingual speakers on staff?
Does your library offer programs in a language other than English? If yes, which languages?

These questions were not widely answered this first year, so the data collected is not reflective of all Colorado public libraries. In fact, 32 of Colorado’s 112 public library systems did not respond to any of these questions. For PLAR data, which nearly all Colorado public libraries submit annually, this is a low response rate. Therefore, caution must be taken when using this data, particularly when it comes to totaling the quantitative aspects of it. Still, the open-ended questions asking libraries to share which languages are spoken by staff to help patrons or offer programs gave us a glimpse into the many languages in which conversations are taking place in Colorado public libraries.

There were 23 different languages shared in response to these questions, and 20 of these languages were listed as languages that multilingual staff use to help patrons though not as an official part of their job. Thirteen of these same languages were also listed as languages other than English in which programs are offered. In response to the question of which languages are spoken by staff in positions that require them to be able to communicate in languages other than English, eight different languages were reported. Predictably, each library that listed one or more languages in response to these questions reported Spanish every time. One of the many exciting things about collecting this data is that it opens doors for different types of qualitative data visualizations, one of the most common of which is the word cloud!

Languages Used by Staff in Colorado Public Libraries

A word cloud containing the 23 different languages reported in the PLAR by Colorado public libraries arranged in different directions. — Figure A

A word cloud containing the 23 different languages reported in the PLAR by Colorado public libraries arrange diagonally. — Figure B

Word clouds are a flashy way to attract people to qualitative data. In the two word clouds above, the size of a language’s name is based on the number of times it appears throughout the entire PLAR 2023 language data set. Spanish, of course, was listed many times more than each of the other languages which is why it appears significantly larger.

The Many Shapes of Word Clouds

The only difference between these two word clouds is the direction of the words. In this post, I share a few different design examples to spark ideas and highlight the pros and cons of word clouds. Do you find Figure A or B easier to read? Personally, I prefer the diagonal lettering in Figure B. Another creative spin on word clouds is to shape them into a figure related to the data’s topic. For example, that might be a speech bubble for this language data. It’s just important to remember that the wilder the design the less accessible it will be for people with vision impairments, and even without vision impairments, tilted or sideways words can take more effort to decipher. So, as we often say at LRS, keeping it simple is better. Color can also make a big difference in the appearance of a word cloud.

Languages Used by Staff in Colorado Public Libraries

A word cloud showing the 23 different languages reported in the PLAR with horizontal lettering and colors applied to help compare the most common languages. — Figure C

Figure C above is our third and final word cloud example and contains a few significant adjustments. Although the color palette is actually very similar to Figures A and B, I applied colors strategically in this word cloud to help readers differentiate between more commonly reported languages. The second adjustment to Figure C is that the languages are all listed horizontally. These two design adjustments, the color and the horizontal lettering, make this word cloud easier to read. In my opinion, they also make Figure C more visually appealing than Figures A and B.

There is, however, one more change in Figure C that skews the data. You probably noticed that certain languages, such as French and Russian, appear larger in this third word cloud. While making Figures A and B, I realized that because Spanish was reported so much more often than any other language, it makes each of the other languages appear much smaller in comparison, and the differences in size between them are difficult to distinguish. In an attempt to fix this problem, I was tempted to adjust the data to account for the number of questions a language was listed in response to. Languages included in these word clouds were listed in the PLAR as answers to three different questions: languages used in positions that require staff to be multilingual; by multilingual staff but not as an official part of their job; and/or for programs conducted in a language other than English. For Figure C, if a language was reported in two of these categories, I multiplied its total number of times reported by two. If a language was reported in all three of these categories, I multiplied the total number of times that it was reported by three. However, there was one exception; the total count for Spanish in Figure C was not multiplied like the other languages! This brings me to my final point about word clouds, which is that they can be an unreliable type of data visualization.

Word Cloud Controversies

If the data is not provided along with the word cloud, it’s difficult to know what sort of artistic liberties the creator took when making it. If you decide that a word cloud is a suitable way to visualize your data, it’s important to clean your data properly beforehand. Qualitative data often contains lots of filler words such as “and” which should be removed from the data, so they don’t clutter the visualization. Additionally, when using word clouds to present themes within a larger data set, there may be similar words that could be combined to better represent the frequency of these themes. For example, within a data set, counts of words such as “speak” and “talk” could be divided between the two, but if these counts are combined, they might stand out much more within a word cloud. The person designing the visualization decides how to combine the data and which data to include. This can bring quite a bit of bias into word clouds.

The lists of languages reported by libraries made for fairly straightforward qualitative data to transform into a word cloud. I did not have to clean the data extensively or remove extraneous words, so I thought it might be a good fit for a word cloud. However, the number of times that Spanish was reported in relation to the rest of the languages made it difficult to use size as a comparison tool for much of the data. Attempting to adjust for this outlier, as I did in Figure C, only skews the data and creates a misleading visualization because the count for Spanish was not scaled up along with the rest of the data. I outlined Figure C in red for this reason. Word clouds can be a fun way to work with qualitative data, and the visualizations above do show the variety of languages spoken in public libraries. On the other hand, word clouds are not the best visualization method for showing nuances within data. Factors such as the color and length of words, as well as people’s own biases, can cause different words to stand out to different people, meaning they might take away significantly different messages from the same word cloud.

Although none of these three word clouds capture the nuances of this language data, and there was a relatively low response rate to these questions, I still chose to share this new PLAR data to shed light on the number of different languages in which staff assist patrons in public libraries. Language barriers are challenging to overcome, but bringing awareness to the languages spoken across the state is the first step to providing services and resources to all members of our communities.

LRS’s Colorado Public Library Data Users Group (DUG) mailing list provides instructions on data analysis and visualization, LRS news, and PLAR updates. To receive posts via email, please complete this form.