Understanding BHL Through Metadata: Patterns of Bio-Diverse Knowledge Production

During my time as an intern for the Biodiversity Heritage Library (July-August 2021), I worked on a project, I hope, will help engender important and critical conversations around the Library’s work and responsibilities vis-á-vis the sometimes harmful and problematic origins of its materials, as well as around the possibilities for the decolonization of its collection and archival practices. By focusing on the case of Latin America and her biodiversity, the main goal of this project was to identify patterns in the metadata of BHL’s collection that can inform decolonial policies and strategies for the diversification of the Library’s catalogue.

To identify such patterns, I extracted and analyzed the metadata of materials that include a subject related to Latin America in their subject lists. These analyses shed important light on diversification issues, specifically in the case of this region. The first part of this project was to perform subject-based extractions from BHL’s collection. Following a similar methodology to that of Chris Freeland (Freeland, steps 1–3), I extracted all records from BHL’s online datasets[i] (as of July 1st, 2021[ii]) that included specific subjects[iii] to build five subsets:

  1. Greater Regions (GR) subset. This subset incorporates all BHL records (4465 total[iv]) that include one of the following subjects—all of which refer to Latin America as a whole or in subcontinental terms—in their subject list: Latin America, Central America, South America, West Indies.
  2. Latin American Countries (LAC) subset. This subset incorporates all BHL records (6801 total) that include the name of at least one Latin American country—except Mexico—in their subject list: Argentina, Belize, Bolivia, Brazil,[v]British Guiana, Chile, Colombia, CostaRica/Costa-Rica, Ecuador, El Salvador, French Guiana, Guatemala, Guiana/Guyana, Honduras, Nicaragua, Panama, Paraguay, Peru, Surinam/e, Uruguay, Venezuela.
  3. Mexico (MEX) subset. This subset incorporates all BHL records (2995 total) that include the subjects Mexico; Mexico, North; and/or Mexico, Northern in their subject lists. Mexico was treated as a separate subset given that a large portion of materials about Mexican biodiversity in BHL comes from the BHL México project. Thus, considering these materials as a separate subset can illuminate the impact of global collaboration in the diversification of the Library’s collection.
  4. Indigenous Peoples – General (IP-G) subset. This subset incorporates all BHL records (1052 total) that include one of the following subjects—all of which refer to Latin American Indigenous peoples in generalizing/homogenizing terms—in their subject list: Indians of Central America, Indians of Mexico, Indians of South America, Indians of the West Indies. Although of a different nature, the subjects Aztecs and Incas are also included in this subset, as they refer to the Indigenous empires that existed at the arrival of European colonizers in the Americas but that no longer exist and were conglomerates of several Indigenous groups of these regions.
  5. Indigenous Peoples – Specific (IP-S) subset. This subset incorporates all BHL records (135 total) that include one of the following subjects—all of which refer to specific Latin American Indigenous peoples—in their subject list: Carib Indians, Choco Indians, Cuna Indians, Diaquita Indians, Goajiro Indians, Huichol Indians, Kickapoo Indians, Mapuche Indians, Mayas, Mayoruna Indians, Mojo Indians, Shipibo-Conibo Indians, Taino Indians, Tairona Indians, Tarahumara Indians, Yahgan Indians.

Once the subsets were created, from the different BHL datasets available online and using Microsoft Access, I extracted specific information for each of the records included in these subsets:[vi] title ID (used as primary key), full title, author, holding institution, year of publication (including start and end year of publication for periodicals), publication details, language, title URL, and the subject they include to be part of the subset.

The next step of this project was to prepare the data, especially for geolocation. In this regard, it is important to mention that BHL’s metadata do not have separate categories for place of publication and publisher. On the contrary, these two are included under the category Publication details, which sometimes includes the year of publication as well. Therefore, to understand the geopolitical affiliations of the records included in each subset, I cleaned the data[vii] to have a separate category for place of publication. In some cases, this required the modernization and/or translation of place names, given that BHL’s metadata often include publication information in the original language and format of the material. For example, the publication details of a volume of José de Acosta’s Historia natural y moral de las Indias (included in the LAC subset) read “Impresso en Seuilla :en casa de Iuan de Leon.,Año de 1590.” This record, then, required the isolation of the place of publication (Seuilla), its modernization from 16th-century Spanish to its modern form Sevilla, and its translation into English, Seville, for software readability. After the cleaning phase and running Microsoft Excel’s geographical data identification, I obtained coordinates (longitude and latitude) for each place of publication included in the subsets.[viii]

After preparing the datasets, the next step was to generate statistics and visualizations, mostly done in Tableau. For geographical data, I also employed Google Maps, which makes it easier to map the locations of holding institutions, as they are already recorded in Google. For Google Maps, given its restrictions in the number of nodes that can be mapped at a time, only unique IDs were included:[ix]GR, LAC, MEX, IP-G, IP-S.[x] In Tableau, however, I generated separate maps and graphs for two groups per subset, one for all title IDs and one for unique IDs only. For most subsets, the generated data visualizations[xi] include a density map for places of publication, graphs for number of titles (unique and non-unique) per place of publication, graphs for number of titles (unique and non-unique) per language of publication, graphs for number of titles (unique and non-unique) per year of publication, graphs for frequency of subject based on number of titles (unique and non-unique), graphs for frequency of holding institution based on number of titles (unique and non-unique), a graph for language versus year of publication, a graph for included subject versus place of publication, and a graph for included subject versus holding institution.[xii]

In almost all subsets,[xiii] mapping metadata categories evidences a discrepancy between the object of study, in this case, the biodiversity and peoples of Latin America, and the networks of knowledge production about said object. These colonial epistemic dynamics are particularly important for the IP-G and IP-S subsets, which not only include Indigenous peoples as objects of study but also show—arguably more than any other subsets—a clear distinction of the Global North as producer of knowledge about Indigenous peoples in Latin America (Figure 1).

Global map showing places of publication in IP-G dataset

Figure 1 Map showing places of publication (density per number of records) in the IP-G subset. Generated on Tableau in August 2021. Data from https://www.biodiversitylibrary.org/data as of July 1st, 2021.

Moreover, the IP-G subset is anchored in homogenizing terms such as “Indians of…” The very fact that the word Indians is being employed to refer to Indigenous peoples in Latin America perpetuates colonial homogenizing discourses that began with the first arrival of Europeans to the Americas (and not India) in the late 15th Century, thus flattening the cultural diversity of the Global South. Additionally, employing this formula alongside general geographical areas further blurs the cultural diversity of Indigenous peoples in the region. For instance, there are 70 identified Indigenous peoples in Mexico (Sistema de Información Cultural SIC México), at least four in Belize, 24 in Guatemala, seven in Honduras, three in El Salvador, nine in Nicaragua, eight in Costa Rica, eight in Panama, 30 in Argentina, 36 in Bolivia, 241 in Brazil, 83 in Colombia, nine in Chile, 12 in Ecuador, nine in Guyana, six in French Guiana, 20 in Paraguay, 43 in Peru, five in Suriname, and 37 in Venezuela (UNICEF España). However, these more than 660 Indigenous cultures are homogenized in the generalizing terms Indians of Mexico, Indians of Central America, and Indians of South America. In sum, the use of the formula Indians of… in subject lists requires a critical reformation that allows for a better representation of Indigenous peoples and topics in metadata practices.

In contrast to the IP-G subset, the IP-S subset is built around subjects that specify the Indigenous group they are referring to, which could potentially mean that the cultural specificity of these communities is being more accurately represented. Nevertheless, the IP-S subset is the only subset where all materials were published and are held in the Global North (Figure 2). Therefore, despite its apparent cultural diversity and specificity, this subset reveals that the Library’s materials about Indigenous peoples continue to be anchored in a stark contrast between the North as epistemic subject—those who know—and the South as a passive object—that to be known. Furthermore, the dichotomy subject-object perpetuates the historical colonial association between nonhuman animals and Indigenous peoples. The animal-human dichotomy is the primeval binary on which other power-based categories are constructed and that characterizes the relationship between colonizers and colonized (Wolfe xx; Rajamannar 5). In this sense, several critics have identified a profound connection between colonization in the Americas and the establishment of a racial system for the classification of human groups (Greer et al.). Therefore, a link between species and race is constructed in the world-making process of colonization, a link that is still palpable in the use of subjects such as Indians of… in BHL’s metadata. These colonial dynamics can only be counteracted by an acknowledgement of the colonial roots of such materials as well as a stronger diversification of the Library’s collection that increases the presence of Indigenous peoples as agents of knowledge production.

Global map showing places of publication in IP-S dataset

Figure 2 Map showing places of publication (density per number of records) in the IP-S subset. Generated on Tableau in August 2021. Data from https://www.biodiversitylibrary.org/data as of July 1st, 2021.

Even in subsets such as LAC, where there is a considerable diversity of places of publication,[xiv] the holding institutions of the materials are almost exclusively located in the Global North (Figure 3). In fact, there are only three works—out of 1982 unique IDs in this subset (LAC)—that are held in institutions outside of Europe and Anglo North America: Jean Baptiste Boussingalt and Francois Desire Raulin’s Viajes científicos a los Andes Ecuatoriales ó Colección de memorias sobre física, química e historia natural de la Nueva Granada, Ecuador y Venezuela (1849) translated by Joaquín Acosta and held in the Universidad Autónoma de Nuevo León in Mexico; the Anales del Museo de Historia Natural de Valparaíso, authored and held by the Museo de Historia Natural de Valparaíso in Chile; and Evangelina Schwindt, Nicolás Battini, Clara Giachetti, Karen Castro, and Alejandro Bortolus’s Especies exóticas marino-costeras: Argentina (2018), edited and contributed to BHL by authors Schwindt and Bortolus.

Global map showing places of publication in LAC dataset

Figure 3 Map showing places of publication (density per number of records) in the LAC subset. Generated on Tableau in August 2021. Data from https://www.biodiversitylibrary.org/data as of July 1st, 2021.

Boussingalt and Raulin’s work and, especially, Acosta’s translation is, in fact, a considerable example of historical efforts to decentralize biodiversity-related knowledge. Acosta himself, in his prologue, highlights the goal of his translation as being that of the sharing of the knowledge produced by Boussingalt and Raulin in French, especially so that it can be accessed by “Granadinos, Venezolanos y Ecuatorianos” (Acosta 10) that is, the people from the places established as the object of study of this work, Colombia, Venezuela, and Ecuador. As the translator mentions, the people from these countries had, during that time, little to no access to these volumes due to their limited distribution and high costs. Furthermore, Acosta emphasizes that his translation is the product of a strong collaboration between him and the authors and that the edition was sponsored by a French editor (10-11). Thus, already in 1844, Acosta’s translation of this work was an example of collaboration between the Global South and North. Moreover, his goals echo in the access provided to these volumes by BHL, even more so given that they are contributed by a Mexican institution, meaning that a fundamental part of the collaborative network of bio-diverse knowledge production of this text, from the 19th century to today, is notably located across the Global South.

In turn, both the volumes of the Anales of the Museo de Historia Natural de Valparaíso in Chile and Schwindt et al’s book are powerful examples of local biodiversity-related knowledge production. The latter is also a more recent example of global collaboration and the role of BHL in promoting it, and it exemplifies the fruitful outcomes of a truly diverse and trans-geopolitical network of bio-diverse knowledge production. The result of an extensive local project to understand marine species in Argentina, Especies exóticas marino-costeras is also bilingual (English and Spanish) and includes the voices of researchers in different parts of the world. It incorporates, for instance, three prologues by researchers from the US, South Africa, and Argentina respectively. Similarly, this book includes a remarkably diverse list of acknowledgements, with individuals from Canada, Argentina, Uruguay, Brazil, France, Sweden, South Africa, the US, the UK, Spain, the Netherlands, Colombia, and Chile, who the authors thank for “provid[ing] valuable assistance and help during the entire creative process … by supplying administrative assistance, photographs and specimens, as well as commenting and improving the text” (Schwindt et al. 15–16). Finally, this work was also featured in a BHL blog post published in October 2020 and written by former Outreach and Communication Manager for BHL, Grace Costantino. At the moment of publication, the blog post was also shared on the Library’s accounts on Twitter and Facebook, thus promoting this bio-diverse trans-geopolitical collaborative work through other important online avenues, a decolonial and truly global representation of Latin America for which I myself have argued elsewhere.

Despite, however, the richness and diverse geopolitical affiliations of these materials, their small numbers blur their presence and continue to evidence an overwhelming predominance of the Global North as the housing site of biodiversity-related knowledge production, thus positing it as a sort of legitimizing epistemic centre for such production. While local and collaborative knowledge production is present in it, BHL still has a long way ahead to achieve a more equitable and diversified collection. In sum, what the statistics and maps of these subsets show is a colonial object-subject relationship between the Global South and North that requires a profound diversification of the networks of knowledge production that constitute BHL’s catalogue.

In the case of BHL, global partnerships are a strong strategy to achieve such diversification, both in terms of its collection and of access to materials. A notable example of this is the MEX subset. A significant portion of the materials about Mexico in BHL is contributed to its collection through the BHL México project, which is the result of a partnership between the Library and several Mexican institutions led by Mexico’s Comisión Nacional para el Conocimiento y Uso de la Biodiversidad (CONABIO). As a result, the MEX subset shows significant diversification in comparison to other subsets, especially regarding places of publication and language distribution. For instance, Mexico City is, by far, the most frequent place of publication in this subset, meaning that the MEX subset is the only one where the most frequent place of publication is not in the Global North and is located in the same country that is included as subject (Figure 4).

Bar chart listing the frequencies of places of publication in MEX dataset

Figure 4 Frequencies of places of publication (non-unique IDs) in the MEX subset. Generated on Tableau in August 2021. Data from https://www.biodiversitylibrary.org/data as of July 1st, 2021.

Such a strong presence of Mexico in the geographical affiliations of these texts thus posits the country not only as an object of study but, especially, as an agent of knowledge production, particularly of local knowledge production. Furthermore, although in most[xv] subsets—including the MEX subset—English continues to be the most frequent language of publication, the MEX subset shows a considerable number of materials in Spanish, with the difference between the two languages being significantly smaller than in other subsets (Figure 5). Therefore, at least to a certain extent, the meaningful presence of Spanish in materials about Mexico—which are part of both the BHL México project and the collection Publicaciones en español—counteracts the English dominance found throughout BHL’s collections, as I have also shown before.

Table listing frequency of English and Spanish publications in all datasets

Figure 5 Frequency of English and Spanish in all subsets. Data from https://www.biodiversitylibrary.org/data as of July 1st, 2021.

Moreover, it is important to note that the substantial presence of Spanish as a fundamental language of publication in the MEX subset is only true for non-unique IDs. On the contrary, when considering only unique IDs in this subset, the gap between English and Spanish increases dramatically, showing, once more, an overwhelming predominance of the former (Figure 6). Given that many of the materials contributed through BHL México and CONABIO are also periodicals, the difference of distribution when considering unique and non-unique IDs further strengthens the hypothesis that linguistic diversification in the MEX subset is strongly related to this partnership. Thus, the diversification of BHL’s collection proves to be deeply related to the diversification of its global partnerships, the BHL México project being a case in point.

Two bar charts comparing the language distribution for non-unique and unique ID titles in MEX dataset

Figure 6 Language distribution for non-unique (left) and unique (right) ID titles in the MEX subset. Graph generated on Tableau in August 2021. Data from https://www.biodiversitylibrary.org/data as of July 1st, 2021.

The Biodiversity Heritage Library is a rich and invaluable resource for researchers across the world. Nevertheless, to maximize access to its collection and an equitable and decolonial representation of diverse human and nonhuman subjects, the Library must continue its efforts regarding decoloniality, diversity, multiculturalism, and multilingualism. Diversification must be a primary goal of BHL’s outreach strategies. As I argued in a previous post on BHL’s blog, increasing multicultural representation can lead to diversified and equitable access that can increase the participation of audiences from the Global South including, at the very least in the case of Latin America, Indigenous communities. Therefore, a critical understanding of metadata, representational paradigms, and collection practices can highlight the dimensions of the Library where more work needs to be done as well as the benefits of previous strategies undertaken towards diversification. For instance, in the case of the subsets I analyzed during my internship with BHL, a stronger collaboration between the Library and institutions in the Global South seems to have a positive impact on the diversification and multicultural representation of topics related to these regions of the world. Thus, the establishment of more and stronger BHL nodes in the Global South can potentially lead to a more decolonial, equitable, and truly global collection anchored in bio-diverse networks of knowledge production, so that the Library can continue to honour its mission of providing global and open access to biodiversity-related content in collaborative and ethical ways.


[i] The files used for the extraction and building of the subsets were creator.txt, item.txt, subject.txt, and title.txt.

[ii] As of this date, the total number of BHL records was 146,806.

[iii] All subjects are presented here and used in the analyses exactly as they appear in BHL’s subject dataset (as of July 1st, 2021).

[iv] Total numbers for these subsets include non-unique ID records, meaning that some title IDs are repeated when the work includes several numbers, volumes, or issues.

[v] Includes the subjects Brazil 0 and Brazil 7 as they appear in BHL’s subject dataset.

[vi] Tables for these subsets (including cleaned geographical data as explained later in this text) can be found in BHL’s GitHub repository.

[vii] Data were cleaned directly on Microsoft Excel, separating values by punctuation and replacing terms that needed modernization and/or translation.

[viii] See note vi.

[ix] This means that works in volumes, issues, and/or numbers were treated as a single record and comprised in their shared title ID.

[x] Additional details can be found in the description of each map on Google. For the IP-G and IP-S subsets, additional information about Indigenous groups is included in each node on the map.

[xi] All visualizations for all subsets can be found in BHL’s GitHub repository.

[xii] Given the one-subject nature of the subset, all subject-based visualizations are not included in the MEX subset.

[xiii] The exception is the MEX subset, which is discussed later in this text.

[xiv] Meaning that they include several places of publication throughout Latin America. It is important to note, however, that there is almost no knowledge production about Latin America from other regions in the Global South.

[xv] The most frequent language in the IP-G subset is French, followed by English. In the GR, LAC, and IP-S subsets, French is the second most frequent language, after English.

[xvi] It is notable that the IP-S subset, along with its dichotomy between the Global South and North in terms of places of publication, is also the only subset with no materials in Spanish.


Works Cited

Acosta, Joaquín, translator. ‘Advertencia Preliminar’. Viajes Cietíficos a Los Andes Ecuatoriales ó Colección de Memorias Sobre Física, Química e Historia Natural de La Nueva Granada, Ecuador y Venezuela Presentada a La Academia de Ciencias de Francia, by Jean Baptiste Boussingault and Francois Desire Raulin, Librería Castellana, 1849, pp. 10–11, https://www.biodiversitylibrary.org/item/235575#page/6/mode/1up.

Freeland, Chris. ‘BHL Poster for AETFAT2010’. ChrisFreeland, 19 Apr. 2010, http://blog.chrisfreeland.com/2010/04/.

Greer, Margaret, et al., editors. Rereading the Black Legend: The Discourses of Religious and Racial Difference in the Renaissance Empires. 2008.

Rajamannar, Shefail. Reading the Animal in the Literature of the British Raj. Palgrave Macmillan, 2012.

Schwindt, Evangelina, et al. Especies Exóticas Marino-Costeras: Argentina. Marine-Coastal Exotic Species: Argentina. Edited by Evangelina Schwindt and Alejandro Bortolus, Vázquez Mazzini Editores, 2018, https://www.biodiversitylibrary.org/bibliography/169323.

Sistema de Información Cultural SIC México. ‘Pueblos Indígenas’. Gobierno de México, 2021, https://sic.cultura.gob.mx/lista.php?table=grupo_etnico&disciplina=&estado_id=.

UNICEF España. UNICEF Presenta El Atlas Sociolingüístico de Pueblos Indígenas En América Latina | UNICEF. https://www.unicef.es/prensa/unicef-presenta-el-atlas-sociolinguistico-de-pueblos-indigenas-en-america-latina. Accessed 29 June 2020.

Wolfe, Cary. ‘Introduction’. Zoontologies: The Question of the Animal, edited by Cary Wolfe, University of Minnesota Press, 2003, pp. ix–xxiii.

photo of a woman in a pale purple shirt with dark hair

Lidia Ponce de la Vega is a Ph.D. Candidate in Hispanic Studies at McGill University. She holds an Honours Bachelor of Arts (Gabino Barreda Medal) in Hispanic Language and Literature from the National Autonomous University of Mexico, and a Master of Arts in Hispanic Studies from McGill University. In her research, she explores topics of digital archives, archival practices, and decolonisation of online epistemologies in their intersection with ecocriticism and interspecies relationships.