National bibliographic data for studies of social sciences and humanities: towards interoperability

National bibliographic data bring numerous opportunities for science studies, especially when integrating data from multiple data sources. The use of multiple data sources, however, is hindered by the lack of interoperability. Although progress has been made in developing persistent international identifiers such as ISBN, DOI, and GRID, the interoperability between different data sources still poses challenges at several levels. We reflect upon these challenges with a focus on conceptual and methodological aspects with respect to the Academic Book Publisher Register (ABP), a comprehensive international list of publishers that is created by integrating multiple publisher lists used in different countries. This register, currently in development, is primarily meant to be used in research evaluation settings. At the same time it is potentially a valuable source of data for studies focused on publishing in different knowledge domains. In discussing the challenges encountered while making the ABP, we focus on two main issues: delineation of publishers and establishing connection between local lists and the ABP. In this paper we discuss possible ways to overcome these obstacles and draw conclusions in relation to other data sources that can be of use in research within the social sciences and humanities.


Introduction
In 1963 Derek de Solla Price famously asked "Why should we not turn the tools of science on science itself?" [1] thus setting in motion the field of quantitative studies of science, also known as bibliometrics or scientometrics. Quantitative studies of science, reliant on various quantitative methods, advance knowledge on different aspects of research: for example, on research or publishing activities or their relationship with various micro or macro characteristics [2,3]. Within this knowledge domain, social sciences and humanities (SSH) as research object, occupy an uneasy position. The well-known literature reviews by Anton Nederhof and Diana Hicks have vividly shown that publication and citation activities in SSH differ from those in fields of natural sciences and medicine [4,5]. One of the central SSH characteristics is their use of a wide range of publishing forms: SSH scholars are more likely than researchers in other fields to publish monographs, book chapters, and make use of discipline-specific genres such as critical editions, for example.
The key challenge when studying SSH is the lack of data sources that accurately portray the diversity within SSH-the so-called problem of coverage [6][7][8]. We argue that the problem of coverage can be overcome by using multiple data sources, many of which are of national scope. For example, Sivertsen has argued for the use of CRISs for research purposes [9]. Similarly, recent years have witnessed increased use of national, regional, as well as institutional bibliographic databases [8,10,11] as well as of lists of publishers and journals that offer yet another direction for science studies [12,13]. All these sources, with their more comprehensive coverage and more accurate depiction of research and publishing activities, are good alternatives for the often used international databases, especially when combining multiple national databases and/or supplementing them with additional data. The use of data from different data sources, however, is challenged by their lack of interoperability.
In this paper we present a conceptually and methodologically oriented reflection on the use of national bibliographic data for studies of the SSH. Our focus is on interoperability as the main challenge when aiming to use multiple data sources. To successfully combine multiple data sources, first, one has to delineate entities under scrutiny to make the combined datasets consistent and therefore usable for research. Second, there is the more practical aspect of connecting multiple sources. Not all entities of interest are linked with persistent identifiers. In those cases, connections have to be made using alternative methods.
To illustrate our points, we use the making of the Academic Book Publisher Register (ABP), an initiative launched within the European Network of Research Evaluation in the Social Sciences and Humanities (ENRESSH, www.enressh.eu), as a case study. This is a specific data source of which the main purpose is not restricted to research (see section 3.1). Nevertheless, it clearly illustrates the key issues of interoperability that in one way or another are applicable to a variety of bibliographic sources of data.
The structure of this paper is as follows: we begin with an outline of our thinking about interoperability and challenges that one encounters when pursuing quantitative research based on data from multiple data sources. Then we continue with our case study: first, we offer a brief description of the ABP and highlight the opportunities it offers for quantitative studies of SSH. Then, we describe how the challenges of delineation and connection play out with respect to ABP and what could be the ways to go about it. Finally, we summarise the key ideas presented and broaden them in a discussion on how they relate to the use of national and other bibliographic databases for studies of SSH and to use of digital data sources in SSH scholarship more broadly.

Interoperability
Interoperability in common sense understanding denotes the ability of two or more systems to exchange information and use the exchanged information. On the one hand, considerable progress has been made by increasingly widespread use of persistent identifiers to denote various data points and their characteristics. entities can be described with persistent identifiers, the existing identifiers are not equally available across different contexts. On the other hand, interoperability issues manifest not only on a technical level, but also on a conceptual level, of which the latter often is an issue harder to tackle. As was argued by Sīle [10], the construction of databases is contextdependent [see also 14,15]. The shaping of research output and records thereof takes place within a magnitude of socio-political contexts, and under culture-specific circumstances, often leading to varying ways of representing research output. Specific practices of data input, transfer, and processing stem from specific understandings on how research is best represented in data. Explorations of the inclusion criteria for publications show that there is by no means a single understanding that is shared across different contexts [16]. This applies also to the delineation of publishers as we will show in the sections that follow. All these aspects are crucial when using data from multiple sources.

ABP in context
The ABP register is envisioned as a reliable source of information on academic book publishers that adhere to the highest academic standards [17]. As such, it is comparable to existing data sources on scholarly journals, like for example the Directory of Open Access Journals (DOAJ). Specifically targeted at academic book publishers, the register aims to document the disciplinary profiles of publishers, procedures implemented for peer review, open access policies, statistics based on bibliographic information, and the profiles of authors. When operational, the register could be both a guide for authors searching for a publisher suitable for their writing as well as an important information source about, for example, national and international open access policies.
ABP is created by integrating multiple existing national registers of publishers. At the first stage of integration, the lists from Spain, Finland, and Italy have been integrated. In terms of data structure, ABP consists of a master list of unique academic book publishers and multiple corresponding lists derived from national bibliographic databases (local lists). This structure allows expansion with a wide range of data sources such as databases maintained by publishers, as well as institutional, regional and national bibliographic databases, Current Research Information Systems, and legal deposit libraries wherein research output is registered.
As noted, the ABP is an ongoing project initiated within ENRESSH. The work on the project began in Spring 2019. The implementation, however, is far from easy, as there are numerous issues encountered when one aims to integrate data about scholarly publishers originating from different sources. Sections 3.3 and 3.4 describe two central challenges along with their potential solutions: delineation of unique publishers and connection between local lists and the master list. While the former challenge addresses the need to specify in conceptual and operation terms what counts as a publisher, the latter tackles the commonplace problem of the same publisher being recorded differently across different lists.

ABP as a data source for studies of social sciences and humanities
Even though the ABP is envisioned as a tool supporting research evaluation, its value for quantitative studies of SSH is manifold. Firstly, it can act as an auxiliary database that facilitates the use of consistent and up-to-date information on publishers in bibliographic metadata for book publications in national, regional, and institutional databases and repositories. Secondly, ABP can be used as a reliable source to delineate unique publishers thus enabling studies that focus more on the publishing side of research [12,18]. Especially the inclusion of additional information about publishers -e.g., type, location, OA policies, market relations -enables new kinds of quantitative studies of book publishing [19][20][21]. Thirdly, ABP lends itself as a tool to be used in the delineation of scholarly and/or peerreviewed literature, thus helping to better understand the different publishing practices and to record the diversity of academic book publishers. Quantitative studies of science are often concerned with quantitative aspects of scientific communication via written text, but what makes a document 'scientific'? The criterion of peer review is considered distinctive in this regard. Identifying peer-reviewed publications is particularly difficult for books. While the status of academic book publishers might inform us to some extent about the quality of the editorial policies and control mechanisms, in many cases there is no clear delineation criterion which makes it easy to distinguish between peer-reviewed and nonpeer-reviewed books. Different labels for peer-reviewed books have been introduced to this end, but there is no international standard in place across countries or even disciplines [22]. ABP can be used as a guide in this complex and continuously evolving landscape of academic publishing and its standards.

Delineation
One of the key challenges encountered in the process of constructing the ABP concerns the delineation of publishers: what is a publisher and how it ought to be represented in data? It is not self-evident how to define the term 'publisher'-especially if such a definition needs to be valid across different institutional and national settings. We could treat publishers as companies or persons 'that prepare[s] and issues books, journals, or music for sale' [23]. However, it is insufficient for the needs of a register of publishers as several related entities can be referred to with it.
To illustrate our point, consider the renowned social sciences and humanities publisher Routledge. In 2020 Routledge issued the book 'Sociological Theory in Digital Age' by Gabe Ignatow. Figure 1 shows a fragment of the colophon of this book. We can see that this book indeed is published by Routledge, but we also can read that Routledge is an imprint of the Taylor & Francis Group, an Informa business. An 'imprint', according to the dictionary by the Oxford University Press, is "[a] printer's or publisher's name, address, and other details in a book or other publication" as well as "[a] brand name under which books are published, typically the name of a former publishing house that is now part of a larger group" [23].  [24]. Which of these entities should stand for a publisher and be recorded These questions can be answered and expressed in rules and guidelines to be followed in the maintenance of a register of publishers. However, the complexity of this challenge immensely increases when integrating registers from different countries, which in their turn follow differing principles of delineation, which often are not even made explicit. Thus, typically, there is no information on whether the entries in national registers refer to imprints or publishers, and for which time period this information is valid. For this reason, we argue that it is important to specify at the outset of making a register of this type what is the entity that each unique records refers to.
There are multiple alternatives available. First, there is an option to focus only on unique imprints. In this case each unique record in a register refers to publishers that are mentioned as publishers on a publication. With respect to the example shown in Figure 1, the publisher to be taken into account is Routledge. Thus, each unique publisher record in ABP refers to an imprint. Second, a unique record in ABP can refer to unique publishers. In the case of the example in Figure 1, a record should be made for Taylor & Francis Group. The main benefit with both these approaches is that they offer a principled approach to delineation and, in addition, it is relatively simple in technical terms to create a register based on such principles. However, the choice between a unique imprint list and unique publisher list depends on the envisioned usage of registers. For quantitative science studies, it is likely that the list of imprints is more suitable as it refers to those publisher names that are more familiar within scholarly communities.
A problem with the two approaches emerges, however, when the status of a publisher is changed. In other words, there is also the issue of temporality. Using the previous example, Routledge is the name of an independent publisher (pre-1998) and it is a name of an imprint (post-1998). If the register would only refer to unique imprints, then it would result in the loss of information about the independent publisher that owns the imprint. This can be important in situations where editorial practices for an imprint are changed after it has been acquired by another publisher. Or, on the contrary, the prestige of a particular imprint in a particular scholarly community differs from the prestige held by the acquiring publisher.
These examples point to the third alternative: a register that is based on two distinct but connected registers: a register of imprints and a register of independent publishers. This, on the one hand, enables accurate representation of complex relations between imprints and publishing houses that are commonplace in contemporary publishing landscape. On the other hand, using two lists instead of one, it is possible to accurately represent temporal and hierarchical changes such as mergers, acquisitions, or discontinuation. At the same time, this approach has a number of drawbacks. The construction of a register that is based on two distinct but connected registers is a time consuming process, since most often the local lists of publishers do not contain information about the status of the entity recorded as publisher. It is impossible to know without consulting external sources whether the entry refers to an imprint or an independent publisher, and for which time period this information is valid. Missing information has to be identified and added manually. Also the maintenance is more demanding in technical terms and required specialised expertise.
With respect to ABP, it is not known at the moment which of the solutions will be implemented. Each of the solutions we have outlined have positive and negative sides. While the main goal of the ABP will be decisive in which solution is picked, this choice also has impact on its potential use for the quantitative studies of SSH.

Connection
To construct the ABP, each record of a publisher in a local list needs to be connected to one in the master list. Each publisher, normally, can be identified by its ISBN prefix(-es). For example, Routledge UK can be identified with 15 prefixes, like 978-0-203, 978-0-415, and 978-0-7007 [25]. Since there are multiple prefixes for one entry in either the local or master list and they do not systematically make a distinction between imprint and publisher, they are not useful to create a straight one-on-one relationships between different lists.
In addition, the structure of local lists is diverse. Each record is described with a different set of categories and each of the categories is understood and recorded in a slightly different way as is often observed in integration projects [26][27][28]. Therefore, each local list requires a customized approach to connect it to the master list. This indicates that the construction of a comprehensive master list requires a well-defined connection strategy. Below, we offer an example of such strategies.
In short, our approach makes use of textual matching of publisher names in combination with a cross-validation of the matched names with additional characteristics of publishers (e.g. ISBN prefixes, country of headquarters). The implementation of this approach proceeds in 5 steps: (1) exploration of the structure of local lists, (2) assignment of an internal ABP identifiers to local lists, (3) textual publisher name matching, (4) validation of matched names, and (5) the integration of the local lists with the master list. In what follows, we provide a brief description of each of the steps.
For elaboration of a suitable approach to connect lists, the first step involves exploration of the structure of local lists and mapping of equivalent data elements. In the case of the ABP, we dispose of the following range of variables, depending on availability: internal identifier of the publisher in the national bibliographic database, country hosting the headquarters of the publisher, abbreviation, ISBN prefix(es), and a variable indicating if publisher is a university press. Next, a new internal ABP identifier has to be assigned to each entry in each local list. This enables a straightforward connection between the master list and the different local lists after the contents in local lists have been matched with the master list.
In the step that follows, publisher names in local lists are matched with those in the master list. There are numerous ways to execute textual name matching. However, publisher names are often rather short and additional identification variables are mostly absent. Therefore complex matching algorithms that make use of string distance calculation, for example, are not always suitable. Also sometimes two different publishers use the same or similar names. For example, there is a Belgian publisher Academia Press, but the name Academia is also used by a Russian publisher. Moreover, publisher names may (or may not) contain multiple components: apart from the mere name (e.g. 'Routledge'), it can contain the type of legal entity ("Inc", "Ltd", …), abbreviation, "Press" or "Publisher", subtitles etc. There are no fixed criteria for the composition of a publisher name. In order to solve this problem, we propose a gradual procedure of text normalization: 1. Name matching based on the exact name as mentioned in both lists; 2. Name matching based on the exact name as mentioned in both lists after removing spaces and diacritic signs; 3. Name matching after removing legal entity type ("Inc", "Ltd", "Gmbh", …); 4. Name matching after removing "Editor", "Press", "Publisher" (or variants) and their translations ("Forlag", "Uitgever", "Edizioni",…); 5. Name matching after removing subtitles; 6. Replace abbreviation by full name if only abbreviation is mentioned in the list. After each step, all the new matches need to be checked manually to remove false positives as the number of matched names increases at each step.
After the textual matching has been carried out, the matched names need to be validated making use of other data elements present in the local lists and the master list (e.g. country of headquarters, ISBN prefixes). This, however, depends on the availability of data. In addition, the connections between the local list and the master list are manually validated by the respective local teams.
Finally, after validation, those publishers from local lists that have not been identified in the master list can be added to the master list with a new master list identifier.
To ensure that a register such as ABP is continuously update, it is recommended to establish continuous feedback and updates from the local teams via a specific module in the online database application. This enables local teams to modify names, signal incorrect connections between their local list and the master list, or add country information, for example. This way, we believe, a register created on the basis of multiple local registers can remain fit-for-purpose long term and address the needs of its users.

Discussion and conclusion
Our aim in this paper was to offer for discussion a reflection on conceptual and methodological issues that emerge when using national bibliographic data for studies of the social sciences and humanities. Our focus was on interoperability as the main challenge, which we illustrated with examples drawn from the on-going work on the ABP register that is created through the integration of data from multiple data sources.
Some of the challenges can be overcome by the emerging infrastructure enabled by the open science movement (e.g. open data, software, etc.). Nevertheless many of the interoperability issues result from the socio-cultural embeddedness of national (and other) bibliographic databases and hence has to be addressed conceptually. This has to be taken into account when engaging into more fine-grained and theory-motivated bibliometric studies of SSH.
In more practical terms this means careful considerations of the rationale behind setups of databases in relation to the research one has in mind. Are there implications to content of a database from the reasons the database is set up? Which publishers are regarded as being 'academic' and hence are traced in the data? Are there limitations for research that can be pursued with such data? Secondly, at a more fine-grained level, similar questions can be asked about each metadata categories one intends to use in an analysis (e.g. what does the category 'academic publisher' mean? Who assigned it, how, and following what considerations?). We believe that considerations of this kind are a step towards a more accurate understanding of publication practices in SSH and research practices in these knowledge domains more broadly.
As discussed in this paper, specifically for a register of book publishers, there are multiple ways to address these challenges. With the example of the delineation of publishers, we showed that there are multiple alternatives available, each with their benefits and drawbacks. Therefore, the choice of the approach to be implemented, ideally, should be guided by considerations of the purpose of the register as well as the available resources.
Bibliometric studies of the SSH have benefited from the availability of data that originate in national bibliographic databases. First, from the researcher perspective, detailed studies of changes in publication patterns (across countries) have become a popular line of inquiry [8]. Many aspects of research evaluation have received considerable attention as well, like for example conceptions and labelling of peer review in different contexts [13]. Research from the perspective of book publishers is less common. Some work has been done on the consolidation and concentration of the publishing industry, both in international ITM Web of Conferences 33, 02002 (2020) ICTeSSH 2020 https://doi.org/10.1051/itmconf /20203302002 and national contexts [2,11,29,30]. Thus the ABP can be a data source that opens up research opportunities on less explored topics. At the same time, for such a register to be useful for science studies, the transparency of its creation is of utmost importance.
The challenges posed by low interoperability across different data sources (bibliographic or other) will likely to remain at the core of data integration initiatives. Thus, as our concluding note, we wish to emphasise that these challenges that emerge can be treated as spaces for discussion on goals of data integration as well as pros and cons of different integration solutions. Often, the main solution that is called for is a new standard to be used across the different sources. We highlight that a single standard is not the only option available. It is possible to take a more creative route and aim for a pluralistic approach that makes use of multiple standards at the same time. This, on the one hand, decreases the loss of information, and, on the other hand, caters for proponents of differing standards thus expanding the usability of a data source.