Biodiversity Data Journal 13: e141562 doi: 10.3897/BDJ.13.e141562 Commentary on "Preliminary Species Hypotheses" in Entomological Taxonomy: A Global Data and FAIR Infrastructure Perspective Sharif Islam * § + Naturalis Biodiversity Center, Leiden, Netherlands § DiSSCo, Leiden, Netherlands Corresponding author: Sharif Islam (sharif.islam@naturalis .nl) Academic editor: Lyubomir Penev Received: 11 Nov 2024 | Accepted: 21 Jan 2025 | Published: 10 Feb 2025 Citation: Islam S (2025) Commentary on "Preliminary Species Hypotheses" in Entomological Taxonomy: A Global Data and FAIR Infrastructure Perspective. Biodiversity Data Journal 13: e141562. https ://doi.org/10.3897/BDJ.13.e141562 Abstract What if early taxonomic findings were treated like preprints, open to iterative improvement or managed with practices from the open-source community, such as Git branching, merging and patch management? Prompted by Buckley's article Charting a Future for Entomological Taxonomy in New Zealand (2024), this commentary explores these possibilities in the context of biodiversity informatics. In response to the need for rapid, scalable biodiversity monitoring, Buckley introduces preliminary species hypotheses (PSH) as a bridge between quick identification tools and the rigorous Linnaean system, leveraging DNA barcoding and Al-assisted image recognition to produce provisional Classifications that can later be validated. Expanding on Buckley’s framework, this commentary emphasises the critical role of data linking, versioning and integration to support evolving taxonomic data. Borrowing from software and open-source practices, | explore the idea of managing PSH with an infrastructure that treats each taxonomic update as a versioned "commit", which can be tracked, refined and integrated over time. Drawing insights from FAIR (Findable, Accessible, Interoperable, Reusable) principles and Digital Extended Specimens, | identify infrastructure requirements for PSH, including robust data standards, persistent identifiers and interoperability to support global biodiversity repositories. Additionally, Taxonomic Data Objects offer a model for ©@lslam S. This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 2 Islam S dynamically integrating PSH into adaptable taxonomies that can evolve with new data and tools. By positioning PSH within an open, infrastructure-focused framework, this commentary advocates for scalable, hypothesis-driven biodiversity data that meets modern conservation needs, bridging traditional and emerging practices in taxonomy. Keywords taxonomy, species, interoperability, FAIR, data integration, open source Introduction In Charting a Future for Entomological Taxonomy in New Zealand, published in the journal New Zealand Entomologist, T.R. Buckley (2024) proposes the concept of preliminary species hypotheses (PSH) as a way to bridge the gap between the need for rapid species identification and the rigorous Linnaean taxonomy. Buckley argues that PSH can address biodiversity monitoring needs by utilising output of rapid identification tools -- such as DNA barcoding and Al-assisted image recognition - as provisional Classifications that serve as an intermediate stage before formal taxonomic classification. Although based in New Zealand and focused on entomology, this proposal has implications for other regions and fields within taxonomy and biodiversity research. Buckley's proposal envisions scalable, hypothesis-driven biodiversity data that can evolve as new information emerges. Inspired by this approach, we might ask: What if early taxonomic findings were treated like preprints (Verma and Detsky 2020) - open to iterative improvement? Or managed through practices adapted from open-source software development, such as Git branching, merging and patch management, where each PSH acts as a versioned "commit"? While this iterative process echoes how taxonomic science has traditionally progressed, this approach could offer a flexible framework for tracking and refining taxonomic data over time. In this commentary, | explore the data linking and integration infrastructure required to support Buckley’s vision, emphasising how PSH fits within the broader framework of the Species Hypotheses (SH) and Taxon Hypotheses (TH) paradigm. These concepts also align with evolving Biodiversity Information Standards (TDWG) standards like the Taxon Concept Schema (TCS), which separates Taxon Concepts from Taxon Names to enhance data interoperability (Klazenga and Liljeblad 2024). Based on these conceptual frameworks, | explore the data linking and integration infrastructure required to support Buckley's vision, drawing on knowledge infrastructure studies such as Christina Borgman's work on data systems (Borgman and Brand 2024) and Sterner et al.'s pluralistic framework for biodiversity data sharing (Sterner et al. 2023). | also consider recent proposals, such as Digital Extended Specimens (Hardisty et al. 2022) and Taxonomic Data Objects (Upham and Poelen 2024), as potential models for integrating PSH and other hypotheses-driven insights (both from molecular- and media-based multimodal workflow) as data products within global biodiversity infrastructures. This infrastructure-based approach can help sustain taxonomy's relevance in conservation and research. My focus is not on assessing the scientific rigour of PSH, but rather on the data linking and integration strategies that could underpin its implementation, offering a scalable pathway for the evolution of taxonomic knowledge. Summary of paper Buckley’s proposal introduces PSH as a practical and flexible way to address the gap between rapid biodiversity identification needs and the more formal Linnean classification system. The paper presents the proposal with a historical background of entomological taxonomy in New Zealand, discussing the reasons for declining taxonomy funding and the importance of maintaining scientific rigour. For relevance, the summary here highlights the key concepts of PSH. This provisional approach of PSH aligns well with similar concepts, such as Operational Taxonomic Units (OTUs) in DNA barcoding, which serve as proxies to categorise unidentified taxa for integration across different biodiversity datasets and use cases. According to Buckley, OTUs are typically molecular-based groupings (often not derived from DNA-sequenced specimens, particularly for environmental DNA data). As Buckley (2024):9 states: “_.it is difficult to reconcile these OTUs [OTUs that are not derived from DNA sequencing specimens and do not have a physical reference specimen] with other types of character data. From a hypothesis testing perspective, these OTUs can also be considered ‘preliminary species hypotheses’, but with a weaker degree of support than from specimen-based DNA sequencing approaches (as outlined earlier). This approach will require a large-scale eDNA survey of New Zealand, focusing on the sampling of soil, water, air and insect trap residues. Achieving this goal would also be a 5- to 10-year project with a moderate financial investment. The output would be a comprehensive database of OTUs that, over time, could be connected to described species or to DNA sequences obtained from individual specimens". In contrast, PSH are structured as an intermediate classification that is less formal than Linnaean taxonomy, but aspires to achieve it over time. Unlike OTUs, PSH are not simply molecular clusters; they are hypotheses that can later be validated and incorporated into formal taxonomy as additional data become available. Buckley also reminds us in the paper that similar methods are commonly used in fields of mycology (KOljalg et al. 2013) and bacteriology. While OTUs offer a rapid and flexible tool for biodiversity estimation, PSH are designed to be a step closer to formal species recognition, enabling hypothesis- driven research and prioritisation without bypassing rigorous taxonomic standards entirely: “The goal is not to replace the Linnean system, or to lower its scientific robustness, but to provide a framework for describing biodiversity more quickly 4 Islam S than Linnean taxonomy can. DNA data can characterise lineages that, in turn, can be considered as ‘preliminary species hypotheses’. These hypothesised species can be tested, verified and described by taxonomists later if resources become available. In the meantime, these hypothesised species can be used as a basis in downstream conservation actions or ecological studies that require biodiversity to be divided into scientifically meaningful entities. However, it must be remembered that these hypothesised species have not been subject to robust testing and, therefore, any downstream inference will not be as reliable as that from a fully revised taxon” (Buckley 2024:8). Furthermore, the robustness and benefit of the hypothesis-driven and iterative approach come not just from a single data type, but from integrating a variety of data types. For instance, combining molecular-based methods and multi-modal Al techniques can significantly reduce uncertainties in the inference of observations: "If we want robust species hypotheses, then large numbers of characters will continue to be needed. There are technologies emerging that promise to greatly increase the rate of data collection without sacrificing scientific robustness. The approach adopting these technologies is known as /arge-scale integrative taxonomy (Hartop et al. 2022; Salili-James et al. 2023; Karbstein et al. 2024). Briefly, this approach comprises two steps. First, high throughput methods are used to collect character data and perform a provisional grouping of specimens into putative species. Second, another character type, with a high a pron probability of being incongruent with the first character set, is used to test those putative species (Hartop et al. 2022). A key feature is the use of technology to accelerate the rate and scale of data collection" (Buckley 2024:7). The demand for taxonomic information for a variety of use cases (Such as environmental monitoring and biosecurity) is rising, making traditional insect sampling and identification methods increasingly impractical, especially amidst a shortage of experts. New technologies, including DNA barcoding, eDNA for community assessments and automated image recognition, offer promising alternatives that can democratise species identification. Automated image recognition, in particular, enables non-specialists to identify insects, making taxonomy more accessible. However, according to Buckley, successful adoption of these tools requires extensive digitisation of specimen records and integration with images, DNA sequences and geo-referenced data. Key Terms, Definitions and Alignment with Existing Concepts The practice of taxonomy and nomenclature deals with different concepts and terms beyond naming species (see Favret (2024) for 5 'D's of taxonomy: delimitation, diagnosis, description, determintation and discovery) where the aspect of testable hypotheses intersects all of these concepts. While detailing every aspect is beyond the scope of this commentary, this section defines key terms and situates them within the evolving landscape of biodiversity informatics. The following concepts are briefly stated here to facilitate the discussion and lay the foundation for understanding how PSH can integrate into taxonomic workflows and biodiversity data infrastructures. Barcode Index Numbers (BINs): BINs are molecular-based clusters derived from DNA barcoding, primarily serving as proxies for species identification using genetic divergence thresholds. BINs are similar to OTUs, but are specific to DNA barcoding. Unlike OTUs, which are often used as an intermediate step requiring further species-level identification, BINS are dynamic and the boundaries of what sequences can be associated with a particular BIN can change with new sampling data (Ratnasingham and Hebert 2013; Lue et al. 2022) and one BIN can cover more than one taxon (Huemer and Mutanen 2022). Species Hypotheses (SH): SH is the main building block of UNITE (a database and sequence management environment centred on the eukaryotic nuclear ribosomal ITS region) which groups similar sequences into provisional species-level clusters typically comprising two or more sequences to avoid excessive inflation (Kdljalg et al. 2013). Representative sequences for each SH are chosen through consensus computation or expert designation. These SHs, along with their representative sequences and annotations, are made available as reference datasets. Buckley's paper discusses SH used in mycology and explores how entomology can adapt similar ideas. This discussion also opens up the possibility of integrating SH concepts for broader use beyond mycology and zoology, not necessarily limited to DNA-based identification methods. Taxon Hypotheses (TH) paradigm: Expanding on the SH concept, Koljalg et al. (2020) introduces the TH paradigm that represents a framework for linking sequence-based identifications to taxonomic concepts. By assigning Digital Object Identifiers (DOls) to these hypotheses, THs enable transparent and reproducible connections between molecular data and taxonomic classifications. Kdljalg et al. (2020) also highlights that, while molecular data are becoming increasingly common, differences in sampling, genetic markers and analytical methods often lead to competing and sometimes conflicting classifications. The reference datasets and DOls provided by UNITE offer a unique reference point that remains consistent even as underlying data and conclusions evolve. This system allows users to reference the data enabling modifications and augmentations, while preserving original versions. All of these frameworks have one thing in common: they acknowledge the dynamic and "preliminary" nature of initial insights into species identification. Thus, PSH or SH could emerge as a new "data type" that can be used not just in mycology or zoology, but across domains. Furthermore, this approach supports integrative methods that apply multiple types of characters, leading to robust hypothesis tests and, therefore, greater confidence in the acceptance or rejection of a species hypothesis. Recent discussions (see Karbstein et al. (2024)) on species delimitation and Al also underscore the importance of incorporating multiple data types and frameworks such as unified species concept, morphological and phylogenetic (genetic relationships and shared ancestry) and DNA clustering methods that are going towards a more integrative 6 Islam S approach (genetics/genomics + morphology + ecology). Al-based identification methods, including multimodal approaches involving sound and vision, are also becoming increasingly prevalent (Waldchen and Mader 2018; Yang et al. 2021). Each approach has limitations; thus, integrative approaches that combine multiple lines of evidence align with the dynamic nature of species hypotheses. By situating PSH, SH, TH, BINs and OTUs within a unified conceptual framework, this commentary underscores the value of treating species hypotheses as dynamic, evolving data objects. Each concept - BINs, OTUs, SH, TH and PSH - has distinct origins rooted in specific fields, such as molecular biology, fungal taxonomy and entomology. These approaches complement the Linnaean classification by integrating preliminary taxonomic data into an iterative process that refines and validates hypotheses over time. Expanding their application to encompass diverse data types will enhance their utility across taxonomic domains. A holistic and integrative approach supports the iterative refinement of taxonomies while balancing the need for rapid discovery with the production of robust, high-quality data. The role of infrastructures Following the summary of Buckley’s PSH proposal, it becomes clear that data integration and linking will be an important aspect and, thus, the successful implementation and sustainability of PSH require a robust digital infrastructure. This infrastructure not only enables data sharing, but also supports the evolution of taxonomic knowledge in a scalable and accessible way. The PSH model is comparable to preprints in scholarly publishing: it provides a way to make new insights accessible, citable and linkable, even if they require further refinement and validation. When viewed through the lens of the Digital Extended Specimen (DES) paradigm (Hardisty et al. 2022) and the FAIR (Findable, Accessible, Interoperable, Reusable) principles, the PSH concept highlights the need for infrastructure that can support both provisional classifications and long-term taxonomic research. The intersection of PSH with DES and FAIR principles underscores the challenges - and critical importance - of establishing, maintaining and scaling digital infrastructure to meet the demands of modern biodiversity research. This is not to argue for a new type of digital infrastructure, but improving on existing infrastructures and aligning global and regional funding schemes that can be adopted to implement such a proposal. Similar to Buckley, Meier et al. (2024) also emphasise that achieving integrative taxonomy (combining morphological, whole organism study with molecular data) requires reliable data handling, including efficient voucher storage, standardised data practices and FAIR-compliant infrastructure to support the evolution of taxonomic hypotheses as new data are added. For biodiversity data to be effective, including taxonomic and nomenclature information, a resilient infrastructure is crucial to maintain links amongst evolving species hypotheses, underlying specimens, environmental observations and genetic data. Efforts to create such infrastructures have accelerated globally as we confront biodiversity and climate crises (Devictor and Bensaude-Vincent 2016). Although global data infrastructures that support biodiversity data and research funding are unevenly distributed, the DES and PSH approach could mitigate disparities by providing an inclusive, interoperable system that enables biodiversity data sharing across regions and disciplines. The DES, as proposed, is a paradigm for digitally linking specimen data from global natural science collections to related taxonomic, ecological and environmental data. DES enables the transformation of physical specimen data into digital objects, making them accessible and FAIR. This approach not only broadens usability, but also enhances the value of collections by integrating them into global data infrastructures that can be leveraged for large-scale, multifactor analysis (Heberling et al. 2021). Thinking about DES, PSH and FAIR in a holistic framework brings up the notion of pluralistic data pooling advocated by Sterner et al. (2023):2: We define ‘data pooling’ for biodiversity data as a process that combines data from multiple sources into one taxonomically standardized body of information, provides infrastructure for managing and accessing the combined data and governs it as a shared resource for a community of users and stakeholders beyond a single research project or lab. We define ‘taxonomic standardization’ as a set of processes for verifying and re-identifying a collection of species observations as needed to ensure that they are classified in a standardized way according to a single, coherent taxonomy of choice. More generally, ‘data standardization’ (also Known as data harmonization) is an established term in academic and industry data science practices. Part of this set of process can be a PSH data element that can accommodate evolving taxonomic concepts, while ensuring reliable links between data sources. It allows for both the robustness of Linnean taxonomy and the flexibility of documenting hypotheses, thereby fostering a dynamic approach to biodiversity research. Echoing Sterner (also Leonelli (2020) and Borgman and Wofford (2021)), the challenges of biodiversity data collection, sharing and preservation are as much social as technical, thus: “...making biodiversity data comprehensively available and reusable will likely require major changes to the cultures, organizations and infrastructures of the research communities involved” (Sterner et al. 2023: 2). This also brings up the notion of maintenance and support. As Borgman et al. (2016) note, "durability" in infrastructure requires continuous maintenance across technical and human resources. Applying this insight to biodiversity data infrastructure highlights that building a sustainable, FAIR-compliant system requires not only technical innovation, but also governance and investment. Borgman’s work in astronomy shows that even well- established systems still face fragility without regular support - an important reminder as we build infrastructures that will support biodiversity data on a global scale. 8 Islam S Integration with Global Data Standards and Networks As mentioned already, PSH can expand beyond New Zealand and entomology; it has potential for integration with global biodiversity data initiatives. Organisations and platforms such as the Catalogue of Life, GBIF, BCON, ALA, INSDC, BOLD, UNITE and Di SSCo provide frameworks, tools and services for aggregating and curating biodiversity data, which could be expanded to incorporate PSH as a new type of digital object. By embedding provisional species data into the global biodiversity network, PSH could become widely accessible and actionable across regions and disciplines. As Moersberger et al. (2024) emphasise in their study on European biodiversity monitoring, integrating biodiversity data is crucial for reducing fragmentation and filling taxonomic gaps. Aligning PSH with the shift toward digital taxonomy could further bridge the divide between morphological and molecular approaches, providing traceable, reusable links to each hypothesis’s provenance. This would enable a more cohesive and adaptable taxonomy, supporting dynamic updates as new data become available. Enhancing PSH with FAIR Compliance To fully realise PSH, we need infrastructure that is both accessible and FAIR-compliant. These hypotheses will function as data points or nodes within a knowledge graph (Page 2019,Penev et al. 2024) and, because they could be stored across multiple infrastructures (Sterner et al. 2020), data linking and interoperability are essential. The Upham and Poelen (2024) concept of Taxonomic Data Objects aligns with this need by offering machine-readable digital packages that encode metadata, enabling the tracking of evolving species concepts over time. Initial taxonomic data can also be compared to a software commit in Git: each PSH represents a specific "state" of species classification, preserving the evolution of taxonomic understanding without overwriting earlier hypotheses. This approach provides a clear pathway for reviewing and merging provisional classifications with established taxonomies, strengthening taxonomic workflows by ensuring data integrity and interoperability across different taxonomic systems (see Fig. 1 for a simple schematic comparing Git merging with the process described using PSH). Practical Requirements for Preliminary Species Hypotheses Implementation For PSH to serve as a valuable tool in taxonomy and biodiversity informatics, certain key elements are essential. This is an initial proposal and will benefit from further discussion: 1. Persistent Identifiers (PIDs): Each PSH digital object should be assigned a PID to ensure reliable tracking and referencing, similar to the approach used for Digital Extended Specimens within the FAIR Digital Object framework (Islam et al. 2023). As suggested by Upham and Poelen (Upham and Poelen 2024), versioning and hashing could be incorporated as part of the metadata to support tracking changes over time. Assigning PIDs to taxonomic data and hypotheses is not a new concept; for example, the Catalogue of Life assigns identifiers for name usage and checklists (Banki et al. 2023) and UNITE assigns DOls to species hypotheses (KOljalg et al. 2020). The discussion should not focus on which specific PID mechanism is optimal - though implementation details are important - but rather on establishing a consensus and actionable plan to assign PIDs to these entities at a granular level. This will enable effective tracking and linking, but requiring dedicated infrastructure and ongoing maintenance support. By assigning transparent and persistent identifiers to contributors across all stages of a species hypothesis’ evolution, the framework could foster equitable recognition while maintaining rigorous standards for formal naming. Taxonomic Workflow Specimen Collection & Environmental DNA & Multi Git Ideas Morphological Data Modal Data Main Taxonomic Data | | Branch Initial Identification initial Identification ? i | / \ { \ Species Hypothesis: SH2 Species Hypothesis: SH1 ’ , Hypotheses Branch 2 Hypotheses Branch 1 % = ye / ~ ~ 4 SH Integrative Data a Pe Hypotheses Sub-branch 2.1 ¥ Provisional Data Integration ¥ | Combined Hypotheses Branch | ¥ FAIR and Data Standards Compliance/ Taxon ’ Concept Schema Usage FAIR and Data Standards | Alignment/ Taxon Concept | Schema Usage Linked with PIDs ¥ Linked with PIDs Long term Data Repository ’ Long Term Data Repository | ® Revised SH with Additional Reviewed & Accepted Data v | Refined Taxonomic Data Reviewed & Accepted ¥ Refined Taxonomic Data Figure 1. EE A simplified conceptual framework for version-controlled taxonomic data management This diagram illustrates the parallel between hypothesis-driven taxonomic workflows and Git- based version control systems. Drawing inspiration from software development practices, the framework demonstrates how version control concepts could be applied to manage and track the evolution of taxonomic hypotheses. The actual processes involved are much more complex, as described in Pyle's paper "An Introduction to Scientific Names of Organisms and the Taxon Concepts they Represent (Pyle 2022). 10 Islam S Interoperable Data Standards: Standards like Darwin Core and Taxon Concept Schema (TCS) are necessary to harmonise species hypothesis data with other biodiversity data types, such as observation and occurrence data. Consistent standards enable smoother integration and reuse of taxonomic information across platforms. How a preliminary concept could be part of Darwin Core and other standards framework will need careful consideration. For instance, “dwc:previous Identifications” property in Darwin Core could store the reference to preliminary data . PSH, SH and TH could have their own data model and metadata, but this also needs global consensus. As new data and insights are being generated, standards and schemas are essential for usability in diverse contexts. While Darwin Core is widely used, TCS’s separation of Taxon Concepts from Taxon Names allows greater flexibility for mapping and resolving taxonomic data. TCS could possibly accommodate dynamic states such as "Preliminary" and "Final" as new insights emerge. It could also address provenance and attribution, akin to the Linnaean tradition of authorship, requiring each state to have a source ("according To") (Klazenga and Liljeblad 2024). FAIR Principles: Along with PIDs, machine-readable formats and data standards will enhance accessibility, interoperability and reusability, supporting transparent and evolving taxonomic classifications. Similar ideas have been proposed by Miralles et al. (2020) in the context of alpha taxonomy repositories. Taxonomic Data Objects (Upham and Poelen 2024) could standardise PSH data in a machine-readable format, preserving their structure and allowing flexible data use. Global Coordination and open source practices: Collaborative efforts with established networks are essential for integrating PSH into a global biodiversity framework. Beyond achieving consensus on metadata standards, the accessibility and publication of these data must remain a priority. Funders, research institutions and collection-holding organisations need to recognise the importance of APIs ( Addink et al. 2023), repositories, data stewardship (De Prins 2019Bentley et al. 2024) and other foundational infrastructure and commit both human and technological resources to support them. This is especially crucial given that many countries, despite their reliance on biodiversity data for modelling and monitoring, often lack the necessary capacity, expertise or funding to fully exploit its potential (WMoersberger et al. 2024). As illustrated by New Zealand's example, where a small population and limited taxonomic expertise hinder the development of comprehensive taxonomic research, many countries depend on international collaboration for taxonomic knowledge. Addressing this taxonomic impediment calls for capacity building, knowledge exchange and the creation of sustainable, FAIR-aligned taxonomic services through coordinated efforts ( Buckley 2024). A unified global solution may be impractical, yet stronger coordination in the software and standards that support taxonomic services is critical. This can facilitate the effective use of new data elements like PSH and promote shared governance structures. For instance, the discussions by Sandall et al. (2023) on checklist maintenance can be extended to taxonomic software and service development, where PSH could be tested and refined. Capacity 11 management and funding challenges also require open dialogue, especially given the voluntary nature of many contributions in taxonomy and also in biodiversity informatics and data stewardship. Metrics from open-source projects, such as the "Contributor Absence Factor" (or "Bus Factor") - which assesses how many contributors can be lost before a project is impacted - could help guide efforts towards sustainability. By learning from open-source practices and research software sustainability principles (Cohen et al. 2021), we can enhance taxonomy's resilience and interoperability across regions. While taxonomic expertise remains indispensable, adopting insights from open-source and other data ecosystems will help us to overcome challenges in data infrastructure and interoperability. Conclusion Buckley's concept of PSH, primarily proposed within entomology, parallels existing frameworks like SH in mycology and BINs and OTUs from molecular methods. Despite their overlaps and distinctions, the need for standardised frameworks to manage preliminary and evolving taxonomic data remains crucial. These frameworks address challenges across diverse taxonomic domains, emphasising their potential to create interoperable and dynamic taxonomic practices, but a wider and global discussion is needed to find a holistic solution. In the context of New Zealand, Buckley advocates for shifting entomological taxonomy away from the primary focus on completing Linnaean classification. Instead, his proposal highlights achievable objectives aligned with realistic funding and _ timelines, incorporating DNA data and Al methods as preliminary steps towards formal classification. This commentary connects Buckley's proposal to broader initiatives, such as FAIR principles, Digital Extended Specimens, Taxon Concept Schema, Taxonomic Data Objects and open-source software practices. By treating PSH as data points - similar to versioned git "commits" or "preprints" - species identification and classifications can be iteratively refined without losing historical data. This fosters a more adaptable and integrative approach to taxonomy, bridging morphological and molecular data and Al- based identification, while enhancing global biodiversity conservation efforts. Conflicts of interest The authors have declared that no competing interests exist. References ° Addink W, Kyriakopoulou N, Penev L, Fichtmueller D, Norton B, Shorthouse D (2023) Deliverable D1.3 Best practice manual for findability, re-use and accessibility of infrastructures. ARPHA Preprints_httos://doi.org/10.3897/arphapreprints.e107169 12 Islam S Banki O, Déring M, Jeppesen T (2023) Name IDs and Name Matching for Catalogue of Life: Existing Services and Prospects. Biodiversity Information Science and Standards 7 https ://doi.org/10.3897/biss.7.111662 Bentley A, Thiers B, Moser WE, Watkins-Colwell GJ, Zimkus BM, Monfils AK, Franz NM, Bates JM, Boundy-Mills K, Lomas MW, Ellwood ER, Poo S, Contreras DL, Webster MS, Nelson G, Pandey JL (2024) Community Action: Planning for Specimen Management in Funding Proposals. BioScience 74 (7): 435-439. https://doi.org/10.1093/biosci/biae032 Borgman C, Sands A, Darch P, Golshan M (2016) The durability and fragility of knowledge infrastructures: Lessons learned from astronomy. Proceedings of the Association for Information Science and Technology 53 (1): 1-10. https://doi.org/10.1002/ pra2.2016.14505301057 Borgman C, Wofford M (2021) From Data Processes to Data Products: Knowledge Infrastructures in Astronomy. Harvard Data Science Review https ://doi.org/ 10.1162/99608f92.4e792052 Borgman C, Brand A (2024) The Future of Data in Research Publishing: From Nice to Have to Need to Have? Harvard Data Science Review https://doi.org/ 10.1162/99608f92.b73aae77 Buckley TR (2024) Charting a future for entomological taxonomy in New Zealand. New Zealand Entomologist1-17. https://doi.org/10.1080/00779962.2024.2407230 Cohen J, Katz D, Barker M, Chue Hong N, Haines R, Jay C (2021) The Four Pillars of Research Software Engineering. IEEE Software 38 (1): 97-105. httos://doi.org/10.1109/ ms .2020.2973362 De Prins J (2019) Global Open Biodiversity Data: Future Vision of FAIR Biodiversity Data Access, Management, Use and Stewardship. Biodiversity Information Science and Standards 3 https://doi.org/10.3897/biss.3.37190 Devictor V, Bensaude-Vincent B (2016) From ecological records to big data: the invention of global biodiversity. History and Philosophy of the Life Sciences 38 (4). https://doi.org/ 10.1007/s40656-016-0113-2 Favret C (2024) The 5 ‘D’s of Taxonomy: A User’s Guide. The Quarterly Review of Biology 99 (3): 131-156. https ://doi.org/10.1086/732044 Hardisty AR, Ellwood ER, Nelson G, Zimkus B, Buschbom J, Addink W, Rabeler RK, Bates J, Bentley A, Fortes JAB, Hansen S, Macklin JA, Mast AR, Miller JT, Monfils AK, Paul DL, Wallis E, Webster M (2022) Digital Extended Specimens: Enabling an Extensible Network of Biodiversity Data Records as Integrated Digital Objects on the Internet. BioScience 72 (10): 978-987. https://doi.org/10.1093/biosci/biacO60 Heberling JM, Miller J, Noesgaard D, Weingart S, Schigel D (2021) Data integration enables global biodiversity synthesis. Proceedings of the National Academy of Sciences 118 (6). https ://doi.org/10.1073/pnas.2018093118 Huemer P, Mutanen M (2022) An Incomplete European Barcode Library Has a Strong Impact on the Identification Success of Lepidoptera from Greece. Diversity 14 (2). https:// doi.org/10.3390/d14020118 Islam S, Beach J, Ellwood E, Fortes J, Lannom L, Nelson G, Plale B (2023) Assessing the FAIR Digital Object Framework for Global Biodiversity Research. Research Ideas and Outcomes 9 https://doi.org/10.3897/rio.9.e108808 Karbstein K, Késters L, Hoda¢ L, Hofmann M, Hérandl E, Tomasello S, Wagner N, Emerson B, Albach D, Scheu S, Bradler S, de Vries J, Irisarri 1, Li H, Soltis P, Mader P, Waldchen J (2024) Species delimitation 4.0: integrative taxonomy meets artificial 13 intelligence. Trends in Ecology & Evolution 39 (8): 771-784. httos://doi.org/10.1016/j.tree. 2023.11.002 Klazenga N, Liljeblad J (2024) Expressing Circumscription in the Taxon Concept Schema (TCS). Biodiversity Information Science and Standards 8 htips://doi.org/10.3897/biss. 8.140738 KOljalg U, Nilsson RH, Abarenkov K, Tedersoo L, Taylor AS, Bahram M, Bates S, Bruns T, Bengtsson-Palme J, Callaghan T, Douglas B, Drenkhan T, Eberhardt U, Duenas M, Grebenc T, Griffith G, Hartmann M, Kirk P, Kohout P, Larsson E, Lindahl B, LUcking R, Martin M, Matheny PB, Nguyen N, Niskanen T, Oja J, Peay K, Peintner U, Peterson M, Példmaa K, Saag L, Saar I, Schuler A, Scott J, Senés C, Smith M, Suija A, Taylor DL, Telleria MT, Weiss M, Larsson K (2013) Towards a unified paradigm for sequence-based identification of fungi. Molecular Ecology 22 (21): 5271-5277. htips://doi.org/10.1111/mec. 12481 K6ljalg U, Nilsson H, Schigel D, Tedersoo L, Larsson K, May T, Taylor AS, Jeppesen TS, Frgslev TG, Lindahl B, Poldmaa K, Saar |, Suija A, Savchenko A, Yatsiuk |, Adojaan K, Ivanov F, Piirmann T, P6hdnen R, Zirk A, Abarenkov K (2020) The Taxon Hypothesis Paradigm—On the Unambiguous Detection and Communication of Taxa. Microorganisms 8 (12). https://doi.org/10.3390/microorganisms8121910 Leonelli S (2020) Learning from Data Journeys. Data Journeys in the Sciences1-24. https ://doi.org/10.1007/978-3-030-37177-7_1 Lue C, Abram P, Hrcek J, Buffington M, Staniczenko PA (2022) Metabarcoding and applied ecology with hyperdiverse organisms: Recommendations for biological control research. Molecular Ecology 32 (23): 6461-6473. https://doi.org/10.1111/mec.16677 Meier R, Lawniczak MN, Srivathsan A (2024) Illuminating Entomological Dark Matter with DNA Barcodes in an Era of Insect Decline, Deep Learning, and Genomics. Annual Review of Entomology _https://doi.org/10.1146/annurev-ento-040124-014001 Miralles A, Bruy T, Wolcott K, Scherz MD, Begerow D, Beszteri B, Bonkowski M, Felden J, Gemeinholzer B, Glaw F, Gléckner FO, Hawlitschek O, Kostadinov |, Nattkemper TW, Printzen C, Renz J, Rybalka N, Stadler M, Weibulat T, Wilke T, Renner SS, Vences M (2020) Repositories for Taxonomic Data: Where We Are and What is Missing. Systematic Biology 69 (6): 1231-1253. https://doi.org/10.1093/sysbio/syaa026 Moersberger H, Valdez J, Martin JC, Junker J, Georgieva |, Bauer S, Beja P, Breeze T, Fernandez M, Fernandez N, Brotons L, Jandt U, Bruelheide H, Kissling WD, Langer C, Liquete C, Lumbierres M, Solheim AL, Maes J, Moran-Ordofiez A, Moreira F, Pe'er G, Santana J, Shamoun-Baranes J, Smets B, Capinha C, McCallum |, Pereira H, Bonn A (2024) Biodiversity monitoring in Europe: User and policy needs. Conservation Letters 17 (5). https://doi.org/10.1111/conl.13038 Page RM (2019) Ozymandias: a biodiversity knowledge graph. PeerJ 7 https://doi.org/ 10.7717/peerj.6739 Penev L, Koureas D, Groom Q, Lanfear J, Agosti D, Casino A, Miller J, Cochrane G, Ba n, O. K&, ljalg U, Ruch P (2024) Beyond BiCIKL: Towards Building an Al-Assisted "Biodiversity Supergraph". Biodiversity Information Science and Standards 8: 135550. https ://doi.org/10.3897/biss.8.135550 Pyle R (2022) An Introduction to Scientific Names of Organisms, and the Taxon Concepts they Represent. Biodiversity Information Science and Standards 6 https://doi.org/10.3897/ biss.6.93926 14 Islam S Ratnasingham S, Hebert PN (2013) A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. PLoS ONE 8 (7). https://doi.org/10.1371/ journal.pone.0066213 Sandall E, Maureaud A, Guralnick R, McGeoch M, Sica Y, Rogan M, Booher D, Edwards R, Franz N, Ingenloff K, Lucas M, Marsh C, McGowan J, Pinkert S, Ranipeta A, Uetz P, Wieczorek J, Jetz W (2023) A globally integrated structure of taxonomy to support biodiversity science and conservation. Trends in Ecology & Evolution 38 (12): 1143-1153. https ://doi.org/10.1016/j.tree.2023.08.004 Sterner B, Gilbert E, Franz N (2020) Decentralized but Globally Coordinated Biodiversity Data. Frontiers in Big Data 3 https ://doi.org/10.3389/fdata.2020.519133 Sterner B, Elliott S, Gilbert EE, Franz NM (2023) Unified and pluralistic ideals for data sharing and reuse in biodiversity. Database 2023 https ://doi.org/10.1093/database/ baad048 Upham N, Poelen J (2024) Taxonomic Data Objects for Communicating the Meaning of Species Names. Biodiversity Information Science and Standards 8 httos://doi.org/ 10.3897/biss.8.139413 Verma A, Detsky A (2020) Preprints: a Timely Counterbalance for Big Data—Driven Research. Journal of General Internal Medicine 35 (7): 2179-2181. httos://doi.org/10.1007/ $11606-020-05746-w Waldchen J, Mader P (2018) Machine learning for image based species identification. Methods in Ecology and Evolution 9 (11): 2216-2225. httos://doi.org/10.1111/2041-210x. 13075 Yang B, Zhang Z, Yang C, Wang Y, Orr MC, Wang H, Zhang A (2021) Identification of Species by Combining Molecular and Morphological Data Using Convolutional Neural Networks. Systematic Biology 71 (3): 690-705. https ://doi.org/10.1093/sysbio/syab076