The Problem & Solution

This section explains the motivation behind the bib-ami tool. It details the common challenges faced when managing academic references and outlines the principled approach the tool takes to solve them.

The Challenge of Messy Bibliographies

Managing bibliographic references is a foundational task for researchers and academics. However, BibTeX files, especially those aggregated from multiple sources or generated by automated tools, frequently suffer from data quality issues that require significant manual intervention. These problems include:

  • Duplicate Entries: The same publication may appear multiple times with minor variations in title, author formatting, or citation key, leading to a redundant and unprofessional bibliography.

  • Missing or Invalid DOIs: A Digital Object Identifier (DOI) is the standard for ensuring a reference is persistent and verifiable. Entries often lack a DOI or contain one that is incorrect or outdated, hindering readers’ ability to locate the source material.

  • Inconsistent Metadata: Key fields like journal names, publication years, or author lists may be incomplete, inconsistently formatted, or contain errors. This requires manual correction to ensure citations are accurate and adhere to publication standards.

  • High Manual Effort: The cumulative effect of these issues is that researchers must spend a considerable amount of time cleaning their bibliographies, detracting from their primary work of research and writing.

These problems create a classic data science challenge known as entity resolution: identifying and merging multiple records that refer to the same real-world entity—in this case, a single academic publication.

The bib-ami Philosophy: Creating a “Golden” Bibliography

The objective of bib-ami is to implement a robust, automated workflow that transforms a collection of untrusted BibTeX files into a single, clean, and verified “golden” bibliography. This process is guided by a clear definition of the desired outcome and a set of non-negotiable principles.

The Desired Outcome

The final, curated bibliography produced by the tool must meet the following five criteria:

Trustworthy

The final set of references is reliable, with its core metadata validated against authoritative sources.

Complete

The process does not discard any legitimate references, even if they could not be automatically verified. Instead, they are flagged for review.

Auditable

The final status of every single input record is traceable, with a clear reason why it was included, merged, or flagged as suspect.

Enriched

Each trusted reference is populated with a complete set of verifiable metadata fields available from the authoritative source.

High-Fidelity

User-generated content from the source files (e.g., notes, annotations, file links) is correctly preserved and merged into the final, clean records.

Guiding Principles

To achieve this outcome, the workflow is built on four core principles that inform every step of the process.

  1. Authoritative Data Reigns Supreme: Data from an external, authoritative source like CrossRef is treated as the ground truth. Data from a local source file is treated as an unverified claim. The workflow must always prioritize authoritative data when defining a record’s core identity and descriptive fields.

  2. Every Input Must Be Accounted For: No source record can ever be silently dropped or lost. The final output must be traceable back to every single initial input, ensuring a complete audit trail.

  3. Preserve User Intent: The system must distinguish between descriptive metadata that can be authoritatively corrected (e.g., a misspelled title) and intentional user-generated content (e.g., note, file). The latter represents work done by the user and must be preserved and correctly reconciled.

  4. Triage, Don’t Discard: The system’s job is not to make a final judgment on ambiguous records. Its primary role is to cleanly separate high-quality, verified records from questionable ones, empowering a human to make the final call on the exceptions.