The Problem & Solution
This section explains the motivation behind the bib-ami tool. It
details the common challenges faced when managing academic references
and outlines the principled approach the tool takes to solve them.
The Challenge of Messy Bibliographies
Managing bibliographic references is a foundational task for researchers and academics. However, BibTeX files, especially those aggregated from multiple sources or generated by automated tools, frequently suffer from data quality issues that require significant manual intervention. These problems include:
Duplicate Entries: The same publication may appear multiple times with minor variations in title, author formatting, or citation key.
Missing or Invalid DOIs: A Digital Object Identifier (DOI) is the standard for ensuring a reference is persistent and verifiable. Entries often lack a DOI or contain one that is incorrect or outdated.
Inconsistent Metadata: Key fields like journal names, publication years, or author lists may be incomplete, inconsistently formatted, or contain errors.
High Manual Effort: The cumulative effect of these issues is that researchers must spend a considerable amount of time cleaning their bibliographies, detracting from their primary work of research and writing.
These problems create a classic data science challenge known as entity resolution: identifying and merging multiple records that refer to the same real-world entity.
The bib-ami Philosophy: From Cleaning to Quality Gating
The objective of bib-ami is to implement a robust, automated
workflow that transforms a collection of untrusted BibTeX files into a
single, clean, and high-quality bibliography. This process is
guided by a clear definition of what constitutes a “golden” record,
and a set of non-negotiable principles.
The Desired Outcome
The final, curated bibliography produced by the tool must meet the following five criteria:
- Verified
The final set of references is reliable, with its core metadata validated against authoritative sources (e.g., CrossRef, DataCite) and its DOI confirmed as active and resolvable.
- Complete
The process does not discard any legitimate references. Instead, every entry is assigned a quality level, and entries that don’t meet the user’s standard are separated for review.
- Auditable
The final status and assigned Quality Level of every single input record is traceable, with a clear comment in the output file explaining why it was included, merged, or flagged as suspect.
- Enriched
Each trusted reference is populated with a complete set of verifiable metadata fields available from the authoritative source, including full author lists, publication years, and ISBNs.
- High-Fidelity
User-generated content from the source files (e.g.,
note,file) is correctly preserved and merged into the final, clean records.
Guiding Principles
To achieve this outcome, the workflow is built on four core principles:
Authoritative Data Establishes Ground Truth: Data from an external, authoritative source like CrossRef is treated as the ground truth. Data from a local source file is treated as an unverified claim. The workflow must always prioritize authoritative data when defining a record’s identity.
Quality is Quantified and Layered:
bib-amidoes not view entries as simply “good” or “bad.” It assigns each entry a specific Quality Level (e.g., Verified, Confirmed, Accepted) based on the cumulative evidence it gathers. An entry with a resolved DOI is of higher quality than one that only passed a metadata check.Filtering is a Configurable, User-Driven Choice: The tool’s job is to analyze and score every entry. The user’s job is to decide what level of quality is acceptable. By default,
bib-amiis inclusive, labeling everything. The user can then create aconfig.jsonfile to define their own rules, such as"min_quality_for_final_bib": "Verified", to apply stricter filtering.Preserve User Intent: The system must distinguish between descriptive metadata that can be authoritatively corrected (e.g., a misspelled title) and intentional user-generated content (e.g.,
note,file). The latter represents work done by the user and must be preserved and correctly reconciled.