The Problem & Solution

This section explains the motivation behind the bib-ami tool. It details the common challenges faced when managing academic references and outlines the principled approach the tool takes to solve them.

The Challenge of Messy Bibliographies

Managing bibliographic references is a foundational task for researchers and academics. However, BibTeX files, especially those aggregated from multiple sources or generated by automated tools, frequently suffer from data quality issues that require significant manual intervention. These problems include:

  • Duplicate Entries: The same publication may appear multiple times with minor variations in title, author formatting, or citation key.

  • Missing or Invalid DOIs: A Digital Object Identifier (DOI) is the standard for ensuring a reference is persistent and verifiable. Entries often lack a DOI or contain one that is incorrect or outdated.

  • Inconsistent Metadata: Key fields like journal names, publication years, or author lists may be incomplete, inconsistently formatted, or contain errors.

  • High Manual Effort: The cumulative effect of these issues is that researchers must spend a considerable amount of time cleaning their bibliographies, detracting from their primary work of research and writing.

These problems create a classic data science challenge known as entity resolution: identifying and merging multiple records that refer to the same real-world entity.

The bib-ami Philosophy: From Cleaning to Quality Gating

The objective of bib-ami is to implement a robust, automated workflow that transforms a collection of untrusted BibTeX files into a single, clean, and high-quality bibliography. This process is guided by a clear definition of what constitutes a “golden” record, and a set of non-negotiable principles.

The Desired Outcome

The final, curated bibliography produced by the tool must meet the following five criteria:

Verified

The final set of references is reliable, with its core metadata validated against authoritative sources (e.g., CrossRef, DataCite) and its DOI confirmed as active and resolvable.

Complete

The process does not discard any legitimate references. Instead, every entry is assigned a quality level, and entries that don’t meet the user’s standard are separated for review.

Auditable

The final status and assigned Quality Level of every single input record is traceable, with a clear comment in the output file explaining why it was included, merged, or flagged as suspect.

Enriched

Each trusted reference is populated with a complete set of verifiable metadata fields available from the authoritative source, including full author lists, publication years, and ISBNs.

High-Fidelity

User-generated content from the source files (e.g., note, file) is correctly preserved and merged into the final, clean records.

Guiding Principles

To achieve this outcome, the workflow is built on four core principles:

  1. Authoritative Data Establishes Ground Truth: Data from an external, authoritative source like CrossRef is treated as the ground truth. Data from a local source file is treated as an unverified claim. The workflow must always prioritize authoritative data when defining a record’s identity.

  2. Quality is Quantified and Layered: bib-ami does not view entries as simply “good” or “bad.” It assigns each entry a specific Quality Level (e.g., Verified, Confirmed, Accepted) based on the cumulative evidence it gathers. An entry with a resolved DOI is of higher quality than one that only passed a metadata check.

  3. Filtering is a Configurable, User-Driven Choice: The tool’s job is to analyze and score every entry. The user’s job is to decide what level of quality is acceptable. By default, bib-ami is inclusive, labeling everything. The user can then create a config.json file to define their own rules, such as "min_quality_for_final_bib": "Verified", to apply stricter filtering.

  4. Preserve User Intent: The system must distinguish between descriptive metadata that can be authoritatively corrected (e.g., a misspelled title) and intentional user-generated content (e.g., note, file). The latter represents work done by the user and must be preserved and correctly reconciled.