"SCIENTIA publication-worthiness and SSOT unification (research 2026)"

SCIENTIA publication-worthiness and SSOT unification (research 2026)

This document implements the current research-plan deliverables for improving publication-worthiness generation and detection, while unifying single-source metadata across legacy and modern publication pathways.

Scope:

  • AI and software engineering publication requirements,
  • Canonical metadata SSOT for transformation into multiple venue formats,
  • Automation boundaries that preserve scientific and ethical accountability.

It is a research and design artifact, not an implementation blueprint.

Baseline assumptions

  • Canonical publication lifecycle remains manifest-centered (publication_manifests, publication_approvals, scholarly_submissions, publication_status_events).
  • Existing worthiness/preflight controls remain authoritative until replaced by versioned contracts.
  • External bibliometric and policy APIs remain assistive, not sole publication gates.

Primary internal anchors:

Deliverable 1: standards-to-signals matrix

The matrix maps external standards into machine-checkable Vox signals.

Standard sourceRequirement classSignal classVox check todayGapProposed machine check
COPE/ICMJE/Nature/Elsevier/JAMA/BMJ/IEEEAI-use disclosure, no AI authorshiphard_gate + metadata_requiredPartial policy/preflight fieldsGranularity by tool/version/scopeAdd ai_disclosure_profile block with policy-profile validation
Crossref/DataCiteDOI-grade metadata completenessmetadata_requiredPartial metadata mapper coverageInconsistent normalized field setAdd canonical metadata completeness score + adapter-specific required-field checks
JATS/legacy journal workflowsStructured article/package interchangemetadata_recommended + diagnosticLimited package scaffoldingNo unified JATS readiness profileAdd jats_export_readiness signal and profile checks
TMLR/JMLR/AAAI/NeurIPS reproducibility practicesEvidence support and reproducibilitysoft_gate + diagnosticExisting evidence/preflight scoringWeak variance/seed/ablation specificityAdd seed_count_transparency, uncertainty_reporting, ablation_adequacy signals
arXiv policiesSource package and moderation constraintshard_gate + metadata_requiredarXiv-assist and handoff contractNo full format preflight profileAdd arxiv_format_profile and package static checks
ACM/EMSE open science artifact normsReplication package qualitysoft_gate + diagnosticPartial through evidence fieldsNo explicit artifact quality taxonomyAdd artifact_replay_bundle_quality score and reason codes
FAIR/RSMD principlesRich, reusable metadatametadata_recommendedSome structured fieldsNo explicit FAIR coverage metricAdd fair_metadata_coverage metric as non-blocking diagnostic
Integrity research on fabricated referencesCitation verificationhard_gateExisting citation checks are partialConfidence and provenance under-specifiedAdd citation_verification_confidence and unresolved_reference_count hard fail thresholds
Contamination/benchmark leakage researchEvaluation integritysoft_gate + diagnosticPartial benchmark evidence controlsNo contamination-risk signalAdd contamination_risk_flag with traceable rationale
Peer-review ethics guidanceHuman accountability boundariesnever_automate ledgerExisting boundary matrixNeeds explicit binding to system actionsAdd action-level boundary policy IDs in runtime reports

Normalized signal catalog

  • hard_gate: mandatory pass before publication submission attempt.
  • soft_gate: failure does not block by default, but raises next_actions.
  • diagnostic: explainability signal for operators and reviewers.
  • metadata_required: route-specific required metadata.
  • metadata_recommended: quality-improving, non-blocking metadata.

Deliverable 2: canonical SSOT metadata graph proposal

Canonical graph objective

Use one manifest-centered metadata graph (metadata_json.scientific_publication and adjacent blocks) as the single authoring source, then compile outward to route-specific payloads.

flowchart LR
  canonicalManifest[CanonicalPublicationManifest] --> coreMetadata[CoreMetadataGraph]
  coreMetadata --> worthinessView[WorthinessAndPreflightView]
  coreMetadata --> crossrefMap[CrossrefMapper]
  coreMetadata --> dataciteMap[DataCiteMapper]
  coreMetadata --> zenodoMap[ZenodoMapper]
  coreMetadata --> arxivMap[arXivHandoffMapper]
  coreMetadata --> openreviewMap[OpenReviewMapper]
  coreMetadata --> socialMap[SyndicationMapper]

Proposed canonical graph domains

  1. identity
    • title, abstract, keywords, domain tags, venue target profile.
  2. contributors
    • authors array, ORCID, affiliations (ROR), contributor roles.
  3. provenance
    • manifest digest, evidence pack digest, repository/commit context, run IDs.
  4. evidence
    • claim-evidence links, benchmark pair summary, seed/variance report, contradiction summary.
  5. policy
    • AI-use disclosure, ethics/broader-impact statements, anonymization attestation.
  6. rights_and_funding
    • license, funding references, COI declaration, access rights.
  7. distribution
    • route intents (journal/preprint/repository/social), required profile variants.

Adapter crosswalk policy

  • Adapters do not own canonical truth.
  • Adapters only transform from canonical graph into target payload shape.
  • Required fields per route are checked twice:
    • in canonical preflight,
    • in adapter pre-submit validation.

Deliverable 3: worthiness detection-quality research protocol

Objective

Improve publication-worthiness triage precision/recall without converting uncertain external signals into brittle hard gates.

Candidate signals to evaluate

  • seed_count_transparency
  • uncertainty_reporting
  • ablation_adequacy
  • contamination_risk_flag
  • citation_verification_confidence
  • claim_evidence_density
  • fair_metadata_coverage

Experimental design (offline research stage)

  1. Build stratified evaluation set:
    • accepted-quality exemplars,
    • borderline submissions requiring evidence,
    • known low-integrity patterns (fabricated citations, weak evidence links).
  2. Replay current worthiness scoring as baseline.
  3. Add candidate signals incrementally and evaluate:
    • precision/recall/F1 for Publish vs AskForEvidence vs Abstain,
    • false-positive rate for hard-gate triggers,
    • explanation quality via operator audit sampling.
  4. Calibrate thresholds by route profile (journal, preprint, repository, social).
  5. Keep external bibliometric signals assistive unless confidence and stability meet governance thresholds.

Calibration guardrails

  • Never hard-fail solely on one external API datum.
  • Require provenance stamp (source, retrieved_at, confidence) for external-derived signals.
  • Require periodic drift checks for API field changes and coverage drops.

Deliverable 4: Codex persistence blueprint (research snapshot model)

Persistence principles

  • Store research snapshots as additive, typed payloads linked to publication_id.
  • Preserve immutable audit trails through status events for each recomputation.
  • Keep backward compatibility with existing manifest lifecycle.

Proposed persisted artifact shape (concept)

{
  "version": "v1-research-snapshot",
  "publication_id": "pub_...",
  "policy_profile": "journal_double_blind",
  "signals": {
    "hard_gate": {},
    "soft_gate": {},
    "diagnostic": {}
  },
  "coverage": {
    "metadata_required": 0.0,
    "metadata_recommended": 0.0
  },
  "citation_verification": {
    "verified_count": 0,
    "unresolved_count": 0,
    "confidence": 0.0
  },
  "external_signal_provenance": [
    {
      "source": "openalex",
      "retrieved_at": 0,
      "confidence": 0.0,
      "notes": ""
    }
  ]
}

Event semantics proposal

  • Add status-event detail payload variants:
    • worthiness_snapshot_computed
    • worthiness_snapshot_recomputed
    • worthiness_snapshot_superseded
  • Include previous snapshot hash in recompute events for chain-of-custody.

Read-model expectations (CLI/MCP)

  • publication-status and MCP lifecycle tools should expose:
    • latest snapshot summary,
    • delta from previous snapshot,
    • unresolved hard/soft gate reasons,
    • source provenance completeness.

Deliverable 5: automation boundaries ledger (explicit)

Workflow actionAutomateAssistNever automateRationale
Hashing, digests, evidence pack indexingyesn/anodeterministic and auditable
Metadata normalization and schema checksyesn/anodeterministic validation
Citation syntax, DOI shape, resolvability checksyesn/anointegrity hardening
Claim-evidence link extraction and scoringyesyesnomachine supports triage, human validates interpretation
Novelty scoring and impact projectionnoyesyes (autonomous final decision)epistemic judgment remains human-accountable
Ethics/safety acceptance decisionnoyesyes (autonomous acceptance)policy/legal responsibility
Final manuscript framing and significance claimnoyesyes (autonomous authorship)authorship accountability
Final submission action on external account-bound portalsnoyesyes (unless explicit approved HITL control)legal/account-level control
Venue policy profile recommendationsnoyesnoadvisory only
Reviewer-facing evidence summariesyesyesnostructured aid with human verification

Risks and research constraints

  • Policy drift risk: journal and publisher rules change faster than static docs.
  • Signal overfitting risk: venue-specific heuristics may fail cross-domain generalization.
  • API reliability risk: external metadata sparsity and schema drift reduce confidence.
  • Over-automation risk: scoring can be mistaken for scientific judgment.

Conversion criteria for implementation planning

Proceed to implementation planning only when all are true:

  1. Signal catalog approved (hard_gate, soft_gate, diagnostic, metadata classes).
  2. Canonical metadata graph ownership boundaries approved.
  3. Snapshot payload and event semantics accepted as backward-compatible.
  4. Boundary ledger accepted by governance owners for human-accountability controls.

External research anchors used in this cycle

  • TMLR/JMLR/AAAI/NeurIPS reproducibility and submission guidance.
  • COPE/ICMJE/Nature/Elsevier/arXiv/IEEE/BMJ/JAMA AI-use policies.
  • Crossref/DataCite/JATS/CFF/CodeMeta/ORCID/ROR metadata and interoperability surfaces.
  • FAIR/RSMD metadata principles.
  • Reproducibility and integrity literature on citation hallucination, contamination risk, and claim-evidence attribution.