Documentation

Help & User Manual

How to search PCYPCatDB, interpret quality tiers, and understand cheminformatics validation badges.

A. Quick Start

Step 1. Search from the homepage or the Reactions page
PCYPCatDB · Home

Enter enzyme names, Latin species, compound names, SMILES, or PMID. Results open in the Reactions grid with live filters.

Step 2. Refine with sidebar filters (tier, taxonomy, CYP family)
Data tier
Gold
Silver
Keywords
taxol
Reaction matrix · 42 rows

Tier filters restrict exports-quality subsets. Click any row to open the full evidence trail.

Step 3. Read badges and the ΔMass chip on the reaction detail page
Gold ΔMass ✓
Substrate
+15.99 Da hydroxylation
Product

Gold = highest-confidence plant CYP record. ΔMass ✓ means RDKit computed a stoichiometrically consistent mass shift between substrate and product SMILES. The central chip shows the observed Δm and the inferred transformation rule.

B. Data Tiering Guide

Every reaction is assigned a data tier after LLM extraction, literature cross-checks, and RDKit gates. Tiers are frozen at monthly release and documented in our homepage pipeline report.

Gold

Gold Standard

Zero-noise training subset: valid plant species, experimental or high-confidence pathway evidence, dual resolved SMILES, and RDKit ΔMass pass. Recommended for ML, retrosynthesis benchmarks, and cheminformatics pipelines. Download as CSV from Downloads.

Silver

Silver

High-quality curated records that miss one Gold criterion—e.g. species pending salvage, partial structure resolution, or ΔMass exempt mechanisms. Still included in the full corpus export; use when breadth matters more than chemical strictness.

Bronze

Bronze

Literature-supported candidates with weaker gates: pathway inference, missing SMILES, or ΔMass mismatch flagged for review. Valuable for hypothesis generation but not validated for automated learning without manual inspection.

Special

Special (atypical P450)

Mechanistically distinct plant enzymes—chiefly CYP74 hydroperoxide lyases and related NADPH-independent cleavage chemistry. Mass-balance rules differ; records carry ΔMass exempt where appropriate. Filter with ?tier=Special in the reaction grid.

Quarantine

Quarantine set (isolated)

Deliberately excluded from curated corpus downloads: literature nomenclature errors, negative catalytic evidence, non-P450 pathway partners, species NLP traps, and PDF–text mismatches after manual review. Published transparently so text-mining benchmarks can avoid false positives. Browse quarantine reactions or download the quarantine CSV.

C. Frequently Asked Questions

How often is PCYPCatDB updated?
We publish a versioned release on the 1st of each month, aligned with NAR continuous maintenance policy. See the pipeline chart on the homepage.
How was the data extracted?
Multi-agent LLM pipelines read PubMed-linked full text, extract reaction statements with chain-of-thought guardrails, enrich species from taxonomic context, resolve structures via PubChem/RDKit, and assign tiers using automated ΔMass and manual review gates.
What does ΔMass validation mean?
For standard oxidation/hydroxylation chemistry we compare exact-mass differences between substrate and product SMILES. A pass badge means the shift matches a known biotransformation rule within tolerance. Special mechanisms may be exempt; Bronze rows may show a fail flag for curator follow-up.
Can I download bulk data?
Yes — Gold standard, full corpus (Gold+Silver+Bronze+Special), quarantine log, and enzyme FASTA bundles are on the Downloads page, version-stamped per release.
I found an error in a record. What should I do?
Please open a GitHub Issue with the reaction ID or PMID. We triage reports monthly and update the next release. You can also email us via the Contact page.
How should I cite PCYPCatDB?
Use the recommended citation on our Contact page and include the release version (e.g. v2026.06) for reproducibility.