Documentation
Help & User Manual
How to search PCYPCatDB, interpret quality tiers, and understand cheminformatics validation badges.
A. Quick Start
Enter enzyme names, Latin species, compound names, SMILES, or PMID. Results open in the Reactions grid with live filters.
Tier filters restrict exports-quality subsets. Click any row to open the full evidence trail.
Gold = highest-confidence plant CYP record. ΔMass ✓ means RDKit computed a stoichiometrically consistent mass shift between substrate and product SMILES. The central chip shows the observed Δm and the inferred transformation rule.
B. Data Tiering Guide
Every reaction is assigned a data tier after LLM extraction, literature cross-checks, and RDKit gates. Tiers are frozen at monthly release and documented in our homepage pipeline report.
Gold Standard
Zero-noise training subset: valid plant species, experimental or high-confidence pathway evidence, dual resolved SMILES, and RDKit ΔMass pass. Recommended for ML, retrosynthesis benchmarks, and cheminformatics pipelines. Download as CSV from Downloads.
Silver
High-quality curated records that miss one Gold criterion—e.g. species pending salvage, partial structure resolution, or ΔMass exempt mechanisms. Still included in the full corpus export; use when breadth matters more than chemical strictness.
Bronze
Literature-supported candidates with weaker gates: pathway inference, missing SMILES, or ΔMass mismatch flagged for review. Valuable for hypothesis generation but not validated for automated learning without manual inspection.
Special (atypical P450)
Mechanistically distinct plant enzymes—chiefly CYP74 hydroperoxide lyases and related
NADPH-independent cleavage chemistry. Mass-balance rules differ; records carry
ΔMass exempt where appropriate.
Filter with ?tier=Special in the reaction grid.
Quarantine set (isolated)
Deliberately excluded from curated corpus downloads: literature nomenclature errors, negative catalytic evidence, non-P450 pathway partners, species NLP traps, and PDF–text mismatches after manual review. Published transparently so text-mining benchmarks can avoid false positives. Browse quarantine reactions or download the quarantine CSV.
C. Frequently Asked Questions
- How often is PCYPCatDB updated?
- We publish a versioned release on the 1st of each month, aligned with NAR continuous maintenance policy. See the pipeline chart on the homepage.
- How was the data extracted?
- Multi-agent LLM pipelines read PubMed-linked full text, extract reaction statements with chain-of-thought guardrails, enrich species from taxonomic context, resolve structures via PubChem/RDKit, and assign tiers using automated ΔMass and manual review gates.
- What does ΔMass validation mean?
- For standard oxidation/hydroxylation chemistry we compare exact-mass differences between substrate and product SMILES. A pass badge means the shift matches a known biotransformation rule within tolerance. Special mechanisms may be exempt; Bronze rows may show a fail flag for curator follow-up.
- Can I download bulk data?
- Yes — Gold standard, full corpus (Gold+Silver+Bronze+Special), quarantine log, and enzyme FASTA bundles are on the Downloads page, version-stamped per release.
- I found an error in a record. What should I do?
- Please open a GitHub Issue with the reaction ID or PMID. We triage reports monthly and update the next release. You can also email us via the Contact page.
- How should I cite PCYPCatDB?
-
Use the recommended citation on our Contact page and include
the release version (e.g.
v2026.06) for reproducibility.