🏆 Gold Standard
The Gold Standard Set
Zero-noise training corpus: valid plant species, experimental evidence, dual SMILES, and RDKit ΔMass pass.
Built for AI engineers who need chemically consistent reaction triplets.
811
reactions
Included fields (17)
enzyme_name, organism, cyp_family, uniprot_id, genbank_id, ncbi_taxid, substrate_name, substrate_smiles, product_name, product_smiles, mass_delta_da, mass_check, mass_rule, reaction_type, evidence_type, pubmed_id, doi
Download CSV ↓
📚 Full Corpus
The Full Curated Corpus
Gold, Silver, Bronze, and Special tiers — every reaction that passed automated and manual review,
excluding the Quarantine set. Includes tier labels, structure status, and literature provenance.
2239
reactions
Included fields (25)
data_tier, display_status, enzyme_name, organism, cyp_family, taxonomy_family, uniprot_id, genbank_id, ncbi_taxid, substrate_name, substrate_smiles, substrate_structure_status, product_name, product_smiles, product_structure_status, mass_delta_da, mass_check, mass_rule, reaction_type, evidence_type, confidence, catalytic_mechanism, literature_title, pubmed_id, doi
Download CSV ↓
☠️ Quarantine set
Quarantine set (CSV)
Deliberately rejected rows: literature nomenclature errors, negative catalytic evidence, non-P450 partners,
and species NLP traps. Use this file to avoid false positives in text-mining benchmarks.
251
total
1
literature error
0
negative evidence
Included fields (12)
display_status, data_tier_note, enzyme_name, organism, substrate_name, product_name, reaction_type, evidence_type, ai_evidence_snippet, pubmed_id, doi, literature_title
🧬 Sequences
Enzyme FASTA bundle
Protein sequences sourced from UniProt and NCBI GenBank (not novel experimental data).
Headers encode gene, organism, and public accessions for BLAST and phylogenetics.
976
unique enzymes with sequence
Format: >CYP71AV1|Artemisia annua|Q1PS23|UniProt:…|GenBank:…
Download FASTA ↓