[OC] Jaccard similarity scores across 31 ingredients — real data from a flavor chemistry database built from 30,000 food science papers

Posted by mark124mjj

11 Comments

  1. I built CompKitchen (compkitchen.com) — a molecular flavor pairing engine built from 30,000+ research papers and 14 scientific databases.

    Each cell = shared volatile compounds ÷ total unique compounds across both ingredients. All values are real Jaccard scores from the database.

    Things I found interesting in this data:

    * Hazelnut × Almond scores 0.493 — the highest pair. Both are Rosaceae family seeds with near-identical benzaldehyde and linalool profiles.
    * Thyme × Rosemary (0.466) and Cinnamon × Thyme (0.421) — the Mediterranean herb cluster is molecularly very tight.
    * Coconut × Almond (0.435) — surprising to most people but coconut shares a lot of lactone and ester compounds with tree nuts.
    * Black pepper row is almost entirely pale — despite being one of the most used spices, it’s molecularly isolated from most ingredients in this set.
    * Olive oil is the palest row — very low volatile compound overlap across the board. Its flavor impact comes more from fatty acid oxidation products than shared volatiles with other ingredients.
    * Green Tea × Black Tea (0.355) — tight pair despite very different flavor profiles, because many base catechin-derived volatiles are shared before oxidation diverges them.

    Data sources: FlavorDB 2.0, FooDB, TGSC, VCF, PubMed NLP pipeline, Van Gemert (2011), Maarse (1991), ChemTastesDB, Dr. Duke, Ahn et al., PubChem, Flavornet

    Tools: Python, custom NLP extraction pipeline, HTML5 Canvas

    Happy to answer questions about methodology, specific pairings, or the database.

  2. Not entirely sure I’m understanding what I’m digesting here (no pun intended), so this is not flavour pairings that feel right but molecular similarities?

    In either case this is super interesting and presented in a cool way. How does raspberry and cinnamon work they seem different but they have a fairly solid colour.

  3. I don’t understand what “Jaccard similatity” could mean, based on this chart.

    Surely you’re not saying “cinnamon” tastes like “coconut” or “basil” more than “wine” tastes like “grape.”

  4. Wow, that looks like a great resource. I’ll pass the link into a mate for a look.

  5. absofruitly202 on

    Not beautiful because it is not easy to understand. What are you comparing? Why is the text so small? Where is the key or any indication of context? Boo

  6. You could convert to perceived odor strength, crudely, by dividing by the odor threshold, but last time I looked, the data quality was horrible and machine learning of the odor threshold very unreliable. Is the situation any better now?

  7. Man, Reddit has really gone downhill with all the AI slop apps and nonstop promotion of them

  8. I have some questions before reading anything into it. What does the correlation actually measure? Co-occurrence in recipes, shared flavor compounds, sensory scores? Because those produce completely different matrices. The obvious clusters (basil/thyme/rosemary, beef/pork, green tea/black tea) just tell me “things used in the same context” which a cookbook index would also give you. More suspicious: olive oil shows near-zero correlation with almost everything, which is a red flag about the data rather than a culinary insight. No scale, no baseline, no null distribution either, so a 0.3 means nothing here.

  9. ok so i jumped on my computer fully intending to do this myself, then realised getting the data out is non-trivial and now i’m here telling you what i would’ve done instead.
    classic.

    really cool project. one suggestion: have you thought about running PCA or UMAP on this? not on the jaccard matrix itself (the distances aren’t euclidean so it gets weird), but on the underlying ingredient x compound presence matrix. PCA would give you interpretable axes: PC1 and PC2 loadings would literally tell you which compound families drive the separation, so youd get something like ‘the benzaldehyde/linalool axis’ falling out naturally. UMAP would be the move if you just want clean clusters for visualisation.

    my bet is you’d see a tight terpene cluster (thyme, rosemary, basil, cinnamon), a lactone/nut cluster (almond, hazelnut, coconut), and black pepper + olive oil sitting alone as outliers. basically the heatmap story but in 2d and easier to read at a glance.

    one gotcha: normalise first, otherwise pc1 will just end up being ‘ingredients with more catalogued compounds’ and you’ll learn nothing.

    again, super cool idea!