One of the key challenges to predict odor from molecular structure is unarguably our limited understanding of the odor space and the complexity of the underlying structure-odor relationships. Here, we show that the predictive performance of machine learning models for structure-based odor predictions can be improved using both, an expert and a data-driven odor taxonomy. The expert taxonomy is based on semantic and perceptual similarities, while the data-driven taxonomy is based on clustering co-occurrence patterns of odor descriptors directly from the prepared dataset. Both taxonomies improve the predictions of different machine learning models and outperform random groupings of descriptors that do not reflect existing relations between odor descriptors. We assess the quality of both taxonomies through their predictive performance across different odor classes and perform an in-depth error analysis highlighting the complexity of odor-structure relationships and identifying potential inconsistencies within the taxonomies by showcasing pear odorants used in perfumery. The data-driven taxonomy allows us to critically evaluate our expert taxonomy and better understand the molecular odor space. Both taxonomies as well as a full dataset are made available to the community, providing a stepping stone for a future community-driven exploration of the molecular basis of smell. In addition, we provide a detailed multi-layer expert taxonomy including a total of 777 different descriptors from the Pyrfume repository.
Exploring molecular odor taxonomies for structure-based odor predictions using machine learning
Submitted to ArXiV, 11 August 2025
      
  Type:
        Report
      Date:
        2025-08-11
      Department:
        Data Science
      Eurecom Ref:
        8345
      Copyright:
        © EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Submitted to ArXiV, 11 August 2025 and is available at : 
      See also:
        
      PERMALINK : https://www.eurecom.fr/publication/8345
 
     
                       
                      