Semantic representation of neurochemical molecules — An unsupervised approach to predict drug effectiveness

4 min readApr 13, 2022

This project was carried out as part of the TechLabs “Digital Shaper Program” in Düsseldorf (Winter Term 2021/22).

Abstract

The blood-brain barrier (BBB) is one of the key protective elements in our brain. It separates the central nervous system (CNS) from the circulatory system and protects the brain against intrusive chemicals or foreign particles including some therapeutic agents. One of the reasons for the relative low success rate of neuropharmaceuticals is due to the BBB blocking the drug’s entry into the brain, resulting in insufficient CNS exposure. Thus, most of the drugs fail to reach the market. Traditional experimental approaches to evaluate the Blood-Brain Barrier (BBB) permeability of a drug are expensive and time consuming. Therefore, we aimed to develop the estimate propensities of compounds to penetrate the BBB. By means of mol2vec, an unsupervised machine learning approach to learn vector representations of molecular substructures, we derived a vector representation for each of the drugs present in the blood-brain barrier penetration (BBBP) dataset. We calculated cosine similarities, to measure how close each drug is to all other drugs. Moreover, by drawing their molecular smiles, we also observed if similar drugs are in fact similarly connected or not. For any ineffective neurochemical drug (unable to cross the blood-brain barrier) we can use our vector representation to predict most similar drugs that are, instead, effective. The problem can also be extended to non-neurochemical drugs.

Introduction

One of the most demanding areas in global pharmaceutical market is neuropharmaceuticals. The success rate of neuropharmaceuticals is very less compared to that of other therapeutic areas. One of the reasons for the relative low success rate is due to the BBB blocking the drug’s entry into the brain, resulting in insufficient CNS exposure. Thus, most of the drugs fail to reach the market. The major challenge in the field of CNS pharmacokinetics and pharmacodynamics is permeability criteria of BBB.

Problems and goals

We aimed to clustering compounds for blood-brain barrier penetration.

Methods

By using mol2vec, an unsupervised machine learning approach to learn vector representations of molecular substructures, we derived a vector representation for each of the drugs present in the blood-brain barrier penetration (BBBP) dataset.

Datasets and clustering tools used for this projects:

The blood-brain barrier penetration (BBBP) dataset contains:

- “name” — Name of the compound

-”smiles” — SMILES representation of the molecular structure

-“p_np” — Binary labels for penetration/non-penetration

Experimental steps:

Results

We calculated cosine similarities, to measure how close each drug is to all other drugs. Moreover, by drawing their molecular smiles, we also observed if similar drugs are in fact similarly connected or not.

Outlook and conclusion

For any ineffective neurochemical drug (unable to cross the blood-brain barrier) we can use our vector representation to predict most similar drugs that are, instead, effective. The problem can also be extended to non-neurochemical drugs which able to penetrate the BBB, predict most similar drugs which would not have effects on the brain.