5

I am interested in assigning 3D coordinates to (atoms in) some 10K molecules that I have, currently represented by SMILES. This is because as have been shown by many chemoinformatics papers, 3D and 2D structures together can better represent a molecule. This is also emphasized in large-scale molecular representation learning challenges, where 3D structures are estimated from DFT.

However, it seems time-costly to compute DFT with publicly available libraries (e.g. pyscf), or maybe I am missing something. And the faster approach used by RDKit or OpenBabel (force field-based, e.g. MMFF, UFF) generates less accurate 3D coordinates (yes, I did notice this post). I have been thinking of some ways: (1) find some possible conversions of my compounds to nominal identifiers (e.g. DrugBank ID) and extract 3D structures from those databases; (2) just run pyscf; (3) find some papers that predicts 3D structure from SMILES (does this kind of papers even exist?)

I'm not from a chemistry background but just started to look into interesting problems in computational chemistry. Would appreciate you experts' suggestions!

jasperhyp
  • 51
  • 4
  • 4
    Welcome. For part 1, while I don't know DrugBank in particular, most of these databases use RDKit, Open Babel, or OMEGA (a commercial product) to generate coordinates using force fields. One challenge, if you're not from a chemistry background, is to know that for most molecules, there are many energetically-accessible conformers (e.g., you need multiple 3D geometries). – Geoff Hutchison Apr 28 '23 at 16:51
  • 2
    You don't have to go full dft, you can try semiempiric models. But you have to start somewhere for that, too. Also the larger the molecules get, the more important Geoff's comment becomes. Here's a practical recommendation though: xtb. – Martin - マーチン Apr 28 '23 at 19:51
  • 2
    Even though I'm on board with this question (is on topic here), you'll might find a better audience at [mattermodeling.se]. If you want, we can push it there; please do not cross post the question. – Martin - マーチン Apr 28 '23 at 19:55
  • 1
    What physical/chemical/biological property of the molecules / of the material is the property of interest the model should learn? Or are there multiple to account for? Perhaps there already are data compilations available (see an earlier post to narrow the question. Is the interest in the conformations of global energetic minima (per molecule), or equally for more shallow minima (which, depending on the property in mind, can be more interesting / more relevant)? – Buttonwood May 03 '23 at 14:18
  • 1
    @GeoffHutchison Thank you for the suggestions! I have sent an inquiry to DrugBank since they didn't provide any information re. the source of 3D structures. I am aware of conformers, and I'm happy to inject those equivariances into my encoder, so that as long as there is one valid (not sure how to define this chemically) 3D structure it should be fine. – jasperhyp May 04 '23 at 18:44
  • @Martin-マーチン Thanks for the suggestions! Would look into that after doing a search in databases as Buttonwood suggests.

    I'm not super familiar with these two substacks but 'd be happy to move this to Matter Modeling if that's more suitable! It looks like you folks who answered my question are crossing these two substacks :)

    – jasperhyp May 04 '23 at 18:49
  • @Buttonwood Totally agree. I'm interested in essentially modeling protein-ligand docking, which is inherently 3D. In this context, many drugs/compounds that are included in datasets such as PDBBind are only represented by SMILES. It might be possible to do a search in other 3D databases based on SMILES, though I am not sure whether such golden-standard 3D databases exist in the chemistry space. – jasperhyp May 04 '23 at 18:52
  • @jasperhyp For docking, the reduced representation by SMILES lacks some crucial information; the molecules' conformation (how groups of atoms are tilt around bonds) by coordinates. But then, the most stable one of an isolated molecule (e.g. DFT computed) still can be a different one vs. if a fitting molecule (could already be water, an ion; the ligand/the protein) is close enough. The consistent assignment of SMILES (see some formats e.g. here) however can help to identify the molecules in question. – Buttonwood May 04 '23 at 20:26

1 Answers1

3

I'd say that the way to approach this problem greatly depends on what you want to do and why you need the 3D structure. What grade of precision do you need and what do you need to measure?

Molecules move all the time unless crystallized. Do you need the crystal structure or do you need a representative conformation that is often assumed in a solvent at room temperature? If you just need to represent the molecules in 3D in a conformation that "makes sense" for small molecules, a force field based method will generally work and produce conformations that work well enough; otherwise you can try some semiempirical methods, those are quite fast and cheap as well. DFT will allow you to predict with better precision the conformation such a molecule would assume in vacuum at 0K (unless you go through several passages to try and model something else). A third option could be using experimentally measured crystal structures, though I don't know how comfortable it is to go finding and cleaning up 10,000 of them. I would not use databases containing computed structures such as DrugBank as you'd have no control over the process nor the possibility to check the quality of the results.

You may obtain quite different conformations using the different methodologies. Your model may work on some molecules because they have a particular conformation and fail on others that would normally have a similar one, however also have another different one that comes out slightly favorable in the calculations. 3D QSAR is particularly complex because of this and due to the difficulties in encoding molecular information in a comparable way. In a three dimensional object it's much more difficult to look for similarities and dissimilarities.

Regarding DFT computation time, that greatly depends on the sampling method you use to search the conformational space. Performing a steepest descent/BFGS may give you results in a couple minutes when we're talking about small molecules. However that won't work when searching for a global minimum, I'd say that computing 10,000 structures using DFT and with no access to a large cluster where to parallelize the computations would take too much time.

Since you're new to chemistry; I'd advise you to stay away from 3D QSAR for the time being. It does provide its advantages, but carries along a large quantity of complexities which can just be skimmed over when working with 2D structures.

EDIT: I just noticed in the website you link they use models to predict the structure of a molecule as modeled by DFT. In that case you necessarily have to use DFT to obtain the structure. It appears as if they were suggesting to use the 3D structure as input to the model; but I doubt that's the case, they evidently meant that you should train a model on 3D data so that it will later be able to predict other 3D structures from the 2D information.

stanton63
  • 31
  • 3
  • 1
    Thank you for the very insightful answer! I'm not sure how precise I need actually -- I am working on a protein-ligand docking problem and it seems natural to model 3D molecular structure in addition to 3D protein structure. I am pretty sure I need just a conformation since I can inject equivariances into the molecular encoder. – jasperhyp May 04 '23 at 18:58
  • 1
    "If you just need to represent the molecules in 3D in a conformation that "makes sense" for small molecules, a force field based method will generally work and produce conformations that work well enough" -- this sounds very encouraging! Perhaps I'll start with those structures. Thanks! Re. the drugbank comment, I am not sure if the 3d structures for drugs are sort of more experimentally-based since many drugs are pretty well-studied. Have sent an inquiry to them. Totally agree that 3D QSAR might be too hard, but I guess my use case is 3D inherently so that might make sense to include. – jasperhyp May 04 '23 at 19:00
  • 1
    For the website's use of model predictions -- I noticed this "3D molecular structures provided. We additionally provide 3D structures for training molecules. These structures are calculated by DFT and are obtained together with the HOMO-LUMO gap." Does thi mean they still used DFT? – jasperhyp May 04 '23 at 19:03
  • I'm not sure what you're trying to do with protein ligand docking; if you're predicting the conformation of the ligand it would make more sense to use a docked conformation during the training. You definitely can use 2D otherwise. I'm not sure what you mean by injecting equivariance; molecular conformations can change a lot depending on the conditions. In that website they build a model which predicts the minimum DFT conformation. They do indeed compute it with DFT for the training data, there's no way around it. Unfortunately, they do not explain HOW they obtain such structures... – stanton63 May 05 '23 at 13:01
  • 2
    @jasperhyp - it looks like they're using geometries from PubChemQC which were calculated with B3LYP DFT. One warning is that these are not necessarily the low-energy or best binding conformations. PubChemQC does not include conformers. – Geoff Hutchison May 05 '23 at 18:07
  • I guess I am not really concerned about conformation but more about the binding site (on protein) & affinity, which is sort of more "abstract". Perhaps it could be fine to still use 2D, or just use DFT although the actual binding conformation could change. By "equivariance" I mean modeling the molecules in a way that all conformers of one molecule would have the same representations. Might not be the most ideal way but given the more abstract task maybe this is fine. – jasperhyp May 07 '23 at 01:41
  • @GeoffHutchison Thanks for the information! I guess I can start with these DFT-based conformer-unaware 3D structures and see if they actually add some values to using only 2D molecular graphs! – jasperhyp May 07 '23 at 01:43
  • 2
    @jasperhyp if all conformations have the same representation; then you're not really providing any conformational information to your model, thus I don't see how that would improve the model. Having to go through the effort of optimizing the structure through DFT can take time, you should also confirm that the effort is worth it due to speed or efficiency when compared to other methods such as molecular docking or molecular dynamics. If your model requires significant amounts of preprocessing time, why not just use those methods? – stanton63 May 08 '23 at 07:26