SELFIES and the future of molecular string representations

Mario Krenn*, Qianxiang Ai, Senja Barthel, Nessa Carson, Angelo Frei, Nathan C. Frey, Pascal Friederich, Théophile Gaudin, Alberto Alexander Gayle, Kevin Maik Jablonka, Rafael F. Lameiro, Dominik Lemm, Alston Lo, Seyed Mohamad Moosavi, José Manuel Nápoles-Duarte, Akshat Kumar Nigam, Robert Pollice, Kohulan Rajan, Ulrich Schatzschneider, Philippe SchwallerMarta Skreta, Berend Smit, Felix Strieth-Kalthoff, Chong Sun, Gary Tom, Guido Falk von Rudorff, Andrew Wang, Andrew D. White, Adamo Young, Rose Yu, Alán Aspuru-Guzik

*Corresponding author for this work

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, SMILES, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, SMILES has several shortcomings -- most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100\% robustness: SELFIES (SELF-referencIng Embedded Strings). SELFIES has since simplified and enabled numerous new applications in chemistry. In this manuscript, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete Future Projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science.
Original languageEnglish
Article number100588
Pages (from-to)1-27
Number of pages27
JournalPatterns
Volume3
Issue number10
DOIs
Publication statusPublished - 14 Oct 2022

Bibliographical note

Funding Information:
The authors thank Greg Landrum, Daniel Flam-Shepherd, Suliman Sharif, and Bettina Lier for valuable comments on the manuscript. The authors also thank Sara Bebbington of IOP Publishing and Zamyla Chan and Erin Warner of the University of Toronto Acceleration Consortium for helping to organize the SELFIES workshop. M.K. acknowledges support from the FWF (Austrian Science Fund) via the Erwin Schrödinger fellowship no. J4309. R.F.L. received a PhD Scholarship from the São Paulo Research Foundation (FAPESP) – grant #2021/01633-3. This study was financed in part by CAPES – Finance Code 001. R.P. acknowledges funding through a Postdoc.Mobility fellowship by the Swiss National Science Foundation (SNSF; project no. 191127). A.W. would like to thank the Natural Sciences and Engineering Council of Canada (NSERC) for financial support via a CGS-M scholarship. G.T. acknowledges financial support from NSERC via the PGS-D scholarship. R.Y. acknowledges support from the US Department of Energy, Office of Science, AWS Machine Learning Research Award, and NSF grant #2037745. D.L. and G.F.v.R. were supported by the von Lilienfeld lab at the University of Vienna. A.D.W. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM137966. K.M.J. and B.S. acknowledge funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 666983, MaGic). J.M.N.-D. acknowledges support by the National Council for Science and Technology (CONACYT) under award number CVU 105568. P.S. acknowledges support from the NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation. S.M.M. was supported by the Swiss National Science Foundation (SNSF) under grant P2ELP2_195155. U.S. acknowledges support from the Deutsche Forschungsgemeinschaft (DFG) within NFDI4Chem (grant no. NFDI4-1). Q.A. acknowledges support from the National Science Foundation (grant no. DMR-1928882). A.A.G. acknowledges support from the Canada 150 Research Chairs Program, the Google Focused Award, and Dr. Anders G. Frøseth.

Funding Information:
The authors thank Greg Landrum, Daniel Flam-Shepherd, Suliman Sharif, and Bettina Lier for valuable comments on the manuscript. The authors also thank Sara Bebbington of IOP Publishing and Zamyla Chan and Erin Warner of the University of Toronto Acceleration Consortium for helping to organize the Selfies workshop. M.K. acknowledges support from the FWF (Austrian Science Fund) via the Erwin Schrödinger fellowship no. J4309 . R.F.L. received a PhD Scholarship from the São Paulo Research Foundation (FAPESP) – grant # 2021/01633-3 . This study was financed in part by CAPES – Finance Code 001 . R.P. acknowledges funding through a Postdoc.Mobility fellowship by the Swiss National Science Foundation (SNSF; project no. 191127 ). A.W. would like to thank the Natural Sciences and Engineering Council of Canada (NSERC) for financial support via a CGS-M scholarship. G.T. acknowledges financial support from NSERC via the PGS-D scholarship. R.Y. acknowledges support from the US Department of Energy , Office of Science, AWS Machine Learning Research Award, and NSF grant # 2037745 . D.L. and G.F.v.R. were supported by the von Lilienfeld lab at the University of Vienna . A.D.W. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM137966 . K.M.J. and B.S. acknowledge funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 666983 , MaGic). J.M.N.-D. acknowledges support by the National Council for Science and Technology (CONACYT) under award number CVU 105568 . P.S. acknowledges support from the NCCR Catalysis (grant number 180544 ), a National Centre of Competence in Research funded by the Swiss National Science Foundation . S.M.M. was supported by the Swiss National Science Foundation (SNSF) under grant P2ELP2_195155 . U.S. acknowledges support from the Deutsche Forschungsgemeinschaft (DFG) within NFDI4Chem (grant no. NFDI4-1 ). Q.A. acknowledges support from the National Science Foundation (grant no. DMR-1928882 ). A.A.G. acknowledges support from the Canada 150 Research Chairs Program, the Google Focused Award , and Dr. Anders G. Frøseth.

Publisher Copyright:
© 2022 The Author(s)

Funding

The authors thank Greg Landrum, Daniel Flam-Shepherd, Suliman Sharif, and Bettina Lier for valuable comments on the manuscript. The authors also thank Sara Bebbington of IOP Publishing and Zamyla Chan and Erin Warner of the University of Toronto Acceleration Consortium for helping to organize the SELFIES workshop. M.K. acknowledges support from the FWF (Austrian Science Fund) via the Erwin Schrödinger fellowship no. J4309. R.F.L. received a PhD Scholarship from the São Paulo Research Foundation (FAPESP) – grant #2021/01633-3. This study was financed in part by CAPES – Finance Code 001. R.P. acknowledges funding through a Postdoc.Mobility fellowship by the Swiss National Science Foundation (SNSF; project no. 191127). A.W. would like to thank the Natural Sciences and Engineering Council of Canada (NSERC) for financial support via a CGS-M scholarship. G.T. acknowledges financial support from NSERC via the PGS-D scholarship. R.Y. acknowledges support from the US Department of Energy, Office of Science, AWS Machine Learning Research Award, and NSF grant #2037745. D.L. and G.F.v.R. were supported by the von Lilienfeld lab at the University of Vienna. A.D.W. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM137966. K.M.J. and B.S. acknowledge funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement no. 666983, MaGic). J.M.N.-D. acknowledges support by the National Council for Science and Technology (CONACYT) under award number CVU 105568. P.S. acknowledges support from the NCCR Catalysis (grant number 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation. S.M.M. was supported by the Swiss National Science Foundation (SNSF) under grant P2ELP2_195155. U.S. acknowledges support from the Deutsche Forschungsgemeinschaft (DFG) within NFDI4Chem (grant no. NFDI4-1). Q.A. acknowledges support from the National Science Foundation (grant no. DMR-1928882). A.A.G. acknowledges support from the Canada 150 Research Chairs Program, the Google Focused Award, and Dr. Anders G. Frøseth. The authors thank Greg Landrum, Daniel Flam-Shepherd, Suliman Sharif, and Bettina Lier for valuable comments on the manuscript. The authors also thank Sara Bebbington of IOP Publishing and Zamyla Chan and Erin Warner of the University of Toronto Acceleration Consortium for helping to organize the Selfies workshop. M.K. acknowledges support from the FWF (Austrian Science Fund) via the Erwin Schrödinger fellowship no. J4309 . R.F.L. received a PhD Scholarship from the São Paulo Research Foundation (FAPESP) – grant # 2021/01633-3 . This study was financed in part by CAPES – Finance Code 001 . R.P. acknowledges funding through a Postdoc.Mobility fellowship by the Swiss National Science Foundation (SNSF; project no. 191127 ). A.W. would like to thank the Natural Sciences and Engineering Council of Canada (NSERC) for financial support via a CGS-M scholarship. G.T. acknowledges financial support from NSERC via the PGS-D scholarship. R.Y. acknowledges support from the US Department of Energy , Office of Science, AWS Machine Learning Research Award, and NSF grant # 2037745 . D.L. and G.F.v.R. were supported by the von Lilienfeld lab at the University of Vienna . A.D.W. was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award number R35GM137966 . K.M.J. and B.S. acknowledge funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 666983 , MaGic). J.M.N.-D. acknowledges support by the National Council for Science and Technology (CONACYT) under award number CVU 105568 . P.S. acknowledges support from the NCCR Catalysis (grant number 180544 ), a National Centre of Competence in Research funded by the Swiss National Science Foundation . S.M.M. was supported by the Swiss National Science Foundation (SNSF) under grant P2ELP2_195155 . U.S. acknowledges support from the Deutsche Forschungsgemeinschaft (DFG) within NFDI4Chem (grant no. NFDI4-1 ). Q.A. acknowledges support from the National Science Foundation (grant no. DMR-1928882 ). A.A.G. acknowledges support from the Canada 150 Research Chairs Program, the Google Focused Award , and Dr. Anders G. Frøseth.

FundersFunder number
Canada 150 Research Chairs Program
NCCR Catalysis180544
National Science FoundationDMR-1928882, 2037745
National Science Foundation
National Institutes of HealthR35GM137966
National Institutes of Health
U.S. Department of Energy
National Institute of General Medical Sciences
Office of Science
Amazon Web Services
Horizon 2020 Framework Programme
Natural Sciences and Engineering Research Council of Canada
European Research Council
Deutsche ForschungsgemeinschaftNFDI4-1
Deutsche Forschungsgemeinschaft
Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung191127
Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung
Fundação de Amparo à Pesquisa do Estado de São Paulo2021/01633-3
Fundação de Amparo à Pesquisa do Estado de São Paulo
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Austrian Science FundJ4309
Austrian Science Fund
Universität Wien
Consejo Nacional de Ciencia y TecnologíaCVU 105568
Consejo Nacional de Ciencia y Tecnología
Horizon 2020666983
Horizon 2020
National Centre of Competence in Research RoboticsP2ELP2_195155
National Centre of Competence in Research Robotics

    Keywords

    • DSML 3: Development/pre-production: Data science output has been rolled out/validated across multiple domains/problems

    Fingerprint

    Dive into the research topics of 'SELFIES and the future of molecular string representations'. Together they form a unique fingerprint.

    Cite this