A CP solver designs a self-assembling protein | Association for Constraint Programming

The Ika4 protein (PDB 6g6q)

Designing a stable and self-assembling protein requires to organize thousands of atoms in a favorable configuration in space.

Researchers at INRA, Toulouse (France) worked in collaboration with biochemists in Belgium and Japan and used constraint programming technology, embodied in the toulbar2 solver, to optimize the intermolecular forces (captured by the Talaris14 force-field) of the full-atom representation of the targeted protein shape. Toulbar2 was able to identify a most stable molecular organization and to prove that no better organization exists.

Proteins are the main molecules of life. They govern much of how cells work, in humans, animals, plants, fungi and microbes. They are able to bind to other molecules, and to self assemble to build complex nano-structures. They can catalyze chemical reactions at ambient temperature and pressure while being biodegradable. Newly designed proteins have therefore a large potential for applications in medicine, green chemistry, biofuels, oil-based products recycling, ... Having the ability to design new tailored proteins is thus of prime importance for health but also to reduce our environmental footprint.

Contact: Thomas Schiex (thomas.schiex@inra.fr)

Proteins are linear molecules formed by bonding successives amino-acids. There are 20 natural amino-acids and each of them is defined by a constant part (identical in all amino acids) and a variable part that sticks out of the side of the constant part. The variable part is called the side-chain. It defines the specific physico-chemical properties of the amino acid and it is usually very flexible (robotic arms algorithms have been used to describe and predict these moves). The constant part is the part which is used to build up the protein, by repeated creation of so-called peptidic bonds between amino acids, defining a long chain called the backbone of the protein. The fundamental paradigm in structural biochemistry is that the structure of the protein (how it organizes various atoms in space) defines its function. To design a protein for a specific function, one starts therefore from a targeted shape of one or several interacting backbones. The aim is then to populate all the side-chains with those that will stabilize the backbone when it is in water, hoping that when the protein is created and exists in a cell, it will fold along this backbone shape. To achieve this and account for side-chain movements, the collection of 20 available side-chains is multiplied by considering a collection of possible positions (or conformations) for each side chain. In the end, one typically needs to consider that several hundreds of side-chains are possible at each position. A protein of length n therefore typically requires to explore a space of size 400^n. To choose the suitable amino acids, interatomic forces are modeled in a so-called force field (or score function) that captures interactions between atoms. The most usual force fields such as CHARM or AMBER are pairwise decomposable meaning that they decompose into a sum of functions involving at most two atoms or two positions. Ultimately, the problem of designing a new protein therefore reduces to a huge optimization problem with as many variables as there are amino acids to design, with few hundreds discrete values for each variable and unary and binary functions between variables, whose sum must be minimized. This is precisely a Cost Function Network (CFN) and finding its minimum cost (or energy) assignment is a Weighted Constraint Satisfaction Problem. Since Computational Protein Design (CPD) is NP-hard, as the WCSP, traditional approaches use Monte Carlo (Simulated annealing) methods to identify low energy solutions. These methods are very general but offer no finite time guarantee and have difficulties taking into account hard constraints. Based on enhanced constraint satisfaction and automated reasoning algorithms, the CFN solver toulbar2 can solve these problems very efficiently. On these problems, toulbar2 was shown to outperform alternative exact solvers based on integer linear programming, quadratic programming, Partial Weighted Maximum Satisfiability or existing CPD-dedicated solvers (1). It was later shown to be able to prove that a highly specialized implementation of Simulated annealing targeted at Computational Protein Design is often unable to identify an optimal solution, with a gap to optimality that increases very rapidly as the problem gets harder. In the end, using CFN technology, it becomes possible to reliably get an optimal solution faster than with Simulated annealing. The capacity of toulbar2 to easily accept constraints in its input makes it also easier to integrate specific requirements on the protein to design (on its composition, charge...)

To see if these good properties were useful in the real world, we collaborated with biochemists in Belgium and Japan to design a new symmetric protein for real, ideally one that could self assemble from pieces. The design was done using a traditional approach and a CFN based approach. Although both approaches produced a protein that correctly folded in the targeted shape, only the CFN-based design could self assemble from smaller components. It is also more stable, resisting temperatures close to 90C°, or a pH close to 1. This protein, which appears on the journal cover of (3), could traverse your stomach undigested (see https://www.linkedin.com/pulse/ai-designs-self-assembling-protein-thomas-schiex).

We are now working on new targets with collaborators in medicine and green chemistry. But tackling CPD problems also requires new automated reasoning algorithms to answer new questions: how can I find a sequence of amino acids that would accept to fold in two alternative shapes (so-called conformational switches) or that should also absolutely avoid some extra shapes. Some of these problems are extremely challenging reasoning problems, at the second level of the polynomial hierarchy. They also raise questions in Machine Learning, the more obvious one being to learn how to design (or help automated reasoning solvers to design) new proteins from existing natural proteins.

(1) Allouche, D., André, I., Barbe, S., Davies, J., De Givry, S., Katsirelos, G., O'Sullivan, B., Prestwich, S., Schiex, T. and Traoré, S., 2014. Computational protein design as an optimization problem. Artificial Intelligence, 212, pp.59-79.

(2) Simoncini, D., Allouche, D., de Givry, S., Delmas, C., Barbe, S. and Schiex, T., 2015. Guaranteed discrete energy optimization on large protein design problems. Journal of chemical theory and computation, 11(12), pp.5980-5989.

(3) Noguchi, H., Addy, C., Simoncini, D., Wouters, S., Mylemans, B., Van Meervelt, L., Schiex, T., Zhang, K.Y., Tame, J.R.H. and Voet, A.R.D., 2019. Computational design of symmetrical eight-bladed β-propeller proteins. IUCrJ, 6(1).