Contents
- 1 FLIGHTED: Inferring Fitness from High-Throughput Experiments (Prof. Kevin Esvelt, MIT)
- 2 Predicting Permeability with Machine Learning and Molecular Dynamics Simulations (Inductive Bio)
- 3 Diffusion Models for Protein Binders (Generate Biomedicines)
- 4 TERMinator: Structure-Based Protein Design with Tertiary Repeating Motifs (Prof. Amy Keating, MIT)
- 5 Machine Learning on DNA-Encoded Libraries (Google Accelerated Sciences)
- 6 Protein/Ligand Binding (Dr. Lucy Colwell, University of Cambridge)
- 7 Nuclear Quantum Effects in Force Fields (Prof. Alan Aspuru-Guzik, Harvard University)
FLIGHTED: Inferring Fitness from High-Throughput Experiments (Prof. Kevin Esvelt, MIT)
My main PhD research has focused on developing FLIGHTED, a tool for inference of fitness landscapes from high-throughput experiments. FLIGHTED is a Bayesian model that uses variational inference and biological noise models to infer fitness from a given set of high-throughput experimental data. Thus far, I have developed and released two versions of FLIGHTED for two different experiment classes:
- FLIGHTED-Selection, which focuses on single-step selection experiments like phage display, mRNA display, and FACS. I showed through simulation that single-step selection experiments are incredibly noisy, especially for highly active variants, and FLIGHTED-Selection can accurately correct for this noise regardless of experimental conditions. Some preliminary benchmarking studies on the popular GB1 dataset indicate that downstream ML model performance is heavily affected by accounting for noise, suggesting that it is important for benchmarking studies to use methods like FLIGHTED to ensure accurate rankings.
- FLIGHTED-DHARMA, designed for a new protein fitness assay named DHARMA developed by my colleague Boqiang Tu. DHARMA measures protein fitness by linking fitness with transcription of a base editor and measuring edits, a process which can be measured efficiently by high-throughput sequencing but is incredibly noisy. Using FLIGHTED, we make DHARMA measurements reliable and provide calibrated error estimates allowing experimentalists to run high-throughput DHARMA screens with even very small numbers of reads per variant. FLIGHTED-DHARMA is well-calibrated, robust, and interpretable.
I am currently using DHARMA and FLIGHTED to generate large protein fitness landscapes (up to a million variants) and perform large-scale protein engineering. A four-site fitness landscape of TEV protease, a sequence-specific protease, cutting the wild-type substrate is now available here.
A preprint describing FLIGHTED is available here and it has also been presented at a NeurIPS and ICLR workshop. If you would like to use FLIGHTED in your research, it has been released on Github here and is pip installable. Please contact me if you would like to use FLIGHTED or run into any issues.
Predicting Permeability with Machine Learning and Molecular Dynamics Simulations (Inductive Bio)
During a summer internship, I worked on predicting permeability by combining machine learning and molecular dynamics simulations at Inductive Bio. I used fast, efficient molecular dynamics simulations to generate potential permeability training sets for machine learning models to use.
Diffusion Models for Protein Binders (Generate Biomedicines)
During a summer internship, I worked at Generate Biomedicines on using protein diffusion models for generating protein binders. My work extended early protein diffusion models to design miniprotein binders to other proteins.
TERMinator: Structure-Based Protein Design with Tertiary Repeating Motifs (Prof. Amy Keating, MIT)
As part of my first PhD rotation, I helped in the development of TERMinator, a neural network designed to predict protein sequence from a given structure. It used TERMs (tertiary repeating motifs), structural motifs found in the PDB, to generate a Potts model for a given protein that could be optimized to generate a sequence. We demonstrated that the use of TERMs and Potts models showed small advantages over previously available methods. My primary contribution was aiding in benchmarking and ablation studies of TERMinator. Our work was presented at a NeurIPS workshop here and published in Protein Science here. TERMinator is available for public use here.
Machine Learning on DNA-Encoded Libraries (Google Accelerated Sciences)
At GAS as an AI resident, I worked on a team focused on applying machine learning to drug discovery. Specifically, we aimed to use molecular data from DNA-encoded libraries to predict protein/ligand binding for pharmaceutically relevant targets (a previous paper summarizing the work can be found here). My work focused on uncertainty quantification for these models: specifically, I developed methods to estimate the number of hits from a selected list of molecules and incorporated applicability domain modeling into our pipeline to improve overall performance.
Protein/Ligand Binding (Dr. Lucy Colwell, University of Cambridge)
I worked in Dr. Lucy Colwell’s lab at the University of Cambridge on machine learning as applied to protein/ligand binding for my masters’ degree. Our work generated 3 key results:
- We demonstrated that standard debiasing algorithms for protein/ligand binding datasets like AVE and MUV debiasing do not improve generalization of models and do not accurately measure generalization. We propose using distant held-out test sets as metrics for performance instead of debiasing approaches. The paper has been published in JCIM and can be found here.
- Focusing on the problem of predicting protein/ligand binding without prior experimental screens, we developed a data augmentation strategy to dramatically improve the performance of DTI models, models that seek to predict interactions between multiple proteins and ligands. Our approach demonstrates remarkable performance on challenging datasets consisting of proteins and ligands with no interactions given to the model. The paper has been posted on BiorXiv here and a previous version was presented at a NeurIPS workshop here.
- We used attribution methods to identify serious flaws in chemical fingerprints as a molecular representation. Specifically, we showed the existence of spurious correlations in fingerprint space due to the structure of fingerprints that are masked by the hashing process but become relevant and degrade model performance on protein/ligand binding datasets. A version of this paper has been presented at an ICML workshop here.
Nuclear Quantum Effects in Force Fields (Prof. Alan Aspuru-Guzik, Harvard University)
I worked with Dr. David Gelbwaser and Prof. Alan Aspuru-Guzik at Harvard for my undergrad senior thesis on efficiently incorporating nuclear quantum effects into force fields. We developed a method, called a force-field functor, that generated an effective force field given any accurate original force field which would reproduce thermodynamic observables with nuclear quantum effects accounted for in simulation. We computed the force-field functor using the Wigner expansion and demonstrated its effectiveness on a system of liquid neon. The paper has been published in JPCL and is available here.