Overview
Syllabus
Intro
Drug discovery is hard
AstraZeneca introduced the "5R" framework
5R has had a significant impact in improving our efficiency
We are investing in new sources of data and faster validation
We need tools to make sense of data & make better and faster decisions
Finding a drug target can be formulated as a hybrid recommendation problem • Scientists need to parse large amount of information and make a ranking prediction • Different formats, data models, locations
Multiple objective optimization
Traditional recsys approaches
We assemble a large scale knowledge graph from public and AZ internal data
KG pipeline on
Pipeline - series of notebooks
Pipeline stages
Node dictionary
Mappings table
Edge assertions
Keep evidence & context for each assertion
Focus on NLP
Use natural language processing to extract precise information at scale
NLP Termite on Spark
Syntax parsing increases precision of entity recognition
Relationship from literatures reduce sparsity of biological KG
Language models lead to improvements in recall and precision
Learned sentence representation can be used for downstream tasks
Graph embedding pipeline
Approximate nearest neighbor search
Lessons learned
Acknowledgements
Taught by
Databricks