Register now After registration you will be able to apply for this opportunity online.
This opportunity is not published. No applications will be accepted.
Master’s project in data mining and protein sequence analysis
The evolutionary adaptations of living organisms are shaped by the environmental factors defining their habitat, but such data is often inaccessible. In this Master’s project we will mine microbial culturing protocols for growth condition data and use it for protein sequence analysis.
The physiology and evolutionary adaptations of every living organism are shaped by the environmental factors defining the habitat in which they live. Gaining insights regarding such adaptations by leveraging the ever-increasing quantity of genomic data is important. Unfortunately, this is currently difficult due to limitations in accessing data describing the environment of each habitat. This is true for such properties as growth temperature, growth media composition, pH, salinity and many more.
Growth data can be obtained in the form of microbial culturing protocols. However, these are locked away in inaccessible formats such as pdf which are in turn distributed over different repositories. In this 6 to 12 month Master’s project we will unlock the potential of this type of information by mining pdf files for growth condition data and storing it in an accessible file format. This process will entail writing scripts for automatic extraction of data combined with human curation of the results. This will give us a unique dataset of growth conditions for tens of thousands of microbes. In a second stage of this project we will leverage the growth condition data for protein sequence analysis with the goal of understanding how microbes adapt to growth in diverse conditions. I recently concluded a similar project, dealing with growth temperatures, and made it available on bioRxiv (https://doi.org/10.1101/271569).
The physiology and evolutionary adaptations of every living organism are shaped by the environmental factors defining the habitat in which they live. Gaining insights regarding such adaptations by leveraging the ever-increasing quantity of genomic data is important. Unfortunately, this is currently difficult due to limitations in accessing data describing the environment of each habitat. This is true for such properties as growth temperature, growth media composition, pH, salinity and many more.
Growth data can be obtained in the form of microbial culturing protocols. However, these are locked away in inaccessible formats such as pdf which are in turn distributed over different repositories. In this 6 to 12 month Master’s project we will unlock the potential of this type of information by mining pdf files for growth condition data and storing it in an accessible file format. This process will entail writing scripts for automatic extraction of data combined with human curation of the results. This will give us a unique dataset of growth conditions for tens of thousands of microbes. In a second stage of this project we will leverage the growth condition data for protein sequence analysis with the goal of understanding how microbes adapt to growth in diverse conditions. I recently concluded a similar project, dealing with growth temperatures, and made it available on bioRxiv (https://doi.org/10.1101/271569).
You will be given ample time to learn/improve your programming skills in Python and R. These skills are highly sought-after in both industry and Academia. You will learn best practices in bioinformatics and computational biology, learn how to collaborate on code using GitHub as well as how to run computationally intensive analysis on the C3SE cluster. Finally, you will learn how to perform clustering, principal component analysis as well as methods for analyzing protein sequences and their annotation in the context of habitat physical parameters.
You are curious, eager to learn, like to solve problems, and to collaborate with others. Previous experience with programming in Python and R is required if opting for a 6 month project. For a 12 month project there are no such requirements.
You will be given ample time to learn/improve your programming skills in Python and R. These skills are highly sought-after in both industry and Academia. You will learn best practices in bioinformatics and computational biology, learn how to collaborate on code using GitHub as well as how to run computationally intensive analysis on the C3SE cluster. Finally, you will learn how to perform clustering, principal component analysis as well as methods for analyzing protein sequences and their annotation in the context of habitat physical parameters.
You are curious, eager to learn, like to solve problems, and to collaborate with others. Previous experience with programming in Python and R is required if opting for a 6 month project. For a 12 month project there are no such requirements.
If you are interested please send me a letter of motivation at martin.engqvist@chalmers.se
Assistant Professor Martin Engqvist
Chalmers University of Technology
Department of Biology and Biological Engineering,
Division of Systems and Synthetic biology
If you are interested please send me a letter of motivation at martin.engqvist@chalmers.se
Assistant Professor Martin Engqvist Chalmers University of Technology Department of Biology and Biological Engineering, Division of Systems and Synthetic biology
Each year the IDEA League offers the students of its partner universities over 180 monthly grants for a short-term research exchange. In general, these grants are awarded based on academic merit. For more information visit http://idealeague.org/student-grant/