Portland State math prof wins $2.1M grant to expand data-driven research, training

The Research Training Group grant is a testament to PSU's ability to develop a leading program in computational math, statistics

Bruno Jedynak, one of the mathematics and statistics professors who is part of the NSF-funded Research Training Group, stands in front of the Coeus high-performance computing cluster on campus.  CREDIT NashCo Photo

As employers clamor for more data scientists and industrial and government labs take on more data-driven research, a new federal grant will help a group of Portland State faculty continue impactful research projects while also training the next generation of researchers to meet the demand. 

The five-year, $2.1 million Research Training Group in Computation- and Data-Enabled Science grant from the National Science Foundation will allow eight Mathematics + Statistics faculty to integrate research and training for as many as five postdoctoral researchers and 30 undergraduate and graduate students at PSU. 

"Everyone wants people who have data acumen, people who can do deep research but who can also work with real data," said lead principal investigator Jay Gopalakrishnan, professor of mathematics. "We have strengths among our faculty in various areas of computational science. We are uniquely positioned to produce these new workforce additions with deep knowledge in the area where they do Ph.D. research, plus a broad understanding of current issues in data-driven science."

Gopalakrishnan said the highly selective grant from the National Science Foundation is a testament to PSU's ability to develop a leading program in computational mathematics and statistics. 

Those efforts began in 2010 with a $3.9 million investment from alum Fariborz Maseeh supporting the recruitment of mid-career faculty from reputed universities. They continued in 2016 with the launch of the Portland Institute for Computational Science and the acquisition and deployment of the first open supercomputing cluster in the state of Oregon — a valuable research and training resource for over 200 members. This coming fall, four new faculty members will join the department as part of a cluster hire in "Computational Science for a Sustainable Future."

"We clearly showed that we have a trajectory of growth and all that began with Dr. Maseeh," Gopalakrishnan said.

The NSF grant will support research on foundational theory in mathematics and statistics as well as in topics as diverse as a simulation of optical fibers that drive today's internet; forecasting of weather, air quality, and drought; understanding the progression of diseases such as cancer and dementia, and optimizing warehouse locations and wireless services. 

Gopalakrishnan said the vertical integration of faculty, postdocs, and students in the group allows for shared learning and mentoring. Postdoctoral researchers can act as faculty multipliers, assisting faculty and, in turn, assisting graduate and undergraduate students in carrying out the group's research and training activities. Graduate students will work closely with faculty mentors while themselves being mentors for undergraduate students.

A major focus of the group will be providing students with opportunities to work with real-world data. Doctoral students will work in a consulting lab on projects from regional clients and have at least two external internships. Postdocs and selected undergraduates will also have the opportunity to participate in client projects and lab activities.

Other elements of the research group include a new seminar series that favors dialogue over monologue, the inclusion of a top external faculty member on each doctoral student's Ph.D. committee, summer boot camps to overcome the anticipated lack of trainee prerequisites, and city-based and community-serving research experiences for undergraduates.

"These are all pretty radical ideas that this grant is enabling us to embark on," Gopalakrishnan said. "I don’t know of any math department in the country with all the innovative training structures we proposed, like the consulting lab, the experimental seminars, the community service aspects, the boot camps, etc."

Gaia discovers strange stars in the most detailed Milky Way survey to date

Today, ESA’s Gaia mission released its new treasure trove of data about our home galaxy. Astronomers describe strange ‘starquakes’, stellar DNA, asymmetric motions, and other fascinating insights in this most detailed Milky Way survey to date.

Gaia is ESA’s mission to create the most accurate and complete multi-dimensional map of the Milky Way. This allows astronomers to reconstruct our home galaxy’s structure and past evolution over billions of years, and to better understand the lifecycle of stars and our place in the Universe.  This image shows four sky maps made with the new ESA Gaia data released on 13 June 2022.

What’s new in data release 3?

Gaia’s data release 3 contains new and improved details for almost two billion stars in our galaxy. The catalog includes new information including chemical compositions, stellar temperatures, colors, masses, ages, and the speed at which stars move towards or away from us (radial velocity). Much of this information was revealed by the newly released spectroscopy data, a technique in which the starlight is split into its constituent colors (like a rainbow). The data also includes special subsets of stars, like those that change brightness over time.

Also new in this data set is the largest catalog yet of binary stars, thousands of Solar System objects such as asteroids and moons of planets, and millions of galaxies and quasars outside the Milky Way. 

There are 6 Gaia data processing centers: at the Institute of Astronomy in Cambridge (United Kingdom), at the University of Geneva in Switzerland, at the Barcelona Supercomputing Centre in Spain, at the University of Torino in Italy, at the Centre National d'Etudes Spatiales in Toulouse (France) and the European Space Astronomy Centre in Madrid, Spain. Each data processing center is responsible for a specific part of the processing and collaborates with the rest of the Gaia consortium to ensure the best scientific data products are obtained.

Starquakes

One of the most surprising discoveries coming out of the new data is that Gaia can detect starquakes – tiny motions on the surface of a star – that change the shapes of stars, something the observatory was not originally built for.

Previously, Gaia already found radial oscillations that cause stars to swell and shrink periodically, while keeping their spherical shape. But Gaia has now also spotted other vibrations that are more like large-scale tsunamis. These nonradial oscillations change the global shape of a star and are therefore harder to detect.

Gaia found strong nonradial starquakes in thousands of stars. Gaia also revealed such vibrations in stars that have seldomly been seen before. These stars should not have any quakes according to the current theory, while Gaia did detect them at their surface.

“Starquakes teach us a lot about stars, notably their internal workings. Gaia is opening a goldmine for ‘asteroseismology' of massive stars,” says Conny Aerts of KU Leuven in Belgium, who is a member of the Gaia collaboration.

The DNA of stars

What stars are made of can tell us about their birthplace and their journey afterward, and therefore about the history of the Milky Way. With today’s data release, Gaia is revealing the largest chemical map of the galaxy coupled to 3D motions, from our solar neighborhood to smaller galaxies surrounding ours.

Some stars contain more ‘heavy metals’ than others. During the Big Bang, only light elements were formed (hydrogen and helium). All other heavier elements – called metals by astronomers – are built inside stars. When stars die, they release these metals into the gas and dust between the stars called the interstellar medium, out of which new stars form. Active star formation and death will lead to an environment that is richer in metals. Therefore, a star’s chemical composition is a bit like its DNA, giving us crucial information about its origin. 

With Gaia, we see that some stars in our galaxy are made of primordial material, while others like our Sun are made of matter enriched by previous generations of stars. Stars that are closer to the center and plane of our galaxy are richer in metals than stars at larger distances. Gaia also identified stars that originally came from different galaxies than our own, based on their chemical composition. 

“Our galaxy is a beautiful melting pot of stars,” says Alejandra Recio-Blanco of the Observatoire de la Côte d’Azur in France, who is a member of the Gaia collaboration. 

“This diversity is extremely important because it tells us the story of our galaxy’s formation. It reveals the processes of migration within our galaxy and accretion from external galaxies. It also clearly shows that our Sun, and we, all belong to an ever-changing system, formed thanks to the assembly of stars and gas of different origins.”

Binary stars, asteroids, quasars, and more

Other papers that are published today reflect the breadth and depth of Gaia's discovery potential. A new binary star catalog presents the mass and evolution of more than 800 thousand binary systems, while a new asteroid survey comprising 156 thousand rocky bodies is digging deeper into the origin of our Solar System. Gaia is also revealing information about 10 million variable stars, mysterious macro-molecules between stars, as well as quasars and galaxies beyond our cosmic neighborhood.

“Unlike other missions that target specific objects, Gaia is a survey mission. This means that while surveying the entire sky with billions of stars multiple times, Gaia is bound to make discoveries that other more dedicated missions would miss. This is one of its strengths, and we can’t wait for the astronomy community to dive into our new data to find out even more about our galaxy and its surroundings than we could’ve imagined,” says Timo Prusti, Project Scientist for Gaia at ESA.

Gaia is ESA’s mission to create the most accurate and complete multi-dimensional map of the Milky Way. This allows astronomers to reconstruct our home galaxy’s structure and past evolution over billions of years, and to better understand the lifecycle of stars and our place in the Universe. 

UAB's proteomic analysis of 2,002 tumors identifies 11 pan-cancer molecular subtypes across 14 types of cancer

A new study that analyzed protein levels in 2,002 primary tumors from 14 tissue-based cancer types identified 11 distinct molecular subtypes, providing systematic knowledge that greatly expands a searchable online database that has become a go-to platform for cancer data analysis by users worldwide. To facilitate gene-level queries of data from more than 10,000 cancer patient transcriptome sequences and proteomics data from 2,000 patients, researchers have developed a user-friendly cancer data analysis web platform called UALCAN.

The University of Alabama at Birmingham Cancer Data analysis portal, or UALCAN, was developed and released to public use in 2017 as a user-friendly portal for pan-cancer omics data analysis, including transcriptomics, epigenetics, and proteomics. UALCAN has had nearly 920,000 site visits from researchers in more than 100 countries, and it has been cited more than 2,750 times.

“UALCAN is an effort to distribute comprehensive cancer data to researchers and clinicians in a user-friendly format to make discoveries and find needles in the haystack,” said Sooryanarayana Varambally, Ph.D., professor in the UAB Department of Pathology Division of Molecular and Cellular Pathology and director of UAB’s Translational Oncologic Pathology Research program. “Cancer detection, diagnosis, treatment, cure, and research need a global team effort, and making sense of the huge amount of data involved needs a way to analyze and interpret these data.”

Cancer is a complex disease, and its initiation, progression, and metastasis, the spread to distant organs, involves dynamic molecular changes in each type of cancer. Individual cancer patients show variations apart from some of the common genomic events.

In the new study, Varambally worked with longtime collaborator Chad Creighton, Ph.D., Baylor College of Medicine, Houston, Texas. Creighton led the proteomic study, “Proteogenomic characterization of 2002 human cancers reveals pan-cancer molecular subtypes and associated pathways.” This extends two early proteomics studies published in 2019 and 2021.

Previously the team performed RNA transcripts analysis, providing the data to researchers through UALCAN, to determine which pathways the myriad forms of cancer used to aid growth, spread and aggressiveness. In this recent study, the team performed and incorporated a large-scale proteomics analysis. The data and results provide new ideas for further research and possible therapeutic interventions.

A proteome is the complement of proteins expressed in a cell or tissue, and these can be measured quantitatively through recent technological advances in mass spectrometry. In cells, DNA makes mRNA, and mRNA makes protein, processes known as the central dogma of molecular biology. Proteins are major functional moieties of cells, crucial in cell metabolism, structure, growth, signaling, and movement.

The cancer types represented in the UALCAN proteomic dataset include breast, colorectal, gastric, glioblastoma, head, and neck, liver, lung adenocarcinoma, lung squamous, ovarian, pancreatic, pediatric brain, prostate, renal, and uterine cancers. The number of tumors in each cancer type in the study ranged from 76 to 230, with an average of 143. Intriguingly, the pan-cancer, proteome-based subtypes the current study found to cut across tumor lineages.

The compendium proteomic dataset came from 17 individual studies. Corresponding multi-omics data were available for most of these tumors, including mRNA levels, DNA somatic small mutations and insertions/deletions, and DNA somatic copy number alterations.

In general, the researchers found the protein expression of genes across tumors broadly correlated with corresponding mRNA levels or copy number alterations. However, there were some notable exceptions.

They identified 11 distinct proteome-based pan-cancer subtypes — named s1 through s11 — that can provide insights into the deregulated pathways and processes in tumors that make them cancerous. Each subtype spanned multiple tissue-based cancer types, though subtype s11 was specific to brain tumors, spanning glioblastomas and pediatric brain tumors. 

Each subtype expressed specific gene categories, some seen before in a previous, less comprehensive proteomic study. Three subtypes showed new gene categories: subtype s7 with “axon guidance” and “frizzled binding” genes, subtype s10 with “DNA repair” and “chromatin organization” genes, and subtype s11 with “synapse,” “dendrite” and “axon” genes.

At the DNA level, the study detailed differences among the proteome-based subtypes in overall copy number alterations of genes, and somatic mutations in subtypes associated with higher pathway activity, as inferred by proteome or transcriptome data.

“Our study results provide a framework for understanding the molecular landscape of cancers at the proteome level to integrate and compare the data with other molecular correlates of cancers,” Varambally said. “The associated datasets and gene-level associations represent a resource for the research community, including helping to identify gene candidates for functional studies and further develop candidates as diagnostic markers or therapeutic targets for a specific subset of cancers.

“Furthermore, this study reinforces the notion that cancers should be comprehensively surveyed at the protein level, though expression profiling on tumors has historically been mostly limited to the RNA transcript level. Many of the analyses in this ever-evolving cancer data analysis platform are based on user or expert requests, and the team is indebted to the support and encouragement from the researchers who use this platform to make discoveries that make a difference in cancer research.” 

Some of the large datasets for the UAB site are generated by consortiums like The Cancer Genome Atlas, or TCGA, and the Clinical Proteomic Tumor Analysis Consortium, or CPTAC, of the National Cancer Institute. Since the researchers also strive to address cancer health disparities, UALCAN provides an option to analyze the data based on patient race or ethnicity, where it is available.

Precision targeting of cancer requires the identification of individual or subclass-specific genomic and molecular alterations. To help cancer researchers perform various data analyses for a better understanding of these large datasets, Darshan Shimoga Chandrashekar, Ph.D., led the development of the UALCAN portal under the mentorship of Varambally. Updates to this continuously evolving portal were recently published in Neoplasia.

The UALCAN initiative and its continuous development involve contributions from a team of experts including bioinformaticians, computer scientists, statisticians, cancer biologists, pathologists, and oncologists. “It is a team science approach to enable the global cancer research team to tackle cancer,” Varambally said.

Co-first authors of this study are Yiqun Zhang and Fengju Chen, Baylor College of Medicine, and Chandrashekar, UAB Department of Pathology Division of Molecular and Cellular Pathology. 

Pathology is a department in the Marnix E. Heersink School of Medicine at UAB. Varambally is a senior scientist in the O’Neal Comprehensive Cancer Center and the Informatics Institute at UAB and is co-director of the Cancer Biology Theme of Graduate Biomedical Sciences at UAB. He holds an adjunct position at the Michigan Center for Translational Pathology, the University of Michigan, Ann Arbor.

Chinese scientists observe large-scale, ordered, tunable Majorana-zero-mode lattice

In a study, a joint research team led by Prof. GAO Hongjun from the Institute of Physics of the Chinese Academy of Sciences (CAS) has reported observation of a large-scale, ordered and tunable Majorana-zero-mode (MZM) lattice in the iron-based superconductor LiFeAs, providing a new pathway towards future topological quantum super computation. Fig. 1. Characterization of biaxial CDW region. (Image by Institute of Physics)

MZMs are zero-energy bound states confined in the topological defects of crystals, such as line defects and magnetic field-induced vortices. They are characterized by scanning tunneling microscopy/spectroscopy (STM/S) as zero-bias conductance peaks. They obey non-Abelian statistics and are considered building blocks for future topological quantum computation. 

MZMs have been observed in several topologically nontrivial iron-based superconductors, such as Fe (Te0.55Se0.45), (Li0.84Fe0.16)OHFeSe, and CaKFe4As4. However, these materials suffer from issues with alloying-induced disorder, uncontrollable and disordered vortex lattices, and the low yield of topological vortices, all of which hinder their further study and application. 

In this study, the researchers observed the formation of an ordered and tunable MZM lattice in the naturally strained superconductor LiFeAs. Using STM/S equipped with magnetic fields, the researchers found that local strain naturally exists in LiFeAs. Biaxial charge density wave (CDW) stripes along the Fe-Fe and As-As directions are produced by the strain, with wavevectors of λ1~2.7 nm and λ2~24.3 nm. The CDW with wavevector λ2 shows strong modulation of the superconductivity of LiFeAs.  Fig. 2. MZM in vortices. (Image by Institute of Physics)

Under a magnetic field perpendicular to the sample surface, the vortices emerge and are forced to align exclusively along with the As-As CDW stripes, forming an ordered lattice. The reduced crystal symmetry leads to a drastic change in the topological band structures at the Fermi level, thus transforming the vortices into topological ones hosting MZMs and forming an ordered MZM lattice. Moreover, the MZM lattice density and geometry are tunable by an external magnetic field. The MZMs start to couple with each other under high magnetic fields. 

This observation of a large-scale, ordered and tunable MZM lattice in LiFeAs expands the MZM family found in iron-based superconductors, thus providing a promising platform for manipulating and braiding MZMs in the future, according to the researchers. 

These findings may shed light on the study of topological quantum super computation using iron-based superconductors. 

Fig. 3. Majorana mechanism in LiFeAs. (Image by Institute of Physics)

Fig. 4. Tuning the MZM lattice with magnetic field. (Image by Institute of Physics)

New Technion research integrating biology, computer science sheds light on the process of protein folding

A study integrating biological ideas and new computer science tools has uncovered novel associations between genetic coding and protein structure, which could potentially change the way we think about protein production in the ribosome – the cell’s “protein assembly line.” The research was composed by Professor Alex Bronstein, Dr. Ailie Marx, and Ph.D. student Aviv Rosenberg at the Technion – Israel Institute of Technology.
L-R: Prof. Alex Bronstein, Dr. Ailie Marx, PhD student Aviv Rosenberg

Proteins, the complex molecules that play critical roles in virtually every biological mechanism, are produced by ribosomes in a process called translation. The ribosome decodes incoming “genetic instructions” to synthesize chains of amino acids – the building blocks of proteins. When amino acids are sequentially bound together into a long chain, they fold into a unique three-dimensional structure that grants the protein its biological properties and functionality. Translation errors can lead to misfolding and subsequently physiological disorders, both mild and major.

Protein production instructions are delivered to the ribosome as codons, sequences of three “letters” from the genetic nucleotide code, which specifies the identity and order of amino acids to be added by the ribosome to the protein chain. For example, the codon UUU signals for the addition of the amino acid phenylalanine, whereas codon UAC instructs for the addition of tyrosine. In this way, the codon sequence encodes for the unique sequence of amino acids characteristic of each protein. This mapping of genetic codons to amino acids used in translation is common to all living creatures on the planet and is considered a primeval mechanism.

As if all of this were not complicated enough, it is important to point out that 61 codons are decoded into just 20 amino acids. In other words, all but two amino acids are encoded by multiple codons.

This is where the present research comes into the picture. Based on experiments carried out in the 1960s and 1970s, the accepted dogma states that proteins carry no “memory” of the specific codon from which each amino acid was translated as long as the amino acid identity remains unchanged. These early experiments into protein folding used chemical denaturants to unfold fully formed proteins and then demonstrated that upon removal of these chemicals the protein chain could refold spontaneously to regain its original structure and function. These experiments suggested that only the amino acid sequence, and not the specific codon sequence, determine a protein’s structure. Given this dogma, mutations that change the genetic coding without changing the amino acid are widely termed as “silent” and considered inconsequential for protein structure and function.

The Technion research team has uncovered an association between the identity of the codon and the local structure of the translated protein, which suggests that this may not be the general case and that proteins may indeed “remember” the specific instructions from which they were synthesized. The research team analyzed thousands of three-dimensional protein structures using dedicated tools they developed, which integrate advanced computer science methods, machine learning, and statistics. In this way, they accurately compared the distributions of angles formed in these structures under different synonymous genetic codes. Their findings show that for certain codons, there is a significant statistical dependence between the identity of the codon and the local structure of the protein at the position of the amino acid encoded by that codon.

The researchers emphasize that the findings are still unable to shed light on the direction of the causal relationship, meaning that it is not yet possible to say whether a change in genetic coding can cause a change in the local protein structure or whether structural changes may cause different coding, for example through evolutionary processes. This question is the foundation for a subsequent research study now being carried out by the group. According to Dr. Marx, a biologist by training and education, “If we find in subsequent research that the codon indeed has a causal effect on protein folding, this is likely to have a huge impact on our understanding of protein folding, as well as on future applications, such as engineering new proteins.”

Dr. Marx emphasizes that the discovery presented in the article would not have been possible without Prof. Bronstein’s computer and analysis skills. “This research is truly interdisciplinary, because biology alone cannot cope with such vast quantities of data without the help of data science, and computer scientists cannot themselves perform research of this kind since they lack familiarity with the complex biological processes being probed. Therefore, our research highlights the huge advantage of interdisciplinary research that integrates skills from different fields to create a whole that is greater than the sum of its parts.”